Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't schedule on Windows nodes by default #9165

Closed
craigminihan opened this issue Nov 15, 2021 · 11 comments
Closed

Don't schedule on Windows nodes by default #9165

craigminihan opened this issue Nov 15, 2021 · 11 comments

Comments

@craigminihan
Copy link

Is this a bug report or feature request?

  • Feature Request

What should the feature do:
When deployed into a mixed Linux and Windows cluster don't attempt to schedule on the Windows nodes.

What is use case behind this feature:
The Rook/Ceph chart should schedule on the Linux nodes essentially ignoring the Windows nodes if present. There is support for nodeSelector and *NodeAffinity which is partiallty successful however it is verbose and the end result is some pods end up scheduled on Windows nodes (node11 in this case) which just fail. See:

hvk8s@hvk8s-master:~$ kubectl get po -A -o wide | grep node11
kube-system      kube-flannel-ds-windows-amd64-6nkq2                      1/1     Running                 0              3d1h    172.31.0.21   hvk8s-node11   <none>           <none>
kube-system      kube-proxy-windows-vjmms                                 1/1     Running                 0              3d1h    172.30.2.3    hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-crashcollector-hvk8s-node11-689cf66dcf-d4hk4   0/1     Init:ImagePullBackOff   0              19m     172.30.2.21   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-crashcollector-hvk8s-node11-864b755d6f-jzpgq   0/1     Init:ImagePullBackOff   0              20m     172.30.2.19   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-mgr-a-d8f84c875-72dms                          0/1     Init:ImagePullBackOff   0              19m     172.30.2.20   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-osd-prepare-hvk8s-node11--1-r7pzw              0/1     Init:0/1                0              13m     <none>        hvk8s-node11   <none>           <none>

The operator chart was deployed with values.yaml as follows:

nodeSelector:
  "kubernetes.io/os": linux
discover:
  nodeAffinity: kubernetes.io/os=linux
csi:
  provisionerNodeAffinity: kubernetes.io/os=linux
  pluginNodeAffinity: kubernetes.io/os=linux
  rbdProvisionerNodeAffinity: kubernetes.io/os=linux
  rbdPluginNodeAffinity: kubernetes.io/os=linux
  cephFSProvisionerNodeAffinity: kubernetes.io/os=linux
  cephFSPluginNodeAffinity: kubernetes.io/os=linux
agent:
  nodeAffinity: kubernetes.io/os=linux
admissionController:
  nodeAffinity: kubernetes.io/os=linux

Environment:
A test cluster showing the problem above:

NAME           STATUS   ROLES                  AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
hvk8s-master   Ready    control-plane,master   3d1h    v1.22.2   172.31.0.10   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node1    Ready    <none>                 3d1h    v1.22.2   172.31.0.11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node11   Ready    <none>                 3d1h    v1.22.2   172.31.0.21   <none>        Windows Server Standard        10.0.19042.867    docker://19.3.18
hvk8s-node2    Ready    <none>                 2d23h   v1.22.2   172.31.0.12   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node3    Ready    <none>                 2d23h   v1.22.2   172.31.0.13   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
@leseb
Copy link
Member

leseb commented Nov 15, 2021

Did you also deploy the CephCluster CRD with Helm? If so can you share the values.yaml (cluster/charts/rook-ceph-cluster/values.yaml)? If you deployed the CephCluster differently, then share the yaml too. Thanks

@craigminihan
Copy link
Author

@leseb I've not got as far as the cluster chart yet. I've just implemented Windows node functionality and was making sure I didn't break anything else. I'll take a look at the cluster chart stuff over the next few days and post a values when I have one.

@craigminihan
Copy link
Author

@leseb I've reset my cluster and re-run this time with the operator and cluster chart version 1.7.9.

I used the values from the original post above for the operator and the following for the cluster:

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true

There are a number of combinations possible here, however this time I see the rook-ceph-detect-version job is scheduled onto the Windows node (node11) which causes the cluster to fail to start. If I drain node11 the job is re-scheduled onto a Linux node allowing it to complete and the reset of the workloads to be scheduled.

The source for the chart script is here: https://github.com/RipcordSoftware/hvk8scluster/blob/feature/issue-63-rook-ceph/src/hyper-v/remote-commands/install-rook-ceph-chart.sh

@leseb
Copy link
Member

leseb commented Dec 8, 2021

The rook-ceph-detect-version uses the same placement as the Monitors. The rook-ceph-cluster.values.yaml in the linked script do not set any placement. Please refer to https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph-cluster/values.yaml#L168-L192.

@craigminihan
Copy link
Author

@leseb I've reset and updated my configuration for 3 Linux and 3 Windows nodes. I added cephClusterSpec.placement.all with nodeAffinity set as above. I've noted issue #9445 which matches rook-ceph-mds-ceph-filesystem-a, rook-ceph-mds-ceph-filesystem-b and rook-ceph-rgw-ceph-objectstore-a behaviour, these are still placed on Windows nodes.

This time rook-ceph-osd-0 thru 3 are placed on Linux nodes and all fail with the following:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/config.cc: In function 'void md_config_t::set_val_default(ConfigValues&, const ConfigTracker&, std::string_view, const string&)' thread 7f3cd2b22080 time 2021-12-19T20:13:33.393019+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/config.cc: 279: FAILED ceph_assert(r >= 0)
 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x559c76bee54c]
 2: ceph-osd(+0x56a766) [0x559c76bee766]
 3: (md_config_t::set_val_default(ConfigValues&, ConfigTracker const&, std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc7) [0x559c773c66c7]
 4: (md_config_t::parse_env(unsigned int, ConfigValues&, ConfigTracker const&, char const*)+0x4ed) [0x559c773c9bfd]
 5: (global_pre_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int)+0x326) [0x559c7732d236]
 6: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x86) [0x559c7732f256]
 7: main()
 8: __libc_start_main()
 9: _start()

I'll review FileSystem and ObjectStore CRs placement YAML and clean install chart version 1.8.1.

@leseb
Copy link
Member

leseb commented Dec 20, 2021

For the OSDs, did the prepare job run successfully?

@craigminihan
Copy link
Author

@leseb the OSD issue was likely due to memory pressure causing the pods to be evicted. This is probably due to the rook/ceph image size (~1.27GB) causing the node to allocate disk cache and reducing free memory. I'm not sure why the node sees this as low memory tho.

The good news is I can now run rook/ceph 1.8.1 on 1.23.1 and all the deployments, daemonsets and jobs are scheduled on the Linux nodes. The bad news is the YAML for the cluster chart is a bit verbose:

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/os
              operator: In
              values:
              - linux

cephFileSystems:
  - name: ceph-filesystem
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
        - failureDomain: host
          replicated:
            size: 3
      metadataServer:
        activeCount: 1
        activeStandby: true
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
    storageClass:
      enabled: true
      isDefault: false
      name: ceph-filesystem
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      mountOptions: []
      parameters:
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephObjectStores:
  - name: ceph-objectstore
    spec:
      metadataPool:
        failureDomain: host
        replicated:
          size: 3
      dataPool:
        failureDomain: host
        erasureCoded:
          dataChunks: 2
          codingChunks: 1
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 1
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
      healthCheck:
        bucket:
          interval: 60s
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      parameters:
        region: us-east-1

In order to set placement on metadataServer in cephFileSystems and gateway in cephObjectStores I've had to pull in the yaml trees from values.yaml for each since I can't merge arrays from --values or --set.

Changing the array to a object in the chart values.yaml would allow the values to be overridden (via merge). The template would need to be updated accordingly to handle this.

@leseb
Copy link
Member

leseb commented Jan 3, 2022

@leseb the OSD issue was likely due to memory pressure causing the pods to be evicted. This is probably due to the rook/ceph image size (~1.27GB) causing the node to allocate disk cache and reducing free memory. I'm not sure why the node sees this as low memory tho.

The good news is I can now run rook/ceph 1.8.1 on 1.23.1 and all the deployments, daemonsets and jobs are scheduled on the Linux nodes. The bad news is the YAML for the cluster chart is a bit verbose:

@craigminihan glad to see it is working now.

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/os
              operator: In
              values:
              - linux

cephFileSystems:
  - name: ceph-filesystem
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
        - failureDomain: host
          replicated:
            size: 3
      metadataServer:
        activeCount: 1
        activeStandby: true
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
    storageClass:
      enabled: true
      isDefault: false
      name: ceph-filesystem
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      mountOptions: []
      parameters:
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephObjectStores:
  - name: ceph-objectstore
    spec:
      metadataPool:
        failureDomain: host
        replicated:
          size: 3
      dataPool:
        failureDomain: host
        erasureCoded:
          dataChunks: 2
          codingChunks: 1
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 1
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
      healthCheck:
        bucket:
          interval: 60s
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      parameters:
        region: us-east-1

In order to set placement on metadataServer in cephFileSystems and gateway in cephObjectStores I've had to pull in the yaml trees from values.yaml for each since I can't merge arrays from --values or --set.

Changing the array to a object in the chart values.yaml would allow the values to be overridden (via merge). The template would need to be updated accordingly to handle this.

Do you think this is something you could send a PR for? Thanks

@leseb leseb added this to To do in v1.8 via automation Jan 3, 2022
@travisn travisn removed this from To do in v1.8 Feb 22, 2022
@travisn travisn added this to To do in v1.9 via automation Feb 22, 2022
@github-actions
Copy link

github-actions bot commented Mar 4, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@parth-gr parth-gr reopened this Mar 16, 2022
@github-actions
Copy link

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

@travisn travisn removed this from To do in v1.9 Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants