Don't schedule on Windows nodes by default #9165

craigminihan · 2021-11-15T00:19:34Z

Is this a bug report or feature request?

Feature Request

What should the feature do:
When deployed into a mixed Linux and Windows cluster don't attempt to schedule on the Windows nodes.

What is use case behind this feature:
The Rook/Ceph chart should schedule on the Linux nodes essentially ignoring the Windows nodes if present. There is support for nodeSelector and *NodeAffinity which is partiallty successful however it is verbose and the end result is some pods end up scheduled on Windows nodes (node11 in this case) which just fail. See:

hvk8s@hvk8s-master:~$ kubectl get po -A -o wide | grep node11
kube-system      kube-flannel-ds-windows-amd64-6nkq2                      1/1     Running                 0              3d1h    172.31.0.21   hvk8s-node11   <none>           <none>
kube-system      kube-proxy-windows-vjmms                                 1/1     Running                 0              3d1h    172.30.2.3    hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-crashcollector-hvk8s-node11-689cf66dcf-d4hk4   0/1     Init:ImagePullBackOff   0              19m     172.30.2.21   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-crashcollector-hvk8s-node11-864b755d6f-jzpgq   0/1     Init:ImagePullBackOff   0              20m     172.30.2.19   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-mgr-a-d8f84c875-72dms                          0/1     Init:ImagePullBackOff   0              19m     172.30.2.20   hvk8s-node11   <none>           <none>
rook-ceph        rook-ceph-osd-prepare-hvk8s-node11--1-r7pzw              0/1     Init:0/1                0              13m     <none>        hvk8s-node11   <none>           <none>

The operator chart was deployed with values.yaml as follows:

nodeSelector:
  "kubernetes.io/os": linux
discover:
  nodeAffinity: kubernetes.io/os=linux
csi:
  provisionerNodeAffinity: kubernetes.io/os=linux
  pluginNodeAffinity: kubernetes.io/os=linux
  rbdProvisionerNodeAffinity: kubernetes.io/os=linux
  rbdPluginNodeAffinity: kubernetes.io/os=linux
  cephFSProvisionerNodeAffinity: kubernetes.io/os=linux
  cephFSPluginNodeAffinity: kubernetes.io/os=linux
agent:
  nodeAffinity: kubernetes.io/os=linux
admissionController:
  nodeAffinity: kubernetes.io/os=linux

Environment:
A test cluster showing the problem above:

NAME           STATUS   ROLES                  AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION    CONTAINER-RUNTIME
hvk8s-master   Ready    control-plane,master   3d1h    v1.22.2   172.31.0.10   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node1    Ready    <none>                 3d1h    v1.22.2   172.31.0.11   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node11   Ready    <none>                 3d1h    v1.22.2   172.31.0.21   <none>        Windows Server Standard        10.0.19042.867    docker://19.3.18
hvk8s-node2    Ready    <none>                 2d23h   v1.22.2   172.31.0.12   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15
hvk8s-node3    Ready    <none>                 2d23h   v1.22.2   172.31.0.13   <none>        Debian GNU/Linux 10 (buster)   4.19.0-18-amd64   docker://19.3.15

The text was updated successfully, but these errors were encountered:

leseb · 2021-11-15T10:37:29Z

Did you also deploy the CephCluster CRD with Helm? If so can you share the values.yaml (cluster/charts/rook-ceph-cluster/values.yaml)? If you deployed the CephCluster differently, then share the yaml too. Thanks

craigminihan · 2021-11-17T21:55:11Z

@leseb I've not got as far as the cluster chart yet. I've just implemented Windows node functionality and was making sure I didn't break anything else. I'll take a look at the cluster chart stuff over the next few days and post a values when I have one.

craigminihan · 2021-12-08T00:44:16Z

@leseb I've reset my cluster and re-run this time with the operator and cluster chart version 1.7.9.

I used the values from the original post above for the operator and the following for the cluster:

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true

There are a number of combinations possible here, however this time I see the rook-ceph-detect-version job is scheduled onto the Windows node (node11) which causes the cluster to fail to start. If I drain node11 the job is re-scheduled onto a Linux node allowing it to complete and the reset of the workloads to be scheduled.

The source for the chart script is here: https://github.com/RipcordSoftware/hvk8scluster/blob/feature/issue-63-rook-ceph/src/hyper-v/remote-commands/install-rook-ceph-chart.sh

leseb · 2021-12-08T09:17:35Z

The rook-ceph-detect-version uses the same placement as the Monitors. The rook-ceph-cluster.values.yaml in the linked script do not set any placement. Please refer to https://github.com/rook/rook/blob/master/deploy/charts/rook-ceph-cluster/values.yaml#L168-L192.

craigminihan · 2021-12-19T20:23:27Z

@leseb I've reset and updated my configuration for 3 Linux and 3 Windows nodes. I added cephClusterSpec.placement.all with nodeAffinity set as above. I've noted issue #9445 which matches rook-ceph-mds-ceph-filesystem-a, rook-ceph-mds-ceph-filesystem-b and rook-ceph-rgw-ceph-objectstore-a behaviour, these are still placed on Windows nodes.

This time rook-ceph-osd-0 thru 3 are placed on Linux nodes and all fail with the following:

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/config.cc: In function 'void md_config_t::set_val_default(ConfigValues&, const ConfigTracker&, std::string_view, const string&)' thread 7f3cd2b22080 time 2021-12-19T20:13:33.393019+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/common/config.cc: 279: FAILED ceph_assert(r >= 0)
 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x559c76bee54c]
 2: ceph-osd(+0x56a766) [0x559c76bee766]
 3: (md_config_t::set_val_default(ConfigValues&, ConfigTracker const&, std::basic_string_view<char, std::char_traits<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xc7) [0x559c773c66c7]
 4: (md_config_t::parse_env(unsigned int, ConfigValues&, ConfigTracker const&, char const*)+0x4ed) [0x559c773c9bfd]
 5: (global_pre_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int)+0x326) [0x559c7732d236]
 6: (global_init(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const*, std::vector<char const*, std::allocator<char const*> >&, unsigned int, code_environment_t, int, bool)+0x86) [0x559c7732f256]
 7: main()
 8: __libc_start_main()
 9: _start()

I'll review FileSystem and ObjectStore CRs placement YAML and clean install chart version 1.8.1.

leseb · 2021-12-20T09:26:25Z

For the OSDs, did the prepare job run successfully?

craigminihan · 2021-12-29T17:16:45Z

@leseb the OSD issue was likely due to memory pressure causing the pods to be evicted. This is probably due to the rook/ceph image size (~1.27GB) causing the node to allocate disk cache and reducing free memory. I'm not sure why the node sees this as low memory tho.

The good news is I can now run rook/ceph 1.8.1 on 1.23.1 and all the deployments, daemonsets and jobs are scheduled on the Linux nodes. The bad news is the YAML for the cluster chart is a bit verbose:

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/os
              operator: In
              values:
              - linux

cephFileSystems:
  - name: ceph-filesystem
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
        - failureDomain: host
          replicated:
            size: 3
      metadataServer:
        activeCount: 1
        activeStandby: true
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
    storageClass:
      enabled: true
      isDefault: false
      name: ceph-filesystem
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      mountOptions: []
      parameters:
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephObjectStores:
  - name: ceph-objectstore
    spec:
      metadataPool:
        failureDomain: host
        replicated:
          size: 3
      dataPool:
        failureDomain: host
        erasureCoded:
          dataChunks: 2
          codingChunks: 1
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 1
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
      healthCheck:
        bucket:
          interval: 60s
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      parameters:
        region: us-east-1

In order to set placement on metadataServer in cephFileSystems and gateway in cephObjectStores I've had to pull in the yaml trees from values.yaml for each since I can't merge arrays from --values or --set.

Changing the array to a object in the chart values.yaml would allow the values to be overridden (via merge). The template would need to be updated accordingly to handle this.

leseb · 2022-01-03T14:53:40Z

@leseb the OSD issue was likely due to memory pressure causing the pods to be evicted. This is probably due to the rook/ceph image size (~1.27GB) causing the node to allocate disk cache and reducing free memory. I'm not sure why the node sees this as low memory tho.

The good news is I can now run rook/ceph 1.8.1 on 1.23.1 and all the deployments, daemonsets and jobs are scheduled on the Linux nodes. The bad news is the YAML for the cluster chart is a bit verbose:

@craigminihan glad to see it is working now.

toolbox:
  enabled: true
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/os
            operator: In
            values:
            - linux
cephClusterSpec:
  dashboard:
    enabled: false
  crashCollector:
    disable: true
  placement:
    all:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kubernetes.io/os
              operator: In
              values:
              - linux

cephFileSystems:
  - name: ceph-filesystem
    spec:
      metadataPool:
        replicated:
          size: 3
      dataPools:
        - failureDomain: host
          replicated:
            size: 3
      metadataServer:
        activeCount: 1
        activeStandby: true
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
    storageClass:
      enabled: true
      isDefault: false
      name: ceph-filesystem
      reclaimPolicy: Delete
      allowVolumeExpansion: true
      mountOptions: []
      parameters:
        csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
        csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
        csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
        csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
        csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
        csi.storage.k8s.io/fstype: ext4

cephObjectStores:
  - name: ceph-objectstore
    spec:
      metadataPool:
        failureDomain: host
        replicated:
          size: 3
      dataPool:
        failureDomain: host
        erasureCoded:
          dataChunks: 2
          codingChunks: 1
      preservePoolsOnDelete: true
      gateway:
        port: 80
        instances: 1
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: kubernetes.io/os
                  operator: In
                  values:
                  - linux
      healthCheck:
        bucket:
          interval: 60s
    storageClass:
      enabled: true
      name: ceph-bucket
      reclaimPolicy: Delete
      parameters:
        region: us-east-1

In order to set placement on metadataServer in cephFileSystems and gateway in cephObjectStores I've had to pull in the yaml trees from values.yaml for each since I can't merge arrays from --values or --set.

Changing the array to a object in the chart values.yaml would allow the values to be overridden (via merge). The template would need to be updated accordingly to handle this.

Do you think this is something you could send a PR for? Thanks

github-actions · 2022-03-04T20:02:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions · 2022-03-11T20:02:20Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

github-actions · 2022-03-23T20:01:59Z

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

craigminihan added the feature label Nov 15, 2021

craigminihan mentioned this issue Nov 23, 2021

Review Rook/Ceph storage provisioning RipcordSoftware/hvk8scluster#63

Closed

leseb added this to To do in v1.8 via automation Jan 3, 2022

travisn removed this from To do in v1.8 Feb 22, 2022

travisn added this to To do in v1.9 via automation Feb 22, 2022

github-actions bot added the wontfix label Mar 4, 2022

github-actions bot closed this as completed Mar 11, 2022

parth-gr reopened this Mar 16, 2022

github-actions bot closed this as completed Mar 23, 2022

travisn removed this from To do in v1.9 Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't schedule on Windows nodes by default #9165

Don't schedule on Windows nodes by default #9165

craigminihan commented Nov 15, 2021

leseb commented Nov 15, 2021

craigminihan commented Nov 17, 2021

craigminihan commented Dec 8, 2021

leseb commented Dec 8, 2021

craigminihan commented Dec 19, 2021

leseb commented Dec 20, 2021

craigminihan commented Dec 29, 2021

leseb commented Jan 3, 2022

github-actions bot commented Mar 4, 2022

github-actions bot commented Mar 11, 2022

github-actions bot commented Mar 23, 2022

Don't schedule on Windows nodes by default #9165

Don't schedule on Windows nodes by default #9165

Comments

craigminihan commented Nov 15, 2021

leseb commented Nov 15, 2021

craigminihan commented Nov 17, 2021

craigminihan commented Dec 8, 2021

leseb commented Dec 8, 2021

craigminihan commented Dec 19, 2021

leseb commented Dec 20, 2021

craigminihan commented Dec 29, 2021

leseb commented Jan 3, 2022

github-actions bot commented Mar 4, 2022

github-actions bot commented Mar 11, 2022

github-actions bot commented Mar 23, 2022