[2.4] systems pods being scheduled on dead nodes #30790

superseb · 2021-01-12T15:23:09Z

Backport of #27734

sadiapoddar · 2021-01-13T02:45:01Z

Reproduced with 2.4.13-rc1, commit id b5935be
Steps.

Created custom clusters of four nodes( 1 (etcd+cntrl) + 3 workers)
Powered off one of the worker nodes where coredns pod was running
Notices that states all the workload running on that worker node went to updating state.
Waited about 30 mins to see if that all the system workloads were running on the powered-off node get created automatically on the other worker nodes. But that the state of workload remained stuck in 'updating' status forever

coredns pod was running on powered off node remain in an unknown state and never recreated on the other available worker node

Deleted one of the system pods, matrics-server which was running on the powered-off node, with 2.4.13-rc1 it spawned the pod on another the available node, not on the powered-off node.
This behavior was different than what we saw in the 2.4.3 version.

sadiapoddar · 2021-01-14T18:24:51Z

Verified on 2.4.13-rc3, 013c038 by adding the following add-ons

monitoring:
  provider: metrics-server
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15
ingress:
  provider: nginx
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15
dns:
  provider: coredns
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15

Created custom clusters of four nodes( 1 (etcd+cntrl) + 3 workers)
Powered off two worker nodes, 1. where coredns pod was running and the other where the matrics-server pod was running.
Notices that states all the workload running on that worker nodes went to updating state.
All pods with add-ons tolerations were running on these powered off nodes got recreated automatically.
Verified add-ons pods tolerations were set properly in node-scheduling section.

Repeated the above steps a few times to make sure with add-ons tolerations pods are not getting scheduled on dead nodes and automatically gets created on the available nodes.

The fix is working with add-ons.

superseb added this to the v2.4.13 milestone Jan 12, 2021

superseb self-assigned this Jan 12, 2021

deniseschannon added the [zube]: Working label Jan 12, 2021

sangeethah assigned sadiapoddar Jan 12, 2021

superseb mentioned this issue Jan 12, 2021

[release/v2.4] Add tolerations option to addons rancher/types#1192

Merged

This was referenced Jan 13, 2021

[release/v1.1] Add tolerations option to addons rancher/rke#2411

Merged

[dev-v2.4] Add tolerations option for addons rancher/kontainer-driver-metadata#467

Merged

superseb added [zube]: Review and removed [zube]: Working labels Jan 13, 2021

deniseschannon added [zube]: To Test and removed [zube]: Waiting for RC labels Jan 13, 2021

sangeethah added [zube]: QA Working and removed [zube]: To Test labels Jan 14, 2021

sadiapoddar closed this as completed Jan 14, 2021

zube bot added [zube]: Done and removed [zube]: QA Working labels Jan 14, 2021

zube bot removed the [zube]: Done label Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.4] systems pods being scheduled on dead nodes #30790

[2.4] systems pods being scheduled on dead nodes #30790

superseb commented Jan 12, 2021

sadiapoddar commented Jan 13, 2021 •

edited

sadiapoddar commented Jan 14, 2021 •

edited

[2.4] systems pods being scheduled on dead nodes #30790

[2.4] systems pods being scheduled on dead nodes #30790

Comments

superseb commented Jan 12, 2021

sadiapoddar commented Jan 13, 2021 • edited

sadiapoddar commented Jan 14, 2021 • edited

sadiapoddar commented Jan 13, 2021 •

edited

sadiapoddar commented Jan 14, 2021 •

edited