Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.4] systems pods being scheduled on dead nodes #30790

Closed
superseb opened this issue Jan 12, 2021 · 2 comments
Closed

[2.4] systems pods being scheduled on dead nodes #30790

superseb opened this issue Jan 12, 2021 · 2 comments
Assignees
Milestone

Comments

@superseb
Copy link
Contributor

Backport of #27734

@sadiapoddar
Copy link

sadiapoddar commented Jan 13, 2021

Reproduced with 2.4.13-rc1, commit id b5935be
Steps.

  1. Created custom clusters of four nodes( 1 (etcd+cntrl) + 3 workers)
  2. Powered off one of the worker nodes where coredns pod was running
  3. Notices that states all the workload running on that worker node went to updating state.
  4. Waited about 30 mins to see if that all the system workloads were running on the powered-off node get created automatically on the other worker nodes. But that the state of workload remained stuck in 'updating' status forever

Screen Shot 2021-01-12 at 6 23 03 PM

coredns pod was running on powered off node remain in an unknown state and never recreated on the other available worker node

Screen Shot 2021-01-12 at 6 28 09 PM

Deleted one of the system pods, matrics-server which was running on the powered-off node, with 2.4.13-rc1 it spawned the pod on another the available node, not on the powered-off node.
This behavior was different than what we saw in the 2.4.3 version.

Screen Shot 2021-01-12 at 6 39 21 PM

@sadiapoddar
Copy link

sadiapoddar commented Jan 14, 2021

Verified on 2.4.13-rc3, 013c038 by adding the following add-ons

monitoring:
  provider: metrics-server
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15
ingress:
  provider: nginx
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15
dns:
  provider: coredns
  tolerations:
  - key: "node.kubernetes.io/unreachable"
    operator: "Exists"
    effect: "NoExecute"
    toleration_seconds: 15
  
  1. Created custom clusters of four nodes( 1 (etcd+cntrl) + 3 workers)
  2. Powered off two worker nodes, 1. where coredns pod was running and the other where the matrics-server pod was running.
  3. Notices that states all the workload running on that worker nodes went to updating state.
  4. All pods with add-ons tolerations were running on these powered off nodes got recreated automatically.
  5. Verified add-ons pods tolerations were set properly in node-scheduling section.

Repeated the above steps a few times to make sure with add-ons tolerations pods are not getting scheduled on dead nodes and automatically gets created on the available nodes.

The fix is working with add-ons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants