Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thousands of deploy-job pods in pending state #755

Closed
HighwayofLife opened this issue Jul 6, 2018 · 11 comments
Closed

Thousands of deploy-job pods in pending state #755

HighwayofLife opened this issue Jul 6, 2018 · 11 comments
Assignees
Milestone

Comments

@HighwayofLife
Copy link
Contributor

RKE version:
rke version v0.1.8-rc11

Docker version: (docker version,docker info preferred)
1.12

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
CoreOS

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
Azure Private Cloud

I just ran RKE with the newest version of RKE against an existing cluster previously provisioned by RKE, and the rke-network-plugin-deploy-job failed during the run, but when I checked the node, noticed that the CPU was at 95%, Disk writes were going crazy, and the kubelet was consuming a huge amount of CPU. Turns out, 19,000 pods for the rke-network-plugin-deploy-job had been created and were in Pending state.

19,000

@deniseschannon
Copy link

What k8s version were you using? Was this k8s 1.11?

@HighwayofLife
Copy link
Contributor Author

No, this one was 1.10.3

@willmao
Copy link

willmao commented Jul 7, 2018

have a look at the kubelet and cni install related containers' logs will help to solve your problem.

@willmao
Copy link

willmao commented Jul 7, 2018

rke/k8s will keep scheduling cni plugin pod if it failed

@HighwayofLife
Copy link
Contributor Author

It should clean up an old pod before it starts a new one. And after a certain number of tries it should report an error without continuing to start new pods

@moelsayed
Copy link
Contributor

moelsayed commented Jul 11, 2018

@HighwayofLife This is known issue with k8s 1.10.x. It's fixed in 1.10.5. Using version 0.1.8 with default k8s 1.10.5-rancher1-1 should resolve this.

@deniseschannon deniseschannon added this to the v0.1.9 milestone Jul 11, 2018
@deniseschannon deniseschannon self-assigned this Jul 11, 2018
@deniseschannon
Copy link

@HighwayofLife Let me know if you start using k8s v1.10.5 and still have these issues.

@deniseschannon deniseschannon modified the milestones: v0.1.9, v0.1.10 Jul 20, 2018
@HighwayofLife
Copy link
Contributor Author

HighwayofLife commented Jul 22, 2018 via email

@moelsayed
Copy link
Contributor

@HighwayofLife are you still seeing this with v0.1.9 ?

@HighwayofLife
Copy link
Contributor Author

No, I have not seen this reappear in 0.1.9

@moelsayed
Copy link
Contributor

Cool. I will close this issue for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants