New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177
Comments
Referring to AWS, usually I would use at least 2C4G with GP3 disk. |
@niusmallnan I don't think that it's a resources issue. I checked the both the logs and monitored
Node info: 4 vCPU, 16G memory, 32G temp storage. Also, if it was a resource issue, my changes to the livenessProbe would have not helped. Since I changed the delay time, the instance is already running for 10h straight. No restarts. |
Rancher's built-in controllers do resync about every 10h, which consumes more resources in bursts. |
I didn't know that but in any case, rancher kept failing on startup until I changed the delay time of the liveness probe. Also, it has been running for 12h straight. So why would this be related to the mentioned ticket? As well, I believe that the machine I'm using is quite powerful, don't you think? |
These logs may be related to the service capability of kube-api, cloud you share some k3s logs? |
I think it would be hard to find any as too much time has passed and I don't have them saved.
sqlite But also I have to say that this specific error has happened only once. Before changing the delay I had a constant failure of Error 137 (NOT OOM), and livenessProbe was basically failing in the events description. |
For rancher setup, the embedded Etcd should be better than SQLite. One question, what's the qosClass of your rancher pod? |
While I understand it, I have never had any issues with running rancher with SQLite. As well, I have a stable rancher and a specific rancher version running on the same architecture, and non of them failed as latest.
Any simple way to change it to As well, I want to emphasize that the pod is running for 43h already, since I changed the liveness probe. |
Hello! I encounter a similar issue with the
at the end. |
I managed to solve the error by cleaning up the I then cloned the repository, edited the
Now my |
Being able to configure livenessProbe settings in helm would be nice. |
Same issue happening here... |
Had the same problem (Rancher v2.6.1). For me stopping rancher server, waiting for the server to cool down a bit (CPU wise) and starting rancher again did the trick: k3s kubectl -n cattle-system scale --replicas 0 deployments/rancher
sleep 15
k3s kubectl -n cattle-system scale --replicas 1 deployments/rancher |
Maybe someone can create a |
sure-5277 |
@moio @rmweir - Can you guys investigate this together? |
@Shaked can you try running
|
The PR above and issues below are to add probe customization to the Rancher chart, but for now we won't be adding to the Rancher docs "helm charts options" page. |
long term fix will be tracked here now #40587 |
Release note: |
Rancher Server Setup
Currently
rancher-2.6.6 v2.6.6
but also tried with2.6.5
before. Basically following the rancher/latest repo.Information about the Cluster
v1.20.15+k3s1
User Information
Describe the bug
Rancher tries to start and fails. First I kept seeing Exit Code 137 (NOT OOM). After that, I changed deploy/rancher
livenessProbe
toThe first time failed for:
After that I extended
initialDelaySeconds
to 160 and then rancher kept failing for:Until at some point it came back to life.
Obviously this won't last as once the deployment will be updated again, liveness will fail and the pod won't start.
To Reproduce
I am really not sure what causes this issue except of the things I have mentioned within the bug.
Additional context
I guess that the least that can be done here is to enable an option to extend the liveness/readiness settings through helm. However this is obviously a patch and a farther investigation is required.
The text was updated successfully, but these errors were encountered: