New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Rancher not starting after cluster restart #44446
Comments
I work with mjj29. We have now managed to start Rancher. We observed "failed to wait for caches to sync" in the log and based on the discussion on #38177, increased the initialDelaySeconds of the livenessProbe and readinessProbe configurations of the rancher deployment (to 300 seconds). Although it might be unrelated, we also observed "Refusing to reset the config and clean up resources of the auth provider ... because its auth config annotation management.cattle.io/auth-provider-cleanup is set to rancher-locked." after updating Rancher in an attempt to resolve the issue. We solved this by manually triggering cleanup of the auth-providers. See "Rancher might retain resources from a disabled auth provider configuration" in the release notes. #43641 also quotes the "mgmt-auth-userattributes-controller: refresh daemon not yet initialized" error and might be fixed by increasing the initialDelaySeconds. |
Could you try to scale the deployment to 0 and then to 1? |
Hi Martin, since we have it running now we don't want to test that and accidentally wedge it again |
Ok - so waiting until it happens again? ;-) |
For now, I think so. Thanks. |
We had the same problem and the "fix" from @jdoc-sag worked for us aswell. |
Rancher Server Setup
Describe the bug
We restarted various nodes in the cluster to pick up new kernel versions. After rebooting all the rancher pods went into crash-loopback failing to become ready. We tried deleting all the pods letting the deployments recreated them and they're still not coming up. I tried applying an updated helm chart to upgrade to 2.8.2 in case it was a bug fixed in the latest version and they're still not coming up. The errors from the logs look like:
To Reproduce
Happens whenever the pods try and start up
Result
Rancher doesn't work (we get a 404 on the ingress because there's no pod on the back end)
Expected Result
rancher pods to start
Here's the k3s output for the namespace. The first two pods are from the new version and the second two pods are from the old version waiting for the new one to become ready. None of them get to 'ready', they just alternate between running and CrashLoopBackOff
The events look like:
This is causing us production down at the moment, so it's urgent.
The text was updated successfully, but these errors were encountered: