Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Rancher not starting after cluster restart #44446

Open
mjj29 opened this issue Feb 13, 2024 · 6 comments
Open

[BUG] Rancher not starting after cluster restart #44446

mjj29 opened this issue Feb 13, 2024 · 6 comments
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release

Comments

@mjj29
Copy link

mjj29 commented Feb 13, 2024

Rancher Server Setup

  • Rancher version: 2.7.5 -> 2.8.2
  • Installation option Helm Chart
    • k3s v1.26.6+k3s1 running on bare metal

Describe the bug
We restarted various nodes in the cluster to pick up new kernel versions. After rebooting all the rancher pods went into crash-loopback failing to become ready. We tried deleting all the pods letting the deployments recreated them and they're still not coming up. I tried applying an updated helm chart to upgrade to 2.8.2 in case it was a bug fixed in the latest version and they're still not coming up. The errors from the logs look like:

2024/02/13 17:42:24 [ERROR] failed to start cluster controllers c-m-pzv9mbq4: context canceled                                                                                                                                                              
2024/02/13 17:42:25 [ERROR] failed to start cluster controllers c-m-ngmtvgsw: context canceled                                                                                                                                                              
2024/02/13 17:42:32 [ERROR] error syncing 'u-7jdxyyztvn': handler mgmt-auth-userattributes-controller: refresh daemon not yet initialized, requeuing                                                                                                        
2024/02/13 17:42:32 [ERROR] error syncing 'user-mw7ch': handler mgmt-auth-userattributes-controller: refresh daemon not yet initialized, requeuing                                                                                                          
2024/02/13 17:42:57 [ERROR] failed to start cluster controllers c-m-pzv9mbq4: context canceled                                                                                                                                                              
2024/02/13 17:42:57 [ERROR] failed to start cluster controllers c-m-ngmtvgsw: context canceled                                                                                                                                                              
2024/02/13 17:42:57 [ERROR] failed to sync cache for management.cattle.io/v3, Kind=Token       

To Reproduce
Happens whenever the pods try and start up

Result
Rancher doesn't work (we get a 404 on the ingress because there's no pod on the back end)

Expected Result
rancher pods to start

Here's the k3s output for the namespace. The first two pods are from the new version and the second two pods are from the old version waiting for the new one to become ready. None of them get to 'ready', they just alternate between running and CrashLoopBackOff

k3s kubectl get all -n cattle-system -o wide
NAME                                            READY   STATUS             RESTARTS         AGE   IP             NODE    NOMINATED NODE   READINESS GATES
pod/helm-operation-bwm8j                        0/2     Completed          0                14m   10.42.0.140    k8s07   <none>           <none>
pod/helm-operation-ngjgk                        0/2     Completed          0                16m   10.42.0.138    k8s07   <none>           <none>
pod/helm-operation-vjgg5                        0/2     Completed          0                14m   10.42.1.177    k8s08   <none>           <none>
pod/helm-operation-wk879                        0/2     Completed          0                16m   10.42.0.136    k8s07   <none>           <none>
pod/rancher-7855f7b44c-kk5ss                    0/1     CrashLoopBackOff   8 (3m58s ago)    29m   10.42.1.173    k8s08   <none>           <none>
pod/rancher-7855f7b44c-z87dh                    0/1     Running            9 (5m57s ago)    29m   10.42.0.135    k8s07   <none>           <none>
pod/rancher-dd57c78cf-l8zt8                     0/1     CrashLoopBackOff   16 (5m1s ago)    69m   10.42.4.57     k8s05   <none>           <none>
pod/rancher-dd57c78cf-pqtvb                     0/1     CrashLoopBackOff   16 (4m31s ago)   69m   10.42.5.131    k8s06   <none>           <none>
pod/rancher-webhook-65f5455d9c-b8x4m            1/1     Running            0                16m   10.42.0.137    k8s07   <none>           <none>
pod/system-upgrade-controller-64f5b6857-8vcpn   1/1     Running            0                69m   10.42.15.186   k8s10   <none>           <none>

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE    SELECTOR
service/rancher           ClusterIP   10.43.32.200    <none>        80/TCP,443/TCP   214d   app=rancher
service/rancher-webhook   ClusterIP   10.43.241.123   <none>        443/TCP          214d   app=rancher-webhook

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS                  IMAGES                                      SELECTOR
deployment.apps/rancher                     0/3     2            0           214d   rancher                     rancher/rancher:v2.8.2                      app=rancher
deployment.apps/rancher-webhook             1/1     1            1           214d   rancher-webhook             rancher/rancher-webhook:v0.4.2              app=rancher-webhook
deployment.apps/system-upgrade-controller   1/1     1            1           214d   system-upgrade-controller   rancher/system-upgrade-controller:v0.10.0   upgrade.cattle.io/controller=system-upgrade-controller

NAME                                                  DESIRED   CURRENT   READY   AGE    CONTAINERS                  IMAGES                                      SELECTOR
replicaset.apps/rancher-7855f7b44c                    2         2         0       29m    rancher                     rancher/rancher:v2.8.2                      app=rancher,pod-template-hash=7855f7b44c
replicaset.apps/rancher-dd57c78cf                     2         2         0       214d   rancher                     rancher/rancher:v2.7.5                      app=rancher,pod-template-hash=dd57c78cf
replicaset.apps/rancher-webhook-65f5455d9c            1         1         1       16m    rancher-webhook             rancher/rancher-webhook:v0.4.2              app=rancher-webhook,pod-template-hash=65f5455d9c
replicaset.apps/rancher-webhook-788c48b988            0         0         0       214d   rancher-webhook             rancher/rancher-webhook:v0.3.5              app=rancher-webhook,pod-template-hash=788c48b988
replicaset.apps/system-upgrade-controller-64f5b6857   1         1         1       214d   system-upgrade-controller   rancher/system-upgrade-controller:v0.10.0   pod-template-hash=64f5b6857,upgrade.cattle.io/controller=system-upgrade-controller

The events look like:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  32m                  default-scheduler  Successfully assigned cattle-system/rancher-7855f7b44c-z87dh to k8s07
  Normal   Pulling    32m                  kubelet            Pulling image "rancher/rancher:v2.8.2"
  Normal   Pulled     31m                  kubelet            Successfully pulled image "rancher/rancher:v2.8.2" in 44.153320876s     (44.153335518s including waiting)
  Normal   Killing    29m                  kubelet            Container rancher failed liveness probe, will be restarted
  Normal   Created    29m (x2 over 31m)    kubelet            Created container rancher
  Normal   Pulled     29m                  kubelet            Container image "rancher/rancher:v2.8.2" already present on machine
  Normal   Started    29m (x2 over 31m)    kubelet            Started container rancher
  Warning  Unhealthy  11m (x48 over 31m)   kubelet            Readiness probe failed: Get "http://10.42.0.135:80/healthz": dial tcp 10.42.0.135:80: connect: connection refused
  Warning  BackOff    7m7s (x27 over 21m)  kubelet            Back-off restarting failed container rancher in pod rancher-7855f7b44c-z87dh_cattle-system(b7ff3456-5b4f-4dd1-b152-356cc95cd2a3)
  Warning  Unhealthy  102s (x26 over 30m)  kubelet            Liveness probe failed: Get "http://10.42.0.135:80/healthz": dial tcp 10.42.0.135:80: connect: connection refused

This is causing us production down at the moment, so it's urgent.

@mjj29 mjj29 added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Feb 13, 2024
@jdoc-sag
Copy link

jdoc-sag commented Feb 16, 2024

I work with mjj29. We have now managed to start Rancher.

We observed "failed to wait for caches to sync" in the log and based on the discussion on #38177, increased the initialDelaySeconds of the livenessProbe and readinessProbe configurations of the rancher deployment (to 300 seconds).

Although it might be unrelated, we also observed "Refusing to reset the config and clean up resources of the auth provider ... because its auth config annotation management.cattle.io/auth-provider-cleanup is set to rancher-locked." after updating Rancher in an attempt to resolve the issue. We solved this by manually triggering cleanup of the auth-providers. See "Rancher might retain resources from a disabled auth provider configuration" in the release notes.

#43641 also quotes the "mgmt-auth-userattributes-controller: refresh daemon not yet initialized" error and might be fixed by increasing the initialDelaySeconds.

@Martin-Weiss
Copy link

Could you try to scale the deployment to 0 and then to 1?

@mjj29
Copy link
Author

mjj29 commented Feb 22, 2024

Hi Martin, since we have it running now we don't want to test that and accidentally wedge it again

@Martin-Weiss
Copy link

Ok - so waiting until it happens again? ;-)

@jdoc-sag
Copy link

For now, I think so. Thanks.

@wargamez
Copy link

We had the same problem and the "fix" from @jdoc-sag worked for us aswell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

4 participants