[BUG] Rancher not starting after cluster restart #44446

mjj29 · 2024-02-13T17:53:01Z

Rancher Server Setup

Rancher version: 2.7.5 -> 2.8.2
Installation option Helm Chart
- k3s v1.26.6+k3s1 running on bare metal

Describe the bug
We restarted various nodes in the cluster to pick up new kernel versions. After rebooting all the rancher pods went into crash-loopback failing to become ready. We tried deleting all the pods letting the deployments recreated them and they're still not coming up. I tried applying an updated helm chart to upgrade to 2.8.2 in case it was a bug fixed in the latest version and they're still not coming up. The errors from the logs look like:

2024/02/13 17:42:24 [ERROR] failed to start cluster controllers c-m-pzv9mbq4: context canceled                                                                                                                                                              
2024/02/13 17:42:25 [ERROR] failed to start cluster controllers c-m-ngmtvgsw: context canceled                                                                                                                                                              
2024/02/13 17:42:32 [ERROR] error syncing 'u-7jdxyyztvn': handler mgmt-auth-userattributes-controller: refresh daemon not yet initialized, requeuing                                                                                                        
2024/02/13 17:42:32 [ERROR] error syncing 'user-mw7ch': handler mgmt-auth-userattributes-controller: refresh daemon not yet initialized, requeuing                                                                                                          
2024/02/13 17:42:57 [ERROR] failed to start cluster controllers c-m-pzv9mbq4: context canceled                                                                                                                                                              
2024/02/13 17:42:57 [ERROR] failed to start cluster controllers c-m-ngmtvgsw: context canceled                                                                                                                                                              
2024/02/13 17:42:57 [ERROR] failed to sync cache for management.cattle.io/v3, Kind=Token

To Reproduce
Happens whenever the pods try and start up

Result
Rancher doesn't work (we get a 404 on the ingress because there's no pod on the back end)

Expected Result
rancher pods to start

Here's the k3s output for the namespace. The first two pods are from the new version and the second two pods are from the old version waiting for the new one to become ready. None of them get to 'ready', they just alternate between running and CrashLoopBackOff

k3s kubectl get all -n cattle-system -o wide
NAME                                            READY   STATUS             RESTARTS         AGE   IP             NODE    NOMINATED NODE   READINESS GATES
pod/helm-operation-bwm8j                        0/2     Completed          0                14m   10.42.0.140    k8s07   <none>           <none>
pod/helm-operation-ngjgk                        0/2     Completed          0                16m   10.42.0.138    k8s07   <none>           <none>
pod/helm-operation-vjgg5                        0/2     Completed          0                14m   10.42.1.177    k8s08   <none>           <none>
pod/helm-operation-wk879                        0/2     Completed          0                16m   10.42.0.136    k8s07   <none>           <none>
pod/rancher-7855f7b44c-kk5ss                    0/1     CrashLoopBackOff   8 (3m58s ago)    29m   10.42.1.173    k8s08   <none>           <none>
pod/rancher-7855f7b44c-z87dh                    0/1     Running            9 (5m57s ago)    29m   10.42.0.135    k8s07   <none>           <none>
pod/rancher-dd57c78cf-l8zt8                     0/1     CrashLoopBackOff   16 (5m1s ago)    69m   10.42.4.57     k8s05   <none>           <none>
pod/rancher-dd57c78cf-pqtvb                     0/1     CrashLoopBackOff   16 (4m31s ago)   69m   10.42.5.131    k8s06   <none>           <none>
pod/rancher-webhook-65f5455d9c-b8x4m            1/1     Running            0                16m   10.42.0.137    k8s07   <none>           <none>
pod/system-upgrade-controller-64f5b6857-8vcpn   1/1     Running            0                69m   10.42.15.186   k8s10   <none>           <none>

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE    SELECTOR
service/rancher           ClusterIP   10.43.32.200    <none>        80/TCP,443/TCP   214d   app=rancher
service/rancher-webhook   ClusterIP   10.43.241.123   <none>        443/TCP          214d   app=rancher-webhook

NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE    CONTAINERS                  IMAGES                                      SELECTOR
deployment.apps/rancher                     0/3     2            0           214d   rancher                     rancher/rancher:v2.8.2                      app=rancher
deployment.apps/rancher-webhook             1/1     1            1           214d   rancher-webhook             rancher/rancher-webhook:v0.4.2              app=rancher-webhook
deployment.apps/system-upgrade-controller   1/1     1            1           214d   system-upgrade-controller   rancher/system-upgrade-controller:v0.10.0   upgrade.cattle.io/controller=system-upgrade-controller

NAME                                                  DESIRED   CURRENT   READY   AGE    CONTAINERS                  IMAGES                                      SELECTOR
replicaset.apps/rancher-7855f7b44c                    2         2         0       29m    rancher                     rancher/rancher:v2.8.2                      app=rancher,pod-template-hash=7855f7b44c
replicaset.apps/rancher-dd57c78cf                     2         2         0       214d   rancher                     rancher/rancher:v2.7.5                      app=rancher,pod-template-hash=dd57c78cf
replicaset.apps/rancher-webhook-65f5455d9c            1         1         1       16m    rancher-webhook             rancher/rancher-webhook:v0.4.2              app=rancher-webhook,pod-template-hash=65f5455d9c
replicaset.apps/rancher-webhook-788c48b988            0         0         0       214d   rancher-webhook             rancher/rancher-webhook:v0.3.5              app=rancher-webhook,pod-template-hash=788c48b988
replicaset.apps/system-upgrade-controller-64f5b6857   1         1         1       214d   system-upgrade-controller   rancher/system-upgrade-controller:v0.10.0   pod-template-hash=64f5b6857,upgrade.cattle.io/controller=system-upgrade-controller

The events look like:

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  32m                  default-scheduler  Successfully assigned cattle-system/rancher-7855f7b44c-z87dh to k8s07
  Normal   Pulling    32m                  kubelet            Pulling image "rancher/rancher:v2.8.2"
  Normal   Pulled     31m                  kubelet            Successfully pulled image "rancher/rancher:v2.8.2" in 44.153320876s     (44.153335518s including waiting)
  Normal   Killing    29m                  kubelet            Container rancher failed liveness probe, will be restarted
  Normal   Created    29m (x2 over 31m)    kubelet            Created container rancher
  Normal   Pulled     29m                  kubelet            Container image "rancher/rancher:v2.8.2" already present on machine
  Normal   Started    29m (x2 over 31m)    kubelet            Started container rancher
  Warning  Unhealthy  11m (x48 over 31m)   kubelet            Readiness probe failed: Get "http://10.42.0.135:80/healthz": dial tcp 10.42.0.135:80: connect: connection refused
  Warning  BackOff    7m7s (x27 over 21m)  kubelet            Back-off restarting failed container rancher in pod rancher-7855f7b44c-z87dh_cattle-system(b7ff3456-5b4f-4dd1-b152-356cc95cd2a3)
  Warning  Unhealthy  102s (x26 over 30m)  kubelet            Liveness probe failed: Get "http://10.42.0.135:80/healthz": dial tcp 10.42.0.135:80: connect: connection refused

This is causing us production down at the moment, so it's urgent.

The text was updated successfully, but these errors were encountered:

jdoc-sag · 2024-02-16T14:40:29Z

I work with mjj29. We have now managed to start Rancher.

We observed "failed to wait for caches to sync" in the log and based on the discussion on #38177, increased the initialDelaySeconds of the livenessProbe and readinessProbe configurations of the rancher deployment (to 300 seconds).

Although it might be unrelated, we also observed "Refusing to reset the config and clean up resources of the auth provider ... because its auth config annotation management.cattle.io/auth-provider-cleanup is set to rancher-locked." after updating Rancher in an attempt to resolve the issue. We solved this by manually triggering cleanup of the auth-providers. See "Rancher might retain resources from a disabled auth provider configuration" in the release notes.

#43641 also quotes the "mgmt-auth-userattributes-controller: refresh daemon not yet initialized" error and might be fixed by increasing the initialDelaySeconds.

Martin-Weiss · 2024-02-19T08:28:32Z

Could you try to scale the deployment to 0 and then to 1?

mjj29 · 2024-02-22T10:26:20Z

Hi Martin, since we have it running now we don't want to test that and accidentally wedge it again

Martin-Weiss · 2024-02-27T13:14:31Z

Ok - so waiting until it happens again? ;-)

jdoc-sag · 2024-02-27T13:29:44Z

For now, I think so. Thanks.

wargamez · 2024-05-12T22:32:22Z

We had the same problem and the "fix" from @jdoc-sag worked for us aswell.

mjj29 added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Rancher not starting after cluster restart #44446

[BUG] Rancher not starting after cluster restart #44446

mjj29 commented Feb 13, 2024

jdoc-sag commented Feb 16, 2024 •

edited

Martin-Weiss commented Feb 19, 2024

mjj29 commented Feb 22, 2024

Martin-Weiss commented Feb 27, 2024

jdoc-sag commented Feb 27, 2024

wargamez commented May 12, 2024

[BUG] Rancher not starting after cluster restart #44446

[BUG] Rancher not starting after cluster restart #44446

Comments

mjj29 commented Feb 13, 2024

jdoc-sag commented Feb 16, 2024 • edited

Martin-Weiss commented Feb 19, 2024

mjj29 commented Feb 22, 2024

Martin-Weiss commented Feb 27, 2024

jdoc-sag commented Feb 27, 2024

wargamez commented May 12, 2024

jdoc-sag commented Feb 16, 2024 •

edited