Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177

Shaked · 2022-07-03T15:08:48Z

Rancher Server Setup

Rancher version: 2.6.5 and 2.6.6
Installation option (Docker install/Helm Chart):
Currently rancher-2.6.6 v2.6.6 but also tried with 2.6.5 before. Basically following the rancher/latest repo.

Information about the Cluster

Kubernetes version: v1.20.15+k3s1
Cluster Type (Local/Downstream): Local

User Information

What is the role of the user logged in? Admin

Describe the bug
Rancher tries to start and fails. First I kept seeing Exit Code 137 (NOT OOM). After that, I changed deploy/rancher livenessProbe to

        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /healthz
            port: 80
            scheme: HTTP
          initialDelaySeconds: 120
          periodSeconds: 30
          successThreshold: 1
          timeoutSeconds: 5

The first time failed for:

2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-kmtbm: context canceled
E0703 10:16:11.471980      32 leaderelection.go:330] error retrieving resource lock kube-system/cattle-controllers: Get "https://10.0.188.63:443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": context canceled
I0703 10:16:11.472029      32 leaderelection.go:283] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
E0703 10:16:11.472219      32 leaderelection.go:306] Failed to release lock: resource name may not be empty
I0703 10:16:11.473005      32 trace.go:205] Trace[367625047]: "Reflector ListAndWatch" name:pkg/mod/github.com/rancher/client-go@v1.23.3-rancher2/tools/cache/reflector.go:168 (03-Jul-2022 10:15:15.544) (total time: 55928ms):
Trace[367625047]: [55.928576086s] [55.928576086s] END
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-sqmd2: context canceled
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=RoleTemplate workers
2022/07/03 10:16:11 [INFO] Shutting down provisioning.cattle.io/v1, Kind=Cluster workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=CatalogTemplateVersion workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=RkeAddon workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodeDriver workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=PodSecurityPolicyTemplateProjectBinding workers
2022/07/03 10:16:11 [INFO] Shutting down catalog.cattle.io/v1, Kind=ClusterRepo workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=MultiClusterApp workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=AuthConfig workers
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=ProjectCatalog workers
2022/07/03 10:16:11 [ERROR] failed to call leader func: Get "https://10.0.188.63:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions": context canceled
2022/07/03 10:16:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodePool workers
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-zr5m6: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-szxcd: context canceled
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=RoleBinding workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=Namespace workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=Secret workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=Role workers
2022/07/03 10:16:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=ClusterRole workers
2022/07/03 10:16:11 [INFO] Shutting down /v1, Kind=ServiceAccount workers
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-g6mz8: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-btxww: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-rr2bb: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-vcvh9: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-rsbzh: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-79bn8: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-kg9ch: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-2tw8h: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-58dwl: context canceled
2022/07/03 10:16:11 [ERROR] failed to start cluster controllers c-q8q2q: context canceled
2022/07/03 10:16:11 [FATAL] leaderelection lost for cattle-controllers

After that I extended initialDelaySeconds to 160 and then rancher kept failing for:

2022/07/03 10:20:11 [INFO] Shutting down apiextensions.k8s.io/v1, Kind=CustomResourceDefinition workers
2022/07/03 10:20:11 [INFO] Shutting down rbac.authorization.k8s.io/v1, Kind=Role workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=Group workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=CatalogTemplateVersion workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=NodeTemplate workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=MultiClusterAppRevision workers
2022/07/03 10:20:11 [INFO] Shutting down catalog.cattle.io/v1, Kind=App workers
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=ManagedChart workers
2022/07/03 10:20:11 [INFO] Starting management.cattle.io/v3, Kind=Cluster controller
2022/07/03 10:20:11 [INFO] Shutting down management.cattle.io/v3, Kind=Cluster workers
2022/07/03 10:20:11 [FATAL] failed to wait for caches to sync

Until at some point it came back to life.

Obviously this won't last as once the deployment will be updated again, liveness will fail and the pod won't start.

To Reproduce
I am really not sure what causes this issue except of the things I have mentioned within the bug.

Additional context

I guess that the least that can be done here is to enable an option to extend the liveness/readiness settings through helm. However this is obviously a patch and a farther investigation is required.

The text was updated successfully, but these errors were encountered:

niusmallnan · 2022-07-04T01:14:44Z

Referring to AWS, usually I would use at least 2C4G with GP3 disk.
It can be seen from the log provided by you that there is a high probability that the computing resources are insufficient.

Shaked · 2022-07-04T08:40:32Z

@niusmallnan I don't think that it's a resources issue. I checked the both the logs and monitored k top nodes and k top pods while running the broken instances (before adding delay time to the liveness probe). My rancher instance runs on:

CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
1473m        38%    9445Mi          74%

Node info: 4 vCPU, 16G memory, 32G temp storage.

Also, if it was a resource issue, my changes to the livenessProbe would have not helped. Since I changed the delay time, the instance is already running for 10h straight. No restarts.

niusmallnan · 2022-07-04T10:13:51Z

Rancher's built-in controllers do resync about every 10h, which consumes more resources in bursts.

#37350

Shaked · 2022-07-04T10:53:31Z

@niusmallnan

Rancher's built-in controllers do resync about every 10h, which consumes more resources in bursts.

#37350

I didn't know that but in any case, rancher kept failing on startup until I changed the delay time of the liveness probe. Also, it has been running for 12h straight. So why would this be related to the mentioned ticket? As well, I believe that the machine I'm using is quite powerful, don't you think?

niusmallnan · 2022-07-04T13:25:20Z

E0703 10:16:11.471980      32 leaderelection.go:330] error retrieving resource lock kube-system/cattle-controllers: Get "https://10.0.188.63:443/api/v1/namespaces/kube-system/configmaps/cattle-controllers?timeout=15m0s": context canceled
I0703 10:16:11.472029      32 leaderelection.go:283] failed to renew lease kube-system/cattle-controllers: timed out waiting for the condition
E0703 10:16:11.472219      32 leaderelection.go:306] Failed to release lock: resource name may not be empty

These logs may be related to the service capability of kube-api, cloud you share some k3s logs?
what's the database of your k3s？etcd?

Shaked · 2022-07-04T21:28:51Z

These logs may be related to the service capability of kube-api, cloud you share some k3s logs?

I think it would be hard to find any as too much time has passed and I don't have them saved.

what's the database of your k3s？etcd?

sqlite

But also I have to say that this specific error has happened only once. Before changing the delay I had a constant failure of Error 137 (NOT OOM), and livenessProbe was basically failing in the events description.

niusmallnan · 2022-07-05T01:50:08Z

For rancher setup, the embedded Etcd should be better than SQLite.
https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/#existing-clusters

One question, what's the qosClass of your rancher pod?

Shaked · 2022-07-05T17:42:13Z

For rancher setup, the embedded Etcd should be better than SQLite.
https://rancher.com/docs/k3s/latest/en/installation/ha-embedded/#existing-clusters

While I understand it, I have never had any issues with running rancher with SQLite. As well, I have a stable rancher and a specific rancher version running on the same architecture, and non of them failed as latest.

One question, what's the qosClass of your rancher pod?

QoS Class: BestEffort

Any simple way to change it to Guaranteed?

As well, I want to emphasize that the pod is running for 43h already, since I changed the liveness probe.

tacerus · 2022-07-17T16:27:27Z

Hello!

I encounter a similar issue with the rancher pods getting restarted while they are still applying CRD's. How did you patch the initialDelaySeconds? I attempted to dump the deployment using kubectl -n=cattle-system get deployment rancher -oyaml > rancher-deployment.yaml and re-import it using kubectl -n=cattle-system apply -f rancher-deployment.yaml after changing the values, however it fails with a huge amount of JSON and

Resource: "apps/v1, Resource=deployments", GroupVersionKind: "apps/v1, Kind=Deployment"
Name: "rancher", Namespace: "cattle-system"
for: "rancher-deployment.yaml": Operation cannot be fulfilled on deployments.apps "rancher": the object has been modified; please apply your changes to the latest version and try again

at the end.
I understand it's not ideal and that it would be overwritten by the next Helm run, however my pods have been running before and only seem to be sluggish when a full redeployment is in process - hence granting the pods more time to come up would - in my opinion - make sense, as not everyone runs their testing environment on blazing fast hardware or expensive cloud services. Or am I missing something?

tacerus · 2022-07-17T22:45:34Z

I managed to solve the error by cleaning up the .yaml from entries not needing patching. However, it still would not apply the patched values to the deployment.

I then cloned the repository, edited the deployment.yaml template file and upgraded the Helm chart from my local checkout.

        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 600
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 600
          periodSeconds: 30

Now my rancher pods got all the time needed and happily came up as "Ready". I presume that means I will continue maintaining my own "fork" of the Helm chart - at least for my "slow" testing environment - guess that's better than applying non-audited charts from the internet anyways. :-)

ScionOfDesign · 2022-09-05T19:12:55Z

Being able to configure livenessProbe settings in helm would be nice.

sachasmart-weavik · 2022-09-18T22:10:59Z

Same issue happening here...

GameScripting · 2022-09-26T16:37:57Z

Had the same problem (Rancher v2.6.1). For me stopping rancher server, waiting for the server to cool down a bit (CPU wise) and starting rancher again did the trick:

k3s kubectl -n cattle-system scale --replicas 0 deployments/rancher
sleep 15
k3s kubectl -n cattle-system scale --replicas 1 deployments/rancher

GameScripting · 2022-09-26T16:38:13Z

Maybe someone can create a kubectl patch-command to change the probes without fiddling with yaml?

samjustus · 2022-09-29T17:58:09Z

sure-5277

samjustus · 2022-10-10T19:47:10Z

@moio @rmweir - Can you guys investigate this together?
@steffantucker - can you focus on making the probes configurable in the chart?

rmweir · 2022-10-11T18:45:32Z

@Shaked can you try running kubectl get --raw /metrics | grep apiserver_storage if you have metrics enabled and if that doesn't return anything or values are all 0 can you run the following script:
note: you will have to pass one arg to it like all_resources.csv for the file to output to. Make sure KUBECONFIG is set to the kubeconfig path for your local cluster.

#!/bin/bash

FILENAME=$1
TIMEFORMAT=%R
printf '%s\n' resource,count,time > $FILENAME
for resource in $(kubectl api-resources --verbs=list --namespaced -o name | grep -v "events.events.k8s.io" | grep -v "events" | sort | uniq); do
	printf %s ${resource}, >> $FILENAME
	(time $(output=$(printf %s $(kubectl get --ignore-not-found ${resource} -A | wc -l)); if [[ "$output" -ne "0" ]]; then output=$((output-1)); fi; printf %s $output >> $FILENAME && printf %s "," >> $FILENAME)) 2>> $FILENAME
	sleep 2
done

cbron · 2022-10-28T18:18:30Z

The PR above and issues below are to add probe customization to the Rancher chart, but for now we won't be adding to the Rancher docs "helm charts options" page.
This issue will continue to track the "long term" fix about slow startup times.

samjustus · 2023-02-22T17:43:35Z

long term fix will be tracked here now #40587

rmweir · 2023-02-28T22:17:20Z

Release note: startupProbe value along with its failureThreshold and periodSeconds sub-values are now exposed in the rancher chart. They are off by default. Setting startupProbe value will use defaults for failureThreshold and periodSeconds, 1 and 30 respectively if they are not set.

tacerus mentioned this issue Jul 17, 2022

Installation error: deployment "rancher" exceeded its progress deadline #36119

Closed

This was referenced Jul 17, 2022

Context canceled #36149

Closed

Rancher server crashed with error - leaderelection lost for cattle-controllers #13446

Closed

samjustus added team/area1 JIRA To be used in correspondence with the internal ticketing system. [zube]: Next Up labels Sep 21, 2022

samjustus added this to the v2.7.0 milestone Sep 21, 2022

samjustus modified the milestones: v2.7.0, v2.7.1 Sep 26, 2022

steffantucker self-assigned this Oct 10, 2022

steffantucker added [zube]: Working and removed [zube]: Next Up labels Oct 10, 2022

samjustus assigned moio Oct 10, 2022

steffantucker mentioned this issue Oct 17, 2022

Add option to override initialDelay and period #39319

Merged

cbron assigned rmweir Oct 25, 2022

zube bot changed the title ~~Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely~~ Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely Oct 26, 2022

brudnak self-assigned this Oct 27, 2022

cbron mentioned this issue Nov 9, 2022

Revert "Add option to override initialDelay and period (#39319)" #39551

Merged

cbron unassigned steffantucker Dec 6, 2022

samjustus modified the milestones: 2023-Q1-v2.7x, 2023-Q2-v2.7x Dec 12, 2022

cbron mentioned this issue Dec 12, 2022

[Enhancement] add probe customization to the Rancher chart #39841

Closed

brudnak added the QA/M label Dec 14, 2022

rancherbot mentioned this issue Dec 19, 2022

[Backport ] [Enhancement] add probe customization to the Rancher chart #39942

Closed

rmweir mentioned this issue Jan 17, 2023

Add startup probe #40190

Merged

moio removed their assignment Feb 15, 2023

samjustus modified the milestones: 2023-Q2-v2.7x, v2.7.x Feb 16, 2023

samjustus closed this as completed Feb 22, 2023

zube bot added [zube]: Done and removed [zube]: Working labels Feb 22, 2023

rmweir added the release-note Note this issue in the milestone's release notes label Feb 28, 2023

cbron modified the milestones: v2.7.x, v2.7.2 Mar 6, 2023

zube bot removed the [zube]: Done label May 24, 2023

jdoc-sag mentioned this issue Feb 16, 2024

[BUG] Rancher not starting after cluster restart #44446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177

Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177

Shaked commented Jul 3, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022 •

edited

niusmallnan commented Jul 5, 2022

Shaked commented Jul 5, 2022

tacerus commented Jul 17, 2022

tacerus commented Jul 17, 2022 •

edited

ScionOfDesign commented Sep 5, 2022

sachasmart-weavik commented Sep 18, 2022

GameScripting commented Sep 26, 2022 •

edited

GameScripting commented Sep 26, 2022

samjustus commented Sep 29, 2022

samjustus commented Oct 10, 2022

rmweir commented Oct 11, 2022 •

edited

cbron commented Oct 28, 2022 •

edited

samjustus commented Feb 22, 2023

rmweir commented Feb 28, 2023 •

edited

Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177

Rancher 2.6.5/2.6.6 liveness crashes the pod and restarts infinitely #38177

Comments

Shaked commented Jul 3, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022

niusmallnan commented Jul 4, 2022

Shaked commented Jul 4, 2022 • edited

niusmallnan commented Jul 5, 2022

Shaked commented Jul 5, 2022

tacerus commented Jul 17, 2022

tacerus commented Jul 17, 2022 • edited

ScionOfDesign commented Sep 5, 2022

sachasmart-weavik commented Sep 18, 2022

GameScripting commented Sep 26, 2022 • edited

GameScripting commented Sep 26, 2022

samjustus commented Sep 29, 2022

samjustus commented Oct 10, 2022

rmweir commented Oct 11, 2022 • edited

cbron commented Oct 28, 2022 • edited

samjustus commented Feb 22, 2023

rmweir commented Feb 28, 2023 • edited

Shaked commented Jul 4, 2022 •

edited

tacerus commented Jul 17, 2022 •

edited

GameScripting commented Sep 26, 2022 •

edited

rmweir commented Oct 11, 2022 •

edited

cbron commented Oct 28, 2022 •

edited

rmweir commented Feb 28, 2023 •

edited