Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rancher v2.7.2 cattle-cluster-agent memory leak oom kill #41225

Closed
taehwanlee opened this issue Apr 20, 2023 · 47 comments
Closed

[BUG] rancher v2.7.2 cattle-cluster-agent memory leak oom kill #41225

taehwanlee opened this issue Apr 20, 2023 · 47 comments
Assignees
Labels
area/scalability 10k or bust kind/bug Issues that are defects reported by users or that we know have reached a real release team/scaling
Milestone

Comments

@taehwanlee
Copy link

taehwanlee commented Apr 20, 2023

Rancher Server Setup

  • Rancher version: v2.7.2
  • Installation option (Docker install/Helm Chart):
    • Docker install
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: 1.25
  • Cluster Type (Local/Downstream):
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions:

Describe the bug

Memory leak occurs when using Rancher UI on Explorer

To Reproduce

Result

Expected Result

Screenshots

image
image

Additional context

@taehwanlee taehwanlee added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Apr 20, 2023
@bananflugan
Copy link

I upgraded our Rancher from 2.6.11 -> 2.7.2 this monday and since then we have had the same issue. The cluster agents slowly eats more and more memory until the server runs out and often hang the servers. Would be great with a quick fix for this.

image

@taehwanlee
Copy link
Author

Set the cattle-cluster-agent pod ad hoc request limit memory. When the pod restarts, the memory is reclaimed.

@taehwanlee taehwanlee changed the title [BUG] cattle-cluster-agent memory leak oom kill rancher 2.7.2 [BUG] rancher v2.7.2 cattle-cluster-agent memory leak oom kill Apr 21, 2023
@niusmallnan
Copy link
Contributor

I didn't find this problem in my setup, I assume it is because of some special resource counts.

Could you share cattle-cluster-agent logs and Rancher diagnostic data? You can visit /dashboard/diagnostic to get the diagnostic info of the downstream cluster, and provide a screenshot for the Top10 resources.

@taehwanlee
Copy link
Author

taehwanlee commented Apr 21, 2023

image

@taehwanlee
Copy link
Author

taehwanlee commented Apr 21, 2023

cattle-cluster-agent log:
INFO: Environment: CATTLE_ADDRESS=10.1.103.85 CATTLE_CA_CHECKSUM=e85c2ca9019f3cb32847f02189113384b2f1445ee1c9ec55eab372b021fa500b CATTLE_CLUSTER=true CATTLE_CLUSTER_AGENT_PORT=tcp://172.20.38.167:80 CATTLE_CLUSTER_AGENT_PORT_443_TCP=tcp://172.20.38.167:443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_ADDR=172.20.38.167 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PORT=443 CATTLE_CLUSTER_AGENT_PORT_443_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_PORT_80_TCP=tcp://172.20.38.167:80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_ADDR=172.20.38.167 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PORT=80 CATTLE_CLUSTER_AGENT_PORT_80_TCP_PROTO=tcp CATTLE_CLUSTER_AGENT_SERVICE_HOST=172.20.38.167 CATTLE_CLUSTER_AGENT_SERVICE_PORT=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTP=80 CATTLE_CLUSTER_AGENT_SERVICE_PORT_HTTPS_INTERNAL=443 CATTLE_CLUSTER_REGISTRY= CATTLE_FEATURES=embedded-cluster-api=false,fleet=false,monitoringv1=false,multi-cluster-management=false,multi-cluster-management-agent=true,provisioningv2=false,rke2=false CATTLE_INGRESS_IP_DOMAIN=sslip.io CATTLE_INSTALL_UUID=bbe1ffd4-5cb7-42a0-bac1-36973cad0ec3 CATTLE_INTERNAL_ADDRESS= CATTLE_IS_RKE=false CATTLE_K8S_MANAGED=true CATTLE_NODE_NAME=cattle-cluster-agent-664454c687-28qgg CATTLE_SERVER=https://3.34.36.235 CATTLE_SERVER_VERSION=v2.7.2
INFO: Using resolv.conf: search cattle-system.svc.cluster.local svc.cluster.local cluster.local ap-northeast-2.compute.internal nameserver 172.20.0.10 options ndots:5
2023-04-20T22:11:08.602330373Z INFO: https://3.34.36.235/ping is accessible
2023-04-20T22:11:08.660017851Z INFO: Value from https://3.34.36.235/v3/settings/cacerts is an x509 certificate
time="2023-04-20T22:11:09Z" level=info msg="Listening on /tmp/log.sock"
2023-04-20T22:11:09.246868024Z time="2023-04-20T22:11:09Z" level=info msg="Rancher agent version v2.7.2 is starting"
2023-04-20T22:11:09.283108430Z time="2023-04-20T22:11:09Z" level=info msg="Connecting to wss://3.34.36.235/v3/connect/register with token starting with rxq8w4klb7g4mttwtsbs48m6dwc"
2023-04-20T22:11:09.283144861Z time="2023-04-20T22:11:09Z" level=info msg="Connecting to proxy" url="wss://3.34.36.235/v3/connect/register"
2023-04-20T22:11:09.295315489Z time="2023-04-20T22:11:09Z" level=info msg="Requesting kubelet certificate regeneration"
time="2023-04-20T22:11:09Z" level=info msg="Starting /v1, Kind=Service controller"
2023-04-20T22:11:09.466265677Z time="2023-04-20T22:11:09Z" level=info msg="Running in single server mode, will not peer connections"
2023-04-20T22:11:09.830604658Z time="2023-04-20T22:11:09Z" level=info msg="Applying CRD features.management.cattle.io"
2023-04-20T22:11:10.574631406Z time="2023-04-20T22:11:10Z" level=info msg="Applying CRD navlinks.ui.cattle.io"
2023-04-20T22:11:10.660839031Z time="2023-04-20T22:11:10Z" level=info msg="Applying CRD podsecurityadmissionconfigurationtemplates.management.cattle.io"
2023-04-20T22:11:10.762272331Z time="2023-04-20T22:11:10Z" level=info msg="Applying CRD clusters.management.cattle.io"
2023-04-20T22:11:10.879584601Z time="2023-04-20T22:11:10Z" level=info msg="Applying CRD apiservices.management.cattle.io"
2023-04-20T22:11:10.953440444Z time="2023-04-20T22:11:10Z" level=info msg="Applying CRD clusterregistrationtokens.management.cattle.io"
2023-04-20T22:11:11.028764691Z time="2023-04-20T22:11:11Z" level=info msg="Applying CRD settings.management.cattle.io"
2023-04-20T22:11:11.106891888Z time="2023-04-20T22:11:11Z" level=info msg="Applying CRD preferences.management.cattle.io"
2023-04-20T22:11:11.181140909Z time="2023-04-20T22:11:11Z" level=info msg="Applying CRD features.management.cattle.io"
time="2023-04-20T22:11:11Z" level=info msg="Applying CRD clusterrepos.catalog.cattle.io"
2023-04-20T22:11:11.331564796Z time="2023-04-20T22:11:11Z" level=info msg="Applying CRD operations.catalog.cattle.io"
2023-04-20T22:11:11.405439570Z time="2023-04-20T22:11:11Z" level=info msg="Applying CRD apps.catalog.cattle.io"
2023-04-20T22:11:12.738150612Z time="2023-04-20T22:11:12Z" level=info msg="Starting API controllers"
2023-04-20T22:11:12.866262848Z time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=Group controller"
2023-04-20T22:11:12.866330349Z time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=Token controller"
2023-04-20T22:11:12.866340269Z time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=UserAttribute controller"
2023-04-20T22:11:12.866563184Z time="2023-04-20T22:11:12Z" level=info msg="Starting /v1, Kind=Secret controller"
2023-04-20T22:11:12.866755309Z time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=User controller"
2023-04-20T22:11:12.866761699Z time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=GroupMember controller"
time="2023-04-20T22:11:12Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
2023-04-20T22:11:13.086914699Z time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=ClusterRegistrationToken controller"
2023-04-20T22:11:13.086950350Z time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=Cluster controller"
2023-04-20T22:11:13.087014611Z time="2023-04-20T22:11:13Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=Preference controller"
2023-04-20T22:11:13.087190585Z time="2023-04-20T22:11:13Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=Role controller"
2023-04-20T22:11:13.087405040Z time="2023-04-20T22:11:13Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRole controller"
2023-04-20T22:11:13.087601444Z time="2023-04-20T22:11:13Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
2023-04-20T22:11:13.088187127Z time="2023-04-20T22:11:13Z" level=info msg="Starting apiextensions.k8s.io/v1, Kind=CustomResourceDefinition controller"
2023-04-20T22:11:13.089002255Z time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=Feature controller"
2023-04-20T22:11:13.106905399Z time="2023-04-20T22:11:13Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
2023-04-20T22:11:13.107016712Z time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=APIService controller"
2023-04-20T22:11:13.107021502Z time="2023-04-20T22:11:13Z" level=info msg="Starting management.cattle.io/v3, Kind=Setting controller"
2023-04-20T22:11:13.107029862Z time="2023-04-20T22:11:13Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=RoleBinding controller"
2023-04-20T22:11:13.107120064Z time="2023-04-20T22:11:13Z" level=info msg="Starting /v1, Kind=Secret controller"
2023-04-20T22:11:13.107305418Z time="2023-04-20T22:11:13Z" level=info msg="Starting catalog.cattle.io/v1, Kind=ClusterRepo controller"
2023-04-20T22:11:13.107388770Z time="2023-04-20T22:11:13Z" level=info msg="Starting steve aggregation client"
2023-04-20T22:11:13.107690896Z time="2023-04-20T22:11:13Z" level=info msg="Starting apiregistration.k8s.io/v1, Kind=APIService controller"
2023-04-20T22:11:13.109476036Z I0420 22:11:13.109392 53 leaderelection.go:248] attempting to acquire leader lease kube-system/cattle-controllers...
time="2023-04-20T22:11:13Z" level=info msg="Active TLS secret cattle-system/serving-cert (ver=239596434) (count 19): map[field.cattle.io/projectId:c-vxnmp:p-dp55h listener.cattle.io/cn-10.1.101.226:10.1.101.226 listener.cattle.io/cn-10.1.103.85:10.1.103.85 listener.cattle.io/cn-10.1.111.117:10.1.111.117 listener.cattle.io/cn-10.1.111.155:10.1.111.155 listener.cattle.io/cn-10.1.111.160:10.1.111.160 listener.cattle.io/cn-10.1.111.179:10.1.111.179 listener.cattle.io/cn-10.1.111.92:10.1.111.92 listener.cattle.io/cn-10.1.113.146:10.1.113.146 listener.cattle.io/cn-10.1.113.158:10.1.113.158 listener.cattle.io/cn-10.1.113.164:10.1.113.164 listener.cattle.io/cn-10.1.113.214:10.1.113.214 listener.cattle.io/cn-10.1.113.239:10.1.113.239 listener.cattle.io/cn-10.1.113.247:10.1.113.247 listener.cattle.io/cn-10.1.113.33:10.1.113.33 listener.cattle.io/cn-10.1.113.79:10.1.113.79 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-rancher.cattle-system:rancher.cattle-system listener.cattle.io/fingerprint:SHA1=6C01AF497D71ED93B7338FE65A01DCA3060B7878]"
2023-04-20T22:11:13.117856680Z time="2023-04-20T22:11:13Z" level=info msg="Listening on :443"
2023-04-20T22:11:13.118050064Z time="2023-04-20T22:11:13Z" level=info msg="Listening on :80"
2023-04-20T22:11:13.138717988Z time="2023-04-20T22:11:13Z" level=info msg="Listening on :444"
2023-04-20T22:11:13.138736299Z time="2023-04-20T22:11:13Z" level=warning msg="dynamiclistener [::]:444: no cached certificate available for preload - deferring certificate load until storage initialization or first client request"
2023-04-20T22:11:13.383218374Z time="2023-04-20T22:11:13Z" level=info msg="Starting /v1, Kind=Secret controller"
2023-04-20T22:11:13.388225324Z time="2023-04-20T22:11:13Z" level=info msg="Updating TLS secret for cattle-system/serving-cert (count: 19): map[field.cattle.io/projectId:c-vxnmp:p-dp55h listener.cattle.io/cn-10.1.101.226:10.1.101.226 listener.cattle.io/cn-10.1.103.85:10.1.103.85 listener.
cattle.io/cn-10.1.111.117:10.1.111.117 listener.cattle.io/cn-10.1.111.155:10.1.111.155 listener.cattle.io/cn-10.1.111.160:10.1.111.160 listener.cattle.io/cn-10.1.111.179:10.1.111.179 listener.cattle.io/cn-10.1.111.92:10.1.111.92 listener.cattle.io/cn-10.1.113.146:10.1.113.146 listener.cattle.io/cn-10.1.113.158:10.1.113.158 listener.cattle.io/cn-10.1.113.164:10.1.113.164 listener.cattle.io/cn-10.1.113.214:10.1.113.214 listener.cattle.io/cn-10.1.113.239:10.1.113.239 listener.cattle.io/cn-10.1.113.247:10.1.113.247 listener.cattle.io/cn-10.1.113.33:10.1.113.33 listener.cattle.io/cn-10.1.113.79:10.1.113.79 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-localhost:localhost listener.cattle.io/cn-rancher.cattle-system:rancher.cattle-system listener.cattle.io/fingerprint:SHA1=6C01AF497D71ED93B7338FE65A01DCA3060B7878]"
2023-04-20T22:11:14.235573484Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for discovery.k8s.io/v1, Kind=EndpointSlice"
2023-04-20T22:11:14.235589884Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for catalog.cattle.io/v1, Kind=ClusterRepo"
2023-04-20T22:11:14.235694547Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for argoproj.io/v1alpha1, Kind=Application"
2023-04-20T22:11:14.235701567Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Cluster"
2023-04-20T22:11:14.235987233Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Namespace"
2023-04-20T22:11:14.236216578Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for flowcontrol.apiserver.k8s.io/v1beta2, Kind=FlowSchema"
2023-04-20T22:11:14.236401142Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for batch/v1, Kind=Job"
2023-04-20T22:11:14.236408302Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=RoleBinding"
2023-04-20T22:11:14.236434523Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for policy/v1, Kind=PodDisruptionBudg
et"
2023-04-20T22:11:14.236547655Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for certificates.k8s.io/v1, Kind=CertificateSigningRequest"
2023-04-20T22:11:14.236681568Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apiextensions.k8s.io/v1, Kind=CustomResourceDefinition"
2023-04-20T22:11:14.236687178Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=ServiceAccount"
2023-04-20T22:11:14.236689378Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=User"
2023-04-20T22:11:14.236691868Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=Prometheus"
2023-04-20T22:11:14.236694319Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
2023-04-20T22:11:14.236696779Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apps/v1, Kind=Deployment"
2023-04-20T22:11:14.236699019Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=UserAttribute"
2023-04-20T22:11:14.236701249Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=APIService"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for elbv2.k8s.aws/v1beta1, Kind=IngressClassParams"
2023-04-20T22:11:14.236819741Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apiregistration.k8s.io/v1, Kind=APIService"
2023-04-20T22:11:14.236926724Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for catalog.cattle.io/v1, Kind=App"
2023-04-20T22:11:14.236993255Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Endpoints"
2023-04-20T22:11:14.236997045Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Feature"
2023-04-20T22:11:14.236999685Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=Alertmanager"
2023-04-20T22:11:14.237002335Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for ui.cattle.io/v1, Kind=NavLink"
2023-04-20T22:11:14.237004755Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Token"
2023-04-20T22:11:14.237016546Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for autoscaling/v2, Kind=HorizontalPodAutoscaler"
2023-04-20T22:11:14.237019746Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Pod"
2023-04-20T22:11:14.237024466Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=ConfigMap"
2023-04-20T22:11:14.237027106Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Setting"
2023-04-20T22:11:14.237029576Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for vpcresources.k8s.aws/v1beta1, Kind=SecurityGroupPolicy"
2023-04-20T22:11:14.237211540Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=PodSecurityAdmissionConfigurationTemplate"
2023-04-20T22:11:14.237242641Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for networking.k8s.io/v1, Kind=Ingress"
2023-04-20
T22:11:14.237311272Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=ReplicationController"
2023-04-20T22:11:14.237392844Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for catalog.cattle.io/v1, Kind=Operation"
2023-04-20T22:11:14.237415864Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig"
2023-04-20T22:11:14.237614419Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for karpenter.sh/v1alpha5, Kind=Provisioner"
2023-04-20T22:11:14.237620899Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=PodTemplate"
2023-04-20T22:11:14.237628149Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=ResourceQuota"
2023-04-20T22:11:14.237630869Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for storage.k8s.io/v1, Kind=CSINode"
2023-04-20T22:11:14.237633269Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=PodMonitor"
2023-04-20T22:11:14.237813503Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=ServiceMonitor"
2023-04-20T22:11:14.237819693Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1alpha1, Kind=AlertmanagerConfig"
2023-04-20T22:11:14.237822313Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apps/v1, Kind=ControllerRevision"
2023-04-20T22:11:14.237824753Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apps/v1, Kind=StatefulSet"
2023-04-20T22:11:14.237827353Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for events.k8s.io/v1, Kind=Event"
2023-04-20T22:11:14.237829664Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=Role"
2023-04-20T22:11:14.237832104Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apps/v1, Kind=DaemonSet"
2023-04-20T22:11:14.237834564Z time="2023-04-20T22:11:14Z"
level=info msg="Watching metadata for admissionregistration.k8s.io/v1, Kind=MutatingWebhookConfiguration"
2023-04-20T22:11:14.237836924Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Event"
2023-04-20T22:11:14.237839314Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for scheduling.k8s.io/v1, Kind=PriorityClass"
2023-04-20T22:11:14.237899805Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=LimitRange"
2023-04-20T22:11:14.238100080Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for node.k8s.io/v1, Kind=RuntimeClass"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=AuthConfig"
2023-04-20T22:11:14.238139270Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Secret"
2023-04-20T22:11:14.238141830Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for rbac.authorization.k8s.io/v1, Kind=ClusterRole"
2023-04-20T22:11:14.238144190Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for storage.k8s.io/v1, Kind=StorageClass"
2023-04-20T22:11:14.238146561Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for argoproj.io/v1alpha1, Kind=ApplicationSet"
2023-04-20T22:11:14.238149081Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=ThanosRuler"
2023-04-20T22:11:14.238151501Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for storage.k8s.io/v1, Kind=CSIDriver"
2023-04-20T22:11:14.238153981Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Preference"
2023-04-20T22:11:14.238156391Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=PersistentVolumeClaim"
2023-04-20T22:11:14.238311454Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=Group"
2023-04-20T22:11:14.238318304Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Node"
2023-04-20T22:11:14.238320864Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for networking.k8s.io/v1, Kind=IngressClass"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=Probe"
2023-04-20T22:11:14.238325824Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=Service"
2023-04-20T22:11:14.238337165Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for argoproj.io/v1alpha1, Kind=AppProject"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for storage.k8s.io/v1, Kind=VolumeAttachment"
2023-04-20T22:11:14.238342415Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for orleans.dot.net/v1, Kind=OrleansClusterVersion"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for karpenter.k8s.aws/v1alpha1, Kind=AWSNodeTemplate"
2023-04-20T22:11:14.238347425Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"
2023-04-20T22:11:14.238349835Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for storage.k8s.io/v1, Kind=CSIStorageCapacity"
2023-04-20T22:11:14.238352215Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for coordination.k8s.io/v1, Kind=Lease"
2023-04-20T22:11:14.238354635Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=GroupMember"
2023-04-20T22:11:14.238357025Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for orleans.dot.net/v1, Kind=OrleansSilo"
2023-04-20T22:11:14.238359465Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for /v1, Kind=PersistentVolume"
2023-04-20T22:11:14.238361825Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for networking.k8s.io/v1, Kind=NetworkPolicy"
2023-04-20T22:11:14.238367525Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for monitoring.coreos.com/v1, Kind=PrometheusRule"
time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for batch/v1, Kind=CronJob"
2023-04-20T22:11:14.238372536Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for apps/v1, Kind=ReplicaSet"
2023-04-20T22:11:14.238392536Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for flowcontrol.apiserver.k8s.io/v1beta2, Kind=PriorityLevelConfiguration"
2023-04-20T22:11:14.238550909Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for management.cattle.io/v3, Kind=ClusterRegistrationToken"
2023-04-20T22:11:14.238557690Z time="2023-04-20T22:11:14Z" level=info msg="Watching metadata for elbv2.k8s.aws/v1beta1, Kind=TargetGroupBinding"
2023-04-21T02:17:01.059479216Z time="2023-04-21T02:17:01Z" level=error msg="Error during subscribe websocket: close sent"
2023-04-21T02:47:03.056931533Z W0421 02:47:03.056825 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T02:47:03.056987255Z W0421 02:47:03.056832 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T02:47:03.158854035Z W0421 02:47:03.158736 53 transport.go:313] Unable to cancel request for *client.addQuery
W0421 03:17:03.063678 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:17:03.063853868Z W0421 03:17:03.063698 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:17:03.165536287Z W0421 03:17:03.165409 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:47:03.070068610Z W0421 03:47:03.069908 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:47:03.070106281Z W0421 03:47:03.069915 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:47:03.173448442Z W0421 03:47:03.173279 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:37.756846494Z W0421 03:59:37.756734 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:37.773385628Z W0421 03:59:37.773309 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.210642186Z W0421 03:59:40.210474 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.210668797Z W0421 03:59:40.210507 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.230921344Z W0421 03:59:40.230848 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.230933814Z W0421 03:59:40.230866 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.331730076Z W0421 03:59:40.331617 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.347375381Z W0421 03:59:40.347284 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.411971765Z W0421 03:59:40.411887 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:40.442833476Z W0421 03:59:40.442716 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.211255521Z W0421 03:59:44.21
53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.219293919Z W0421 03:59:44.219102 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.219317369Z W0421 03:59:44.219132 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.222874038Z W0421 03:59:44.222791 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.237908849Z W0421 03:59:44.237822 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.241602511Z W0421 03:59:44.241518 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.245560338Z W0421 03:59:44.245479 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.286606243Z W0421 03:59:44.286486 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.288638478Z W0421 03:59:44.288569 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.292174356Z W0421 03:59:44.292077 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.292384701Z W0421 03:59:44.292285 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.293461814Z W0421 03:59:44.293234 53 transport.go:313] Unable to cancel request for *client.addQuery
W0421 03:59:44.312100 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.312242268Z W0421 03:59:44.312120 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T03:59:44.313383353Z W0421 03:59:44.313239 53 transport.go:313] Unable to cancel request for *client.addQuery
2023-04-21T04:15:44.007271917Z time="2023-04-21T04:15:44Z" level=error msg="Error during subscribe websocket: close sent"

@taehwanlee
Copy link
Author

@niusmallnan
Copy link
Contributor

@taehwanlee It looks like you have three downstream clusters, they are not the heavy load clusters.

There are some questions here:

  1. Does OOM only happen on hp-stg-cluster?
  2. How is hp-stg-cluster created, EKS? It looks like it was imported to Rancher by generic way.
  3. I see a lot of Unable to cancel request for *client.addQuery, do you happen to know if there are many requests proxied to the downstream cluster through the Rancher API?

@taehwanlee
Copy link
Author

taehwanlee commented Apr 21, 2023

  1. Yes it is currently happening on hp-stg-cluster.
  2. The cluster created in AWS as an EKS cluster was imported from Rancher.
  3. I don't know why there are so many requests.

Is there any workaround?

@bananflugan
Copy link

Here is info about one of our clusters with this issue as well.
Our clusters are RKE running on ubuntu-server OS.

image

cattle-cluster-agent-68c85d4cd4-92lwd_cluster-register.log

@niusmallnan
Copy link
Contributor

@taehwanlee I could not reproduce this Memory Leak case, don't have any workaround for now.

We can also try to look for clues in the rancher server logs. There are some built-in controllers that send API requests to downstream clusters, which go through the cattle-cluster-agent.

@wirwolf
Copy link

wirwolf commented Apr 21, 2023

I have the same problem and I fix this temporarily
#38723 (comment)

kubectl patch -n cattle-system deployment cattle-cluster-agent -p '{"spec":{"template":{"spec":{"containers":[{"name":"cluster-register","resources":{"requests":{"cpu":"1","memory":"2Gi"},"limits":{"cpu":"1","memory":"2Gi"}}}]}}}}'

@justdan96
Copy link

We are experiencing the exact same issue, including the cattle-cluster-agent log entries mentioning transport.go:313] Unable to cancel request for *client.addQuery.

I thought the issue was due to our cattle-node-agent pods asking for kubelet certificate regenerations every 120 seconds but even after fixing that behaviour the memory usage of the cattle-cluster-agent pods seems to be creeping up and up. You can see in the graph below the memory usage is completely different after the 2.7.2 upgrade.

image

@wirwolf
Copy link

wirwolf commented Apr 24, 2023

Today have OOM:
My patch saved the day
image

@justdan96
Copy link

I used the same patch as @wirwolf except I upped the requests and limits to 3GB. It seems to have kept the OOM issue under control for now.

@lasseoe
Copy link

lasseoe commented Apr 25, 2023

I've just had what I assume is the same issue, but with our "local" cluster, the Rancher cluster and the rancher pod itself.

@bananflugan
Copy link

I've just had what I assume is the same issue, but with our "local" cluster, the Rancher cluster and the rancher pod itself.

Indeed, this will also happen, just not as often. Last time it happened to us was last night but it took maybe five days for it to happen. Now we have limits set on the rancher deploy as well.

image

@niusmallnan
Copy link
Contributor

I found rough steps to reproduce, and although I haven't caught the root cause yet, it should be close.

Rancher v2.7.3 and a clean downstream cluster K3s v1.25.9.

In the downstream cluster, create some resources, close to the real setup. For example, I created 2000 secrets.

#!/bin/bash

NS=test
NUM=2000

n=1
while [ $n -le $NUM ]; do
    kubectl create serviceaccount test$n -n $NS
    kubectl create -f - <<EOF
apiVersion: v1
kind: Secret
type: kubernetes.io/service-account-token
metadata:
  name: test$n
  namespace: test
  annotations:
    kubernetes.io/service-account.name: test$n
EOF
    let n++
done

Open a browser, visit the secret UI page of the downstream cluster, and then refresh crazily, or open multiple windows.

Check the memory usage of cattle-cluster-agent.
截屏2023-04-27 下午6 17 04

@orangedeng
Copy link
Contributor

orangedeng commented Apr 28, 2023

I think I found the cause of this problem.

The short answer is:
By disabling the list cache with setting env CATTLE_REQUEST_CACHE_DISABLED=true, this issue will disappear.

The long answer is:
It might be the LRUCache Problem here. The cache key for the list return contains the opts.Revision which will be different every request. So the cache won't be hit and the cache size will grow by each request.

By following steps, you can see the memory will grow with each request:

  • Setting cluster-agent env GODEBUG=madvdontneed=1 so you can see the memory grow easily.
  • create 2000 secrets in your ns with the script @niusmallnan provided.
  • Navigate to Storage/Secrets page in Explorer UI the first time, and wait for 1-2mins
  • Get the heap info from cluster-agent pod with command curl localhost:6060/debug/pprof/heap -o /tmp/heap and copy to host.
  • Refresh the Secrets page and get another heap info from pod

By compare the two heap info, you can see the memory increase and most of them goes to sigs.k8s.io/json/internal/golang/encoding/json.(*decodeState).objectInterface.

You can repeat the steps above with setting env CATTLE_REQUEST_CACHE_DISABLED=true and no more memory get increased from heap info.

@niusmallnan
Copy link
Contributor

I can confirm that CATTLE_REQUEST_CACHE_DISABLED is in effect, and both rancher server and cattle-cluster-agent should be set since they have Steve API framework built in.

Another mitigation is CATTLE_REQUEST_CACHE_SIZE_INT , this size defaults to 1000, and can be set smaller to fit the host memory capacity. Because caching is completely disabled, API performance may degrade.

@taehwanlee
Copy link
Author

taehwanlee commented May 3, 2023

I added the CATTLE_REQUEST_CACHE_DISABLE setting, but the memory leak still occurs. Upgraded to rancher 2.7.3.
image

image

@moio
Copy link
Contributor

moio commented May 3, 2023

@taehwanlee can you please share the cattle-cluster-agent deployment YAML of one of the affected clusters, together with its memory profile at the time when memory usage is high?

To collect it:

export KUBECONFIG=/path/to/your/downstream.yaml

export POD=$(kubectl --namespace cattle-system get pod --selector='app=cattle-cluster-agent' --output jsonpath="{.items[0].metadata.name}")

kubectl --namespace cattle-system port-forward pod/$POD 6060:6060 &

wget http://localhost:6060/debug/pprof/heap

TIA

@taehwanlee
Copy link
Author

@moio I ran it as instructed and created a heap.

heap.zip

@orangedeng
Copy link
Contributor

orangedeng commented May 4, 2023

@moio
It doesn't seem like a memory leak to me.

  • The memory curve is quite flatten over 12h and didn't grow a lot. I think the memory usage over 4G is because of the scale of your cluster. The go GC won't release the memory immediately if the OS doesn't reclaim them. So it is like the buffer kept by cluster-agent.
  • In the heap info, I don't see any large size orphan memories.

You can try to add env var GODEBUG=madvdontneed=1 to your cluster agent and see if there were any difference. It will slow down your query from UI a little but not a lot.

@taehwanlee
Copy link
Author

I think it's because there weren't many memory leaks when the heap was created...

@orangedeng
Copy link
Contributor

I think it's because there weren't many memory leaks when the heap was created...

Try to set GODEBUG=madvdontneed=1 and keep it running for a while. And dump the heap before and after refreshing the deployment/pods page in UI. And also we can see the metrics after that. Thanks!

moio added a commit to moio/steve that referenced this issue Jun 9, 2023
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a
`metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and
subsequent calls to `returnErr` will remain parked forever.

This helps with rancher/rancher#41225
Fuller explanation in rancher#107
@rmweir
Copy link
Contributor

rmweir commented Jun 9, 2023

If you are considering trying the image or want to evaluate whether the fix is likely to work for you, you can follow the following steps:

  1. take goroutine profile of affected pod
    if it is rancher server then run
KUBECONFIG=<pathtokubeconfigforyourlocalcluster>
for pod in $(kubectl get pods -n cattle-system --no-headers -l app=rancher | cut -d ' ' -f1); do
  echo getting goroutine for $pod
  kubectl exec -n cattle-system $pod -- curl -s http://localhost:6060/debug/pprof/goroutine -o goroutine
  kubectl cp cattle-system/${pod}:goroutine ./${pod}-goroutine
  echo saved ${pod}-goroutine
done

if it is rancher-agent then run

KUBECONFIG=<pathtokubeconfigfordownstreamcluster>
for pod in $(kubectl get pods -n cattle-system --no-headers -l app=cattle-cluster-agent | cut -d ' ' -f1); do
  echo getting goroutine for $pod
  kubectl exec -n cattle-system $pod -- curl -s http://localhost:6060/debug/pprof/goroutine -o goroutine
  kubectl cp cattle-system/${pod}:goroutine ./${pod}-goroutine
  echo saved ${pod}-goroutine
done
  1. with go tools/pprof installed run:
    go tool pprof <pathtogoroutineprofile>
  2. then use command:
    top 40 --cum
  3. If you observe github.com/rancher/steve/pkg/stores/proxy.returnErr (inline), especially with a relatively high cum%, then you are affected by the leak and likely experiencing performance issues due to it. Usually this is memory issues but it seems it is also related to the UI freezing.

@git-ival
Copy link

git-ival commented Jun 15, 2023

Rancher and cattle-cluster-agent Memory Leak Validation check

Methodology:

Followed the general guidance from Silvio and Ricardo from here and here.

There are 3 scenarios that were tested, all using the same basic test steps:

  1. Fresh install of Rancher 2.7.5-rc5
  2. Image update from Rancher 2.7.5-rc4 (Pre-fix) to 2.7.5-rc5 (Post-fix)
  3. Full Rancher upgrade from Rancher 2.7.4 to 2.7.5-rc5

Steps:

  1. Setup your Rancher cluster
    1. Install the same version of rancher-monitoring 102.0.1+up40.1.2
    2. Setup 3 baseline clusters (one for k3s, rke1 and rke2)
    3. Install rancher-monitoring on each downstream cluster
    4. Load each downstream cluster with a set of "objects"
      • 100 secrets
      • 10 projects
      • 12 namespaces
      • 300 users (users are created in the Local cluster, this does not include the manually created user specified later)
      • 1 project Role Template
      • 300 project Role Template bindings
  2. Add the --debug argument and CATTLE_WATCH_TIMEOUT_SECONDS env var to the Rancher deployments
  3. Create a user called testuser and add them to the cluster members for the Local and downstream clusters
  4. Create 1 configmap that testuser has access to and 1 configmap that it does not have access to on each cluster
  5. Start script for observing the # of goroutines stuck in returnErr state
  6. Login to Rancher as testuser and navigate to the list of configmaps for each Cluster
    1. Use Chrome and utilize different tabs for each Cluster
    2. Verify that only the intended configmap is visible
    3. Wait a few minutes/hours and Verify that the expected symptoms occur on 2.7.5-rc4 and 2.7.4 for both Local and downstream clusters
      1. There are a # of goroutines stuck in returnErr state.
      2. Memory usage continues to increase while the watches are open
    4. Verify that the expected symptoms do not occur on 2.7.5-rc5 for both local and downstream clusters

Memory Leak Screenshots

Fresh Install

Local:
fix-local-memory

K3s:
fix-k3s-memory

RKE2:
fix-rke2-memory

Image Update

Pre:
Local:
pre-update-local-memory

K3s:
pre-update-k3s-memory

RKE2:
pre-update-rke2-memory

Post:
Local:
post-update-local-memory

K3s:
post-update-k3s-memory

RKE2:
post-update-rke2-memory

Upgrade

Pre:
Local:
pre-upgrade-local-memory

K3s:
pre-upgrade-k3s-memory

RKE2:
pre-upgrade-rke2-memory

Post:
Local:
post-upgrade-local-memory

K3s:
post-upgrade-k3s-memory

RKE2:
post-upgrade-rke2-memory

Notes

  • All scenarios passed as expected, the memory leak symptom that was describe should be fully resolved
  • The "too old" error messages symptom still persists post-fix, as expected
  • K8s "Compute Resources" kubelet metrics were not being reported for the RKE1 downstream clusters due to a bug in rancher-monitoring (still investigating this before opening an official issue)
  • Performance metrics were not being reported after the Upgrade scenario post-upgrade
  • All artifacts can be found here: Rancher Issue 41225.zip

@moio
Copy link
Contributor

moio commented Jun 21, 2023

Additional internal reference SURE-6430

@gmanera
Copy link

gmanera commented Jun 21, 2023

I believe this issue could be related to the problem described in #41684.

rmweir pushed a commit to rmweir/steve that referenced this issue Jun 28, 2023
This prevents a goroutine leak when item.Object is a `runtime.Object` but not a
`metav1.Object`, as in that case `WatchNames`’s `for` loop will quit early and
subsequent calls to `returnErr` will remain parked forever.

This helps with rancher/rancher#41225
Fuller explanation in rancher#107
@thanhtai9606
Copy link

I too much Ram in pod cattle-cluster-agent, I didn't fix in rancher 2.7.4 and kubernetes 1.22.4
agent

@markusleitoldZBS
Copy link

We realized exactly the same behaviour... we built a cluster v1.24.10 with rancher 2.7.4.
After enabling cluster monitoring a few weeks ago we noticed that the cattle-cluster-agent after an arbitary amount of time suddenly starts to eat the complete memory which in the worst case led to a crash of the whole k8s controller node.

We then added a resource limit in its deployment:

grafik

Since then things calmed down a bit, no mor sudden controller node crashes. But of course, the container now gets OOM Killed when it runs into the memory eating situation.

grafik

This happens quite often as can be be seen in monitoring (the last 24 hours at the time of writing):

grafik

Doesn't look very healthy IMHO....

@hoerup
Copy link

hoerup commented Aug 18, 2023

@markusleitoldZBS what did you expect ? This issue has been marked as fixed in 2.7.5 - so of course the bug would still be present in 2.7.4

@zube zube bot removed the [zube]: Done label Sep 15, 2023
@Richardswe
Copy link

@hoerup well how is 2.7.5, is it stable? i thought 2.7.3 was good but all of a sudden i got the memory leak as well which caught me by suprise.

@rmweir
Copy link
Contributor

rmweir commented Oct 18, 2023

@Richardswe an additional issue was found and resolved involving goroutines. 2.7.5 has both issues resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scalability 10k or bust kind/bug Issues that are defects reported by users or that we know have reached a real release team/scaling
Projects
None yet
Development

No branches or pull requests