rancher 2.0.4 server memory leak #14356

balleddog · 2018-07-03T10:32:17Z

Rancher versions:
rancher/rancher: v2.0.4

kubernetes versions:

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:14:26Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack VM

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
We set up a 3-nodes Kubernets cluster manually via RKE in OpenStack backed VM and deployed a single pod deployment of rancher 2.0.4(We are deploying Rancher2.0 since technical review).

Recently this system became a web-console for our production Kubernetes clusters and We discovered that pod of cattle deployment are crashing because of OOM(RSS nearly 7GiB) every several days.

Every time cattle pod crashes, all of our cluster needs to re-establish connections to it which is actually quite annoying

Is there some progress on this issue or can we help out for further investigation?

Results:

The text was updated successfully, but these errors were encountered:

cloudnautique · 2018-07-03T17:08:09Z

Thanks for the info, and we are looking into resource consumption in Rancher 2.0 for both issues and helping users plan for capacity.

If you are willing to share the info:

How many clusters do you have?
Type of clusters (Custom, imported, cloud)
Number of nodes per cluster
Number of pods and deployments in each cluster

balleddog · 2018-07-04T03:08:38Z

Thanks for your reply!

We are still in a premature state in our kubernetes adventure. So there are no monster-size clusters

One thing may be worth-noting: some of our clusters are actually far away from our rancher server which (possibly) led to frequent cluster-unavailable events in Rancher WebUI.

So we suspected this to be the devil of leaking memory and managed to relocate the rancher server to be network-wise closed to these off-shore clusters.

Since then, the memory consumption of cattle pod seems to be rational. Don't know whether or not the info is useful

advancingu · 2018-07-04T14:32:22Z

We are seeing the same issue in production with our cluster nodes. Memory usage creeps up until all available memory is consumed, then the node becomes unresponsive and needs to be replaced. We now run 8GB nodes to alleviate the problem to a degree but as you can see in the screenshot (showing a range since Monday), there definitely is a leak somewhere.

We currently run 3 EC2 m5.large worker nodes with RancherOS 1.4.0 using the EU-Frankfurt AMI provided by RancherOS and a separate etcd/control t2.medium node with RancherOS 1.4.0. Each worker node has 15-20 pods. Traffic gets to pods via EC2 load balancers managed by the K8 AWS plugin provided by Rancher. We are still running Rancher 2.0.0 but I would imagine that has little impact on this issue.
Edit: Kubernetes is v1.10.1

So far the etcd/control node has not crashed but the worker nodes regularly did crash.

Edit 2: Our Rancher server is also very far away from the actual cluster.

balleddog · 2018-07-05T04:52:53Z

An update of this issue.

Even after we relocated our rancher server' public Internet access point, the cattle pod still crashes due to kernel OOM kill

alena1108 · 2018-07-06T18:18:55Z

#12965

soumyalj · 2018-07-06T21:17:52Z

On a setup with rancher v2.0.5-rc3, ran tests involving custom host addition with different roles.
The server crashed with FATAL error below:


2018-07-06 00:27:55.668885 W | etcdserver: apply entries took too long [575.987489ms for 1 entries]
2018-07-06 00:27:55.669555 W | etcdserver: avoid queries with large range/delete range!
WARNING: 2018/07/06 00:27:57 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: context canceled"; please retry.
2018-07-06 00:27:57.297866 W | etcdserver: apply entries took too long [1.51920453s for 7 entries]
2018-07-06 00:27:57.298553 W | etcdserver: avoid queries with large range/delete range!
2018/07/06 00:28:01 [ERROR] NodePoolController c-6lqxl/np-jrqmn [nodepool-provisioner] failed with : Get https://127.0.0.1:6443/apis/management.cattle.io/v3/nodes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
WARNING: 2018/07/06 00:28:02 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: context canceled"; please retry.
E0706 00:28:02.946152       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-scheduler: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints kube-scheduler)
2018/07/06 00:28:03 [ERROR] NodePoolController c-6lqxl/np-nd7b5 [nodepool-provisioner] failed with : Get https://127.0.0.1:6443/apis/management.cattle.io/v3/nodes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018/07/06 00:28:03 [INFO] Updating node ip-172-31-4-192 with labels map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/hostname:ip-172-31-4-192 node-role.kubernetes.io/worker:true] and annotations map[node.alpha.kubernetes.io/ttl:0 rke.cattle.io/external-ip:13.59.87.42 rke.cattle.io/internal-ip:172.31.4.192 volumes.kubernetes.io/controller-managed-attach-detach:true]
E0706 00:28:03.938664       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'df0e43ba6747 stopped leading'
E0706 00:28:03.945831       1 server.go:145] lost master
Error: lost lease
2018/07/06 00:28:03 [ERROR] kube-scheduler exited with error: lost lease
2018/07/06 00:28:03 [INFO] Shutting down SecretController controller
2018/07/06 00:28:03 [INFO] Shutting down NamespaceController controller
2018/07/06 00:28:03 [INFO] Shutting down RoleController controller
2018/07/06 00:28:04 [INFO] Shutting down SecretController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ListenConfigController controller
2018/07/06 00:28:04 [INFO] Shutting down SettingController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeDriverController controller
2018/07/06 00:28:04 [INFO] Shutting down DynamicSchemaController controller
2018/07/06 00:28:04 [INFO] Shutting down AuthConfigController controller
2018/07/06 00:28:04 [INFO] Shutting down GroupController controller
2018/07/06 00:28:04 [INFO] Shutting down GroupMemberController controller
2018/07/06 00:28:04 [INFO] Shutting down ProjectController controller
2018/07/06 00:28:04 [INFO] Shutting down PipelineController controller
2018/07/06 00:28:04 [INFO] Shutting down SourceCodeRepositoryController controller
2018/07/06 00:28:04 [INFO] Shutting down SourceCodeCredentialController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterPipelineController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleTemplateController controller
2018/07/06 00:28:04 [INFO] Shutting down UserAttributeController controller
2018/07/06 00:28:04 [INFO] Shutting down TokenController controller
2018/07/06 00:28:04 [INFO] Shutting down ProjectRoleTemplateBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleTemplateBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down UserController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRegistrationTokenController controller
2018/07/06 00:28:04 [FATAL] context canceled

angheladrianclaudiu · 2018-07-17T11:49:02Z

Something new related to this? Do you recommend downgrading?

sandbox-services · 2018-07-20T12:46:05Z

Clusters deployed via aws ec2 3 weeks ago are stable, trying to deploy some new clusters from the same RKE instance (which has been upgraded), The newly deployed ec2 clusters using the same instance sizing and running the exact same workload, the workers crash out in minutes with 100% CPU and memory utilization. The same workload on the stable clusters has the hosts sitting at under 1 cpu load and about 30% memory utilization per worker node.

rancherOS hosts aws ec2 m5.xlarge x 10 workers.
70 pods

deniseschannon · 2018-07-20T21:07:13Z

There have been several performance related fixes that have gone in but not tagged specifically to this issue. @cloudnautique can you use this issue to track your performance testing?

cloudnautique · 2018-07-20T22:46:26Z

Yeah, we can use this for the memory consumption work.

penevigor · 2018-07-25T12:53:29Z

We are experiencing the same issue

You can see the cattle server using 83% memory on one of the rancher nodes:

And here you can see the memory free graph going down until the container is recreated again

advancingu · 2018-07-25T14:47:39Z

@penevigor What are you doing on your machines that you manage to leak > 2.5 Gigs of RAM in two hours? In our case it takes about three days for a gigabyte and we're primarily running web servers with comparably low throughput network traffic and occasional high-usage CPU tasks.

sandbox-services · 2018-07-28T13:21:26Z

@advancingu we were leaking 16gb of ram x 10 nodes in under 2 hours.

Workload consisted of various web services, some java, some php, some python. roughly 70 containers in 20 pods (workloads at 2 - 5 x scale depending on the services). this workload is stable with plenty of resources remaining when run on rancher 1.6 via cattle on far less resources ( 5 * m4.large vs 10 * m5.xlarge)

jzandbergen · 2018-08-02T14:23:20Z

Here we have 2 rancher 2.0.6 instances both on CentOS 7.5 and docker-ce 17.03-02. One has one cluster to manage. And the other 2 clusters. Now we have same problem on the rancher2 instance that manages our 2 kubernetes clusters.

It happens multiple times per day as you can see here:

If I can do/test anything to help, i'll gladly help.

sboschman · 2018-08-03T06:57:48Z

Rancher 2.0.6 with Kubernetes v1.10.5-rancher1-1

Maybe a hint in which direction to look (from the rancher server logs correlated with the cpu/memory spikes):

E0803 05:51:25.597774       1 streamwatcher.go:109] Unable to decode an event from the watch stream: json: cannot unmarshal string into Go struct field dynamicEvent.Object of type v3.NodeStatus
E0803 05:52:28.401636       1 streamwatcher.go:109] Unable to decode an event from the watch stream: backed up reader

last message repeats like 5 times a second

Edit: The unmarshal error is already mentioned in #12332. Unmarshal error occurs indeed periodically in the logs, but only sometimes followed with the "backup up reader" error and the oom issue.

BowlingX · 2018-08-06T14:03:00Z

I have a similar issue with v2.0.2, especially while have the Project (workload/service/volumes) overview open. It seems like it tries to reconnect to the websockets every second and fails. Memory and CPU usage (stays 100%) increases every second. If I close the browser, the CPU usage goes back and memory get's freed after a while as well.

cjellick · 2018-08-22T18:06:33Z

Need to coordinate with @cloudnautique on who and how we're going to verify this.

cloudnautique · 2018-08-24T20:28:12Z

@cjellick I think with the release of v2.0.8 and the catalog backoff fix, we should resolve this one. If newer leaks pop up, we can open separate issues.

alv91 · 2018-08-29T14:35:03Z

Same issue with v2.0.8

cloudnautique · 2018-08-29T16:22:29Z

@alv91 there have been multiple causes, and multiple fixes addressed in this issue. Can you please provide additional information? Or open a new issue specifically outlining the symptoms/conditions you are seeing high memory consumption?

Ideally, if you could run in debug mode for a bit, and look for any log messages that seem to be repeating that would be really helpful. Also, specs on the sizing of the server, would be useful as well.

alv91 · 2018-08-29T17:59:54Z

@cloudnautique of course, i will post tomorrow. In short, the problem described by @BowlingX. I noticed memory increasing to 100% on the catalog apps tab and probably other tabs.

alv91 · 2018-08-30T08:05:27Z

@cloudnautique
I see only multiple events every seconds:

E0830 08:00:57.504534 7 streamwatcher.go:109] Unable to decode an event from the watch stream: object to decode was longer than maximum allowed size

I have one cluster with 6 nodes. Problem present on clean installation, i mean without resources in kubernetes.

cloudnautique · 2018-08-30T15:54:09Z

@alv91, is Rancher server in debug mode? You need to add --debug to the rancher command. Hopefully it might give some clue as to what object might be too large. The error itself is actually being thrown by K8s.

It looks like there is an open issue upstream where objects can be larger the 1MiB but the system doesn't handle them properly. kubernetes/kubernetes#57073

alv91 · 2018-08-31T10:41:41Z

@cloudnautique In debug mode just flooding with:

2018/08/31 10:21:29 [DEBUG] closing watcher for configMap
2018/08/31 10:21:29 [DEBUG] Impersonating user user-c9mjc, groups [system:authenticated]
2018/08/31 10:21:29 [DEBUG] watching ingress
2018/08/31 10:21:29 [DEBUG] watching workload
2018/08/31 10:21:29 [DEBUG] watching namespacedSecret
2018/08/31 10:21:29 [DEBUG] watching job
2018/08/31 10:21:29 [DEBUG] watching namespacedDockerCredential
2018/08/31 10:21:29 [DEBUG] watching namespacedSshAuth
2018/08/31 10:21:29 [DEBUG] watching appRevision
2018/08/31 10:21:29 [DEBUG] watching cronJob
2018/08/31 10:21:29 [DEBUG] watching sshAuth
2018/08/31 10:21:29 [DEBUG] watching replicationController
2018/08/31 10:21:29 [DEBUG] watching namespacedServiceAccountToken
2018/08/31 10:21:29 [DEBUG] watching statefulSet
2018/08/31 10:21:29 [DEBUG] watching cronJob
2018/08/31 10:21:29 [DEBUG] watching deployment
2018/08/31 10:21:29 [DEBUG] watching replicaSet
2018/08/31 10:21:29 [DEBUG] watching replicationController
2018/08/31 10:21:29 [DEBUG] watching daemonSet
2018/08/31 10:21:29 [DEBUG] watching serviceAccountToken
2018/08/31 10:21:29 [DEBUG] watching daemonSet
2018/08/31 10:21:29 [DEBUG] watching pod
2018/08/31 10:21:29 [DEBUG] watching replicaSet
2018/08/31 10:21:29 [DEBUG] watching namespacedBasicAuth
2018/08/31 10:21:29 [DEBUG] watching deployment
2018/08/31 10:21:29 [DEBUG] watching statefulSet
2018/08/31 10:21:29 [DEBUG] watching app
2018/08/31 10:21:29 [DEBUG] watching secret
2018/08/31 10:21:29 [DEBUG] watching persistentVolumeClaim
2018/08/31 10:21:29 [DEBUG] watching basicAuth
2018/08/31 10:21:29 [DEBUG] watching dockerCredential
2018/08/31 10:21:29 [DEBUG] watching service
2018/08/31 10:21:29 [DEBUG] watching dnsRecord
2018/08/31 10:21:29 [DEBUG] watching namespacedCertificate
2018/08/31 10:21:29 [DEBUG] watching job
2018/08/31 10:21:29 [DEBUG] watching certificate
2018/08/31 10:21:29 [DEBUG] watching configMap

cloudnautique · 2018-08-31T18:42:24Z

Thanks @alv91 , nothing going by at roughly the same time as one of the streamwatcher error messages regarding the size?

alv91 · 2018-09-03T09:44:19Z

@cloudnautique unfortunately, nothing more. What can i say more, memory is not releasing when i stop using webui.

I dont know if it is related, but when i inspecting web browser, console is flooding with:

subscribe.js:287 Project Subscribe disconnected (sockId=null)
socket.js:57 Socket connecting (id=38, url=wss://<ip>/v3/projects/c-mrx9c:p-9t9r9/subscribe...)
socket.js:241 Socket opened (state=reconnecting, id=38)
subscribe.js:277 Project Subscribe connected (sockId=38) (after 1 try, 0.183 sec)
socket.js:186 Socket null closed

cjellick · 2018-09-19T15:45:39Z

We've addressed 3 or 4 different performance issues around this ticket in v2.0.8. @cloudnautique has verified those through performance testing. As he suggests, I am going to close this issue. New bugs or issues around memory leaks or performance should be opened as new issues.

hesoyamcode · 2019-02-28T18:12:45Z

Is this issue solved? I get same trouble about memory increase unexpectedly using 4gb of ram 2vcpu 60gb disk even I only run hello world container and mini hello react app at the moment I'm still using cattle in production and not yet do a migration after referencing this issue still exists in my RnD env

fr4ngus · 2019-03-21T10:40:08Z

I have the same problem when I go to Catalogs Tab, and one of Helm Stable or Helm incubator are enabled.

deniseschannon assigned alena1108 Jul 5, 2018

deniseschannon added kind/bug Issues that are defects reported by users or that we know have reached a real release kind/performance labels Jul 5, 2018

deniseschannon added this to the v2.0.6 milestone Jul 5, 2018

superseb mentioned this issue Jul 13, 2018

RAM and CPU increases in 4-5h to almost 100% #14526

Closed

alena1108 added the version/2.0 label Jul 16, 2018

deniseschannon assigned cloudnautique and unassigned alena1108 Jul 20, 2018

deniseschannon added the status/to-test label Jul 20, 2018

cloudnautique mentioned this issue Jul 27, 2018

Update norman dependency #14823

Merged

cjellick modified the milestones: v2.0.7, v2.0 - Sep Aug 10, 2018

deniseschannon modified the milestones: v2.0 - Sep, v2.1 Aug 21, 2018

cjellick added the status/resolved label Aug 22, 2018

cjellick closed this as completed Sep 19, 2018

sangeethah mentioned this issue Sep 28, 2018

Panic on rke up rancher/rke#867

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rancher 2.0.4 server memory leak #14356

rancher 2.0.4 server memory leak #14356

balleddog commented Jul 3, 2018

cloudnautique commented Jul 3, 2018

balleddog commented Jul 4, 2018

advancingu commented Jul 4, 2018 •

edited

Loading

balleddog commented Jul 5, 2018

alena1108 commented Jul 6, 2018

soumyalj commented Jul 6, 2018

angheladrianclaudiu commented Jul 17, 2018 •

edited

Loading

sandbox-services commented Jul 20, 2018 •

edited

Loading

deniseschannon commented Jul 20, 2018

cloudnautique commented Jul 20, 2018

penevigor commented Jul 25, 2018 •

edited

Loading

advancingu commented Jul 25, 2018

sandbox-services commented Jul 28, 2018 •

edited

Loading

jzandbergen commented Aug 2, 2018

sboschman commented Aug 3, 2018 •

edited

Loading

BowlingX commented Aug 6, 2018 •

edited

Loading

cjellick commented Aug 22, 2018

cloudnautique commented Aug 24, 2018

alv91 commented Aug 29, 2018

cloudnautique commented Aug 29, 2018

alv91 commented Aug 29, 2018

alv91 commented Aug 30, 2018

cloudnautique commented Aug 30, 2018

alv91 commented Aug 31, 2018

cloudnautique commented Aug 31, 2018

alv91 commented Sep 3, 2018

cjellick commented Sep 19, 2018

hesoyamcode commented Feb 28, 2019

fr4ngus commented Mar 21, 2019

rancher 2.0.4 server memory leak #14356

rancher 2.0.4 server memory leak #14356

Comments

balleddog commented Jul 3, 2018

cloudnautique commented Jul 3, 2018

balleddog commented Jul 4, 2018

advancingu commented Jul 4, 2018 • edited Loading

balleddog commented Jul 5, 2018

alena1108 commented Jul 6, 2018

soumyalj commented Jul 6, 2018

angheladrianclaudiu commented Jul 17, 2018 • edited Loading

sandbox-services commented Jul 20, 2018 • edited Loading

deniseschannon commented Jul 20, 2018

cloudnautique commented Jul 20, 2018

penevigor commented Jul 25, 2018 • edited Loading

advancingu commented Jul 25, 2018

sandbox-services commented Jul 28, 2018 • edited Loading

jzandbergen commented Aug 2, 2018

sboschman commented Aug 3, 2018 • edited Loading

BowlingX commented Aug 6, 2018 • edited Loading

cjellick commented Aug 22, 2018

cloudnautique commented Aug 24, 2018

alv91 commented Aug 29, 2018

cloudnautique commented Aug 29, 2018

alv91 commented Aug 29, 2018

alv91 commented Aug 30, 2018

cloudnautique commented Aug 30, 2018

alv91 commented Aug 31, 2018

cloudnautique commented Aug 31, 2018

alv91 commented Sep 3, 2018

cjellick commented Sep 19, 2018

hesoyamcode commented Feb 28, 2019

fr4ngus commented Mar 21, 2019

advancingu commented Jul 4, 2018 •

edited

Loading

angheladrianclaudiu commented Jul 17, 2018 •

edited

Loading

sandbox-services commented Jul 20, 2018 •

edited

Loading

penevigor commented Jul 25, 2018 •

edited

Loading

sandbox-services commented Jul 28, 2018 •

edited

Loading

sboschman commented Aug 3, 2018 •

edited

Loading

BowlingX commented Aug 6, 2018 •

edited

Loading