Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rancher 2.0.4 server memory leak #14356

Closed
balleddog opened this issue Jul 3, 2018 · 29 comments
Closed

rancher 2.0.4 server memory leak #14356

balleddog opened this issue Jul 3, 2018 · 29 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@balleddog
Copy link

Rancher versions:
rancher/rancher: v2.0.4

kubernetes versions:

Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.1", GitCommit:"d4ab47518836c750f9949b9e0d387f20fb92260b", GitTreeState:"clean", BuildDate:"2018-04-12T14:14:26Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Docker version: (docker version,docker info preferred)

Client:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:38:45 2017
 OS/Arch:      linux/amd64

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="16.04.1 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.1 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
UBUNTU_CODENAME=xenial

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack VM

Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
We set up a 3-nodes Kubernets cluster manually via RKE in OpenStack backed VM and deployed a single pod deployment of rancher 2.0.4(We are deploying Rancher2.0 since technical review).

Recently this system became a web-console for our production Kubernetes clusters and We discovered that pod of cattle deployment are crashing because of OOM(RSS nearly 7GiB) every several days.

Every time cattle pod crashes, all of our cluster needs to re-establish connections to it which is actually quite annoying

Is there some progress on this issue or can we help out for further investigation?

Results:
screen shot 2018-07-03 at 6 19 45 pm

@cloudnautique
Copy link
Contributor

Thanks for the info, and we are looking into resource consumption in Rancher 2.0 for both issues and helping users plan for capacity.

If you are willing to share the info:

  1. How many clusters do you have?
  2. Type of clusters (Custom, imported, cloud)
  3. Number of nodes per cluster
  4. Number of pods and deployments in each cluster

@balleddog
Copy link
Author

Thanks for your reply!

We are still in a premature state in our kubernetes adventure. So there are no monster-size clusters

One thing may be worth-noting: some of our clusters are actually far away from our rancher server which (possibly) led to frequent cluster-unavailable events in Rancher WebUI.

So we suspected this to be the devil of leaking memory and managed to relocate the rancher server to be network-wise closed to these off-shore clusters.

Since then, the memory consumption of cattle pod seems to be rational. Don't know whether or not the info is useful

screen shot 2018-07-04 at 10 49 57 am
screen shot 2018-07-04 at 10 55 54 am

@advancingu
Copy link

advancingu commented Jul 4, 2018

We are seeing the same issue in production with our cluster nodes. Memory usage creeps up until all available memory is consumed, then the node becomes unresponsive and needs to be replaced. We now run 8GB nodes to alleviate the problem to a degree but as you can see in the screenshot (showing a range since Monday), there definitely is a leak somewhere.

screenshot from 2018-07-04 16-21-34

We currently run 3 EC2 m5.large worker nodes with RancherOS 1.4.0 using the EU-Frankfurt AMI provided by RancherOS and a separate etcd/control t2.medium node with RancherOS 1.4.0. Each worker node has 15-20 pods. Traffic gets to pods via EC2 load balancers managed by the K8 AWS plugin provided by Rancher. We are still running Rancher 2.0.0 but I would imagine that has little impact on this issue.
Edit: Kubernetes is v1.10.1

So far the etcd/control node has not crashed but the worker nodes regularly did crash.

Edit 2: Our Rancher server is also very far away from the actual cluster.

@balleddog
Copy link
Author

An update of this issue.

Even after we relocated our rancher server' public Internet access point, the cattle pod still crashes due to kernel OOM kill
screen shot 2018-07-05 at 12 50 53 pm

@deniseschannon deniseschannon added kind/bug Issues that are defects reported by users or that we know have reached a real release kind/performance labels Jul 5, 2018
@deniseschannon deniseschannon added this to the v2.0.6 milestone Jul 5, 2018
@alena1108
Copy link

#12965

@soumyalj
Copy link

soumyalj commented Jul 6, 2018

On a setup with rancher v2.0.5-rc3, ran tests involving custom host addition with different roles.
The server crashed with FATAL error below:


2018-07-06 00:27:55.668885 W | etcdserver: apply entries took too long [575.987489ms for 1 entries]
2018-07-06 00:27:55.669555 W | etcdserver: avoid queries with large range/delete range!
WARNING: 2018/07/06 00:27:57 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: context canceled"; please retry.
2018-07-06 00:27:57.297866 W | etcdserver: apply entries took too long [1.51920453s for 7 entries]
2018-07-06 00:27:57.298553 W | etcdserver: avoid queries with large range/delete range!
2018/07/06 00:28:01 [ERROR] NodePoolController c-6lqxl/np-jrqmn [nodepool-provisioner] failed with : Get https://127.0.0.1:6443/apis/management.cattle.io/v3/nodes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
WARNING: 2018/07/06 00:28:02 Failed to dial 127.0.0.1:2379: connection error: desc = "transport: authentication handshake failed: context canceled"; please retry.
E0706 00:28:02.946152       1 leaderelection.go:224] error retrieving resource lock kube-system/kube-scheduler: the server was unable to return a response in the time allotted, but may still be processing the request (get endpoints kube-scheduler)
2018/07/06 00:28:03 [ERROR] NodePoolController c-6lqxl/np-nd7b5 [nodepool-provisioner] failed with : Get https://127.0.0.1:6443/apis/management.cattle.io/v3/nodes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018/07/06 00:28:03 [INFO] Updating node ip-172-31-4-192 with labels map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/os:linux kubernetes.io/hostname:ip-172-31-4-192 node-role.kubernetes.io/worker:true] and annotations map[node.alpha.kubernetes.io/ttl:0 rke.cattle.io/external-ip:13.59.87.42 rke.cattle.io/internal-ip:172.31.4.192 volumes.kubernetes.io/controller-managed-attach-detach:true]
E0706 00:28:03.938664       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"", GenerateName:"", Namespace:"", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'selfLink was empty, can't make reference'. Will not report event: 'Normal' 'LeaderElection' 'df0e43ba6747 stopped leading'
E0706 00:28:03.945831       1 server.go:145] lost master
Error: lost lease
2018/07/06 00:28:03 [ERROR] kube-scheduler exited with error: lost lease
2018/07/06 00:28:03 [INFO] Shutting down SecretController controller
2018/07/06 00:28:03 [INFO] Shutting down NamespaceController controller
2018/07/06 00:28:03 [INFO] Shutting down RoleController controller
2018/07/06 00:28:04 [INFO] Shutting down SecretController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ListenConfigController controller
2018/07/06 00:28:04 [INFO] Shutting down SettingController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeDriverController controller
2018/07/06 00:28:04 [INFO] Shutting down DynamicSchemaController controller
2018/07/06 00:28:04 [INFO] Shutting down AuthConfigController controller
2018/07/06 00:28:04 [INFO] Shutting down GroupController controller
2018/07/06 00:28:04 [INFO] Shutting down GroupMemberController controller
2018/07/06 00:28:04 [INFO] Shutting down ProjectController controller
2018/07/06 00:28:04 [INFO] Shutting down PipelineController controller
2018/07/06 00:28:04 [INFO] Shutting down SourceCodeRepositoryController controller
2018/07/06 00:28:04 [INFO] Shutting down SourceCodeCredentialController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterPipelineController controller
2018/07/06 00:28:04 [INFO] Shutting down RoleTemplateController controller
2018/07/06 00:28:04 [INFO] Shutting down UserAttributeController controller
2018/07/06 00:28:04 [INFO] Shutting down TokenController controller
2018/07/06 00:28:04 [INFO] Shutting down ProjectRoleTemplateBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRoleTemplateBindingController controller
2018/07/06 00:28:04 [INFO] Shutting down UserController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeController controller
2018/07/06 00:28:04 [INFO] Shutting down NodeController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterController controller
2018/07/06 00:28:04 [INFO] Shutting down ClusterRegistrationTokenController controller
2018/07/06 00:28:04 [FATAL] context canceled

@angheladrianclaudiu
Copy link

angheladrianclaudiu commented Jul 17, 2018

Something new related to this? Do you recommend downgrading?

@sandbox-services
Copy link

sandbox-services commented Jul 20, 2018

Clusters deployed via aws ec2 3 weeks ago are stable, trying to deploy some new clusters from the same RKE instance (which has been upgraded), The newly deployed ec2 clusters using the same instance sizing and running the exact same workload, the workers crash out in minutes with 100% CPU and memory utilization. The same workload on the stable clusters has the hosts sitting at under 1 cpu load and about 30% memory utilization per worker node.

rancherOS hosts aws ec2 m5.xlarge x 10 workers.
70 pods

@deniseschannon
Copy link

There have been several performance related fixes that have gone in but not tagged specifically to this issue. @cloudnautique can you use this issue to track your performance testing?

@cloudnautique
Copy link
Contributor

Yeah, we can use this for the memory consumption work.

@penevigor
Copy link

penevigor commented Jul 25, 2018

We are experiencing the same issue

You can see the cattle server using 83% memory on one of the rancher nodes:
ran_cattle_container_memory1

And here you can see the memory free graph going down until the container is recreated again
ran01_mem1 2

@advancingu
Copy link

@penevigor What are you doing on your machines that you manage to leak > 2.5 Gigs of RAM in two hours? In our case it takes about three days for a gigabyte and we're primarily running web servers with comparably low throughput network traffic and occasional high-usage CPU tasks.

@sandbox-services
Copy link

sandbox-services commented Jul 28, 2018

@advancingu we were leaking 16gb of ram x 10 nodes in under 2 hours.

Workload consisted of various web services, some java, some php, some python. roughly 70 containers in 20 pods (workloads at 2 - 5 x scale depending on the services). this workload is stable with plenty of resources remaining when run on rancher 1.6 via cattle on far less resources ( 5 * m4.large vs 10 * m5.xlarge)

@jzandbergen
Copy link

Here we have 2 rancher 2.0.6 instances both on CentOS 7.5 and docker-ce 17.03-02. One has one cluster to manage. And the other 2 clusters. Now we have same problem on the rancher2 instance that manages our 2 kubernetes clusters.

It happens multiple times per day as you can see here:
mem
swap

If I can do/test anything to help, i'll gladly help.

@sboschman
Copy link

sboschman commented Aug 3, 2018

Rancher 2.0.6 with Kubernetes v1.10.5-rancher1-1

Maybe a hint in which direction to look (from the rancher server logs correlated with the cpu/memory spikes):

E0803 05:51:25.597774       1 streamwatcher.go:109] Unable to decode an event from the watch stream: json: cannot unmarshal string into Go struct field dynamicEvent.Object of type v3.NodeStatus
E0803 05:52:28.401636       1 streamwatcher.go:109] Unable to decode an event from the watch stream: backed up reader

last message repeats like 5 times a second

Edit: The unmarshal error is already mentioned in #12332. Unmarshal error occurs indeed periodically in the logs, but only sometimes followed with the "backup up reader" error and the oom issue.

@BowlingX
Copy link

BowlingX commented Aug 6, 2018

I have a similar issue with v2.0.2, especially while have the Project (workload/service/volumes) overview open. It seems like it tries to reconnect to the websockets every second and fails. Memory and CPU usage (stays 100%) increases every second. If I close the browser, the CPU usage goes back and memory get's freed after a while as well.

@cjellick cjellick modified the milestones: v2.0.7, v2.0 - Sep Aug 10, 2018
@deniseschannon deniseschannon modified the milestones: v2.0 - Sep, v2.1 Aug 21, 2018
@cjellick
Copy link

Need to coordinate with @cloudnautique on who and how we're going to verify this.

@cloudnautique
Copy link
Contributor

@cjellick I think with the release of v2.0.8 and the catalog backoff fix, we should resolve this one. If newer leaks pop up, we can open separate issues.

@alv91
Copy link

alv91 commented Aug 29, 2018

Same issue with v2.0.8

@cloudnautique
Copy link
Contributor

@alv91 there have been multiple causes, and multiple fixes addressed in this issue. Can you please provide additional information? Or open a new issue specifically outlining the symptoms/conditions you are seeing high memory consumption?

Ideally, if you could run in debug mode for a bit, and look for any log messages that seem to be repeating that would be really helpful. Also, specs on the sizing of the server, would be useful as well.

@alv91
Copy link

alv91 commented Aug 29, 2018

@cloudnautique of course, i will post tomorrow. In short, the problem described by @BowlingX. I noticed memory increasing to 100% on the catalog apps tab and probably other tabs.

@alv91
Copy link

alv91 commented Aug 30, 2018

@cloudnautique
I see only multiple events every seconds:

E0830 08:00:57.504534 7 streamwatcher.go:109] Unable to decode an event from the watch stream: object to decode was longer than maximum allowed size

I have one cluster with 6 nodes. Problem present on clean installation, i mean without resources in kubernetes.

@cloudnautique
Copy link
Contributor

@alv91, is Rancher server in debug mode? You need to add --debug to the rancher command. Hopefully it might give some clue as to what object might be too large. The error itself is actually being thrown by K8s.

It looks like there is an open issue upstream where objects can be larger the 1MiB but the system doesn't handle them properly. kubernetes/kubernetes#57073

@alv91
Copy link

alv91 commented Aug 31, 2018

@cloudnautique In debug mode just flooding with:

2018/08/31 10:21:29 [DEBUG] closing watcher for configMap
2018/08/31 10:21:29 [DEBUG] Impersonating user user-c9mjc, groups [system:authenticated]
2018/08/31 10:21:29 [DEBUG] watching ingress
2018/08/31 10:21:29 [DEBUG] watching workload
2018/08/31 10:21:29 [DEBUG] watching namespacedSecret
2018/08/31 10:21:29 [DEBUG] watching job
2018/08/31 10:21:29 [DEBUG] watching namespacedDockerCredential
2018/08/31 10:21:29 [DEBUG] watching namespacedSshAuth
2018/08/31 10:21:29 [DEBUG] watching appRevision
2018/08/31 10:21:29 [DEBUG] watching cronJob
2018/08/31 10:21:29 [DEBUG] watching sshAuth
2018/08/31 10:21:29 [DEBUG] watching replicationController
2018/08/31 10:21:29 [DEBUG] watching namespacedServiceAccountToken
2018/08/31 10:21:29 [DEBUG] watching statefulSet
2018/08/31 10:21:29 [DEBUG] watching cronJob
2018/08/31 10:21:29 [DEBUG] watching deployment
2018/08/31 10:21:29 [DEBUG] watching replicaSet
2018/08/31 10:21:29 [DEBUG] watching replicationController
2018/08/31 10:21:29 [DEBUG] watching daemonSet
2018/08/31 10:21:29 [DEBUG] watching serviceAccountToken
2018/08/31 10:21:29 [DEBUG] watching daemonSet
2018/08/31 10:21:29 [DEBUG] watching pod
2018/08/31 10:21:29 [DEBUG] watching replicaSet
2018/08/31 10:21:29 [DEBUG] watching namespacedBasicAuth
2018/08/31 10:21:29 [DEBUG] watching deployment
2018/08/31 10:21:29 [DEBUG] watching statefulSet
2018/08/31 10:21:29 [DEBUG] watching app
2018/08/31 10:21:29 [DEBUG] watching secret
2018/08/31 10:21:29 [DEBUG] watching persistentVolumeClaim
2018/08/31 10:21:29 [DEBUG] watching basicAuth
2018/08/31 10:21:29 [DEBUG] watching dockerCredential
2018/08/31 10:21:29 [DEBUG] watching service
2018/08/31 10:21:29 [DEBUG] watching dnsRecord
2018/08/31 10:21:29 [DEBUG] watching namespacedCertificate
2018/08/31 10:21:29 [DEBUG] watching job
2018/08/31 10:21:29 [DEBUG] watching certificate
2018/08/31 10:21:29 [DEBUG] watching configMap

@cloudnautique
Copy link
Contributor

Thanks @alv91 , nothing going by at roughly the same time as one of the streamwatcher error messages regarding the size?

@alv91
Copy link

alv91 commented Sep 3, 2018

@cloudnautique unfortunately, nothing more. What can i say more, memory is not releasing when i stop using webui.

I dont know if it is related, but when i inspecting web browser, console is flooding with:

subscribe.js:287 Project Subscribe disconnected (sockId=null)
socket.js:57 Socket connecting (id=38, url=wss://<ip>/v3/projects/c-mrx9c:p-9t9r9/subscribe...)
socket.js:241 Socket opened (state=reconnecting, id=38)
subscribe.js:277 Project Subscribe connected (sockId=38) (after 1 try, 0.183 sec)
socket.js:186 Socket null closed

@cjellick
Copy link

We've addressed 3 or 4 different performance issues around this ticket in v2.0.8. @cloudnautique has verified those through performance testing. As he suggests, I am going to close this issue. New bugs or issues around memory leaks or performance should be opened as new issues.

@hesoyamcode
Copy link

Is this issue solved? I get same trouble about memory increase unexpectedly using 4gb of ram 2vcpu 60gb disk even I only run hello world container and mini hello react app at the moment I'm still using cattle in production and not yet do a migration after referencing this issue still exists in my RnD env

@fr4ngus
Copy link

fr4ngus commented Mar 21, 2019

I have the same problem when I go to Catalogs Tab, and one of Helm Stable or Helm incubator are enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests