-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rancher 2.0.4 server memory leak #14356
Comments
Thanks for the info, and we are looking into resource consumption in Rancher 2.0 for both issues and helping users plan for capacity. If you are willing to share the info:
|
Thanks for your reply! We are still in a premature state in our kubernetes adventure. So there are no monster-size clusters One thing may be worth-noting: some of our clusters are actually far away from our rancher server which (possibly) led to frequent cluster-unavailable events in Rancher WebUI. So we suspected this to be the devil of leaking memory and managed to relocate the rancher server to be network-wise closed to these off-shore clusters. Since then, the memory consumption of cattle pod seems to be rational. Don't know whether or not the info is useful |
We are seeing the same issue in production with our cluster nodes. Memory usage creeps up until all available memory is consumed, then the node becomes unresponsive and needs to be replaced. We now run 8GB nodes to alleviate the problem to a degree but as you can see in the screenshot (showing a range since Monday), there definitely is a leak somewhere. We currently run 3 EC2 m5.large worker nodes with RancherOS 1.4.0 using the EU-Frankfurt AMI provided by RancherOS and a separate etcd/control t2.medium node with RancherOS 1.4.0. Each worker node has 15-20 pods. Traffic gets to pods via EC2 load balancers managed by the K8 AWS plugin provided by Rancher. We are still running Rancher 2.0.0 but I would imagine that has little impact on this issue. So far the etcd/control node has not crashed but the worker nodes regularly did crash. Edit 2: Our Rancher server is also very far away from the actual cluster. |
On a setup with rancher v2.0.5-rc3, ran tests involving custom host addition with different roles.
|
Something new related to this? Do you recommend downgrading? |
Clusters deployed via aws ec2 3 weeks ago are stable, trying to deploy some new clusters from the same RKE instance (which has been upgraded), The newly deployed ec2 clusters using the same instance sizing and running the exact same workload, the workers crash out in minutes with 100% CPU and memory utilization. The same workload on the stable clusters has the hosts sitting at under 1 cpu load and about 30% memory utilization per worker node. rancherOS hosts aws ec2 m5.xlarge x 10 workers. |
There have been several performance related fixes that have gone in but not tagged specifically to this issue. @cloudnautique can you use this issue to track your performance testing? |
Yeah, we can use this for the memory consumption work. |
@penevigor What are you doing on your machines that you manage to leak > 2.5 Gigs of RAM in two hours? In our case it takes about three days for a gigabyte and we're primarily running web servers with comparably low throughput network traffic and occasional high-usage CPU tasks. |
@advancingu we were leaking 16gb of ram x 10 nodes in under 2 hours. Workload consisted of various web services, some java, some php, some python. roughly 70 containers in 20 pods (workloads at 2 - 5 x scale depending on the services). this workload is stable with plenty of resources remaining when run on rancher 1.6 via cattle on far less resources ( 5 * m4.large vs 10 * m5.xlarge) |
Rancher 2.0.6 with Kubernetes v1.10.5-rancher1-1 Maybe a hint in which direction to look (from the rancher server logs correlated with the cpu/memory spikes):
last message repeats like 5 times a second Edit: The unmarshal error is already mentioned in #12332. Unmarshal error occurs indeed periodically in the logs, but only sometimes followed with the "backup up reader" error and the oom issue. |
I have a similar issue with v2.0.2, especially while have the Project (workload/service/volumes) overview open. It seems like it tries to reconnect to the websockets every second and fails. Memory and CPU usage (stays 100%) increases every second. If I close the browser, the CPU usage goes back and memory get's freed after a while as well. |
Need to coordinate with @cloudnautique on who and how we're going to verify this. |
@cjellick I think with the release of v2.0.8 and the catalog backoff fix, we should resolve this one. If newer leaks pop up, we can open separate issues. |
Same issue with v2.0.8 |
@alv91 there have been multiple causes, and multiple fixes addressed in this issue. Can you please provide additional information? Or open a new issue specifically outlining the symptoms/conditions you are seeing high memory consumption? Ideally, if you could run in debug mode for a bit, and look for any log messages that seem to be repeating that would be really helpful. Also, specs on the sizing of the server, would be useful as well. |
@cloudnautique of course, i will post tomorrow. In short, the problem described by @BowlingX. I noticed memory increasing to 100% on the catalog apps tab and probably other tabs. |
@cloudnautique
I have one cluster with 6 nodes. Problem present on clean installation, i mean without resources in kubernetes. |
@alv91, is Rancher server in debug mode? You need to add It looks like there is an open issue upstream where objects can be larger the 1MiB but the system doesn't handle them properly. kubernetes/kubernetes#57073 |
@cloudnautique In debug mode just flooding with:
|
Thanks @alv91 , nothing going by at roughly the same time as one of the streamwatcher error messages regarding the size? |
@cloudnautique unfortunately, nothing more. What can i say more, memory is not releasing when i stop using webui. I dont know if it is related, but when i inspecting web browser, console is flooding with:
|
We've addressed 3 or 4 different performance issues around this ticket in v2.0.8. @cloudnautique has verified those through performance testing. As he suggests, I am going to close this issue. New bugs or issues around memory leaks or performance should be opened as new issues. |
Is this issue solved? I get same trouble about memory increase unexpectedly using 4gb of ram 2vcpu 60gb disk even I only run hello world container and mini hello react app at the moment I'm still using cattle in production and not yet do a migration after referencing this issue still exists in my RnD env |
I have the same problem when I go to Catalogs Tab, and one of Helm Stable or Helm incubator are enabled. |
Rancher versions:
rancher/rancher: v2.0.4
kubernetes versions:
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack VM
Setup details: (single node rancher vs. HA rancher, internal DB vs. external DB)
We set up a 3-nodes Kubernets cluster manually via RKE in OpenStack backed VM and deployed a single pod deployment of rancher 2.0.4(We are deploying Rancher2.0 since technical review).
Recently this system became a web-console for our production Kubernetes clusters and We discovered that pod of cattle deployment are crashing because of OOM(RSS nearly 7GiB) every several days.
Every time cattle pod crashes, all of our cluster needs to re-establish connections to it which is actually quite annoying
Is there some progress on this issue or can we help out for further investigation?
Results:
The text was updated successfully, but these errors were encountered: