Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still a memory leak with k8s - 1.1 RC4 #844

Closed
rrichardson opened this issue Nov 14, 2016 · 21 comments
Closed

Still a memory leak with k8s - 1.1 RC4 #844

rrichardson opened this issue Nov 14, 2016 · 21 comments

Comments

@rrichardson
Copy link

I know this was addressed in #387

I also note that somebody reported an existing leak in the new solution (back in May).

I am running 1.1 RC4

Here is what I'm seeing:

traefik

The 2nd item in the list git the memory limit and was killed.

@emilevauge
Copy link
Member

Ouch, can you give us more details? Which Kubernetes version are you using? Were you running traefik 1.0 before? Any difference?
/cc @containous/traefik

@rrichardson
Copy link
Author

k8s version : Server Version: version.Info{Major:"1", Minor:"4", GitVersion:"v1.4.0", GitCommit:"a16c0a7f71a6f93c7e0f222d961f4675cd97a46b", GitTreeState:"clean", BuildDate:"2016-09-26T18:10:32Z", GoVersion:"go1.6.3", Compiler:"gc", Platform:"linux/amd64"}

I haven't run a prior version of Traefik (at least not while monitoring and using the cluster much)

Are there any debug/gc metrics I can pull out of Traefik to help narrow this down?

@emilevauge
Copy link
Member

Anything in the logs?

@rrichardson
Copy link
Author

rrichardson commented Nov 14, 2016

Here's an anonymized debug log, I'm not really sure what I'm looking for..

traefik.log.gz

There are a couple errors involving broken connections and failure to parse :

time="2016-11-14T16:43:45Z" level=error msg="Kubernetes connection error failed to decode watch event: GET "https://100.64.0.1:443/apis/extensions/v1beta1/ingresses\" : unexpected EOF, retrying in 605.782735ms"

time="2016-11-14T16:45:58Z" level=error msg="Kubernetes connection error failed to decode watch event: GET "https://100.64.0.1:443/api/v1/endpoints\" : invalid character 'o' looking for beginning of value, retrying in 276.215364ms"

Other than that, I can't find anything that might lead to a leak.

@errm
Copy link
Contributor

errm commented Nov 14, 2016

Are you running a HA master with the --leader-elect flag set on kube-scheduler and/or kube-controller-manager ?

@rrichardson
Copy link
Author

rrichardson commented Nov 14, 2016

I am running HA with leader-elect=true on the controller-manager.
3 masters.

@emilevauge
Copy link
Member

There was an old issue with event spamming, but it did not lead to a memory leak...
#449

@emilevauge
Copy link
Member

I tried to reproduce this with Kubernetes v1.4.4 and have not run into a memory leak in ours, despite the leadership election events spamming.
Another way to investigate would be to kill -SIGABRT $PID_OF_TRAEFIK (while traefik uses a lot of memory) and send us the stack.

@emilevauge emilevauge added this to the 1.1 milestone Nov 15, 2016
@rrichardson
Copy link
Author

rrichardson commented Nov 15, 2016

I wonder if it a function of the number of services/pods. We currently have about 70 pods across 30 services.

Either way, attached is the stack trace. (with 6 million goroutines :) )

traefik-abrt.log.gz

@emilevauge
Copy link
Member

emilevauge commented Nov 15, 2016

@rrichardson thanks for investigating on this, it will help us alot :)

@emilevauge
Copy link
Member

Hey @rrichardson, it would be awesome if you could test with this docker image containous/traefik:k8s, it has been built with the fix #845.

@rrichardson
Copy link
Author

The new image has been running for 1 hour and so far the results are encouraging. Each pod is using about 20MB of RAM. I don't have any pretty charts yet, but things look good so far.

@emilevauge
Copy link
Member

@rrichardson I love what you are saying 👍

@rrichardson
Copy link
Author

traefik

@rrichardson
Copy link
Author

It hasn't leveled off yet, which is a bit of a concern. However, the previous build would increase at a rate of 35MB/hour. The current build went from 20MB to 40MB in 2.5 hours, or 8MB/hr. I'll check again this evening to see if it has stopped increasing.

@emilevauge
Copy link
Member

@rrichardson

It hasn't leveled off yet, which is a bit of a concern

Indeed...
Could you kill -SIGABRT $PID_OF_TRAEFIK again?

@rrichardson
Copy link
Author

traefik.log.gz

@emilevauge
Copy link
Member

emilevauge commented Nov 17, 2016

I updated the Docker image containous/traefik:k8s with another fix.
Goroutines and memory allocation are pretty stable on my laptop:
screenshot from 2016-11-17 14 39 13

@rrichardson
Copy link
Author

I've been breaking and fixing prometheus all day, so I don't have much historical data, but I have been running the new image for almost an hour, and so far.

In the previous build, the instances were at 25MB by now, these seem to be holding steady at 14MB. I'll report back tomorrow.

traefik

@rrichardson
Copy link
Author

The latest fix definitely worked. All 3 instances are sitting at 14MB after almost 24 hours.

@emilevauge
Copy link
Member

@rrichardson Awesome news :) I really would like to thank you for your help on this 👍

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants