-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue where drain leaks threads after discovering dead consumers. #171
Conversation
… when we sleep old threads aren't getting cleaned up by GC (reference via closure or something else holding on?)
lib/travis/logs/drain_consumer.rb
Outdated
@@ -95,7 +95,7 @@ def dead? | |||
ensure | |||
@dead = true | |||
@batch_buffer = nil | |||
sleep | |||
exit |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
…ect fall out of scope the threads are all cleaned up since they're not sleeping infinitely. exit felt brute force even if it accomplished similar.
Talked about this with @igorwwwwwwwwwwwwwwwwwwww in slack (https://travisci.slack.com/archives/C474DLQAK/p1522858075000558 sorry for the private link ya'll). And while it seems this didn't cause the entire process to exit and only the threads exited, it did cause me to ask some questions around around how needed the
Initially this didn't work due to another experiment I was trying (clearing This feels cleaner and less brute force than the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! this is very cool
Fix issue where drain leaks threads after discovering dead consumers.
[Enterprise Backport] Fix issue where drain leaks threads after discovering dead consumers. (#171)
1. What is the problem that this PR is trying to fix?
On servers without very much activity we can see issues where
travis-logs:drain
's consumers die do to timeouts or other issues. Once declared dead, every tick more threads are created.We were tracking this as a blocker for Enterprise in https://github.com/travis-pro/travis-enterprise/issues/275, from that issue you can follow along as I try different things. Copying a comment there (since it's private):
So I've been researching this today, trying different things, trying to isolate when it happens. What I've noticed is that the thread count is stable until the first time it finds dead consumers:
After this, every tick creates 5 new threads.
Just checking the consumers creates threads:
Up until the first dead consumers, this does not happen. It's stable. After we find dead consumers, it happens every tick.
Not sure if it's Bunny that's leaking the threads or something we're doing with it but thought it was curious that bunny would leak it so much (also 27 threads for 5 consumers seems a bit interesting, but I'm sure there's some other tasks around these that I don't know about).
This will continue until it fills the server. It even seems to speed up like those threads are still being worked despite not longer being part of
@consumers
(https://github.com/travis-ci/travis-logs/blob/master/lib/travis/logs/drain.rb#L41-L53). After enough time, we get thread errors and the process crashes. In Enterprise upstart restarts the process. In staging on Heroku it does too - but after enough times it stops and then staging appears to be broken.2. What approach did you choose and why?
I tracked this down to the
shutdown
method's ensure. After ensuring the buffer is nil and the consumer is marked dead it used tosleep
. This would cause the thread to wait for the next time it's called on. I'm not super clear on what mechanism we're using for threading (it may be Bunny's threading in the connection). If all references were cleaned up this should be enough for it to get garbage collected, but it seems something is holding onto the references. Since we're already in a method that is declaring that we are done with this connection/DrainConsumer
completely, I switched out thesleep
for anexit
. Ending the thread entirely.3. How can you test this?
Start up a server, put it in debug logging so you can see the dead consumer messages. Run a single job, then don't touch it for a long period. After you wait long enough some consumers will be dead. In the old way, this would start the leaking of threads. Now watching the thread count, it stays stable. 🎉
4. What feedback would you like?
I don't like that I'm not super sure exactly where the threads are being created and joined. As such it's harder to know exactly why my solution works besides the brute force reason of
exit
will certainly kill threads. I feel ok with this approach at the moment since it's in theshutdown
method but someone checking my work would be appreciated.