Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue where drain leaks threads after discovering dead consumers. #171

Merged
merged 2 commits into from
Apr 5, 2018

Conversation

bnferguson
Copy link
Contributor

1. What is the problem that this PR is trying to fix?

On servers without very much activity we can see issues where travis-logs:drain's consumers die do to timeouts or other issues. Once declared dead, every tick more threads are created.

We were tracking this as a blocker for Enterprise in https://github.com/travis-pro/travis-enterprise/issues/275, from that issue you can follow along as I try different things. Copying a comment there (since it's private):


So I've been researching this today, trying different things, trying to isolate when it happens. What I've noticed is that the thread count is stable until the first time it finds dead consumers:

[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=1/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=1/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=2/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=2/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=3/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=3/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=4/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=4/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=5/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=5/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="creating new consumer" name=1/5 tid=69990780612440

After this, every tick creates 5 new threads.

Just checking the consumers creates threads:

[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=1/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=2/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=3/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=4/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=5/5 tid=69990780612440

Up until the first dead consumers, this does not happen. It's stable. After we find dead consumers, it happens every tick.

Not sure if it's Bunny that's leaking the threads or something we're doing with it but thought it was curious that bunny would leak it so much (also 27 threads for 5 consumers seems a bit interesting, but I'm sure there's some other tasks around these that I don't know about).


This will continue until it fills the server. It even seems to speed up like those threads are still being worked despite not longer being part of @consumers (https://github.com/travis-ci/travis-logs/blob/master/lib/travis/logs/drain.rb#L41-L53). After enough time, we get thread errors and the process crashes. In Enterprise upstart restarts the process. In staging on Heroku it does too - but after enough times it stops and then staging appears to be broken.

2. What approach did you choose and why?

I tracked this down to the shutdown method's ensure. After ensuring the buffer is nil and the consumer is marked dead it used to sleep. This would cause the thread to wait for the next time it's called on. I'm not super clear on what mechanism we're using for threading (it may be Bunny's threading in the connection). If all references were cleaned up this should be enough for it to get garbage collected, but it seems something is holding onto the references. Since we're already in a method that is declaring that we are done with this connection/DrainConsumer completely, I switched out the sleep for an exit. Ending the thread entirely.

3. How can you test this?

Start up a server, put it in debug logging so you can see the dead consumer messages. Run a single job, then don't touch it for a long period. After you wait long enough some consumers will be dead. In the old way, this would start the leaking of threads. Now watching the thread count, it stays stable. 🎉

4. What feedback would you like?

I don't like that I'm not super sure exactly where the threads are being created and joined. As such it's harder to know exactly why my solution works besides the brute force reason of exit will certainly kill threads. I feel ok with this approach at the moment since it's in the shutdown method but someone checking my work would be appreciated.

… when we sleep old threads aren't getting cleaned up by GC (reference via closure or something else holding on?)
@@ -95,7 +95,7 @@ def dead?
ensure
@dead = true
@batch_buffer = nil
sleep
exit

This comment was marked as spam.

…ect fall out of scope the threads are all cleaned up since they're not sleeping infinitely. exit felt brute force even if it accomplished similar.
@bnferguson
Copy link
Contributor Author

Talked about this with @igorwwwwwwwwwwwwwwwwwwww in slack (https://travisci.slack.com/archives/C474DLQAK/p1522858075000558 sorry for the private link ya'll). And while it seems this didn't cause the entire process to exit and only the threads exited, it did cause me to ask some questions around around how needed the exit was. After some experiments I removed the exit entirely and it seems to have the same effect. As I mentioned in slack:

Actually testing one other thing, if we are indeed getting a kernel sleep and not at thread sleep (which would make sense), then all of those dead consumers would just be infinitely sleeping since we called sleep with no interval.

Initially this didn't work due to another experiment I was trying (clearing @amqp_conn) interfering, but once I removed that experiment, it seems once we close the Rabbit connection and this object falls out of scope as we replace it with a new one in https://github.com/travis-ci/travis-logs/blob/master/lib/travis/logs/drain.rb#L51-L52 the thread is properly cleaned up.

This feels cleaner and less brute force than the exit method. Thanks to @igorwwwwwwwwwwwwwwwwwwww for their good questions and insight on this issue!

Copy link
Contributor

@igorwwwwwwwwwwwwwwwwwwww igorwwwwwwwwwwwwwwwwwwww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! this is very cool

@bnferguson bnferguson deployed to org-production April 5, 2018 07:30 Active
@bnferguson bnferguson deployed to com-production April 5, 2018 07:30 Active
@bnferguson bnferguson merged commit 8830f9e into master Apr 5, 2018
@bnferguson bnferguson deleted the bf-fix-leaking-thread-issue branch April 5, 2018 07:47
bnferguson added a commit that referenced this pull request Apr 5, 2018
Fix issue where drain leaks threads after discovering dead consumers.
bnferguson added a commit that referenced this pull request Apr 5, 2018
[Enterprise Backport] Fix issue where drain leaks threads after discovering dead consumers. (#171)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants