Fix issue where drain leaks threads after discovering dead consumers. #171

bnferguson · 2018-04-04T15:54:14Z

1. What is the problem that this PR is trying to fix?

On servers without very much activity we can see issues where travis-logs:drain's consumers die do to timeouts or other issues. Once declared dead, every tick more threads are created.

We were tracking this as a blocker for Enterprise in https://github.com/travis-pro/travis-enterprise/issues/275, from that issue you can follow along as I try different things. Copying a comment there (since it's private):

So I've been researching this today, trying different things, trying to isolate when it happens. What I've noticed is that the thread count is stable until the first time it finds dead consumers:

[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=1/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=1/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=2/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=2/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=3/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=3/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=4/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=4/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="checking drain consumer" name=5/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="dead consumer found" name=5/5 tid=69990780612440
[2018-03-28T14:45:22+0000][travis-logs:drain] time=2018-03-28T14:45:22+00:00 level=debug msg="creating new consumer" name=1/5 tid=69990780612440

After this, every tick creates 5 new threads.

Just checking the consumers creates threads:

[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=1/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=2/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=3/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=4/5 tid=69990780612440
[2018-03-28T14:45:32+0000][travis-logs:drain] time=2018-03-28T14:45:32+00:00 level=debug msg="checking drain consumer" name=5/5 tid=69990780612440

Up until the first dead consumers, this does not happen. It's stable. After we find dead consumers, it happens every tick.

Not sure if it's Bunny that's leaking the threads or something we're doing with it but thought it was curious that bunny would leak it so much (also 27 threads for 5 consumers seems a bit interesting, but I'm sure there's some other tasks around these that I don't know about).

This will continue until it fills the server. It even seems to speed up like those threads are still being worked despite not longer being part of @consumers (https://github.com/travis-ci/travis-logs/blob/master/lib/travis/logs/drain.rb#L41-L53). After enough time, we get thread errors and the process crashes. In Enterprise upstart restarts the process. In staging on Heroku it does too - but after enough times it stops and then staging appears to be broken.

2. What approach did you choose and why?

I tracked this down to the shutdown method's ensure. After ensuring the buffer is nil and the consumer is marked dead it used to sleep. This would cause the thread to wait for the next time it's called on. I'm not super clear on what mechanism we're using for threading (it may be Bunny's threading in the connection). If all references were cleaned up this should be enough for it to get garbage collected, but it seems something is holding onto the references. Since we're already in a method that is declaring that we are done with this connection/DrainConsumer completely, I switched out the sleep for an exit. Ending the thread entirely.

3. How can you test this?

Start up a server, put it in debug logging so you can see the dead consumer messages. Run a single job, then don't touch it for a long period. After you wait long enough some consumers will be dead. In the old way, this would start the leaking of threads. Now watching the thread count, it stays stable. 🎉

4. What feedback would you like?

I don't like that I'm not super sure exactly where the threads are being created and joined. As such it's harder to know exactly why my solution works besides the brute force reason of exit will certainly kill threads. I feel ok with this approach at the moment since it's in the shutdown method but someone checking my work would be appreciated.

… when we sleep old threads aren't getting cleaned up by GC (reference via closure or something else holding on?)

lib/travis/logs/drain_consumer.rb

@@ -95,7 +95,7 @@ def dead?
      ensure
        @dead = true
        @batch_buffer = nil
-        sleep
+        exit


…ect fall out of scope the threads are all cleaned up since they're not sleeping infinitely. exit felt brute force even if it accomplished similar.

bnferguson · 2018-04-04T17:11:24Z

Talked about this with @igorwwwwwwwwwwwwwwwwwwww in slack (https://travisci.slack.com/archives/C474DLQAK/p1522858075000558 sorry for the private link ya'll). And while it seems this didn't cause the entire process to exit and only the threads exited, it did cause me to ask some questions around around how needed the exit was. After some experiments I removed the exit entirely and it seems to have the same effect. As I mentioned in slack:

Actually testing one other thing, if we are indeed getting a kernel sleep and not at thread sleep (which would make sense), then all of those dead consumers would just be infinitely sleeping since we called sleep with no interval.

Initially this didn't work due to another experiment I was trying (clearing @amqp_conn) interfering, but once I removed that experiment, it seems once we close the Rabbit connection and this object falls out of scope as we replace it with a new one in https://github.com/travis-ci/travis-logs/blob/master/lib/travis/logs/drain.rb#L51-L52 the thread is properly cleaned up.

This feels cleaner and less brute force than the exit method. Thanks to @igorwwwwwwwwwwwwwwwwwwww for their good questions and insight on this issue!

igorwwwwwwwwwwwwwwwwwwww

lgtm! this is very cool

Fix issue where drain leaks threads after discovering dead consumers.

[Enterprise Backport] Fix issue where drain leaks threads after discovering dead consumers. (#171)

When doing shutdown, exit instead of sleep since we're done here. And…

65d33fd

… when we sleep old threads aren't getting cleaned up by GC (reference via closure or something else holding on?)

bnferguson added the enterprise-2.2-backport label Apr 4, 2018

bnferguson self-assigned this Apr 4, 2018

bnferguson requested a review from meatballhat April 4, 2018 15:54

igorwwwwwwwwwwwwwwwwwwww reviewed Apr 4, 2018

View reviewed changes

lib/travis/logs/drain_consumer.rb Outdated

@@ -95,7 +95,7 @@ def dead?

ensure

@dead = true

@batch_buffer = nil

sleep

exit

This comment was marked as spam.

Sign in to view

Remove unneeded exit as once we close the connection and let this obj…

cd2c0e1

…ect fall out of scope the threads are all cleaned up since they're not sleeping infinitely. exit felt brute force even if it accomplished similar.

igorwwwwwwwwwwwwwwwwwwww approved these changes Apr 5, 2018

View reviewed changes

bnferguson deployed to com-staging April 5, 2018 06:54 Active

bnferguson deployed to org-staging April 5, 2018 06:54 Active

bnferguson deployed to org-production April 5, 2018 07:30 Active

bnferguson deployed to com-production April 5, 2018 07:30 Active

bnferguson merged commit 8830f9e into master Apr 5, 2018

bnferguson deleted the bf-fix-leaking-thread-issue branch April 5, 2018 07:47

bnferguson added a commit that referenced this pull request Apr 5, 2018

Merge pull request #171 from travis-ci/bf-fix-leaking-thread-issue

0fc71f2

Fix issue where drain leaks threads after discovering dead consumers.

bnferguson mentioned this pull request Apr 5, 2018

[Enterprise Backport] Fix issue where drain leaks threads after discovering dead consumers. (#171) #172

Merged

bnferguson removed the enterprise-2.2-backport label Apr 5, 2018

bnferguson added a commit that referenced this pull request Apr 5, 2018

Merge pull request #172 from travis-ci/enterprise-backport-171

62e43ba

[Enterprise Backport] Fix issue where drain leaks threads after discovering dead consumers. (#171)

bnferguson mentioned this pull request May 25, 2018

Thread leak in logs-drain (potentially due to TimerTask bug) #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue where drain leaks threads after discovering dead consumers. #171

Fix issue where drain leaks threads after discovering dead consumers. #171

bnferguson commented Apr 4, 2018

This comment was marked as spam.

bnferguson commented Apr 4, 2018

igorwwwwwwwwwwwwwwwwwwww left a comment

Fix issue where drain leaks threads after discovering dead consumers. #171

Fix issue where drain leaks threads after discovering dead consumers. #171

Conversation

bnferguson commented Apr 4, 2018

1. What is the problem that this PR is trying to fix?

2. What approach did you choose and why?

3. How can you test this?

4. What feedback would you like?

This comment was marked as spam.

bnferguson commented Apr 4, 2018

igorwwwwwwwwwwwwwwwwwwww left a comment

Choose a reason for hiding this comment