Threads, Forking and New Relic #1101

Open
jasonrclark opened this Issue Aug 9, 2013 · 25 comments

Projects

None yet
@jasonrclark

TL;DR; We've got ideas for avoiding deadlocks when using New Relic and Resque together, but would like feedback on the approach, and advice on what we could contribute to Resque to improve the situation.

Long story short, threads (with locks) and forking don't mix nicely. (〰️ + 🍴 = 💥 )

The Problem

There are long-standing reports of deadlocking Resque jobs when using New Relic (newrelic_rpm). Certain updates have alleviated issues for specific users, but we still hear about problems from others--especially those running lots of jobs (think millions of jobs per day)

Some great debugging by @benweint and others has led us to suspect the problems are variations on this sequence of events:

  1. New Relic starts in the Resque master process and spawns a background thread to transmit data back to our services.
  2. Resque starts running normally, forking to process jobs
  3. New Relic harvests once a minute and during that time takes various locks (internal locks in newrelic_rpm, locks in Ruby's Logger class, or even native locks around, for example, DNS lookup)
  4. While New Relic's harvest thread has a lock, Resque forks a new job
  5. Only Resque's main thread lives on in the child process, so New Relic's harvest thread isn't around to release the lock from step 3).
  6. If any code tries to acquire the lock from 3), the Resque job hangs 😢

It seems we can't safely fork while holding a lock in another thread or bad things happen. If newrelic_rpm owned all the locks in question we could deal with them in our after_fork handler (and we have where we've identified them). However, at least one common case (DNS lookup) isn't even locking in Ruby-land.

A Solution

I've pushed a branch, resque_harvest_locking on newrelic_rpm with a proposed change to our instrumentation. (newrelic/rpm@d703271) This synchronizes between New Relic's background thread running a harvest and Resque forking a new job (just the forking, not the full job processing). We already locked around our harvest thread, so we just sync on the same lock from a patched Resque::Worker#fork

This has downsides for sure:

  • If the network is slow contacting New Relic, new jobs can't start until the harvest finishes.
  • Ugly patching to get the lock at the right spot in Resque's processing
  • The existing code doesn't look workable for Resque 2.x, so there's still work to do if this approach is taken

But in our testing it does alleviate deadlocks.

Gathering More Information

The latest version of the newrelic_rpm gem ships with an executable called nrdebug. Given a pid, nrdebug drops a log file with tons of information, up to and including gdb and Ruby backtraces from the stuck process. Anyone seeing hung workers with newrelic_rpm should definitely run nrdebug before killing those workers. We'd love to see the results.

Other Ideas?

Has anyone else found these type of issues using threads and Resque together?

Are there places we could add hooks in Resque to accomplish to make this less invasive?

Even more drastically, could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

Thanks!

Owner

Thanks so much for looking into this. I've been ill, so sorry about the delay.

I think that this approach sounds reasonable, but I also don't use NewRelic with Resque personally, so I'd like to hear what people who do use it think. Let me tweet this out there, and also cross-post to the mailing list, and see if we can't get more feedback.

could there be a better approach to running plugins periodically in the Resque's master so we wouldn't need the background thread and could sidestep the locking woes?

I think this would be the ideal solution for 2.0.

I use NewRelic with Resque but I can't say I've ever noticed a problem. (But we're only running dozens of long-running jobs, not millions per day).

That said, I'm happy to help out with the app I'm talking about above if you'll tell me how I can be useful.

MarkusQ commented Aug 12, 2013

One possibility that occurs to me (without looking at the code) is to replace the background thread with another fork; the mental model is "no process should ever fork after calling Thread.new." This would clearly remove you from the TL;DR-defined (threads+locks+forks) space.

My second suggestion would be pretty much what Jason implemented.

About ten years ago this came up and Joel VanderWerf sketched a solution http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/61918 but I'm not finding a live link or local copy.

Another idea (which gets messy and may not handle all cases, but I have seen "working") is to manually set all locks to unlocked on forking (iirc the attempt at this I saw used ObjectSpace, evil.rb, and goat entrails).

I keep thinking there are other, cleaner answers, but none are sprining to mind.

mperham commented Aug 12, 2013

This is relevant to my interests. Good to learn about nrdebug, could be useful for debugging Sidekiq lockups too.

@MarkusQ Thanks for the feedback. We've definitely considered a forking model to get reporting back to New Relic separated from the app processes, but it's felt like a pretty big change from the agent's current behavior. We will keep it in mind to be sure.

Really appreciate as well the links to those old solutions. While haven't found the code, it definitely led to some interesting reading.

@codeodor We'll definitely get in touch if there's any way you could help out. We mostly encounter problems with short, frequent jobs, but if we do make changes we'll want to be sure things remain compatible with longer running jobs like you've got.

I can't really comment on where to head codewise, but just reporting in as someone who's experienced this issue a lot.

I have an application running in production which is running on an older version of resque (resque-1.13.0 and ruby enterprise e.g. 1.8.7) which currently reliably causes/reproduces the locking up issue (I've had to kill process several times over the last couple of weeks for it alone). I'm currently in the process of upgrading that application to ruby 1.9.3 and resque 1.24.1 - unsure if that will resolve or alleviate these problems given some work has already gone in to the releases since then.

While I've still got the old system running and can reproduce, would backtraces / gdb output be useful (from the older resque version)? If so, let me know what you want to see, as outside of reading newrelic's how to post It all looks a bit foreign to me.

EDIT: I was way off on my count of how many jobs goes through. I forgot about some optimisations I did

Owner
yaauie commented Aug 27, 2013

@jeremyolliver I don't think that the backtraces from older 1.x-series will be all that helpful, since there will be no updates to them beyond the forward progress that has already been made to 1.24 and beyond. Best of luck getting upgraded though 😄.

I figured as much. I can report back post-upgrade if the issue still persists.

Contributor
benweint commented Sep 5, 2013

@jeremyolliver Backtraces from the 1.x series would be useful in as much as they would allow us to confirm whether your issue is the one Jason describes above or some other problem. You can use the nrdebug tool that Jason mentions in the 'Gathering More Information' section above to automate the collection of backtraces so that you won't have to manually carry out the steps in that blog post.

If you're able to get backtraces, we'd be happy to take a look at them to at least confirm whether you're hitting the same issue or something different (I've seen a few cases where hangs people thought were caused by this issue were actually caused by unrelated issues in other libraries or application code).

obrie commented Oct 7, 2013

Hey guys -

I've been following this thread as it actually is in line with what I've experienced on a separate project. I wanted to provide an update because it affects how folks address this issue. I found that there were two issues at play here:

First, there is a bug in Ruby. All mutexes in any background thread are supposed to be unlocked when you fork. Unfortunately there was a race condition in all versions of Ruby which made this fail. The good news is there's a patch I submitted which has been accepted: https://bugs.ruby-lang.org/issues/8433#note-2. This will need to get backported, but in the meantime I'm personally feeding the patch into rvm.

Second, instead of using the default dns resolution implementation in Ruby, you can use resolv-replace. This makes dns resolution thread-friendly.

After adding in these two changes, it completely removed any issues I was seeing with hung forks and NewRelic. There are a few bug fixes I've encountered in NewRelic that I'll be putting PRs in for soon but I don't think any of them were contributing to hung workers -- only the two I noted above.

Hope this is helpful -- let me know if you have any questions!

@obrie That's absolutely fantastic news! Thanks for putting the work into that. We'd love to hear whenever this lands somewhere that we can test against.

We'll also take a look at resolv-replace and see if that improves the situation as well. It sounds promising given that a large number of the hangs we've seen have been DNS related.

obrie commented Oct 7, 2013

No problem :) It's also worth noting that there's a bug in 1.9.3 pre-p374 that can also cause forks to deadlock: https://bugs.ruby-lang.org/issues/7693. We were experiencing this issue (in addition to all the others) -- so we're now using p448 in all of our testing.

Hey all,
We've released version 3.6.8.164 of newrelic_rpm which includes an optional feature to implement the locking around harvest I describe in this issue. It's documented at https://docs.newrelic.com/docs/ruby/resque-instrumentation#deadlocks, but you can set this in the desired section of your newrelic.yml in enable the locks:

resque:
  use_harvest_lock: true

Keen to hear how this helps for anyone experiencing these sorts of deadlocks. Thanks all!

I'll be looking at Resque 2.x in the near future, so we'll see if we can do anything better at that point.

obrie commented Oct 16, 2013

Just to give one more update... We've now been running live in production (with a high throughput) without any issues for about a week now. To provide a little more information, this was the environment we required to completely avoid hung workers:

  • Ruby 1.9.3-p448 ... p327+ fixes thread creation deadlock [https://bugs.ruby-lang.org/issues/7693]
  • railsexpress patch (if applicable, the falcon gc patch on p385+ fixes 100% cpu gc deadlock) [https://github.com/skaes/rvm-patchsets/blob/master/patchsets/ruby/1.9.3/p448/railsexpress]
  • Backported mutexes patch (fixes mutex deadlock) [https://bugs.ruby-lang.org/issues/8433]
  • Ubuntu 12.04 / glibc 2.15 (fixes mutex locking bug / deadlock) .. Unfortunately I don't have a link for this one, but I can verify that glibc 2.11.1 has the issue on Ubuntu 10.04
  • resolv-replace Ruby library (fixes dns resolution deadlock)

I'll provide further updates if we discover anything else -- hopefully that's it :)

Owner

❤️❤️❤️❤️❤️

mperham commented Oct 16, 2013

Wow, thank you @obrie! This is helpful to everyone using threads.

@mperham mperham referenced this issue in mperham/sidekiq Oct 17, 2013
Closed

Use resolv-replace #1258

This is awesome @obrie! We're definitely pointing people to these options when they report deadlocks, since most of them are outside of what newrelic_rpm can address directly from within the gem.

Again, thanks for all the deep digging!

TheCorp commented Oct 18, 2013

@obrie - Quick question, any idea if this issue is present on Ruby 2.x or just 1.9.3 p<237? I am on ruby 2.0.0p247.

Also @mperham - Based on the commit it seems like the NewRelic upgrade with the harvest locking is specifically for Resque users? Does Sidekiq have a similar issue or is the resolv-replace all we need?

We process millions of jobs with Sidekiq so we're definitely very interested in whether this affects us or not.

mperham commented Oct 18, 2013

@TheCorp Sidekiq itself never forks but I can't control if workers want to fork. It's crazy but possible. Sidekiq master requires resolv-replace now.

TheCorp commented Oct 18, 2013

Well then, this might explain why we had so many issues when we tried to implement parallel http (we do lots of scraping) calls by forking threads within a sidekiq worker.

mperham commented Oct 18, 2013

You don't fork threads, you fork processes. I mean a worker could actually fork the sidekiq process.

Slow DNS can cause those issues. We use a local DNS caching proxy which I think helps. resolv-replace should help a lot.

TheCorp commented Oct 18, 2013

Gotcha, thanks.

kajic commented Oct 25, 2013

In the process of upgrading ruby (from 1.9.3p0 to 2.0.0p247) I was also forced to upgrade newrelic_rpm (from 3.5.3.25
to 3.6.8.168). During the switch, an old resque machine (resque gem at 1.20.0), running ruby 1.9.3p0 and newrelic_rpm 3.6.8.168, started to experience hanging processes. This was not happening at all with newrelic_rpm 3.5.3.25. I was led to this github issue by examining the hanging processes using gdb. After having upgraded ruby to 2.0.0p247 the issue seems to have gone away. I hope this can be of help to someone.

This issue is still open, so I'll comment on here now.

We had NewRelic disabled on our Resque workers (managed with resque-pool) via the environment variable, and just turned it on after upgrading to the last 1.x Resque. We are now seeing occasional locked up workers again, though its a lot better than when we had turned it off before.

Relevant Specs:
Ubuntu 12.04.4 LTS (GNU/Linux 3.2.0-65-virtual x86_64)
ldd (Ubuntu EGLIBC 2.15-0ubuntu10.5) 2.15
ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-linux]
resque (1.25.2)
resque-dynamic-queues (0.8.0)
resque-pool (0.3.0)
resque-queue-priority (0.6.0)
resque-scheduler (2.5.5)

I'm about to turn on the harvest locks flag now.

I think I've got a ticket open with NewRelic I'll contact them directly with, but wonder if there was any updates on this side.

@steveklabnik steveklabnik added the Hard label Jan 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment