Deadlocks in rails 5 #31019

BrentWheeldon · 2017-10-31T17:54:01Z

Steps to reproduce

Since upgrading to rails 5 we've started seeing deadlocks in development and test. Using @matthewd's super helpful tool we've tracked it down to a thread being blocked by the load interlocking. This thread is holding the AR connection's lock while it's blocked. The other threads that are sharing the load interlock are blocked waiting for the connection's lock.

We are using rails 5.1.4 with postgres and are seeing this in both webrick and puma. I have a feature spec that reliably reproduces this issue (running puma with 4 threads) and the state we end up in is:

Puma Thread 1:

This thread is updating records inside an explicit .transaction block and has a record level lock on an object using #lock!. It's waiting to autoload an association's class.

Has the AR connection lock
Blocked waiting to get the load interlock

Puma Thread 2:

This thread is just fetching some data from the database.

Sharing the load interlock
Blocked waiting to get the AR connection lock

Puma Thread 3:

This thread is just fetching some data from the database.

Sharing the load interlock
Blocked waiting to get the AR connection lock

Puma Thread 4:

This thread is just fetching some data from the database.

Sharing the load interlock
Blocked waiting to get the AR connection lock

Removing our .transaction and #lock! calls do not solve the problem.

All of the threads that are waiting on the AR connection lock are waiting in the ActiveRecord::ConnectionAdapters::AbstractAdapter#log method. If I add a ActiveSupport::Dependencies.interlock.permit_concurrent_loads block outside of the locking code here the problem goes away, but I'm not sure exactly what the negative side effects will be.

Is there anything that I'm missing here? I'm really not sure where to start. Any help would be really appreciated!

System configuration

Rails version: 5.1.4

Ruby version: 2.3.3

The text was updated successfully, but these errors were encountered:

matthewd · 2017-10-31T18:53:23Z

Oops... when we added the connection lock I did consider the load lock, but in deciding it wasn't relevant I forgot about transactions. (I can't think of any other way to put user-controlled code inside that lock -- and user-controlled code is the only place that can trigger a load.)

The solution is indeed to wrap the lock acquisition with permit_concurrent_loads -- though after we get the connection lock, we must reacquire the load lock before we can run the contained user code. (And yes, this does create its own deadlock risk.. but it should be much narrower.)

However, I'm hesitant to add so much lock machinery to the log call when 1) most of the time, the only thing happening inside that block will be a database query [which itself does permit_concurrent_loads], and 2) the connection lock is only really relevant in the test environment.

I wonder if we can instead push it out closer to the transaction method, where it can apply permit_concurrent_loads across the whole transaction-initiation bit, and then reacquire the lock just before it yields back to the user code.

BrentWheeldon · 2017-10-31T19:05:52Z

Thanks for the super fast response! I'll have a go at what you described in the transaction-y code and see how I go. I'll report back.

BrentWheeldon · 2017-10-31T20:38:41Z

So I have been doing some poking around and it seems that modifying transaction isn't going to help us here. Claiming the lock when starting a transaction isn't the problem - it's other threads trying to claim that lock when it's owned by a thread that is blocked by the load interlock.

I think basically the issue boils down to us wanting to permit_concurrent_loads while we wait for the DB lock.

I had a go at this with a custom monitor class that we would use in ActiveRecord::ConnectionAdapters::AbstractAdapter instead of the standard Monitor class.

class LoadInterlockAwareMonitor < Monitor
  def mon_enter
    ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
      super
    end
  end
end

That fixes the issue for me, but I'm not in love with the new class's name, and I'm also no sure where that would live necessarily.

Do we think that a custom monitor class is even the way we should be solving this?

matthewd · 2017-11-01T23:02:12Z

Claiming the lock when starting a transaction isn't the problem - it's other threads trying to claim that lock when it's owned by a thread that is blocked by the load interlock.

Ah, right.

Yeah, that does look like an interesting approach. Part of me just wants to monkey-patch the base Monitor (because any time a thread is waiting on something, like a lock acquisition, it's safe for it to permit loads)... but that seems likely to be unpopular. Suitable naming/location will require a bit more thought.

One tweak to its contents: I think we can throw in a mon_try_enter ||, so in the uncontended case it'll take the lock immediately without any extra effort.

BrentWheeldon · 2017-11-02T00:46:40Z

Good suggestion on the mon_try_enter. I will add that in and get a PR going and seek feedback on location/naming. Thanks again for your help!

Dee-2015 · 2017-11-15T14:07:01Z

Apologies when this is not related, but I ran into an issue running rspecs against a postgres 10 db that failed. But they always passed with the postgres 9.5.10 database. Could the issue #31019 possibly also relate to the postgres version?

mlh758 · 2018-04-02T15:50:47Z

This fix also appears to fix an issue where Sidekiq jobs can't run in development due to a deadlock if multiple jobs start at the same time. I was on Rails 5.1.4 using Ruby 2.4.1.

Thanks for finding and fixing this.

janko · 2020-10-28T14:46:22Z

As part of resolving a race condition in a gem that uses Active Record, I'm currently deep in understanding Active Record's connection management, and it would really help me to get something cleared up.

If Active Record's connection pool doesn't allow two threads to use the same connection at the same time, how can then multiple Puma threads be deadlocked by AR connection lock, if each thread will be using their own AR connection which have their own connection lock? Why is a lock needed in the first place, if in theory two threads cannot access the same AR connection?

I would've asked this on the Discourse channel, but generally I didn't have luck getting answers there.

ngan · 2022-11-06T14:34:13Z

@janko i believe the issue is only happening in tests where things are auto loaded. For example, in rails system tests, your server (puma) and your test runner are sharing the same connection even though they’re in separate threads. This is my understanding. I could be wrong.

janko · 2022-11-06T14:56:31Z

Yes, you are correct. I don't think it's necessarily related to autoloading, but to the fact that in system tests the background Puma workers run in a separate threads, all of which use a shared connection in order for transactional tests to work, so this access needs to be synchronized.

rafaelfranca assigned matthewd Oct 31, 2017

BrentWheeldon mentioned this issue Nov 2, 2017

Prevent deadlocks with load interlock and DB lock. #31035

Merged

matthewd closed this as completed in #31035 Nov 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlocks in rails 5 #31019

Deadlocks in rails 5 #31019

BrentWheeldon commented Oct 31, 2017

matthewd commented Oct 31, 2017

BrentWheeldon commented Oct 31, 2017

BrentWheeldon commented Oct 31, 2017

matthewd commented Nov 1, 2017

BrentWheeldon commented Nov 2, 2017

Dee-2015 commented Nov 15, 2017

mlh758 commented Apr 2, 2018

janko commented Oct 28, 2020 •

edited

ngan commented Nov 6, 2022

janko commented Nov 6, 2022 •

edited

Deadlocks in rails 5 #31019

Deadlocks in rails 5 #31019

Comments

BrentWheeldon commented Oct 31, 2017

Steps to reproduce

System configuration

matthewd commented Oct 31, 2017

BrentWheeldon commented Oct 31, 2017

BrentWheeldon commented Oct 31, 2017

matthewd commented Nov 1, 2017

BrentWheeldon commented Nov 2, 2017

Dee-2015 commented Nov 15, 2017

mlh758 commented Apr 2, 2018

janko commented Oct 28, 2020 • edited

ngan commented Nov 6, 2022

janko commented Nov 6, 2022 • edited

janko commented Oct 28, 2020 •

edited

janko commented Nov 6, 2022 •

edited