Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlocks in rails 5 #31019

Closed
BrentWheeldon opened this issue Oct 31, 2017 · 10 comments · Fixed by #31035
Closed

Deadlocks in rails 5 #31019

BrentWheeldon opened this issue Oct 31, 2017 · 10 comments · Fixed by #31035
Assignees

Comments

@BrentWheeldon
Copy link
Contributor

Steps to reproduce

Since upgrading to rails 5 we've started seeing deadlocks in development and test. Using @matthewd's super helpful tool we've tracked it down to a thread being blocked by the load interlocking. This thread is holding the AR connection's lock while it's blocked. The other threads that are sharing the load interlock are blocked waiting for the connection's lock.

We are using rails 5.1.4 with postgres and are seeing this in both webrick and puma. I have a feature spec that reliably reproduces this issue (running puma with 4 threads) and the state we end up in is:

Puma Thread 1:

This thread is updating records inside an explicit .transaction block and has a record level lock on an object using #lock!. It's waiting to autoload an association's class.

  • Has the AR connection lock
  • Blocked waiting to get the load interlock

Puma Thread 2:

This thread is just fetching some data from the database.

  • Sharing the load interlock
  • Blocked waiting to get the AR connection lock

Puma Thread 3:

This thread is just fetching some data from the database.

  • Sharing the load interlock
  • Blocked waiting to get the AR connection lock

Puma Thread 4:

This thread is just fetching some data from the database.

  • Sharing the load interlock
  • Blocked waiting to get the AR connection lock

Removing our .transaction and #lock! calls do not solve the problem.

All of the threads that are waiting on the AR connection lock are waiting in the ActiveRecord::ConnectionAdapters::AbstractAdapter#log method. If I add a ActiveSupport::Dependencies.interlock.permit_concurrent_loads block outside of the locking code here the problem goes away, but I'm not sure exactly what the negative side effects will be.

Is there anything that I'm missing here? I'm really not sure where to start. Any help would be really appreciated!

System configuration

Rails version: 5.1.4

Ruby version: 2.3.3

@matthewd
Copy link
Member

Oops... when we added the connection lock I did consider the load lock, but in deciding it wasn't relevant I forgot about transactions. (I can't think of any other way to put user-controlled code inside that lock -- and user-controlled code is the only place that can trigger a load.)

The solution is indeed to wrap the lock acquisition with permit_concurrent_loads -- though after we get the connection lock, we must reacquire the load lock before we can run the contained user code. (And yes, this does create its own deadlock risk.. but it should be much narrower.)

However, I'm hesitant to add so much lock machinery to the log call when 1) most of the time, the only thing happening inside that block will be a database query [which itself does permit_concurrent_loads], and 2) the connection lock is only really relevant in the test environment.

I wonder if we can instead push it out closer to the transaction method, where it can apply permit_concurrent_loads across the whole transaction-initiation bit, and then reacquire the lock just before it yields back to the user code.

@BrentWheeldon
Copy link
Contributor Author

Thanks for the super fast response! I'll have a go at what you described in the transaction-y code and see how I go. I'll report back.

@BrentWheeldon
Copy link
Contributor Author

So I have been doing some poking around and it seems that modifying transaction isn't going to help us here. Claiming the lock when starting a transaction isn't the problem - it's other threads trying to claim that lock when it's owned by a thread that is blocked by the load interlock.

I think basically the issue boils down to us wanting to permit_concurrent_loads while we wait for the DB lock.

I had a go at this with a custom monitor class that we would use in ActiveRecord::ConnectionAdapters::AbstractAdapter instead of the standard Monitor class.

class LoadInterlockAwareMonitor < Monitor
  def mon_enter
    ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
      super
    end
  end
end

That fixes the issue for me, but I'm not in love with the new class's name, and I'm also no sure where that would live necessarily.

Do we think that a custom monitor class is even the way we should be solving this?

@matthewd
Copy link
Member

matthewd commented Nov 1, 2017

Claiming the lock when starting a transaction isn't the problem - it's other threads trying to claim that lock when it's owned by a thread that is blocked by the load interlock.

Ah, right.

Yeah, that does look like an interesting approach. Part of me just wants to monkey-patch the base Monitor (because any time a thread is waiting on something, like a lock acquisition, it's safe for it to permit loads)... but that seems likely to be unpopular. Suitable naming/location will require a bit more thought.

One tweak to its contents: I think we can throw in a mon_try_enter ||, so in the uncontended case it'll take the lock immediately without any extra effort.

@BrentWheeldon
Copy link
Contributor Author

Good suggestion on the mon_try_enter. I will add that in and get a PR going and seek feedback on location/naming. Thanks again for your help!

@Dee-2015
Copy link

Apologies when this is not related, but I ran into an issue running rspecs against a postgres 10 db that failed. But they always passed with the postgres 9.5.10 database. Could the issue #31019 possibly also relate to the postgres version?

@mlh758
Copy link

mlh758 commented Apr 2, 2018

This fix also appears to fix an issue where Sidekiq jobs can't run in development due to a deadlock if multiple jobs start at the same time. I was on Rails 5.1.4 using Ruby 2.4.1.

Thanks for finding and fixing this.

@janko
Copy link
Contributor

janko commented Oct 28, 2020

As part of resolving a race condition in a gem that uses Active Record, I'm currently deep in understanding Active Record's connection management, and it would really help me to get something cleared up.

If Active Record's connection pool doesn't allow two threads to use the same connection at the same time, how can then multiple Puma threads be deadlocked by AR connection lock, if each thread will be using their own AR connection which have their own connection lock? Why is a lock needed in the first place, if in theory two threads cannot access the same AR connection?

I would've asked this on the Discourse channel, but generally I didn't have luck getting answers there.

@ngan
Copy link
Contributor

ngan commented Nov 6, 2022

@janko i believe the issue is only happening in tests where things are auto loaded. For example, in rails system tests, your server (puma) and your test runner are sharing the same connection even though they’re in separate threads. This is my understanding. I could be wrong.

@janko
Copy link
Contributor

janko commented Nov 6, 2022

Yes, you are correct. I don't think it's necessarily related to autoloading, but to the fact that in system tests the background Puma workers run in a separate threads, all of which use a shared connection in order for transactional tests to work, so this access needs to be synchronized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants