New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fiber-safe ConnectionPool + Dataloader gobbles all available database connections #4739
Comments
Hey, thanks for sharing all those details. Even for queries that don't use I considered adding a maximum queue size to dataloader before (#3913), but the customer I was emailing with found a different solution to the problem. I didn't immediate add it because I didn't encounter this problem, but I'll investigate adding an Async::Semaphore to limit the number of actively-running Fibers. Out of curiousity, do you know what your connection pool size is, and whether you're running a threaded server or not? If you're running the Rails default of 5 connections in the pool, and using a threaded server, then it might be possible to genuinely run out of connections across the different fibers. |
I added a queue size limit of sorts in #4742, you could try out that branch if you want to investigate that possibility. It defaults to 10 fibers. But it only limits fibers for graphql execution -- the Dataloader may have to spin up more fibers to resolve sources, and that can't really be bounded because it may have an arbitrarily deep dependency tree. So you could try setting |
Yes, I'm using the rails default of 5, at least in development mode (as I imagine most others are in dev mode) |
I tried setting it to 1 and it still triggers the same exception. I tested with multiple values, and it was always on the 6th request (all in series) after a freshly-launched server, which makes me think for some reason connections aren't properly being returned to the pool after a request completes. |
Ah, yeah, that's a great clue -- so it seems like one connection is being taken (and not released) per request. Just to rule it out: if you keep In the meantime, I'll do some digging to see if we can track what is holding those connections. |
Wow, good call! I didn't think to try that. I'm running into the same issue even when using the plain dataloader 🤔 |
Interesting -- that makes me think there's some other part of the system that's holding on to connections longer than expected. Looking at the stacktrace in the original report, the gems I see are:
I wonder if we could isolate some of the variables, for example:
Let me know what you find! I appreciate that using Fibers this way is on the leading edge of Rails development but I'd love to do whatever I can to make it work, because I think it's a really bright future for Ruby concurrency. |
Testing your other ideas:
Worth noting that the last 2 lines of the stracktrace are:
So it looks like fibers are still used in the stock dataloader (though I'm sure you knew that 😆) Appreciate you working with my existing app, as putting together a repro for stuff like this can be quite tricky and time-consuming! Glad to help, as I feel the same way about the future of Ruby concurrency, and appreciate your forward-thinking work (and ability to engage the pillars of the Ruby concurrency community in your efforts)! |
Well, shoot. I was able to replicate this bug in my test app, just like you described. (I didn't see the bug there before -- but I hadn't actually set I debugged using this method: class ActiveRecord::ConnectionAdapters::ConnectionPool
# Based on ConnectionPool#stat
def my_stat
synchronize do
{
# size: size,
connections: @connections.size,
in_use: @connections.select(&:in_use?).map(&:object_id),
owner_alive: @connections.select { |c| c.owner&.alive? }.map(&:object_id),
owner: @connections.map { |c| c.owner },
current_context: ActiveSupport::IsolatedExecutionState.context,
# busy: @connections.count { |c| c.in_use? && c.owner.alive? },
# dead: @connections.count { |c| c.in_use? && !c.owner.alive? },
# idle: @connections.count { |c| !c.in_use? },
waiting: num_waiting_in_queue,
}
end
end
end I put it in my GraphQL resolver right before making an ActiveRecord call: def local_count(set:)
pp ActiveRecord::Base.connection_pool.my_stat
Card.where(set: set).count
end My first surprise was that my helper raised an error: {:connections=>5,
:in_use=>[14820, 30460, 30600, 30720, 30840],
:owner_alive=>[14820, 30460, 30600, 30720, 30840],
:owner=>
[#<Thread:0x00000001058eb358@puma srv tp 004 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/puma-5.6.7/lib/puma/thread_pool.rb:104 sleep_forever>,
#<Thread:0x00000001058e99e0@puma srv tp 005 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/puma-5.6.7/lib/puma/thread_pool.rb:104 run>,
#<Thread:0x00000001058e99e0@puma srv tp 005 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/puma-5.6.7/lib/puma/thread_pool.rb:104 run>,
#<Thread:0x00000001058e99e0@puma srv tp 005 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/puma-5.6.7/lib/puma/thread_pool.rb:104 run>,
#<Thread:0x00000001058e99e0@puma srv tp 005 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/puma-5.6.7/lib/puma/thread_pool.rb:104 run>],
:waiting=>1} The surprise to me was that I started the app again and it worked. (I also uncommented the {:connections=>5,
:in_use=>[29700, 29940, 29980, 30120, 30180],
:owner_alive=>[29980, 30120],
:owner=>
[#<Fiber:0x000000010707d520 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (terminated)>,
#<Fiber:0x000000010709f2b0 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (terminated)>,
#<Fiber:0x00000001075b7d10 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (suspended)>,
#<Fiber:0x00000001075bf218 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (suspended)>,
#<Fiber:0x0000000107017b58 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (terminated)>],
:current_context=>
#<Fiber:0x00000001075b1280 /Users/rmosolgo/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/async-2.6.5/lib/async/task.rb:326 (resumed)>,
:waiting=>0} As expected, the I'm not sure where this leaves us, since you're already on Rails 7.1.2, but here are a couple of questions that might help:
|
I'm using Puma 6.4.0. Here's my config, which is pretty simple: bind "tcp://0.0.0.0:#{ENV["PORT"] || 3000}"
if Rails.env.development?
threads 1, 1
workers 0
# Allow puma to be restarted by `rails restart` command.
plugin :tmp_restart
elsif !Rails.env.test?
preload_app!
fork_worker if Rails.env.staging?
workers Integer(ENV["WEB_CONCURRENCY"] || 1)
end I added Click to view exception backtrace
|
Ok, thanks -- the give-away was those I was able to replicate the bug in my own app, with those suspended fibers, and #4743 fixed it for me. I just merged that PR, could you try your app on |
No more issues! I did have to keep |
I don't know that we've totally solved this problem, since this "fix" found above created other problems. I was able to replicate the issue in Async directly and I opened an issue there: socketry/async#295 |
Describe the bug
When using the Rails 7.1 setting
config.active_support.isolation_level = :fiber
, GraphQL eventually gobbles all database connectionsVersions
graphql
version: 2.2.1rails
(or other framework): 7.1.2other applicable versions (
graphql-batch
, etc)GraphQL schema
I can't figure out exactly which part of my schema is triggering this, as I can send the same query twice in series and have it succeed the first time, and the second time it will say it can't obtain a connection from the pool. However, the fields I'm requesting don't involve any
dataloader
calls.GraphQL query
This happens with many queries, but here's a simple example GraphQL query and response
Steps to reproduce
Send multiple queries in series
Expected behavior
Queries should complete successfully
Actual behavior
ActiveRecord::ConnectionTimeoutError: could not obtain a connection from the pool within 5.000 seconds (waited 5.000 seconds); all pooled connections were in use
Full backtrace:
Click to view exception backtrace
The text was updated successfully, but these errors were encountered: