Set a new default for the Puma thread count #50450

dhh · 2023-12-26T23:55:16Z

We currently set the default Puma thread count to 5. That's very high. It assumes you will massively benefit from a bunch of inline 3rd party calls or super slow DB queries. It's not a good default for apps with quick SQL queries and 3rd party calls running via jobs, which is the recommended way to make Rails applications.

At 37signals, we run a 1 worker to 1 thread ratio. That's after extensive testing. It provided the best latency, at some small cost to ultimate throughput. Maybe that's too much or has on negative effects on resource-starved systems, like Heroku Dynos. But it's clear that a default of 1:5 is not right.

So let's find a way to benchmark our way to a good, new default that works for most people, most of the time.

cc @byroot

byroot · 2023-12-27T10:24:01Z

As discussed on Campfire, I've been saying for a long time that 5 threads is way too much for what I consider a well optimized application (as you mention, no slow queries, no N+1, no 3rd party API calls from the request cycle).

The lobste.rs benchmark could be a good start: https://github.com/Shopify/yjit-bench/tree/main/benchmarks/lobsters, however it was specially crafted to not use a web server, so we'd need to rework it a bit for that purpose.

Overall the choice will be a latency vs throughput tradeoff, so there won't be a single setting that satisfies everyone, but we likely can find a better compromise than 5 threads per process (which supposes more than 80% IO which is nuts).

dhh · 2023-12-27T15:53:39Z

Totally. Maybe we can get someone from the community to help here. Let's run all the public benchmarks that are available at different ratios. 1:1, 1:2, 1:3. Then see what things look like in terms of latency and throughput. In our testing, 1:1 proved to be the best on both, but that was with HEY and Basecamp. So let's see what might be true with other benchmarks.

That marks this issue open for collaboration. If you're interested in helping us find the right default ratio, please post test results here. Thanks for helping!

nateberkopec · 2024-01-07T11:34:26Z

@noahgibbs worked on this for many years (although his work stopped ~3 years ago after losing sponsorship). One benchmark I recall of his showed that throughput on Discourse improved 25% when going from 1 to 5 threads.. I would certainly consider Discourse a highly-optimized Rails app, maybe one of the most highly optimized in the world. It still spends 25% of its time waiting on I/O.

Noah's result accords pretty well with Amdahl's Law. Amdahl's Law shows that modest throughput gains can be obtained even when small parts of the program can be done in parallel - for example, a Rails application that spends 30% of it's time waiting on I/O will have 18% higher throughput with 2 threads than with 1.

Of course, the second thread is the most costly thread of all: now you're multi-threaded, with all the bugs that can cause. But perhaps it is better to keep Rails multi-threaded by default, so that those bugs can occur and be seen and fixed often and early? What bugs would slip through is Rails became single-thread by default again?

@byroot: I chose the default of 5 in Puma based on the benefit for a Rails app with 25-50% time spent in I/O wait (based on my experience looking at 100+ Rails apps perf dashboards, this covers 80% or more of prod apps). These apps receive a 30 to 65% improvement (respectively) in throughput with 5 threads, w/roughly a 25% increase in memory usage.

If I round the numbers a bit, Noah's benchmark result could be said to show something like the expected speedup given by Amdahl's Law for a 25%-wait-in-I/O application (in the formula though, 25% wait-on-io translates to 25% improvement in throughput w/5 threads). If you work back through Amdahl's Law, you can produce the following table, to give you an idea of the tradeoffs here even in a low-I/O app:

Thread Count	Throughput
1	100%
2	113%
3	120%
4	123%
5	125%

byroot · 2024-01-07T11:36:50Z

I chose the default of 5 in Puma based on the benefit for a Rails app with 25-50% time spent in I/O wait

Something doesn't add up though. Even assuming 50% IO wait, that means max throughput would be achieved with 2 threads, not 5.

nateberkopec · 2024-01-07T11:54:55Z

Imagine you have 10 requests to process. Each request is 0.5 seconds of a db call, followed by 0.5 seconds of CPU burn. 1 thread would take 10 seconds to process all those requests, one after the other.

With 2 threads and one process, would it take 5 seconds?

If you had 10 threads, you could fire all the db calls at once, wait 0.5 seconds, then do 5 seconds of serial CPU work, processing all requests in 5.5 seconds. That's a 81% speedup. If you had 2 threads, you would have to spend 2.5 seconds doing all the DB calls, and 5 seconds doing CPU work, for a total of 7.5 seconds. 5 threads, 6 seconds.

That's where I got 5 threads from. 80% of the maximum possible benefit for a 50% I/O wait app.

Amdahl's Law assumes you could keep adding threads and make the db calls go faster, which is where the analogy breaks down, but I find it's still quite accurate and a useful heuristic here.

byroot · 2024-01-07T12:14:21Z

Yeah, I see what you mean. But I think a bit differently.

In my mental model, a process has a capacity of 1s of CPU time per second. So with 5 threads and 50% IOs you are queuing 2.5 times as much work as you can chew. Which is fine to handle small spikes, but degrades terribly if you are under capacity.

Also in your simplified model, all 5 requests are received at once and do 0.5s of IO followed by 0.5 of CPU time, in practice requests are received continuously and IO and CPU are much more intertwined than that.

All this to say, you're not wrong, but I think you are focusing too much on throughput impact, and not enough on the latency impact (which itself impact throughput negatively).

But again, it's of course a tradeoff, and depends on what your priorities are.

dhh · 2024-01-07T13:05:33Z

Our testing on both HEY and Basecamp showed that when we have a higher thread count, we end up with some requests that really spike in latency. It was much easier to ensure a consistent latency with a lower thread count. But actually, now that I think about it, maybe that latency isn't from the threading itself, but from GC'ing? Would be nice to get to the bottom of that. In theory, I dig the idea that we have a higher thread count, and if we can keep that, without the latency swings, that's certainly better.

What's the next step to verify these theories?

victorlcampos · 2024-01-07T13:17:39Z

I understood for your business that single thread works well

but I think base camp and Shopify running a single thread (I think Shopify use unicorn fork, right?) is a risk for new threads bugs emerge

I think default should be at least 2, even for avg apps 1 is better.

byroot · 2024-01-07T13:19:11Z

But actually, now that I think about it, maybe that latency isn't from the threading itself, but from GC'ing?

It's the same thing. In this context GC is just CPU work, it pauses all threads.

I dig the idea that we have a higher thread count, and if we can keep that, without the latency swings,

That's my whole point, the higher the thread count, the higher the tail latency. You have to choose your prefered middle ground between maximizing throughput (hence reducing hosting cost) and minimizing latency (hence improving user experience). You can't have both.

I think default should be at least 2

That's my recommendation yes.

mensfeld · 2024-01-07T13:20:36Z

, maybe that latency isn't from the threading itself, but from GC'ing?

You may want to look into @peterzhu2118 / shopify https://github.com/shopify/autotuner (presented at RailsWorld :) ) - it collects GC data between requests and recommends alignments to the GC engine.

Aside from that IMHO when discussing Puma's default thread count, it's crucial to acknowledge its impact not just on web requests but also on background jobs, which may have different latency and IO characteristics. People often model things based on Rails/Puma defaults. With this change, should the default on connection pool change as well?

byroot · 2024-01-07T13:22:22Z

it's crucial to acknowledge its impact not just on web requests but also on background jobs

Yeah, this discussion is just for web workers, background job workers are much less latency sensitive by definition, and maximizing throughput at the expense of latency make sense there.

byroot · 2024-01-07T13:51:25Z

If you had 10 threads, you could fire all the db calls at once, wait 0.5 seconds, then do 5 seconds of serial CPU work, processing all requests in 5.5 seconds. That's a 81% speedup.

Since a small drawing is generally worth a lot of words, I made a schema of this theoretical case:

Now the same (not super realistic) scenario with only two threads

Edit: fixed a small mistake in the second drawing.

dhh · 2024-01-07T14:00:01Z

Yeah, I don't want to regress on thread safety. Who knows, maybe one day the great GLV is going to be gone 😄. But it just seems that 5 is too much for most people, if you care about latency more than throughput, and I think most should? It's easier to deal with throughput using multiple processes than it is to deal with latency.

So maybe we just start with a change to 2? Then we can continue to document the factors that might lead someone to change that.

noahgibbs · 2024-01-07T15:09:24Z

I actually found peak throughput with 6 threads/process at the time for Discourse -- and that benchmark would have been in 2017 or 2018. But yeah, that many threads is usually not great for a production app trying to keep latency low. It was purely optimising for big-batch throughput.

dhh · 2024-01-07T15:11:14Z

Yeah, for us, the latency tail just got worse and worse the more threads we added. But it would be nice to get scientific about this!

dhh · 2024-01-07T15:34:36Z

Do we need to line up any other thread pool counts if we make the change from 5 to 2?

p8 · 2024-01-07T15:40:38Z

I've updated the TechEmpower benchmarks to use 3 threads instead of 5.
TechEmpower/FrameworkBenchmarks#8668
Locally I got the following results. I'm waiting for the next run to see if anything improved:

+---------------------+---------+------+-----+-----+-----+-------+--------------+
|                     |plaintext|update| json|   db|query|fortune|weighted_score|
+---------------------+---------+------+-----+-----+-----+-------+--------------+
| 1 thread  per worker|    20799|  9819|83123|17434|10334|   8653|          1048|
| 2 threads per worker|    20870|  9478|66764|16220|11634|  10543|          1042|
| 3 threads per worker|    25077| 10382|84429|16497|12338|  11246|          1141|
| 4 threads per worker|    29257| 10051|69532|18108|11752|  11132|          1093|
| 5 threads per worker|    33152| 10203|77062|18459|11721|  11711|          1114|
+---------------------+---------+------+-----+-----+-----+-------+--------------+

natematykiewicz · 2024-01-07T15:51:09Z

On my app at work I found that if I decreased the thread count from 5 to 4, response times improved. I then decreased them from 4 to 3 and they improved again. 3 to 2 did not improve response times, so I left it at 3. For our app, 3 threads results in about twice as fast response times as 5 threads while retaining the same throughput.

Our app has fast DB queries, avoids web requests in our controller actions (offloading those to activejob). We have a JSON API, a bunch of ERB pages, and we have a CDN in front of our site so we proxy a lot of our ActiveStorage images.

I know that your ideal thread count highly depends on what you're doing, but I think 1 is too low of a default. 3 has served us well.

dhh · 2024-01-07T16:29:25Z

@p8 I think whatever benchmarks we run has to include a histogram of individual request latency. That's the main issue here. The original default of 5 was optimized primarily for throughput, not tail latency. So we should find a way to quantify the tail latency.

byroot · 2024-01-07T17:27:13Z

I don't think TechEmpower is particularly interesting for this discussion, as it's way too simple to reflect reality: https://github.com/TechEmpower/FrameworkBenchmarks/blob/6e4fa674519771a1833e792d5d69f0043e5bebf3/frameworks/Ruby/rails/app/controllers/hello_world_controller.rb

But overall, choosing an application to benchmark is not all, there is also the setup. When doing trivial SQL queries, a local database vs one on another host on LAN will make a major difference. (~0.5ms per query).

So in the end, I think we could make this decision purely based on a synthetic application that can be configured to have a certain IO/CPU ratio.

natematykiewicz · 2024-01-07T17:35:19Z

The other thing to bear in mind is that the average rails app doesn't have as heavily tuned database queries. From being in some Rails-related slacks, missing indexes and inefficient queries seem to be the norm. Basecamp preferring 1 thread feels like a real outlier due to having knowledgeable staff that's able to make those queries return incredibly quickly. The average rails app benefits from more threads to offset the inefficient queries. Then as they improve their queries, they need less threads, because there's less IO wait.

natematykiewicz · 2024-01-07T17:39:28Z

There's also a huge misunderstanding of Ruby's concurrency model among newcomers. So often in these slacks devs complain about slow response times. I ask how many threads they're running and sometimes they say numbers as high as 30. They want more throughput so they crank up the thread count. I'm not sure how to best educate devs on how to tune their thread count (other than recommend Nate Berkopec's courses), but I feel that setting the thread count to 1 will make devs definitely want to crank that number up (rightfully so), and not have any idea how high it should actually be.

dhh · 2024-01-07T18:03:08Z

I'm not actually advocating for a default of 1. I'd like to put some documentation into our default puma config explaining some of the very basics here, and at least propose 1 as a setting for folks.

But it sounds like we're essentially talking about whether the new default should be 2 or 3. I don't have a strong opinion there, except that 5 isn't it.

byroot · 2024-01-07T18:24:29Z

Yeah, I think we're pretty much in agreement for either 2 or 3.

The other thing to bear in mind is that the average rails app doesn't have as heavily tuned database queries.

That is true, and that's probably also why the popular belief that Ruby performance doesn't matter much because it's all IO anyway is so commonplace. When you start optimizing a Rails application, the biggest problems you uncover are always query (or generally IO) related. But even in applications with such issues, I've never really witnessed enough IOs to justify 5 threads.

The only use case that would justify it would be an application whose main endpoint is essentially a proxy to another API. I've seen this one or twice, and there yes you can crank up the threads (or move to async).

But I don't think we should really be optimizing for this kind of situations, and 2 or 3 threads should still be plenty to keep a high enough utilization for apps having a few N+1 and a couple slow query issues.

Also it's really just a default, even more, it's generated when you create the app, so it's really easy to edit.

natematykiewicz · 2024-01-07T18:31:41Z

I'd personally vote for 3 because on my app, 2 threads had identical latency as 3, but 3 had more more throughput.

I'm very performance minded, so I find it hard to believe that the average app running defaults would have low enough IO that they'd prefer just 2 threads. As you said, N+1 queries are incredibly common (sometimes in loops that become like 5N+1) and a great way to rack up the IO.

I don't think the defaults should be optimized for apps doing HTTP requests in a controller action. That should be highly discouraged as it ruins throughput and is a good way to get DDoSed if the 3rd party site goes offline (happened to me once many years ago, never again).

byroot · 2024-01-07T19:19:41Z

I find it hard to believe that the average app running defaults would have low enough IO that they'd prefer just 2 threads.

It's not that low at all really.

If you assume a 50% IO ratio in the average request, you only need two concurrent requests to saturate one processes, so two threads.

If of course depends on how much data transformation you are doing at the Ruby layer, but view templating or JSON serialization (plus all the Active Record de-serialization) can easily take more time than several decently indexed DB queries.

But of course it's all about our personal experiences and perception, as there's only a handful of open source Rails app out there to look at, and nothing guarantee that they are representative of what is being done in the wild, and more importantly, having the source is only one part, you need the whole production setup and traffic pattern to really measure the IO/CPU ratio.

So all this to say that whatever value we choose, will be largely based on a guess.

mensfeld · 2024-01-07T21:28:09Z

But of course it's all about our personal experiences and perception, as there's only a handful of open source Rails app out there to look at, and nothing guarantee that they are representative of what is being done in the wild, and more importantly, having the source is only one part, you need the whole production setup and traffic pattern to really measure the IO/CPU ratio.

Maybe we could reach out to someone in AppSignal (cc @thijsc) and get some aggregated metrics on this from a set of commercial Rails projects. WDYT?

byroot · 2024-01-07T21:31:04Z

If they have the data I'd love to look at it, but we'd need to be very careful about how it's collected because GVL contention is generally mistaken for IO time by most instrumentation. So apps that use too many thread can appear as being more IO heavy that they actually are.

lvonk · 2024-01-09T08:38:23Z

I'm confused. How come in DHH's 32:1 results, there's 8-11ms (3-5% of total runtime) GVL Wait? There's no other threads to compete with. What's the GVL waiting on?

@natematykiewicz There are more threads running than just the puma worker thread. So perhaps some of the internal puma threads or the connection reaper?

ioquatix · 2024-01-09T09:16:29Z

I ran the benchmarks a couple of times each and chose the best for each.

Puma (1 process, 4 threads)

Puma starting in single mode...
* Puma version: 6.4.1 (ruby 3.2.2-p53) ("The Eagle of Durango")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 63741
Testing with:
50.0% time spent waiting on CPU
50.0% time spent waiting on IO
0.1 seconds average response duration (approximate, not including GC)
300000 objects are generated w/each request
We will decide to do CPU or IO every 10.0 milliseconds
CPU iteration: 0.052999937906861305ms
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Warming up for 15 seconds...
Benchmarking...

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 2m0s max duration (incl. graceful stop):
           * contacts: Up to 8.00 iterations/s for 1m30s over 1 stages (maxVUs: 10, gracefulStop: 30s)


     data_received..................: 43 kB 474 B/s
     data_sent......................: 36 kB 399 B/s
     http_req_blocked...............: min=273µs    avg=608.42µs med=524µs    max=2.84ms   p(75)=654.25µs p(99)=1.52ms   p(99.99)=2.81ms  
     http_req_connecting............: min=234µs    avg=544.71µs med=473.5µs  max=1.81ms   p(75)=577.75µs p(99)=1.19ms   p(99.99)=1.79ms  
     http_req_duration..............: min=120.88ms avg=138.23ms med=135.79ms max=188.97ms p(75)=143.52ms p(99)=166.06ms p(99.99)=188.55ms
       { expected_response:true }...: min=120.88ms avg=138.23ms med=135.79ms max=188.97ms p(75)=143.52ms p(99)=166.06ms p(99.99)=188.55ms
     http_req_failed................: 0.00% ✓ 0       ✗ 450 
     http_req_receiving.............: min=20µs     avg=37.69µs  med=36µs     max=90µs     p(75)=40.75µs  p(99)=64µs     p(99.99)=89.23µs 
     http_req_sending...............: min=16µs     avg=37.67µs  med=31µs     max=670µs    p(75)=37µs     p(99)=173.33µs p(99.99)=665.68µs
     http_req_tls_handshaking.......: min=0s       avg=0s       med=0s       max=0s       p(75)=0s       p(99)=0s       p(99.99)=0s      
   ✓ http_req_waiting...............: min=120.79ms avg=138.16ms med=135.73ms max=188.93ms p(75)=143.45ms p(99)=165.99ms p(99.99)=188.5ms 
     http_reqs......................: 450   4.99257/s
     iteration_duration.............: min=121.33ms avg=138.93ms med=136.4ms  max=189.39ms p(75)=144.22ms p(99)=167.23ms p(99.99)=188.97ms
     iterations.....................: 450   4.99257/s
     vus............................: 1     min=0     max=2 
     vus_max........................: 10    min=10    max=10

Falcon (1 process)

  0.0s     info: Falcon::Command::Serve [oid=0x7bc] [ec=0x7d0] [pid=63816] [2024-01-09 21:49:31 +1300]
               | Falcon v0.42.3 taking flight! Using Async::Container::Forked {:count=>1}.
               | - Binding to: #<Falcon::Endpoint http://localhost:9292/ {}>
               | - To terminate: Ctrl-C or kill 63816
               | - To reload configuration: kill -HUP 63816
 0.01s     info: Falcon::Controller::Serve [oid=0x7f8] [ec=0x7d0] [pid=63816] [2024-01-09 21:49:31 +1300]
               | Starting Falcon Server on http://localhost:9292/
Testing with:
50.0% time spent waiting on CPU
50.0% time spent waiting on IO
0.1 seconds average response duration (approximate, not including GC)
300000 objects are generated w/each request
We will decide to do CPU or IO every 10.0 milliseconds
CPU iteration: 0.05300017073750496ms
 0.02s     info: Async::Container::Process::Instance [oid=0x80c] [ec=0x820] [pid=63818] [2024-01-09 21:49:31 +1300]
               | - Per-process status: kill -USR1 63818
Warming up for 15 seconds...
Benchmarking...

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 2m0s max duration (incl. graceful stop):
           * contacts: Up to 8.00 iterations/s for 1m30s over 1 stages (maxVUs: 10, gracefulStop: 30s)


     data_received..................: 45 kB 494 B/s
     data_sent......................: 36 kB 399 B/s
     http_req_blocked...............: min=1µs      avg=36.85µs  med=5µs      max=2.79ms   p(75)=6µs      p(99)=1.19ms   p(99.99)=2.78ms  
     http_req_connecting............: min=0s       avg=26.06µs  med=0s       max=2.52ms   p(75)=0s       p(99)=1.08ms   p(99.99)=2.47ms  
     http_req_duration..............: min=116.76ms avg=129.55ms med=128.04ms max=171.12ms p(75)=133.45ms p(99)=150.04ms p(99.99)=170.38ms
       { expected_response:true }...: min=116.76ms avg=129.55ms med=128.04ms max=171.12ms p(75)=133.45ms p(99)=150.04ms p(99.99)=170.38ms
     http_req_failed................: 0.00% ✓ 0        ✗ 450 
     http_req_receiving.............: min=17µs     avg=26.45µs  med=25µs     max=62µs     p(75)=29µs     p(99)=56µs     p(99.99)=61.86µs 
     http_req_sending...............: min=7µs      avg=21.65µs  med=19µs     max=334µs    p(75)=25µs     p(99)=78.03µs  p(99.99)=329.28µs
     http_req_tls_handshaking.......: min=0s       avg=0s       med=0s       max=0s       p(75)=0s       p(99)=0s       p(99.99)=0s      
   ✓ http_req_waiting...............: min=116.72ms avg=129.5ms  med=128ms    max=171.08ms p(75)=133.4ms  p(99)=149.89ms p(99.99)=170.34ms
     http_reqs......................: 450   4.993017/s
     iteration_duration.............: min=116.81ms avg=129.69ms med=128.14ms max=171.16ms p(75)=133.63ms p(99)=151.41ms p(99.99)=170.44ms
     iterations.....................: 450   4.993017/s
     vus............................: 1     min=0      max=1 
     vus_max........................: 10    min=10     max=10


running (1m30.1s), 00/10 VUs, 450 complete and 0 interrupted iterations

Conclusion

It does appear that falcon, at least for this benchmark, on this hardware, consistently outperforms puma, or in other words, "just switch their Rails app to falcon to get free performance" isn't unreasonable, at least in this case. Also bear in mind that as you said, 50/50 IO/CPU is not ideal for an event driven server and as we go up higher, e.g. 90/10, you will see Falcon make extremely significant gains. As things like YJIT take over, I expect this to become more common.

For the sake of this discussion, it appears there is still performance being left on the table - my gut feeling is the request queueing and persistent connection handling are more of an issue. For the sake of thread pool size, I don't think a user should ever need to think about it. In other words, I think puma should probably just keep on creating threads as requests come in, unless the user has configured it otherwise. Of course, this is vector for DoS, but in that case, everyone will be better off with a proper load balancer and cluster that scales according to queue depth/latency OR an load balancer that starts dropping connections if resources are simply not available.

byroot · 2024-01-09T09:43:17Z

It does appear that falcon, at least for this benchmark, on this hardware, consistently outperforms puma

From your results:

puma:     http_reqs......................: 450   4.99257/s

vs:

falcon: http_reqs......................: 450   4.993017/s

So as I said, very much in the same ballpark, and certainly not capable of "5000 fibers" concurrently.

As for the latency numbers, they look a bit better indeed, hats to you, but similarly they're very much in the same ballpark and are much more likely to be explained by the IO handling code than by the use of fibers over threads.

Anyway, all I'm asking is not to make misleading claims in this thread. End of discussion for me.

The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's usefull to prevent booting the server twice there but I don't think it makes much sense in production, specially [since Puma no longer supports daemonization and instead recomment using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.

The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, specially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.

The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.

byroot · 2024-01-09T10:12:40Z

Alright, I opened #50669 with this change plus a couple smaller ones and a revamp of the comment that tries to gives the gist of the tradeoff involved. I'll leave it open for a bit in case there is extra feedback, but after that I believe we can finally close this issue.

The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.

ioquatix · 2024-01-09T12:31:30Z

So as I said, very much in the same ballpark, and certainly not capable of "5000 fibers" concurrently.

AFAICT, the benchmark only goes up to 10 concurrent requests (VUs? Please correct me if I'm wrong). You'd have to open up the throttle a bit to see falcon really fly. In any case, I have not made any misleading claims and falcon is well capable of 5000 fibers if you make 5000 concurrent requests. In fact I gave a live demonstration of 1 million fibers connected to a single Falcon server process - as you can imagine it does start to choke a bit at that level, but it's definitely possible.

That being said, coming back to puma, even if we have a thread pool capable of up to a few hundred threads, it would be good enough. It would also potentially take advantage of the MN scheduler (which is just the same coroutines that I implemented for fibers), and it also makes it easier for puma to support long running streaming requests, e.g. WebSockets, without the need for a rack.hijack background server thread.

much more likely to be explained by the IO handling code than by the use of fibers over threads.

Actually, it's both, since the fiber scheduler mostly avoids GVL contention and explicitly schedules work. I spent a huge effort optimising the request and response handling to ensure the minimum latency.

The main change is the default number of threads is reduced from 5 to 3 as discussed in rails#50450 Pending a potential future "Rails tuning" guide, I tried to include in comments the gist of the tradeoffs involved. I also removed the pidfile except for development. It's useful to prevent booting the server twice there but I don't think it makes much sense in production, especially [since Puma no longer supports daemonization and instead recommend using a process monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize). And it makes even less sense in a PaaS or containerized world.

Lengthy discussion on default thread count, web concurrency, and pid files in Set new default for the Puma thread count rails/rails#50450 Update the default Puma configuration rails/rails#50669

This guide explains major concurrency and performance principles for Puma and CRuby, when to possibly not use Puma, and how to do basic load testing for tuning production performance settings. Incorporates comments from Rails issue rails#50450, PR rails#50669 and feedback from Jean Boussier. Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>

lypanov · 2024-04-25T10:39:56Z

Actually, it's both, since the fiber scheduler mostly avoids GVL contention and explicitly schedules work. I spent a huge effort optimising the request and response handling to ensure the minimum latency.

We hit this issue in production a while back, to me it feels like (multi threaded Puma) Ruby is not prioritizing threads which have IO on them. Alas as we'd found a fix and this ate all available time we had I wasn't able to put more time into the underlying cause.

Benchmarking a minimal local reproduction to confirm this was pretty trivial, we run "eat CPU" requests (which simulates the production load of mixed IO and CPU bound - ERB mostly) and simultaneously start up a benchmarking tool making many trivial IO based requests (networking, we used redis ping). The runtime for threads vs workers indicates something is really not okay with however things are being scheduled.

We saw up to 100ms jitter in production Redis/SQL requests and confirmed with our cloud provider via telemetry that actual responses were < 8ms peak.

Switching to multi worker puma, using a to 1:3 thread ratio, and moving to memory optimized instances solved it.

ioquatix · 2024-04-25T12:38:10Z

@lypanov 100ms jitter comes from the thread scheduler in Ruby. https://ivoanjo.me/blog/2023/07/23/understanding-the-ruby-global-vm-lock-by-observing-it/ is a great blog post/presentation about the issue IIRC.

dhh added the railties label Dec 26, 2023

dhh added this to the 7.2.0 milestone Dec 26, 2023

byroot mentioned this issue Jan 9, 2024

Update the default Puma configuration #50669

Merged

dentarg mentioned this issue Jan 9, 2024

Tail Latency with queue_requests in Single Threaded mode puma/puma#2311

Open

byroot closed this as completed in #50669 Jan 10, 2024

dentarg mentioned this issue Jan 11, 2024

High latency on puma puma/puma#3315

Closed

rossta mentioned this issue Jan 11, 2024

Update puma config per latest changes in Rails joyofrails/joyofrails.com#21

Merged

lexcao mentioned this issue Jan 23, 2024

Set a new default for the Puma thread count #50450 lexcao/learn-from-issue#1

Open

benoittgt mentioned this issue Jan 28, 2024

Permit thread state SUSPENDED to go to STOPPED jhawthorn/vernier#53

Merged

noahgibbs mentioned this issue Feb 1, 2024

Deploy Performance Tuning Guide noahgibbs/rails#2

Closed

noahgibbs mentioned this issue Feb 2, 2024

Add a Rails Guide for tuning performance for deployment #50949

Open

4 tasks

beauraF mentioned this issue Feb 5, 2024

Add metrics around GVL DataDog/dd-trace-rb#3433

Open

trevorturk mentioned this issue Apr 3, 2024

Falcon no longer respects count, new likely increases memory usage on shared hosts socketry/falcon#233

Closed

jrochkind mentioned this issue Apr 24, 2024

Change heroku puma worker/thread counts sciencehistory/scihist_digicoll#2465

Open

Set a new default for the Puma thread count #50450

Set a new default for the Puma thread count #50450

Comments

dhh commented Dec 26, 2023

byroot commented Dec 27, 2023

dhh commented Dec 27, 2023

nateberkopec commented Jan 7, 2024

byroot commented Jan 7, 2024

nateberkopec commented Jan 7, 2024 • edited

byroot commented Jan 7, 2024

dhh commented Jan 7, 2024

victorlcampos commented Jan 7, 2024

byroot commented Jan 7, 2024

mensfeld commented Jan 7, 2024

byroot commented Jan 7, 2024

byroot commented Jan 7, 2024 • edited

dhh commented Jan 7, 2024

noahgibbs commented Jan 7, 2024

dhh commented Jan 7, 2024

dhh commented Jan 7, 2024

p8 commented Jan 7, 2024

natematykiewicz commented Jan 7, 2024 • edited

dhh commented Jan 7, 2024

byroot commented Jan 7, 2024

natematykiewicz commented Jan 7, 2024

natematykiewicz commented Jan 7, 2024

dhh commented Jan 7, 2024

byroot commented Jan 7, 2024

natematykiewicz commented Jan 7, 2024 • edited

byroot commented Jan 7, 2024

mensfeld commented Jan 7, 2024

byroot commented Jan 7, 2024

lvonk commented Jan 9, 2024

ioquatix commented Jan 9, 2024

Puma (1 process, 4 threads)

Falcon (1 process)

Conclusion

byroot commented Jan 9, 2024

byroot commented Jan 9, 2024

ioquatix commented Jan 9, 2024 • edited

lypanov commented Apr 25, 2024 • edited

ioquatix commented Apr 25, 2024

nateberkopec commented Jan 7, 2024 •

edited

byroot commented Jan 7, 2024 •

edited

natematykiewicz commented Jan 7, 2024 •

edited

natematykiewicz commented Jan 7, 2024 •

edited

ioquatix commented Jan 9, 2024 •

edited

lypanov commented Apr 25, 2024 •

edited