Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a new default for the Puma thread count #50450

Closed
dhh opened this issue Dec 26, 2023 · 85 comments · Fixed by #50669
Closed

Set a new default for the Puma thread count #50450

dhh opened this issue Dec 26, 2023 · 85 comments · Fixed by #50669
Labels
Milestone

Comments

@dhh
Copy link
Member

dhh commented Dec 26, 2023

We currently set the default Puma thread count to 5. That's very high. It assumes you will massively benefit from a bunch of inline 3rd party calls or super slow DB queries. It's not a good default for apps with quick SQL queries and 3rd party calls running via jobs, which is the recommended way to make Rails applications.

At 37signals, we run a 1 worker to 1 thread ratio. That's after extensive testing. It provided the best latency, at some small cost to ultimate throughput. Maybe that's too much or has on negative effects on resource-starved systems, like Heroku Dynos. But it's clear that a default of 1:5 is not right.

So let's find a way to benchmark our way to a good, new default that works for most people, most of the time.

cc @byroot

@dhh dhh added the railties label Dec 26, 2023
@dhh dhh added this to the 7.2.0 milestone Dec 26, 2023
@byroot
Copy link
Member

byroot commented Dec 27, 2023

As discussed on Campfire, I've been saying for a long time that 5 threads is way too much for what I consider a well optimized application (as you mention, no slow queries, no N+1, no 3rd party API calls from the request cycle).

The lobste.rs benchmark could be a good start: https://github.com/Shopify/yjit-bench/tree/main/benchmarks/lobsters, however it was specially crafted to not use a web server, so we'd need to rework it a bit for that purpose.

Overall the choice will be a latency vs throughput tradeoff, so there won't be a single setting that satisfies everyone, but we likely can find a better compromise than 5 threads per process (which supposes more than 80% IO which is nuts).

@dhh
Copy link
Member Author

dhh commented Dec 27, 2023

Totally. Maybe we can get someone from the community to help here. Let's run all the public benchmarks that are available at different ratios. 1:1, 1:2, 1:3. Then see what things look like in terms of latency and throughput. In our testing, 1:1 proved to be the best on both, but that was with HEY and Basecamp. So let's see what might be true with other benchmarks.

That marks this issue open for collaboration. If you're interested in helping us find the right default ratio, please post test results here. Thanks for helping!

@nateberkopec
Copy link
Contributor

@noahgibbs worked on this for many years (although his work stopped ~3 years ago after losing sponsorship). One benchmark I recall of his showed that throughput on Discourse improved 25% when going from 1 to 5 threads.. I would certainly consider Discourse a highly-optimized Rails app, maybe one of the most highly optimized in the world. It still spends 25% of its time waiting on I/O.

Noah's result accords pretty well with Amdahl's Law. Amdahl's Law shows that modest throughput gains can be obtained even when small parts of the program can be done in parallel - for example, a Rails application that spends 30% of it's time waiting on I/O will have 18% higher throughput with 2 threads than with 1.

Of course, the second thread is the most costly thread of all: now you're multi-threaded, with all the bugs that can cause. But perhaps it is better to keep Rails multi-threaded by default, so that those bugs can occur and be seen and fixed often and early? What bugs would slip through is Rails became single-thread by default again?

@byroot: I chose the default of 5 in Puma based on the benefit for a Rails app with 25-50% time spent in I/O wait (based on my experience looking at 100+ Rails apps perf dashboards, this covers 80% or more of prod apps). These apps receive a 30 to 65% improvement (respectively) in throughput with 5 threads, w/roughly a 25% increase in memory usage.

If I round the numbers a bit, Noah's benchmark result could be said to show something like the expected speedup given by Amdahl's Law for a 25%-wait-in-I/O application (in the formula though, 25% wait-on-io translates to 25% improvement in throughput w/5 threads). If you work back through Amdahl's Law, you can produce the following table, to give you an idea of the tradeoffs here even in a low-I/O app:

Thread Count Throughput
1 100%
2 113%
3 120%
4 123%
5 125%

@byroot
Copy link
Member

byroot commented Jan 7, 2024

I chose the default of 5 in Puma based on the benefit for a Rails app with 25-50% time spent in I/O wait

Something doesn't add up though. Even assuming 50% IO wait, that means max throughput would be achieved with 2 threads, not 5.

@nateberkopec
Copy link
Contributor

nateberkopec commented Jan 7, 2024

Imagine you have 10 requests to process. Each request is 0.5 seconds of a db call, followed by 0.5 seconds of CPU burn. 1 thread would take 10 seconds to process all those requests, one after the other.

With 2 threads and one process, would it take 5 seconds?

If you had 10 threads, you could fire all the db calls at once, wait 0.5 seconds, then do 5 seconds of serial CPU work, processing all requests in 5.5 seconds. That's a 81% speedup. If you had 2 threads, you would have to spend 2.5 seconds doing all the DB calls, and 5 seconds doing CPU work, for a total of 7.5 seconds. 5 threads, 6 seconds.

That's where I got 5 threads from. 80% of the maximum possible benefit for a 50% I/O wait app.

Amdahl's Law assumes you could keep adding threads and make the db calls go faster, which is where the analogy breaks down, but I find it's still quite accurate and a useful heuristic here.

@byroot
Copy link
Member

byroot commented Jan 7, 2024

Yeah, I see what you mean. But I think a bit differently.

In my mental model, a process has a capacity of 1s of CPU time per second. So with 5 threads and 50% IOs you are queuing 2.5 times as much work as you can chew. Which is fine to handle small spikes, but degrades terribly if you are under capacity.

Also in your simplified model, all 5 requests are received at once and do 0.5s of IO followed by 0.5 of CPU time, in practice requests are received continuously and IO and CPU are much more intertwined than that.

All this to say, you're not wrong, but I think you are focusing too much on throughput impact, and not enough on the latency impact (which itself impact throughput negatively).

But again, it's of course a tradeoff, and depends on what your priorities are.

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

Our testing on both HEY and Basecamp showed that when we have a higher thread count, we end up with some requests that really spike in latency. It was much easier to ensure a consistent latency with a lower thread count. But actually, now that I think about it, maybe that latency isn't from the threading itself, but from GC'ing? Would be nice to get to the bottom of that. In theory, I dig the idea that we have a higher thread count, and if we can keep that, without the latency swings, that's certainly better.

What's the next step to verify these theories?

@victorlcampos
Copy link

I understood for your business that single thread works well

but I think base camp and Shopify running a single thread (I think Shopify use unicorn fork, right?) is a risk for new threads bugs emerge

I think default should be at least 2, even for avg apps 1 is better.

@byroot
Copy link
Member

byroot commented Jan 7, 2024

But actually, now that I think about it, maybe that latency isn't from the threading itself, but from GC'ing?

It's the same thing. In this context GC is just CPU work, it pauses all threads.

I dig the idea that we have a higher thread count, and if we can keep that, without the latency swings,

That's my whole point, the higher the thread count, the higher the tail latency. You have to choose your prefered middle ground between maximizing throughput (hence reducing hosting cost) and minimizing latency (hence improving user experience). You can't have both.

I think default should be at least 2

That's my recommendation yes.

@mensfeld
Copy link
Contributor

mensfeld commented Jan 7, 2024

, maybe that latency isn't from the threading itself, but from GC'ing?

You may want to look into @peterzhu2118 / shopify https://github.com/shopify/autotuner (presented at RailsWorld :) ) - it collects GC data between requests and recommends alignments to the GC engine.

Aside from that IMHO when discussing Puma's default thread count, it's crucial to acknowledge its impact not just on web requests but also on background jobs, which may have different latency and IO characteristics. People often model things based on Rails/Puma defaults. With this change, should the default on connection pool change as well?

@byroot
Copy link
Member

byroot commented Jan 7, 2024

it's crucial to acknowledge its impact not just on web requests but also on background jobs

Yeah, this discussion is just for web workers, background job workers are much less latency sensitive by definition, and maximizing throughput at the expense of latency make sense there.

@byroot
Copy link
Member

byroot commented Jan 7, 2024

If you had 10 threads, you could fire all the db calls at once, wait 0.5 seconds, then do 5 seconds of serial CPU work, processing all requests in 5.5 seconds. That's a 81% speedup.

Since a small drawing is generally worth a lot of words, I made a schema of this theoretical case:

Capture d’écran 2024-01-07 à 14 49 17

Now the same (not super realistic) scenario with only two threads

Capture d’écran 2024-01-07 à 14 52 11

Edit: fixed a small mistake in the second drawing.

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

Yeah, I don't want to regress on thread safety. Who knows, maybe one day the great GLV is going to be gone 😄. But it just seems that 5 is too much for most people, if you care about latency more than throughput, and I think most should? It's easier to deal with throughput using multiple processes than it is to deal with latency.

So maybe we just start with a change to 2? Then we can continue to document the factors that might lead someone to change that.

@noahgibbs
Copy link
Contributor

I actually found peak throughput with 6 threads/process at the time for Discourse -- and that benchmark would have been in 2017 or 2018. But yeah, that many threads is usually not great for a production app trying to keep latency low. It was purely optimising for big-batch throughput.

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

Yeah, for us, the latency tail just got worse and worse the more threads we added. But it would be nice to get scientific about this!

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

Do we need to line up any other thread pool counts if we make the change from 5 to 2?

@p8
Copy link
Member

p8 commented Jan 7, 2024

I've updated the TechEmpower benchmarks to use 3 threads instead of 5.
TechEmpower/FrameworkBenchmarks#8668
Locally I got the following results. I'm waiting for the next run to see if anything improved:

+---------------------+---------+------+-----+-----+-----+-------+--------------+
|                     |plaintext|update| json|   db|query|fortune|weighted_score|
+---------------------+---------+------+-----+-----+-----+-------+--------------+
| 1 thread  per worker|    20799|  9819|83123|17434|10334|   8653|          1048|
| 2 threads per worker|    20870|  9478|66764|16220|11634|  10543|          1042|
| 3 threads per worker|    25077| 10382|84429|16497|12338|  11246|          1141|
| 4 threads per worker|    29257| 10051|69532|18108|11752|  11132|          1093|
| 5 threads per worker|    33152| 10203|77062|18459|11721|  11711|          1114|
+---------------------+---------+------+-----+-----+-----+-------+--------------+

@natematykiewicz
Copy link
Contributor

natematykiewicz commented Jan 7, 2024

On my app at work I found that if I decreased the thread count from 5 to 4, response times improved. I then decreased them from 4 to 3 and they improved again. 3 to 2 did not improve response times, so I left it at 3. For our app, 3 threads results in about twice as fast response times as 5 threads while retaining the same throughput.

Our app has fast DB queries, avoids web requests in our controller actions (offloading those to activejob). We have a JSON API, a bunch of ERB pages, and we have a CDN in front of our site so we proxy a lot of our ActiveStorage images.

I know that your ideal thread count highly depends on what you're doing, but I think 1 is too low of a default. 3 has served us well.

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

@p8 I think whatever benchmarks we run has to include a histogram of individual request latency. That's the main issue here. The original default of 5 was optimized primarily for throughput, not tail latency. So we should find a way to quantify the tail latency.

@byroot
Copy link
Member

byroot commented Jan 7, 2024

I don't think TechEmpower is particularly interesting for this discussion, as it's way too simple to reflect reality: https://github.com/TechEmpower/FrameworkBenchmarks/blob/6e4fa674519771a1833e792d5d69f0043e5bebf3/frameworks/Ruby/rails/app/controllers/hello_world_controller.rb

But overall, choosing an application to benchmark is not all, there is also the setup. When doing trivial SQL queries, a local database vs one on another host on LAN will make a major difference. (~0.5ms per query).

So in the end, I think we could make this decision purely based on a synthetic application that can be configured to have a certain IO/CPU ratio.

@natematykiewicz
Copy link
Contributor

The other thing to bear in mind is that the average rails app doesn't have as heavily tuned database queries. From being in some Rails-related slacks, missing indexes and inefficient queries seem to be the norm. Basecamp preferring 1 thread feels like a real outlier due to having knowledgeable staff that's able to make those queries return incredibly quickly. The average rails app benefits from more threads to offset the inefficient queries. Then as they improve their queries, they need less threads, because there's less IO wait.

@natematykiewicz
Copy link
Contributor

There's also a huge misunderstanding of Ruby's concurrency model among newcomers. So often in these slacks devs complain about slow response times. I ask how many threads they're running and sometimes they say numbers as high as 30. They want more throughput so they crank up the thread count. I'm not sure how to best educate devs on how to tune their thread count (other than recommend Nate Berkopec's courses), but I feel that setting the thread count to 1 will make devs definitely want to crank that number up (rightfully so), and not have any idea how high it should actually be.

@dhh
Copy link
Member Author

dhh commented Jan 7, 2024

I'm not actually advocating for a default of 1. I'd like to put some documentation into our default puma config explaining some of the very basics here, and at least propose 1 as a setting for folks.

But it sounds like we're essentially talking about whether the new default should be 2 or 3. I don't have a strong opinion there, except that 5 isn't it.

@byroot
Copy link
Member

byroot commented Jan 7, 2024

Yeah, I think we're pretty much in agreement for either 2 or 3.

The other thing to bear in mind is that the average rails app doesn't have as heavily tuned database queries.

That is true, and that's probably also why the popular belief that Ruby performance doesn't matter much because it's all IO anyway is so commonplace. When you start optimizing a Rails application, the biggest problems you uncover are always query (or generally IO) related. But even in applications with such issues, I've never really witnessed enough IOs to justify 5 threads.

The only use case that would justify it would be an application whose main endpoint is essentially a proxy to another API. I've seen this one or twice, and there yes you can crank up the threads (or move to async).

But I don't think we should really be optimizing for this kind of situations, and 2 or 3 threads should still be plenty to keep a high enough utilization for apps having a few N+1 and a couple slow query issues.

Also it's really just a default, even more, it's generated when you create the app, so it's really easy to edit.

@natematykiewicz
Copy link
Contributor

natematykiewicz commented Jan 7, 2024

I'd personally vote for 3 because on my app, 2 threads had identical latency as 3, but 3 had more more throughput.

I'm very performance minded, so I find it hard to believe that the average app running defaults would have low enough IO that they'd prefer just 2 threads. As you said, N+1 queries are incredibly common (sometimes in loops that become like 5N+1) and a great way to rack up the IO.

I don't think the defaults should be optimized for apps doing HTTP requests in a controller action. That should be highly discouraged as it ruins throughput and is a good way to get DDoSed if the 3rd party site goes offline (happened to me once many years ago, never again).

@byroot
Copy link
Member

byroot commented Jan 7, 2024

I find it hard to believe that the average app running defaults would have low enough IO that they'd prefer just 2 threads.

It's not that low at all really.

If you assume a 50% IO ratio in the average request, you only need two concurrent requests to saturate one processes, so two threads.

If of course depends on how much data transformation you are doing at the Ruby layer, but view templating or JSON serialization (plus all the Active Record de-serialization) can easily take more time than several decently indexed DB queries.

But of course it's all about our personal experiences and perception, as there's only a handful of open source Rails app out there to look at, and nothing guarantee that they are representative of what is being done in the wild, and more importantly, having the source is only one part, you need the whole production setup and traffic pattern to really measure the IO/CPU ratio.

So all this to say that whatever value we choose, will be largely based on a guess.

@mensfeld
Copy link
Contributor

mensfeld commented Jan 7, 2024

But of course it's all about our personal experiences and perception, as there's only a handful of open source Rails app out there to look at, and nothing guarantee that they are representative of what is being done in the wild, and more importantly, having the source is only one part, you need the whole production setup and traffic pattern to really measure the IO/CPU ratio.

Maybe we could reach out to someone in AppSignal (cc @thijsc) and get some aggregated metrics on this from a set of commercial Rails projects. WDYT?

@byroot
Copy link
Member

byroot commented Jan 7, 2024

If they have the data I'd love to look at it, but we'd need to be very careful about how it's collected because GVL contention is generally mistaken for IO time by most instrumentation. So apps that use too many thread can appear as being more IO heavy that they actually are.

@lvonk
Copy link

lvonk commented Jan 9, 2024

I'm confused. How come in DHH's 32:1 results, there's 8-11ms (3-5% of total runtime) GVL Wait? There's no other threads to compete with. What's the GVL waiting on?

@natematykiewicz There are more threads running than just the puma worker thread. So perhaps some of the internal puma threads or the connection reaper?

@ioquatix
Copy link
Contributor

ioquatix commented Jan 9, 2024

I ran the benchmarks a couple of times each and chose the best for each.

Puma (1 process, 4 threads)

Puma starting in single mode...
* Puma version: 6.4.1 (ruby 3.2.2-p53) ("The Eagle of Durango")
*  Min threads: 4
*  Max threads: 4
*  Environment: development
*          PID: 63741
Testing with:
50.0% time spent waiting on CPU
50.0% time spent waiting on IO
0.1 seconds average response duration (approximate, not including GC)
300000 objects are generated w/each request
We will decide to do CPU or IO every 10.0 milliseconds
CPU iteration: 0.052999937906861305ms
* Listening on http://0.0.0.0:9292
Use Ctrl-C to stop
Warming up for 15 seconds...
Benchmarking...

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 2m0s max duration (incl. graceful stop):
           * contacts: Up to 8.00 iterations/s for 1m30s over 1 stages (maxVUs: 10, gracefulStop: 30s)


     data_received..................: 43 kB 474 B/s
     data_sent......................: 36 kB 399 B/s
     http_req_blocked...............: min=273µs    avg=608.42µs med=524µs    max=2.84ms   p(75)=654.25µs p(99)=1.52ms   p(99.99)=2.81ms  
     http_req_connecting............: min=234µs    avg=544.71µs med=473.5µs  max=1.81ms   p(75)=577.75µs p(99)=1.19ms   p(99.99)=1.79ms  
     http_req_duration..............: min=120.88ms avg=138.23ms med=135.79ms max=188.97ms p(75)=143.52ms p(99)=166.06ms p(99.99)=188.55ms
       { expected_response:true }...: min=120.88ms avg=138.23ms med=135.79ms max=188.97ms p(75)=143.52ms p(99)=166.06ms p(99.99)=188.55ms
     http_req_failed................: 0.00% ✓ 0       ✗ 450 
     http_req_receiving.............: min=20µs     avg=37.69µs  med=36µs     max=90µs     p(75)=40.75µs  p(99)=64µs     p(99.99)=89.23µs 
     http_req_sending...............: min=16µs     avg=37.67µs  med=31µs     max=670µs    p(75)=37µs     p(99)=173.33µs p(99.99)=665.68µs
     http_req_tls_handshaking.......: min=0s       avg=0s       med=0s       max=0s       p(75)=0s       p(99)=0s       p(99.99)=0s      
   ✓ http_req_waiting...............: min=120.79ms avg=138.16ms med=135.73ms max=188.93ms p(75)=143.45ms p(99)=165.99ms p(99.99)=188.5ms 
     http_reqs......................: 450   4.99257/s
     iteration_duration.............: min=121.33ms avg=138.93ms med=136.4ms  max=189.39ms p(75)=144.22ms p(99)=167.23ms p(99.99)=188.97ms
     iterations.....................: 450   4.99257/s
     vus............................: 1     min=0     max=2 
     vus_max........................: 10    min=10    max=10

Falcon (1 process)

  0.0s     info: Falcon::Command::Serve [oid=0x7bc] [ec=0x7d0] [pid=63816] [2024-01-09 21:49:31 +1300]
               | Falcon v0.42.3 taking flight! Using Async::Container::Forked {:count=>1}.
               | - Binding to: #<Falcon::Endpoint http://localhost:9292/ {}>
               | - To terminate: Ctrl-C or kill 63816
               | - To reload configuration: kill -HUP 63816
 0.01s     info: Falcon::Controller::Serve [oid=0x7f8] [ec=0x7d0] [pid=63816] [2024-01-09 21:49:31 +1300]
               | Starting Falcon Server on http://localhost:9292/
Testing with:
50.0% time spent waiting on CPU
50.0% time spent waiting on IO
0.1 seconds average response duration (approximate, not including GC)
300000 objects are generated w/each request
We will decide to do CPU or IO every 10.0 milliseconds
CPU iteration: 0.05300017073750496ms
 0.02s     info: Async::Container::Process::Instance [oid=0x80c] [ec=0x820] [pid=63818] [2024-01-09 21:49:31 +1300]
               | - Per-process status: kill -USR1 63818
Warming up for 15 seconds...
Benchmarking...

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

  execution: local
     script: k6/script.js
     output: -

  scenarios: (100.00%) 1 scenario, 10 max VUs, 2m0s max duration (incl. graceful stop):
           * contacts: Up to 8.00 iterations/s for 1m30s over 1 stages (maxVUs: 10, gracefulStop: 30s)


     data_received..................: 45 kB 494 B/s
     data_sent......................: 36 kB 399 B/s
     http_req_blocked...............: min=1µs      avg=36.85µs  med=5µs      max=2.79ms   p(75)=6µs      p(99)=1.19ms   p(99.99)=2.78ms  
     http_req_connecting............: min=0s       avg=26.06µs  med=0s       max=2.52ms   p(75)=0s       p(99)=1.08ms   p(99.99)=2.47ms  
     http_req_duration..............: min=116.76ms avg=129.55ms med=128.04ms max=171.12ms p(75)=133.45ms p(99)=150.04ms p(99.99)=170.38ms
       { expected_response:true }...: min=116.76ms avg=129.55ms med=128.04ms max=171.12ms p(75)=133.45ms p(99)=150.04ms p(99.99)=170.38ms
     http_req_failed................: 0.00% ✓ 0        ✗ 450 
     http_req_receiving.............: min=17µs     avg=26.45µs  med=25µs     max=62µs     p(75)=29µs     p(99)=56µs     p(99.99)=61.86µs 
     http_req_sending...............: min=7µs      avg=21.65µs  med=19µs     max=334µs    p(75)=25µs     p(99)=78.03µs  p(99.99)=329.28µs
     http_req_tls_handshaking.......: min=0s       avg=0s       med=0s       max=0s       p(75)=0s       p(99)=0s       p(99.99)=0s      
   ✓ http_req_waiting...............: min=116.72ms avg=129.5ms  med=128ms    max=171.08ms p(75)=133.4ms  p(99)=149.89ms p(99.99)=170.34ms
     http_reqs......................: 450   4.993017/s
     iteration_duration.............: min=116.81ms avg=129.69ms med=128.14ms max=171.16ms p(75)=133.63ms p(99)=151.41ms p(99.99)=170.44ms
     iterations.....................: 450   4.993017/s
     vus............................: 1     min=0      max=1 
     vus_max........................: 10    min=10     max=10


running (1m30.1s), 00/10 VUs, 450 complete and 0 interrupted iterations

Conclusion

It does appear that falcon, at least for this benchmark, on this hardware, consistently outperforms puma, or in other words, "just switch their Rails app to falcon to get free performance" isn't unreasonable, at least in this case. Also bear in mind that as you said, 50/50 IO/CPU is not ideal for an event driven server and as we go up higher, e.g. 90/10, you will see Falcon make extremely significant gains. As things like YJIT take over, I expect this to become more common.

For the sake of this discussion, it appears there is still performance being left on the table - my gut feeling is the request queueing and persistent connection handling are more of an issue. For the sake of thread pool size, I don't think a user should ever need to think about it. In other words, I think puma should probably just keep on creating threads as requests come in, unless the user has configured it otherwise. Of course, this is vector for DoS, but in that case, everyone will be better off with a proper load balancer and cluster that scales according to queue depth/latency OR an load balancer that starts dropping connections if resources are simply not available.

@byroot
Copy link
Member

byroot commented Jan 9, 2024

It does appear that falcon, at least for this benchmark, on this hardware, consistently outperforms puma

From your results:

puma:     http_reqs......................: 450   4.99257/s

vs:

falcon: http_reqs......................: 450   4.993017/s

So as I said, very much in the same ballpark, and certainly not capable of "5000 fibers" concurrently.

As for the latency numbers, they look a bit better indeed, hats to you, but similarly they're very much in the same ballpark and are much more likely to be explained by the IO handling code than by the use of fibers over threads.

Anyway, all I'm asking is not to make misleading claims in this thread. End of discussion for me.

byroot added a commit to byroot/rails that referenced this issue Jan 9, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's usefull to prevent booting the server twice there
but I don't think it makes much sense in production,
specially [since Puma no longer supports daemonization
and instead recomment using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
byroot added a commit to byroot/rails that referenced this issue Jan 9, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
specially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
byroot added a commit to byroot/rails that referenced this issue Jan 9, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
especially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
@byroot
Copy link
Member

byroot commented Jan 9, 2024

Alright, I opened #50669 with this change plus a couple smaller ones and a revamp of the comment that tries to gives the gist of the tradeoff involved. I'll leave it open for a bit in case there is extra feedback, but after that I believe we can finally close this issue.

byroot added a commit to byroot/rails that referenced this issue Jan 9, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
especially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
byroot added a commit to byroot/rails that referenced this issue Jan 9, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
especially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
@ioquatix
Copy link
Contributor

ioquatix commented Jan 9, 2024

So as I said, very much in the same ballpark, and certainly not capable of "5000 fibers" concurrently.

AFAICT, the benchmark only goes up to 10 concurrent requests (VUs? Please correct me if I'm wrong). You'd have to open up the throttle a bit to see falcon really fly. In any case, I have not made any misleading claims and falcon is well capable of 5000 fibers if you make 5000 concurrent requests. In fact I gave a live demonstration of 1 million fibers connected to a single Falcon server process - as you can imagine it does start to choke a bit at that level, but it's definitely possible.

That being said, coming back to puma, even if we have a thread pool capable of up to a few hundred threads, it would be good enough. It would also potentially take advantage of the MN scheduler (which is just the same coroutines that I implemented for fibers), and it also makes it easier for puma to support long running streaming requests, e.g. WebSockets, without the need for a rack.hijack background server thread.

much more likely to be explained by the IO handling code than by the use of fibers over threads.

Actually, it's both, since the fiber scheduler mostly avoids GVL contention and explicitly schedules work. I spent a huge effort optimising the request and response handling to ensure the minimum latency.

byroot added a commit to byroot/rails that referenced this issue Jan 10, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
especially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
byroot added a commit to byroot/rails that referenced this issue Jan 10, 2024
The main change is the default number of threads
is reduced from 5 to 3 as discussed in rails#50450

Pending a potential future "Rails tuning" guide, I tried
to include in comments the gist of the tradeoffs involved.

I also removed the pidfile except for development.
It's useful to prevent booting the server twice there
but I don't think it makes much sense in production,
especially [since Puma no longer supports daemonization
and instead recommend using a process
monitor](https://github.com/puma/puma/blob/99f83c50fbb712b0987667f3533cce4ea925b2da/docs/deployment.md#should-i-daemonize).
And it makes even less sense in a PaaS or containerized
world.
rossta added a commit to joyofrails/joyofrails.com that referenced this issue Jan 11, 2024
Lengthy discussion on default thread count, web concurrency, and pid
files in

Set new default for the Puma thread count
rails/rails#50450

Update the default Puma configuration
rails/rails#50669
noahgibbs added a commit to noahgibbs/rails that referenced this issue Feb 2, 2024
This guide explains major concurrency and performance principles
for Puma and CRuby, when to possibly not use Puma, and how to
do basic load testing for tuning production performance settings.
Incorporates comments from Rails issue rails#50450, PR rails#50669 and
feedback from Jean Boussier.

Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
noahgibbs added a commit to noahgibbs/rails that referenced this issue Feb 2, 2024
This guide explains major concurrency and performance principles
for Puma and CRuby, when to possibly not use Puma, and how to
do basic load testing for tuning production performance settings.
Incorporates comments from Rails issue rails#50450, PR rails#50669 and
feedback from Jean Boussier.

Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
noahgibbs added a commit to noahgibbs/rails that referenced this issue Feb 7, 2024
This guide explains major concurrency and performance principles
for Puma and CRuby, when to possibly not use Puma, and how to
do basic load testing for tuning production performance settings.
Incorporates comments from Rails issue rails#50450, PR rails#50669 and
feedback from Jean Boussier.

Co-authored-by: Jean byroot Boussier <jean.boussier+github@shopify.com>
@lypanov
Copy link

lypanov commented Apr 25, 2024

Actually, it's both, since the fiber scheduler mostly avoids GVL contention and explicitly schedules work. I spent a huge effort optimising the request and response handling to ensure the minimum latency.

We hit this issue in production a while back, to me it feels like (multi threaded Puma) Ruby is not prioritizing threads which have IO on them. Alas as we'd found a fix and this ate all available time we had I wasn't able to put more time into the underlying cause.

Benchmarking a minimal local reproduction to confirm this was pretty trivial, we run "eat CPU" requests (which simulates the production load of mixed IO and CPU bound - ERB mostly) and simultaneously start up a benchmarking tool making many trivial IO based requests (networking, we used redis ping). The runtime for threads vs workers indicates something is really not okay with however things are being scheduled.

We saw up to 100ms jitter in production Redis/SQL requests and confirmed with our cloud provider via telemetry that actual responses were < 8ms peak.

Switching to multi worker puma, using a to 1:3 thread ratio, and moving to memory optimized instances solved it.

@ioquatix
Copy link
Contributor

@lypanov 100ms jitter comes from the thread scheduler in Ruby. https://ivoanjo.me/blog/2023/07/23/understanding-the-ruby-global-vm-lock-by-observing-it/ is a great blog post/presentation about the issue IIRC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.