[Q] no free workers in the pool
error on long requests
#799
-
While stress-testing a roadrunner 2.4.2 deployment under load with ApacheBench (300 concurrent connections), we encounter connection timeout errors and/or "500 internal server errors" if the query lasts more than 0.7s. The roadrunner pod logs multiple messages:
The roadrunner is running a pretty vanilla Laravel app. I expect to see 100% of the longer requests processed but part of the requests longer than 0.7s fail (maybe like 40% of responses are 500x or timeouts). It is even worse that only part of the requests fail and the "no free workers in the pool" error is sporadic and difficult to detect what it really depends on. The version of RR used: 2.4.2 from Docker Hub, running in Kubernetes v1.21 My
Errortrace, Backtrace or Panictrace: Please see the attached log, not to clutter the PR. If this is a misconfiguration, please point it out. |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 24 replies
-
Hey @stl-victor-sudakov . What is the |
Beta Was this translation helpful? Give feedback.
-
WIth num_workers=0 it starts 4. I've tried hard setting 4 and 8, with no real improvement. |
Beta Was this translation helpful? Give feedback.
-
Actually I'd be happy to change any parameter and report the result, if it helps resolve the issue. For example, decreasing the number of concurrent connections (e.g. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@rustatian This has been very instructive, thank you. Is there a formula to calculate the number of workers and allocate_timeout knowing the number of requests per second and the average request time? |
Beta Was this translation helpful? Give feedback.
-
@stl-victor-sudakov Unfortunately no :( . That heavily depends on the particular use-cases. But the general advice is the following:
|
Beta Was this translation helpful? Give feedback.
-
@stl-victor-sudakov Feel free to close the issue if everything is clear :) |
Beta Was this translation helpful? Give feedback.
-
Here is the equation of success:
E.g. with the default allocate_timeout=60s and request duration=2.9s |
Beta Was this translation helpful? Give feedback.
-
~980 RPS with 10ms usleep. ~248 RPS with 40ms usleep.
First of all, you should take into account your application's main load. If you calculate (just for example) 1000th fibo number your app will be CPU-intensive and for sure, you have to monitor CPU load to prevent overloading. On the other hand, if your app is making a lot of HTTP requests to other services, has a lot of DB queries, this will be IO-bound operations. You have to check your network (db, other services) limits to prevent IO overload.
So, your app is mostly CPU bound, which means, that you should not allocate a lot of workers (generally - more than 2x Logical CPU cores) to effectively handle the load. |
Beta Was this translation helpful? Give feedback.
@stl-victor-sudakov Unfortunately no :( . That heavily depends on the particular use-cases. But the general advice is the following:
allocate_timeout
value. While the slow path may have a relatively bigger number of workers to handle a batch of the requests in parralel.etags
for the static. You may also use weaketags
calculations.ab
, I prefer using the wrk