[Q] `no free workers in the pool` error on long requests #799

stl-victor-sudakov · 2021-09-15T04:02:05Z

stl-victor-sudakov
Sep 15, 2021

While stress-testing a roadrunner 2.4.2 deployment under load with ApacheBench (300 concurrent connections), we encounter connection timeout errors and/or "500 internal server errors" if the query lasts more than 0.7s. The roadrunner pod logs multiple messages:

2021-09-15T03:17:56.900Z	WARN	server      	server/plugin.go:219	no free workers in the pool	{"error": "static_pool_exec_with_context: NoFreeWorkers:\n\tworker_watcher_get_free_worker:\n\tcontext deadline exceeded"}

The roadrunner is running a pretty vanilla Laravel app.

I expect to see 100% of the longer requests processed but part of the requests longer than 0.7s fail (maybe like 40% of responses are 500x or timeouts). It is even worse that only part of the requests fail and the "no free workers in the pool" error is sporadic and difficult to detect what it really depends on.

The version of RR used: 2.4.2 from Docker Hub, running in Kubernetes v1.21

My .rr.yaml configuration is:

server:
  command: "php ./vendor/bin/rr-worker start"
  relay: pipes
http:
  address: 0.0.0.0:8060
  middleware: ["headers", "gzip"]
  pool:
    num_workers: 0
    max_jobs: 700
    supervisor:
      exec_ttl: 60s
  headers:
    response:
      X-Powered-By: "RoadRunner"
  static:
    dir: "public"
    forbid: [".php"]
status:
  address: 0.0.0.0:2114

Errortrace, Backtrace or Panictrace: Please see the attached log, not to clutter the PR. If this is a misconfiguration, please point it out.
log.txt

Answered by rustatian

Sep 15, 2021

@stl-victor-sudakov Unfortunately no :( . That heavily depends on the particular use-cases. But the general advice is the following:

Track the CPU load. If you see, that the CPU consumption is less than 85-90% (constantly, not spikes), increase the number of workers.
If that is possible, split fast and slow request paths to different RR instances (pods). Fast may have a smaller number of workers with a smaller allocate_timeout value. While the slow path may have a relatively bigger number of workers to handle a batch of the requests in parralel.
Use etags for the static. You may also use weak etags calculations.
Personally, I have a bad experience with the ab, I prefer using the wrk

View full answer

rustatian · 2021-09-15T04:51:29Z

rustatian
Sep 15, 2021
Maintainer

Hey @stl-victor-sudakov . What is the num_workers actual value?

0 replies

stl-victor-sudakov · 2021-09-15T04:54:36Z

stl-victor-sudakov
Sep 15, 2021
Author

Hey @stl-victor-sudakov . What is the num_workers actual value?

WIth num_workers=0 it starts 4. I've tried hard setting 4 and 8, with no real improvement.

0 replies

stl-victor-sudakov · 2021-09-15T04:59:50Z

stl-victor-sudakov
Sep 15, 2021
Author

Actually I'd be happy to change any parameter and report the result, if it helps resolve the issue. For example, decreasing the number of concurrent connections (e.g. ab -c 200) removes the errors.

0 replies

rustatian · 2021-09-15T05:09:02Z

rustatian
Sep 15, 2021
Maintainer

num_workers = 0 means, that RR should detect the actual number of cores and run that number of workers. You may initialize 50 workers for example per-pod if the CPU load is not a 100%. It's not necessary to start with 1-4-8 workers.
Due to the system architecture (especially http/tcp layer) your requests are coming to the RR and accumulating in the queue. How much request will be in the queue defined by the allocate_timeout, the default value is 60s. Now, imagine, you have 300 req/s, 0.7 seconds per each request, 210 seconds in total to handle 1 second of bombarding your application. The request has a 60s timeout. With such throughput capacity, you'll get all requests after 60s failed with a timeout.
To prevent this, do the following (choose which is applicable to you):

Increase the number of workers. Try to use 50-100...etc. It's generally safe. If you have a lot of IO-bound operations, you may allocate as many workers as you have memory.
Increase allocate_timeout to 5-10 minutes. This is a big value, but, you can try.
Scale your application horizontally or vertically. You use pods and Kubernetes. Use more pods and balance your load. RR allows you to start more than 1 instance on the same port (TCP_REUSEPORT).

0 replies

stl-victor-sudakov · 2021-09-15T05:24:00Z

stl-victor-sudakov
Sep 15, 2021
Author

@rustatian This has been very instructive, thank you. Is there a formula to calculate the number of workers and allocate_timeout knowing the number of requests per second and the average request time?

0 replies

rustatian · 2021-09-15T05:37:06Z

rustatian
Sep 15, 2021
Maintainer

@stl-victor-sudakov Unfortunately no :( . That heavily depends on the particular use-cases. But the general advice is the following:

Track the CPU load. If you see, that the CPU consumption is less than 85-90% (constantly, not spikes), increase the number of workers.
If that is possible, split fast and slow request paths to different RR instances (pods). Fast may have a smaller number of workers with a smaller allocate_timeout value. While the slow path may have a relatively bigger number of workers to handle a batch of the requests in parralel.
Use etags for the static. You may also use weak etags calculations.
Personally, I have a bad experience with the ab, I prefer using the wrk

0 replies

rustatian · 2021-09-15T06:05:51Z

rustatian
Sep 15, 2021
Maintainer

@stl-victor-sudakov Feel free to close the issue if everything is clear :)

1 reply

rustatian Sep 15, 2021
Maintainer

Moved to the discussions.

stl-victor-sudakov · 2021-09-15T08:41:52Z

stl-victor-sudakov
Sep 15, 2021
Author

Here is the equation of success:

Conns/sec / num_workers * request_duration < allocate_timeout

E.g. with the default allocate_timeout=60s and request duration=2.9s
300/15*2.9=58
these are the numbers which will still work without errors.

#798

18 replies

rustatian Dec 8, 2021
Maintainer

For my bench env (nvme 980 pro, ryzen 5950x, 64gb ram, ArchLinux 5.15.6, PHP 8.0.13) and simple worker (I test only RR, not the applications, so, I have a simple worker for that purpose) I got an average 124k req/s for the 10 workers. (only HTTP plugin involved).
It's approximately 12400 requests per second for the 1 worker for the request round-trip on my env on average.

victor-sudakov Dec 9, 2021

I got an average 124k req/s for the 10 workers.

What was the request duration? According to my calculations, with 10 workers and 10ms request duration your theoretical maximum should be about 1k RPS and for 40ms request duration and 10 workers, 250 RPS. Or I'm doing the maths wrong.

rustatian Dec 9, 2021
Maintainer

It varies in range from 100 microseconds to 1 millisecond in rare cases (I mean, on my worker).

That is the naked request, w/o middleware. Addition latency is the worker latency. You may have such numbers for many reasons, first of all, worker latency is not a constant, it's varies, especially if you use supervisor in the pool and additional middleware (then RR latency might be up to 1ms). You may read this article to see, that the RPS calculation in real life is far away from the theoretical: https://www.davepacheco.net/blog/2019/performance-puzzler-the-slow-server/. And if in the article we knew that the server has 1s latency (constant), in our case, you have 10 let's say servers (workers) with ranged latency. That's because you see the different results.

victor-sudakov Dec 9, 2021

Can I please ask you to test with 10ms and 40ms requests? What would be your RPS numbers?

We are currently investigating our deployment trying to find causes of the 250 RPS plateau. It may be the Laravel app, or may be RR running in an Alpine container, or something else.

You may read this article to see, that the RPS calculation in real life is far away from the theoretical:

I will, but our test deployment is as far from real life as it could be. The only load is the artificial consistent load created by wrk and Yandex Tank.

victor-sudakov Dec 13, 2021

Could you please share your configuration as well?

It turned out we were running out of CPU power on our dev K8s cluster. Upgrading the worker nodes from large to xlarge to 2xlarge resulted in very predictable, linear scaling of RPS results in the wrk test.

rustatian · 2021-12-13T08:28:03Z

rustatian
Dec 13, 2021
Maintainer

Can I please ask you to test with 10ms and 40ms requests? What would be your RPS numbers?

Sure:

~980 RPS with 10ms usleep.

~248 RPS with 40ms usleep.

Upgrading the worker nodes from large to xlarge to 2xlarge resulted in very predictable, linear scaling of RPS results in the wrk test.

First of all, you should take into account your application's main load. If you calculate (just for example) 1000th fibo number your app will be CPU-intensive and for sure, you have to monitor CPU load to prevent overloading. On the other hand, if your app is making a lot of HTTP requests to other services, has a lot of DB queries, this will be IO-bound operations. You have to check your network (db, other services) limits to prevent IO overload.
All these limitations will affect your RPS count.

It turned out we were running out of CPU power on our dev K8s cluster. Upgrading the worker nodes from large to xlarge to 2xlarge resulted in very predictable, linear scaling of RPS results in the wrk test.

So, your app is mostly CPU bound, which means, that you should not allocate a lot of workers (generally - more than 2x Logical CPU cores) to effectively handle the load.

5 replies

victor-sudakov Dec 13, 2021

~980 RPS with 10ms usleep.
~248 RPS with 40ms usleep.

Great, the same results here. Theoretically should be 1000 and 250 RPS respectively with 10 workers. My test app is mostly for RR vs PHP-FPM vs Nginx Unit testing, it just does usleep() for X ms, so yes, it can be called CPU-bound. Thank you again for the recommendations and the 2x logical CPU cores hint. It would be interesting to actually decrease the number of workers and compare the results.

rustatian Dec 13, 2021
Maintainer

it just does usleep() for X ms, so yes, it can be called CPU-bound

usleep is just a syscall. It's not a CPU-bound operation.

RR vs PHP-FPM vs Nginx Unit testing

Just curious, who is winning in this competition? ))

Great, the same results here. Theoretically should be 1000 and 250 RPS respectively with 10 workers.

Also, 1-2-3 ms to process the request, so It would be a little bit lower than theoretical throughput as expected.

It would be interesting to actually decrease the number of workers and compare the results.

Measure, measure, measure )) As the second metric you may have a look at is a CPU load. For example, you may try to write a simple n-th fibo number method. Pick up some number, which will lead to 10ms (or 40ms) calculation time, and then use it in the benchmark (if you want to simulate a CPU load). The same for the IO and mixed load.

victor-sudakov Dec 13, 2021

Just curious, who is winning in this competition?

Currently, RR is winning with a great advantage. Surprisingly, Unit does not show any significant advantage over PHP-FPM (with the same number of children as RR's and Unit's number of workers, and pm = static). I hope I'll be able to share my results one day.

rustatian Dec 13, 2021
Maintainer

@victor-sudakov Interesting, are you saying, that PHP-FPM which is restarting processes on every request has the same RPS as the RR?

victor-sudakov Dec 13, 2021

@victor-sudakov Interesting, are you saying, that PHP-FPM which is restarting processes on every request has the same RPS as the RR?

No, I am not. I'm saying that Unit and PHP-FPM show similar performance, and RR is ahead of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoadRunner

[Q] `no free workers in the pool` error on long requests #799

{{title}}

Replies: 9 comments 24 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Q] no free workers in the pool error on long requests #799

Replies: 9 comments · 24 replies

rustatian Sep 15, 2021 Maintainer

stl-victor-sudakov Sep 15, 2021 Author

stl-victor-sudakov Sep 15, 2021 Author

rustatian Sep 15, 2021 Maintainer

stl-victor-sudakov Sep 15, 2021 Author

rustatian Sep 15, 2021 Maintainer

rustatian Sep 15, 2021 Maintainer

rustatian Sep 15, 2021 Maintainer

stl-victor-sudakov Sep 15, 2021 Author

rustatian Dec 8, 2021 Maintainer

rustatian Dec 9, 2021 Maintainer

rustatian Dec 13, 2021 Maintainer

rustatian Dec 13, 2021 Maintainer

rustatian Dec 13, 2021 Maintainer

[Q] `no free workers in the pool` error on long requests #799

Replies: 9 comments 24 replies

rustatian
Sep 15, 2021
Maintainer

stl-victor-sudakov
Sep 15, 2021
Author

stl-victor-sudakov
Sep 15, 2021
Author

rustatian
Sep 15, 2021
Maintainer

stl-victor-sudakov
Sep 15, 2021
Author

rustatian
Sep 15, 2021
Maintainer

rustatian
Sep 15, 2021
Maintainer

rustatian Sep 15, 2021
Maintainer

stl-victor-sudakov
Sep 15, 2021
Author

rustatian Dec 8, 2021
Maintainer

rustatian Dec 9, 2021
Maintainer

rustatian
Dec 13, 2021
Maintainer

rustatian Dec 13, 2021
Maintainer

rustatian Dec 13, 2021
Maintainer