[BUG] All workers ready, exec 0 #737

iluuu1994 · 2021-06-25T08:44:20Z

Hi! First of all, thanks for RoadRunner! ❤️ It's a great tool

We have some stability issues in one of our big websites. This has occurred many times in the past and we can't identify a clear cause for what's happening. RoadRunner suddenly stops responding to requests completely after several hours or days. rr workers reveals that all of the workers are ready with an EXECS of 0. The issue never resolves on its own until we do a rr reset.

I'm not sure if maybe sometimes the workers start incompletely, unable to process requests, without rr reloading them because they never reach any of the soft or hard limits. After some time they might be accumulating until there are no healthy workers left to process any requests. This is just a theory. The workers are displayed as ready.

The version of RR used:

rr version 2.3.0 (build time: 2021-06-11T14:54:08+0000, go1.16.4)

My .rr.yaml configuration is:

rpc:
  listen:     tcp://127.0.0.1:6000

server:
  command: "/usr/bin/php7.4 psr-worker.php"

http:
  address: 0.0.0.0:8082
  
  pool:
    num_workers: 48
    
    supervisor:
      watch_tick: 60s
      max_worker_memory: 200
      ttl: 84600s
      exec_ttl: 30s

We've just added error logging to our config now, so unfortunately I can't provide any logs yet, I will update this report as soon as we have anything.

Do you have any suspicions what might be causing this? Are there any things other than logs that we can provide that might be helpful?

Thanks for your help!

The text was updated successfully, but these errors were encountered:

rustatian · 2021-06-25T09:22:16Z

Hey @iluuu1994. That's definitely strange. In the past, I made a load-testing session for few days. Let me verify this. I'll make a load-testing session for a few days starting from now with your configuration and with php74 (I tested on 8.0 previously).
I'll use my test worker for that https://github.com/spiral/roadrunner/blob/master/tests/psr-worker.php.

Do you have any suspicions what might be causing this?

Might be some deadlock with the pool and supervised pool.

Are there any things other than logs that we can provide that might be helpful?

Debug logs will be super helpful.

rustatian · 2021-06-25T09:26:47Z

@iluuu1994 Could you please try the latest beta version if that's possible?

iluuu1994 · 2021-06-25T09:31:47Z

Hey @rustatian, thanks for your quick response!

Debug logs will be super helpful.

We'll do that.

Could you please try the latest beta version if that's possible?

Depends. How stable is it? Unfortunately we can only reproduce it in the production environment. I guess it's load-dependent.

rustatian · 2021-06-25T09:34:37Z

Depends. How stable is it? Unfortunately we can only reproduce it in the production environment. I guess it's load-dependent.

It's pretty stable, we will release the same version, but without beta postfix next week. It has few new configuration options like broadcast and reworked WebSockets with KV, so, if you don't use these features, you may safely update.

wolfy-j · 2021-06-25T09:52:02Z

Have you been using RR1 in the past? If yes - have you seen the same experience?

rustatian · 2021-06-25T09:54:26Z

@iluuu1994 Could you please share info about your environment? OS version (uname -a for Linux) or full build for windows. Docker version (if started in the docker, as well with the docker kernel (alpine, ubuntu, etc))

wolfy-j · 2021-06-25T09:58:47Z

Can you also provide us more info on your load:

is your load CPU intense?
why do you reload workers once a day (do you code leak?)
does deadlock appear after some time after deployment or randomly?
did you observe any patterns when this happens?
do you workers have processing timeout sometimes (I see 30s)?

wolfy-j · 2021-06-25T10:07:07Z

When do you rr reset, how long do you wait till it responds back to you?

wolfy-j · 2021-06-25T11:01:41Z

Sounds like we found a core issue and it's related to a specific edge case with execTTL harming the pool under load. We have not stress-tested this scenario originally (will do going forward). It might cause some worker degradation over time when timeout is being hit. We are not sure where exactly the issue yet (doing debugging of it), but it was reproduced.

Without execTTL (or without supervisor at all) it should be working fine if you need a quick fix.

No need to update the beta, we will fix this issue before the next release (2.3.1, next Tuesday). Thank you for bringing this issue to us, we will solve it at the highest priority.

iluuu1994 · 2021-06-25T12:30:43Z

@wolfy-j @rustatian

Have you been using RR1 in the past? If yes - have you seen the same experience?

Yes, we've been using RoadRunner for over a year and experienced the same thing in RoadRunner 1. Unfortunately we also had a bad configuration that lead to memory issues (no soft limit for the workers, no memory_limit for the cli sapi) so I can't say with certainty if this is the same issue.

is your load CPU intense?

No, CPU was virtually at 0%.

Could you please share info about your environment? OS version (uname -a for Linux) or full build for windows. Docker version (if started in the docker, as well with the docker kernel (alpine, ubuntu, etc))

Linux {{hostname}} 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

We're running RoadRunner natively, so no Docker.

why do you reload workers once a day (do you code leak?)

Yes, we have lots of libraries we're using and I'm sure the application also has various places where memory could leak. It's also possible that some endpoints exceed the exec TTL. Usually new workers start at around 100MB and get restarted after several hundred to thousand executions because they hit the memory limit. Certainly not great.

does deadlock appear after some time after deployment or randomly?

AFAIK (I'm not directly involved with the project) it happens randomly, there's not a specific time frame that passes after deployment.

did you observe any patterns when this happens?

Unfortunately, no.

do you workers have processing timeout sometimes (I see 30s)?

I can't say with certainty. Admittedly we've not done our part here. Debug logging is active now and I'll report back as soon as we have some data. The website has not been down since enabling it but I see lots of these logs:

{"level":"error","ts":1624622826.0332258,"logger":"http","caller":"logger/std_log_adapter.go:15","msg":"server internal error","message":"http_plugin: 2021/06/25 12:07:06 http: superfluous response.WriteHeader call from github.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP (handler.go:162)\n","stacktrace":"github.com/spiral/roadrunner/v2/plugins/logger.(*StdLogAdapter).Write\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/logger/std_log_adapter.go:15\nlog.(*Logger).Output\n\tlog/log.go:184\nlog.(*Logger).Printf\n\tlog/log.go:191\nnet/http.(*Server).logf\n\tnet/http/server.go:3137\nnet/http.(*response).WriteHeader\n\tnet/http/server.go:1143\nnet/http.Error\n\tnet/http/server.go:2081\ngithub.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_handler/handler.go:162\ngithub.com/spiral/roadrunner/v2/plugins/http.(*Plugin).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/http/plugin.go:283\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}

And a few of these:

{"level":"error","ts":1624622877.2459464,"logger":"server","caller":"server/plugin.go:254","msg":"worker_watcher_wait: signal: killed; process_wait: signal: killed","stacktrace":"github.com/spiral/roadrunner/v2/plugins/server.(*Plugin).collectEvents\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/server/plugin.go:254\ngithub.com/spiral/roadrunner/v2/pkg/events.(*HandlerImpl).Push\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/events/general.go:37\ngithub.com/spiral/roadrunner/v2/pkg/worker_watcher.(*workerWatcher).wait\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_watcher/worker_watcher.go:235\ngithub.com/spiral/roadrunner/v2/pkg/worker_watcher.(*workerWatcher).addToWatch.func1\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_watcher/worker_watcher.go:260"}

Note that these logs are still from the stable release. I will switch to beta tonight when there's less traffic.

When do you rr reset, how long do you wait till it responds back to you?

Just a few seconds.

rustatian · 2021-06-25T12:33:04Z

Got u, thanks. I've reproduced this issue, I'll try to fix it ASAP. Priority number 1 at the moment.

iluuu1994 · 2021-06-25T12:33:14Z

Without execTTL (or without supervisor at all) it should be working fine if you need a quick fix.

No need to update the beta, we will fix this issue before the next release (2.3.1, next Tuesday). Thank you for bringing this issue to us, we will solve it at the highest priority.

Oh sorry, just saw that message now. We'll wait for the fix then and report if it resolves the issue. Thanks for the very quick reaction! We really appreciated it!

wolfy-j · 2021-06-25T12:35:31Z

{"level":"error","ts":1624622826.0332258,"logger":"http","caller":"logger/std_log_adapter.go:15","msg":"server internal error","message":"http_plugin: 2021/06/25 12:07:06 http: superfluous response.WriteHeader call from github.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP (handler.go:162)\n","stacktrace":"github.com/spiral/roadrunner/v2/plugins/logger.(*StdLogAdapter).Write\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/logger/std_log_adapter.go:15\nlog.(*Logger).Output\n\tlog/log.go:184\nlog.(*Logger).Printf\n\tlog/log.go:191\nnet/http.(*Server).logf\n\tnet/http/server.go:3137\nnet/http.(*response).WriteHeader\n\tnet/http/server.go:1143\nnet/http.Error\n\tnet/http/server.go:2081\ngithub.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_handler/handler.go:162\ngithub.com/spiral/roadrunner/v2/plugins/http.(*Plugin).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/http/plugin.go:283\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}

These logs related to the client leaving the connection, and it should't be critical. The second is disturbing. @rustatian does it make sense to mark this error as a warning in the future?

rustatian · 2021-06-25T17:16:27Z

does it make sense to mark this error as a warning in the future?

Writing twice into a closed ResponseWriter is an error. This should be fixed.

@iluuu1994 I found the issue, the fix will be on Tuesday (v2.3.1).

iluuu1994 · 2021-06-25T18:00:45Z

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

rustatian · 2021-06-25T19:05:34Z

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

Have a wonderful weekend too, and welcome to the RR/Spiral community :)

rustatian · 2021-06-28T09:33:30Z

Hey @iluuu1994, you may safely update to the latest RC1 -> https://github.com/spiral/roadrunner-binary/releases/tag/v2.3.1-rc.1. Waiting for your feedback on this 😃

iluuu1994 · 2021-06-28T19:02:19Z

@rustatian Today was a holiday in our city so we were off. We're probably gonna upgrade tomorrow in the evening. I'll let you know how it goes 🙂

iluuu1994 · 2021-06-30T08:04:54Z

We deployed it yesterday, works great so far. If we encounter any issues we'll let you know. Thanks again!

rustatian · 2021-06-30T08:05:33Z

Cool, thanks 👍🏻

iluuu1994 added B-bug Bug: bug, exception F-need-verification labels Jun 25, 2021

iluuu1994 assigned rustatian Jun 25, 2021

rustatian added the Y-Release blocker Priority: Release blocker label Jun 25, 2021

rustatian added this to the 2.3.1 milestone Jun 25, 2021

rustatian mentioned this issue Jun 25, 2021

fix(allocator): workers were not reallocating when exec_ttl was hit #738

Merged

6 tasks

rustatian closed this as completed in #738 Jun 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] All workers ready, exec 0 #737

[BUG] All workers ready, exec 0 #737

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021 •

edited

Loading

rustatian commented Jun 25, 2021

wolfy-j commented Jun 25, 2021

rustatian commented Jun 25, 2021

wolfy-j commented Jun 25, 2021 •

edited

Loading

wolfy-j commented Jun 25, 2021 •

edited

Loading

wolfy-j commented Jun 25, 2021 •

edited by rustatian

Loading

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021

wolfy-j commented Jun 25, 2021 •

edited

Loading

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

rustatian commented Jun 28, 2021

iluuu1994 commented Jun 28, 2021

iluuu1994 commented Jun 30, 2021 •

edited

Loading

rustatian commented Jun 30, 2021

[BUG] All workers ready, exec 0 #737

[BUG] All workers ready, exec 0 #737

Comments

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021 • edited Loading

rustatian commented Jun 25, 2021

wolfy-j commented Jun 25, 2021

rustatian commented Jun 25, 2021

wolfy-j commented Jun 25, 2021 • edited Loading

wolfy-j commented Jun 25, 2021 • edited Loading

wolfy-j commented Jun 25, 2021 • edited by rustatian Loading

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021

wolfy-j commented Jun 25, 2021 • edited Loading

rustatian commented Jun 25, 2021

iluuu1994 commented Jun 25, 2021

rustatian commented Jun 25, 2021

rustatian commented Jun 28, 2021

iluuu1994 commented Jun 28, 2021

iluuu1994 commented Jun 30, 2021 • edited Loading

rustatian commented Jun 30, 2021

iluuu1994 commented Jun 25, 2021 •

edited

Loading

wolfy-j commented Jun 25, 2021 •

edited

Loading

wolfy-j commented Jun 25, 2021 •

edited

Loading

wolfy-j commented Jun 25, 2021 •

edited by rustatian

Loading

wolfy-j commented Jun 25, 2021 •

edited

Loading

iluuu1994 commented Jun 30, 2021 •

edited

Loading