Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] All workers ready, exec 0 #737

Closed
iluuu1994 opened this issue Jun 25, 2021 · 20 comments · Fixed by #738
Closed

[BUG] All workers ready, exec 0 #737

iluuu1994 opened this issue Jun 25, 2021 · 20 comments · Fixed by #738
Assignees
Labels
B-bug Bug: bug, exception F-need-verification Y-Release blocker Priority: Release blocker
Milestone

Comments

@iluuu1994
Copy link

Hi! First of all, thanks for RoadRunner! ❤️ It's a great tool

We have some stability issues in one of our big websites. This has occurred many times in the past and we can't identify a clear cause for what's happening. RoadRunner suddenly stops responding to requests completely after several hours or days. rr workers reveals that all of the workers are ready with an EXECS of 0. The issue never resolves on its own until we do a rr reset.

I'm not sure if maybe sometimes the workers start incompletely, unable to process requests, without rr reloading them because they never reach any of the soft or hard limits. After some time they might be accumulating until there are no healthy workers left to process any requests. This is just a theory. The workers are displayed as ready.

The version of RR used:

rr version 2.3.0 (build time: 2021-06-11T14:54:08+0000, go1.16.4)

My .rr.yaml configuration is:

rpc:
  listen:     tcp://127.0.0.1:6000

server:
  command: "/usr/bin/php7.4 psr-worker.php"

http:
  address: 0.0.0.0:8082
  
  pool:
    num_workers: 48
    
    supervisor:
      watch_tick: 60s
      max_worker_memory: 200
      ttl: 84600s
      exec_ttl: 30s

We've just added error logging to our config now, so unfortunately I can't provide any logs yet, I will update this report as soon as we have anything.

Do you have any suspicions what might be causing this? Are there any things other than logs that we can provide that might be helpful?

Thanks for your help!

@iluuu1994 iluuu1994 added B-bug Bug: bug, exception F-need-verification labels Jun 25, 2021
@rustatian rustatian added the Y-Release blocker Priority: Release blocker label Jun 25, 2021
@rustatian rustatian added this to the 2.3.1 milestone Jun 25, 2021
@rustatian
Copy link
Member

Hey @iluuu1994. That's definitely strange. In the past, I made a load-testing session for few days. Let me verify this. I'll make a load-testing session for a few days starting from now with your configuration and with php74 (I tested on 8.0 previously).
I'll use my test worker for that https://github.com/spiral/roadrunner/blob/master/tests/psr-worker.php.

Do you have any suspicions what might be causing this?

Might be some deadlock with the pool and supervised pool.

Are there any things other than logs that we can provide that might be helpful?

Debug logs will be super helpful.

@rustatian
Copy link
Member

@iluuu1994 Could you please try the latest beta version if that's possible?

@iluuu1994
Copy link
Author

iluuu1994 commented Jun 25, 2021

Hey @rustatian, thanks for your quick response!

Debug logs will be super helpful.

We'll do that.

Could you please try the latest beta version if that's possible?

Depends. How stable is it? Unfortunately we can only reproduce it in the production environment. I guess it's load-dependent.

@rustatian
Copy link
Member

Depends. How stable is it? Unfortunately we can only reproduce it in the production environment. I guess it's load-dependent.

It's pretty stable, we will release the same version, but without beta postfix next week. It has few new configuration options like broadcast and reworked WebSockets with KV, so, if you don't use these features, you may safely update.

@wolfy-j
Copy link
Contributor

wolfy-j commented Jun 25, 2021

Have you been using RR1 in the past? If yes - have you seen the same experience?

@rustatian
Copy link
Member

@iluuu1994 Could you please share info about your environment? OS version (uname -a for Linux) or full build for windows. Docker version (if started in the docker, as well with the docker kernel (alpine, ubuntu, etc))

@wolfy-j
Copy link
Contributor

wolfy-j commented Jun 25, 2021

Can you also provide us more info on your load:

  • is your load CPU intense?
  • why do you reload workers once a day (do you code leak?)
  • does deadlock appear after some time after deployment or randomly?
  • did you observe any patterns when this happens?
  • do you workers have processing timeout sometimes (I see 30s)?

@wolfy-j
Copy link
Contributor

wolfy-j commented Jun 25, 2021

When do you rr reset, how long do you wait till it responds back to you?

@wolfy-j
Copy link
Contributor

wolfy-j commented Jun 25, 2021

Sounds like we found a core issue and it's related to a specific edge case with execTTL harming the pool under load. We have not stress-tested this scenario originally (will do going forward). It might cause some worker degradation over time when timeout is being hit. We are not sure where exactly the issue yet (doing debugging of it), but it was reproduced.

Without execTTL (or without supervisor at all) it should be working fine if you need a quick fix.

No need to update the beta, we will fix this issue before the next release (2.3.1, next Tuesday). Thank you for bringing this issue to us, we will solve it at the highest priority.

@iluuu1994
Copy link
Author

@wolfy-j @rustatian

Have you been using RR1 in the past? If yes - have you seen the same experience?

Yes, we've been using RoadRunner for over a year and experienced the same thing in RoadRunner 1. Unfortunately we also had a bad configuration that lead to memory issues (no soft limit for the workers, no memory_limit for the cli sapi) so I can't say with certainty if this is the same issue.

is your load CPU intense?

No, CPU was virtually at 0%.

Could you please share info about your environment? OS version (uname -a for Linux) or full build for windows. Docker version (if started in the docker, as well with the docker kernel (alpine, ubuntu, etc))

Linux {{hostname}} 4.19.0-16-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux

We're running RoadRunner natively, so no Docker.

why do you reload workers once a day (do you code leak?)

Yes, we have lots of libraries we're using and I'm sure the application also has various places where memory could leak. It's also possible that some endpoints exceed the exec TTL. Usually new workers start at around 100MB and get restarted after several hundred to thousand executions because they hit the memory limit. Certainly not great.

does deadlock appear after some time after deployment or randomly?

AFAIK (I'm not directly involved with the project) it happens randomly, there's not a specific time frame that passes after deployment.

did you observe any patterns when this happens?

Unfortunately, no.

do you workers have processing timeout sometimes (I see 30s)?

I can't say with certainty. Admittedly we've not done our part here. Debug logging is active now and I'll report back as soon as we have some data. The website has not been down since enabling it but I see lots of these logs:

{"level":"error","ts":1624622826.0332258,"logger":"http","caller":"logger/std_log_adapter.go:15","msg":"server internal error","message":"http_plugin: 2021/06/25 12:07:06 http: superfluous response.WriteHeader call from github.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP (handler.go:162)\n","stacktrace":"github.com/spiral/roadrunner/v2/plugins/logger.(*StdLogAdapter).Write\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/logger/std_log_adapter.go:15\nlog.(*Logger).Output\n\tlog/log.go:184\nlog.(*Logger).Printf\n\tlog/log.go:191\nnet/http.(*Server).logf\n\tnet/http/server.go:3137\nnet/http.(*response).WriteHeader\n\tnet/http/server.go:1143\nnet/http.Error\n\tnet/http/server.go:2081\ngithub.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_handler/handler.go:162\ngithub.com/spiral/roadrunner/v2/plugins/http.(*Plugin).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/http/plugin.go:283\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}

And a few of these:

{"level":"error","ts":1624622877.2459464,"logger":"server","caller":"server/plugin.go:254","msg":"worker_watcher_wait: signal: killed; process_wait: signal: killed","stacktrace":"github.com/spiral/roadrunner/v2/plugins/server.(*Plugin).collectEvents\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/server/plugin.go:254\ngithub.com/spiral/roadrunner/v2/pkg/events.(*HandlerImpl).Push\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/events/general.go:37\ngithub.com/spiral/roadrunner/v2/pkg/worker_watcher.(*workerWatcher).wait\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_watcher/worker_watcher.go:235\ngithub.com/spiral/roadrunner/v2/pkg/worker_watcher.(*workerWatcher).addToWatch.func1\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_watcher/worker_watcher.go:260"}

Note that these logs are still from the stable release. I will switch to beta tonight when there's less traffic.

When do you rr reset, how long do you wait till it responds back to you?

Just a few seconds.

@rustatian
Copy link
Member

Got u, thanks. I've reproduced this issue, I'll try to fix it ASAP. Priority number 1 at the moment.

@iluuu1994
Copy link
Author

Without execTTL (or without supervisor at all) it should be working fine if you need a quick fix.

No need to update the beta, we will fix this issue before the next release (2.3.1, next Tuesday). Thank you for bringing this issue to us, we will solve it at the highest priority.

Oh sorry, just saw that message now. We'll wait for the fix then and report if it resolves the issue. Thanks for the very quick reaction! We really appreciated it!

@wolfy-j
Copy link
Contributor

wolfy-j commented Jun 25, 2021

{"level":"error","ts":1624622826.0332258,"logger":"http","caller":"logger/std_log_adapter.go:15","msg":"server internal error","message":"http_plugin: 2021/06/25 12:07:06 http: superfluous response.WriteHeader call from github.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP (handler.go:162)\n","stacktrace":"github.com/spiral/roadrunner/v2/plugins/logger.(*StdLogAdapter).Write\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/logger/std_log_adapter.go:15\nlog.(*Logger).Output\n\tlog/log.go:184\nlog.(*Logger).Printf\n\tlog/log.go:191\nnet/http.(*Server).logf\n\tnet/http/server.go:3137\nnet/http.(*response).WriteHeader\n\tnet/http/server.go:1143\nnet/http.Error\n\tnet/http/server.go:2081\ngithub.com/spiral/roadrunner/v2/pkg/worker_handler.(*Handler).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/pkg/worker_handler/handler.go:162\ngithub.com/spiral/roadrunner/v2/plugins/http.(*Plugin).ServeHTTP\n\tgithub.com/spiral/roadrunner/v2@v2.3.0/plugins/http/plugin.go:283\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2887\nnet/http.(*conn).serve\n\tnet/http/server.go:1952"}

These logs related to the client leaving the connection, and it should't be critical. The second is disturbing. @rustatian does it make sense to mark this error as a warning in the future?

@rustatian
Copy link
Member

does it make sense to mark this error as a warning in the future?

Writing twice into a closed ResponseWriter is an error. This should be fixed.

@iluuu1994 I found the issue, the fix will be on Tuesday (v2.3.1).

@iluuu1994
Copy link
Author

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

@rustatian
Copy link
Member

@rustatian @wolfy-j Great! Thanks again to both of you. Hope you have a great weekend!

Have a wonderful weekend too, and welcome to the RR/Spiral community :)

@rustatian
Copy link
Member

Hey @iluuu1994, you may safely update to the latest RC1 -> https://github.com/spiral/roadrunner-binary/releases/tag/v2.3.1-rc.1. Waiting for your feedback on this 😃

@iluuu1994
Copy link
Author

@rustatian Today was a holiday in our city so we were off. We're probably gonna upgrade tomorrow in the evening. I'll let you know how it goes 🙂

@iluuu1994
Copy link
Author

iluuu1994 commented Jun 30, 2021

We deployed it yesterday, works great so far. If we encounter any issues we'll let you know. Thanks again!

@rustatian
Copy link
Member

Cool, thanks 👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception F-need-verification Y-Release blocker Priority: Release blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants