New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 BUG]: HTTP Queue gets too large #1841
Comments
Hey @L3tum 👋 |
Sure, although it's fairly standard I think (hope): version: '3'
server:
command: "php public/index.php"
env:
APP_RUNTIME: Baldinof\RoadRunnerBundle\Runtime\Runtime
rpc:
listen: tcp://127.0.0.1:6001
metrics:
address: "0.0.0.0:9180"
http:
address: 0.0.0.0:8080
# Maximal incoming request size in megabytes. Zero means no limit.
max_request_size: 4
middleware: [ "http_metrics" ]
uploads:
forbid: [ ".php", ".exe", ".bat" ]
pool:
num_workers: 4
# Timeout for worker allocation. Zero means no limit.
allocate_timeout: 60s
# Timeout for worker destroying before process killing. Zero means no limit.
destroy_timeout: 60s
supervisor:
# watch_tick defines how often to check the state of the workers (seconds)
watch_tick: 10s
# ttl defines maximum time worker is allowed to live (seconds)
ttl: 0s
# idle_ttl defines maximum duration worker can spend in idle mode after first use. Disabled when 0 (seconds)
idle_ttl: 0s
# exec_ttl defines maximum lifetime per job (seconds)
exec_ttl: 2s
# max_worker_memory limits memory usage per worker (MB)
max_worker_memory: 256
# HTTP/2 settings.
http2:
# HTTP/2 over non-encrypted TCP connection using H2C.
#
# Default: false
h2c: true
# Maximal concurrent streams count.
#
# Default: 128
max_concurrent_streams: 128
# Health check endpoint (docs: https://roadrunner.dev/docs/beep-beep-health). If response code is 200 - it means at
# least one worker ready to serve requests. 500 - there are no workers ready to service requests.
# Drop this section for this feature disabling.
status:
# Host and port to listen on (eg.: `127.0.0.1:2114`). Use the following URL: http://127.0.0.1:2114/health?plugin=http
# Multiple plugins must be separated using "&" - http://127.0.0.1:2114/health?plugin=http&plugin=rpc where "http" and
# "rpc" are active (connected) plugins.
#
# This option is required.
address: 0.0.0.0:2114
# Response status code if a requested plugin not ready to handle requests
# Valid for both /health and /ready endpoints
#
# Default: 503
unavailable_status_code: 503
logs:
mode: production
encoding: json
channels:
http:
level: warn # debug Log all http requests, set to info to disable
server:
level: info # Everything written to worker stderr is logged
mode: raw
metrics:
level: error
|
Yeah, noting suspicious. ATM we don't have a configuration option to limit this queue, but this is a good idea to add this option. I think I'll add it in the |
Definitely, that would be really nice. Thank you! |
Hey @L3tum 👋 |
Thank you! I'll make an update tomorrow and check it out :) |
Hey @rustatian ! I've been testing this change out a bit and noticed a somewhat bad behaviour. It seems like when RR is getting hammered and starts dropping requests because the queue is already full, that RR uses up all CPU time answering 503s rather than responding to queued requests. I could see this in multiple areas:
I realize that this is somewhat an inherent mechanism, but I'm curious if you think you could fix it in RR instead of bolting another service, like a circuit breaker, on top. I don't know, for example, what would happen in Go if you'd stop accepting new requests (or if that's even possible). The same issue could be triggered if the side communicating with PHP and managing the workers triggers other errors in Go. For example someone sending tons of requests with a too-large body. I ended my search here, where the error would be generated. I think an expontential backoff/circuit breaker would need to be there somewhere, but I don't know enough about Go on how to go about it. Another fix would be to assign a lower priority to the RR process, which would give the PHP workers a higher priority, but that would mean (afaik) that RR would still be starved of important resources for answering |
Hey @L3tum 👋 |
You may create a feature request ticket for circuit breaker middleware and link this one in the description. |
Rust-based SAPI? Tell me more! :D I wanted to learn some Go anyways so I'll write a feature request down and then start working on it. Full disclosure that will be my first Go stint but I hope it won't be too bad. |
I thought that everyone knew about that, that I'm working on that 😄
Sure, would be happy to help you and review your PRs. You may take a look at this one: link, grab the ideas and reimplement them for our middleware. Here you may find docs about writing a middleware. You may also ping me on our Discord server 😃 |
BTW, when you'll be ready to start working on that, just ping me on our Discord server, I'll create a repository and give you permissions to work on that w/o forking. |
Haha, I've personally left Twitter behind so I sometimes miss these things. Is it gonna replace Roadrunner or be a middlelayer of sorts? Thank you! I'll have a few hours on sunday where I'll need to watch some graphs so I'll probably start then. No worries if I don't have a repo by that time though, I can just use a local one. I'll give you a ping on Discord tomorrow or on Sunday at the latest :) |
It'll integrate with RoadRunner 😃 |
Hey @L3tum 👋 |
Done. Updated repo with a basic skeleton, added comments, sample config. To use, just start the test in the |
No duplicates 🥲.
What happened?
Not necessarily a bug, maybe a feature request, but eh.
We've observed that the HTTP Request Queue, which is usually a good thing for small delays, can grow way too large.
In particular this seems to be happening when no workers are available anymore. What I would expect (from reading the docs) is that a few requests may be queued but most would be denied outright. Instead we observed a request queue of ~5000 requests, which is just too much.
What I would love is for this to be configurable, but it'd be a start to get some kind of documentation on what to expect from this. The only reference to a queue in the docs and issues is related to the JOBS queue.
Here's a screenshot showing a queue of 150000 requests across 30 instances (so 5000 per instance):
I know it's kind of extreme but a circuit breaker within RR would be extremely useful.
Version (rr --version)
Latest (v2023.3.9)
How to reproduce the issue?
Launch RR and run a load test that is overloading the system
Relevant log output
The text was updated successfully, but these errors were encountered: