[🐛 BUG]: HTTP Queue gets too large #1841

L3tum · 2024-01-23T08:36:08Z

No duplicates 🥲.

I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Not necessarily a bug, maybe a feature request, but eh.
We've observed that the HTTP Request Queue, which is usually a good thing for small delays, can grow way too large.
In particular this seems to be happening when no workers are available anymore. What I would expect (from reading the docs) is that a few requests may be queued but most would be denied outright. Instead we observed a request queue of ~5000 requests, which is just too much.
What I would love is for this to be configurable, but it'd be a start to get some kind of documentation on what to expect from this. The only reference to a queue in the docs and issues is related to the JOBS queue.

Here's a screenshot showing a queue of 150000 requests across 30 instances (so 5000 per instance):

I know it's kind of extreme but a circuit breaker within RR would be extremely useful.

Version (rr --version)

Latest (v2023.3.9)

How to reproduce the issue?

Launch RR and run a load test that is overloading the system

Relevant log output

rustatian · 2024-01-23T09:13:58Z

Hey @L3tum 👋
Could you please share your configuration as well? (.rr.yaml) ?

L3tum · 2024-01-23T09:22:20Z

Sure, although it's fairly standard I think (hope):

version: '3'

server:
    command: "php public/index.php"
    env:
        APP_RUNTIME: Baldinof\RoadRunnerBundle\Runtime\Runtime

rpc:
    listen: tcp://127.0.0.1:6001

metrics:
    address: "0.0.0.0:9180"

http:
    address: 0.0.0.0:8080
    # Maximal incoming request size in megabytes. Zero means no limit.
    max_request_size: 4
    middleware: [ "http_metrics" ]
    uploads:
        forbid: [ ".php", ".exe", ".bat" ]
    pool:
        num_workers: 4
        # Timeout for worker allocation. Zero means no limit.
        allocate_timeout: 60s
        # Timeout for worker destroying before process killing. Zero means no limit.
        destroy_timeout: 60s
        supervisor:
            # watch_tick defines how often to check the state of the workers (seconds)
            watch_tick: 10s
            # ttl defines maximum time worker is allowed to live (seconds)
            ttl: 0s
            # idle_ttl defines maximum duration worker can spend in idle mode after first use. Disabled when 0 (seconds)
            idle_ttl: 0s
            # exec_ttl defines maximum lifetime per job (seconds)
            exec_ttl: 2s
            # max_worker_memory limits memory usage per worker (MB)
            max_worker_memory: 256
    # HTTP/2 settings.
    http2:
        # HTTP/2 over non-encrypted TCP connection using H2C.
        #
        # Default: false
        h2c: true

        # Maximal concurrent streams count.
        #
        # Default: 128
        max_concurrent_streams: 128

# Health check endpoint (docs: https://roadrunner.dev/docs/beep-beep-health). If response code is 200 - it means at
# least one worker ready to serve requests. 500 - there are no workers ready to service requests.
# Drop this section for this feature disabling.
status:
    # Host and port to listen on (eg.: `127.0.0.1:2114`). Use the following URL: http://127.0.0.1:2114/health?plugin=http
    # Multiple plugins must be separated using "&" - http://127.0.0.1:2114/health?plugin=http&plugin=rpc where "http" and
    # "rpc" are active (connected) plugins.
    #
    # This option is required.
    address: 0.0.0.0:2114

    # Response status code if a requested plugin not ready to handle requests
    # Valid for both /health and /ready endpoints
    #
    # Default: 503
    unavailable_status_code: 503

logs:
    mode: production
    encoding: json
    channels:
        http:
            level: warn # debug Log all http requests, set to info to disable
        server:
            level: info # Everything written to worker stderr is logged
            mode: raw
        metrics:
            level: error

rustatian · 2024-01-23T09:29:22Z

Yeah, noting suspicious. ATM we don't have a configuration option to limit this queue, but this is a good idea to add this option. I think I'll add it in the 2023.3.10 release with the 503 HTTP code if the server is overloaded. Does that help you?

L3tum · 2024-01-23T09:41:32Z

Definitely, that would be really nice. Thank you!

rustatian · 2024-02-01T15:05:06Z

Hey @L3tum 👋
2023.3.10 will be released today 🥳

L3tum · 2024-02-01T17:28:00Z

Thank you! I'll make an update tomorrow and check it out :)

L3tum · 2024-02-09T17:50:40Z

Hey @rustatian !

I've been testing this change out a bit and noticed a somewhat bad behaviour. It seems like when RR is getting hammered and starts dropping requests because the queue is already full, that RR uses up all CPU time answering 503s rather than responding to queued requests. I could see this in multiple areas:

The log entries for "request queue too large" errors were always with an elapsed time in the microsecond area
The status and metrics plugins stopped responding (leading to the container getting terminated)
The already-worked-on requests weren't getting answered, leading to response times >5 seconds or even none at all

I realize that this is somewhat an inherent mechanism, but I'm curious if you think you could fix it in RR instead of bolting another service, like a circuit breaker, on top. I don't know, for example, what would happen in Go if you'd stop accepting new requests (or if that's even possible). The same issue could be triggered if the side communicating with PHP and managing the workers triggers other errors in Go. For example someone sending tons of requests with a too-large body.

I ended my search here, where the error would be generated. I think an expontential backoff/circuit breaker would need to be there somewhere, but I don't know enough about Go on how to go about it.

Another fix would be to assign a lower priority to the RR process, which would give the PHP workers a higher priority, but that would mean (afaik) that RR would still be starved of important resources for answering status for example. I also haven't tried it yet and I think it wouldn't work, since the PHP workers are children of RR and thus would inherit that nice-ness and you can't increase priority/decrease nice-ness in an unprivileged environment.

rustatian · 2024-02-09T18:00:12Z

Hey @L3tum 👋
Hmm, interesting... Respond with 503 and actual request processing are happening in different threads. But yeah, a circuit breaker would be a preferable solution here. No need to stress pool if we can do that in middleware.
Sorry, a lot of work with our new Rust-based SAPI, so, I would be able to start working on this not as fast as usual 😭

rustatian · 2024-02-09T18:00:57Z

You may create a feature request ticket for circuit breaker middleware and link this one in the description.

L3tum · 2024-02-09T18:42:26Z

Rust-based SAPI? Tell me more! :D

I wanted to learn some Go anyways so I'll write a feature request down and then start working on it. Full disclosure that will be my first Go stint but I hope it won't be too bad.
Would you prefer pulling in some circuit breaker lib and connecting it in a middleware, or writing a simple one "in-house"?
Additionally, a middleware would need to be added to the roadrunner-server/http project, rather than an independent plugin, right?

rustatian · 2024-02-09T19:27:18Z

I thought that everyone knew about that, that I'm working on that 😄

I wanted to learn some Go anyways

Sure, would be happy to help you and review your PRs. You may take a look at this one: link, grab the ideas and reimplement them for our middleware. Here you may find docs about writing a middleware. You may also ping me on our Discord server 😃

rustatian · 2024-02-09T19:31:48Z

BTW, when you'll be ready to start working on that, just ping me on our Discord server, I'll create a repository and give you permissions to work on that w/o forking.

L3tum · 2024-02-09T21:56:28Z

Haha, I've personally left Twitter behind so I sometimes miss these things. Is it gonna replace Roadrunner or be a middlelayer of sorts?

Thank you! I'll have a few hours on sunday where I'll need to watch some graphs so I'll probably start then. No worries if I don't have a repo by that time though, I can just use a local one. I'll give you a ping on Discord tomorrow or on Sunday at the latest :)

rustatian · 2024-02-09T22:21:45Z

Is it gonna replace Roadrunner or be a middlelayer of sorts?

It'll integrate with RoadRunner 😃

rustatian · 2024-02-10T10:13:16Z

Hey @L3tum 👋
Added you to a contributors team, created repo: https://github.com/roadrunner-server/circuit-breaker. I'll create a middleware skeleton with test, so it would be easier for you start.

rustatian · 2024-02-10T11:17:57Z

Done. Updated repo with a basic skeleton, added comments, sample config. To use, just start the test in the tests/ folder.
But before, execute composer u in the tests/php_test_files folder. You may check the composer.json, but since I'm not a PHP dev, this is just my template, which might be contained too many deps.

L3tum added B-bug Bug: bug, exception F-need-verification labels Jan 23, 2024

L3tum assigned rustatian Jan 23, 2024

L3tum changed the title ~~[🐛 BUG]: HTTP Queue gets stupidly large~~ [🐛 BUG]: HTTP Queue gets too large Jan 23, 2024

rustatian added C-feature-request Category: feature requested, but need to be discussed and removed F-need-verification labels Jan 23, 2024

rustatian added this to the v2023.3.10 milestone Jan 23, 2024

rustatian mentioned this issue Feb 1, 2024

feature: max queue size option roadrunner-server/sdk#111

Merged

6 tasks

rustatian closed this as completed in roadrunner-server/sdk#111 Feb 1, 2024

rustatian mentioned this issue Feb 1, 2024

release: v2023.3.10 #1856

Merged

6 tasks

Porkepix mentioned this issue Feb 2, 2024

roadrunner 2023.3.10 Homebrew/homebrew-core#161608

Merged

coderabbitai bot mentioned this issue Apr 11, 2024

release: v2024.1.0 #1902

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 BUG]: HTTP Queue gets too large #1841

[🐛 BUG]: HTTP Queue gets too large #1841

L3tum commented Jan 23, 2024

rustatian commented Jan 23, 2024

L3tum commented Jan 23, 2024

rustatian commented Jan 23, 2024

L3tum commented Jan 23, 2024

rustatian commented Feb 1, 2024

L3tum commented Feb 1, 2024

L3tum commented Feb 9, 2024 •

edited

rustatian commented Feb 9, 2024

rustatian commented Feb 9, 2024

L3tum commented Feb 9, 2024

rustatian commented Feb 9, 2024 •

edited

rustatian commented Feb 9, 2024

L3tum commented Feb 9, 2024

rustatian commented Feb 9, 2024

rustatian commented Feb 10, 2024

rustatian commented Feb 10, 2024

[🐛 BUG]: HTTP Queue gets too large #1841

[🐛 BUG]: HTTP Queue gets too large #1841

Comments

L3tum commented Jan 23, 2024

No duplicates 🥲.

What happened?

Version (rr --version)

How to reproduce the issue?

Relevant log output

rustatian commented Jan 23, 2024

L3tum commented Jan 23, 2024

rustatian commented Jan 23, 2024

L3tum commented Jan 23, 2024

rustatian commented Feb 1, 2024

L3tum commented Feb 1, 2024

L3tum commented Feb 9, 2024 • edited

rustatian commented Feb 9, 2024

rustatian commented Feb 9, 2024

L3tum commented Feb 9, 2024

rustatian commented Feb 9, 2024 • edited

rustatian commented Feb 9, 2024

L3tum commented Feb 9, 2024

rustatian commented Feb 9, 2024

rustatian commented Feb 10, 2024

rustatian commented Feb 10, 2024

L3tum commented Feb 9, 2024 •

edited

rustatian commented Feb 9, 2024 •

edited