Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: HTTP Queue gets too large #1841

Closed
1 task done
L3tum opened this issue Jan 23, 2024 · 16 comments · Fixed by roadrunner-server/sdk#111
Closed
1 task done

[🐛 BUG]: HTTP Queue gets too large #1841

L3tum opened this issue Jan 23, 2024 · 16 comments · Fixed by roadrunner-server/sdk#111
Assignees
Labels
B-bug Bug: bug, exception C-feature-request Category: feature requested, but need to be discussed
Milestone

Comments

@L3tum
Copy link

L3tum commented Jan 23, 2024

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

Not necessarily a bug, maybe a feature request, but eh.
We've observed that the HTTP Request Queue, which is usually a good thing for small delays, can grow way too large.
In particular this seems to be happening when no workers are available anymore. What I would expect (from reading the docs) is that a few requests may be queued but most would be denied outright. Instead we observed a request queue of ~5000 requests, which is just too much.
What I would love is for this to be configurable, but it'd be a start to get some kind of documentation on what to expect from this. The only reference to a queue in the docs and issues is related to the JOBS queue.

Here's a screenshot showing a queue of 150000 requests across 30 instances (so 5000 per instance):
{5B2CE156-CE59-48C7-AFB4-83D2745B2FE5}

I know it's kind of extreme but a circuit breaker within RR would be extremely useful.

Version (rr --version)

Latest (v2023.3.9)

How to reproduce the issue?

Launch RR and run a load test that is overloading the system

Relevant log output

-
@L3tum L3tum added B-bug Bug: bug, exception F-need-verification labels Jan 23, 2024
@L3tum L3tum changed the title [🐛 BUG]: HTTP Queue gets stupidly large [🐛 BUG]: HTTP Queue gets too large Jan 23, 2024
@rustatian
Copy link
Member

Hey @L3tum 👋
Could you please share your configuration as well? (.rr.yaml) ?

@L3tum
Copy link
Author

L3tum commented Jan 23, 2024

Sure, although it's fairly standard I think (hope):

version: '3'

server:
    command: "php public/index.php"
    env:
        APP_RUNTIME: Baldinof\RoadRunnerBundle\Runtime\Runtime

rpc:
    listen: tcp://127.0.0.1:6001

metrics:
    address: "0.0.0.0:9180"

http:
    address: 0.0.0.0:8080
    # Maximal incoming request size in megabytes. Zero means no limit.
    max_request_size: 4
    middleware: [ "http_metrics" ]
    uploads:
        forbid: [ ".php", ".exe", ".bat" ]
    pool:
        num_workers: 4
        # Timeout for worker allocation. Zero means no limit.
        allocate_timeout: 60s
        # Timeout for worker destroying before process killing. Zero means no limit.
        destroy_timeout: 60s
        supervisor:
            # watch_tick defines how often to check the state of the workers (seconds)
            watch_tick: 10s
            # ttl defines maximum time worker is allowed to live (seconds)
            ttl: 0s
            # idle_ttl defines maximum duration worker can spend in idle mode after first use. Disabled when 0 (seconds)
            idle_ttl: 0s
            # exec_ttl defines maximum lifetime per job (seconds)
            exec_ttl: 2s
            # max_worker_memory limits memory usage per worker (MB)
            max_worker_memory: 256
    # HTTP/2 settings.
    http2:
        # HTTP/2 over non-encrypted TCP connection using H2C.
        #
        # Default: false
        h2c: true

        # Maximal concurrent streams count.
        #
        # Default: 128
        max_concurrent_streams: 128

# Health check endpoint (docs: https://roadrunner.dev/docs/beep-beep-health). If response code is 200 - it means at
# least one worker ready to serve requests. 500 - there are no workers ready to service requests.
# Drop this section for this feature disabling.
status:
    # Host and port to listen on (eg.: `127.0.0.1:2114`). Use the following URL: http://127.0.0.1:2114/health?plugin=http
    # Multiple plugins must be separated using "&" - http://127.0.0.1:2114/health?plugin=http&plugin=rpc where "http" and
    # "rpc" are active (connected) plugins.
    #
    # This option is required.
    address: 0.0.0.0:2114

    # Response status code if a requested plugin not ready to handle requests
    # Valid for both /health and /ready endpoints
    #
    # Default: 503
    unavailable_status_code: 503

logs:
    mode: production
    encoding: json
    channels:
        http:
            level: warn # debug Log all http requests, set to info to disable
        server:
            level: info # Everything written to worker stderr is logged
            mode: raw
        metrics:
            level: error

@rustatian
Copy link
Member

Yeah, noting suspicious. ATM we don't have a configuration option to limit this queue, but this is a good idea to add this option. I think I'll add it in the 2023.3.10 release with the 503 HTTP code if the server is overloaded. Does that help you?

@L3tum
Copy link
Author

L3tum commented Jan 23, 2024

Definitely, that would be really nice. Thank you!

@rustatian rustatian added C-feature-request Category: feature requested, but need to be discussed and removed F-need-verification labels Jan 23, 2024
@rustatian rustatian added this to the v2023.3.10 milestone Jan 23, 2024
@rustatian
Copy link
Member

Hey @L3tum 👋
2023.3.10 will be released today 🥳

@L3tum
Copy link
Author

L3tum commented Feb 1, 2024

Thank you! I'll make an update tomorrow and check it out :)

@L3tum
Copy link
Author

L3tum commented Feb 9, 2024

Hey @rustatian !

I've been testing this change out a bit and noticed a somewhat bad behaviour. It seems like when RR is getting hammered and starts dropping requests because the queue is already full, that RR uses up all CPU time answering 503s rather than responding to queued requests. I could see this in multiple areas:

  • The log entries for "request queue too large" errors were always with an elapsed time in the microsecond area
  • The status and metrics plugins stopped responding (leading to the container getting terminated)
  • The already-worked-on requests weren't getting answered, leading to response times >5 seconds or even none at all

I realize that this is somewhat an inherent mechanism, but I'm curious if you think you could fix it in RR instead of bolting another service, like a circuit breaker, on top. I don't know, for example, what would happen in Go if you'd stop accepting new requests (or if that's even possible). The same issue could be triggered if the side communicating with PHP and managing the workers triggers other errors in Go. For example someone sending tons of requests with a too-large body.

I ended my search here, where the error would be generated. I think an expontential backoff/circuit breaker would need to be there somewhere, but I don't know enough about Go on how to go about it.

Another fix would be to assign a lower priority to the RR process, which would give the PHP workers a higher priority, but that would mean (afaik) that RR would still be starved of important resources for answering status for example. I also haven't tried it yet and I think it wouldn't work, since the PHP workers are children of RR and thus would inherit that nice-ness and you can't increase priority/decrease nice-ness in an unprivileged environment.

@rustatian
Copy link
Member

Hey @L3tum 👋
Hmm, interesting... Respond with 503 and actual request processing are happening in different threads. But yeah, a circuit breaker would be a preferable solution here. No need to stress pool if we can do that in middleware.
Sorry, a lot of work with our new Rust-based SAPI, so, I would be able to start working on this not as fast as usual 😭

@rustatian
Copy link
Member

You may create a feature request ticket for circuit breaker middleware and link this one in the description.

@L3tum
Copy link
Author

L3tum commented Feb 9, 2024

Rust-based SAPI? Tell me more! :D

I wanted to learn some Go anyways so I'll write a feature request down and then start working on it. Full disclosure that will be my first Go stint but I hope it won't be too bad.
Would you prefer pulling in some circuit breaker lib and connecting it in a middleware, or writing a simple one "in-house"?
Additionally, a middleware would need to be added to the roadrunner-server/http project, rather than an independent plugin, right?

@rustatian
Copy link
Member

rustatian commented Feb 9, 2024

I thought that everyone knew about that, that I'm working on that 😄

I wanted to learn some Go anyways

Sure, would be happy to help you and review your PRs. You may take a look at this one: link, grab the ideas and reimplement them for our middleware. Here you may find docs about writing a middleware. You may also ping me on our Discord server 😃

@rustatian
Copy link
Member

BTW, when you'll be ready to start working on that, just ping me on our Discord server, I'll create a repository and give you permissions to work on that w/o forking.

@L3tum
Copy link
Author

L3tum commented Feb 9, 2024

Haha, I've personally left Twitter behind so I sometimes miss these things. Is it gonna replace Roadrunner or be a middlelayer of sorts?

Thank you! I'll have a few hours on sunday where I'll need to watch some graphs so I'll probably start then. No worries if I don't have a repo by that time though, I can just use a local one. I'll give you a ping on Discord tomorrow or on Sunday at the latest :)

@rustatian
Copy link
Member

Is it gonna replace Roadrunner or be a middlelayer of sorts?

It'll integrate with RoadRunner 😃

@rustatian
Copy link
Member

Hey @L3tum 👋
Added you to a contributors team, created repo: https://github.com/roadrunner-server/circuit-breaker. I'll create a middleware skeleton with test, so it would be easier for you start.

@rustatian
Copy link
Member

Done. Updated repo with a basic skeleton, added comments, sample config. To use, just start the test in the tests/ folder.
But before, execute composer u in the tests/php_test_files folder. You may check the composer.json, but since I'm not a PHP dev, this is just my template, which might be contained too many deps.

@coderabbitai coderabbitai bot mentioned this issue Apr 11, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception C-feature-request Category: feature requested, but need to be discussed
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants