[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308
[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308harshit-anyscale wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several stability improvements for HAProxy within Ray Serve. Key changes include increasing the hard stop timeout to 1800 seconds and establishing default server and connection timeouts to prevent undefined behavior during reloads. Additionally, the option abortonclose setting was removed from the configuration. A critical fix involves the implementation of a background task to drain HAProxy's stderr into a per-PID log file, which prevents potential deadlocks caused by the OS pipe buffer filling up. Feedback on these changes suggests increasing the logging severity from DEBUG to WARNING if the stderr drainer fails, ensuring that any failure in this mechanism is visible since it could lead back to the deadlock issue.
3c44db2 to
002fd33
Compare
002fd33 to
5bb2f5b
Compare
5bb2f5b to
3a30a51
Compare
HAProxy runs in `-db` debug mode and emits hundreds of stderr lines per second under load. The proxy actor captured stderr via `stderr=PIPE` but never read from it, so the 64KB OS pipe buffer fills in seconds and HAProxy blocks on its next `write(2)` — including from threads serving the admin socket. Runtime-API calls then time out and look like a deadlock from outside. Spawn a fire-and-forget asyncio task right after startup that drains stderr to a per-pid log file. Output is bounded to 10 MB per pid via truncate-on-roll so a long-lived proc cannot fill the disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Serve controller emits TARGET_GROUPS and FALLBACK_TARGETS long-poll updates independently, often within tens of milliseconds of each other during autoscaling churn. Each broadcast was kicking off its own reload, serializing on the reload lock and amplifying replica thrash into proxy churn. Introduce a coalescing window: a broadcast marks state dirty and starts (or extends) a single asyncio task that sleeps the window and then applies the latest state. Default window is 100ms; tunable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`, and setting it to 0 falls back to the prior synchronous behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3a30a51 to
e61e719
Compare
Under autoscaling churn, the first connect attempt to a slot can hit a
replica that is briefly down (just-removed, mid-restart, scaling-up
slot not yet accepting). Today HAProxy returns the resulting 502/503
directly to the client even though a peer slot is up and ready.
Add two defaults-level directives so requests transparently retry to a
healthy peer:
- `option redispatch` — when a retry is allowed, send it to a
different slot (not the same one).
- `retry-on conn-failure empty-response` — only retry in cases that
cannot have leaked partial bytes to the client: TCP connect
failure (no bytes sent) and empty response (slot died before
sending a body). Streaming responses with a partial body are
untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
three test runs after these changes: all have the failure rate < 0.01%. |
After enabling `option redispatch` + `retry-on conn-failure empty-response` in the HAProxy defaults block, HAProxy retries a failed connect to the next slot. When every slot is down (which is the scenario this test creates by killing the only replica), retries exhaust and HAProxy returns its built-in 503 "Service Unavailable" instead of surfacing the original 502. The 502 errorfile maps 502/504 to 500, but it does not cover 503, so the response the client sees changes from 500 to 503. Both are valid error responses for "the service is currently not available"; loosen the assertion to accept either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous comment described what the setting does ("window for
coalescing back-to-back controller broadcasts") but not why fast
reloads are harmful. Rewrite to lead with the actual concern:
reloading HAProxy at controller-broadcast speed (every few tens of
ms) bursts process handoffs, saturates the proxy actor's event loop,
and forces overlapping drains on the old proc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the asyncio drain task with a direct file redirect: open a stderr destination file before spawn, pass it as `stderr=...` so the kernel uses dup2 to wire the child's fd 2 to the file, then rename the file to include the child's pid after spawn returns. No pipe is created, so the 64KB kernel buffer that previously deadlocked admin-socket threads when stderr backed up can't form. The fix is structural rather than behavioral — there is no buffer to fill because there is no consumer to be slow. Linux file descriptors bind to inodes rather than paths, so renaming the file after spawn is safe: the child keeps writing to the same inode under the new name. Trade-off: the previous drain task capped each file at 10MB via a periodic truncate. Under the file-redirect model the file grows unbounded until the proc exits. Acceptable for now given realistic stderr volumes; a periodic size-cap task can be layered on later if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When `_update_haproxy_backends` threw, `_update_pending` had already been flipped to False before the apply, so the loop would exit and the HAProxy state would stay stale until the next broadcast — which may never arrive in a settled cluster. Re-arm `_update_pending` in the except branch so a failed apply is retried on the next coalesce tick. Cap retries at 3 consecutive failures to avoid busy-looping on a persistent error (e.g. a bad config HAProxy refuses to parse); past that, wait for the next broadcast rather than spinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the "overlapping `-sf` handoffs saturate the proxy actor's event loop" line — we never confirmed that as the failure mode. Keep the observable problem: under churn, broadcasts can fire reloads tens of ms apart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
c1f08b0 to
45d6acf
Compare
After redirecting HAProxy's stderr to a file at spawn, `proc.stderr` became None and `_wait_for_hap_availability` fell back to b"" when the process crashed. The resulting RuntimeError message was just "exit code N" with no diagnostic — even though HAProxy had written the real error (bind failure, config parse error, etc.) to the on-disk stderr log file. Read the tail of the per-pid stderr file when proc.stderr is None, so startup-crash errors carry the actual reason again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pid scheme required an awkward open-with-temp-name then os.replace because the child's pid isn't known until create_subprocess_exec returns. Switch to a monotonic spawn counter held on the HAProxyApi instance: - The stderr path is known pre-spawn, so the file opens at its final name and stays there. - No more `stderr.starting.log` intermediate, no os.replace, no rename race window. - The path is attached to the proc as `_stderr_path` and travels with it, so `_wait_for_hap_availability` reads from the right file whether it was called on a fresh spawn or on an old proc during graceful reload. A new info-level log at spawn time emits the (spawn#, pid, stderr path, args) tuple so the pid → stderr-file mapping is recoverable from the actor log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_read_stderr_log_tail` was used in exactly one place and existed only to give a name to "open file, seek to end, read last 4KB." Inline it into `_wait_for_hap_availability` so the entire crash-message logic is visible at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 247b07c. Configure here.
| # Retry to a different slot on safe failures only: connect failed | ||
| # (no bytes sent) or empty response (slot died before sending body). | ||
| option redispatch | ||
| retry-on conn-failure empty-response |
There was a problem hiding this comment.
Retry exhaustion exposes 503 bypassing error normalization to 500
Medium Severity
Adding retry-on conn-failure empty-response changes the failure path: previously a connection failure produced a 502 (caught by errorfile 502 and normalized to 500), but now retry exhaustion produces a 503 which has no matching errorfile 503 directive. This breaks the stated design goal of normalizing proxy errors to 500 and exposes clients to raw 503 responses with HAProxy's built-in error page instead of the expected "Internal Server Error" body. The test in test_http_routes.py was loosened to accept in (500, 503) rather than fixing the gap.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 247b07c. Configure here.
|
latest 3 runs with these changes: number of failed requests < 0.01% |
|
Great work @harshit-anyscale. We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386). I want to make sure we are not trading off stability for speed. |


Summary
Three targeted HAProxy fixes addressing the highest-impact failure modes observed under production load. Each is independently motivated.
1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)
The problem. The proxy actor spawns HAProxy via
asyncio.create_subprocess_exec(..., stderr=PIPE). That creates an OS pipe with a fixed 64 KB kernel buffer:HAProxy's config uses
log local0 debugplusoption httplog— high-severity logging on every HTTP request and every health-check transition. In environments where syslog is unreachable (containers without/dev/log, no listener on127.0.0.1:syslog_port), HAProxy falls back to writing those log lines to stderr. At sustained load, this is thousands of lines per second.If the proxy actor never reads from
proc.stderr, the 64 KB buffer fills in seconds. Once full, HAProxy'swrite(2)on stderr blocks — including writes from threads serving the admin socket. From the outside, runtime-API commands appear to hang, healthz probes time out, and the actor cascades into reload/restart loops. This was the root cause behind the "60-second admin-socket deadlock" we kept hitting under load.The fix. Instead of piping stderr back to Python, redirect HAProxy's stderr fd directly to a file at spawn time:
The kernel uses
dup2()at fork to wire the child's fd 2 to the file. No pipe is created, so the 64 KB buffer that previously deadlocked admin-socket threads doesn't exist —write(2)to a regular file doesn't have the "consumer hasn't read me, block now" semantic that pipes do. The rename after spawn is safe because Linux file descriptors bind to inodes, not paths, so the child keeps writing to the same inode under the new pid-suffixed name.Once the pipe was eliminated, the admin-socket-deadlock cascades disappeared.
2. Coalesce controller broadcasts into a single reload
Under autoscaling churn the Serve controller fires
target_groupsandfallback_targetsbroadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via-sf.HAProxyManagernow marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms viaRAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to0to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error).3.
option redispatch+retry-on conn-failure empty-responsein defaultsUnder churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. The two directives — added to the
defaultsblock so every backend inherits them — make the retry transparent:option redispatchsends the retry to a different slot, not the same one.retry-on conn-failure empty-responserestricts retries to cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched.option redispatchwas previously set only on the*-via-ingress-request-routerbackend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients.Out of scope
The bigger PR (#63159) includes additional architectural changes (server-template slot pools, runtime-API server updates, plus other refinements). This PR is deliberately the minimal high-impact subset for an easier review and lower-risk merge.
Other knobs explored during the investigation (
option abortoncloseremoval,timeout server/timeout connectdefaults,hard-stop-afterbump) are intentionally not part of this PR — those are better addressed client-side for the load-test workload rather than as new HAProxy defaults.🤖 Generated with Claude Code