[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch by harshit-anyscale · Pull Request #63308 · ray-project/ray

harshit-anyscale · 2026-05-12T18:57:18Z

Summary

Three targeted HAProxy fixes addressing the highest-impact failure modes observed under production load. Each is independently motivated.

1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)

The problem. The proxy actor spawns HAProxy via asyncio.create_subprocess_exec(..., stderr=PIPE). That creates an OS pipe with a fixed 64 KB kernel buffer:

HAProxy ──[write(2, ...)]──► [64 KB kernel buffer] ──► proc.stderr (read by Python)

HAProxy's config uses log local0 debug plus option httplog — high-severity logging on every HTTP request and every health-check transition. In environments where syslog is unreachable (containers without /dev/log, no listener on 127.0.0.1:syslog_port), HAProxy falls back to writing those log lines to stderr. At sustained load, this is thousands of lines per second.

If the proxy actor never reads from proc.stderr, the 64 KB buffer fills in seconds. Once full, HAProxy's write(2) on stderr blocks — including writes from threads serving the admin socket. From the outside, runtime-API commands appear to hang, healthz probes time out, and the actor cascades into reload/restart loops. This was the root cause behind the "60-second admin-socket deadlock" we kept hitting under load.

The fix. Instead of piping stderr back to Python, redirect HAProxy's stderr fd directly to a file at spawn time:

with open(tmp_path, "ab", buffering=0) as stderr_file:
    proc = await asyncio.create_subprocess_exec(..., stderr=stderr_file)
os.replace(tmp_path, f"{socket_path}.stderr.{proc.pid}.log")

The kernel uses dup2() at fork to wire the child's fd 2 to the file. No pipe is created, so the 64 KB buffer that previously deadlocked admin-socket threads doesn't exist — write(2) to a regular file doesn't have the "consumer hasn't read me, block now" semantic that pipes do. The rename after spawn is safe because Linux file descriptors bind to inodes, not paths, so the child keeps writing to the same inode under the new pid-suffixed name.

Once the pipe was eliminated, the admin-socket-deadlock cascades disappeared.

2. Coalesce controller broadcasts into a single reload

Under autoscaling churn the Serve controller fires target_groups and fallback_targets broadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via -sf.

HAProxyManager now marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms via RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to 0 to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error).

3. `option redispatch` + `retry-on conn-failure empty-response` in defaults

Under churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. The two directives — added to the defaults block so every backend inherits them — make the retry transparent:

option redispatch sends the retry to a different slot, not the same one.
retry-on conn-failure empty-response restricts retries to cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched.

option redispatch was previously set only on the *-via-ingress-request-router backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients.

Out of scope

The bigger PR (#63159) includes additional architectural changes (server-template slot pools, runtime-API server updates, plus other refinements). This PR is deliberately the minimal high-impact subset for an easier review and lower-risk merge.

Other knobs explored during the investigation (option abortonclose removal, timeout server / timeout connect defaults, hard-stop-after bump) are intentionally not part of this PR — those are better addressed client-side for the load-test workload rather than as new HAProxy defaults.

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces several stability improvements for HAProxy within Ray Serve. Key changes include increasing the hard stop timeout to 1800 seconds and establishing default server and connection timeouts to prevent undefined behavior during reloads. Additionally, the option abortonclose setting was removed from the configuration. A critical fix involves the implementation of a background task to drain HAProxy's stderr into a per-PID log file, which prevents potential deadlocks caused by the OS pipe buffer filling up. Feedback on these changes suggests increasing the logging severity from DEBUG to WARNING if the stderr drainer fails, ensuring that any failure in this mechanism is visible since it could lead back to the deadlock issue.

HAProxy runs in `-db` debug mode and emits hundreds of stderr lines per second under load. The proxy actor captured stderr via `stderr=PIPE` but never read from it, so the 64KB OS pipe buffer fills in seconds and HAProxy blocks on its next `write(2)` — including from threads serving the admin socket. Runtime-API calls then time out and look like a deadlock from outside. Spawn a fire-and-forget asyncio task right after startup that drains stderr to a per-pid log file. Output is bounded to 10 MB per pid via truncate-on-roll so a long-lived proc cannot fill the disk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Serve controller emits TARGET_GROUPS and FALLBACK_TARGETS long-poll updates independently, often within tens of milliseconds of each other during autoscaling churn. Each broadcast was kicking off its own reload, serializing on the reload lock and amplifying replica thrash into proxy churn. Introduce a coalescing window: a broadcast marks state dirty and starts (or extends) a single asyncio task that sleeps the window and then applies the latest state. Default window is 100ms; tunable via `RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`, and setting it to 0 falls back to the prior synchronous behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under autoscaling churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. Add two defaults-level directives so requests transparently retry to a healthy peer: - `option redispatch` — when a retry is allowed, send it to a different slot (not the same one). - `retry-on conn-failure empty-response` — only retry in cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

harshit-anyscale · 2026-05-14T06:32:24Z

three test runs after these changes:

all have the failure rate < 0.01%.

After enabling `option redispatch` + `retry-on conn-failure empty-response` in the HAProxy defaults block, HAProxy retries a failed connect to the next slot. When every slot is down (which is the scenario this test creates by killing the only replica), retries exhaust and HAProxy returns its built-in 503 "Service Unavailable" instead of surfacing the original 502. The 502 errorfile maps 502/504 to 500, but it does not cover 503, so the response the client sees changes from 500 to 503. Both are valid error responses for "the service is currently not available"; loosen the assertion to accept either. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous comment described what the setting does ("window for coalescing back-to-back controller broadcasts") but not why fast reloads are harmful. Rewrite to lead with the actual concern: reloading HAProxy at controller-broadcast speed (every few tens of ms) bursts process handoffs, saturates the proxy actor's event loop, and forces overlapping drains on the old proc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the asyncio drain task with a direct file redirect: open a stderr destination file before spawn, pass it as `stderr=...` so the kernel uses dup2 to wire the child's fd 2 to the file, then rename the file to include the child's pid after spawn returns. No pipe is created, so the 64KB kernel buffer that previously deadlocked admin-socket threads when stderr backed up can't form. The fix is structural rather than behavioral — there is no buffer to fill because there is no consumer to be slow. Linux file descriptors bind to inodes rather than paths, so renaming the file after spawn is safe: the child keeps writing to the same inode under the new name. Trade-off: the previous drain task capped each file at 10MB via a periodic truncate. Under the file-redirect model the file grows unbounded until the proc exits. Acceptable for now given realistic stderr volumes; a periodic size-cap task can be layered on later if needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When `_update_haproxy_backends` threw, `_update_pending` had already been flipped to False before the apply, so the loop would exit and the HAProxy state would stay stale until the next broadcast — which may never arrive in a settled cluster. Re-arm `_update_pending` in the except branch so a failed apply is retried on the next coalesce tick. Cap retries at 3 consecutive failures to avoid busy-looping on a persistent error (e.g. a bad config HAProxy refuses to parse); past that, wait for the next broadcast rather than spinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Remove the "overlapping `-sf` handoffs saturate the proxy actor's event loop" line — we never confirmed that as the failure mode. Keep the observable problem: under churn, broadcasts can fire reloads tens of ms apart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After redirecting HAProxy's stderr to a file at spawn, `proc.stderr` became None and `_wait_for_hap_availability` fell back to b"" when the process crashed. The resulting RuntimeError message was just "exit code N" with no diagnostic — even though HAProxy had written the real error (bind failure, config parse error, etc.) to the on-disk stderr log file. Read the tail of the per-pid stderr file when proc.stderr is None, so startup-crash errors carry the actual reason again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pid scheme required an awkward open-with-temp-name then os.replace because the child's pid isn't known until create_subprocess_exec returns. Switch to a monotonic spawn counter held on the HAProxyApi instance: - The stderr path is known pre-spawn, so the file opens at its final name and stays there. - No more `stderr.starting.log` intermediate, no os.replace, no rename race window. - The path is attached to the proc as `_stderr_path` and travels with it, so `_wait_for_hap_availability` reads from the right file whether it was called on a fresh spawn or on an old proc during graceful reload. A new info-level log at spawn time emits the (spawn#, pid, stderr path, args) tuple so the pid → stderr-file mapping is recoverable from the actor log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`_read_stderr_log_tail` was used in exactly one place and existed only to give a name to "open file, seek to end, read last 4KB." Inline it into `_wait_for_hap_availability` so the entire crash-message logic is visible at the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 247b07c. Configure here.}

cursor · 2026-05-15T09:33:36Z

+    # Retry to a different slot on safe failures only: connect failed
+    # (no bytes sent) or empty response (slot died before sending body).
+    option redispatch
+    retry-on conn-failure empty-response


Retry exhaustion exposes 503 bypassing error normalization to 500

Medium Severity

Adding retry-on conn-failure empty-response changes the failure path: previously a connection failure produced a 502 (caught by errorfile 502 and normalized to 500), but now retry exhaustion produces a 503 which has no matching errorfile 503 directive. This breaks the stated design goal of normalizing proxy errors to 500 and exposes clients to raw 503 responses with HAProxy's built-in error page instead of the expected "Internal Server Error" body. The test in test_http_routes.py was loosened to accept in (500, 503) rather than fixing the gap.

Additional Locations (1)

python/ray/serve/_private/haproxy_templates.py#L64-L68

^{Reviewed by Cursor Bugbot for commit 247b07c. Configure here.}

harshit-anyscale · 2026-05-15T15:25:04Z

latest 3 runs with these changes:

https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_g7p6lsu6r8g7garwbxifppyz23/jobs/prodjob_cdfkbj1cik2zpedd1w7wgrl79y?job-tab=overview&job-logs-section-tabs=application_logs

number of failed requests < 0.01%

kouroshHakha · 2026-05-16T01:53:35Z

Great work @harshit-anyscale. We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386). I want to make sure we are not trading off stability for speed.

harshit-anyscale · 2026-05-18T04:03:33Z

We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386)

makes sense; cool, will plan to merge it after 3-4 days of merging of #63386 .

harshit-anyscale requested a review from a team as a code owner May 12, 2026 18:57

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

cursor Bot reviewed May 12, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

Comment thread python/ray/serve/_private/haproxy.py Outdated

ray-gardener Bot added the serve Ray Serve Related Issue label May 12, 2026

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3c44db2 to 002fd33 Compare May 13, 2026 15:29

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 002fd33 to 5bb2f5b Compare May 13, 2026 17:45

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 5bb2f5b to 3a30a51 Compare May 13, 2026 19:18

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, abortonclose, timeout defaults, hard-stop~~ [serve] HAProxy stability fixes: stderr drain, abortonclose, broadcast coalescing May 13, 2026

harshit-anyscale and others added 2 commits May 13, 2026 19:52

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3a30a51 to e61e719 Compare May 13, 2026 19:53

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, abortonclose, broadcast coalescing~~ [serve] HAProxy stability fixes: stderr drain, broadcast coalescing May 13, 2026

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, broadcast coalescing~~ [serve] HAProxy stability fixes: stderr drain, broadcast coalescing, redispatch May 14, 2026

harshit-anyscale and others added 2 commits May 14, 2026 18:01

Merge branch 'master' into serve-haproxy-minimal-fixes

56aee88

harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 14, 2026

harshit-anyscale self-assigned this May 14, 2026

harshit-anyscale and others added 7 commits May 15, 2026 04:07

[serve] HAProxy: tighten retry and coalescing comments

097e396

Shorter wording for two comments that were doing more explaining than the code needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[serve] Shorten broadcast-coalesce comment

f4773af

Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates the actor) and drops the rest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

[serve] Shorten stderr-redirect comment

45d6acf

7 lines → 3 lines. Keeps the surprising bit (rename safety via fd-binds-to-inode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

harshit-anyscale changed the title ~~[serve] HAProxy stability fixes: stderr drain, broadcast coalescing, redispatch~~ [serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch May 15, 2026

cursor Bot reviewed May 15, 2026

View reviewed changes

Comment thread python/ray/serve/_private/haproxy.py Outdated

harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from c1f08b0 to 45d6acf Compare May 15, 2026 07:45

harshit-anyscale and others added 3 commits May 15, 2026 08:28

cursor Bot reviewed May 15, 2026

View reviewed changes

kouroshHakha approved these changes May 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308

[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308
harshit-anyscale wants to merge 15 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-minimal-fixes

harshit-anyscale commented May 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 15, 2026

Uh oh!

harshit-anyscale commented May 15, 2026

Uh oh!

kouroshHakha commented May 16, 2026

Uh oh!

harshit-anyscale commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harshit-anyscale commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)

2. Coalesce controller broadcasts into a single reload

3. option redispatch + retry-on conn-failure empty-response in defaults

Out of scope

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 15, 2026

Choose a reason for hiding this comment

Retry exhaustion exposes 503 bypassing error normalization to 500

Uh oh!

harshit-anyscale commented May 15, 2026

Uh oh!

kouroshHakha commented May 16, 2026

Uh oh!

harshit-anyscale commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshit-anyscale commented May 12, 2026 •

edited

Loading

3. `option redispatch` + `retry-on conn-failure empty-response` in defaults

harshit-anyscale commented May 14, 2026 •

edited

Loading