Skip to content

[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308

Open
harshit-anyscale wants to merge 15 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-minimal-fixes
Open

[serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch#63308
harshit-anyscale wants to merge 15 commits into
ray-project:masterfrom
harshit-anyscale:serve-haproxy-minimal-fixes

Conversation

@harshit-anyscale
Copy link
Copy Markdown
Contributor

@harshit-anyscale harshit-anyscale commented May 12, 2026

Summary

Three targeted HAProxy fixes addressing the highest-impact failure modes observed under production load. Each is independently motivated.

1. Redirect HAProxy stderr to a file (avoids the pipe-buffer deadlock)

The problem. The proxy actor spawns HAProxy via asyncio.create_subprocess_exec(..., stderr=PIPE). That creates an OS pipe with a fixed 64 KB kernel buffer:

HAProxy ──[write(2, ...)]──► [64 KB kernel buffer] ──► proc.stderr (read by Python)

HAProxy's config uses log local0 debug plus option httplog — high-severity logging on every HTTP request and every health-check transition. In environments where syslog is unreachable (containers without /dev/log, no listener on 127.0.0.1:syslog_port), HAProxy falls back to writing those log lines to stderr. At sustained load, this is thousands of lines per second.

If the proxy actor never reads from proc.stderr, the 64 KB buffer fills in seconds. Once full, HAProxy's write(2) on stderr blocks — including writes from threads serving the admin socket. From the outside, runtime-API commands appear to hang, healthz probes time out, and the actor cascades into reload/restart loops. This was the root cause behind the "60-second admin-socket deadlock" we kept hitting under load.

The fix. Instead of piping stderr back to Python, redirect HAProxy's stderr fd directly to a file at spawn time:

with open(tmp_path, "ab", buffering=0) as stderr_file:
    proc = await asyncio.create_subprocess_exec(..., stderr=stderr_file)
os.replace(tmp_path, f"{socket_path}.stderr.{proc.pid}.log")

The kernel uses dup2() at fork to wire the child's fd 2 to the file. No pipe is created, so the 64 KB buffer that previously deadlocked admin-socket threads doesn't exist — write(2) to a regular file doesn't have the "consumer hasn't read me, block now" semantic that pipes do. The rename after spawn is safe because Linux file descriptors bind to inodes, not paths, so the child keeps writing to the same inode under the new pid-suffixed name.

Once the pipe was eliminated, the admin-socket-deadlock cascades disappeared.

2. Coalesce controller broadcasts into a single reload

Under autoscaling churn the Serve controller fires target_groups and fallback_targets broadcasts independently, often only tens of ms apart. Without coalescing, each broadcast triggers its own config regeneration and graceful reload via -sf.

HAProxyManager now marks state dirty on each broadcast and arms a single sleeping coalesce task; updates arriving during the sleep window are absorbed into the same pending apply. Window defaults to 100 ms via RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S; set to 0 to disable. If an apply fails, the task re-arms and retries on the next tick (up to 3 consecutive failures, then waits for a new broadcast — avoids busy-spinning on a persistent error).

3. option redispatch + retry-on conn-failure empty-response in defaults

Under churn, the first connect attempt to a slot can hit a replica that is briefly down (just-removed, mid-restart, scaling-up slot not yet accepting). Today HAProxy returns the resulting 502/503 directly to the client even though a peer slot is up and ready. The two directives — added to the defaults block so every backend inherits them — make the retry transparent:

  • option redispatch sends the retry to a different slot, not the same one.
  • retry-on conn-failure empty-response restricts retries to cases that cannot have leaked partial bytes to the client: TCP connect failure (no bytes sent) and empty response (slot died before sending a body). Streaming responses with a partial body are untouched.

option redispatch was previously set only on the *-via-ingress-request-router backend; the primary routing path had no retry semantics, which is why churn-class 5xx surfaced to clients.

Out of scope

The bigger PR (#63159) includes additional architectural changes (server-template slot pools, runtime-API server updates, plus other refinements). This PR is deliberately the minimal high-impact subset for an easier review and lower-risk merge.

Other knobs explored during the investigation (option abortonclose removal, timeout server / timeout connect defaults, hard-stop-after bump) are intentionally not part of this PR — those are better addressed client-side for the load-test workload rather than as new HAProxy defaults.

🤖 Generated with Claude Code

@harshit-anyscale harshit-anyscale requested a review from a team as a code owner May 12, 2026 18:57
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several stability improvements for HAProxy within Ray Serve. Key changes include increasing the hard stop timeout to 1800 seconds and establishing default server and connection timeouts to prevent undefined behavior during reloads. Additionally, the option abortonclose setting was removed from the configuration. A critical fix involves the implementation of a background task to drain HAProxy's stderr into a per-PID log file, which prevents potential deadlocks caused by the OS pipe buffer filling up. Feedback on these changes suggests increasing the logging severity from DEBUG to WARNING if the stderr drainer fails, ensuring that any failure in this mechanism is visible since it could lead back to the deadlock issue.

Comment thread python/ray/serve/_private/haproxy.py Outdated
Comment thread python/ray/serve/_private/haproxy.py Outdated
Comment thread python/ray/serve/_private/haproxy.py Outdated
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 12, 2026
Comment thread python/ray/serve/_private/haproxy.py Outdated
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3c44db2 to 002fd33 Compare May 13, 2026 15:29
Comment thread python/ray/serve/_private/haproxy.py Outdated
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 002fd33 to 5bb2f5b Compare May 13, 2026 17:45
Comment thread python/ray/serve/_private/haproxy.py Outdated
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 5bb2f5b to 3a30a51 Compare May 13, 2026 19:18
@harshit-anyscale harshit-anyscale changed the title [serve] HAProxy stability fixes: stderr drain, abortonclose, timeout defaults, hard-stop [serve] HAProxy stability fixes: stderr drain, abortonclose, broadcast coalescing May 13, 2026
harshit-anyscale and others added 2 commits May 13, 2026 19:52
HAProxy runs in `-db` debug mode and emits hundreds of stderr lines per
second under load. The proxy actor captured stderr via `stderr=PIPE` but
never read from it, so the 64KB OS pipe buffer fills in seconds and
HAProxy blocks on its next `write(2)` — including from threads serving
the admin socket. Runtime-API calls then time out and look like a
deadlock from outside.

Spawn a fire-and-forget asyncio task right after startup that drains
stderr to a per-pid log file. Output is bounded to 10 MB per pid via
truncate-on-roll so a long-lived proc cannot fill the disk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Serve controller emits TARGET_GROUPS and FALLBACK_TARGETS long-poll
updates independently, often within tens of milliseconds of each other
during autoscaling churn. Each broadcast was kicking off its own reload,
serializing on the reload lock and amplifying replica thrash into proxy
churn.

Introduce a coalescing window: a broadcast marks state dirty and starts
(or extends) a single asyncio task that sleeps the window and then
applies the latest state. Default window is 100ms; tunable via
`RAY_SERVE_HAPROXY_BROADCAST_COALESCE_S`, and setting it to 0 falls back
to the prior synchronous behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from 3a30a51 to e61e719 Compare May 13, 2026 19:53
@harshit-anyscale harshit-anyscale changed the title [serve] HAProxy stability fixes: stderr drain, abortonclose, broadcast coalescing [serve] HAProxy stability fixes: stderr drain, broadcast coalescing May 13, 2026
Under autoscaling churn, the first connect attempt to a slot can hit a
replica that is briefly down (just-removed, mid-restart, scaling-up
slot not yet accepting). Today HAProxy returns the resulting 502/503
directly to the client even though a peer slot is up and ready.

Add two defaults-level directives so requests transparently retry to a
healthy peer:

  - `option redispatch` — when a retry is allowed, send it to a
    different slot (not the same one).
  - `retry-on conn-failure empty-response` — only retry in cases that
    cannot have leaked partial bytes to the client: TCP connect
    failure (no bytes sent) and empty response (slot died before
    sending a body). Streaming responses with a partial body are
    untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale changed the title [serve] HAProxy stability fixes: stderr drain, broadcast coalescing [serve] HAProxy stability fixes: stderr drain, broadcast coalescing, redispatch May 14, 2026
harshit-anyscale and others added 2 commits May 14, 2026 18:01
After enabling `option redispatch` + `retry-on conn-failure
empty-response` in the HAProxy defaults block, HAProxy retries a
failed connect to the next slot. When every slot is down (which is the
scenario this test creates by killing the only replica), retries
exhaust and HAProxy returns its built-in 503 "Service Unavailable"
instead of surfacing the original 502.

The 502 errorfile maps 502/504 to 500, but it does not cover 503, so
the response the client sees changes from 500 to 503. Both are valid
error responses for "the service is currently not available"; loosen
the assertion to accept either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label May 14, 2026
@harshit-anyscale harshit-anyscale self-assigned this May 14, 2026
harshit-anyscale and others added 7 commits May 15, 2026 04:07
The previous comment described what the setting does ("window for
coalescing back-to-back controller broadcasts") but not why fast
reloads are harmful. Rewrite to lead with the actual concern:
reloading HAProxy at controller-broadcast speed (every few tens of
ms) bursts process handoffs, saturates the proxy actor's event loop,
and forces overlapping drains on the old proc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the asyncio drain task with a direct file redirect: open a
stderr destination file before spawn, pass it as `stderr=...` so the
kernel uses dup2 to wire the child's fd 2 to the file, then rename the
file to include the child's pid after spawn returns.

No pipe is created, so the 64KB kernel buffer that previously
deadlocked admin-socket threads when stderr backed up can't form. The
fix is structural rather than behavioral — there is no buffer to fill
because there is no consumer to be slow.

Linux file descriptors bind to inodes rather than paths, so renaming
the file after spawn is safe: the child keeps writing to the same
inode under the new name.

Trade-off: the previous drain task capped each file at 10MB via a
periodic truncate. Under the file-redirect model the file grows
unbounded until the proc exits. Acceptable for now given realistic
stderr volumes; a periodic size-cap task can be layered on later if
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shorter wording for two comments that were doing more explaining than
the code needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Down from 9 lines to 4 — keeps the why (overlapping `-sf` saturates
the actor) and drops the rest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When `_update_haproxy_backends` threw, `_update_pending` had already
been flipped to False before the apply, so the loop would exit and the
HAProxy state would stay stale until the next broadcast — which may
never arrive in a settled cluster.

Re-arm `_update_pending` in the except branch so a failed apply is
retried on the next coalesce tick. Cap retries at 3 consecutive
failures to avoid busy-looping on a persistent error (e.g. a bad
config HAProxy refuses to parse); past that, wait for the next
broadcast rather than spinning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the "overlapping `-sf` handoffs saturate the proxy actor's
event loop" line — we never confirmed that as the failure mode. Keep
the observable problem: under churn, broadcasts can fire reloads
tens of ms apart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 lines → 3 lines. Keeps the surprising bit (rename safety via
fd-binds-to-inode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshit-anyscale harshit-anyscale changed the title [serve] HAProxy stability fixes: stderr drain, broadcast coalescing, redispatch [serve] HAProxy stability fixes: stderr file redirect, broadcast coalescing, redispatch May 15, 2026
Comment thread python/ray/serve/_private/haproxy.py Outdated
@harshit-anyscale harshit-anyscale force-pushed the serve-haproxy-minimal-fixes branch from c1f08b0 to 45d6acf Compare May 15, 2026 07:45
harshit-anyscale and others added 3 commits May 15, 2026 08:28
After redirecting HAProxy's stderr to a file at spawn, `proc.stderr`
became None and `_wait_for_hap_availability` fell back to b"" when
the process crashed. The resulting RuntimeError message was just
"exit code N" with no diagnostic — even though HAProxy had written
the real error (bind failure, config parse error, etc.) to the
on-disk stderr log file.

Read the tail of the per-pid stderr file when proc.stderr is None,
so startup-crash errors carry the actual reason again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pid scheme required an awkward open-with-temp-name then os.replace
because the child's pid isn't known until create_subprocess_exec
returns. Switch to a monotonic spawn counter held on the HAProxyApi
instance:

- The stderr path is known pre-spawn, so the file opens at its final
  name and stays there.
- No more `stderr.starting.log` intermediate, no os.replace, no rename
  race window.
- The path is attached to the proc as `_stderr_path` and travels with
  it, so `_wait_for_hap_availability` reads from the right file
  whether it was called on a fresh spawn or on an old proc during
  graceful reload.

A new info-level log at spawn time emits the (spawn#, pid, stderr
path, args) tuple so the pid → stderr-file mapping is recoverable
from the actor log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`_read_stderr_log_tail` was used in exactly one place and existed
only to give a name to "open file, seek to end, read last 4KB."
Inline it into `_wait_for_hap_availability` so the entire
crash-message logic is visible at the call site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 247b07c. Configure here.

# Retry to a different slot on safe failures only: connect failed
# (no bytes sent) or empty response (slot died before sending body).
option redispatch
retry-on conn-failure empty-response
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retry exhaustion exposes 503 bypassing error normalization to 500

Medium Severity

Adding retry-on conn-failure empty-response changes the failure path: previously a connection failure produced a 502 (caught by errorfile 502 and normalized to 500), but now retry exhaustion produces a 503 which has no matching errorfile 503 directive. This breaks the stated design goal of normalizing proxy errors to 500 and exposes clients to raw 503 responses with HAProxy's built-in error page instead of the expected "Internal Server Error" body. The test in test_http_routes.py was loosened to accept in (500, 503) rather than fixing the gap.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 247b07c. Configure here.

@kouroshHakha
Copy link
Copy Markdown
Contributor

Great work @harshit-anyscale. We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386). I want to make sure we are not trading off stability for speed.

@harshit-anyscale
Copy link
Copy Markdown
Contributor Author

We should potentially merge this after we enable ha proxy for the performance tests and let those cook a bit (#63386)

makes sense; cool, will plan to merge it after 3-4 days of merging of #63386 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants