Skip to content

[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1#63353

Merged
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kh/serve-haproxy-tcp-nodelay-default
May 15, 2026
Merged

[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1#63353
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kh/serve-haproxy-tcp-nodelay-default

Conversation

@kouroshHakha
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha commented May 14, 2026

Summary

  • Flip RAY_SERVE_HAPROXY_TCP_NODELAY default from "0""1". Setting it to "1" causes HAProxy to emit option http-no-delay, which sets the TCP_NODELAY socket option (= Nagle's algorithm disabled) on both client- and server-facing sockets of every frontend/backend.

Motivation

Streaming serving is the dominant Ray Serve workload today — streaming LLM completions, server-sent events, gRPC streaming. Streaming is hostile to Nagle's algorithm. When the upstream produces a small first chunk (e.g. the first SSE event of a chat completion stream), Nagle holds it in the kernel's TCP send buffer waiting for either:

  1. enough data to accumulate to fill an MSS-sized packet, or
  2. the in-flight segment to be acknowledged, or
  3. a kernel timer (~40 ms delayed-ACK timer, up to ~200 ms full Nagle timer).

That wait time lands directly in time-to-first-byte / time-to-first-token for every streamed request.

The default was chosen for bulk-transfer HTTP proxying where packet coalescing helps. For the streaming workloads that have become Ray Serve's primary use case, the default is the wrong shape.

Benchmark comparison (TCP_NODELAY=1 vs =0)

Both runs use HAProxy + pytest_serve_microbenchmarks --run-all on the same commit (5124dde36c). Δ is relative to TCP=0; negative = TCP=1 is faster / lower / more stable.

Wins for TCP=1

Workload Metric TCP=0 TCP=1 Δ
HTTP 10 MB p90 latency 6.57 ms 4.91 ms −25.3%
HTTP 10 MB p95 latency 7.19 ms 5.05 ms −29.7%
HTTP 10 MB p99 latency 7.72 ms 5.67 ms −26.5%
HTTP small p50 latency 2.24 ms 2.14 ms −4.6%
HTTP small p90 / p95 / p99 latency −3.0% to −3.2%
HTTP streaming throughput std-dev 1410 312 −77.9% (much more stable)
Throughput std-dev (most paths) mostly −5% to −25% (tighter)

Regressions / no-change for TCP=1

Workload Metric TCP=0 TCP=1 Δ
HTTP streaming p50 / p99 latency 13.2 s / 13.6 s 13.7 s / 14.0 s +2.5% to +3.4%
HTTP streaming avg TPS 37 991 36 232 −4.6%
gRPC small p99 latency 2.11 ms 2.24 ms +5.8%
gRPC 1 MB p99 latency 8.07 ms 8.64 ms +7.0%
RPS / TPS means (most paths) within ±1.6% (noise)

Reading

  • Clear win: HTTP large-payload tail latency (http_10mb_*) drops 25-30% at p90+, the classic multi-segment Nagle case.
  • Moderate win: HTTP small-payload latency improves 3-5% across all percentiles.
  • Variance win: streaming throughput std-dev drops 78% — much more predictable, even when the mean is slightly lower.
  • Caveat: HTTP streaming mean latency is +3% and mean throughput −4.6% with TCP=1. Tail-latency and variance both improve, which is what most streaming consumers actually care about, but this benchmark's particular --run-streaming shape doesn't cleanly capture TTFT.
  • gRPC small-payload p99 is slightly worse (+3-7%); larger payloads neutral to better.
  • Handle (in-process, doesn't traverse HAProxy) is within noise.

Test plan

  • Manual: deploy a streaming Serve LLM app behind HAProxy, observe rendered haproxy.cfg now contains option http-no-delay by default.
  • Manual: deploy with RAY_SERVE_HAPROXY_TCP_NODELAY=0 exported, observe option http-no-delay is absent.
  • No code paths consume the constant beyond the existing Jinja template in haproxy_templates.py, so existing tests for HAProxy config generation continue to cover the wiring.
  • Release pytest_serve_microbenchmarks.aws against HAProxy, with TCP_NODELAY=1 and =0 (see Benchmark comparison above).

🤖 Generated with Claude Code

Streaming serving (streaming LLM completions, SSE, gRPC streaming) is the
dominant Ray Serve workload, and it is hostile to Nagle's algorithm: when
the upstream emits a small first chunk (e.g. the first SSE event), the
kernel holds it in the TCP send buffer waiting for either more data or
the delayed-ACK timer. That wait time lands directly in time-to-first-
token / time-to-first-byte for every streamed request.

Flip the default for the HAProxy proxy from off to on. The flag is
controlled by RAY_SERVE_HAPROXY_TCP_NODELAY; set to '0' to restore
coalescing for non-streaming workloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kouroshHakha kouroshHakha requested a review from a team as a code owner May 14, 2026 22:06
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables RAY_SERVE_HAPROXY_TCP_NODELAY by default and adds documentation explaining its benefits for streaming workloads, such as LLM completions and gRPC, by avoiding latency issues caused by Nagle's algorithm. The reviewer suggested using the get_env_bool utility function to maintain consistency with other environment variables in the file and ensure proper validation.

Comment thread python/ray/serve/_private/constants.py Outdated
@kouroshHakha kouroshHakha marked this pull request as draft May 14, 2026 22:34
Switch the env-var read to the shared get_env_bool helper for
consistency with the other boolean RAY_SERVE_* constants in this file,
and to pick up the RAY_SERVE_ prefix validation it provides.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kouroshHakha kouroshHakha added alpha Alpha release features go add ONLY when ready to merge, run all tests and removed alpha Alpha release features labels May 15, 2026
@kouroshHakha kouroshHakha marked this pull request as ready for review May 15, 2026 06:37
@ray-gardener ray-gardener Bot added the serve Ray Serve Related Issue label May 15, 2026
…ault

With RAY_SERVE_HAPROXY_TCP_NODELAY now defaulting to "1", the rendered
HAProxy defaults section emits 'option http-no-delay'. Update the
expected_config string in test_generate_config_file_internal so the
exact-match comparison reflects the new template output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@abrarsheikh
Copy link
Copy Markdown
Contributor

let's kicks off a full perf regression test and compare master against this change to make sure it does not degrade other metrics.

@kouroshHakha
Copy link
Copy Markdown
Contributor Author

kouroshHakha commented May 15, 2026

Run A: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=1

https://buildkite.com/ray-project/release/builds/92975

Run B: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=0

https://buildkite.com/ray-project/release/builds/92973

@kouroshHakha kouroshHakha merged commit ae6ae5c into ray-project:master May 15, 2026
6 checks passed
kouroshHakha added a commit to kouroshHakha/ray that referenced this pull request May 16, 2026
Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's
runtime_env and --run-direct-streaming to the command line, so the
nightly haproxy row exercises the optimal-latency configuration end to
end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1
+ direct streaming via ingress_request_router bypass.

Depends on ray-project#63391, which defines --run-direct-streaming and
DirectStreamingRouter. Without ray-project#63391 merged, this command line will
fail at click parse time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akyang-anyscale pushed a commit to kouroshHakha/ray that referenced this pull request May 20, 2026
Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's
runtime_env and --run-direct-streaming to the command line, so the
nightly haproxy row exercises the optimal-latency configuration end to
end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1
+ direct streaming via ingress_request_router bypass.

Depends on ray-project#63391, which defines --run-direct-streaming and
DirectStreamingRouter. Without ray-project#63391 merged, this command line will
fail at click parse time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants