[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1 by kouroshHakha · Pull Request #63353 · ray-project/ray

kouroshHakha · 2026-05-14T22:06:23Z

Summary

Flip RAY_SERVE_HAPROXY_TCP_NODELAY default from "0" → "1". Setting it to "1" causes HAProxy to emit option http-no-delay, which sets the TCP_NODELAY socket option (= Nagle's algorithm disabled) on both client- and server-facing sockets of every frontend/backend.

Motivation

Streaming serving is the dominant Ray Serve workload today — streaming LLM completions, server-sent events, gRPC streaming. Streaming is hostile to Nagle's algorithm. When the upstream produces a small first chunk (e.g. the first SSE event of a chat completion stream), Nagle holds it in the kernel's TCP send buffer waiting for either:

enough data to accumulate to fill an MSS-sized packet, or
the in-flight segment to be acknowledged, or
a kernel timer (~40 ms delayed-ACK timer, up to ~200 ms full Nagle timer).

That wait time lands directly in time-to-first-byte / time-to-first-token for every streamed request.

The default was chosen for bulk-transfer HTTP proxying where packet coalescing helps. For the streaming workloads that have become Ray Serve's primary use case, the default is the wrong shape.

Benchmark comparison (TCP_NODELAY=1 vs =0)

Both runs use HAProxy + pytest_serve_microbenchmarks --run-all on the same commit (5124dde36c). Δ is relative to TCP=0; negative = TCP=1 is faster / lower / more stable.

Run A — TCP_NODELAY=1 (new default): https://buildkite.com/ray-project/release/builds/92975
Run B — TCP_NODELAY=0 (old default): https://buildkite.com/ray-project/release/builds/92973

Wins for TCP=1

Workload	Metric	TCP=0	TCP=1	Δ
HTTP 10 MB	p90 latency	6.57 ms	4.91 ms	−25.3%
HTTP 10 MB	p95 latency	7.19 ms	5.05 ms	−29.7%
HTTP 10 MB	p99 latency	7.72 ms	5.67 ms	−26.5%
HTTP small	p50 latency	2.24 ms	2.14 ms	−4.6%
HTTP small	p90 / p95 / p99 latency	—	—	−3.0% to −3.2%
HTTP streaming	throughput std-dev	1410	312	−77.9% (much more stable)
Throughput std-dev (most paths)	—	—	—	mostly −5% to −25% (tighter)

Regressions / no-change for TCP=1

Workload	Metric	TCP=0	TCP=1	Δ
HTTP streaming	p50 / p99 latency	13.2 s / 13.6 s	13.7 s / 14.0 s	+2.5% to +3.4%
HTTP streaming	avg TPS	37 991	36 232	−4.6%
gRPC small	p99 latency	2.11 ms	2.24 ms	+5.8%
gRPC 1 MB	p99 latency	8.07 ms	8.64 ms	+7.0%
RPS / TPS means (most paths)	—	—	—	within ±1.6% (noise)

Reading

Clear win: HTTP large-payload tail latency (http_10mb_*) drops 25-30% at p90+, the classic multi-segment Nagle case.
Moderate win: HTTP small-payload latency improves 3-5% across all percentiles.
Variance win: streaming throughput std-dev drops 78% — much more predictable, even when the mean is slightly lower.
Caveat: HTTP streaming mean latency is +3% and mean throughput −4.6% with TCP=1. Tail-latency and variance both improve, which is what most streaming consumers actually care about, but this benchmark's particular --run-streaming shape doesn't cleanly capture TTFT.
gRPC small-payload p99 is slightly worse (+3-7%); larger payloads neutral to better.
Handle (in-process, doesn't traverse HAProxy) is within noise.

Test plan

Manual: deploy a streaming Serve LLM app behind HAProxy, observe rendered haproxy.cfg now contains option http-no-delay by default.
Manual: deploy with RAY_SERVE_HAPROXY_TCP_NODELAY=0 exported, observe option http-no-delay is absent.
No code paths consume the constant beyond the existing Jinja template in haproxy_templates.py, so existing tests for HAProxy config generation continue to cover the wiring.
Release pytest_serve_microbenchmarks.aws against HAProxy, with TCP_NODELAY=1 and =0 (see Benchmark comparison above).

🤖 Generated with Claude Code

Streaming serving (streaming LLM completions, SSE, gRPC streaming) is the dominant Ray Serve workload, and it is hostile to Nagle's algorithm: when the upstream emits a small first chunk (e.g. the first SSE event), the kernel holds it in the TCP send buffer waiting for either more data or the delayed-ACK timer. That wait time lands directly in time-to-first- token / time-to-first-byte for every streamed request. Flip the default for the HAProxy proxy from off to on. The flag is controlled by RAY_SERVE_HAPROXY_TCP_NODELAY; set to '0' to restore coalescing for non-streaming workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request enables RAY_SERVE_HAPROXY_TCP_NODELAY by default and adds documentation explaining its benefits for streaming workloads, such as LLM completions and gRPC, by avoiding latency issues caused by Nagle's algorithm. The reviewer suggested using the get_env_bool utility function to maintain consistency with other environment variables in the file and ensure proper validation.

Switch the env-var read to the shared get_env_bool helper for consistency with the other boolean RAY_SERVE_* constants in this file, and to pick up the RAY_SERVE_ prefix validation it provides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ault With RAY_SERVE_HAPROXY_TCP_NODELAY now defaulting to "1", the rendered HAProxy defaults section emits 'option http-no-delay'. Update the expected_config string in test_generate_config_file_internal so the exact-match comparison reflects the new template output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

abrarsheikh · 2026-05-15T16:23:03Z

let's kicks off a full perf regression test and compare master against this change to make sure it does not degrade other metrics.

kouroshHakha · 2026-05-15T17:13:03Z

Run A: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=1

https://buildkite.com/ray-project/release/builds/92975

Run B: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=0

https://buildkite.com/ray-project/release/builds/92973

Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's runtime_env and --run-direct-streaming to the command line, so the nightly haproxy row exercises the optimal-latency configuration end to end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1 + direct streaming via ingress_request_router bypass. Depends on ray-project#63391, which defines --run-direct-streaming and DirectStreamingRouter. Without ray-project#63391 merged, this command line will fail at click parse time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's runtime_env and --run-direct-streaming to the command line, so the nightly haproxy row exercises the optimal-latency configuration end to end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1 + direct streaming via ingress_request_router bypass. Depends on ray-project#63391, which defines --run-direct-streaming and DirectStreamingRouter. Without ray-project#63391 merged, this command line will fail at click parse time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

kouroshHakha requested a review from a team as a code owner May 14, 2026 22:06

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/serve/_private/constants.py Outdated

kouroshHakha marked this pull request as draft May 14, 2026 22:34

kouroshHakha added alpha Alpha release features go add ONLY when ready to merge, run all tests and removed alpha Alpha release features labels May 15, 2026

kouroshHakha marked this pull request as ready for review May 15, 2026 06:37

ray-gardener Bot added the serve Ray Serve Related Issue label May 15, 2026

abrarsheikh approved these changes May 15, 2026

View reviewed changes

kouroshHakha merged commit ae6ae5c into ray-project:master May 15, 2026
6 checks passed

This was referenced May 15, 2026

[serve][release] Add HAProxy variant to throughput-optimized serve microbenchmarks #63386

Merged

[serve][release] Add TTFT and inter-token jitter metrics to streaming microbench #63391

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1#63353

[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1#63353
kouroshHakha merged 3 commits into
ray-project:masterfrom
kouroshHakha:kh/serve-haproxy-tcp-nodelay-default

kouroshHakha commented May 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

abrarsheikh commented May 15, 2026

Uh oh!

kouroshHakha commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kouroshHakha commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Benchmark comparison (TCP_NODELAY=1 vs =0)

Wins for TCP=1

Regressions / no-change for TCP=1

Reading

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

abrarsheikh commented May 15, 2026

Uh oh!

kouroshHakha commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run A: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=1

Run B: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=0

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kouroshHakha commented May 14, 2026 •

edited

Loading

kouroshHakha commented May 15, 2026 •

edited

Loading