[serve] Default RAY_SERVE_HAPROXY_TCP_NODELAY to 1#63353
Merged
kouroshHakha merged 3 commits intoMay 15, 2026
Merged
Conversation
Streaming serving (streaming LLM completions, SSE, gRPC streaming) is the dominant Ray Serve workload, and it is hostile to Nagle's algorithm: when the upstream emits a small first chunk (e.g. the first SSE event), the kernel holds it in the TCP send buffer waiting for either more data or the delayed-ACK timer. That wait time lands directly in time-to-first- token / time-to-first-byte for every streamed request. Flip the default for the HAProxy proxy from off to on. The flag is controlled by RAY_SERVE_HAPROXY_TCP_NODELAY; set to '0' to restore coalescing for non-streaming workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request enables RAY_SERVE_HAPROXY_TCP_NODELAY by default and adds documentation explaining its benefits for streaming workloads, such as LLM completions and gRPC, by avoiding latency issues caused by Nagle's algorithm. The reviewer suggested using the get_env_bool utility function to maintain consistency with other environment variables in the file and ensure proper validation.
Switch the env-var read to the shared get_env_bool helper for consistency with the other boolean RAY_SERVE_* constants in this file, and to pick up the RAY_SERVE_ prefix validation it provides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ault With RAY_SERVE_HAPROXY_TCP_NODELAY now defaulting to "1", the rendered HAProxy defaults section emits 'option http-no-delay'. Update the expected_config string in test_generate_config_file_internal so the exact-match comparison reflects the new template output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
let's kicks off a full perf regression test and compare master against this change to make sure it does not degrade other metrics. |
Contributor
Author
Run A: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=1https://buildkite.com/ray-project/release/builds/92975 Run B: RAY_SERVE_ENABLE_HA_PROXY=1 + RAY_SERVE_HAPROXY_TCP_NODELAY=0 |
abrarsheikh
approved these changes
May 15, 2026
This was referenced May 15, 2026
kouroshHakha
added a commit
to kouroshHakha/ray
that referenced
this pull request
May 16, 2026
Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's runtime_env and --run-direct-streaming to the command line, so the nightly haproxy row exercises the optimal-latency configuration end to end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1 + direct streaming via ingress_request_router bypass. Depends on ray-project#63391, which defines --run-direct-streaming and DirectStreamingRouter. Without ray-project#63391 merged, this command line will fail at click parse time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akyang-anyscale
pushed a commit
to kouroshHakha/ray
that referenced
this pull request
May 20, 2026
Adds RAY_SERVE_LLM_ENABLE_DIRECT_STREAMING=1 to the haproxy variant's runtime_env and --run-direct-streaming to the command line, so the nightly haproxy row exercises the optimal-latency configuration end to end: HAProxy + TCP_NODELAY=1 (default in ray-project#63353) + THROUGHPUT_OPTIMIZED=1 + direct streaming via ingress_request_router bypass. Depends on ray-project#63391, which defines --run-direct-streaming and DirectStreamingRouter. Without ray-project#63391 merged, this command line will fail at click parse time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RAY_SERVE_HAPROXY_TCP_NODELAYdefault from"0"→"1". Setting it to"1"causes HAProxy to emitoption http-no-delay, which sets theTCP_NODELAYsocket option (= Nagle's algorithm disabled) on both client- and server-facing sockets of every frontend/backend.Motivation
Streaming serving is the dominant Ray Serve workload today — streaming LLM completions, server-sent events, gRPC streaming. Streaming is hostile to Nagle's algorithm. When the upstream produces a small first chunk (e.g. the first SSE event of a chat completion stream), Nagle holds it in the kernel's TCP send buffer waiting for either:
That wait time lands directly in time-to-first-byte / time-to-first-token for every streamed request.
The default was chosen for bulk-transfer HTTP proxying where packet coalescing helps. For the streaming workloads that have become Ray Serve's primary use case, the default is the wrong shape.
Benchmark comparison (TCP_NODELAY=1 vs =0)
Both runs use HAProxy +
pytest_serve_microbenchmarks --run-allon the same commit (5124dde36c). Δ is relative to TCP=0; negative = TCP=1 is faster / lower / more stable.Wins for TCP=1
Regressions / no-change for TCP=1
Reading
http_10mb_*) drops 25-30% at p90+, the classic multi-segment Nagle case.--run-streamingshape doesn't cleanly capture TTFT.Test plan
haproxy.cfgnow containsoption http-no-delayby default.RAY_SERVE_HAPROXY_TCP_NODELAY=0exported, observeoption http-no-delayis absent.haproxy_templates.py, so existing tests for HAProxy config generation continue to cover the wiring.pytest_serve_microbenchmarks.awsagainst HAProxy, with TCP_NODELAY=1 and =0 (see Benchmark comparison above).🤖 Generated with Claude Code