[CI][Serve] Add dedicated Buildkite target for HAProxy tests#60914
[CI][Serve] Add dedicated Buildkite target for HAProxy tests#60914abrarsheikh merged 18 commits intoray-project:masterfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Exclude haproxy-tagged tests from 5 general serve test steps and add a dedicated HAProxy step that runs 40 test targets (haproxy-specific + standard serve tests) with RAY_SERVE_ENABLE_HA_PROXY=1, mirroring the rayturbo CI configuration. Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
98772c5 to
3a1f82c
Compare
| --test-env=RAY_SERVE_ENABLE_HA_PROXY=1 | ||
| --test-env=RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S=0.01 | ||
| --test-env=RAY_SERVE_DISABLE_SHUTTING_DOWN_INGRESS_REPLICAS_FORCEFULLY=0 | ||
| --test-env=SERVE_SOCKET_REUSE_PORT_ENABLED=1 |
There was a problem hiding this comment.
i am surprised that this SERVE_SOCKET_REUSE_PORT_ENABLED=1, adding this to the list of things to fix later
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- Add app_name="" to TargetGroup in _get_proxy_target_groups() to match rayturbo (commit 85adcc0), fixing 3 parametrized variants of test_get_serve_instance_details_json_serializable - Increase wait_for_condition timeout to 30s in test_num_replicas_auto_basic to match rayturbo, fixing timeout under HAProxy CI Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Revert proactive app_name="" addition to TargetGroup constructors in controller.py — Pydantic exclude_unset=True already handles serialization correctly without explicit empty string. Add HAProxy process cleanup to the serve_ha fixture in test_gcs_failure.py. When GCS is killed mid-test, the HAProxy manager actor dies without cleaning up its subprocess. Orphaned HAProxy processes hold the port 8000 socket and serve stale configs, causing 404s in subsequent tests. This matches the cleanup pattern already used in test_haproxy.py and test_haproxy_api.py. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
PR ray-project#60823 throttled the serve_deployment_replica_healthy gauge with a 10s cache TTL and bumped the timeout in test_metrics.py to 40s, but missed applying the same fix to the HAProxy variant. The default 10s wait_for_condition timeout is too short now that the gauge is cached. Increase the timeout to 40s to match test_metrics.py. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Remove the pkill haproxy cleanup from serve_ha fixture — the orphaned process issue needs more investigation before landing a fix. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
- controller.py: Add app_name to TargetGroup in _get_proxy_target_groups - test_direct_ingress.py: Revert if False guards to match upstream - test_deploy_2.py: Increase wait_for_condition timeout to 30s - test_deploy.py: Fix typo - serve_hap_test_names.txt: Align with upstream test list Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…e-haproxy-port Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Matches upstream HAProxy CI config. Without this, direct ingress mode interferes with HAProxy routing during GCS failure tests. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Remove OSS-only branch that returned empty target groups when HAProxy was enabled and no apps were visible. This caused HAProxy to clear routes when GCS died and get_route_prefix returned None. Match upstream behavior of returning proxy_target_groups. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
This reverts commit dc99494. Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
There was a problem hiding this comment.
Now the file is identical to parity implementation
| ) | ||
|
|
||
| if user_exception is not None: | ||
| if user_exception is not None and not request_metadata.is_direct_ingress: |
There was a problem hiding this comment.
This function _handle_errors_and_metrics identical to parity implementation
| commands: | ||
| - bazel run //ci/ray_ci:test_in_docker -- //python/ray/serve/... //python/ray/tests/... serve | ||
| --except-tags post_wheel_build,gpu,ha_integration,serve_tracing,direct_ingress | ||
| --except-tags post_wheel_build,gpu,ha_integration,serve_tracing,direct_ingress,haproxy |
There was a problem hiding this comment.
Exclusion pattern also used in parity
| --python-version 3.10 | ||
| depends_on: servetracingbuild | ||
|
|
||
| - label: ":ray-serve: serve: HAProxy tests" |
There was a problem hiding this comment.
Parallelism, command, env vars identical to parity
There was a problem hiding this comment.
Contents identical, minus the expected number
There was a problem hiding this comment.
With the new test files in this PR, the contents of the test lists in this file are identical to the parity implementation.
| shutdown_ref = self._actor_handle.shutdown.remote() | ||
| ray.get(shutdown_ref, timeout=5) | ||
|
|
||
| # Shutdown completed successfully, now kill the actor |
There was a problem hiding this comment.
Unhandled exception in kill() skips ray.kill() call
High Severity
The ray.get(shutdown_ref, timeout=5) call in kill() can raise GetTimeoutError (or RayActorError if the actor dies mid-shutdown), which would prevent the subsequent ray.kill() from ever executing. This leaves the proxy actor potentially alive when it was supposed to be force-killed. The exception also propagates up to ProxyStateManager.shutdown() and _stop_proxies_if_needed(), which iterate over proxies in a loop — an unhandled exception from one proxy's kill() would break the shutdown of all remaining proxies.
| //python/ray/serve/tests:test_replica_sync_methods_with_run_sync_in_threadpool | ||
| //python/ray/serve/tests:test_request_timeout | ||
| //python/ray/serve/tests:test_streaming_response | ||
| //python/ray/serve/tests:test_target_capacity |
There was a problem hiding this comment.
test_controller_haproxy missing from HAProxy CI test list
Medium Severity
The test_controller_haproxy unit test (//python/ray/serve/tests/unit:test_controller_haproxy) has the haproxy Bazel tag and is now excluded from all general serve CI steps via --except-tags haproxy. However, it's not listed in serve_hap_test_names.txt, so the new HAProxy CI step won't run it either. This test has zero CI coverage. The PR test plan explicitly lists test_controller_haproxy as a test that should pass.
Needed for #60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - #60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - #60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Needed for ray-project#60914 Add test names file for HAProxy tests and update `CODEOWNERS` accordingly Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ject#60914) ## Summary - Exclude `haproxy`-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabled - Add a dedicated HAProxy CI step that runs 40 test targets (haproxy-specific + standard serve tests) with `RAY_SERVE_ENABLE_HA_PROXY=1` - Test targets listed in `ci/ray_ci/serve_hap_test_names.txt`, following the same pattern as `serve_di_test_names.txt` ## Test plan - [x] HAProxy-specific tests pass: `test_haproxy`, `test_haproxy_api`, `test_metrics_haproxy`, `test_controller_haproxy` - [x] Standard serve tests pass with `RAY_SERVE_ENABLE_HA_PROXY=1` (40 targets in `serve_hap_test_names.txt`) - [x] Existing serve CI steps unaffected (haproxy tag excluded) --------- Signed-off-by: Gene Su <gene@anyscale.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…ray-project#60953) ## Why are these changes needed? `test_metrics_haproxy.py::test_replica_metrics_fields` is failing in postmerge. - ray-project#60823 added a 10s cache on the health gauge, with `RAY_SERVE_REPLICA_HEALTH_GAUGE_REPORT_INTERVAL_S=0.1` in the metrics BUILD target to keep tests passing - ray-project#60914 added a HAProxy BUILD target that re-runs serve tests with `RAY_SERVE_ENABLE_HA_PROXY=1`, but didn't carry over that env var - Without it, the health gauge goes stale between scrapes and the test misses one deployment's metric Fix: add the missing env var to the HAProxy target. Example failure: https://buildkite.com/ray-project/postmerge/builds/15966#019c49dc-52b4-40dd-a67e-9f5ea5c61755 --------- Signed-off-by: Seiji Eicher <seiji@anyscale.com>


Summary
haproxy-tagged tests from 5 general serve test steps to avoid running them without HAProxy enabledRAY_SERVE_ENABLE_HA_PROXY=1ci/ray_ci/serve_hap_test_names.txt, following the same pattern asserve_di_test_names.txtTest plan
test_haproxy,test_haproxy_api,test_metrics_haproxy,test_controller_haproxyRAY_SERVE_ENABLE_HA_PROXY=1(40 targets inserve_hap_test_names.txt)