[RayService][HA] Fix flaky tests #1823

kevin85421 · 2024-01-10T19:33:47Z

Why are these changes needed?

Fix flaky tests introduced by #1808. Note that the RayService was already flaky before #1808, but it became much more unstable after #1808 (failing about 5 times in 6 runs). Although the tests remain as flaky as they were before #1808 even after this PR is merged, it does alleviate the additional flakiness introduced by #1808.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

kevin85421 · 2024-01-10T21:57:56Z

ray-operator/config/samples/ray_v1alpha1_rayservice.yaml

@@ -17,7 +17,8 @@ spec:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
-            num_replicas: 1


This is used to ensure that both the head and worker have at least one Serve replica each. If a Pod doesn't have a Ray Serve replica, it will not have the Proxy Actor since Ray 2.8.0. Hence, the readiness probe will fail and thus the RayServiceAddCREvent can't converge. See this line for more details.

I think we should add this as a comment in the test to prevent it from becoming tribal knowledge (for example, if later some future maintainer changes num_replicas to 1 for whatever reason, the test will start flaking and nobody will remember why). Ideally we should just make the test robust to both cases if it's not too hard.

However if this issue with the readiness probe is just temporary, then it's fine.

Ideally, we should just make the test robust to both cases if it's not too hard.

Agree. I will expose more information in the RayService status later so that the RayServiceAddCREvent doesn't need to wait until all Pods are ready.

Update 6a2047d

kevin85421 · 2024-01-10T22:01:35Z

tests/test_sample_rayservice_yamls.py

            timeout=400,
            message="Releasing all blocked requests. Worker pods should start scaling down..."
        )
        cr_events: List[CREvent] = [
            RayServiceAddCREvent(
-                custom_resource_object=self.cr,


Without this PR, the custom_resource_object in RayServiceAddCREvent is sourced from ray_v1alpha1_rayservice.yaml, and the filepath is set to ray-service.autoscaler.yaml. This is definitely a bug.

kevin85421 · 2024-01-10T22:30:58Z

cc @Yicheng-Lu-llll

architkulkarni

Looks good, just one minor comment

Yicheng-Lu-llll · 2024-01-11T01:24:41Z

lgtm

kevin85421 added 2 commits January 10, 2024 19:32

update

5aeba22

update

a525abb

kevin85421 commented Jan 10, 2024

View reviewed changes

kevin85421 marked this pull request as ready for review January 10, 2024 22:30

kevin85421 requested a review from architkulkarni January 10, 2024 22:30

kevin85421 assigned architkulkarni and gvspraveen Jan 10, 2024

kevin85421 requested a review from gvspraveen January 10, 2024 22:30

architkulkarni approved these changes Jan 10, 2024

View reviewed changes

Yicheng-Lu-llll approved these changes Jan 11, 2024

View reviewed changes

update

6a2047d

kevin85421 merged commit 73f4f21 into ray-project:master Jan 11, 2024
23 of 24 checks passed

Yicheng-Lu-llll mentioned this pull request Feb 18, 2024

[RayService] Refactor to Rely More on RayService Status in RayService E2E Tests #1928

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RayService][HA] Fix flaky tests #1823

[RayService][HA] Fix flaky tests #1823

kevin85421 commented Jan 10, 2024 •

edited

kevin85421 Jan 10, 2024

architkulkarni Jan 10, 2024

kevin85421 Jan 11, 2024

kevin85421 Jan 11, 2024

kevin85421 Jan 10, 2024

kevin85421 commented Jan 10, 2024

architkulkarni left a comment

Yicheng-Lu-llll commented Jan 11, 2024

[RayService][HA] Fix flaky tests #1823

[RayService][HA] Fix flaky tests #1823

Conversation

kevin85421 commented Jan 10, 2024 • edited

Why are these changes needed?

Related issue number

Checks

kevin85421 Jan 10, 2024

Choose a reason for hiding this comment

architkulkarni Jan 10, 2024

Choose a reason for hiding this comment

kevin85421 Jan 11, 2024

Choose a reason for hiding this comment

kevin85421 Jan 11, 2024

Choose a reason for hiding this comment

kevin85421 Jan 10, 2024

Choose a reason for hiding this comment

kevin85421 commented Jan 10, 2024

architkulkarni left a comment

Choose a reason for hiding this comment

Yicheng-Lu-llll commented Jan 11, 2024

kevin85421 commented Jan 10, 2024 •

edited