Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RayService][HA] Fix flaky tests #1823

Merged
merged 3 commits into from
Jan 11, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jan 10, 2024

Why are these changes needed?

Fix flaky tests introduced by #1808. Note that the RayService was already flaky before #1808, but it became much more unstable after #1808 (failing about 5 times in 6 runs). Although the tests remain as flaky as they were before #1808 even after this PR is merged, it does alleviate the additional flakiness introduced by #1808.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -17,7 +17,8 @@ spec:
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
deployments:
- name: MangoStand
num_replicas: 1
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to ensure that both the head and worker have at least one Serve replica each. If a Pod doesn't have a Ray Serve replica, it will not have the Proxy Actor since Ray 2.8.0. Hence, the readiness probe will fail and thus the RayServiceAddCREvent can't converge. See this line for more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add this as a comment in the test to prevent it from becoming tribal knowledge (for example, if later some future maintainer changes num_replicas to 1 for whatever reason, the test will start flaking and nobody will remember why). Ideally we should just make the test robust to both cases if it's not too hard.

However if this issue with the readiness probe is just temporary, then it's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should just make the test robust to both cases if it's not too hard.

Agree. I will expose more information in the RayService status later so that the RayServiceAddCREvent doesn't need to wait until all Pods are ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update 6a2047d

timeout=400,
message="Releasing all blocked requests. Worker pods should start scaling down..."
)
cr_events: List[CREvent] = [
RayServiceAddCREvent(
custom_resource_object=self.cr,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this PR, the custom_resource_object in RayServiceAddCREvent is sourced from ray_v1alpha1_rayservice.yaml, and the filepath is set to ray-service.autoscaler.yaml. This is definitely a bug.

@kevin85421
Copy link
Member Author

cc @Yicheng-Lu-llll

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one minor comment

@Yicheng-Lu-llll
Copy link
Contributor

lgtm

@kevin85421 kevin85421 merged commit 73f4f21 into ray-project:master Jan 11, 2024
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants