-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService][HA] Fix flaky tests #1823
Conversation
@@ -17,7 +17,8 @@ spec: | |||
working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip" | |||
deployments: | |||
- name: MangoStand | |||
num_replicas: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used to ensure that both the head and worker have at least one Serve replica each. If a Pod doesn't have a Ray Serve replica, it will not have the Proxy Actor since Ray 2.8.0. Hence, the readiness probe will fail and thus the RayServiceAddCREvent can't converge. See this line for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add this as a comment in the test to prevent it from becoming tribal knowledge (for example, if later some future maintainer changes num_replicas
to 1 for whatever reason, the test will start flaking and nobody will remember why). Ideally we should just make the test robust to both cases if it's not too hard.
However if this issue with the readiness probe is just temporary, then it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we should just make the test robust to both cases if it's not too hard.
Agree. I will expose more information in the RayService status later so that the RayServiceAddCREvent doesn't need to wait until all Pods are ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update 6a2047d
timeout=400, | ||
message="Releasing all blocked requests. Worker pods should start scaling down..." | ||
) | ||
cr_events: List[CREvent] = [ | ||
RayServiceAddCREvent( | ||
custom_resource_object=self.cr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this PR, the custom_resource_object
in RayServiceAddCREvent is sourced from ray_v1alpha1_rayservice.yaml
, and the filepath
is set to ray-service.autoscaler.yaml
. This is definitely a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one minor comment
lgtm |
Why are these changes needed?
Fix flaky tests introduced by #1808. Note that the RayService was already flaky before #1808, but it became much more unstable after #1808 (failing about 5 times in 6 runs). Although the tests remain as flaky as they were before #1808 even after this PR is merged, it does alleviate the additional flakiness introduced by #1808.
Related issue number
Checks