-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] autoscaler not working properly in rayjob #1064
[Bug] autoscaler not working properly in rayjob #1064
Conversation
@architkulkarni @kevin85421, Would you mind reviewing this PR and letting me know if it looks good to you? Thank you! The issue we're facing is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated. The proposed solution is to prevent the RayJob controller from updating the replicas if the autoscaler is enabled. The rationale behind this is:
|
6131200
to
85c2513
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good pending @kevin85421 's comments. Some minor questions:
- Should we add a log in each of the 4 cases, or will it be too spammy? This is to prevent a future user complaining about changes being "silently ignored"
- Is there a convenient way to parametrize the tests (similar to
pytest.mark.parametrize
in python) to avoid repeated code, to isolate the part of the test that's different when the autoscaler is on vs off? A quick search turned up something about "table-driven tests" but I'm not sure if that's applicable here.
Thank you for pointing out! I have changed according. I use
The diffculty is we need to create two different RayJobs, one with autoscaler, one without. And the test logic/behavior will be slighly different. It might be challenging to abstract into common test logic. |
dba4a90
to
3cb50b3
Compare
3cb50b3
to
e01db44
Compare
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
} else if errors.IsNotFound(err) { | ||
if len(rayJobInstance.Spec.ClusterSelector) == 0 && rayJobInstance.Spec.RayClusterSpec == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not good. We should not try to GET
RayCluster if both ClusterSelector
and RayClusterSpec
are not set. Please add a comment "TODO" and a new issue to track the progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There are still a few minor things that can be improved, but it is fine to resolve them in another PR.
if rayJobInstance.Spec.RayClusterSpec == nil { | ||
r.Log.Info("Found associated RayCluster for RayJob", "rayjob", rayJobInstance.Name, "raycluster", rayClusterNamespacedName) | ||
|
||
// Case1: The job is submitted to an existing ray cluster, simply return the rayClusterInstance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Case1: The job is submitted to an existing RayCluster. Return the existing RayCluster instance and ignore all updates from RayClusterSpec.
autoscaler not working properly in rayjob
Why are these changes needed?
In short:
When submitting a Ray job to Kubernetes with the autoscaler, the following behaviors can be observed:
The root cause is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated.
So, This PR:
Reproduce
The reproduction file can be found here: ray_v1alpha1_rayjob.yaml.
I believe the root cause of the problem lies in this section of the code:
kuberay/ray-operator/controllers/ray/rayjob_controller.go
Lines 379 to 382 in 3571d52
The sequence of events is as follows:
After
kubectl apply -f ray_v1alpha1_rayjob.yaml
:RayJob operator creates a Ray cluster with 1 head pod and 1 worker pod.
Based on the workload (placement group in this case), the autoscaler adjusts the RayCluster replica count to 2.
The RayCluster operator creates a new worker pod.
During the reconciliation of the RayJob operator, it executes the above-mentioned code, overwriting
RayClusterInstance.Spec
withRayJobInstance.Spec.RayClusterSpec
(The goal here is to reflect the user's update to the ray job yaml file). As a result, the replica count reverts to its original value of 1(Replicas's type is *int32).The RayCluster operator deletes an existing worker pod based on the updated replica count.
So, we can see "New pod did created but another pod terminated suddenly."
Fix
From my understanding, the need to overwrite
RayClusterInstance.Spec
withRayJobInstance.Spec.RayClusterSpec
(when a RayCluster is found) arises when users update the replicas in the Ray job YAML file, prompting the Ray job controller to adjust the replica count on their behalf.So, the root cause stems from both the autoscaler (based on the workload) and Ray job controllers (based on the Ray job YAML file) attempting to update the replica count simultaneously. Consider these points:
So, a potential solution to address this issue is:
Related issue number
Closes #532
Checks
Most tests are incorporated in the CI. Additionally, I've conducted manual testing for the following cases.
I confirm all the below cases work.
Case1: With in-tree autoscaling is disabled, user(RayJob controller) should be able to update the replica.
Case2: With in-tree autoscaling disabled, if users update specs other than the replica, The following log message should be displayed: