[Bug] autoscaler not working properly in rayjob #1064

Yicheng-Lu-llll · 2023-05-05T00:36:23Z

Why are these changes needed?

In short:

When submitting a Ray job to Kubernetes with the autoscaler, the following behaviors can be observed:

A new worker pod is created as the autoscaler increases the number of replicas.
An existing worker is terminated simultaneously.
The total number of worker pods remains equal to the original replica count.

The root cause is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated.

So, This PR:

Preventing Ray job controllers from updating the RayCluster when the autoscaler is enabled.
Even when the autoscaler is not enabled, Ray job controller only modifies the replica, rather than the entire RayClusterSpec.

Reproduce

The reproduction file can be found here: ray_v1alpha1_rayjob.yaml.

I believe the root cause of the problem lies in this section of the code:

kuberay/ray-operator/controllers/ray/rayjob_controller.go

Lines 379 to 382 in 3571d52

    
           rayClusterInstance.Spec = *rayJobInstance.Spec.RayClusterSpec 
        
           r.Log.Info("Update ray cluster spec", "raycluster", rayClusterNamespacedName) 
        
           if err := r.Update(ctx, rayClusterInstance); err != nil {

The sequence of events is as follows:

After kubectl apply -f ray_v1alpha1_rayjob.yaml:

RayJob operator creates a Ray cluster with 1 head pod and 1 worker pod.
Based on the workload (placement group in this case), the autoscaler adjusts the RayCluster replica count to 2.
The RayCluster operator creates a new worker pod.
During the reconciliation of the RayJob operator, it executes the above-mentioned code, overwriting RayClusterInstance.Spec with RayJobInstance.Spec.RayClusterSpec（The goal here is to reflect the user's update to the ray job yaml file）. As a result, the replica count reverts to its original value of 1(Replicas's type is *int32).
The RayCluster operator deletes an existing worker pod based on the updated replica count.
So, we can see "New pod did created but another pod terminated suddenly."

Fix

From my understanding, the need to overwrite RayClusterInstance.Spec with RayJobInstance.Spec.RayClusterSpec (when a RayCluster is found) arises when users update the replicas in the Ray job YAML file, prompting the Ray job controller to adjust the replica count on their behalf.

So, the root cause stems from both the autoscaler (based on the workload) and Ray job controllers (based on the Ray job YAML file) attempting to update the replica count simultaneously. Consider these points:

RayCluster currently only permits replica updates; modifying other specs may cause errors.
When the autoscaler is enabled, users are less likely to manually adjust the replica count, as the autoscaler should adapt the replica based on workload.

So, a potential solution to address this issue is:

Preventing Ray job controllers from updating the RayCluster when the autoscaler is enabled.
Even when the autoscaler is not enabled, Ray job controllers should only modify the replica, rather than the entire RayClusterSpec.

Related issue number

Closes #532

Checks

Most tests are incorporated in the CI. Additionally, I've conducted manual testing for the following cases.

I confirm all the below cases work.

Case1: With in-tree autoscaling is disabled, user(RayJob controller) should be able to update the replica.
Case2: With in-tree autoscaling disabled, if users update specs other than the replica, The following log message should be displayed:

2023-05-19T01:05:10.326Z        INFO    controllers.RayJob      RayJob supports replica changes only. Adjustments made to other specs will be disregarded as they may cause unexpected behavior

Case3: With in-tree autoscaling is enabled, the autoscaler should be able to adjust the number of workers based on the workload. The RayJob controller should not be able to change RayJob and propagate to the RayCluster. The following log message should be displayed:

2023-05-19T01:08:28.652Z        INFO    controllers.RayJob      Since in-tree autoscaling is enabled, any adjustments made to the RayJob will be disregarded and will not be propagated to the RayCluster.

Case4: With RayClusterSpec empty and ClusterSelector set, an existing Ray cluster should be used. The following log message should be displayed:

2023-05-19T01:14:23.429Z        INFO    controllers.RayJob      ClusterSelector is being used to select an existing RayCluster. RayClusterSpec will be disregarded      {"raycluster": "default/raycluster-autoscaler"}

Case5: With both RayClusterSpec and ClusterSelector are set, an existing Ray cluster should be used RayClusterSpec should be ignored. The RayJob controller should not be able to change RayJob and propagate to the existing RayCluster. The following log message should be displayed:

2023-05-19T01:19:05.363Z        INFO    controllers.RayJob      ClusterSelector is being used to select an existing RayCluster. RayClusterSpec will be disregarded      {"raycluster": "default/raycluster-autoscaler"}

Case6: With both RayClusterSpec and ClusterSelector are unset, The following error message should be displayed:

2023-05-19T01:22:12.162Z        ERROR   controllers.RayJob      Failed to configure RayCluster instance due to missing configuration    {"error": "Both ClusterSelector and RayClusterSpec are undefined"}

Yicheng-Lu-llll · 2023-05-05T19:50:26Z

@architkulkarni @kevin85421, Would you mind reviewing this PR and letting me know if it looks good to you? Thank you!

The issue we're facing is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated.

The proposed solution is to prevent the RayJob controller from updating the replicas if the autoscaler is enabled.

The rationale behind this is:

When the autoscaler is in use, users are less likely to manually modify the replicas in the RayJob YAML, as the autoscaler will dynamically adjust the replicas based on the workload.

ray-operator/controllers/ray/rayjob_controller_test.go

ray-operator/controllers/ray/rayjob_controller.go

ray-operator/controllers/ray/rayjob_controller_test.go

architkulkarni

Looks good pending @kevin85421 's comments. Some minor questions:

Should we add a log in each of the 4 cases, or will it be too spammy? This is to prevent a future user complaining about changes being "silently ignored"
Is there a convenient way to parametrize the tests (similar to pytest.mark.parametrize in python) to avoid repeated code, to isolate the part of the test that's different when the autoscaler is on vs off? A quick search turned up something about "table-driven tests" but I'm not sure if that's applicable here.

Yicheng-Lu-llll · 2023-05-16T22:52:08Z

Should we add a log in each of the 4 cases, or will it be too spammy? This is to prevent a future user complaining about changes being "silently ignored"

Thank you for pointing out! I have changed according. I use areSpecsIdentical , areReplicasIdentical to ensure we only log when necessary.

Is there a convenient way to parametrize the tests (similar to pytest.mark.parametrize in python) to avoid repeated code, to isolate the part of the test that's different when the autoscaler is on vs off? A quick search turned up something about "table-driven tests" but I'm not sure if that's applicable here.

The diffculty is we need to create two different RayJobs, one with autoscaler, one without. And the test logic/behavior will be slighly different. It might be challenging to abstract into common test logic.

ray-operator/controllers/ray/rayjob_controller.go

ray-operator/controllers/ray/rayjob_controller_test.go

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

ray-operator/controllers/ray/rayjob_controller.go

ray-operator/controllers/ray/rayjob_controller_test.go

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

ray-operator/controllers/ray/rayjob_controller.go

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

ray-operator/controllers/ray/rayjob_controller.go

kevin85421 · 2023-05-18T23:55:55Z

ray-operator/controllers/ray/rayjob_controller.go

 	} else if errors.IsNotFound(err) {
+		if len(rayJobInstance.Spec.ClusterSelector) == 0 && rayJobInstance.Spec.RayClusterSpec == nil {


This is not good. We should not try to GET RayCluster if both ClusterSelector and RayClusterSpec are not set. Please add a comment "TODO" and a new issue to track the progress.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421

LGTM. There are still a few minor things that can be improved, but it is fine to resolve them in another PR.

kevin85421 · 2023-05-19T03:31:36Z

ray-operator/controllers/ray/rayjob_controller.go

-		if rayJobInstance.Spec.RayClusterSpec == nil {
+		r.Log.Info("Found associated RayCluster for RayJob", "rayjob", rayJobInstance.Name, "raycluster", rayClusterNamespacedName)
+
+		// Case1: The job is submitted to an existing ray cluster, simply return the rayClusterInstance.


// Case1: The job is submitted to an existing RayCluster. Return the existing RayCluster instance and ignore all updates from RayClusterSpec.

ray-operator/controllers/ray/rayjob_controller_test.go

autoscaler not working properly in rayjob

architkulkarni self-assigned this May 5, 2023

kevin85421 self-requested a review May 16, 2023 00:34

Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch from 6131200 to 85c2513 Compare May 16, 2023 18:08

Yicheng-Lu-llll commented May 16, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller_test.go Show resolved Hide resolved

Yicheng-Lu-llll commented May 16, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller_test.go Show resolved Hide resolved

kevin85421 reviewed May 16, 2023

View reviewed changes

architkulkarni approved these changes May 16, 2023

View reviewed changes

Yicheng-Lu-llll marked this pull request as ready for review May 16, 2023 23:02

Yicheng-Lu-llll changed the title ~~[WIP][Bug] autoscaler not working properly in rayjob~~ [Bug] autoscaler not working properly in rayjob May 16, 2023

kevin85421 reviewed May 17, 2023

View reviewed changes

Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch 2 times, most recently from dba4a90 to 3cb50b3 Compare May 17, 2023 21:37

rebase

e01db44

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch from 3cb50b3 to e01db44 Compare May 17, 2023 21:59

kevin85421 reviewed May 18, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

kevin85421 reviewed May 18, 2023

View reviewed changes

Yicheng-Lu-llll added 2 commits May 18, 2023 16:57

handle an corner case

cef94b9

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

add error log for the corner case

f5c4f49

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 reviewed May 18, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

nit

5c6e248

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 reviewed May 18, 2023

View reviewed changes

ray-operator/controllers/ray/rayjob_controller.go Outdated Show resolved Hide resolved

kevin85421 reviewed May 18, 2023

View reviewed changes

nit

d46e12b

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>

kevin85421 approved these changes May 19, 2023

View reviewed changes

kevin85421 merged commit 82c925b into ray-project:master May 19, 2023
19 checks passed

lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023

[Bug] autoscaler not working properly in rayjob (ray-project#1064)

a794d1f

autoscaler not working properly in rayjob

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] autoscaler not working properly in rayjob #1064

[Bug] autoscaler not working properly in rayjob #1064

Yicheng-Lu-llll commented May 5, 2023 •

edited

Yicheng-Lu-llll commented May 5, 2023

architkulkarni left a comment

Yicheng-Lu-llll commented May 16, 2023 •

edited

kevin85421 May 18, 2023

kevin85421 left a comment

kevin85421 May 19, 2023

	rayClusterInstance.Spec = *rayJobInstance.Spec.RayClusterSpec

	r.Log.Info("Update ray cluster spec", "raycluster", rayClusterNamespacedName)
	if err := r.Update(ctx, rayClusterInstance); err != nil {

		} else if errors.IsNotFound(err) {
		if len(rayJobInstance.Spec.ClusterSelector) == 0 && rayJobInstance.Spec.RayClusterSpec == nil {

[Bug] autoscaler not working properly in rayjob #1064

[Bug] autoscaler not working properly in rayjob #1064

Conversation

Yicheng-Lu-llll commented May 5, 2023 • edited

Why are these changes needed?

In short:

Reproduce

The sequence of events is as follows:

Fix

Related issue number

Checks

Yicheng-Lu-llll commented May 5, 2023

architkulkarni left a comment

Choose a reason for hiding this comment

Yicheng-Lu-llll commented May 16, 2023 • edited

kevin85421 May 18, 2023

Choose a reason for hiding this comment

kevin85421 left a comment

Choose a reason for hiding this comment

kevin85421 May 19, 2023

Choose a reason for hiding this comment

Yicheng-Lu-llll commented May 5, 2023 •

edited

Yicheng-Lu-llll commented May 16, 2023 •

edited