Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] autoscaler not working properly in rayjob #1064

Conversation

Yicheng-Lu-llll
Copy link
Contributor

@Yicheng-Lu-llll Yicheng-Lu-llll commented May 5, 2023

Why are these changes needed?

In short:

When submitting a Ray job to Kubernetes with the autoscaler, the following behaviors can be observed:

  1. A new worker pod is created as the autoscaler increases the number of replicas.
  2. An existing worker is terminated simultaneously.
  3. The total number of worker pods remains equal to the original replica count.

The root cause is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated.

So, This PR:

  1. Preventing Ray job controllers from updating the RayCluster when the autoscaler is enabled.
  2. Even when the autoscaler is not enabled, Ray job controller only modifies the replica, rather than the entire RayClusterSpec.

Reproduce

The reproduction file can be found here: ray_v1alpha1_rayjob.yaml.

I believe the root cause of the problem lies in this section of the code:

rayClusterInstance.Spec = *rayJobInstance.Spec.RayClusterSpec
r.Log.Info("Update ray cluster spec", "raycluster", rayClusterNamespacedName)
if err := r.Update(ctx, rayClusterInstance); err != nil {

The sequence of events is as follows:

After kubectl apply -f ray_v1alpha1_rayjob.yaml:

  1. RayJob operator creates a Ray cluster with 1 head pod and 1 worker pod.

  2. Based on the workload (placement group in this case), the autoscaler adjusts the RayCluster replica count to 2.

  3. The RayCluster operator creates a new worker pod.

  4. During the reconciliation of the RayJob operator, it executes the above-mentioned code, overwriting RayClusterInstance.Spec with RayJobInstance.Spec.RayClusterSpec(The goal here is to reflect the user's update to the ray job yaml file). As a result, the replica count reverts to its original value of 1(Replicas's type is *int32).

  5. The RayCluster operator deletes an existing worker pod based on the updated replica count.

  6. So, we can see "New pod did created but another pod terminated suddenly."

Fix

From my understanding, the need to overwrite RayClusterInstance.Spec with RayJobInstance.Spec.RayClusterSpec (when a RayCluster is found) arises when users update the replicas in the Ray job YAML file, prompting the Ray job controller to adjust the replica count on their behalf.

So, the root cause stems from both the autoscaler (based on the workload) and Ray job controllers (based on the Ray job YAML file) attempting to update the replica count simultaneously. Consider these points:

  1. RayCluster currently only permits replica updates; modifying other specs may cause errors.
  2. When the autoscaler is enabled, users are less likely to manually adjust the replica count, as the autoscaler should adapt the replica based on workload.

So, a potential solution to address this issue is:

  1. Preventing Ray job controllers from updating the RayCluster when the autoscaler is enabled.
  2. Even when the autoscaler is not enabled, Ray job controllers should only modify the replica, rather than the entire RayClusterSpec.

Related issue number

Closes #532

Checks

Most tests are incorporated in the CI. Additionally, I've conducted manual testing for the following cases.

I confirm all the below cases work.

  • Case1: With in-tree autoscaling is disabled, user(RayJob controller) should be able to update the replica.

  • Case2: With in-tree autoscaling disabled, if users update specs other than the replica, The following log message should be displayed:

2023-05-19T01:05:10.326Z        INFO    controllers.RayJob      RayJob supports replica changes only. Adjustments made to other specs will be disregarded as they may cause unexpected behavior
  • Case3: With in-tree autoscaling is enabled, the autoscaler should be able to adjust the number of workers based on the workload. The RayJob controller should not be able to change RayJob and propagate to the RayCluster. The following log message should be displayed:
2023-05-19T01:08:28.652Z        INFO    controllers.RayJob      Since in-tree autoscaling is enabled, any adjustments made to the RayJob will be disregarded and will not be propagated to the RayCluster.
  • Case4: With RayClusterSpec empty and ClusterSelector set, an existing Ray cluster should be used. The following log message should be displayed:
2023-05-19T01:14:23.429Z        INFO    controllers.RayJob      ClusterSelector is being used to select an existing RayCluster. RayClusterSpec will be disregarded      {"raycluster": "default/raycluster-autoscaler"}
  • Case5: With both RayClusterSpec and ClusterSelector are set, an existing Ray cluster should be used RayClusterSpec should be ignored. The RayJob controller should not be able to change RayJob and propagate to the existing RayCluster. The following log message should be displayed:
2023-05-19T01:19:05.363Z        INFO    controllers.RayJob      ClusterSelector is being used to select an existing RayCluster. RayClusterSpec will be disregarded      {"raycluster": "default/raycluster-autoscaler"}
  • Case6: With both RayClusterSpec and ClusterSelector are unset, The following error message should be displayed:
2023-05-19T01:22:12.162Z        ERROR   controllers.RayJob      Failed to configure RayCluster instance due to missing configuration    {"error": "Both ClusterSelector and RayClusterSpec are undefined"}

@Yicheng-Lu-llll
Copy link
Contributor Author

@architkulkarni @kevin85421, Would you mind reviewing this PR and letting me know if it looks good to you? Thank you!

The issue we're facing is that both the autoscaler and the RayJob controller are trying to update the replicas, causing worker pods to be repeatedly created and terminated.

The proposed solution is to prevent the RayJob controller from updating the replicas if the autoscaler is enabled.

The rationale behind this is:

  1. When the autoscaler is in use, users are less likely to manually modify the replicas in the RayJob YAML, as the autoscaler will dynamically adjust the replicas based on the workload.

@architkulkarni architkulkarni self-assigned this May 5, 2023
@kevin85421 kevin85421 self-requested a review May 16, 2023 00:34
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch from 6131200 to 85c2513 Compare May 16, 2023 18:08
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good pending @kevin85421 's comments. Some minor questions:

  • Should we add a log in each of the 4 cases, or will it be too spammy? This is to prevent a future user complaining about changes being "silently ignored"
  • Is there a convenient way to parametrize the tests (similar to pytest.mark.parametrize in python) to avoid repeated code, to isolate the part of the test that's different when the autoscaler is on vs off? A quick search turned up something about "table-driven tests" but I'm not sure if that's applicable here.

@Yicheng-Lu-llll
Copy link
Contributor Author

Yicheng-Lu-llll commented May 16, 2023

  • Should we add a log in each of the 4 cases, or will it be too spammy? This is to prevent a future user complaining about changes being "silently ignored"

Thank you for pointing out! I have changed according. I use areSpecsIdentical , areReplicasIdentical to ensure we only log when necessary.

Is there a convenient way to parametrize the tests (similar to pytest.mark.parametrize in python) to avoid repeated code, to isolate the part of the test that's different when the autoscaler is on vs off? A quick search turned up something about "table-driven tests" but I'm not sure if that's applicable here.

The diffculty is we need to create two different RayJobs, one with autoscaler, one without. And the test logic/behavior will be slighly different. It might be challenging to abstract into common test logic.

@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review May 16, 2023 23:02
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [WIP][Bug] autoscaler not working properly in rayjob [Bug] autoscaler not working properly in rayjob May 16, 2023
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch 2 times, most recently from dba4a90 to 3cb50b3 Compare May 17, 2023 21:37
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the auto-scaler-not-working-properly-in-rayjob branch from 3cb50b3 to e01db44 Compare May 17, 2023 21:59
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
} else if errors.IsNotFound(err) {
if len(rayJobInstance.Spec.ClusterSelector) == 0 && rayJobInstance.Spec.RayClusterSpec == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good. We should not try to GET RayCluster if both ClusterSelector and RayClusterSpec are not set. Please add a comment "TODO" and a new issue to track the progress.

Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. There are still a few minor things that can be improved, but it is fine to resolve them in another PR.

if rayJobInstance.Spec.RayClusterSpec == nil {
r.Log.Info("Found associated RayCluster for RayJob", "rayjob", rayJobInstance.Name, "raycluster", rayClusterNamespacedName)

// Case1: The job is submitted to an existing ray cluster, simply return the rayClusterInstance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Case1: The job is submitted to an existing RayCluster. Return the existing RayCluster instance and ignore all updates from RayClusterSpec.

@kevin85421 kevin85421 merged commit 82c925b into ray-project:master May 19, 2023
19 checks passed
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] autoscaler not working properly in rayjob
3 participants