-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritize WorkersToDelete #208
Prioritize WorkersToDelete #208
Conversation
…then adjust the total number of running pods to match Replicas
…then adjust the total number of running pods to match Replicas
…cale/kuberay into prioritize-workers-to-delete
@Jeffwan Can you have a look at the PR and help resolve the questions? |
Sure. I will help review the change today |
Please excuse my ignorance but do we know how is |
The Ray Autoscaler provides it. Not close to a computer right now so can't
link to source code.
Sriram
…On Tue, Mar 22, 2022, 6:54 PM Abhishek Malvankar ***@***.***> wrote:
Please excuse my ignorance but do we know how is WorkersToDelete list is
obtained? is it generated or supplied by the user?
—
Reply to this email directly, view it on GitHub
<#208 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXLTZPYZU3VYGGW3LVH7XDLVBJ2VDANCNFSM5RLNZCZA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Please help list the cases. You mean autoscaler scale up/down with unexpected new/removed pods? (2 * 2)? |
@@ -248,6 +248,35 @@ func (r *RayClusterReconciler) reconcilePods(instance *rayiov1alpha1.RayCluster) | |||
} | |||
} | |||
diff := *worker.Replicas - int32(len(runningPods.Items)) | |||
|
|||
//// SriramQ: How do I create a feature flag to guard the new functionality? | |||
featureFlag := true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. We didn't have global feature gate mechanism at this moment. Let's use this temporarily.
I create #211 for the feature gate implementation and discussion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to do better - we should be able to provide this as a startup option. I think I know how to do this - please wait for my next code update.
@@ -262,7 +291,7 @@ func (r *RayClusterReconciler) reconcilePods(instance *rayiov1alpha1.RayCluster) | |||
} else if diff == 0 { | |||
log.Info("reconcilePods", "all workers already exist for group", worker.GroupName) | |||
continue | |||
} else if int32(len(runningPods.Items)) == (*worker.Replicas + int32(len(worker.ScaleStrategy.WorkersToDelete))) { | |||
} else if -diff == int32(len(worker.ScaleStrategy.WorkersToDelete)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worker.ScaleStrategy.WorkersToDelete
has been set to 0 in line 274 if the flag is true. I think you may want to assign the value to a different variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no. You are correct that this is dead code when featureFlag is true. However, we need to retain existing behavior when featureFlag is false. Once we have finished testing and commit to the new logic, we will remove the featureFlag check and also delete this case from the if statement (as a followup PR).
// we need to scale down | ||
workersToRemove := int32(len(runningPods.Items)) - *worker.Replicas | ||
//// SriramQ: Isn't this too early? This does not consider the IsNotFound case (see below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the code block 255-275, you can probably track the number of pods deleted (excluding NOT_FOUND).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is related to when featureFlag is false (meaning existing behavior). If you consider my scenario from an earlier comment where 5 of to 10 entries in WorkersToDelete are missing, randomlyRemoveWorkers is off by 5.
I do not want to change existing behavior (when featureFlag is false) - I am just asking as a clarifying question. When featureFlag is true, WorkersToDelete is empty - so everything will work fine.
r.Recorder.Eventf(instance, v1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name) | ||
} | ||
//// SriramQ: Any difference between this and "worker.ScaleStrategy.WorkesToDelete = ..." | ||
//// SriramQ: I assume this means that the operator is clearing WorkersToDelete in | ||
//// UpdateStatus() - which means the clearing in the Autoscaler is redundant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
em. How does autoscaler get accurate data to clear WorkersToDelete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct here - the current Autoscaler logic is not perfect. I am removing that code to clear WorkersToDelete in the Autoscaler as a separate Ray PR.
I had actually assumed that Kuberay was not clearing WorkersToDelete when I saw the Autoscaler code, but then saw that it was in fact doing it. This is the right approach - glad that it is this way.
Here is the code reference: https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/kuberay/node_provider.py#L332 |
I mean featureFlag = true/false combined with before/after the Ray Autoscaler change in a PR I will have ready today. |
@@ -262,7 +291,7 @@ func (r *RayClusterReconciler) reconcilePods(instance *rayiov1alpha1.RayCluster) | |||
} else if diff == 0 { | |||
log.Info("reconcilePods", "all workers already exist for group", worker.GroupName) | |||
continue | |||
} else if int32(len(runningPods.Items)) == (*worker.Replicas + int32(len(worker.ScaleStrategy.WorkersToDelete))) { | |||
} else if -diff == int32(len(worker.ScaleStrategy.WorkersToDelete)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no. You are correct that this is dead code when featureFlag is true. However, we need to retain existing behavior when featureFlag is false. Once we have finished testing and commit to the new logic, we will remove the featureFlag check and also delete this case from the if statement (as a followup PR).
// we need to scale down | ||
workersToRemove := int32(len(runningPods.Items)) - *worker.Replicas | ||
//// SriramQ: Isn't this too early? This does not consider the IsNotFound case (see below) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is related to when featureFlag is false (meaning existing behavior). If you consider my scenario from an earlier comment where 5 of to 10 entries in WorkersToDelete are missing, randomlyRemoveWorkers is off by 5.
I do not want to change existing behavior (when featureFlag is false) - I am just asking as a clarifying question. When featureFlag is true, WorkersToDelete is empty - so everything will work fine.
r.Recorder.Eventf(instance, v1.EventTypeNormal, "Deleted", "Deleted pod %s", pod.Name) | ||
} | ||
//// SriramQ: Any difference between this and "worker.ScaleStrategy.WorkesToDelete = ..." | ||
//// SriramQ: I assume this means that the operator is clearing WorkersToDelete in | ||
//// UpdateStatus() - which means the clearing in the Autoscaler is redundant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct here - the current Autoscaler logic is not perfect. I am removing that code to clear WorkersToDelete in the Autoscaler as a separate Ray PR.
I had actually assumed that Kuberay was not clearing WorkersToDelete when I saw the Autoscaler code, but then saw that it was in fact doing it. This is the right approach - glad that it is this way.
instance.Spec.WorkerGroupSpecs[index].ScaleStrategy.WorkersToDelete = []string{} | ||
|
||
// remove the remaining pods not part of the scaleStrategy | ||
i := 0 | ||
if int(randomlyRemovedWorkers) > 0 { | ||
for _, randomPodToDelete := range runningPods.Items { | ||
found := false | ||
//// SriramQ: Isn't the following loop dead code - see my previous question |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to leave this code here - and will delete when we commit to the new logic and remove featureFlag. It's clearly dead code though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we expect nodes in WorkersToDelete
to be re-used?, here is a scenario let's say the user downscaled the cluster, autoscaler with help of GCS drained the nodes, the nodes are idle but immediately user again upscaled the cluster. In such a scenario, do we intend to remove a few workers from workerToDelete
?
There are situations when the coordination between the Autoscaler and Kuberay can get confused. This PR along with a Kuberay PR (ray-project/kuberay#208) addresses these situations. Examples: - Autoscaler request Kuberay to delete a specific set of nodes, but before the Kuberay reconciler kicks in, a node dies. This causes Kuberay to delete a random set of nodes instead of the ones specified. This issue gets fixed in the Kuberay PR. - Autoscaler requests creation or termination of nodes. But simultaneously there is another request that changes the number of replicas (e.g., through the Kuberay API server). In this case, the _wait_for_pods methods will never terminate, and cause the Autoscaler to get stuck. This PR fixes this issue. Details on the code changes: The Autoscaler no longer waits for Kuberay to complete the request (through waiting in _wait_for_pods). Instead it makes sure the previous request has been completed each time before it submits a new request. Instead of ensuring that the number of replicas are correct (as _wait_for_pods was doing) - which is error prone, we now check that Kuberay has cleared workersToDelete as the indication that the previous request has been completed. The Autoscaler no longer clears workersToDelete. The Autoscaler adds a dummy entry into workersToDelete even for createNode requests (which Kuberay will eventually clear) so future requests can ensure the createNode request has been completed.
…the featureFlag as a command line flag.
The PR looks great -- I don't know as much about the code as other people in this thread so don't feel like I can approve it, @Jeffwan can you have another look and approve if it looks good to you? I also convinced myself of the fact that the code is equivalent if the feature flag is not set, so I'm a bit confused that the CI is failing on the latest commit. It would be great to dig into that a bit more :) |
We should follow Kubernetes Best Practices https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ , here the desired state is |
I think the test is flaky and I will put some time to fix it. We can rerun the tests later. I will have another check and let's also address comments from @chenk008 |
@chenk008 My thoughts here are the following, and this is very similar to what @sriram-anyscale talked about in the last community meeting: Kubernetes and its autoscalers uses Now the pod is still a great abstraction for that, but collections of interchangeable pods ( Is that appropriately addressing your comment about the Kubernetes best practices or did you have something more specific or different in mind? |
// Essentially WorkersToDelete has to be deleted to meet the expectations of the Autoscaler. | ||
log.Info("reconcilePods", "removing the pods in the scaleStrategy of", worker.GroupName) | ||
for _, podsToDelete := range worker.ScaleStrategy.WorkersToDelete { | ||
if diff >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenk008 - this if statement may address your concerns. However I do not think it is a good idea. But we can use this as a starting point for a discussion. I fully agree that we should be deleting the nodes directly (and the not being declarative issue). However my PR did not introduce this problem. The node deletion and non-declarative aspects has already existed before my PR. This change with this if statement makes the code strictly better than it was (I hope this part is obvious). I argue that the code will be even better without the if statement (which is the discussion issue).
The reason we cannot delete the nodes directly is due to how the CRD and associated logic has been designed. If we delete a node directly, Kuberay will go ahead and add a new one (which is not desirable in most cases). What we need to do in the current setup is to atomically decrease the number of replicas and remove the nodes.
There are multiple scenarios where the scheduler/autoscaler needs to remove a node or multiple nodes but not maintain the current number of replicas. We really need to revisit the CRD design to address this, but this PR is attempting to improve the implementation given the current CRD.
@pcmoritz @sriram-anyscale I agree that ray is a stateful workload. Maybe there is some gap in our discussion: we start ray with Consider the cases:
We should have consensus on the default behavior of ray-operator and autoscaler. @Jeffwan I think we can merge it, but maybe should discuss in the other issue. |
I think you correctly list the 3 cases. There are also the cases when nodes
die - then we need to scale down all the (specific) nodes that are
dependencies to this node. We may eventually have to scale up, but that may
happen after some time.
Regardless these discussions go beyond this PR - this PR is simply
improving on how things are done currently. The "if statement" I pointed
out (whether or not to have it) is probably the central part of the
discussion required to wrap up the PR. My main argument to remove all the
specified nodes (and therefore not to have the "if statement") is to
support the case when the containers do not die - they stay alive
unexpectedly due to getting stuck (due to a corner case bug for example).
Then the only choice is to force killing of this node. Furthermore the "if
statement" only happens in corner case situations where some simultaneous
events happen (like the 100 node example). I don't think it is necessary to
optimize for those situations - correctness is more important.
…On Thu, Mar 24, 2022 at 7:39 PM chenk008 ***@***.***> wrote:
@pcmoritz <https://github.com/pcmoritz> @sriram-anyscale
<https://github.com/sriram-anyscale> I agree that ray is a stateful
workload.
Maybe there is some gap in our discussion: we start ray with block, and
the container entrypoint is ray start. When the ray worker died, the
container will exit, and the kubelet will restart the container, the ray
work will come back. Kuberay reconciliation loop does not involve in this
flow.
Consider the cases:
1. scale down with specified node: I think that's most of the cases.
It is a little hard to atomically decrease the number of replicas and
remove the nodes.
2. scale down with random node: We rarely use this.
3. scale up: It is easy to adjust the replica
We should have consensus on the default behavior of ray-operator and
autoscaler.
—
Reply to this email directly, view it on GitHub
<#208 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AXLTZP3SFJXETZ3SNYUSLXDVBURPVANCNFSM5RLNZCZA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
To summarize I hope it is OK to merge after removing the "if statement" I just added. Please comment either way - thanks! |
… we have verified that the tests pass with the flag set to true)
@chenk008 The problem is that Ray needs to know that the nodes/pods came back so it can restart the actors (the actors won't be restarted if kubernetes just re-runs the pod with the My proposal here is to remove the if statement right now and discuss this more in the next Kuberay meeting (I think this is better discussed in person). We won't switch the feature flag to true before we have discussed this question and agree on it. Given how the Ray autoscaler is designed today, the code without the if statement makes the most sense, so let's merge the PR with that now, so we can fix the bugs with the Ray Autoscaler <> KubeRay integration. Note it won't have an impact on existing KubeRay users since it is behind a feature flag. @Jeffwan @chenk008 Does this course of action sound good to you? |
LGTM! We should merge it to fix the bugs with the integration. |
Thanks everybody for their efforts to help with this :) |
I was busy with some internal stuff and just get a chance to check new updated threads. Overall looks good to me, since we already have a MVP version (in v0.2.0) out, we can iterate quickly on master given it's protected by feature flag. For the further design improvements, let's create separate issue and discuss them in the community meeting. |
This PR Flips the flag introduced in Prioritize WorkersToDelete #208. This allows the autoscaler to function properly without additional configuration of the operator deployment. Updates the docs accordingly. Makes minor tweaks to the autoscaling documentation, including documenting recently added fields to the sample config. Updates the default autoscaler image with changes from Ray upstream, to include the bug fix from [KubeRay][Autoscaler][Core] Add a flag to disable ray status version check ray#26584. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
* Modifies the reconciliation loop to act on WorkersToDelete first and then adjust the total number of running pods to match Replicas * Modifies the reconciliation loop to act on WorkersToDelete first and then adjust the total number of running pods to match Replicas * Removed my questions that were comments in the source code and added the featureFlag as a command line flag. * Added a change as a potential solution to issues raised in the PR * fixed location of if statement * Removed the if statement and set feature flag back to false (now that we have verified that the tests pass with the flag set to true)
…ect#379) This PR Flips the flag introduced in Prioritize WorkersToDelete ray-project#208. This allows the autoscaler to function properly without additional configuration of the operator deployment. Updates the docs accordingly. Makes minor tweaks to the autoscaling documentation, including documenting recently added fields to the sample config. Updates the default autoscaler image with changes from Ray upstream, to include the bug fix from [KubeRay][Autoscaler][Core] Add a flag to disable ray status version check ray#26584. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Why are these changes needed?
There are multiple race conditions between the Ray Autoscaler and the Kuberay reconciliation loop that this PR addresses. For example, suppose the Autoscaler requests a downscale by reducing the number of replicas and specifies the workers to delete. And suppose that a worker pod independently dies before the Kuberay reconciliation loop runs. The current code will delete a random set of pods to meet the Replicas count and ignore WorkersToDelete.
This PR makes the reconciliation loop first delete all the named pods in WorkersToDelete, and then reconciles the remaining running pods to match Replicas (either way - scale up or down).
To verify that this change is compatible with all other components that work with Kuberay (e.g., Ray Autoscaler), the change is currently guarded by a feature flag - which needs to be set when Kuberay is started. This way we can test version compatibility.
There is a matching change in the Ray Autoscaler (ray-project/ray#23428). During testing we have to make sure that the before/after Ray Autoscaler works with the before/after Kuberay (essentially 4 combinations).
Related issue number
None
Checks
This PR is not ready to be merged right now. I need help on how to do the feature flag as well as how to run tests. Once I get past this, I will update the PR appropriately.