-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274
[autoscaler] Improve autoscaler auto-configuration, upstream recent improvements to Kuberay NodeProvider #274
Conversation
@@ -8,7 +8,7 @@ metadata: | |||
# An unique identifier for the head node and workers of this cluster. | |||
name: raycluster-autoscaler | |||
spec: | |||
rayVersion: 'nightly' | |||
rayVersion: '1.12.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest Ray release is compatible with the pinned autoscaler image.
func addEmptyDir(container *v1.Container, pod *v1.Pod) { | ||
if checkIfVolumeMounted(container, pod) { | ||
func addEmptyDir(container *v1.Container, pod *v1.Pod, volumeName string, volumeMountPath string, storageMedium v1.StorageMedium) { | ||
if checkIfVolumeMounted(container, pod, volumeMountPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalized for purposes of adding ray-log volume.
container.VolumeMounts = append(container.VolumeMounts, mountedVolume) | ||
} | ||
} | ||
|
||
func checkIfVolumeMounted(container *v1.Container, pod *v1.Pod) bool { | ||
//Format an emptyDir volume. | ||
//When the storage medium is memory, set the size limit based on container resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size limit calculation is kept as it was previously.
@@ -47,7 +49,7 @@ var instance = rayiov1alpha1.RayCluster{ | |||
Containers: []v1.Container{ | |||
{ | |||
Name: "ray-head", | |||
Image: "rayproject/autoscaler", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rayproject/autoscaler
is deprecated
}, | ||
Limits: v1.ResourceList{ | ||
v1.ResourceCPU: resource.MustParse("1"), | ||
v1.ResourceMemory: testMemoryLimit, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to test /dev/shm volume size limit.
That bug fix is here. |
``` | ||
|
||
> Note: For compatibility with the Ray autoscaler, the KubeRay Operator's entrypoint | ||
> must include the flag `--prioritize-workers-to-delete`. The kustomization overlay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure to have a plan to remove this flag and make it the default behavior.
Small consistency ask before we merge: Can we always have a space after |
I think I got all of the ones from this PR. |
@@ -259,14 +359,6 @@ func TestDefaultHeadPodTemplate_WithAutoscalingEnabled(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestBuildAutoscalerContainer(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic moved into TestBuildPod_with_autoscaler_enabled.
Thanks for updating the comments :) |
} | ||
|
||
// not found, use second container | ||
// (This branch shouldn't be accessed -- the autoscaler container should be present.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log an error or warning here, or possibly even exit the program if this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this were a Python program, I'd add an assertion.
The most correct thing would probably be to bubble an error a couple layers up the call stack, but that's a bit too much work for something that's not logically possible.
I will add a log statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or panic, maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, why don't we add a warning log statement both here and in the function above that we use the first or second container respectively :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Panic for autoscaler container, since it's supposed to be logically impossible.
Info statements for Ray container. IMO, using the first container should be the standard (also, that's the pattern suggested by the sample configs.)
} | ||
|
||
// This should be unreachable. | ||
panic("Autoscaler container not found!") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's standard to panic in a code branch that's not supposed to be accessible.
return i | ||
} | ||
} | ||
} | ||
// not found, use first container | ||
log.Info("Head pod container with index 0 identified as Ray container.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to document the requirement that the Ray container goes first.
To me, it seems odd to identify a container by an env variable -- also, none of the kuberay sample configs I've seen do that.
SGTM! |
…mprovements to Kuberay NodeProvider (ray-project#274) Upstreams recent autoscaler changes from the Ray repo.
Why are these changes needed?
Upstreams recent autoscaler changes from the Ray repo.
The Ray image is used to run the autoscaler, using the new entrypoint
ray kuberay-autoscaler
.Redis passwords are removed from autoscaler configuration, since it's not relevant for Ray >= 1.11.0 and we don't support autoscaling with older Ray.
The autoscaler and Ray container need to share a log volume so that Ray drivers can receive autoscaling events. This PR has the operator configure the relevant volume mounts when autoscaling is enabled. Existing code for configuring volumeMounts is generalized for this purpose.
Some unit test logic around volume mounts is added.
Documentation is updated.
Related issue number
Checks