Support in tree Autoscaler in ray operator #28

Jeffwan · 2021-09-12T00:09:42Z

Controller support to scale in arbitrary pods by following api. This is extremely helpful for user who use out of tree autoscaler.

https://github.com/ray-project/ray-contrib/blob/f4076b4ec5bfae4cea6d9b66a1ec4e63680ca366/ray-operator/api/v1alpha1/raycluster_types.go#L56-L60

In our case, we still like to use in-tree autoscaler. the major differences is

we want to enable in-tree autoscaler in head pod via --autoscaling-config and head pod will start monitor process
config actually has to come from operator. So operator needs to convert RayCluster custom resource to a config file which can be used by in-tree autoscaler. example here
Since head and operator are hosted in different pods, operator needs to create the ConfigMap and mount to head pod transparently.

A new field has been reserved in API support this change.

https://github.com/ray-project/ray-contrib/pull/22/files#diff-edc3be4feb67012c143a57fcaefafb4c95e4cd6e661a67bb2ad1da340255bc00R21-R22

The text was updated successfully, but these errors were encountered:

ericl · 2021-10-18T23:14:45Z

It would be great to see support for in-tree autoscaling! Are there any API changes to the in-tree autoscaler or proto APIs that might make this easier to implement / maintain?

(I'm happy to work together on this issue)

Jeffwan · 2021-10-19T04:52:29Z

@ericl We did some analysis and notice it's kind of hard to start monitor and keep it exact same pattern as it is in ray/core. I do think we need some changes to provides a smooth and pluggable experience. Let us add more details in the issue and we can have the discussion

ericl · 2021-10-19T05:07:43Z

Cc @DmitriGekhtman, who maintains the in-tree operator.

DmitriGekhtman · 2021-10-20T03:24:50Z

@Jeffwan could you say more about why having the autoscaler run in the head pod is preferable for the use-cases you are considering?

If I understand right, you'd also prefer the autoscaler to directly interact with K8s api server, rather than acting on a custom resource and delegating pod management to the operator.

Just curious if there are particular reasons this way of doing things works best for you, besides the fact that the Ray autoscaler is currently set up to favor this deployment strategy.

DmitriGekhtman · 2021-10-31T22:39:51Z

I guess "in-tree autoscaler" mostly means "monitor.py" from the main Ray project.
One way to make it work is to write a NodeProvider implementation whose "create node" and "terminate node" methods act on the scale fields of the RayCluster CR.

Jeffwan · 2021-11-02T05:36:03Z

@Jeffwan could you say more about why having the autoscaler run in the head pod is preferable for the use-cases you are considering?

@DmitriGekhtman I missed your last comment. We can scope autoscaler at the cluster level which is under our expectation. Since autoscaler in the future may have different policies etc, this gives us enough flexibility to custom autoscaler for each cluster for different ray versions. (we are not end users and version upgrade takes time, it's common to have multiple versions running at the same time in the cluster)

If I understand right, you'd also prefer the autoscaler to directly interact with K8s api server, rather than acting on a custom resource and delegating pod management to the operator.

I actually prefer to ask autoscaler to update Kubernetes CRD so there's always one owner of the pods and the responsibility is clear.

Jeffwan · 2021-11-02T05:39:39Z

I guess "in-tree autoscaler" mostly means "monitor.py" from the main Ray project.
One way to make it work is to write a NodeProvider implementation whose "create node" and "terminate node" methods act on the scale fields of the RayCluster CR.

That's correct. We did some POC like below to verify the functionality but feel there're some upstream changes to make. Currently, we are not using autoscaling yet in our envs.

CRD -> a config file autoscaler can recongnize
operator converts CRD to config and create a ConfigMap and mount to head node
head node start monitoring process and reads the config.

DmitriGekhtman · 2021-11-02T07:19:08Z

All of this makes sense.
I think it might be advantageous to deploy the autoscaler as a separate deployment (scoped to a single Ray cluster). That gives more flexibility. Also, it's better for resource management -- we've observed the autoscaler using up a lot of memory under certain conditions.

Mounting a config map works. Another option is to have the autoscaler read the custom resource and do the translation to a suitable format itself, once per autoscaler iteration. This has the advantage that changes to the CR propagate faster to the autoscaler -- mounted config maps take a while to update.

pcmoritz · 2021-11-10T23:34:40Z

I wrote a design doc fleshing out the above proposals a bit more:

https://docs.google.com/document/d/1I2CYu2-hTQUJ29wPonMvCZgEiRPs1-KeqT1mzrC6LXY

Please let us know about the direction and any suggestions or improvements you might have :)

Jeffwan · 2022-02-21T08:48:06Z

ray-project/ray#21086
ray-project/ray#22348

Ray upstream already have the support. Under current implementation, kuberay operator's work become easier, operator should take actions on this field to orchestrate the autoscaler. Entire process should be transparent to users

kuberay/ray-operator/api/raycluster/v1alpha1/raycluster_types.go

Lines 21 to 22 in ffa7e60

    
           // EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs 
        
           EnableInTreeAutoscaling *bool `json:"enableInTreeAutoscaling,omitempty"`

While, version management is still tricky. We should not support autoscaler for earlier Ray versions.

DmitriGekhtman · 2022-02-21T16:20:34Z

Yep, I agree that we don't need to support the Ray autoscaler with earlier Ray versions.

Jeffwan · 2022-03-08T07:34:17Z

Major implementation is done. Let's create separate issues to track future improvements.

Jeffwan added the enhancement New feature or request label Sep 12, 2021

Jeffwan added this to the v0.2.0 release milestone Oct 19, 2021

Jeffwan mentioned this issue Nov 2, 2021

[planning] 0.2 features collection #64

Closed

Jeffwan mentioned this issue Feb 28, 2022

Support in-tree autoscaler #163

Merged

4 tasks

Jeffwan self-assigned this Feb 28, 2022

Jeffwan closed this as completed Mar 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support in tree Autoscaler in ray operator #28

Support in tree Autoscaler in ray operator #28

Jeffwan commented Sep 12, 2021

ericl commented Oct 18, 2021 •

edited

Jeffwan commented Oct 19, 2021 •

edited

ericl commented Oct 19, 2021

DmitriGekhtman commented Oct 20, 2021

DmitriGekhtman commented Oct 31, 2021

Jeffwan commented Nov 2, 2021

Jeffwan commented Nov 2, 2021

DmitriGekhtman commented Nov 2, 2021 •

edited

pcmoritz commented Nov 10, 2021

Jeffwan commented Feb 21, 2022

DmitriGekhtman commented Feb 21, 2022

Jeffwan commented Mar 8, 2022

Support in tree Autoscaler in ray operator #28

Support in tree Autoscaler in ray operator #28

Comments

Jeffwan commented Sep 12, 2021

ericl commented Oct 18, 2021 • edited

Jeffwan commented Oct 19, 2021 • edited

ericl commented Oct 19, 2021

DmitriGekhtman commented Oct 20, 2021

DmitriGekhtman commented Oct 31, 2021

Jeffwan commented Nov 2, 2021

Jeffwan commented Nov 2, 2021

DmitriGekhtman commented Nov 2, 2021 • edited

pcmoritz commented Nov 10, 2021

Jeffwan commented Feb 21, 2022

DmitriGekhtman commented Feb 21, 2022

Jeffwan commented Mar 8, 2022

ericl commented Oct 18, 2021 •

edited

Jeffwan commented Oct 19, 2021 •

edited

DmitriGekhtman commented Nov 2, 2021 •

edited