Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Kuberay Ray Autoscaler integration #100

Closed
wants to merge 8 commits into from
Closed

Conversation

pcmoritz
Copy link
Collaborator

This PR is not functional yet, it is used to track the Kuberay <-> Ray Autoscaler integration prototype

Why are these changes needed?

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@@ -13,13 +13,14 @@ RUN go mod download
COPY main.go main.go
COPY api/ api/
COPY controllers/ controllers/
COPY rpc/ rpc/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may change how to organize the codes later. Let's concentrate on the functionality and let's adapt to the new changes later.

@@ -19,7 +19,8 @@ type RayClusterSpec struct {
// RayVersion is the version of ray being used. this affects the command used to start ray
RayVersion string `json:"rayVersion,omitempty"`
// EnableInTreeAutoscaling indicates whether operator should create in tree autoscaling configs
EnableInTreeAutoscaling *bool `json:"enableInTreeAutoscaling,omitempty"`
EnableInTreeAutoscaling *bool `json:"enableInTreeAutoscaling,omitempty"`
DesiredWorkers []string `json:"desiredWorkers,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a full snapshot of the cluster workers? I remember community has some discussions in the past whether using full/delta earlier. Some feedbacks are if there's a large ray cluster, this list would be long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScaleStrategy ScaleStrategy `json:"scaleStrategy,omitempty"`

We defined ScaleStrategy for this purpose in the past. Can that be reused?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @akanso

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current iteration, I'm now just launching worker pods directly in the gRPC server from the autoscaler. That's simple and effective and doesn't need schema changes.

path: code.py
- key: example.py
path: example.py
- name: autoscaler-config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are playing with configuration here to save some debugging efforts on e2e work?
Just like to confirm if we are on the same page on the flow.

if EnableInTreeAutoscaling is enabled,

  1. Controller converts CR to autoscaling config.
  2. Controller create configmap and mount to head automatically

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will be integrated into the controller, it is just for testing at the moment.

@@ -252,7 +252,9 @@ func setMissingRayStartParams(rayStartParams map[string]string, nodeType rayiov1
func concatenateContainerCommand(nodeType rayiov1alpha1.RayNodeType, rayStartParams map[string]string) (fullCmd string) {
switch nodeType {
case rayiov1alpha1.HeadNode:
return fmt.Sprintf("ulimit -n 65536; ray start --head %s", convertParamMap(rayStartParams))
// ray start --head --no-monitor --port=6379 --redis-password=5241590000000000 --dashboard-host=0.0.0.0 --node-ip-address=$MY_POD_IP --node-manager-port=12346 --num-cpus=1 --object-manager-port=12345 --object-store-memory=100000000
return fmt.Sprintf("ulimit -n 65536; ray start --head --no-monitor --block %s", convertParamMap(rayStartParams))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--block is tracked in #77

@pcmoritz
Copy link
Collaborator Author

We are now doing this differently, see ray-project/ray#21086

@pcmoritz pcmoritz closed this Feb 19, 2022
@Jeffwan Jeffwan deleted the kuberay-autoscaler branch March 14, 2022 03:40
@Jeffwan Jeffwan restored the kuberay-autoscaler branch March 14, 2022 03:41
@Jeffwan Jeffwan deleted the kuberay-autoscaler branch August 18, 2022 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants