-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920
[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920
Conversation
Ray client now runs on the head node by default using port 10001.
51049a8
to
0b96c84
Compare
@wuisawesome @simon-mo , can you please take a look/review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me.
Here is what would be ideal here.
- Try running ray up with all the examples here.
- Attach to the cluster and check if all the logs look sane and the cluster is autoscaling.
I think we should have that before merging this PR. We should also add the documentation part (not necessary in this PR).
Yep, I've done the tests. From user feedback, autoscaling in response to resource loads seems to be failing on K8s, but that's a whole other issue to track. Working on documentation now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm.
added a few questions. they may not require any changes.
python/ray/autoscaler/kubernetes/example_scripts/run_local_example.py
Outdated
Show resolved
Hide resolved
@@ -56,3 +56,26 @@ provider: | |||
kind: Role | |||
name: autoscaler | |||
apiGroup: rbac.authorization.k8s.io | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm these examples seem pretty long still. Do you have any thoughts on what it would take to move more of these into defaults.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, fillout_defaults
only fills in top-level fields
The examples could be shortened if fillout_defaults filled in the subfields of the provider config that give the head node pod permissions it needs to autoscale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that hard to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing to consider is that defaults need to be filled out in slightly different ways when using K8s operator vs. when using cluster launcher with K8s.
In any case, I think it's sufficiently complicated [enough decisions involved] that it's better left to another PR.
f12f1d8
to
620ffe4
Compare
@edoakes I think this is ready to merge |
…ion, example scripts. (ray-project#13920)
…plification, example scripts. (ray-project#13920)" This reverts commit a4226a4.
Why are these changes needed?
From the current example configs and documentation, it's not clear to users how to actually run a Ray program on Kubernetes.
The preferred scheme will be to run Ray client on the head node and to access the client via a K8s service.
This PR updates the example configs and operator code to reflect that scheme.
The relevant documentation changes will be added soon to this branch: #13839
In more detail, this PR does the following:
(1)
Adds argNever mind, the Ray Client server now runs on port 10001 by default!--ray-client-server-port 50051
to the ray start command in the example cluster launching configs and operator configs.(2) Adds to example cluster launching configs a service that allows access to the head pod's client and dashboard ports.
(3) Has the operator auto-configure the appropriate default service
(4) Simplifies the example configs
(5) Adds three example scripts (based on the script in the current docs) showing to how to run a Ray program by
kubectl exec
ing directly on the head node(6) Tests the job submission in the operator unit test.
(7) Make example-full and defaults multi-node-type, moves example-full to example-full-legacy.
Related issue number
Closes #13656
Checks
scripts/format.sh
to lint the changes in this PR.Updated unit test works for me locally.