[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

DmitriGekhtman · 2021-02-05T00:13:07Z

Why are these changes needed?

From the current example configs and documentation, it's not clear to users how to actually run a Ray program on Kubernetes.
The preferred scheme will be to run Ray client on the head node and to access the client via a K8s service.
This PR updates the example configs and operator code to reflect that scheme.
The relevant documentation changes will be added soon to this branch: #13839

In more detail, this PR does the following:
(1) ~~Adds arg --ray-client-server-port 50051 to the ray start command in the example cluster launching configs and operator configs.~~ Never mind, the Ray Client server now runs on port 10001 by default!
(2) Adds to example cluster launching configs a service that allows access to the head pod's client and dashboard ports.
(3) Has the operator auto-configure the appropriate default service
(4) Simplifies the example configs
(5) Adds three example scripts (based on the script in the current docs) showing to how to run a Ray program by

kubectl exec ing directly on the head node
submitting a job and using Ray client from the job's pod
kubectl port-forward ing from local host to the head service and using Ray client locally

(6) Tests the job submission in the operator unit test.
(7) Make example-full and defaults multi-node-type, moves example-full to example-full-legacy.

Related issue number

Closes #13656

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Updated unit test works for me locally.

Ray client now runs on the head node by default using port 10001.

python/ray/autoscaler/kubernetes/defaults.yaml

python/ray/autoscaler/ray-schema.json

AmeerHajAli · 2021-02-07T23:31:47Z

@wuisawesome @simon-mo , can you please take a look/review this?

AmeerHajAli

This looks good to me.
Here is what would be ideal here.

Try running ray up with all the examples here.
Attach to the cluster and check if all the logs look sane and the cluster is autoscaling.
I think we should have that before merging this PR. We should also add the documentation part (not necessary in this PR).

DmitriGekhtman · 2021-02-07T23:45:28Z

This looks good to me.
Here is what would be ideal here.

Try running ray up with all the examples here.

Attach to the cluster and check if all the logs look sane and the cluster is autoscaling.
I think we should have that before merging this PR. We should also add the documentation part (not necessary in this PR).

Yep, I've done the tests.

From user feedback, autoscaling in response to resource loads seems to be failing on K8s, but that's a whole other issue to track.
(min_workers work are tested in the unit tests I run locally)

Working on documentation now.

wuisawesome

lgtm.

added a few questions. they may not require any changes.

python/ray/autoscaler/kubernetes/example_scripts/run_local_example.py

wuisawesome · 2021-02-07T23:44:20Z

python/ray/autoscaler/kubernetes/example-minimal.yaml

@@ -56,3 +56,26 @@ provider:
            kind: Role
            name: autoscaler
            apiGroup: rbac.authorization.k8s.io
+


Hmmm these examples seem pretty long still. Do you have any thoughts on what it would take to move more of these into defaults.yaml?

hmm, fillout_defaults only fills in top-level fields

The examples could be shortened if fillout_defaults filled in the subfields of the provider config that give the head node pod permissions it needs to autoscale.

Is that hard to change?

Another thing to consider is that defaults need to be filled out in slightly different ways when using K8s operator vs. when using cluster launcher with K8s.

In any case, I think it's sufficiently complicated [enough decisions involved] that it's better left to another PR.

python/ray/autoscaler/kubernetes/defaults.yaml

python/ray/autoscaler/kubernetes/example-full-legacy.yaml

python/ray/autoscaler/kubernetes/example-full.yaml

python/ray/autoscaler/kubernetes/example-minimal.yaml

python/ray/autoscaler/kubernetes/example-full.yaml

DmitriGekhtman · 2021-02-08T22:18:25Z

@edoakes I think this is ready to merge

…ion, example scripts. (ray-project#13920)

…plification, example scripts. (ray-project#13920)" This reverts commit a4226a4.

DmitriGekhtman assigned yiranwang52, edoakes and AmeerHajAli Feb 5, 2021

DmitriGekhtman added this to the Serverless Autoscaling milestone Feb 5, 2021

AmeerHajAli requested review from alanwguo, edoakes, AmeerHajAli and yiranwang52 and removed request for alanwguo February 5, 2021 14:28

DmitriGekhtman added 11 commits February 6, 2021 11:48

wip

ac50499

wip

597a428

test

117642a

fix

19f8ddd

fix-open-pull

97dccab

fix-config

40cc969

test_autoscaler_yaml fix

d676821

Use --ray-client-server-port option

b6904b3

Remove debug settings from job yaml

6526567

multi node type example

0dee6e0

Change client port from 50051 to 10001. Remove ray-client-server arg.

0b96c84

Ray client now runs on the head node by default using port 10001.

DmitriGekhtman force-pushed the ray-client-example-config-reworking branch from 51049a8 to 0b96c84 Compare February 6, 2021 19:49

DmitriGekhtman added 5 commits February 6, 2021 18:43

config

d6951f6

More validation garbage for defaults.yaml

e02ac5f

fix test_autoscaler_yaml again

d6b435a

Resource values are integers

ae0321b

schema-fix

fea5f5d

AmeerHajAli reviewed Feb 7, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/defaults.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 7, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/defaults.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 7, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/defaults.yaml Show resolved Hide resolved

defaults

e1e8907

AmeerHajAli reviewed Feb 7, 2021

View reviewed changes

python/ray/autoscaler/ray-schema.json Outdated Show resolved Hide resolved

AmeerHajAli requested review from simon-mo and wuisawesome February 7, 2021 23:31

AmeerHajAli assigned wuisawesome and simon-mo Feb 7, 2021

schema-format

3fd41d3

AmeerHajAli approved these changes Feb 7, 2021

View reviewed changes

wuisawesome approved these changes Feb 7, 2021

View reviewed changes

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/defaults.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/example-full-legacy.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/example-full.yaml Outdated Show resolved Hide resolved

defaults_node_min_worker

620ffe4

DmitriGekhtman force-pushed the ray-client-example-config-reworking branch from f12f1d8 to 620ffe4 Compare February 8, 2021 00:07

DmitriGekhtman added 3 commits February 7, 2021 16:09

legacy minworkers zero

c9c70cf

remove-top-level-min-workers

949f26d

Use ray.cluster_resources instead of ray.nodes

d74a2ec

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/example-minimal.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/example-full.yaml Outdated Show resolved Hide resolved

AmeerHajAli reviewed Feb 8, 2021

View reviewed changes

python/ray/autoscaler/kubernetes/example-full.yaml Outdated Show resolved Hide resolved

simon-mo removed their request for review February 8, 2021 17:43

simon-mo removed their assignment Feb 8, 2021

Minor config changes

6159202

edoakes merged commit 081f3e5 into ray-project:master Feb 9, 2021

DmitriGekhtman mentioned this pull request Feb 9, 2021

[autoscaler][kubernetes][docs] Updated Kubernetes Documentation #14016

Merged

6 tasks

fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021

[autoscaler][kubernetes] Ray client setup, example config simplificat…

a4226a4

…ion, example scripts. (ray-project#13920)

fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021

Revert "[autoscaler][kubernetes] Ray client setup, example config sim…

6834c97

…plification, example scripts. (ray-project#13920)" This reverts commit a4226a4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

DmitriGekhtman commented Feb 5, 2021 •

edited

Loading

AmeerHajAli commented Feb 7, 2021

AmeerHajAli left a comment

DmitriGekhtman commented Feb 7, 2021 •

edited

Loading

wuisawesome left a comment

wuisawesome Feb 7, 2021

DmitriGekhtman Feb 8, 2021 •

edited

Loading

wuisawesome Feb 8, 2021

DmitriGekhtman Feb 9, 2021

DmitriGekhtman commented Feb 8, 2021

[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

Conversation

DmitriGekhtman commented Feb 5, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

AmeerHajAli commented Feb 7, 2021

AmeerHajAli left a comment

Choose a reason for hiding this comment

DmitriGekhtman commented Feb 7, 2021 • edited Loading

wuisawesome left a comment

Choose a reason for hiding this comment

wuisawesome Feb 7, 2021

Choose a reason for hiding this comment

DmitriGekhtman Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

wuisawesome Feb 8, 2021

Choose a reason for hiding this comment

DmitriGekhtman Feb 9, 2021

Choose a reason for hiding this comment

DmitriGekhtman commented Feb 8, 2021

DmitriGekhtman commented Feb 5, 2021 •

edited

Loading

DmitriGekhtman commented Feb 7, 2021 •

edited

Loading

DmitriGekhtman Feb 8, 2021 •

edited

Loading