Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. #13920

Merged

Conversation

DmitriGekhtman
Copy link
Contributor

@DmitriGekhtman DmitriGekhtman commented Feb 5, 2021

Why are these changes needed?

From the current example configs and documentation, it's not clear to users how to actually run a Ray program on Kubernetes.
The preferred scheme will be to run Ray client on the head node and to access the client via a K8s service.
This PR updates the example configs and operator code to reflect that scheme.
The relevant documentation changes will be added soon to this branch: #13839

In more detail, this PR does the following:
(1) Adds arg --ray-client-server-port 50051 to the ray start command in the example cluster launching configs and operator configs. Never mind, the Ray Client server now runs on port 10001 by default!
(2) Adds to example cluster launching configs a service that allows access to the head pod's client and dashboard ports.
(3) Has the operator auto-configure the appropriate default service
(4) Simplifies the example configs
(5) Adds three example scripts (based on the script in the current docs) showing to how to run a Ray program by

  • kubectl exec ing directly on the head node
  • submitting a job and using Ray client from the job's pod
  • kubectl port-forward ing from local host to the head service and using Ray client locally

(6) Tests the job submission in the operator unit test.
(7) Make example-full and defaults multi-node-type, moves example-full to example-full-legacy.

Related issue number

Closes #13656

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Updated unit test works for me locally.

@DmitriGekhtman DmitriGekhtman force-pushed the ray-client-example-config-reworking branch from 51049a8 to 0b96c84 Compare February 6, 2021 19:49
@AmeerHajAli
Copy link
Contributor

@wuisawesome @simon-mo , can you please take a look/review this?

Copy link
Contributor

@AmeerHajAli AmeerHajAli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.
Here is what would be ideal here.

  1. Try running ray up with all the examples here.
  2. Attach to the cluster and check if all the logs look sane and the cluster is autoscaling.
    I think we should have that before merging this PR. We should also add the documentation part (not necessary in this PR).

@DmitriGekhtman
Copy link
Contributor Author

DmitriGekhtman commented Feb 7, 2021

This looks good to me.
Here is what would be ideal here.

  1. Try running ray up with all the examples here.
  2. Attach to the cluster and check if all the logs look sane and the cluster is autoscaling.
    I think we should have that before merging this PR. We should also add the documentation part (not necessary in this PR).

Yep, I've done the tests.

From user feedback, autoscaling in response to resource loads seems to be failing on K8s, but that's a whole other issue to track.
(min_workers work are tested in the unit tests I run locally)

Working on documentation now.

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

added a few questions. they may not require any changes.

@@ -56,3 +56,26 @@ provider:
kind: Role
name: autoscaler
apiGroup: rbac.authorization.k8s.io

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm these examples seem pretty long still. Do you have any thoughts on what it would take to move more of these into defaults.yaml?

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, fillout_defaults only fills in top-level fields

The examples could be shortened if fillout_defaults filled in the subfields of the provider config that give the head node pod permissions it needs to autoscale.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that hard to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider is that defaults need to be filled out in slightly different ways when using K8s operator vs. when using cluster launcher with K8s.

In any case, I think it's sufficiently complicated [enough decisions involved] that it's better left to another PR.

@DmitriGekhtman DmitriGekhtman force-pushed the ray-client-example-config-reworking branch from f12f1d8 to 620ffe4 Compare February 8, 2021 00:07
@simon-mo simon-mo removed their request for review February 8, 2021 17:43
@simon-mo simon-mo removed their assignment Feb 8, 2021
@DmitriGekhtman
Copy link
Contributor Author

@edoakes I think this is ready to merge

@edoakes edoakes merged commit 081f3e5 into ray-project:master Feb 9, 2021
fishbone pushed a commit to fishbone/ray that referenced this pull request Feb 16, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[autoscaler][kubernetes] Get Ray client to work with Ray on K8s
6 participants