Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][clusters] add doc for setting up Ray and K8s #39408

Merged
merged 15 commits into from
Sep 9, 2023

Conversation

angelinalg
Copy link
Contributor

Fill the content gap that provides best practices for two flavors of deployments:

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
@angelinalg angelinalg added docs An issue or change related to documentation core-clusters For launching and managing Ray clusters/jobs/kubernetes v2.7.0-pick labels Sep 7, 2023
doc/source/cluster/kubernetes/user-guides/ray-k8s-setup.md Outdated Show resolved Hide resolved
doc/source/cluster/kubernetes/user-guides/ray-k8s-setup.md Outdated Show resolved Hide resolved

# Set up a Ray + Kubernetes cluster

This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ray and Kubernetes ecosystem encompasses various aspects. Could you specify which setup instructions are covered by this document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be covered by:

This guide covers best practices for these deployment considerations:

* Where to ship or run your code on the Ray cluster
* Choosing a storage system for artifacts
* Package dependencies for your application

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed


### Storage

Use one of these two standard solutions for artifact and log storage during the development process:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is inconsistent with the table above. We only mention NFS/EFS in the table under the 'interactive development' column. However, here we reference both NFS/EFS and S3/GS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated



```{eval-rst}
.. image:: ../images/prod.png
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image is inconsistent with the table above. We only mention S3/GS in the table under the 'production' column. However, here we only reference NFS/EFS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


### Storage

Reading and writing data and artifacts to cloud storage is the most reliable and observable option for production Ray deployments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


Bake your code, remote, and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes).

Using Cloud storage and the `runtime_env` is a less preferred method. In this case, use the runtime environment option to download zip files containing code and other private modules from cloud storage, in addition to specifying the pip packages needed to run your application.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a sentence to explain why runtime_env is a less preferred method for production.

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just minor comments


This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization.

When you set up Ray on Kubernetes, the KubeRay documentation provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. This guide complements the KubeRay documentation by providing best practices for effectively using Ray deployments in your organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please link to KubeRay doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Good point.

Comment on lines 21 to 22
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe spell out EFS, NFS, S3, GS the first time you use them, and/or add links for them

Comment on lines 21 to 22
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into docker image |
| Artifact Storage | Set up an EFS | Cloud storage (S3, GS) |
| Package Dependencies | Install onto NFS <br /> or <br /> Use runtime environments | Bake into Docker image |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for catching!

Comment on lines 47 to 50

* Start a Jupyter server on the head node
* SSH onto the head node and run the driver script or application there
* Use the Ray Job Submission client to submit code from a local machine onto a cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear what these are examples of. I thought of "Here are some examples of ways to run a driver script on the head node", but that doesn't seem to fit well with the first bullet about Jupyter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed

Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>

## Production

For production, we suggest the following configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a motivating comment here for recommendations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be addressed


This document contains recommendations for setting up a Ray + Kubernetes cluster for your organization.

When you set up Ray on Kubernetes, the KubeRay documentation provides an overview of how to configure the operator to execute and manage the Ray cluster lifecycle. This guide complements the KubeRay documentation by providing best practices for effectively using Ray deployments in your organization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit more clarity as to why this doc matters

| | Interactive Development | Production |
|---|---|---|
| Cluster Configuration | KubeRay YAML | KubeRay YAML |
| Code | Run driver or Jupyter notebook on head node | S3 + runtime envs <br /> OR <br /> Bake code into Docker image (link) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this link thing / do we need to say more about the docker image setup? or is that common knowledge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the word, link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a Ray image from scratch is not easy, and our image-building CI pipelines are pretty complex. It will be helpful to have a doc in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.ray.io/en/master/serve/production-guide/docker.html => This is not enough. For example, some users are sensitive to security and want to build the image with different Linux distributions.

doc/source/cluster/kubernetes/user-guides/ray-k8s-setup.md Outdated Show resolved Hide resolved

### Code and Dependencies

Bake your code, remote, and local dependencies into a published Docker image for the workers. This is the most common way to deploy applications onto [Kubernetes](https://kube.academy/courses/building-applications-for-kubernetes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to add a link to how to build it into the docker image? -> https://docs.ray.io/en/master/serve/production-guide/docker.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me refresh this once more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you help me refresh this one as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

richardliaw and others added 3 commits September 7, 2023 17:45
Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Update https://github.com/ray-project/ray/blob/master/doc/source/cluster/kubernetes/user-guides.md

  2. I am not familiar with NFS/EFS. Could you explain why NFS is inside the "Ray cluster" in the interactive-dev.png but outside the "Ray cluster" in the production.png?

@angelinalg angelinalg changed the title [doc][clusters] add doc for setting up Ray and K8s WIP [doc][clusters] add doc for setting up Ray and K8s Sep 8, 2023
@angelinalg angelinalg changed the title WIP [doc][clusters] add doc for setting up Ray and K8s [doc][clusters] add doc for setting up Ray and K8s Sep 8, 2023
Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Signed-off-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
@richardliaw richardliaw merged commit af192d8 into ray-project:master Sep 9, 2023
13 of 15 checks passed
angelinalg added a commit to angelinalg/ray that referenced this pull request Sep 9, 2023
GeneDer pushed a commit that referenced this pull request Sep 9, 2023
#39510)

* Update metrics.md (#38512)

1. there are 3 dashboards in the folder now. Refer to the folder instead of only 1 dashboard
2. include "Copy" since people need to copy this from the head node to the Grafana server

Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>

* polish observability (o11y) docs (#39069)

Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
Co-authored-by: matthewdeng <matt@anyscale.com>

* [Doc] Unbold "Use Cases" in sidebar (#39295)

Signed-off-by: pdmurray <peynmurray@gmail.com>

* [docs] Cleanup for other AIR concepts (#39400)

* [doc][clusters] add doc for setting up Ray and K8s (#39408)

---------

Signed-off-by: Huaiwei Sun <scottsun94@gmail.com>
Signed-off-by: pdmurray <peynmurray@gmail.com>
Co-authored-by: Huaiwei Sun <scottsun94@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Peyton Murray <peynmurray@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
Signed-off-by: Jim Thompson <jimthompson5802@gmail.com>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-clusters For launching and managing Ray clusters/jobs/kubernetes docs An issue or change related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants