Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation and example for running simple NLP service on kuberay #1340

Merged
merged 7 commits into from
Aug 17, 2023

Conversation

gvspraveen
Copy link
Contributor

Why are these changes needed?

This is needed for Kuberay CUJ testing

Related issue number

Checks

Manually tested

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

docs/guidance/aws-eks-gpu-cluster.md Outdated Show resolved Hide resolved
docs/guidance/text-summarizer-rayservice.md Outdated Show resolved Hide resolved
docs/guidance/text-summarizer-rayservice.md Outdated Show resolved Hide resolved

Note that the RayService's Kubernetes service will be created after the Serve applications are ready and running. This process may take approximately 1 minute after all Pods in the RayCluster are running.

## Step 5: Send a request to the text-to-image model
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text-to-image -> text summarization (?)

docs/guidance/text-summarizer-rayservice.md Outdated Show resolved Hide resolved
docs/guidance/text-summarizer-rayservice.md Outdated Show resolved Hide resolved
docs/guidance/text-summarizer-rayservice.md Outdated Show resolved Hide resolved
gvspraveen and others added 2 commits August 16, 2023 19:24
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Praveen <gorthypraveen@gmail.com>

This RayService configuration contains some important settings:

* Its `tolerations` for workers match the taints on the GPU node group (which has taints), so they can be scheduled on either GPU or CPU node. We don't add these to head nodes to head node from being allocated to GPU node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tolerations for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set nvidia.com/gpu: 1 in the Pod's resource configurations.

@@ -21,7 +21,7 @@ kubectl apply -f ray-service.stable-diffusion.yaml

This RayService configuration contains some important settings:

* Its `tolerations` for workers match the taints on the GPU node group. Without the tolerations, worker Pods won't be scheduled on GPU nodes.
* Its `tolerations` for workers match the taints on the GPU node group (which has taints), so they can be scheduled on either GPU or CPU node. We don't add these to `headGroupSpec` to make sure head Pod & KubeRay operator Pod are not allocated to GPU node group (which has taints).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tolerations for workers allow them to be scheduled on nodes without any taints or on nodes with specific taints. However, workers will only be scheduled on GPU nodes because we set nvidia.com/gpu: 1 in the Pod's resource configurations.

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gvspraveen gvspraveen merged commit 1cbac51 into ray-project:master Aug 17, 2023
18 of 22 checks passed
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 22, 2023
…ay-project#1340)

* add service yaml for nlp

* Documentation fixes

* Fix instructions

* Apply suggestions from code review

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Praveen <gorthypraveen@gmail.com>

* Fix tolerations comment

* review comments

* Update docs/guidance/stable-diffusion-rayservice.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

---------

Signed-off-by: Praveen <gorthypraveen@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
blublinsky pushed a commit to blublinsky/kuberay that referenced this pull request Aug 25, 2023
…ay-project#1340)

* add service yaml for nlp

* Documentation fixes

* Fix instructions

* Apply suggestions from code review

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Praveen <gorthypraveen@gmail.com>

* Fix tolerations comment

* review comments

* Update docs/guidance/stable-diffusion-rayservice.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

---------

Signed-off-by: Praveen <gorthypraveen@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
…ay-project#1340)

* add service yaml for nlp

* Documentation fixes

* Fix instructions

* Apply suggestions from code review

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Praveen <gorthypraveen@gmail.com>

* Fix tolerations comment

* review comments

* Update docs/guidance/stable-diffusion-rayservice.md

Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

---------

Signed-off-by: Praveen <gorthypraveen@gmail.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants