Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update kuberay mcad integration doc #1373

Merged
merged 10 commits into from
Oct 20, 2023

Conversation

tedhtchang
Copy link
Contributor

Why are these changes needed?

Closes #1327 Improve the kuberay macad integration doc

Related issue number

#1327

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@tedhtchang
Copy link
Contributor Author

This PR should be review together with the PR.
/cc @anishasthana @kevin85421

docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
@@ -5,7 +5,7 @@ The multi-cluster-app-dispatcher is a Kubernetes controller providing mechanisms
## Use case

MCAD allows you to deploy Ray cluster with a guarantee that sufficient resources are available in the cluster prior to actual pod creation in the Kubernetes cluster. It supports features such as:

- Integrates with upstream Kubernetes scheduling stack for features such co-scheduling, Packing on GPU dimension etc.
- Ability to wrap any Kubernetes objects.
- Increases control plane stability by JIT (Just-in Time) object creation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, would you mind explaining this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asm582 may explain this JIT better.
My understanding is mcad creating object only when there is enough resource. In other words, there will not be any pending pods so the control plane stability is improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My other thought would be, Mcad when paired the InstaScale operator can auto scale out enough k8s worker nodes to run a job and scales down afterwards; This kinda fits the Just in time concept of allocating right amount resources to a workstation on an assembly line to complete a task in order to reduce total inventory and cost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's about right! MCAD will not create the underlying Ray resources until there are enough resources to schedule the pod. If InstaScale is enabled, InstaScale will scale up new nodes on your cluster -> MCAD will schedule Ray resources onto new nodes -> Ray Cluster runes -> Once Ray Cluster context is finished and appwrapper is deleted, InstaScale deletes nodes. At no point will there be pending pods/services/routes etc. on your cluster

docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Outdated Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Show resolved Hide resolved
docs/guidance/kuberay-with-MCAD.md Show resolved Hide resolved
Events: <none>
```

As seen the second Ray cluster is queued with no pending pods created.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every user allocates different CPU/memory resources to their Kubernetes clusters. If a user possesses a high-end workstation, would the RayCluster still be queued?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No likely won't be queue. My cluster has only has 16 cpus with some overhead already allocated so the 2nd AW queued. If the high-end workstation has 16+ cpus, the 2nd AppWrapper may not go over total allocatable cpus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide a clear example so users can consistently reproduce the expected behavior across all environments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also explain when MCAD creates the RayCluster CR (e.g., 5 CPUs)? Without this information, it's difficult for users to understand how MCAD works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin85421 I have added more commits. Do you think these 2 comments have been addressed ?

Copy link
Contributor

@asm582 asm582 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@anishasthana anishasthana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tedhtchang
Copy link
Contributor Author

@kevin85421 I need to put this PR on hold because mcad repo is under refactoring. The example yamls will not work any more.

tedhtchang and others added 7 commits October 16, 2023 17:30
Co-authored-by: Anish Asthana <anishasthana1@gmail.com>
Signed-off-by: ted chang <htchang@us.ibm.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>
@jbusche
Copy link
Contributor

jbusche commented Oct 18, 2023

Hi @tedhtchang, I tested these steps using the kind cluster on my Mac M1 laptop, and it looked good! I was able to install the items and test the appwrappers.

oc get pods
NAME                                             READY   STATUS    RESTARTS   AGE
kuberay-operator-7cdbdb6f6d-d5tzl                1/1     Running   0          102m
raycluster-complete-1-head-qm98l                 1/1     Running   0          80m
raycluster-complete-1-worker-small-group-dx8sh   1/1     Running   0          80m

Nice work!

Copy link
Contributor

@jbusche jbusche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tedhtchang
Copy link
Contributor Author

@kevin85421 This doc should be good to merge.

@kevin85421
Copy link
Member

@tedhtchang could you also update README?

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not retry the doc again after Sept, but some folks tried it and approved this PR.

README.md Outdated Show resolved Hide resolved
@kevin85421 kevin85421 merged commit bde5e9a into ray-project:master Oct 20, 2023
11 of 14 checks passed
kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Nov 2, 2023
* Update kuberay mcad integration doc

* Update docs/guidance/kuberay-with-MCAD.md

Co-authored-by: Anish Asthana <anishasthana1@gmail.com>
Signed-off-by: ted chang <htchang@us.ibm.com>

* Update docs/guidance/kuberay-with-MCAD.md

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>

* Update docs/guidance/kuberay-with-MCAD.md

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>

* Update docs/guidance/kuberay-with-MCAD.md

Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: ted chang <htchang@us.ibm.com>

* address review comments

* address more comments

* update content

* fix memory spelling

* Update README

---------

Signed-off-by: ted chang <htchang@us.ibm.com>
Co-authored-by: Anish Asthana <anishasthana1@gmail.com>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update MCAD/CodeFlare documentation to improve usability
5 participants