-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891
Conversation
e56e064
to
af0ed07
Compare
@kevin85421 PTAL, I plan to use this sample for upcoming Kueue documentation (#1890) |
Here's the log output from the Job I ran in my own GKE cluster:
|
- name: kuberay-repo | ||
mountPath: /home/ray/kuberay | ||
containers: | ||
- name: ray-head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RayJob uses a single-node Ray cluster to avoid configuring shared storage for checkpointing. Otherwise I would need to configure something like GCSFuse which would not work on every environment
af0ed07
to
a8f346b
Compare
ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,153 @@ | |||
import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to embed the Python script in the YAML file to maintain consistency between them. Ensuring the same commit for both the YAML and the Python file requires specifying the commit in the YAML's runtime environment. However, it's easy to forget updating the runtime environment whenever the Python script is modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just stick to the ConfigMap route then? It's easier but i don't want users thinking this is the recommended way to upload source for Ray jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to keep the script separate because:
- I like how the python script is versioned in a .py file instead of in embedded YAML
- The sample shows how to use working_dir in runtime environment.
- This is more inline with how users will actually use RayJob (I imagine no one is adding their job code in a ConfigMap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't want users thinking this is the recommended way to upload source for Ray jobs
This makes sense to me. I will check with the Ray team to see if there is a repository where we can upload Ray example scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with the Ray team, but we didn't reach a consensus on creating a new repository for the Python scripts. Hence, we can initially put them in the KubeRay repository. However, as the KubeRay repository is 186MB, downloading the script may take some time in the runtime environment.
fc9ef9e
to
376e70c
Compare
with Ray Data Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
376e70c
to
30bb84e
Compare
@@ -0,0 +1,41 @@ | |||
# This RayJob is based on the "Fine-tune a PyTorch Lightning Text Classifier with Ray Data" example in the Ray documentation. | |||
# See https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html for more details. | |||
apiVersion: ray.io/v1alpha1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevin85421 added a v1alpha1 RayJob as well since Kueue does not fully support v1 until kubernetes-sigs/kueue#1435 is merged
Why are these changes needed?
Add a sample RayJob that packages Fine-tune a PyTorch Lightning Text Classifier with Ray Data into a single RayJob.
I plan to use this sample for future documentation with Kueue, but this change could be a good sample on it's own as well.
Related issue number
Part of #1890
Checks