Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Examples] NeMo distributed training for BERT and GPT3 #2533

Merged
merged 29 commits into from
Oct 11, 2023

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Sep 8, 2023

Starter example showing how to run Nvidia NeMo on SkyPilot for fine-tuning a BERT model on GLUE tasks and training a GPT style model on wikipedia dataset.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Great to see that we support NeMo out of the box. Left several minor comments. : )

if [ $? -eq 0 ]; then
echo "conda env exists"
else
conda create -y --name nemo python==3.10.12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought conda should only accept single =? Also, is the minor version required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is from NeMo's official install instructions, but if you'd like to use just python=3.10, I can change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. I remembered == was not supported by conda, but I think it is fine to follow their official instruction if that works.
(The conda's doc mentions python=3.8)


# Install nemo
sudo apt-get update
sudo apt-get install -y libsndfile1 ffmpeg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we installs ffmpeg, but is training on language tasks. Should we train on some CV tasks instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also from the official install instructions, I thought of leaving it in if people use NeMo for multi-modal tasks. Let me know if you want to remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to remove these to keep our setup minimal. I think we can have another example yaml for multi-modal task that include these commands. Wdyt?

examples/nemo/nemo.yaml Outdated Show resolved Hide resolved
@romilbhardwaj romilbhardwaj changed the title [Examples] NeMo distributed finetuning on GLUE [Examples] NeMo distributed finetuning for BERT and GPT3 Sep 9, 2023
@romilbhardwaj romilbhardwaj changed the title [Examples] NeMo distributed finetuning for BERT and GPT3 [Examples] NeMo distributed training for BERT and GPT3 Sep 9, 2023
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the example @romilbhardwaj! LGTM.

num_nodes: 2

envs:
DATASET_ROOT: $HOME/wiki/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this $HOME work for remote cluster that do not have the same username as the local machine?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the $HOME is expanded on the remote machine (i.e., this results in /home/sky/wiki/ on a k8s cluster rather than /Users/romilb/wiki).

@romilbhardwaj
Copy link
Collaborator Author

Added a note on using GCS when mounting dataset bucket, since goofys fails with "transport endpoint is not connected" error. Tested on GKE and GCP, merging now.

@romilbhardwaj romilbhardwaj merged commit 1eea3b8 into master Oct 11, 2023
18 checks passed
@romilbhardwaj romilbhardwaj deleted the nemo_example branch October 11, 2023 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants