Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP example #1019

Merged
merged 29 commits into from
May 24, 2023
Merged

FSDP example #1019

merged 29 commits into from
May 24, 2023

Conversation

HamidShojanazeri
Copy link
Contributor

This example shows training a HF T5 model with FSDP to be used with its tutorial

@netlify
Copy link

netlify bot commented Jul 7, 2022

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit c15c689
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/62c741fbf9c2cc00089990df

FSDP/wikihowSep.csv Outdated Show resolved Hide resolved
FSDP/.gitignore Outdated Show resolved Hide resolved
FSDP/README.md Outdated Show resolved Hide resolved
FSDP/README.md Outdated Show resolved Hide resolved
FSDP/README.md Outdated
@@ -0,0 +1,25 @@
## FSDP T5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put FSDP in the distributed/ folder? Also please link to this example from the main README.md as well

FSDP/T5_training.py Outdated Show resolved Hide resolved
#init_end_event.record()

#if rank == 0:
#print(f"Cuda event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the commented out code

currEpoch = (
"-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt"
)
print(f"--> attempting to save model prefix {currEpoch}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: saving could be its own function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree - it's done this way in a different fsdp repo I have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

FSDP/T5_training.py Outdated Show resolved Hide resolved
FSDP/README.md Outdated Show resolved Hide resolved
FSDP/T5_training.py Outdated Show resolved Hide resolved
Copy link
Contributor

@hudeven hudeven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamidShojanazeri I think a good FSDP example would be very useful for users doing large scale training. Thanks for your contribution! Requesting to resolve the comments.

@msaroufim
Copy link
Member

msaroufim commented Sep 22, 2022

@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant

@lessw2020
Copy link
Contributor

Let me review - I was not even aware this PR existed until today, so thanks for the direct link.

@lessw2020
Copy link
Contributor

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR).
But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

@netlify
Copy link

netlify bot commented May 24, 2023

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit f62b4ae
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-examples-preview/deploys/646e47182d49400008c6a694

@HamidShojanazeri
Copy link
Contributor Author

@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this.

@HamidShojanazeri
Copy link
Contributor Author

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it would be good to update this example with it, to make sure it's present as activation checkpointing is one of our biggest throughput boosters.

Added the AC and rate_lmiter as well+ model checkpointings.

@msaroufim msaroufim self-requested a review May 24, 2023 16:10
@msaroufim
Copy link
Member

@svekars any idea if the doc build is flaking for any reason?

@HamidShojanazeri do you mind rebasing on main to see if the error goes away

@msaroufim msaroufim merged commit 7b7c708 into pytorch:main May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants