-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP example #1019
FSDP example #1019
Conversation
✅ Deploy Preview for pytorch-examples-preview canceled.
|
FSDP/README.md
Outdated
@@ -0,0 +1,25 @@ | |||
## FSDP T5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put FSDP in the distributed/
folder? Also please link to this example from the main README.md
as well
FSDP/T5_training.py
Outdated
#init_end_event.record() | ||
|
||
#if rank == 0: | ||
#print(f"Cuda event elapsed time: {init_start_event.elapsed_time(init_end_event) / 1000}sec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the commented out code
FSDP/T5_training.py
Outdated
currEpoch = ( | ||
"-" + str(epoch) + "-" + str(round(curr_val_loss.item(), 4)) + ".pt" | ||
) | ||
print(f"--> attempting to save model prefix {currEpoch}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: saving could be its own function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree - it's done this way in a different fsdp repo I have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamidShojanazeri I think a good FSDP example would be very useful for users doing large scale training. Thanks for your contribution! Requesting to resolve the comments.
@rohan-varma @lessw2020 @HamidShojanazeri once you tell @hudeven and I that you'd like to merge the PR let us know. This has been open for a while. Feel free to close any feedback you don't believe is relevant |
Let me review - I was not even aware this PR existed until today, so thanks for the direct link. |
General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). |
✅ Deploy Preview for pytorch-examples-preview canceled.
|
@msaroufim , @hudeven sorry for the delay I addressed the comments and made the code more modular, would be great if we could merge this. |
Added the AC and rate_lmiter as well+ model checkpointings. |
@svekars any idea if the doc build is flaking for any reason? @HamidShojanazeri do you mind rebasing on main to see if the error goes away |
This example shows training a HF T5 model with FSDP to be used with its tutorial