-
Notifications
You must be signed in to change notification settings - Fork 630
use Fused AdamW as default #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for verifying the performance. I have a suggestion inline.
Also, regarding https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/optimizer.py#L212-L213
can fused and foreach coexist?
|
|
|
updated based on PR feedback to ensure command line disable is supported (toml support was already there). Added --optimizer.disable_fused support and verified both cases: b - not using disable fused (uses default setting for fused): |
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Could you fix CI before merging?
In particular, please update the fused into a foreach test: https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L267-L274
Also, there seem to be two optimizers used, which we should trim to one.
Currently Titan does not use Fused AdamW as default.
This PR makes Fused the new default.
After investigating current parallelisms using Llama 8B, I found an
average speedup of 2.64% as follows:
<google-sheets-html-origin><!--td {border: 1px solid #cccccc;}br
{mso-data-placement:same-cell;}-->
Fused AdamW | FSDP, eager | 2.24%
-- | -- | --
8B | FSDP, compile | 1.63%
| TP | 3.62%
| AsyncTP | 3.26%
| CP | 2.97%
(debug model) | PP | 2.10%
| |
Gains | Average | 2.64%
| Min | 1.63%
| Max | 3.62%
Updated to add --optimizer.implementation with ["for-loop", "foreach"
and "fused"] support.
Testing:
beyond verifying no issues with all parallelisms above, verified that
fused/foreach/ for-loop is being set with the new default config:
~~~
[rank0]:Using foreach implementation for optimizer
[rank0]:foreach=True, fused=False
~~~
~~~
[rank0]:Using for-loop implementation for optimizer
[rank0]:foreach=False, fused=False
~~~
~~~
[rank0]:Using fused implementation for optimizer
[rank0]:foreach=False, fused=True
~~~
Currently Titan does not use Fused AdamW as default.
This PR makes Fused the new default.
After investigating current parallelisms using Llama 8B, I found an average speedup of 2.64% as follows:
Updated to add --optimizer.implementation with ["for-loop", "foreach" and "fused"] support.
Testing:
beyond verifying no issues with all parallelisms above, verified that fused/foreach/ for-loop is being set with the new default config: