Skip to content

Conversation

@lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Feb 24, 2025

Currently Titan does not use Fused AdamW as default.
This PR makes Fused the new default.

After investigating current parallelisms using Llama 8B, I found an average speedup of 2.64% as follows:

Fused AdamW FSDP, eager 2.24%
8B FSDP, compile 1.63%
  TP 3.62%
  AsyncTP 3.26%
  CP 2.97%
(debug model)  PP 2.10%
     
Gains Average 2.64%
  Min 1.63%
  Max 3.62%

Updated to add --optimizer.implementation with ["for-loop", "foreach" and "fused"] support.

Testing:
beyond verifying no issues with all parallelisms above, verified that fused/foreach/ for-loop is being set with the new default config:

[rank0]:Using foreach implementation for optimizer
[rank0]:foreach=True, fused=False
[rank0]:Using for-loop implementation for optimizer
[rank0]:foreach=False, fused=False
[rank0]:Using fused implementation for optimizer
[rank0]:foreach=False, fused=True

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 24, 2025
@lessw2020 lessw2020 requested review from tianyu-l and wz337 February 24, 2025 04:22
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for verifying the performance. I have a suggestion inline.

Also, regarding https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/optimizer.py#L212-L213
can fused and foreach coexist?

@tianyu-l tianyu-l linked an issue Feb 24, 2025 that may be closed by this pull request
@fegin
Copy link
Contributor

fegin commented Feb 24, 2025

fused and foreach cannot coexist.

@lessw2020
Copy link
Contributor Author

updated based on PR feedback to ensure command line disable is supported (toml support was already there).

Added --optimizer.disable_fused support and verified both cases:
a - using disable_fused in cmd line:

rank0]:Using AdamW optimizer
[rank0]:optimizer_kwargs: {'lr': 0.0008, 'betas': (0.9, 0.95), 'weight_decay': 0.1, 'fused': False, 'foreach': True}

b - not using disable fused (uses default setting for fused):

rank0]:Using AdamW optimizer
[rank0]:optimizer_kwargs: {'lr': 0.0008, 'betas': (0.9, 0.95), 'weight_decay': 0.1, 'fused': True, 'foreach': False}

@lessw2020 lessw2020 requested review from fegin and tianyu-l February 24, 2025 16:33
@lessw2020 lessw2020 requested a review from tianyu-l February 24, 2025 23:59
Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Could you fix CI before merging?
In particular, please update the fused into a foreach test: https://github.com/pytorch/torchtitan/blob/main/tests/integration_tests.py#L267-L274
Also, there seem to be two optimizers used, which we should trim to one.

@tianyu-l tianyu-l merged commit 6e49885 into main Feb 25, 2025
6 checks passed
MaxiBoether pushed a commit to eth-easl/torchtitan-mixtera that referenced this pull request Apr 17, 2025
Currently Titan does not use Fused AdamW as default.
This PR makes Fused the new default.

After investigating current parallelisms using Llama 8B, I found an
average speedup of 2.64% as follows:

<google-sheets-html-origin><!--td {border: 1px solid #cccccc;}br
{mso-data-placement:same-cell;}-->
Fused AdamW | FSDP, eager | 2.24%
-- | -- | --
8B | FSDP, compile | 1.63%
  | TP | 3.62%
  | AsyncTP | 3.26%
  | CP | 2.97%
(debug model)  | PP | 2.10%
  |   |  
Gains | Average | 2.64%
  | Min | 1.63%
  | Max | 3.62%

Updated to add --optimizer.implementation with ["for-loop", "foreach"
and "fused"] support.

Testing:
beyond verifying no issues with all parallelisms above, verified that
fused/foreach/ for-loop is being set with the new default config:
~~~
[rank0]:Using foreach implementation for optimizer
[rank0]:foreach=True, fused=False
~~~
~~~
[rank0]:Using for-loop implementation for optimizer
[rank0]:foreach=False, fused=False
~~~
~~~
[rank0]:Using fused implementation for optimizer
[rank0]:foreach=False, fused=True
~~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Fused AdamW default? Need to verify across all parallelisms ...

5 participants