[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns #118119

LucasLLC · 2024-01-23T20:31:53Z

~~Soliciting some early feedback here.~~

~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~

Edit: Added tests

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

pytorch-bot · 2024-01-23T20:31:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118119

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3bbd358 with merge base abe3c55 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

fegin · 2024-01-23T20:57:04Z

You can check https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_optim_state.py#L620. But remember NOT use FSDP to wrap the model as FSDP will handle the prefix for you. So you won't be able to test the logic if FSDP is used.

fegin

Add one comment. Others LGTM.

fegin · 2024-01-25T23:26:05Z

test/distributed/checkpoint/test_state_dict.py

+        for model, name in model_names:
+            for fqn in _get_fqns(model, name):
+                self.assertNotIn(_CHECKPOINT_WRAPPED_MODULE, fqn)


We should just call get_state_dict() and compare its keys with the original model without activation checkpoint (you can deepcopy the original model).

LucasLLC · 2024-01-26T18:54:18Z

@pytorchbot merge

pytorchmergebot · 2024-01-26T18:56:13Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

LucasLLC · 2024-01-26T18:57:35Z

@pytorchbot merge

pytorchmergebot · 2024-01-26T18:59:35Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-01-26T19:26:28Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Check mergeability and dependencies for ghstack prs / ghstack-mergeability-check

Details for Dev Infra team

Raised by workflow job

LucasLLC · 2024-01-29T15:32:22Z

@pytorchbot merge

pytorchmergebot · 2024-01-29T15:34:55Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@fegin

…#118119) Fixes pytorch#117399 ~~Soliciting some early feedback here.~~ ~~Do we happen to know if there already some tests that cover this case or would it make sense to add? @fegin , @wz337~~ Edit: Added tests Pull Request resolved: pytorch#118119 Approved by: https://github.com/fegin

Fixes #124546 When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit #118119. Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param'] ``` Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned. With the change, the list of object names will now have `_checkpoint_wrapped_module` removed: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param'] ``` And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](https://github.com/pytorch/pytorch/blob/ea61c9cb299b6dfebc57dc9d8821c34321d568ab/torch/distributed/checkpoint/state_dict.py#L168) is satisfied. The correct FQNs are: ``` {'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias', 'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight', 'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight', 'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight', 'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias', 'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'} ``` Pull Request resolved: #124698 Approved by: https://github.com/Skylion007

Fixes pytorch#124546 When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit pytorch#118119. Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param'] ``` Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned. With the change, the list of object names will now have `_checkpoint_wrapped_module` removed: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param'] ``` And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](https://github.com/pytorch/pytorch/blob/ea61c9cb299b6dfebc57dc9d8821c34321d568ab/torch/distributed/checkpoint/state_dict.py#L168) is satisfied. The correct FQNs are: ``` {'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias', 'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight', 'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight', 'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight', 'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias', 'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'} ``` Pull Request resolved: pytorch#124698 Approved by: https://github.com/Skylion007

…26559) Fixes #124546 When setting `use_orig_params = False` and using activation checkpointing, the FQN mapping as retrieved by the `_get_fqns` function is incorrect because the prefix that is added to the name of each activation checkpointed module, `_checkpoint_wrapped_module`, can still be present. I think this is an edge case with the `_get_fqns` function that was not addressed by this previous commit #118119. Without the change, the list of object names for an activation checkpointed module with FSDP (and `use_orig_params=False`) can be something like: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_checkpoint_wrapped_module', '_flat_param'] ``` Which will incorrectly return just one FQN, `{'model.transformer.blocks.0._flat_param'}`, when all the FQNs of the parameters of the transformer block should be returned. With the change, the list of object names will now have `_checkpoint_wrapped_module` removed: ``` ['model', '_fsdp_wrapped_module', 'transformer', 'blocks', '0', '_fsdp_wrapped_module', '_flat_param'] ``` And the FQNs are correctly retrieved and returned in `_get_fqns` when [this condition](https://github.com/pytorch/pytorch/blob/ea61c9cb299b6dfebc57dc9d8821c34321d568ab/torch/distributed/checkpoint/state_dict.py#L168) is satisfied. The correct FQNs are: ``` {'model.transformer.blocks.0.attn.Wqkv.bias', 'model.transformer.blocks.0.ffn.up_proj.bias', 'model.transformer.blocks.0.attn.out_proj.weight', 'model.transformer.blocks.0.norm_2.weight', 'model.transformer.blocks.0.ffn.down_proj.weight', 'model.transformer.blocks.0.attn.Wqkv.weight', 'model.transformer.blocks.0.norm_2.bias', 'model.transformer.blocks.0.ffn.up_proj.weight', 'model.transformer.blocks.0.ffn.down_proj.bias', 'model.transformer.blocks.0.norm_1.bias', 'model.transformer.blocks.0.norm_1.weight', 'model.transformer.blocks.0.attn.out_proj.bias'} ``` Pull Request resolved: #124698 Approved by: https://github.com/Skylion007 Co-authored-by: Saaketh <narayan.saaketh@gmail.com>

removes checkpoint wrapped prefix from state dict fqns

897a21b

LucasLLC added the module: distributed_checkpoint label Jan 23, 2024

LucasLLC requested review from fegin and wz337 January 23, 2024 20:31

LucasLLC self-assigned this Jan 23, 2024

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 23, 2024

removes activation checkpointing prefix, adds tests

a21193f

LucasLLC marked this pull request as ready for review January 25, 2024 16:26

fegin approved these changes Jan 25, 2024

View reviewed changes

LucasLLC added 3 commits January 26, 2024 10:45

updates activation ckpt test

0218e10

lintrunner

4a08a76

Merge branch 'main' into remove_checkpoint_wrapped_from_fqn

5726446

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 26, 2024

pytorchmergebot added the merging label Jan 26, 2024

pytorchmergebot removed the merging label Jan 26, 2024

LucasLLC added the topic: not user facing topic category label Jan 26, 2024

pytorchmergebot added the merging label Jan 26, 2024

pytorchmergebot removed the merging label Jan 26, 2024

Merge branch 'main' into remove_checkpoint_wrapped_from_fqn

3bbd358

pytorchmergebot added the merging label Jan 29, 2024

pytorchmergebot added the Merged label Jan 29, 2024

pytorchmergebot closed this in 0288db3 Jan 29, 2024

pytorchmergebot removed the merging label Jan 29, 2024

github-actions bot deleted the remove_checkpoint_wrapped_from_fqn branch February 29, 2024 02:08

snarayan21 mentioned this pull request Apr 23, 2024

Remove activation checkpointing tag to get correct FQNs #124698

Closed

mvpatel2000 mentioned this pull request May 17, 2024

Remove activation checkpointing tag to get correct FQNs (#124698) #126559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns #118119

[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns #118119

Uh oh!

LucasLLC commented Jan 23, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading

Uh oh!

fegin commented Jan 23, 2024 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

fegin Jan 25, 2024

Uh oh!

LucasLLC commented Jan 26, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Uh oh!

LucasLLC commented Jan 26, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Uh oh!

LucasLLC commented Jan 29, 2024

Uh oh!

pytorchmergebot commented Jan 29, 2024

Uh oh!

Uh oh!

[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns #118119

[DCP] Removes Checkpoint Wrapped Prefix from state dict fqns #118119

Uh oh!

Conversation

LucasLLC commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118119

✅ No Failures

Uh oh!

fegin commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

LucasLLC commented Jan 26, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Merge failed

Uh oh!

LucasLLC commented Jan 26, 2024

Uh oh!

pytorchmergebot commented Jan 26, 2024

Merge started

Uh oh!

pytorchmergebot commented Jan 26, 2024

Merge failed

Uh oh!

LucasLLC commented Jan 29, 2024

Uh oh!

pytorchmergebot commented Jan 29, 2024

Merge started

Uh oh!

Uh oh!

LucasLLC commented Jan 23, 2024 •

edited

Loading

pytorch-bot bot commented Jan 23, 2024 •

edited

Loading

fegin commented Jan 23, 2024 •

edited

Loading