[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

fegin · 2022-10-08T05:35:57Z

Stack from ghstack (oldest at bottom):

-> [FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

_sharded_pre_load_state_dict_hook() should calls _param_fqns() to ensure shared parameters names are also included.

Differential Revision: D40201304

`_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) [ghstack-poisoned]

pytorch-bot · 2022-10-08T05:35:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86524

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures, 1 Pending

As of commit 5764aae:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

`_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) ghstack-source-id: 169790107 Pull Request resolved: #86524

awgu

This change makes sense to me, but if we want to land this formally, should we include a unit test?

Her are some code examples for how to use TransformerWithSharedParams.init(), which has shared parameters (see common_fsdp.py for the precise details).

No FSDP:

pytorch/test/distributed/fsdp/test_fsdp_use_orig_params.py

Lines 84 to 89 in be682be

    
           model = TransformerWithSharedParams.init( 
        
               self.process_group, 
        
               FSDPInitMode.NO_FSDP, 
        
               CUDAInitMode.CUDA_BEFORE, 
        
               deterministic=True, 
        
           )

FSDP:

pytorch/test/distributed/fsdp/test_fsdp_use_orig_params.py

Lines 115 to 133 in be682be

    
           fsdp_kwargs = { 
        
               "auto_wrap_policy": functools.partial( 
        
                   transformer_auto_wrap_policy, 
        
                   transformer_layer_cls={ 
        
                       TransformerEncoderLayer, 
        
                       TransformerDecoderLayer, 
        
                   }, 
        
               ), 
        
               "use_orig_params": True, 
        
               "sharding_strategy": sharding_strategy, 
        
               "backward_prefetch": backward_prefetch, 
        
               "cpu_offload": cpu_offload, 
        
           } 
        
           model = TransformerWithSharedParams.init( 
        
               self.process_group, 
        
               FSDPInitMode.NO_FSDP, 
        
               cuda_init_mode, 
        
               deterministic=True, 
        
           )

fegin · 2022-10-10T02:46:41Z

@awgu Yup, that makes sense. What is puzzling me is that I thought test_state_dict already has shared parameters test cases but somehow this issue is not caught. I'm looking into the UT to understand what is missing.

fegin · 2022-10-10T06:21:59Z

@awgu Please ignore my previous comment. I didn't find shared parameters testing in test_fsdp_state_dict, probably was confused with Pyper's tests.

…ed parameters" `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) [ghstack-poisoned]

Pull Request resolved: #86524 `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. ghstack-source-id: 169844364 Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/)

rohan-varma · 2022-10-10T16:54:35Z

test/distributed/fsdp/test_fsdp_state_dict.py

+        )
+
+        fsdp_model = model_creator()
+        for tensor in itertools.chain(fsdp_model.parameters(), fsdp_model.buffers()):


what does this portion add to the unittest?

zhaojuanmao · 2022-10-11T17:54:52Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

+    @property
+    def _shared_param_fqns(self) -> Iterator[Tuple[str, str, str]]:
+        for param_name, module_name in (
+            self._fsdp_wrapped_module.handle.shared_parameter_module_names()


nit: i remembered _fsdp_wrapped_module could have multiple handles, so this should be "self._fsdp_wrapped_module.handles[0]"? @awgu

I think either @fegin or I will need to rebase, but it is not a big deal either way. The preferred approach will be self._handles[0] since I want to get rid of self._fsdp_wrapped_module.

I changed to self._handles[0] as suggested.

…ed parameters" `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) [ghstack-poisoned]

Pull Request resolved: #86524 `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. ghstack-source-id: 170184507 Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/)

…ed parameters" `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/) [ghstack-poisoned]

Pull Request resolved: #86524 `_sharded_pre_load_state_dict_hook()` should calls `_param_fqns()` to ensure shared parameters names are also included. ghstack-source-id: 170321602 Differential Revision: [D40201304](https://our.internmc.facebook.com/intern/diff/D40201304/)

facebook-github-bot · 2022-10-14T05:17:53Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2022-10-14T05:19:13Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-14T05:20:10Z

Hey @fegin.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

fegin requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, mingzhe09088, H-Huang, awgu and kwen2501 as code owners October 8, 2022 05:35

pytorch-bot bot added the release notes: distributed (sharded) release notes category label Oct 8, 2022

awgu reviewed Oct 8, 2022

View reviewed changes

rohan-varma approved these changes Oct 10, 2022

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 10, 2022

zhaojuanmao reviewed Oct 11, 2022

View reviewed changes

lessw2020 mentioned this pull request Oct 13, 2022

[FSDP] loading saved SHARDED_STATE_DICT checkpoint asserts with 'prev_state_dict_type is not None' #86483

Closed

pytorchmergebot added the Merged label Oct 14, 2022

pytorchmergebot closed this in bc4ca4c Oct 14, 2022

facebook-github-bot deleted the gh/fegin/32/head branch June 8, 2023 17:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

fegin commented Oct 8, 2022 •

edited

pytorch-bot bot commented Oct 8, 2022 •

edited

awgu left a comment

fegin commented Oct 10, 2022

fegin commented Oct 10, 2022 •

edited

rohan-varma Oct 10, 2022

zhaojuanmao Oct 11, 2022

awgu Oct 11, 2022

fegin Oct 12, 2022

facebook-github-bot commented Oct 14, 2022

pytorchmergebot commented Oct 14, 2022

github-actions bot commented Oct 14, 2022

	model = TransformerWithSharedParams.init(
	self.process_group,
	FSDPInitMode.NO_FSDP,
	CUDAInitMode.CUDA_BEFORE,
	deterministic=True,
	)

	fsdp_kwargs = {
	"auto_wrap_policy": functools.partial(
	transformer_auto_wrap_policy,
	transformer_layer_cls={
	TransformerEncoderLayer,
	TransformerDecoderLayer,
	},
	),
	"use_orig_params": True,
	"sharding_strategy": sharding_strategy,
	"backward_prefetch": backward_prefetch,
	"cpu_offload": cpu_offload,
	}
	model = TransformerWithSharedParams.init(
	self.process_group,
	FSDPInitMode.NO_FSDP,
	cuda_init_mode,
	deterministic=True,
	)

[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

[FSDP] Fix load_sharded_state_dict FQN mismatches for shared parameters #86524

Conversation

fegin commented Oct 8, 2022 • edited

pytorch-bot bot commented Oct 8, 2022 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/86524

✅ No Failures, 1 Pending

awgu left a comment

Choose a reason for hiding this comment

fegin commented Oct 10, 2022

fegin commented Oct 10, 2022 • edited

rohan-varma Oct 10, 2022

Choose a reason for hiding this comment

zhaojuanmao Oct 11, 2022

Choose a reason for hiding this comment

awgu Oct 11, 2022

Choose a reason for hiding this comment

fegin Oct 12, 2022

Choose a reason for hiding this comment

facebook-github-bot commented Oct 14, 2022

pytorchmergebot commented Oct 14, 2022

Merge started

github-actions bot commented Oct 14, 2022

fegin commented Oct 8, 2022 •

edited

pytorch-bot bot commented Oct 8, 2022 •

edited

fegin commented Oct 10, 2022 •

edited