[FSDP] Do not clean FQNs even for `use_orig_params=True` #91767

awgu · 2023-01-05T16:34:38Z

Stack from ghstack:

[FSDP][RFC] Enforce rank r's current device is cuda:r #92035 [FSDP][RFC] Enforce rank r's current device is cuda:r
[FSDP] Do not clean FQNs even for use_orig_params=True #91767 [FSDP] Do not clean FQNs even for use_orig_params=True
[FSDP][BE] Improve device_id + CPU offload test #92031 [FSDP][BE] Improve device_id + CPU offload test
[FSDP][BE] Rename prefixed_param_names -> fqns for consolidation #92028 [FSDP][BE] Rename prefixed_param_names -> fqns for consolidation
[FSDP][BE] Better error msg for incorrect device for training #92027 [FSDP][BE] Better error msg for incorrect device for training

Cleaning FQN for FullyShardedDataParallel(use_orig_params=True) can cause some discrepancies with respect to the FQN compared to manually looping over named_modules() and named_parameters() together.

There is no requirement for the FQNs to be clean when using wrapper FSDP + use_orig_params=True. We can leave clean FQNs to fully_shard.

cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire

[ghstack-poisoned]

pytorch-bot · 2023-01-05T16:34:42Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91767

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d50af79:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 8a808f41d4773309f2fbc94e39a4549a3ae0df02 Pull Request resolved: #91767

[ghstack-poisoned]

ghstack-source-id: 3f36724b2025ddfa15ffc4bc4c9bec3fcb754017 Pull Request resolved: #91767

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. Testing in CI... [ghstack-poisoned]

ghstack-source-id: cb89c7e9de117975ce6e59e103b99783887926dd Pull Request resolved: #91767

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. Testing in CI... [ghstack-poisoned]

ghstack-source-id: 66d46f3336b1fe8a69ad53ee39fad2ff58ab4a44 Pull Request resolved: #91767

ghstack-source-id: 66d46f3336b1fe8a69ad53ee39fad2ff58ab4a44 Pull Request resolved: pytorch#91767

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. Testing in CI... [ghstack-poisoned]

ghstack-source-id: 23b1adfc2a659147b63c30f1c2b2cd14393c06c4 Pull Request resolved: #91767

ghstack-source-id: 23b1adfc2a659147b63c30f1c2b2cd14393c06c4 Pull Request resolved: pytorch#91767

pytorchmergebot · 2023-01-12T19:07:55Z

@awgu your PR has been successfully reverted.

)" This reverts commit a383789. Reverted #91767 on behalf of https://github.com/huydhn due to This breaks inductor_distributed workflow https://hud.pytorch.org/pytorch/pytorch/commit/a383789f4d8ecb36adaff6bd3746430209ff0546

ghstack-source-id: ae7e571104746e336bf1bc8eb0fc55f0a41196de Pull Request resolved: pytorch#91767

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`. [ghstack-poisoned]

awgu · 2023-01-12T21:53:19Z

torch/_dynamo/testing.py

@@ -46,7 +46,7 @@ def remove_optimized_module_prefix(name):
    prefix = "_orig_mod."
    assert name.startswith(prefix)
    name = name[len(prefix) :]
-    return torch.distributed.fsdp._common_utils.clean_tensor_name(name)
+    return name


@wconstab

Context: This PR changes FSDP so that when use_orig_params=True, the parameter names returned from named_parameters() include the _fsdp_wrapped_module. prefixes. Before, FSDP overrode named_parameters() to clean the prefix (see the change in fully_sharded_data_parallel.py). I found that overriding it to clean the prefix creates unwanted discrepancy with the true module structure, which can lead to some issues. We can leave having clean parameter names to the currently-developed composable APIs (e.g. fully_shard).

This change here in _dynamo/testing.py to not call clean_tensor_name() allows the unit tests to pass. I still wanted to check: does Dynamo + FSDP rely on FSDP's named_parameters() returning the cleaned names (i.e. without _fsdp_wrapped_module.)?

I don't think dynamo relies on the structure of the name either way. other than we do rely on finding the "is_fsdp_wrapped_module" boolean for enablement of fsdp handling.

question: what is the history of remove_optimized_module_prefix fn? at a glance, i have no memory of modifying this myself for fsdp but i see it had an fsdp clean param thing in it. did you put that there before and are taking it out now?

removed_optimized_module_prefix came from #89113

My understanding is as follows:

pytorch/torch/_dynamo/testing.py

Lines 63 to 65 in 46a81c8

for name, param in model.named_parameters():

if isinstance(model, eval_frame.OptimizedModule):

name = remove_optimized_module_prefix(name)

pytorch/test/distributed/test_dynamo_distributed.py

Lines 368 to 370 in 46a81c8

correct_results = collect_results(eager_model, correct_outputs.logits, correct_loss, inputs_flat)

opt_results = collect_results(opt_model, opt_outputs.logits, opt_loss, inputs_flat)

self.assertTrue(same(correct_results, opt_results))

The eager_model removed the prefixes but the opt_model did not, so for parity, we needed to manually remove the prefixes for opt_model. However, with this PR, eager_model no longer removes the prefixes, so similarly, we can remove the manual removal for opt_model.

In this case, we should be safe to land this.

seems ok to me

ghstack-source-id: f14495f0f1e8fc92c813b9353e2da47c866a529f Pull Request resolved: pytorch#91767

@mlazos

Cleaning FQN for `FullyShardedDataParallel(use_orig_params=True)` can cause some discrepancies with respect to the FQN compared to manually looping over `named_modules()` and `named_parameters()` together. There is no requirement for the FQNs to be clean when using wrapper FSDP + `use_orig_params=True`. We can leave clean FQNs to `fully_shard`. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire [ghstack-poisoned]

ghstack-source-id: d3fab858407489d64e2173ab845ceef488031f88 Pull Request resolved: pytorch#91767

awgu · 2023-01-17T17:39:41Z

@pytorchbot merge

pytorchmergebot · 2023-01-17T17:41:19Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2023-01-17T20:03:14Z

@pytorchbot revert -m "Looks like it broke test_compatible_with_named_optimizer distribued tests, see https://hud.pytorch.org/pytorch/pytorch/commit/d6f3265e1add26abedb504910be93b393b9fb33c" -c nosignal

pytorchmergebot · 2023-01-17T20:04:47Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2023-01-17T20:04:56Z

@awgu your PR has been successfully reverted.

)" This reverts commit d6f3265. Reverted #91767 on behalf of https://github.com/malfet due to Looks like it broke `test_compatible_with_named_optimizer` distribued tests, see https://hud.pytorch.org/pytorch/pytorch/commit/d6f3265e1add26abedb504910be93b393b9fb33c

awgu · 2023-01-17T20:51:25Z

We had a land race. We will re-land this PR after a fix on the optimizer state side.

ghstack-source-id: d3fab858407489d64e2173ab845ceef488031f88 Pull Request resolved: pytorch#91767

@mlazos

…orig_params=True`" The last PR (#91767) had a land race and got reverted. This is a re-land. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire [ghstack-poisoned]

@mlazos

The last PR (#91767) had a land race and got reverted. This is a re-land. cc @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @chunyuan-w @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire [ghstack-poisoned]

…orig_params=True`" The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

…orig_params=True`" The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. cc mlazos soumith voznesenskym yanboliang penguinwu anijain2305 EikanWang jgong5 Guobing-Chen chunyuan-w XiaobingSuper zhuhaozhe blzheng Xia-Weiwen wenzhe-nrv jiayisunx desertfire [ghstack-poisoned]

The last PR (#91767) had a land race relating to `_NamedOptimizer` + FSDP and got reverted. This is a re-land. Pull Request resolved: #92662 Approved by: https://github.com/rohan-varma

[WIP] Do not clean FQNs even for use_orig_params=True

988ba6f

[ghstack-poisoned]

awgu requested review from mrshenli, zhaojuanmao, pritamdamania87, rohan-varma, H-Huang, kwen2501 and wanchaol as code owners January 5, 2023 16:34

This was referenced Jan 5, 2023

[FSDP] Re-support model dtype change after FSDP init #91192

Closed

[FSDP] Test use_orig_params=True, no_sync(), mixed precision #91193

Closed

pytorch-bot bot added the release notes: distributed (sharded) release notes category label Jan 5, 2023

awgu added a commit that referenced this pull request Jan 5, 2023

[WIP] Do not clean FQNs even for use_orig_params=True

7130c86

ghstack-source-id: 8a808f41d4773309f2fbc94e39a4549a3ae0df02 Pull Request resolved: #91767

Update on "[WIP] Do not clean FQNs even for use_orig_params=True"

65abb71

[ghstack-poisoned]

awgu added a commit that referenced this pull request Jan 5, 2023

[WIP] Do not clean FQNs even for use_orig_params=True

0b680c6

ghstack-source-id: 3f36724b2025ddfa15ffc4bc4c9bec3fcb754017 Pull Request resolved: #91767

awgu added release notes: distributed (fsdp) release notes category topic: improvements topic category and removed release notes: distributed (sharded) release notes category labels Jan 5, 2023

awgu added a commit that referenced this pull request Jan 5, 2023

[WIP] Do not clean FQNs even for use_orig_params=True

2cc5455

ghstack-source-id: cb89c7e9de117975ce6e59e103b99783887926dd Pull Request resolved: #91767

awgu added a commit that referenced this pull request Jan 5, 2023

[WIP] Do not clean FQNs even for use_orig_params=True

f13f2f0

ghstack-source-id: 66d46f3336b1fe8a69ad53ee39fad2ff58ab4a44 Pull Request resolved: #91767

awgu changed the title ~~[WIP] Do not clean FQNs even for use_orig_params=True~~ [FSDP] Do not clean FQNs even for use_orig_params=True Jan 8, 2023

awgu mentioned this pull request Jan 9, 2023

[PoC][FSDP] Async reduce-scatter #91865

Closed

awgu added a commit to awgu/pytorch that referenced this pull request Jan 10, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

53e0b60

ghstack-source-id: 66d46f3336b1fe8a69ad53ee39fad2ff58ab4a44 Pull Request resolved: pytorch#91767

awgu added a commit to awgu/pytorch that referenced this pull request Jan 10, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

4a9d47b

ghstack-source-id: 66d46f3336b1fe8a69ad53ee39fad2ff58ab4a44 Pull Request resolved: pytorch#91767

awgu mentioned this pull request Jan 10, 2023

[FSDP] Clarify MixedPrecision docs #91974

Closed

awgu added a commit that referenced this pull request Jan 10, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

39e320d

ghstack-source-id: 23b1adfc2a659147b63c30f1c2b2cd14393c06c4 Pull Request resolved: #91767

awgu added a commit to awgu/pytorch that referenced this pull request Jan 10, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

d4c7fea

ghstack-source-id: 23b1adfc2a659147b63c30f1c2b2cd14393c06c4 Pull Request resolved: pytorch#91767

pytorchmergebot added the Reverted label Jan 12, 2023

awgu added a commit to awgu/pytorch that referenced this pull request Jan 12, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

23dbf60

ghstack-source-id: ae7e571104746e336bf1bc8eb0fc55f0a41196de Pull Request resolved: pytorch#91767

github-actions bot added the module: dynamo label Jan 12, 2023

awgu commented Jan 12, 2023

View reviewed changes

awgu added a commit to awgu/pytorch that referenced this pull request Jan 14, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

0cf152b

ghstack-source-id: f14495f0f1e8fc92c813b9353e2da47c866a529f Pull Request resolved: pytorch#91767

awgu added a commit to awgu/pytorch that referenced this pull request Jan 17, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

108bc54

ghstack-source-id: d3fab858407489d64e2173ab845ceef488031f88 Pull Request resolved: pytorch#91767

pytorchmergebot closed this in d6f3265 Jan 17, 2023

awgu added a commit to awgu/pytorch that referenced this pull request Jan 17, 2023

[FSDP] Do not clean FQNs even for use_orig_params=True

b8fc018

ghstack-source-id: d3fab858407489d64e2173ab845ceef488031f88 Pull Request resolved: pytorch#91767

awgu mentioned this pull request Jan 20, 2023

[Reland][FSDP] Do not clean FQNs for use_orig_params=True #92662

Closed

facebook-github-bot deleted the gh/awgu/288/head branch June 8, 2023 15:33

awgu mentioned this pull request Oct 20, 2023

FSDP: clean names for use_orig_params? #111648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Do not clean FQNs even for `use_orig_params=True` #91767

[FSDP] Do not clean FQNs even for `use_orig_params=True` #91767

awgu commented Jan 5, 2023 •

edited

pytorch-bot bot commented Jan 5, 2023 •

edited

pytorchmergebot commented Jan 12, 2023

awgu Jan 12, 2023 •

edited

wconstab Jan 17, 2023

awgu Jan 17, 2023

wconstab Jan 17, 2023

awgu commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

malfet commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

awgu commented Jan 17, 2023

	for name, param in model.named_parameters():
	if isinstance(model, eval_frame.OptimizedModule):
	name = remove_optimized_module_prefix(name)

	correct_results = collect_results(eager_model, correct_outputs.logits, correct_loss, inputs_flat)
	opt_results = collect_results(opt_model, opt_outputs.logits, opt_loss, inputs_flat)
	self.assertTrue(same(correct_results, opt_results))

[FSDP] Do not clean FQNs even for use_orig_params=True #91767

[FSDP] Do not clean FQNs even for use_orig_params=True #91767

Conversation

awgu commented Jan 5, 2023 • edited

pytorch-bot bot commented Jan 5, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/91767

✅ No Failures

pytorchmergebot commented Jan 12, 2023

awgu Jan 12, 2023 • edited

Choose a reason for hiding this comment

wconstab Jan 17, 2023

Choose a reason for hiding this comment

awgu Jan 17, 2023

Choose a reason for hiding this comment

wconstab Jan 17, 2023

Choose a reason for hiding this comment

awgu commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

Merge started

malfet commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

pytorchmergebot commented Jan 17, 2023

awgu commented Jan 17, 2023

[FSDP] Do not clean FQNs even for `use_orig_params=True` #91767

[FSDP] Do not clean FQNs even for `use_orig_params=True` #91767

awgu commented Jan 5, 2023 •

edited

pytorch-bot bot commented Jan 5, 2023 •

edited

awgu Jan 12, 2023 •

edited