[2D] Enable 2D FSDP+TP model.load_state_dict() #110925

wz337 · 2023-10-10T04:48:36Z

Stack from ghstack (oldest at bottom):

This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling model.load_state_dict().

cc. @fegin

[ghstack-poisoned]

pytorch-bot · 2023-10-10T04:48:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110925

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (4 Unrelated Failures)

As of commit 598cde8 with merge base ad24965 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 6fa37cc Pull Request resolved: #110925

This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`. cc. fegin [ghstack-poisoned]

fduwjj · 2023-10-10T06:12:28Z

torch/distributed/fsdp/_shard_utils.py

    )
+
+
+def _all_gather_dtensor(


Can we merge this logic and the one in torch/distributed/tensor/parallel/fsdp.py into one common place?

This _all_gather_dtensor() is actually internal to FSDP. We are following the extension design here. See this function as an example: https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fsdp_extensions.py#L91

Essentially, there are two code paths:
FSDP only --> _all_gather_dtensor()
FSDP + TP -> _extensions.all_gather_dtensor()

This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`. cc. fegin [ghstack-poisoned]

fegin · 2023-10-10T07:17:01Z

torch/distributed/fsdp/_shard_utils.py

+
+def _all_gather_dtensor(
+    tensor: DTensor,
+    parent_mesh: DeviceMesh,


This typing should be optional.

This PR adds a all_gather_dtensor() method to fsdp/_fsdp_extensions.py and the actual implementation in tensor/parallel/fsdp.py. This enables FSDP to load 2D DTensor state_dict into model when calling `model.load_state_dict()`. cc. fegin [ghstack-poisoned]

ghstack-source-id: da730a1 Pull Request resolved: #110925

fegin

LGTM

wz337 · 2023-10-11T18:19:42Z

@pytorchmergebot merge

pytorchmergebot · 2023-10-11T18:21:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

enable 2D FSDP+TP load_state_dict()

e47d0bc

[ghstack-poisoned]

wz337 requested review from H-Huang, awgu, d4l3k, fduwjj, fegin, kiukchung, kwen2501, mrshenli, rohan-varma, wanchaol and zhaojuanmao as code owners October 10, 2023 04:48

wz337 mentioned this pull request Oct 10, 2023

[FSDP] Change _create_chunk_dtensor in fsdp/_shard_utils.py to use public API from DTensor #110831

Closed

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 10, 2023

wz337 mentioned this pull request Oct 10, 2023

[2D] Enable 2D DTensor state_dict for FSDP + TP #110846

Closed

wz337 added a commit that referenced this pull request Oct 10, 2023

enable 2D FSDP+TP load_state_dict()

a528f19

ghstack-source-id: 6fa37cc Pull Request resolved: #110925

wz337 changed the title ~~enable 2D FSDP+TP load_state_dict()~~ [2D] Enable 2D FSDP+TP model.load_state_dict() Oct 10, 2023

wz337 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Oct 10, 2023

fduwjj reviewed Oct 10, 2023

View reviewed changes

fegin reviewed Oct 10, 2023

View reviewed changes

wz337 requested a review from fegin October 10, 2023 07:45

wz337 added a commit that referenced this pull request Oct 10, 2023

enable 2D FSDP+TP load_state_dict()

6e4efcb

ghstack-source-id: da730a1 Pull Request resolved: #110925

fegin approved these changes Oct 10, 2023

View reviewed changes

pytorchmergebot added the merging label Oct 11, 2023

pytorchmergebot added Merged and removed merging labels Oct 11, 2023

pytorchmergebot closed this in 80dfc97 Oct 11, 2023

facebook-github-bot deleted the gh/wz337/3/head branch October 15, 2023 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2D] Enable 2D FSDP+TP model.load_state_dict() #110925

[2D] Enable 2D FSDP+TP model.load_state_dict() #110925

Uh oh!

wz337 commented Oct 10, 2023 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 10, 2023 •

edited

Loading

Uh oh!

fduwjj Oct 10, 2023

Uh oh!

wz337 Oct 10, 2023 •

edited

Loading

Uh oh!

fegin Oct 10, 2023

Uh oh!

fegin left a comment

Uh oh!

wz337 commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		)


		def _all_gather_dtensor(

[2D] Enable 2D FSDP+TP model.load_state_dict() #110925

[2D] Enable 2D FSDP+TP model.load_state_dict() #110925

Uh oh!

Conversation

wz337 commented Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110925

✅ You can merge normally! (4 Unrelated Failures)

Uh oh!

fduwjj Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

wz337 Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fegin Oct 10, 2023

Choose a reason for hiding this comment

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

wz337 commented Oct 11, 2023

Uh oh!

pytorchmergebot commented Oct 11, 2023

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wz337 commented Oct 10, 2023 •

edited

Loading

pytorch-bot bot commented Oct 10, 2023 •

edited

Loading

wz337 Oct 10, 2023 •

edited

Loading