[HSDP] Fix Node 1 unable receive parameters from Node 0 #108331

lxg2015 · 2023-08-31T07:36:33Z

When use hybrid_shard mode FSDP,
state.process_group means gpu_0,1,,,~,7 on node 0，so gpus on node 1 cannot receive parameters, setting process_group to default_group（global_group）can fix this issue

Fixes #ISSUE_NUMBER

pytorch-bot · 2023-08-31T07:36:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108331

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac714a3 with merge base 121cfb6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2023-08-31T07:36:36Z

The committers listed above are authorized under a signed CLA.

✅ login: lxg2015 (0f99fd7, da255d7, 634f0c1, f2e7b32, 6a4ce60, b79adb2, ac714a3)

awgu · 2023-08-31T10:52:28Z

Thanks for pointing this issue out!

I think that the current broadcast semantics over only the sharding process group is an issue. However, I am not sure if using the global process group is the right thing to do in general.

I think the right semantics are to "union" the sharding process group and replication process group and broadcast over this unioned process group. When only using HSDP alone, this union is the global process group. However, if HSDP does not have full ownership over the cluster (e.g. if composing with some other parallelism), then the union may not be the global process group.

cc: @wanchaol @wz337 Would DeviceMesh support something like: We have a 2D submesh from some global mesh (possibly the global mesh is just the 2D mesh), and we call a collective over two dimensions of the mesh? Could DeviceMesh initialize new process groups if needed under the hood?

lxg2015 · 2023-08-31T12:59:06Z

@awgu You are right, there is indeed an issue with the global process group. I have modified the logic here.

Now, I will first synchronize the parameters on gpu0 to gpu1,2,...,7 on node0, and then according to state._inter_node_pg synchronize the parameters from gpu0 to gpu8, gpu1 to gpu9, gpu2 to gpu10, .... , so params on gpu0 broadcast to all ranks.

Is this all right ? thxs for your reply

awgu · 2023-08-31T13:03:25Z

@lxg2015 This approach seems reasonable to me. I am wondering if you can add a unit test in https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_hybrid_shard.py.

The unit test probably needs 4 GPUs (shard across 2 and replicate across 2).

lxg2015 · 2023-09-01T03:32:28Z

@awgu I add a unit test commit, and test successfully on 4 GPUs and 8 GPUs.

awgu

Thank you @lxg2015!

awgu · 2023-09-01T22:23:07Z

test/distributed/fsdp/test_fsdp_hybrid_shard.py

+        model = fsdp_ctor(model)
+
+        with FSDP.state_dict_type(model, StateDictType.FULL_STATE_DICT):
+            assert (model.lin1.weight == 0).all()


nit: I think the error messaging might be better if we use self.assertTrue(model.lin1.weight == 0).all()? Or at least, I think using self.assert<...> is the common practice.

yes, I have changed assert to self.assertTrue

awgu · 2023-09-01T22:24:30Z

torch/distributed/fsdp/_init_utils.py

@@ -516,6 +516,10 @@ def _init_param_handle_from_module(
        _sync_module_params_and_buffers(
            fully_sharded_module, managed_params, state.process_group
        )
+        if hasattr(state, '_inter_node_pg'):


nit: I wonder if this might be brittle for checking if we are using HSDP. Perhaps, we can do getattr(state, "_inter_node_pg", None) is not None?

yes, I have changed this

lxg2015 · 2023-09-07T02:51:06Z

Hello @awgu , can this PR be merged? In order to reduce memory consumption, many other packages will only load the checkpoint on gpu 0, like accelerate. if we don't fix it, this will cause abnormal loss.

awgu · 2023-09-07T13:18:32Z

@pytorchbot merge

pytorchmergebot · 2023-09-07T13:22:46Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-09-07T13:33:01Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

pytorchmergebot · 2023-09-07T18:17:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2023-09-07T18:38:29Z

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

awgu · 2023-09-07T18:39:16Z

@pytorchbot rebase -s

pytorchmergebot · 2023-09-07T18:42:05Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

When use hybrid_shard mode FSDP, state.process_group means gpu_0,1,,,~,7 on node 0，so gpus on node 1 can not receive parameters, setting process_group to default_group（global_group）can fix this issue

first broadcast params from gpu0 to gpu1,2,,,7 on node0， then broadcast params from gpu0 to gpu8, gpu1 to gpu9, gpu2 to gpu10, .... , so params on gpu0 broadcast to all ranks works well when there are more nodes.

pytorchmergebot · 2023-09-07T18:42:12Z

Successfully rebased lxg2015-patch-3 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout lxg2015-patch-3 && git pull --rebase)

wz337 · 2023-09-09T00:38:55Z

Thanks for pointing this issue out!

I think that the current broadcast semantics over only the sharding process group is an issue. However, I am not sure if using the global process group is the right thing to do in general.

I think the right semantics are to "union" the sharding process group and replication process group and broadcast over this unioned process group. When only using HSDP alone, this union is the global process group. However, if HSDP does not have full ownership over the cluster (e.g. if composing with some other parallelism), then the union may not be the global process group.

cc: @wanchaol @wz337 Would DeviceMesh support something like: We have a 2D submesh from some global mesh (possibly the global mesh is just the 2D mesh), and we call a collective over two dimensions of the mesh? Could DeviceMesh initialize new process groups if needed under the hood?

Yes. I think you can call a collective over two dimensions of the mesh. I believe Wanchao removes DeviceMesh's collective, since they were just a thin layer of functional collective. You should be able to use functional collective directly for this. If I understand your use case correctly, this may be something you are looking for. Code pointer: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L161

And Yes, DeviceMesh would initialize new process group if needed. The current logic is that for mesh same as the world size, it will re-use if it has been initialized. For sub pgs, it will go throught the creation. See pointer:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/device_mesh.py#L213

cc. @wanchaol

lxg2015 · 2023-09-09T16:26:27Z

Hi @awgu, All checks appear to be fine, can this PR be merged？:grin: Or is DeviceMesh a better way to fix

awgu · 2023-09-11T15:10:59Z

@pytorchbot merge

pytorchmergebot · 2023-09-11T15:13:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

haichaoyu · 2024-03-22T07:09:56Z

Hi @lxg2015, @awgu , thanks for the PR and reviewing. One question here.

Here, _get_orig_params is called, and it does not return FlatParameter's. However, when some auto_wrap_policy is applied, there will actually be some FlatParameter's due to nested wrapping. In this case, not all model parameters will be successfully synced to make them consistent? Thanks!

awgu · 2024-03-25T13:45:52Z

@haichaoyu Each FullyShardedDataParallel instance should be responsible for syncing its own managed parameters, so you would want to pass sync_module_states=True for all FullyShardedDataParallel instances to broadcast the entire model from rank 0.

For example, suppose you had a transformer model with transformer blocks, where you apply FSDP to each transformer block and then finally to the root transformer. When you apply FSDP to each transformer block with sync_module_states=True, it broadcasts the transformer block's parameters from rank 0. Finally, when you apply FSDP to the root, it broadcasts the root's parameters (e.g. embedding weight, output projection weight) from rank 0 and does not re-broadcast the already flattened transformer block parameters.

Let me know if this makes sense!

haichaoyu · 2024-04-09T03:42:59Z

Got it. Thanks for detailed explanation!

haichaoyu · 2024-04-09T03:47:08Z

@awgu Another relevant question.
In the second time of calling _sync_module_params_and_buffers, buffer.FSDP_SYNCED is already set True and will not be synced between nodes. Does this cause inter-node inconsistency issue if model has randomly initialized buffers? Thanks!

awgu · 2024-04-09T14:29:28Z

If a parent module re-initializes the buffer of a child module, where the child module is part of a different FSDP wrapping, then yes, this could cause issues. The general guidance for FSDP is that each module should only initialize its directly owned parameters/buffers to avoid cases like this.

lxg2015 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, fduwjj, kiukchung, d4l3k and wz337 as code owners August 31, 2023 07:36

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 31, 2023

pytorchbot added the open source label Aug 31, 2023

lxg2015 mentioned this pull request Sep 1, 2023

[HSDP] add sync_module_state unit test #108392

Closed

awgu approved these changes Sep 1, 2023

View reviewed changes

awgu self-assigned this Sep 7, 2023

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 7, 2023

pytorchmergebot added the merging label Sep 7, 2023

pytorchmergebot removed the merging label Sep 7, 2023

pytorchmergebot added the merging label Sep 7, 2023

pytorchmergebot removed the merging label Sep 7, 2023

pytorch deleted a comment from pytorch-bot bot Sep 7, 2023

lxg2015 added 7 commits September 7, 2023 18:42

[HSDP] Fix Node 1 unable receive parameters from Node 0

0f99fd7

When use hybrid_shard mode FSDP, state.process_group means gpu_0,1,,,~,7 on node 0，so gpus on node 1 can not receive parameters, setting process_group to default_group（global_group）can fix this issue

[HSDP] broadcast params to allreduce process group

da255d7

first broadcast params from gpu0 to gpu1,2,,,7 on node0， then broadcast params from gpu0 to gpu8, gpu1 to gpu9, gpu2 to gpu10, .... , so params on gpu0 broadcast to all ranks works well when there are more nodes.

Update _init_utils.py

634f0c1

add hsdp sync_module_state unit test

f2e7b32

change assert to self.assertTrue

6a4ce60

format code

b79adb2

replate getattr with state.sharding_strategy

ac714a3

pytorchmergebot force-pushed the lxg2015-patch-3 branch from 352fad9 to ac714a3 Compare September 7, 2023 18:42

pytorchmergebot added the merging label Sep 11, 2023

pytorchmergebot added Merged and removed merging labels Sep 11, 2023

pytorchmergebot closed this in e19a855 Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HSDP] Fix Node 1 unable receive parameters from Node 0 #108331

[HSDP] Fix Node 1 unable receive parameters from Node 0 #108331

lxg2015 commented Aug 31, 2023

pytorch-bot bot commented Aug 31, 2023 •

edited

linux-foundation-easycla bot commented Aug 31, 2023 •

edited

awgu commented Aug 31, 2023

lxg2015 commented Aug 31, 2023

awgu commented Aug 31, 2023

lxg2015 commented Sep 1, 2023 •

edited

awgu left a comment

awgu Sep 1, 2023

lxg2015 Sep 2, 2023

awgu Sep 1, 2023

lxg2015 Sep 2, 2023

lxg2015 commented Sep 7, 2023

awgu commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

awgu commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

wz337 commented Sep 9, 2023

lxg2015 commented Sep 9, 2023

awgu commented Sep 11, 2023

pytorchmergebot commented Sep 11, 2023

haichaoyu commented Mar 22, 2024 •

edited

awgu commented Mar 25, 2024

haichaoyu commented Apr 9, 2024

haichaoyu commented Apr 9, 2024

awgu commented Apr 9, 2024

[HSDP] Fix Node 1 unable receive parameters from Node 0 #108331

[HSDP] Fix Node 1 unable receive parameters from Node 0 #108331

Conversation

lxg2015 commented Aug 31, 2023

pytorch-bot bot commented Aug 31, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/108331

✅ No Failures

linux-foundation-easycla bot commented Aug 31, 2023 • edited

awgu commented Aug 31, 2023

lxg2015 commented Aug 31, 2023

awgu commented Aug 31, 2023

lxg2015 commented Sep 1, 2023 • edited

awgu left a comment

Choose a reason for hiding this comment

awgu Sep 1, 2023

Choose a reason for hiding this comment

lxg2015 Sep 2, 2023

Choose a reason for hiding this comment

awgu Sep 1, 2023

Choose a reason for hiding this comment

lxg2015 Sep 2, 2023

Choose a reason for hiding this comment

lxg2015 commented Sep 7, 2023

awgu commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

Merge started

pytorchmergebot commented Sep 7, 2023

Merge failed

pytorchmergebot commented Sep 7, 2023

Merge started

pytorchmergebot commented Sep 7, 2023

Merge failed

awgu commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

pytorchmergebot commented Sep 7, 2023

wz337 commented Sep 9, 2023

lxg2015 commented Sep 9, 2023

awgu commented Sep 11, 2023

pytorchmergebot commented Sep 11, 2023

Merge started

haichaoyu commented Mar 22, 2024 • edited

awgu commented Mar 25, 2024

haichaoyu commented Apr 9, 2024

haichaoyu commented Apr 9, 2024

awgu commented Apr 9, 2024

pytorch-bot bot commented Aug 31, 2023 •

edited

linux-foundation-easycla bot commented Aug 31, 2023 •

edited

lxg2015 commented Sep 1, 2023 •

edited

haichaoyu commented Mar 22, 2024 •

edited