[FSDP2] Enable HSDP + TP #133335

fegin · 2024-08-13T17:49:04Z

Stack from ghstack (oldest at bottom):

-> [FSDP2] Enable HSDP + TP #133335

This PR enables HSDP + TP

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-08-13T17:49:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133335

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit d347f6b with merge base cc1cc71 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / win-vs2019-cuda11.8-py3 / test (default, 1, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_sympy_utils.py::TestValueRanges::test_mul_zero_unknown
periodic / win-vs2019-cuda11.8-py3 / test (default, 3, 4, windows.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_sdpa_autocast_cuda
trunk / linux-focal-cuda12.4-py3.10-gcc9-sm86 / test (default, 2, 5, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'test/inductor/test_cudacodecache.py::TestCUDACodeCache::test_async_compile'
trunk / macos-py3-arm64 / test (default, 1, 3, macos-m1-stable) (gh) (similar failure)
dynamo/test_view.py::ViewTests::test_view_to_2d
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (similar failure)
dynamo/test_repros.py::ReproTests::test_addr_alpha_beta_out

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / macos-py3-arm64 / test (default, 3, 3, macos-m1-stable) (gh) (trunk failure)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_argmax_argmin3_dynamic_shapes_cpu

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 2, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/_composable/fsdp/test_fully_shard_overlap.py::TestFullyShardOverlap::test_fully_shard_post_optim_event_overlap
periodic / linux-focal-rocm6.1-py3.8 / test (distributed, 3, 3, linux.rocm.gpu, unstable) (gh) (#129209)
distributed/tensor/parallel/test_micro_pipeline_tp.py::MicroPipelineTPTest::test_fuse_all_gather_scaled_matmul_A_dims_2_gather_dim_0

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR enables HSDP + TP ghstack-source-id: 05a1706 Pull Request resolved: #133335

awgu

The SPMD placements change looks good to me.

awgu · 2024-08-13T19:01:00Z

torch/distributed/_composable/fsdp/_fsdp_param.py

+                2 <= self._spmd_mesh.ndim <= 3
+            ), f"_spmd_mesh.ndim can only be 2 or 3 but got {self._spmd_mesh.ndim}."
+            self._spmd_placements: Tuple[Placement, ...]
+            if self._spmd_mesh.ndim == 2:


nit: maybe factor out to avoid some duplication?

fsdp_shard_placement = ( _StridedShard(0, split_factor=split_factor) if split_factor > 1 else Shard(0) ) if self._spmd_mesh.ndim == 2: self._spmd_placements = (fsdp_shard_placement, self._tp_spec.placements[0]) else: self._spmd_placements = ( Replicate(), fsdp_shard_placement, self._tp_spec.placements[0]) )

awgu · 2024-08-13T19:02:43Z

test/distributed/_composable/fsdp/test_fully_shard_training.py

+        pp_size = 2 if self.world_size > 4 else 1
+        return init_device_mesh(
+            "cuda",
+            (2, 2, 2),


nit: I think we need to use the dp_size and pp_size above.

Forgot to remove, but we need 8 GPUs, so it is always 2x2x2. I skip the test if the number of GPUs is less than 8.

awgu

not sure if you want to make more changes since PR is still draft

but stamp to unblock

fegin · 2024-08-13T20:05:56Z

Thanks @awgu for the review, I'm going to add one more test to test state_dict before landing.

[ghstack-poisoned]

This PR enables HSDP + TP ghstack-source-id: 908504d Pull Request resolved: #133335

fegin · 2024-08-14T16:26:16Z

@pytorchbot merge

pytorchmergebot · 2024-08-14T16:28:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR enables HSDP + TP Pull Request resolved: pytorch#133335 Approved by: https://github.com/awgu

mayank31398 · 2024-08-16T22:54:20Z

@awgu thanks a lot for this PR.
I am trying to make TP work with HSDP and I want to do:

TP on 2 GPUs
FSDP on 8
DDP on 16
basically 2 nodes do FSDP and inside them is TP and across is DDP
(2, 8, 16)

this is the device mesh I pass to fully_shard.

It takes some DP mesh and reshapes it to work with HSDP.
is this logic not correct?

I see following issue:

[rank26]: Traceback (most recent call last):
[rank26]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank26]:   File "<frozen runpy>", line 88, in _run_code
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/pretrain.py", line 381, in <module>
[rank26]:     main()
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/pretrain.py", line 312, in main
[rank26]:     model = wrap_model_for_distributed_training(args, model)
[rank26]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/distributed/__init__.py", line 210, in wrap_model_for_distributed_training
[rank26]:     fully_shard(
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/contract.py", line 125, in wrapper
[rank26]:     updated = func(inp_module, *args, **kwargs)
[rank26]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/fully_shard.py", line 129, in fully_shard
[rank26]:     state._fsdp_param_group = FSDPParamGroup(
[rank26]:                               ^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 114, in __init__
[rank26]:     self.fsdp_params = [
[rank26]:                        ^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 115, in <listcomp>
[rank26]:     FSDPParam(
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 235, in __init__
[rank26]:     self._init_sharded_param(param, device)
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank26]:     return func(*args, **kwargs)
[rank26]:            ^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 269, in _init_sharded_param
[rank26]:     raise AssertionError(
[rank26]: AssertionError: FSDP requires the DP and TP mesh to have the same parent mesh but got: 
[rank26]: DP's global mesh: DeviceMesh('cuda', [[0, 8, 16, 24], [2, 10, 18, 26], [4, 12, 20, 28], [6, 14, 22, 30]])
[rank26]: TP's global mesh: DeviceMesh('cuda', [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15], [16, 17], [18, 19], [20, 21], [22, 23], [24, 25], [26, 27], [28, 29], [30, 31]], mesh_dim_names=('dp', 'tp'))

fegin · 2024-08-19T17:34:54Z

@mayank31398 How did you create the device mesh? Can you share the init_device_mesh() call?

mayank31398 · 2024-08-19T20:25:25Z

@fegin sorry, it was an error on my end.
This is fixed now :)

Update

5bdd0fa

[ghstack-poisoned]

fegin mentioned this pull request Aug 13, 2024

Add _sync_module_states_with_mesh util for broadcasting module states with DeviceMesh #130356

Closed

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 13, 2024

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 13, 2024

fegin added a commit that referenced this pull request Aug 13, 2024

[FSDP2] Enable HSDP + TP

d57ff1c

This PR enables HSDP + TP ghstack-source-id: 05a1706 Pull Request resolved: #133335

fegin marked this pull request as draft August 13, 2024 17:50

fegin requested a review from awgu August 13, 2024 17:50

awgu reviewed Aug 13, 2024

View reviewed changes

awgu approved these changes Aug 13, 2024

View reviewed changes

Update

d347f6b

[ghstack-poisoned]

fegin added a commit that referenced this pull request Aug 14, 2024

[FSDP2] Enable HSDP + TP

18329c8

This PR enables HSDP + TP ghstack-source-id: 908504d Pull Request resolved: #133335

fegin marked this pull request as ready for review August 14, 2024 06:45

fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Aug 14, 2024

pytorchmergebot added the merging label Aug 14, 2024

pytorchmergebot closed this in d114fd7 Aug 14, 2024

pytorchmergebot added Merged and removed merging labels Aug 14, 2024

WeizhuoZhang-intel pushed a commit to WeizhuoZhang-intel/pytorch that referenced this pull request Aug 15, 2024

[FSDP2] Enable HSDP + TP (pytorch#133335)

8591b24

This PR enables HSDP + TP Pull Request resolved: pytorch#133335 Approved by: https://github.com/awgu

github-actions bot deleted the gh/fegin/285/head branch September 23, 2024 02:07

[FSDP2] Enable HSDP + TP #133335

[FSDP2] Enable HSDP + TP #133335

Uh oh!

Conversation

fegin commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133335

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

awgu Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

awgu Aug 13, 2024

Choose a reason for hiding this comment

Uh oh!

fegin Aug 14, 2024

Choose a reason for hiding this comment

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented Aug 13, 2024

Uh oh!

fegin commented Aug 14, 2024

Uh oh!

pytorchmergebot commented Aug 14, 2024

Merge started

Uh oh!

mayank31398 commented Aug 16, 2024

Uh oh!

fegin commented Aug 19, 2024

Uh oh!

mayank31398 commented Aug 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fegin commented Aug 13, 2024 •

edited

Loading

pytorch-bot bot commented Aug 13, 2024 •

edited

Loading