Skip to content

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Aug 13, 2024

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Aug 13, 2024
Copy link

pytorch-bot bot commented Aug 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133335

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit d347f6b with merge base cc1cc71 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Aug 13, 2024
fegin added a commit that referenced this pull request Aug 13, 2024
This PR enables HSDP + TP

ghstack-source-id: 05a1706
Pull Request resolved: #133335
@fegin fegin marked this pull request as draft August 13, 2024 17:50
@fegin fegin requested a review from awgu August 13, 2024 17:50
Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPMD placements change looks good to me.

2 <= self._spmd_mesh.ndim <= 3
), f"_spmd_mesh.ndim can only be 2 or 3 but got {self._spmd_mesh.ndim}."
self._spmd_placements: Tuple[Placement, ...]
if self._spmd_mesh.ndim == 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe factor out to avoid some duplication?

fsdp_shard_placement = (
    _StridedShard(0, split_factor=split_factor)
    if split_factor > 1
    else Shard(0)
)
if self._spmd_mesh.ndim == 2:
    self._spmd_placements = (fsdp_shard_placement, self._tp_spec.placements[0])
else:
    self._spmd_placements = (
        Replicate(), fsdp_shard_placement, self._tp_spec.placements[0])
    )

pp_size = 2 if self.world_size > 4 else 1
return init_device_mesh(
"cuda",
(2, 2, 2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think we need to use the dp_size and pp_size above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove, but we need 8 GPUs, so it is always 2x2x2. I skip the test if the number of GPUs is less than 8.

Copy link
Collaborator

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if you want to make more changes since PR is still draft

but stamp to unblock

@fegin
Copy link
Contributor Author

fegin commented Aug 13, 2024

Thanks @awgu for the review, I'm going to add one more test to test state_dict before landing.

[ghstack-poisoned]
fegin added a commit that referenced this pull request Aug 14, 2024
This PR enables HSDP + TP

ghstack-source-id: 908504d
Pull Request resolved: #133335
@fegin fegin marked this pull request as ready for review August 14, 2024 06:45
@fegin fegin added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Aug 14, 2024
@fegin
Copy link
Contributor Author

fegin commented Aug 14, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

WeizhuoZhang-intel pushed a commit to WeizhuoZhang-intel/pytorch that referenced this pull request Aug 15, 2024
This PR enables HSDP + TP

Pull Request resolved: pytorch#133335
Approved by: https://github.com/awgu
@mayank31398
Copy link
Contributor

@awgu thanks a lot for this PR.
I am trying to make TP work with HSDP and I want to do:

  1. TP on 2 GPUs
  2. FSDP on 8
  3. DDP on 16
    basically 2 nodes do FSDP and inside them is TP and across is DDP
    (2, 8, 16)

this is the device mesh I pass to fully_shard.

Screenshot 2024-08-16 at 6 52 31 PM

It takes some DP mesh and reshapes it to work with HSDP.
is this logic not correct?

I see following issue:

[rank26]: Traceback (most recent call last):
[rank26]:   File "<frozen runpy>", line 198, in _run_module_as_main
[rank26]:   File "<frozen runpy>", line 88, in _run_code
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/pretrain.py", line 381, in <module>
[rank26]:     main()
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/pretrain.py", line 312, in main
[rank26]:     model = wrap_model_for_distributed_training(args, model)
[rank26]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/scratch/tmp1/dolomite-engine/dolomite_engine/distributed/__init__.py", line 210, in wrap_model_for_distributed_training
[rank26]:     fully_shard(
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/contract.py", line 125, in wrapper
[rank26]:     updated = func(inp_module, *args, **kwargs)
[rank26]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/fully_shard.py", line 129, in fully_shard
[rank26]:     state._fsdp_param_group = FSDPParamGroup(
[rank26]:                               ^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 114, in __init__
[rank26]:     self.fsdp_params = [
[rank26]:                        ^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 115, in <listcomp>
[rank26]:     FSDPParam(
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 235, in __init__
[rank26]:     self._init_sharded_param(param, device)
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank26]:     return func(*args, **kwargs)
[rank26]:            ^^^^^^^^^^^^^^^^^^^^^
[rank26]:   File "/u/mayank98/miniconda3/envs/ai/lib/python3.11/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 269, in _init_sharded_param
[rank26]:     raise AssertionError(
[rank26]: AssertionError: FSDP requires the DP and TP mesh to have the same parent mesh but got: 
[rank26]: DP's global mesh: DeviceMesh('cuda', [[0, 8, 16, 24], [2, 10, 18, 26], [4, 12, 20, 28], [6, 14, 22, 30]])
[rank26]: TP's global mesh: DeviceMesh('cuda', [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11], [12, 13], [14, 15], [16, 17], [18, 19], [20, 21], [22, 23], [24, 25], [26, 27], [28, 29], [30, 31]], mesh_dim_names=('dp', 'tp'))

@fegin
Copy link
Contributor Author

fegin commented Aug 19, 2024

@mayank31398 How did you create the device mesh? Can you share the init_device_mesh() call?

@mayank31398
Copy link
Contributor

@fegin sorry, it was an error on my end.
This is fixed now :)

@github-actions github-actions bot deleted the gh/fegin/285/head branch September 23, 2024 02:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (fsdp) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants