Remove HSDP validation check #112435

mvpatel2000 · 2023-10-30T20:29:17Z

Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check.

However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

pytorch-bot · 2023-10-30T20:29:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112435

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e08159e with merge base d444a3b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

awgu · 2023-10-30T20:39:31Z

If I understand correctly, there is still some value in the check; however, it is currently overly strict and is problematic for manual wrapping + HSDP. I think there was someone internally working on relaxing the check.

The check that is valuable is that if you are using HSDP, then each HSDP instance should use the same process groups if using the same ranks. We do not want to create a different pair of process groups per HSDP instance.

mvpatel2000 · 2023-10-30T20:41:03Z

@awgu got it! If someone is working on it, feel free to close this PR then :)

awgu · 2023-10-30T20:45:01Z

Let me follow-up on the progress on that PR and get back to you!

awgu · 2023-11-15T00:44:47Z

@fegin @wz337 Is there anything from the checkpointing side that requires each FSDP instance to use the HSDP process groups?

If not, then I think removing this requirement/check sounds good to me (and we would need to remove the unit test).

pytorch/test/distributed/fsdp/test_fsdp_hybrid_shard.py

Lines 119 to 157 in b8b3c26

    
           @skip_if_lt_x_gpu(2) 
        
           def test_hybrid_shard_pg_mismatch_raises(self): 
        
               model = MyModel().cuda() 
        
               intra_pg = self.process_group 
        
               inter_pg = dist.new_group(ranks=[self.rank]) 
        
               # Mismatched process groups for intra-node 
        
               model.lin1 = FSDP( 
        
                   model.lin1, 
        
                   process_group=(intra_pg, inter_pg), 
        
                   sharding_strategy=ShardingStrategy.HYBRID_SHARD, 
        
               ) 
        
               model = FSDP( 
        
                   model, 
        
                   process_group=(dist.new_group(), dist.new_group()), 
        
                   sharding_strategy=ShardingStrategy.HYBRID_SHARD, 
        
               ) 
        
               # Errors during _lazy_init 
        
               inp = torch.randn(4, 10) 
        
               with self.assertRaisesRegex( 
        
                   ValueError, "intra-node process groups do not match" 
        
               ): 
        
                   model(inp) 
        
               # Mismatched process groups for inter-node 
        
               model = MyModel().cuda() 
        
               model.lin1 = FSDP( 
        
                   model.lin1, 
        
                   process_group=(intra_pg, inter_pg), 
        
                   sharding_strategy=ShardingStrategy.HYBRID_SHARD, 
        
               ) 
        
               model = FSDP( 
        
                   model, 
        
                   process_group=(intra_pg, dist.new_group()), 
        
                   sharding_strategy=ShardingStrategy.HYBRID_SHARD, 
        
               ) 
        
               with self.assertRaisesRegex( 
        
                   ValueError, "inter-node process groups do not match" 
        
               ): 
        
                   model(inp)

mvpatel2000 · 2023-12-22T06:42:59Z

@fegin @wz337 Is there anything from the checkpointing side that requires each FSDP instance to use the HSDP process groups?

If not, then I think removing this requirement/check sounds good to me (and we would need to remove the unit test).

pytorch/test/distributed/fsdp/test_fsdp_hybrid_shard.py

Lines 119 to 157 in b8b3c26

@skip_if_lt_x_gpu(2)

def test_hybrid_shard_pg_mismatch_raises(self):

model = MyModel().cuda()

intra_pg = self.process_group

inter_pg = dist.new_group(ranks=[self.rank])

# Mismatched process groups for intra-node

model.lin1 = FSDP(

model.lin1,

process_group=(intra_pg, inter_pg),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

model = FSDP(

model,

process_group=(dist.new_group(), dist.new_group()),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

# Errors during _lazy_init

inp = torch.randn(4, 10)

with self.assertRaisesRegex(

ValueError, "intra-node process groups do not match"

):

model(inp)

# Mismatched process groups for inter-node

model = MyModel().cuda()

model.lin1 = FSDP(

model.lin1,

process_group=(intra_pg, inter_pg),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

model = FSDP(

model,

process_group=(intra_pg, dist.new_group()),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

with self.assertRaisesRegex(

ValueError, "inter-node process groups do not match"

):

model(inp)

@awgu @fegin @wz337 bumping this request! would love to have this issue resolved

mvpatel2000 · 2024-01-30T04:44:23Z

@awgu @fegin @wz337 bumping this please!

pytorch-bot · 2024-01-30T17:52:24Z

Please seek CI approval before scheduling CIFlow labels

fegin · 2024-01-30T17:54:19Z

I think it is okay to remove the check. Will let @wz337 to review again.

wz337 · 2024-01-30T19:12:58Z

@fegin @wz337 Is there anything from the checkpointing side that requires each FSDP instance to use the HSDP process groups?

If not, then I think removing this requirement/check sounds good to me (and we would need to remove the unit test).

pytorch/test/distributed/fsdp/test_fsdp_hybrid_shard.py

Lines 119 to 157 in b8b3c26

@skip_if_lt_x_gpu(2)

def test_hybrid_shard_pg_mismatch_raises(self):

model = MyModel().cuda()

intra_pg = self.process_group

inter_pg = dist.new_group(ranks=[self.rank])

# Mismatched process groups for intra-node

model.lin1 = FSDP(

model.lin1,

process_group=(intra_pg, inter_pg),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

model = FSDP(

model,

process_group=(dist.new_group(), dist.new_group()),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

# Errors during _lazy_init

inp = torch.randn(4, 10)

with self.assertRaisesRegex(

ValueError, "intra-node process groups do not match"

):

model(inp)

# Mismatched process groups for inter-node

model = MyModel().cuda()

model.lin1 = FSDP(

model.lin1,

process_group=(intra_pg, inter_pg),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

model = FSDP(

model,

process_group=(intra_pg, dist.new_group()),

sharding_strategy=ShardingStrategy.HYBRID_SHARD,

)

with self.assertRaisesRegex(

ValueError, "inter-node process groups do not match"

):

model(inp)

We are relying on the DTensor to do all_gather and chunk so we don't use the HSDP process groups directly. So I think it should be fine removing this requirement.

wz337

I think it's ok to remove this check.

Could you also include removing the unit test in the PR as @awgu mentioned so CI doesn't break?

https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_hybrid_shard.py#L120

pytorch-bot · 2024-02-01T23:54:33Z

Please seek CI approval before scheduling CIFlow labels

mvpatel2000 · 2024-02-01T23:57:09Z

@wz337 test removed!

@wconstab sorry -- updated to be more clear :)

mvpatel2000 · 2024-02-02T03:19:58Z

@pytorchmergebot merge

pytorchmergebot · 2024-02-02T03:22:56Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-02-02T03:23:12Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Skylion007 · 2024-02-05T18:55:59Z

@pytorchbot merge -r

pytorchmergebot · 2024-02-05T18:57:51Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-05T18:57:56Z

Successfully rebased mvpatel2000/remove-hsdp-validate onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout mvpatel2000/remove-hsdp-validate && git pull --rebase)

pytorchmergebot · 2024-02-05T18:59:10Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check. However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption. Pull Request resolved: #112435 Approved by: https://github.com/wz337, https://github.com/Skylion007

Currently, HSDP validates that all intra/inter node PGs are the same. This makes sense if you are only using HSDP with no other forms of parallelism and is a nice but not necessary sanity check. However, if you want to mix HSDP with other forms, say tensor parallelism on the FFN of a transformer block, the intra/inter node PGs will be different for that layer. This check raises errors in this scenario, so we need to remove this assumption. Pull Request resolved: pytorch#112435 Approved by: https://github.com/wz337, https://github.com/Skylion007

Co-authored-by: Andrew Gu <andgu@fb.com> resolved: #112435 resolved: #118620 Fixed `device_mesh` and auto wrap (#119064) fix #118906. resolved: #119064 resolved: #118638 Fixes #118639. resolved: #119481

mvpatel2000 requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, fduwjj, kiukchung, d4l3k and wz337 as code owners October 30, 2023 20:29

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Oct 30, 2023

pytorchbot added the open source label Oct 30, 2023

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 2, 2023

mvpatel2000 mentioned this pull request Jan 30, 2024

HSDP + TP Support with DTensor #118639

Open

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jan 30, 2024

pytorch-bot bot removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 30, 2024

wz337 approved these changes Jan 30, 2024

View reviewed changes

fegin added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 30, 2024

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Feb 1, 2024

pytorch-bot bot removed the ciflow/inductor label Feb 1, 2024

pytorchmergebot added the merging label Feb 2, 2024

pytorchmergebot removed the merging label Feb 2, 2024

Skylion007 approved these changes Feb 5, 2024

View reviewed changes

mvpatel2000 added 2 commits February 5, 2024 18:57

Update _runtime_utils.py

90a54f5

Update test_fsdp_hybrid_shard.py

e08159e

pytorchmergebot force-pushed the mvpatel2000/remove-hsdp-validate branch from f0a3c56 to e08159e Compare February 5, 2024 18:57

github-actions bot added the ciflow/inductor label Feb 5, 2024

pytorchmergebot added the merging label Feb 5, 2024

pytorchmergebot added the Merged label Feb 5, 2024

pytorchmergebot closed this in d9d8c2b Feb 5, 2024

pytorchmergebot removed the merging label Feb 5, 2024

mvpatel2000 deleted the mvpatel2000/remove-hsdp-validate branch February 13, 2024 19:06

mvpatel2000 mentioned this pull request Feb 13, 2024

[v2.2.1] Release Tracker #119295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove HSDP validation check #112435

Remove HSDP validation check #112435

mvpatel2000 commented Oct 30, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Oct 30, 2023 •

edited

awgu commented Oct 30, 2023

mvpatel2000 commented Oct 30, 2023

awgu commented Oct 30, 2023

awgu commented Nov 15, 2023

mvpatel2000 commented Dec 22, 2023

mvpatel2000 commented Jan 30, 2024

pytorch-bot bot commented Jan 30, 2024

fegin commented Jan 30, 2024

wz337 commented Jan 30, 2024

wz337 left a comment

pytorch-bot bot commented Feb 1, 2024

mvpatel2000 commented Feb 1, 2024

mvpatel2000 commented Feb 2, 2024

pytorchmergebot commented Feb 2, 2024

pytorchmergebot commented Feb 2, 2024

Skylion007 commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

Remove HSDP validation check #112435

Remove HSDP validation check #112435

Conversation

mvpatel2000 commented Oct 30, 2023 • edited by pytorch-bot bot

pytorch-bot bot commented Oct 30, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112435

✅ No Failures

awgu commented Oct 30, 2023

mvpatel2000 commented Oct 30, 2023

awgu commented Oct 30, 2023

awgu commented Nov 15, 2023

mvpatel2000 commented Dec 22, 2023

mvpatel2000 commented Jan 30, 2024

pytorch-bot bot commented Jan 30, 2024

fegin commented Jan 30, 2024

wz337 commented Jan 30, 2024

wz337 left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Feb 1, 2024

mvpatel2000 commented Feb 1, 2024

mvpatel2000 commented Feb 2, 2024

pytorchmergebot commented Feb 2, 2024

Merge started

pytorchmergebot commented Feb 2, 2024

Merge failed

Skylion007 commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

pytorchmergebot commented Feb 5, 2024

Merge started

mvpatel2000 commented Oct 30, 2023 •

edited by pytorch-bot bot

pytorch-bot bot commented Oct 30, 2023 •

edited