fixes multiple GPU detected error for test_fsdp_fine_tune.py #112406

tinglvv · 2023-10-30T16:46:51Z

fixes "Duplicate GPU detected : rank 1 and rank 0 both on CUDA device" on test_fsdp_fine_tune.py. Only run the test if GPU number > 1.
cc @eqy @ptrblck

pytorch-bot · 2023-10-30T16:46:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112406

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6456a48 with merge base 8858eda ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2023-10-30T19:11:58Z

It seems a bit scary add a skip for the run function vs. individual tests; would it be possible to fix this issue by adding the skip decorator to the individual failing tests?

awgu · 2023-10-30T19:14:36Z

Strangely, it looks like every test in that file already has @skip_if_lt_x_gpu(2).

tinglvv · 2023-10-31T21:28:10Z

Thanks @eqy and @awgu for reviewing!

It seems a bit scary add a skip for the run function vs. individual tests; would it be possible to fix this issue by adding the skip decorator to the individual failing tests?

Currently the decorator is added to the test_fsdp_fine_tune.py, and with "super" we are not modifying the parent class. It would be the only way as far as I see now, so from the error log
Last error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 81000 dist init r=0, world=2 dist init r=1, world=2

and https://github.com/coyotelll/pytorch/blob/f0ae3b73369849ce6e530effc0e5c770472c67aa/torch/testing/_internal/common_fsdp.py#L904,
It appears the _run method is called twice, which will have the error with rank. And as the main purpose of _run method is to run_test, which has @skip_if_lt_x_gpu(2) all over, it is okay to skip the _run method. https://github.com/coyotelll/pytorch/blob/f0ae3b73369849ce6e530effc0e5c770472c67aa/torch/testing/_internal/common_fsdp.py#L931.

Strangely, it looks like every test in that file already has @skip_if_lt_x_gpu(2).

Yes. So it would be okay to skip the _run method for 1 GPU case.

…into tingl-fsdpfix

awgu · 2023-11-01T01:51:17Z

@coyotelll Sorry, I am not sure I followed.

What is special about test_fsdp_fine_tune.py as opposed to the other FSDP unit tests? What setup does it take to reproduce the error?

fegin · 2023-11-02T22:45:52Z

You can change the world_size to min(torch.cuda.device_count(), 2). This will make each skip_if_lt_x_gpu, though I'm also wondering the same thing as @awgu, in what setup do you get the error?

tinglvv · 2023-11-02T23:01:42Z

You can change the world_size to min(torch.cuda.device_count(), 2). This will make each skip_if_lt_x_gpu, though I'm also wondering the same thing as @awgu, in what setup do you get the error?

Thanks for the comment. We get this error when running L0_self_test_distributed on for example H100. To reproduce, can run test/distributed/fsdp/test_fsdp_fine_tune.py::TestFSDPFineTune::test_parity_with_ddp.

Error is
Last error: Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 101000

…into tingl-fsdpfix

eqy · 2023-11-09T05:29:10Z

Rebasing to see if the android failure is fix (sorry if you need to update your local branch after this)

eqy · 2023-11-09T05:29:15Z

@pytorchmergebot rebase

pytorchmergebot · 2023-11-09T05:31:29Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2023-11-09T05:31:34Z

Successfully rebased tingl-fsdpfix onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tingl-fsdpfix && git pull --rebase)

eqy · 2023-11-10T01:47:07Z

@awgu are your concerns addressed or did you want to try out the fix suggested by @fegin ?

awgu · 2023-11-10T04:10:13Z

@eqy Sorry if I missed something, but I am still not clear on:

What is special about test_fsdp_fine_tune.py as opposed to the other FSDP unit tests?

…into tingl-fsdpfix

eqy · 2023-11-17T20:02:15Z

@coyotelll does this fix the issue as observed on your end (are we ready to merge?)

tinglvv · 2023-11-17T20:03:43Z

@eqy Yes the update solves this issue. We can merge now. Thanks.

eqy · 2023-11-17T21:15:32Z

@awgu do you mind stamping this?

awgu · 2023-11-17T21:17:09Z

I am happy to stamp. I just want to understand the implications more broadly on our unit tests since we typically return 2 or min(4, torch.cuda.device_count()) for world size.

eqy · 2023-11-17T21:32:56Z

Could it be that upstream CI only ever runs fsdp tests on runners with enough GPUs (world size >= 2)? That could explain why the failure is not visible in upstream. In other words, my understanding is that the hardcoded world size would turn decorators gating tests based on the number of GPUs to be hardcoded no-ops. This fix in particular should also only affect TestFSDPFineTune, correct?

awgu · 2023-11-17T22:13:47Z

This fix in particular should also only affect TestFSDPFineTune, correct?

Yes. I think so. That is why I still have some uncertainty -- it seems like either we make a change to all unit tests that override world_size or something else is up.

If I run

CUDA_VISIBLE_DEVICES=0 python -m pytest test/distributed/fsdp/test_fsdp_fine_tune.py

I can reproduce the error on my machine. I can also do so for test_fsdp_grad_acc.py, and I imagine as well for the other tests that override world_size. I am not sure if this was a regression somehow though. I recall it would just skip the unit test before when there were an insufficient number of GPUs.

Either way, let us stamp this one to unblock, and we can investigate further.

eqy · 2023-11-17T22:22:31Z

@pytorchmergebot merge

pytorchmergebot · 2023-11-17T22:24:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

tinglvv · 2023-11-17T22:33:43Z

@awgu, thanks for approving! More data on this: We see this error in several places in our CI where world_size is defined to be >= 2. Another example is https://github.com/pytorch/pytorch/blob/main/test/distributed/fsdp/test_fsdp_state_dict.py#L1239. Would probably require several changes to fix this issue, but not sure if this would suffice as a regression.

Fixes multiple GPU detected error for test_fsdp_fine_tune

f0ae3b7

tinglvv requested review from H-Huang, awgu, fduwjj, fegin, kwen2501, mrshenli, rohan-varma, wanchaol, wz337 and zhaojuanmao as code owners October 30, 2023 16:46

pytorch-bot bot added the topic: not user facing topic category label Oct 30, 2023

pytorchbot added the open source label Oct 30, 2023

tinglvv added 2 commits October 31, 2023 14:39

Fixes multiple GPU detected error for test_fsdp_fine_tune

1655a2b

Merge branch 'tingl-fsdpfix' of https://github.com/coyotelll/pytorch …

c79a237

…into tingl-fsdpfix

albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Nov 2, 2023

tinglvv added 3 commits November 3, 2023 16:19

Merge branch 'tingl-fsdpfix' of https://github.com/coyotelll/pytorch …

03ef79f

…into tingl-fsdpfix

Fix trailing space

63c475a

Merge branch 'tingl-fsdpfix' of https://github.com/coyotelll/pytorch …

cdb8000

…into tingl-fsdpfix

tinglvv requested a review from LucasLLC as a code owner November 3, 2023 23:20

Fix extra self parameter

2244f61

tinglvv added 3 commits November 9, 2023 05:31

Fixes multiple GPU detected error for test_fsdp_fine_tune

5fab775

Fix trailing space

4d3469a

Fix extra self parameter

a0feed1

pytorchmergebot force-pushed the tingl-fsdpfix branch from 2244f61 to a0feed1 Compare November 9, 2023 05:31

tinglvv added 3 commits November 16, 2023 13:50

Change world_size to trigger skip condition

14c1310

Merge branch 'tingl-fsdpfix' of https://github.com/coyotelll/pytorch …

60f9954

…into tingl-fsdpfix

remove previous commit

6456a48

awgu approved these changes Nov 17, 2023

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 17, 2023

pytorchmergebot added the merging label Nov 17, 2023

pytorchmergebot added Merged and removed merging labels Nov 18, 2023

pytorchmergebot closed this in 47220bc Nov 18, 2023

awgu mentioned this pull request Nov 27, 2023

[FSDP] Added test for ignored_states + auto wrap #114612

Closed

fixes multiple GPU detected error for test_fsdp_fine_tune.py #112406

fixes multiple GPU detected error for test_fsdp_fine_tune.py #112406

Uh oh!

Conversation

tinglvv commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112406

✅ No Failures

Uh oh!

eqy commented Oct 30, 2023

Uh oh!

awgu commented Oct 30, 2023

Uh oh!

tinglvv commented Oct 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgu commented Nov 1, 2023

Uh oh!

fegin commented Nov 2, 2023

Uh oh!

tinglvv commented Nov 2, 2023

Uh oh!

eqy commented Nov 9, 2023

Uh oh!

eqy commented Nov 9, 2023

Uh oh!

pytorchmergebot commented Nov 9, 2023

Uh oh!

pytorchmergebot commented Nov 9, 2023

Uh oh!

eqy commented Nov 10, 2023

Uh oh!

awgu commented Nov 10, 2023

Uh oh!

eqy commented Nov 17, 2023

Uh oh!

tinglvv commented Nov 17, 2023

Uh oh!

eqy commented Nov 17, 2023

Uh oh!

awgu commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eqy commented Nov 17, 2023

Uh oh!

awgu commented Nov 17, 2023

Uh oh!

eqy commented Nov 17, 2023

Uh oh!

pytorchmergebot commented Nov 17, 2023

Merge started

Uh oh!

tinglvv commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tinglvv commented Oct 30, 2023 •

edited

Loading

pytorch-bot bot commented Oct 30, 2023 •

edited

Loading

tinglvv commented Oct 31, 2023 •

edited

Loading

awgu commented Nov 17, 2023 •

edited

Loading

tinglvv commented Nov 17, 2023 •

edited

Loading