-
Notifications
You must be signed in to change notification settings - Fork 25.6k
fixes multiple GPU detected error for test_fsdp_fine_tune.py #112406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112406
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6456a48 with merge base 8858eda ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
It seems a bit scary add a skip for the run function vs. individual tests; would it be possible to fix this issue by adding the skip decorator to the individual failing tests? |
Strangely, it looks like every test in that file already has |
Thanks @eqy and @awgu for reviewing!
Currently the decorator is added to the test_fsdp_fine_tune.py, and with "super" we are not modifying the parent class. It would be the only way as far as I see now, so from the error log and https://github.com/coyotelll/pytorch/blob/f0ae3b73369849ce6e530effc0e5c770472c67aa/torch/testing/_internal/common_fsdp.py#L904,
Yes. So it would be okay to skip the _run method for 1 GPU case. |
@coyotelll Sorry, I am not sure I followed. What is special about test_fsdp_fine_tune.py as opposed to the other FSDP unit tests? What setup does it take to reproduce the error? |
You can change the |
Thanks for the comment. We get this error when running L0_self_test_distributed on for example H100. To reproduce, can run Error is |
…into tingl-fsdpfix
…into tingl-fsdpfix
Rebasing to see if the android failure is fix (sorry if you need to update your local branch after this) |
@pytorchmergebot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
2244f61
to
a0feed1
Compare
@eqy Sorry if I missed something, but I am still not clear on:
|
@coyotelll does this fix the issue as observed on your end (are we ready to merge?) |
@eqy Yes the update solves this issue. We can merge now. Thanks. |
@awgu do you mind stamping this? |
I am happy to stamp. I just want to understand the implications more broadly on our unit tests since we typically return 2 or |
Could it be that upstream CI only ever runs fsdp tests on runners with enough GPUs (world size >= 2)? That could explain why the failure is not visible in upstream. In other words, my understanding is that the hardcoded world size would turn decorators gating tests based on the number of GPUs to be hardcoded no-ops. This fix in particular should also only affect |
Yes. I think so. That is why I still have some uncertainty -- it seems like either we make a change to all unit tests that override If I run
I can reproduce the error on my machine. I can also do so for Either way, let us stamp this one to unblock, and we can investigate further. |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@awgu, thanks for approving! More data on this: We see this error in several places in our CI where |
fixes "Duplicate GPU detected : rank 1 and rank 0 both on CUDA device" on test_fsdp_fine_tune.py. Only run the test if GPU number > 1.
cc @eqy @ptrblck