-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel training hangs #1
Comments
Is your Does it work if you use a loopback device
if not, see what other local network devices you have via It's currently using If not, does it work if you use the first 2 or the last 2 gpus
then the 2nd pair:
If not, attach to each process with |
Pure gold! Thank you so much for the insight. I don't think it's a firewall/networking issue since this machine is on my desk, and I'm logged into it directly. I see some page faults in I get the same result every time: Each GPU being tested hangs, using one third of its total power and ~2GB of VRAM indefinitely.
|
I also tried this CUDA bandwidthTest from Nvidia, and it passed. BTW, I have the fourth GPU unplugged for now—just because this Threadripper box needs a dedicated 20A power outlet to run on all cylinders.
|
|
BINGO! Disabling IOMMU did the trick!
|
Oh, wow! That's some awesome diagnostics you have performed - absolutely awesome, @mhillebrand! Glad to hear you got it working! So the key to unravelling this problem was noticing a page fault in syslog:
We probably should start compiling all the difference causes somewhere so others will have it easier. Glad you resolved it! |
@jeffra, tagging you on this one as FYI, since some users are likely to run into this with Deepspeed. And this is not the first problem with AMD and multi-gpu I have seen. |
Yes, that is correct. 😃 Thanks again for all your help! |
Oh, duh. You can also disable IOMMU in the BIOS. That's preferable to fiddling with GRUB, me thinks. |
Hi, I saw your toolbox link in a Huggingface issue and gave it a try. My four new GPUs hang when trying to fine tune a transformer, and they appear to do the same thing when running your
torch-distributed-gpu-test.py
tool, too. However, I'm not sure what the expected outcome is here. I should point out that I can fine tune a transformer with just a single GPU. I'm using Python 3.9.7, Transformers 4.17.0, PyTorch 1.11.0+cu113, NCCL 2.12.7 for CUDA 11.6, and four Nvidia A6000 GPUs.The text was updated successfully, but these errors were encountered: