New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataparallel not working on nvidia gpus and amd cpus #13045
Comments
I believe this is a duplicate of #1637. Specifically, see this comment. I've had success on a threadripper machine by disabling IOMMU or changing the IOMMU setting to 'soft'. |
cc @petrex from amd and @slayton58 from nvidia - can you guys take a look and make sure this is addressed correctly in the appropriate driver. |
@davidmascharka Yes I also figured out that turning off the hardware IOMMU via bios or kernel params solves the problem, It just took me a couple of very annoying weeks to pin down the source. Not the first place you go looking for a simple training session not converging. In my case it was even more confusing, as there was no crash or error messages, just very funky behavior. It's a bit sad that this bug has been known for so long and nobody is addressing it. I found mentions of this issue , i.e. CUDA on AMD platforms going years back. |
thanks, @jspisak will bring this back to our teams and see what we can do. |
EVERYONE My friend Jason J. Yu (https://github.com/JasonYuJjyu) got this under control. He will save us. |
Closing this issue via #1637 (comment), |
馃悰 Bug
We have a number of machines with Threadripper CPUs, and 2 NVIDIA GPUs, some have 1070ti cards some 1080 some 1080ti and one with titanXp, they all displayed this behavior, when switching to using data parallel, training would fail, i.e. accuracy would not go up. We first saw this in our code base, but it also happens on the imagnet example from the pytorch examples repo
To Reproduce
Steps to reproduce the behavior:
these error messages were found in the dmesg log:
[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020]
[1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020]
[1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020]
[1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020]
[1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020]
[1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020]
[1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]
[1118468.942414] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001e40 flags=0x0020]
[1118468.942452] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00025c0 flags=0x0020]
[1118468.942489] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003040 flags=0x0020]
[1118468.942525] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0002d40 flags=0x0020]
[1118468.942560] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00034c0 flags=0x0020]
[1118468.942596] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004040 flags=0x0020]
[1118468.942632] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003c40 flags=0x0020]
[1118468.942667] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00043c0 flags=0x0020]
[1118468.942703] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0005040 flags=0x0020]
[1118468.942739] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004b40 flags=0x0020]
[1118468.942774] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00052c0 flags=0x0020]
Expected behavior
accuracy would not pick up in most cases, it never picked up for the validation set. We managed to work around this problem by turning off IOMMU in the bios.
Environment
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1
Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
Nvidia driver version: 384.130
cuDNN version: Probably one of the following:
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7.0.5
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn_static.a
The text was updated successfully, but these errors were encountered: