dataparallel not working on nvidia gpus and amd cpus #13045

gregmbi · 2018-10-24T12:59:14Z

🐛 Bug

We have a number of machines with Threadripper CPUs, and 2 NVIDIA GPUs, some have 1070ti cards some 1080 some 1080ti and one with titanXp, they all displayed this behavior, when switching to using data parallel, training would fail, i.e. accuracy would not go up. We first saw this in our code base, but it also happens on the imagnet example from the pytorch examples repo

To Reproduce

Steps to reproduce the behavior:

run the imagnet example for the examples repo in pytorch with dataparallel

these error messages were found in the dmesg log:

[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020]
[1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020]
[1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020]
[1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020]
[1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020]
[1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020]
[1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]
[1118468.942414] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001e40 flags=0x0020]
[1118468.942452] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00025c0 flags=0x0020]
[1118468.942489] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003040 flags=0x0020]
[1118468.942525] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0002d40 flags=0x0020]
[1118468.942560] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00034c0 flags=0x0020]
[1118468.942596] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004040 flags=0x0020]
[1118468.942632] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003c40 flags=0x0020]
[1118468.942667] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00043c0 flags=0x0020]
[1118468.942703] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0005040 flags=0x0020]
[1118468.942739] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004b40 flags=0x0020]
[1118468.942774] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00052c0 flags=0x0020]

Expected behavior

accuracy would not pick up in most cases, it never picked up for the validation set. We managed to work around this problem by turning off IOMMU in the bios.

Environment

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 384.130
cuDNN version: Probably one of the following:
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7.0.5
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn_static.a

The text was updated successfully, but these errors were encountered:

davidmascharka · 2018-10-24T14:33:11Z

I believe this is a duplicate of #1637. Specifically, see this comment. I've had success on a threadripper machine by disabling IOMMU or changing the IOMMU setting to 'soft'.

jspisak · 2018-10-24T16:14:43Z

cc @petrex from amd and @slayton58 from nvidia - can you guys take a look and make sure this is addressed correctly in the appropriate driver.

gregmbi · 2018-10-24T18:45:31Z

@davidmascharka Yes I also figured out that turning off the hardware IOMMU via bios or kernel params solves the problem, It just took me a couple of very annoying weeks to pin down the source. Not the first place you go looking for a simple training session not converging. In my case it was even more confusing, as there was no crash or error messages, just very funky behavior. It's a bit sad that this bug has been known for so long and nobody is addressing it. I found mentions of this issue , i.e. CUDA on AMD platforms going years back.

petrex · 2018-10-24T19:16:41Z

thanks, @jspisak will bring this back to our teams and see what we can do.

JaeDukSeo · 2019-05-08T19:43:37Z

EVERYONE My friend Jason J. Yu (https://github.com/JasonYuJjyu) got this under control.

https://gph.is/2RXxlPP

He will save us.

ailzhang · 2019-05-09T20:42:37Z

Closing this issue via #1637 (comment),
@JaeDukSeo if you also another work around for it, please feel free to share it here so that more people can find it. Thanks!

ailzhang closed this as completed May 9, 2019

kousu mentioned this issue Jun 14, 2021

Custom Modeling Groundwork ivadomed/ms-challenge-2021#42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataparallel not working on nvidia gpus and amd cpus #13045

dataparallel not working on nvidia gpus and amd cpus #13045

gregmbi commented Oct 24, 2018

davidmascharka commented Oct 24, 2018

jspisak commented Oct 24, 2018

gregmbi commented Oct 24, 2018

petrex commented Oct 24, 2018

JaeDukSeo commented May 8, 2019

ailzhang commented May 9, 2019

dataparallel not working on nvidia gpus and amd cpus #13045

dataparallel not working on nvidia gpus and amd cpus #13045

Comments

gregmbi commented Oct 24, 2018

🐛 Bug

To Reproduce

Expected behavior

Environment

davidmascharka commented Oct 24, 2018

jspisak commented Oct 24, 2018

gregmbi commented Oct 24, 2018

petrex commented Oct 24, 2018

JaeDukSeo commented May 8, 2019

ailzhang commented May 9, 2019