Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataparallel not working on nvidia gpus and amd cpus #13045

Closed
gregmbi opened this issue Oct 24, 2018 · 6 comments
Closed

dataparallel not working on nvidia gpus and amd cpus #13045

gregmbi opened this issue Oct 24, 2018 · 6 comments

Comments

@gregmbi
Copy link

gregmbi commented Oct 24, 2018

馃悰 Bug

We have a number of machines with Threadripper CPUs, and 2 NVIDIA GPUs, some have 1070ti cards some 1080 some 1080ti and one with titanXp, they all displayed this behavior, when switching to using data parallel, training would fail, i.e. accuracy would not go up. We first saw this in our code base, but it also happens on the imagnet example from the pytorch examples repo

To Reproduce

Steps to reproduce the behavior:

  1. run the imagnet example for the examples repo in pytorch with dataparallel

these error messages were found in the dmesg log:

[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020]
[1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020]
[1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020]
[1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020]
[1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020]
[1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020]
[1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]
[1118468.942414] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001e40 flags=0x0020]
[1118468.942452] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00025c0 flags=0x0020]
[1118468.942489] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003040 flags=0x0020]
[1118468.942525] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0002d40 flags=0x0020]
[1118468.942560] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00034c0 flags=0x0020]
[1118468.942596] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004040 flags=0x0020]
[1118468.942632] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0003c40 flags=0x0020]
[1118468.942667] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00043c0 flags=0x0020]
[1118468.942703] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0005040 flags=0x0020]
[1118468.942739] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d0004b40 flags=0x0020]
[1118468.942774] AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x000f address=0x00000000d00052c0 flags=0x0020]

Expected behavior

accuracy would not pick up in most cases, it never picked up for the validation set. We managed to work around this problem by turning off IOMMU in the bios.

Environment

PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.5 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 2.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti

Nvidia driver version: 384.130
cuDNN version: Probably one of the following:
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn.so.7.0.5
/usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudnn_static.a

@davidmascharka
Copy link
Contributor

I believe this is a duplicate of #1637. Specifically, see this comment. I've had success on a threadripper machine by disabling IOMMU or changing the IOMMU setting to 'soft'.

@jspisak
Copy link
Contributor

jspisak commented Oct 24, 2018

cc @petrex from amd and @slayton58 from nvidia - can you guys take a look and make sure this is addressed correctly in the appropriate driver.

@gregmbi
Copy link
Author

gregmbi commented Oct 24, 2018

@davidmascharka Yes I also figured out that turning off the hardware IOMMU via bios or kernel params solves the problem, It just took me a couple of very annoying weeks to pin down the source. Not the first place you go looking for a simple training session not converging. In my case it was even more confusing, as there was no crash or error messages, just very funky behavior. It's a bit sad that this bug has been known for so long and nobody is addressing it. I found mentions of this issue , i.e. CUDA on AMD platforms going years back.

@petrex
Copy link
Contributor

petrex commented Oct 24, 2018

thanks, @jspisak will bring this back to our teams and see what we can do.

@JaeDukSeo
Copy link

EVERYONE My friend Jason J. Yu (https://github.com/JasonYuJjyu) got this under control.

https://gph.is/2RXxlPP

He will save us.

@ailzhang
Copy link
Contributor

ailzhang commented May 9, 2019

Closing this issue via #1637 (comment),
@JaeDukSeo if you also another work around for it, please feel free to share it here so that more people can find it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants