Parallel training hangs #1

mhillebrand · 2022-03-23T00:23:03Z

Hi, I saw your toolbox link in a Huggingface issue and gave it a try. My four new GPUs hang when trying to fine tune a transformer, and they appear to do the same thing when running your torch-distributed-gpu-test.py tool, too. However, I'm not sure what the expected outcome is here. I should point out that I can fine tune a transformer with just a single GPU. I'm using Python 3.9.7, Transformers 4.17.0, PyTorch 1.11.0+cu113, NCCL 2.12.7 for CUDA 11.6, and four Nvidia A6000 GPUs.

$ NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:21288:21288 [0] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21288:21288 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21288:21288 [0] NCCL INFO NET/IB : No device found.
DeepWhite:21288:21288 [0] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21288:21288 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:21290:21290 [2] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21289:21289 [1] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21290:21290 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21289:21289 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21289:21289 [1] NCCL INFO NET/IB : No device found.
DeepWhite:21290:21290 [2] NCCL INFO NET/IB : No device found.
DeepWhite:21290:21290 [2] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21289:21289 [1] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21290:21290 [2] NCCL INFO Using network Socket
DeepWhite:21289:21289 [1] NCCL INFO Using network Socket
DeepWhite:21291:21291 [3] NCCL INFO Bootstrap : Using enp67s0:192.168.50.21<0>
DeepWhite:21291:21291 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:21291:21291 [3] NCCL INFO NET/IB : No device found.
DeepWhite:21291:21291 [3] NCCL INFO NET/Socket : Using [0]enp67s0:192.168.50.21<0>
DeepWhite:21291:21291 [3] NCCL INFO Using network Socket
DeepWhite:21289:21327 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
DeepWhite:21291:21329 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
DeepWhite:21290:21328 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
DeepWhite:21288:21326 [0] NCCL INFO Channel 00/02 :    0   1   2   3
DeepWhite:21288:21326 [0] NCCL INFO Channel 01/02 :    0   1   2   3
DeepWhite:21288:21326 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:21288:21326 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:21291:21329 [3] NCCL INFO Channel 00 : 3[4a000] -> 0[3000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 00 : 1[21000] -> 2[49000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 00 : 2[49000] -> 3[4a000] via P2P/IPC
DeepWhite:21291:21329 [3] NCCL INFO Channel 01 : 3[4a000] -> 0[3000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 01 : 1[21000] -> 2[49000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 01 : 2[49000] -> 3[4a000] via P2P/IPC
DeepWhite:21288:21326 [0] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Connected all rings
DeepWhite:21290:21328 [2] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Channel 00 : 3[4a000] -> 2[49000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Connected all rings
DeepWhite:21291:21329 [3] NCCL INFO Channel 01 : 3[4a000] -> 2[49000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 00 : 2[49000] -> 1[21000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:21290:21328 [2] NCCL INFO Channel 01 : 2[49000] -> 1[21000] via P2P/IPC
DeepWhite:21289:21327 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:21291:21329 [3] NCCL INFO Connected all trees
DeepWhite:21291:21329 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21291:21329 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21288:21326 [0] NCCL INFO Connected all trees
DeepWhite:21288:21326 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21288:21326 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21290:21328 [2] NCCL INFO Connected all trees
DeepWhite:21290:21328 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21290:21328 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21289:21327 [1] NCCL INFO Connected all trees
DeepWhite:21289:21327 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
DeepWhite:21289:21327 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:21291:21329 [3] NCCL INFO comm 0x7f8894002fb0 rank 3 nranks 4 cudaDev 3 busId 4a000 - Init COMPLETE
DeepWhite:21289:21327 [1] NCCL INFO comm 0x7fd2c8002fb0 rank 1 nranks 4 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:21290:21328 [2] NCCL INFO comm 0x7f7aa0002fb0 rank 2 nranks 4 cudaDev 2 busId 49000 - Init COMPLETE
DeepWhite:21288:21326 [0] NCCL INFO comm 0x7fb314002fb0 rank 0 nranks 4 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:21288:21288 [0] NCCL INFO Launch mode Parallel

The text was updated successfully, but these errors were encountered:

stas00 · 2022-03-23T05:40:04Z

Is your 192.168.50.21 firewalled? or is it somehow a misconfigured network device?

Does it work if you use a loopback device 127.0.0.1?

NCCL_DEBUG=INFO NCCL_SOCKET_IFNAME=lo python -m torch.distributed.run --nproc_per_node 4 --nnodes 1 torch-distributed-gpu-test.py

if not, see what other local network devices you have via ifconfig - try that instead of lo if any.

It's currently using enp67s0 in your case.

If not, does it work if you use the first 2 or the last 2 gpus

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

then the 2nd pair:

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

If not, attach to each process with sudo py-spy dump -n -p PID after pip install py-spy and share the tracebacks - one is enough if they are the same. PID is the process id of the hanging python processes.

mhillebrand · 2022-03-23T19:08:19Z

Pure gold! Thank you so much for the insight. I don't think it's a firewall/networking issue since this machine is on my desk, and I'm logged into it directly. I see some page faults in /var/log/syslog, but I don't know if that's bad or not. py-spy shows something about sleeping and libc-2.31.so at the top of the stack. Could this be the problem?

I get the same result every time: Each GPU being tested hangs, using one third of its total power and ~2GB of VRAM indefinitely.

$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:6484:6484 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6484:6484 [0] NCCL INFO NET/IB : No device found.
DeepWhite:6484:6484 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6484:6484 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:6485:6485 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:6485:6485 [1] NCCL INFO NET/IB : No device found.
DeepWhite:6485:6485 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:6485:6485 [1] NCCL INFO Using network Socket
DeepWhite:6484:6516 [0] NCCL INFO Channel 00/02 :    0   1
DeepWhite:6485:6517 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:6484:6516 [0] NCCL INFO Channel 01/02 :    0   1
DeepWhite:6484:6516 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:6484:6516 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:6484:6516 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6484:6516 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:6485:6517 [1] NCCL INFO Connected all rings
DeepWhite:6485:6517 [1] NCCL INFO Connected all trees
DeepWhite:6485:6517 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6485:6517 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6484:6516 [0] NCCL INFO Connected all rings
DeepWhite:6484:6516 [0] NCCL INFO Connected all trees
DeepWhite:6484:6516 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:6484:6516 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:6485:6517 [1] NCCL INFO comm 0x7fcec4002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:6484:6516 [0] NCCL INFO comm 0x7f5a98002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:6484:6484 [0] NCCL INFO Launch mode Parallel

$ tail /var/log/syslog

Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.090997] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740000000 flags=0x0030]
Mar 23 10:57:11 deepwhite kernel: [ 1332.091001] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139068 flags=0x0020]
Mar 23 10:57:12 deepwhite kernel: [ 1332.223860] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0xd2139070 flags=0x0020]

$ py-spy dump -n -p 14088

Process 14088: python3 -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py
Python v3.9.7 (/home/matt/miniconda3/envs/nlp3.9/bin/python3.9)

Thread 14088 (idle): "MainThread"
    select (libc-2.31.so)
    time_sleep (python3.9)
    _invoke_run (torch/distributed/elastic/agent/server/api.py:850)
    run (torch/distributed/elastic/agent/server/api.py:709)
    wrapper (torch/distributed/elastic/metrics/api.py:125)
    launch_agent (torch/distributed/launcher/api.py:236)
    __call__ (torch/distributed/launcher/api.py:131)
    run (torch/distributed/run.py:715)
    main (torch/distributed/run.py:724)
    wrapper (torch/distributed/elastic/multiprocessing/errors/__init__.py:345)
    <module> (torch/distributed/run.py:728)
    _run_code (runpy.py:87)
    _run_module_as_main (runpy.py:197)

mhillebrand · 2022-03-23T19:21:09Z

I also tried this CUDA bandwidthTest from Nvidia, and it passed. BTW, I have the fourth GPU unplugged for now—just because this Threadripper box needs a dedicated 20A power outlet to run on all cylinders.

/usr/local/cuda/samples/cuda-samples/Samples/1_Utilities/bandwidthTest$ ./bandwidthTest --device=all --mode=shmoo

[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

 Device 0: NVIDIA RTX A6000
 Device 1: NVIDIA RTX A6000
 Device 2: NVIDIA RTX A6000
 Shmoo Mode

...................................................................................................................................................................................................................................................
 Host to Device Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				1.4
   2000				2.7
   3000				4.1
   4000				5.4
   5000				6.7
   6000				8.0
   7000				9.2
   8000				10.6
   9000				11.7
   10000			12.5
   11000			13.3
   12000			14.3
   13000			15.2
   14000			16.1
   15000			15.3
   16000			17.2
   17000			17.9
   18000			18.9
   19000			19.3
   20000			20.2
   22000			21.2
   24000			22.2
   26000			23.8
   28000			24.3
   30000			25.3
   32000			26.3
   34000			26.9
   36000			27.7
   38000			27.9
   40000			28.3
   42000			29.5
   44000			30.6
   46000			30.7
   48000			31.3
   50000			31.5
   60000			33.9
   70000			36.0
   80000			36.9
   90000			38.1
   100000			39.1
   200000			44.4
   300000			46.4
   400000			52.0
   500000			54.4
   600000			55.2
   700000			56.2
   800000			59.1
   900000			59.5
   1000000			59.9
   2000000			62.8
   3000000			63.6
   4000000			64.3
   5000000			65.1
   6000000			65.5
   7000000			65.7
   8000000			65.6
   9000000			65.8
   10000000			66.1
   11000000			66.0
   12000000			66.3
   13000000			66.2
   14000000			66.3
   15000000			66.2
   16000000			66.4
   18000000			66.3
   20000000			66.4
   22000000			66.4
   24000000			66.5
   26000000			66.6
   28000000			66.6
   30000000			66.6
   32000000			66.6
   36000000			66.6
   40000000			66.6
   44000000			66.6
   48000000			66.7
   52000000			66.7
   56000000			66.8
   60000000			66.7
   64000000			66.8
   68000000			66.7

...................................................................................................................................................................................................................................................
 Device to Host Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				1.4
   2000				2.7
   3000				4.4
   4000				5.8
   5000				7.0
   6000				8.7
   7000				10.1
   8000				11.6
   9000				12.7
   10000			14.5
   11000			16.0
   12000			17.2
   13000			18.7
   14000			20.1
   15000			21.9
   16000			23.1
   17000			24.1
   18000			24.2
   19000			26.5
   20000			27.9
   22000			33.7
   24000			34.9
   26000			35.7
   28000			36.4
   30000			38.9
   32000			40.0
   34000			40.4
   36000			42.0
   38000			42.4
   40000			42.8
   42000			42.8
   44000			44.3
   46000			45.8
   48000			46.1
   50000			46.6
   60000			49.4
   70000			51.1
   80000			52.7
   90000			53.5
   100000			54.5
   200000			60.7
   300000			62.9
   400000			64.1
   500000			64.3
   600000			64.5
   700000			65.5
   800000			65.8
   900000			65.1
   1000000			65.4
   2000000			66.5
   3000000			66.9
   4000000			67.0
   5000000			67.1
   6000000			66.9
   7000000			67.0
   8000000			67.0
   9000000			67.1
   10000000			67.0
   11000000			67.0
   12000000			66.9
   13000000			66.7
   14000000			66.7
   15000000			66.7
   16000000			66.8
   18000000			66.7
   20000000			66.6
   22000000			66.6
   24000000			66.6
   26000000			66.5
   28000000			66.7
   30000000			66.5
   32000000			66.6
   36000000			62.6
   40000000			60.7
   44000000			60.5
   48000000			60.7
   52000000			60.5
   56000000			60.4
   60000000			60.7
   64000000			60.9
   68000000			60.6

...................................................................................................................................................................................................................................................
 Device to Device Bandwidth, 3 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(GB/s)
   1000				2.8
   2000				4.0
   3000				6.3
   4000				8.4
   5000				10.5
   6000				12.6
   7000				14.5
   8000				16.9
   9000				18.7
   10000			20.9
   11000			23.0
   12000			25.4
   13000			27.2
   14000			29.7
   15000			31.4
   16000			33.7
   17000			35.7
   18000			37.9
   19000			39.7
   20000			42.1
   22000			46.4
   24000			50.8
   26000			55.1
   28000			59.4
   30000			63.5
   32000			67.7
   34000			72.2
   36000			76.6
   38000			81.3
   40000			85.2
   42000			86.7
   44000			94.5
   46000			98.4
   48000			102.3
   50000			107.7
   60000			128.0
   70000			151.1
   80000			172.9
   90000			194.8
   100000			216.8
   200000			441.6
   300000			678.2
   400000			933.0
   500000			1200.5
   600000			1477.6
   700000			1736.5
   800000			1946.4
   900000			2108.5
   1000000			2287.1
   2000000			2577.7
   3000000			2586.3
   4000000			1814.6
   5000000			1575.0
   6000000			1606.6
   7000000			1595.9
   8000000			1637.1
   9000000			1675.1
   10000000			1700.9
   11000000			1754.5
   12000000			1767.3
   13000000			1784.8
   14000000			1798.7
   15000000			1805.3
   16000000			1825.7
   18000000			1858.4
   20000000			1862.0
   22000000			1878.9
   24000000			1891.2
   26000000			1907.0
   28000000			1916.1
   30000000			1916.8
   32000000			1927.9
   36000000			1941.4
   40000000			1953.7
   44000000			1959.2
   48000000			1968.8
   52000000			1975.5
   56000000			1974.6
   60000000			1984.2
   64000000			2007.7
   68000000			1990.8

Result = PASS

mhillebrand · 2022-03-23T19:44:44Z

/usr/local/cuda/samples/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ ./p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A6000, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A6000, pciBusID: 21, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA RTX A6000, pciBusID: 49, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1     2
     0	     1     1     1
     1	     1     1     1
     2	     1     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 673.20  13.15  13.15 
     1  12.99 673.20  22.20 
     2  13.00  21.98 673.78 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2 
     0 673.49   2.60   1.58 
     1   2.12 672.33   1.70 
     2   2.12   1.60 673.78 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 677.73  18.01  18.02 
     1  19.25 678.32  27.11 
     2  19.25  28.02 678.43 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 678.91   5.21   5.21 
     1   3.30 678.46   5.56 
     2   3.49   3.21 677.73 
P2P=Disabled Latency Matrix (us)
   GPU     0      1      2 
     0   1.57  11.57  12.11 
     1  11.44   1.62  11.53 
     2  16.89  11.80   1.57 

   CPU     0      1      2 
     0   2.68   8.79   8.31 
     1   8.44   2.64   8.24 
     2   8.99   8.29   2.64 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1      2 
     0   1.62 49299.65 49299.60 
     1 49299.91   1.57 49299.87 
     2 49299.74 49299.72   1.64 

   CPU     0      1      2 
     0   2.73   2.18   3.01 
     1   2.28   2.81   2.21 
     2   3.51   2.43   2.77

mhillebrand · 2022-03-23T20:52:05Z

Oh, look at this! Same page fault messages. Sounds like this might help. God, I hope I don't break GRUB. It was a nightmare getting this RAID-0 array set up. Stay tuned.

mhillebrand · 2022-03-23T21:18:55Z

BINGO! Disabling IOMMU did the trick!

$ CUDA_VISIBLE_DEVICES=0,1 NCCL_SOCKET_IFNAME=lo NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DeepWhite:3031:3031 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3031:3031 [0] NCCL INFO NET/IB : No device found.
DeepWhite:3031:3031 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3031:3031 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
DeepWhite:3032:3032 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
DeepWhite:3032:3032 [1] NCCL INFO NET/IB : No device found.
DeepWhite:3032:3032 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
DeepWhite:3032:3032 [1] NCCL INFO Using network Socket
DeepWhite:3031:3059 [0] NCCL INFO Channel 00/02 :    0   1
DeepWhite:3032:3060 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
DeepWhite:3031:3059 [0] NCCL INFO Channel 01/02 :    0   1
DeepWhite:3031:3059 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
DeepWhite:3031:3059 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff
DeepWhite:3031:3059 [0] NCCL INFO Channel 00 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 00 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3031:3059 [0] NCCL INFO Channel 01 : 0[3000] -> 1[21000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Channel 01 : 1[21000] -> 0[3000] via P2P/IPC
DeepWhite:3032:3060 [1] NCCL INFO Connected all rings
DeepWhite:3032:3060 [1] NCCL INFO Connected all trees
DeepWhite:3032:3060 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3032:3060 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3031:3059 [0] NCCL INFO Connected all rings
DeepWhite:3031:3059 [0] NCCL INFO Connected all trees
DeepWhite:3031:3059 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
DeepWhite:3031:3059 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
DeepWhite:3032:3060 [1] NCCL INFO comm 0x7fdb7c002fb0 rank 1 nranks 2 cudaDev 1 busId 21000 - Init COMPLETE
DeepWhite:3031:3059 [0] NCCL INFO comm 0x7fcbac002fb0 rank 0 nranks 2 cudaDev 0 busId 3000 - Init COMPLETE
DeepWhite:3031:3031 [0] NCCL INFO Launch mode Parallel
[DeepWhite-1] is OK (global rank: 1/2)
[DeepWhite-0] is OK (global rank: 0/2)
pt=1.11.0+cu113, cuda=11.3, nccl=(2, 10, 3)
device compute capabilities=(8, 6)
pytorch compute capabilities=['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']

stas00 · 2022-03-23T21:27:29Z

Oh, wow! That's some awesome diagnostics you have performed - absolutely awesome, @mhillebrand! Glad to hear you got it working!

So the key to unravelling this problem was noticing a page fault in syslog:

$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]

We probably should start compiling all the difference causes somewhere so others will have it easier.

Glad you resolved it!

stas00 · 2022-03-23T21:29:26Z

@jeffra, tagging you on this one as FYI, since some users are likely to run into this with Deepspeed.

And this is not the first problem with AMD and multi-gpu I have seen.

mhillebrand · 2022-03-23T22:12:45Z

So the key to unravelling this problem was noticing a page fault in syslog:

$ tail /var/log/syslog
Mar 23 10:57:11 deepwhite kernel: [ 1332.090989] nvidia 0000:21:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x001f address=0x19740001000 flags=0x0030]

Yes, that is correct. 😃 Thanks again for all your help!

mhillebrand · 2022-04-03T17:42:02Z

Oh, duh. You can also disable IOMMU in the BIOS. That's preferable to fiddling with GRUB, me thinks.

mhillebrand closed this as completed Mar 23, 2022

mhillebrand mentioned this issue Mar 23, 2022

DDP training hangs with run_glue.py and run_seq2seq.py huggingface/transformers#15618

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel training hangs #1

Parallel training hangs #1

mhillebrand commented Mar 23, 2022 •

edited

stas00 commented Mar 23, 2022 •

edited

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022 •

edited

mhillebrand commented Mar 23, 2022 •

edited

stas00 commented Mar 23, 2022 •

edited

stas00 commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Apr 3, 2022

Parallel training hangs #1

Parallel training hangs #1

Comments

mhillebrand commented Mar 23, 2022 • edited

stas00 commented Mar 23, 2022 • edited

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Mar 23, 2022 • edited

mhillebrand commented Mar 23, 2022 • edited

stas00 commented Mar 23, 2022 • edited

stas00 commented Mar 23, 2022

mhillebrand commented Mar 23, 2022

mhillebrand commented Apr 3, 2022

mhillebrand commented Mar 23, 2022 •

edited

stas00 commented Mar 23, 2022 •

edited

mhillebrand commented Mar 23, 2022 •

edited

mhillebrand commented Mar 23, 2022 •

edited

stas00 commented Mar 23, 2022 •

edited