Performance tuning parameters for Horovod-TensorFlow benchmarks #288

vilmara · 2018-06-02T16:43:17Z

Hi all, can you recommend a set of tuning parameters to get the highest performance (throughput images/sec) using Horovod-TensorFlow benchmarks? . I got the below result and I was wonder if there is room for more improvement:

mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" -x python tf_cnn_benchmarks.py --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --display_every=1000 --use_fp16=False --local_parameter_device=gpu --variable_update=horovod

total images/sec: 2047.95

The text was updated successfully, but these errors were encountered:

alsrgv · 2018-06-02T19:24:38Z

@vilmara, these look reasonable. What GPUs are you using for this test (P100? V100?), and what number of images/sec do you get on a single GPU?

vilmara · 2018-06-02T20:06:48Z

hi @alsrgv, the system has 2 nodes with 4 V100 GPUs each, a single node produces around 255 images/second.

I have run the benchmark with parameter tuning for only a single node (without horovod) and I got even higher throughput; I tried these parameters with horovod but I got errors (except --variable_update which I kept =horovod)

Here is the result for Single Node with regular TF benchmark without horovod
With tuning parameters: --batch_size=128 --use_fp16=True --local_parameter_device=gpu --variable_update=replicated all_reduce_spec=nccl
python3 tf_cnn_benchmarks.py --device=gpu --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --num_gpus=4 --batch_size=128 --use_fp16=True --local_parameter_device=gpu --variable_update=replicated --all_reduce_spec=nccl

Result: 2341.4 images/sec

alsrgv · 2018-06-02T20:37:03Z

@vilmara, can you try using --use_fp16=True for horovod as well?

vilmara · 2018-06-02T21:25:15Z

@alsrgv with --use_fp16=True It gave around 430 images/sec per gpu. Can you suggest what other parameters can I combine with --variable_update=horovod to get the highest throughput performance?

alsrgv · 2018-06-03T01:37:00Z

@vilmara, the best performance I got with (ResNet-50, batch size 64, FP16, V100 with NVLink, real data) combination is ~616 img/sec. If you have NVLink, you should get ~80-85% scaling efficiency within the node - so ~2000-2100 img/sec on 4 GPUs. You may get slightly better numbers since you're using a little higher batch size. Scaling efficiency between nodes will depend on your network. It looks like you have InfiniBand, which is great! What is the speed of your NIC? Do you have NVLink or PCIe?

Here's the command that I used to get a good performance:

$ mpirun -np 8 \
  -H 10.128.0.3:4,10.128.0.2:4 \
  -mca btl_tcp_if_exclude lo,docker0 \
  -mca pml ob1 \
  -mca btl ^openib \
  -bind-to none \
  -map-by slot \
  -x NCCL_SOCKET_IFNAME=^docker0 \
  -x NCCL_DEBUG=INFO \
  -x LD_LIBRARY_PATH \
  -x PATH \
  python tf_cnn_benchmarks.py \
    --model resnet50 \
    --batch_size 64 \
    --num_batches 1000 \
    --use_fp16=True \
    --variable_update horovod \
    --horovod_device gpu \
    --data_dir ~/tf_train \
    --data_name imagenet \
    --datasets_num_private_threads 4

Please note --datasets_num_private_threads 4. It really helps to balance the number of threads TensorFlow creates for preprocessing. You may want to play with it to work best on your CPU.

vilmara · 2018-06-05T13:56:44Z

Hi @alsrgv, I have infiniband, the NIC's speed is 100 GB/s, I have NVLink, I ran the test adding the flag --datasets_num_private_threads=4 and it boosted ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs; however, I still don't reach the performance you mentioned.

Here is the command I used:
mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --train_dir=/benchmarks/TrainAccuracyLog/ --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu --horovod_device=gpu --datasets_num_private_threads=4 --display_every=1000

alsrgv · 2018-06-05T21:14:13Z

@vilmara, great!

Sorry - I made a mistake in the expected number that I published. With batch size 64, ~2000-2100 img/sec on 4 GPUs are expected within a node.

I don't have numbers for batch size 128 - what performance are you getting within a node in your setup?

vilmara · 2018-06-05T22:02:05Z

Hi @alsrgv, I am getting ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs with batch size 128

alsrgv · 2018-06-05T22:52:16Z

@vilmara, do you also get ~500 img/sec per GPU with 4 GPUs within node?

vilmara · 2018-06-06T15:07:23Z

Hi @alsrgv, sorry I got it, here are the results within node:
with only 1 GPU: ~644 img/sec
with 4 GPUs: : ~595 img/sec per gpu, ~2381 img/sec

alsrgv · 2018-06-06T19:25:29Z

@vilmara, that's great, you get 92% scaling efficiency within a server. The fact that it goes down to 77% as you cross over the network is a bit underwhelming though. Can you run your test with -x NCCL_DEBUG=INFO and paste the output of NCCL lines? I'm wondering if GPUDirect is enabled and whether it would help.

vilmara · 2018-06-07T16:14:54Z

hi @alsrgv, here are the outputs for NCCL lines, C4140-V100-1 is the primary host and C4140-V100-2 is the secondary host:

C4140-V100-1:24:201 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:24:201 [0] INFO Using internal Network Socket
C4140-V100-1:24:201 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:24:201 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:24:201 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-1:27:199 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:27:199 [3] INFO Using internal Network Socket
C4140-V100-1:27:199 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-1:26:196 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:26:196 [2] INFO Using internal Network Socket
C4140-V100-1:26:196 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-1:25:200 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:25:200 [1] INFO Using internal Network Socket
C4140-V100-1:25:200 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:24:229 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:24:229 [0] INFO Using internal Network Socket
C4140-V100-2:24:229 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:26:231 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:26:231 [2] INFO Using internal Network Socket
C4140-V100-2:26:231 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:25:228 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:25:228 [1] INFO Using internal Network Socket
C4140-V100-2:25:228 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:27:230 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:27:230 [3] INFO Using internal Network Socket
C4140-V100-2:27:230 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:24:229 [0] INFO comm 0x7fadd4312ea0 rank 4 nranks 8
C4140-V100-2:24:229 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:24:229 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:27:230 [3] INFO comm 0x7f953430fb40 rank 7 nranks 8
C4140-V100-2:25:228 [1] INFO comm 0x7f75d42dd280 rank 5 nranks 8
C4140-V100-2:26:231 [2] INFO comm 0x7f061c30d3d0 rank 6 nranks 8
C4140-V100-2:27:230 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:27:230 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:25:228 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:25:228 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:26:231 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:26:231 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:25:200 [1] INFO comm 0x7f775c298f10 rank 1 nranks 8
C4140-V100-1:24:201 [0] INFO comm 0x7f7734244ae0 rank 0 nranks 8
C4140-V100-1:25:200 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:26:196 [2] INFO comm 0x7f719c2a1740 rank 2 nranks 8
C4140-V100-1:27:199 [3] INFO comm 0x7f5ae82c0c40 rank 3 nranks 8
C4140-V100-1:26:196 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:196 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:27:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:199 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:24:201 [0] INFO Using 256 threads
C4140-V100-1:24:201 [0] INFO Min Comp Cap 7
C4140-V100-1:24:201 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:24:201 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:26:231 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:25:228 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:24:229 [0] INFO 3 -> 4 via NET/Socket/0
C4140-V100-2:24:229 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:25:200 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:26:196 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-1:24:201 [0] INFO 7 -> 0 via NET/Socket/0
C4140-V100-1:24:201 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:24:201 [0] INFO Launch mode Parallel

I have tried some tuning parameters indicated in this link without success, could you please recommend some?: http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf

alsrgv · 2018-06-07T22:49:52Z

@vilmara, good news - there is software room for improvement :-)

C4140-V100-1:27:199 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] indicates that Mellanox drivers are not installed.

C4140-V100-1:24:201 [0] INFO 7 -> 0 via NET/Socket/0 indicates that communication is happening over sockets instead of IB.

Hopefully, these two pointers will help:

vilmara · 2018-06-08T13:53:46Z

hi @alsrgv, I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service

alsrgv · 2018-06-08T21:01:24Z

@vilmara, you need two more things:

Extend Dockerfile to install Mellanox drivers - the same version as the version running on the host.
Plumb through the appropriate devices. The simplest way is to run Docker in --privileged mode.

Mellanox has much more detail in this document: https://community.mellanox.com/docs/DOC-3014#jive_content_id_Create_or_pull_a_base_image_and_run_Container, but I think you can get away with just doing (1) and (2).

vilmara · 2018-06-08T21:26:00Z

@alsrgv how can I extend the Dockerfile to install Mellanox drivers?

alsrgv · 2018-06-08T21:29:06Z

@vilmara, you can either edit Horovod Dockerfile, or make your own Dockerfile and specify FROM uber/horovod:0.13.4-tf1.8.0-torch0.4.0-py2.7 or another version that you prefer and install drivers there.

This guide has a good introduction to Docker: https://docs.docker.com/get-started/

vilmara · 2018-06-08T21:59:41Z

@alsrgv , when using the command for the option 1 it is throwing this error: Error response from daemon: pull access denied for mellanox/hpcx-2-0, repository does not exist or may require 'docker login'. I logged into the docker but still getting the same error

alsrgv · 2018-06-08T22:48:52Z

@vilmara, I found this image - https://hub.docker.com/r/mellanox/hpcx20_docker/, but I haven't personally tried it.

alsrgv · 2018-06-08T22:49:37Z

I don't think their docker would container CUDA / NCCL / TF though. So it may be easier to start with Horovod docker and just install MLNX_OFED using their installation script.

vilmara · 2018-06-11T03:28:02Z

Thanks @alsrgv, here I have some extra questions:

1- if I select the option 3, editing the horovod Dockerfile and adding MLNX_OFED using their installation script, I won't need to use the options 1 or 2, right?.

2- When editing the horovod Dockerfile, do I need to add GPUDirect Rmda API installation?, I guess yes.

3- What are the flags to be used with docker run and mpirun to indicate we are using GPUDirect?

vilmara · 2018-06-12T19:05:12Z

Option 3, editing the horovod Dockerfile:

@alsrgv, I was able to extend Dockerfile to install Mellanox driver (only MLNX_OFED since GPUDirect Rmda is throwing errors), so I installed GPUDirect Rmda API outside of the docker and activated it before running the benchmarks.

Now, when running the benchmarks here is what I got:
Output:
C4140-V100-1:25:200 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:25:200 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:25:200 [0] INFO Using internal Network IB
C4140-V100-1:25:200 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:25:200 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-2:87186:87393 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87186:87393 [3] INFO Using internal Network Socket
C4140-V100-2:87186:87393 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87183:87387 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87183:87387 [0] INFO Using internal Network Socket
C4140-V100-2:87183:87387 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87185:87396 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87185:87396 [2] INFO Using internal Network Socket
C4140-V100-2:87185:87396 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87184:87390 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87184:87390 [1] INFO Using internal Network Socket
C4140-V100-2:87184:87390 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:27:197 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:197 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:28:203 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:28:203 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:26:204 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:204 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:26:204 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:26:204 [1] INFO Using internal Network IB
C4140-V100-1:26:204 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:27:197 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:27:197 [2] INFO Using internal Network IB
C4140-V100-1:27:197 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:28:203 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:28:203 [3] INFO Using internal Network IB
C4140-V100-1:28:203 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:25:200 [0] INFO comm 0x7fce942dd730 rank 0 nranks 8
C4140-V100-1:25:200 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:27:197 [2] INFO comm 0x7fcd402af340 rank 4 nranks 8
C4140-V100-1:28:203 [3] INFO comm 0x7fea482e2420 rank 6 nranks 8
C4140-V100-1:26:204 [1] INFO comm 0x7fe3fc311fb0 rank 2 nranks 8
C4140-V100-1:27:197 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:197 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:26:204 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:204 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:28:203 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:28:203 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:27:197 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:28:203 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:26:204 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:87185:87396 [2] INFO comm 0x7f99382b3e70 rank 5 nranks 8
C4140-V100-2:87184:87390 [1] INFO comm 0x7fda8428e990 rank 3 nranks 8
C4140-V100-2:87186:87393 [3] INFO comm 0x7fa32c2d3260 rank 7 nranks 8
C4140-V100-2:87185:87396 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87185:87396 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87184:87390 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87184:87390 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87186:87393 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87186:87393 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87183:87387 [0] INFO comm 0x7f935835b090 rank 1 nranks 8
C4140-V100-2:87183:87387 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87183:87387 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-1:25:200 [0] INFO Using 256 threads
C4140-V100-1:25:200 [0] INFO Min Comp Cap 7
C4140-V100-1:25:200 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:25:200 [0] INFO Ring 00 : 0 2 4 6 1 3 5 7
C4140-V100-2:87184:87390 [1] INFO Ring 00 : 3[1] -> 5[2] via P2P/IPC
C4140-V100-2:87185:87396 [2] INFO Ring 00 : 5[2] -> 7[3] via P2P/IPC
C4140-V100-1:27:197 [2] INFO Ring 00 : 4[2] -> 6[3] via P2P/IPC
C4140-V100-1:26:204 [1] INFO Ring 00 : 2[1] -> 4[2] via P2P/IPC
C4140-V100-2:87183:87387 [0] INFO 6 -> 1 via NET/Socket/0
C4140-V100-2:87183:87387 [0] INFO Ring 00 : 1[0] -> 3[1] via P2P/IPC
C4140-V100-1:25:200 [0] INFO 7 -> 0 via NET/IB/0
C4140-V100-1:25:200 [0] INFO Ring 00 : 0[0] -> 2[1] via P2P/IPC
C4140-V100-1:28:203 [3] INFO NET/IB: Dev 0 Port 1 qpn 238 mtu 5 LID 1

The primary server hangs at this point. I see it is still showing some network connections over the socket

alsrgv · 2018-06-12T22:39:47Z

@vilmara, are you able to do ibv_devinfo inside the docker image? Are you running container in privileged mode? I believe you should not need to install GPUDirect inside the container.

vilmara · 2018-06-12T22:50:18Z

Hi @alsrgv, yes I am able to ibv_devinfo inside the docker image in both servers. Yes I am running container in provileged mode. I tried to install GPUDirect inside the container but it didn't work, also the Mellanox's guys told me that "regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too" Mellanox/nv_peer_memory#41 (comment).

Before installing MLNX_OFED inside the docker, the network connection was only by socket, now it is mixed and the system hangs

alsrgv · 2018-06-12T22:52:49Z

@vilmara, gotcha. Do all these things apply to the second node as well, i.e. ibv_devinfo works in the container, it's running in privileged mode, etc?

vilmara · 2018-06-12T22:56:02Z

@alsrgv here are my commands in both servers:

Primary server:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest

cd /benchmarks/scripts/tf_cnn_benchmarks

mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --mca plm_rsh_args "-p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --train_dir=/benchmarks/TrainAccuracyLog/ --model=resnet50 --batch_size=128 --device=gpu --num_epochs=1 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=cpu --horovod_device=gpu --datasets_num_private_threads=4 --display_every=1000

Secondary server:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest bash -c "/usr/sbin/sshd -p 50000; sleep infinity"

alsrgv · 2018-06-12T22:57:57Z

@vilmara, is horovod:latest on the second server updated to have MLNX_OFED as well? Can you try running ibv_devinfo in the container on the second server?

vilmara · 2018-06-12T23:00:28Z

@alsrgv here are the outputs of both servers inside the docker:

Primary server:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.17.2052
node_guid: e41d:2d03:0062:1256
sys_image_guid: e41d:2d03:0062:1256
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: DEL2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand

Secondary server:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.20.1820
node_guid: 7cfe:9003:0028:cec6
sys_image_guid: 7cfe:9003:0028:cec6
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: DEL2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 2
port_lmc: 0x00
link_layer: InfiniBand

alsrgv · 2018-06-12T23:06:29Z

@vilmara, it's extremely strange that you have both:

C4140-V100-2:87186:87393 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]

and ibv_devinfo working correctly.

Is it possible that you still have container running an older version of docker image on the second server, and that's what gets accessed over ssh from the first node container?

vilmara · 2018-06-12T23:10:47Z

@alsrgv in the secondary node outside docker this is what I have:

docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
horovod latest 069fc17c7694 3 hours ago 6.09GB

alsrgv · 2018-06-12T23:12:19Z

@vilmara, one other way to check:

$ nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest
# inside...
$ ssh -p 50000 192.168.11.1 ibv_devinfo
$ ssh -p 50000 192.168.11.2 ibv_devinfo

vilmara · 2018-06-12T23:24:34Z

@alsrgv running from the primary server inside the docker:

ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

ssh -p 50000 192.168.11.2 ibv_devinfo
bash: ibv_devinfo: command not found

Also:
root@c4140v1001:/examples# ssh 192.168.11.2
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-127-generic x86_64)

Documentation: https://help.ubuntu.com
Management: https://landscape.canonical.com
Support: https://ubuntu.com/advantage

17 packages can be updated.
0 updates are security updates.

*** System restart required ***
Last login: Tue Jun 12 18:22:23 2018 from 192.168.11.1
oot@c4140v1002:~#

alsrgv · 2018-06-12T23:27:24Z

@vilmara, ok, this:

ssh -p 50000 192.168.11.2 ibv_devinfo
bash: ibv_devinfo: command not found

Suggests that container on the second node is running an older version of the image which does not have ibv_devinfo.

Can you just reboot the second server and run your commands again?

vilmara · 2018-06-12T23:50:36Z

@alsrgv here are the commands in the primary node after rebooting secondary node :

root@c4140v1001:/examples# ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

root@c4140v1001:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
Failed to get IB devices list: Function not implemented

vilmara · 2018-06-12T23:54:17Z

@alsrgv some progress, I ran the benchmark again and the outputs are not showing WARN Failed to open libibverbs.so[.1] anymore, but still the primary server is hanging and showing socket connections

Running warm up
c4140v1001:23:198 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:23:198 [0] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:23:198 [0] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:23:198 [0] INFO Using internal Network IB
c4140v1001:23:198 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:23:198 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:23:198 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
c4140v1001:24:195 [1] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:24:195 [1] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:25:200 [2] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:25:200 [2] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:26:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:26:199 [3] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:26:199 [3] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:26:199 [3] INFO Using internal Network IB
c4140v1001:26:199 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:24:195 [1] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:24:195 [1] INFO Using internal Network IB
c4140v1001:24:195 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:34:240 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:34:240 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:34:240 [0] INFO Using internal Network Socket
C4140-V100-2:34:240 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:36:238 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:36:238 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:36:238 [2] INFO Using internal Network Socket
C4140-V100-2:36:238 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:35:241 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:35:241 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:35:241 [1] INFO Using internal Network Socket
C4140-V100-2:37:239 [3] INFO Using internal Network Socket
C4140-V100-2:35:241 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:37:239 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:25:200 [2] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:25:200 [2] INFO Using internal Network IB
c4140v1001:25:200 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:23:198 [0] INFO comm 0x7f81c024db30 rank 0 nranks 8
c4140v1001:23:198 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:34:240 [0] INFO comm 0x7fcf04335a30 rank 4 nranks 8
C4140-V100-2:34:240 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:34:240 [0] INFO NET/Socket : 1 interfaces found
c4140v1001:25:200 [2] INFO comm 0x7f67e42ef520 rank 2 nranks 8
c4140v1001:24:195 [1] INFO comm 0x7f6b6c2e7170 rank 1 nranks 8
c4140v1001:26:199 [3] INFO comm 0x7f64282fc2a0 rank 3 nranks 8
c4140v1001:24:195 [1] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:24:195 [1] INFO NET/Socket : 1 interfaces found
c4140v1001:25:200 [2] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:25:200 [2] INFO NET/Socket : 1 interfaces found
c4140v1001:26:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:26:199 [3] INFO NET/Socket : 1 interfaces found
c4140v1001:24:195 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
c4140v1001:25:200 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
c4140v1001:26:199 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:36:238 [2] INFO comm 0x7faa6035bec0 rank 6 nranks 8
C4140-V100-2:36:238 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:36:238 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:37:239 [3] INFO comm 0x7fd6b4468e40 rank 7 nranks 8
C4140-V100-2:37:239 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:35:241 [1] INFO comm 0x7f89bc2f6c30 rank 5 nranks 8
C4140-V100-2:35:241 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:35:241 [1] INFO NET/Socket : 1 interfaces found
c4140v1001:23:198 [0] INFO Using 256 threads
c4140v1001:23:198 [0] INFO Min Comp Cap 7
c4140v1001:23:198 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
c4140v1001:23:198 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
c4140v1001:25:200 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
c4140v1001:24:195 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
c4140v1001:23:198 [0] INFO 7 -> 0 via NET/IB/0
c4140v1001:23:198 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-2:35:241 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:36:238 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:34:240 [0] INFO 3 -> 4 via NET/Socket/0
C4140-V100-2:34:240 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
c4140v1001:26:199 [3] INFO NET/IB: Dev 0 Port 1 qpn 252 mtu 5 LID 1

alsrgv · 2018-06-12T23:59:57Z

@vilmara,

root@c4140v1001:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
Failed to get IB devices list: Function not implemented

Can you check that MLNX_OFED is installed on the second host?

vilmara · 2018-06-13T00:05:41Z

@alsrgv after rebooting the second host it is not installed, I will rebuild horovod again with the extended horovod Dockerfile

vilmara · 2018-06-13T00:11:26Z

@alsrgv here are the outputs in the secondary node after re-building extended horovod docker with MLNX_OFED

root@C4140-V100-2:/examples# ofed_info -s
MLNX_OFED_LINUX-4.3-1.0.1.0:

root@C4140-V100-2:/examples# ibv_devinfo
Failed to get IB devices list: Function not implemented

alsrgv · 2018-06-13T00:17:17Z

@vilmara, please install MLNX_OFED on the second host as well, since you need to have kernel driver in addition to userland libraries that you have installed in the docker.

vilmara · 2018-06-13T16:05:40Z

hi @alsrgv have reinstalled MLNX_OFED on the second host and ran the below test again

In the primary node:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest

#inside the docker:
root@C4140-V100-1:/examples# ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

root@C4140-V100-1:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
ssh: connect to host 192.168.11.2 port 50000: Connection refused

vilmara · 2018-06-13T18:21:00Z

@alsrgv also I was able to run the benchmark again with the below performance and output compilation:

Within a Node (single node mode with 4GPUs ):
1 GPU: ~641 img/sec
4 GPUs: ~595 img/sec
total img/sec: ~2380
92.80% scaling efficiency within the node

Multinode mode (2 nodes with 4 GPUs each):
~552.4 img/sec per GPU
total img/sec: ~4419.71
86.19% scaling efficiency with multinode

Outputs:
C4140-V100-1:340762:340937 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340762:340937 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340762:340937 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340762:340937 [0] INFO Using internal Network IB
C4140-V100-1:340762:340937 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340762:340937 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340762:340937 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
C4140-V100-1:340765:340939 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340765:340939 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340764:340934 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340764:340934 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340763:340938 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340763:340938 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239186:239393 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239186:239393 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239189:239399 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239189:239399 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239188:239390 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239188:239390 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239187:239394 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239187:239394 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340765:340939 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340765:340939 [3] INFO Using internal Network IB
C4140-V100-1:340765:340939 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340764:340934 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340764:340934 [2] INFO Using internal Network IB
C4140-V100-1:340764:340934 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340763:340938 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340763:340938 [1] INFO Using internal Network IB
C4140-V100-1:340763:340938 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239189:239399 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239189:239399 [3] INFO Using internal Network IB
C4140-V100-2:239189:239399 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239187:239394 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239187:239394 [1] INFO Using internal Network IB
C4140-V100-2:239187:239394 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239186:239393 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239186:239393 [0] INFO Using internal Network IB
C4140-V100-2:239186:239393 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239188:239390 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239188:239390 [2] INFO Using internal Network IB
C4140-V100-2:239188:239390 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340762:340937 [0] INFO comm 0x7fe7d42ca870 rank 0 nranks 8
C4140-V100-1:340765:340939 [3] INFO comm 0x7fc4783263a0 rank 3 nranks 8
C4140-V100-1:340763:340938 [1] INFO comm 0x7fe028299500 rank 1 nranks 8
C4140-V100-1:340762:340937 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340763:340938 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340763:340938 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340764:340934 [2] INFO comm 0x7f6f042aa040 rank 2 nranks 8
C4140-V100-2:239189:239399 [3] INFO comm 0x7f47f02ea120 rank 7 nranks 8
C4140-V100-1:340764:340934 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340764:340934 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340763:340938 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340764:340934 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340765:340939 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340765:340939 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239189:239399 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239189:239399 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340765:340939 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239189:239399 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239187:239394 [1] INFO comm 0x7f9de4304d90 rank 5 nranks 8
C4140-V100-2:239187:239394 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239187:239394 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO comm 0x7fd1d833e740 rank 4 nranks 8
C4140-V100-2:239188:239390 [2] INFO comm 0x7fdc20304440 rank 6 nranks 8
C4140-V100-2:239187:239394 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239188:239390 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239188:239390 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239186:239393 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239188:239390 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340762:340937 [0] INFO Using 256 threads
C4140-V100-1:340762:340937 [0] INFO Min Comp Cap 7
C4140-V100-1:340762:340937 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:340762:340937 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:239188:239390 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:239187:239394 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-1:340763:340938 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:340764:340934 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-2:239186:239393 [0] INFO 3 -> 4 via NET/IB/0
C4140-V100-2:239186:239393 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:340762:340937 [0] INFO 7 -> 0 via NET/IB/0
C4140-V100-1:340762:340937 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:340765:340939 [3] INFO NET/IB: Dev 0 Port 1 qpn 243 mtu 5 LID 1
C4140-V100-2:239189:239399 [3] INFO NET/IB: Dev 0 Port 1 qpn 274 mtu 5 LID 2
C4140-V100-1:340762:340937 [0] INFO Launch mode Parallel

alsrgv · 2018-06-13T18:30:43Z

@vilmara, great! I think installing the MLNX_OFED drivers on the second node resolved the issue. Right now you're using RDMA, but not GPUDirect.

Can you try a couple of things:

Make sure nv_peer_mem driver is installed on the second host, too.
Show output of nvidia-smi topo -m
Try running the test with -x NCCL_IB_CUDA_SUPPORT=1 in mpirun command.

vilmara · 2018-06-13T18:58:24Z

@alsrgv here is the nvidia-smi topo -m

    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity

GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU1 NV2 X NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU2 NV2 NV2 X NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU3 NV2 NV2 NV2 X SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
mlx5_0 SYS SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

Regarding the second step, should I delete the flags -x NCCL_IB_DISABLE=0 from my configuration and add -x NCCL_IB_CUDA_SUPPORT=1 or shoud I keep boths?

alsrgv · 2018-06-13T19:01:33Z

@vilmara, you should keep both. I'm not sure if GPUDirect actually works over NUMA. It would be far better if Mellanox NIC is attached to PCIx root complex that has NVLink mesh. You hardware team or Dell contacts should be able to help with that.

vilmara · 2018-06-13T19:15:09Z

@alsrgv , here is the performance with -x NCCL_IB_CUDA_SUPPORT=1, it looks like improved a little using RDMA
~560.5 img/sec per GPU
total img/sec: ~4483
~87% scaling efficiency with multinode

And here are some of the NCCL info output:
C4140-V100-1:366451:366626 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366451:366626 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366451:366626 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366451:366626 [0] INFO Using internal Network IB
C4140-V100-1:366451:366626 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366451:366626 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366451:366626 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
C4140-V100-1:366453:366623 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366453:366623 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366454:366627 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366454:366627 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366452:366632 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366452:366632 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370364:370568 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370364:370568 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370365:370567 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370365:370567 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370363:370569 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370363:370569 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370362:370566 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370362:370566 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366454:366627 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366454:366627 [3] INFO Using internal Network IB
C4140-V100-1:366454:366627 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366452:366632 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366452:366632 [1] INFO Using internal Network IB
C4140-V100-1:366452:366632 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366453:366623 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366453:366623 [2] INFO Using internal Network IB
C4140-V100-1:366453:366623 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370364:370568 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370364:370568 [2] INFO Using internal Network IB
C4140-V100-2:370365:370567 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370364:370568 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370365:370567 [3] INFO Using internal Network IB
C4140-V100-2:370365:370567 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370363:370569 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370363:370569 [1] INFO Using internal Network IB
C4140-V100-2:370363:370569 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370362:370566 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370362:370566 [0] INFO Using internal Network IB
C4140-V100-2:370362:370566 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370362:370566 [0] INFO comm 0x7f3014327ab0 rank 4 nranks 8
C4140-V100-2:370362:370566 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370362:370566 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370362:370566 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366451:366626 [0] INFO comm 0x7fa5f026bd30 rank 0 nranks 8
C4140-V100-1:366451:366626 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366452:366632 [1] INFO comm 0x7f516c296610 rank 1 nranks 8
C4140-V100-1:366453:366623 [2] INFO comm 0x7f49b42c1940 rank 2 nranks 8
C4140-V100-1:366454:366627 [3] INFO comm 0x7fde5c2eaa00 rank 3 nranks 8
C4140-V100-1:366453:366623 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366453:366623 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:366452:366632 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366452:366632 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:366454:366627 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366454:366627 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370363:370569 [1] INFO comm 0x7fbe742eee00 rank 5 nranks 8
C4140-V100-1:366453:366623 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370365:370567 [3] INFO comm 0x7efc24339990 rank 7 nranks 8
C4140-V100-2:370364:370568 [2] INFO comm 0x7f13342caef0 rank 6 nranks 8
C4140-V100-1:366452:366632 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366454:366627 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370365:370567 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370365:370567 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370363:370569 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370363:370569 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370364:370568 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370364:370568 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370365:370567 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370363:370569 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370364:370568 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366451:366626 [0] INFO Using 256 threads
C4140-V100-1:366451:366626 [0] INFO Min Comp Cap 7
C4140-V100-1:366451:366626 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:366451:366626 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:370363:370569 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:370364:370568 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:370362:370566 [0] INFO 3 -> 4 via NET/IB/0/GDRDMA
C4140-V100-2:370362:370566 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:366453:366623 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-1:366452:366632 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:366451:366626 [0] INFO 7 -> 0 via NET/IB/0/GDRDMA
C4140-V100-1:366451:366626 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:366454:366627 [3] INFO NET/IB: Dev 0 Port 1 qpn 405 mtu 5 LID 1
C4140-V100-2:370365:370567 [3] INFO NET/IB: Dev 0 Port 1 qpn 436 mtu 5 LID 2
C4140-V100-1:366451:366626 [0] INFO Launch mode Parallel

Regarding to your recommendation, is it more a hardware improvement?

alsrgv · 2018-06-13T19:19:17Z

@vilmara, oh - it worked, great! I think you may get slightly better performance with the hardware change I mentioned, since NVLink <-> PCI <-> Mellanox performance may be quite a bit better than NVLink <-> CPU <-> Mellanox.

vilmara · 2018-06-13T19:26:34Z

@alsrgv thanks so much for your help. I was great boosting the performance from 77% to 87% on multi node systems with your recommendations

vilmara · 2019-11-15T20:38:21Z

Hi, @alsrgv. I am running the TF benchmarks in multi-node mode with the latest version of Horovod 0.18.2, MLNX_OFED_LINUX-4.7-1.0.0.1, and GPUDirect RDMA (nvidia-peer-memory_1.0-8). I am following the above instructions, but I am not seeing the output connection via NET/IB/0/GDRDMA. Do you have some comments or recommendations on what else I am missing to activate GPUDirect RDMA with the new software stack?

Current throughput:
Throughput 1 GPU within a node : ~802 img/sec
Throughput 1 GPU across the nodes: ~760 img/sec
Throughput 8 GPU's total images/sec: ~6071 img/sec

Tracelog
master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
NCCL version 2.4.7+cuda10.0
master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2
master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2
master_node:22:295 [2] NCCL INFO Ring 00 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 -> 6 [receive] via NET/IB/0
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 00 : 3 -> 6 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] -> 1[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 2[2] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 01 : 3 -> 6 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 -> 2 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 6 -> 2 [receive] via NET/IB/0
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 -> 2 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 6[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 -> 6 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2 -> 6 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 -> 6 [receive] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 -> 2 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 2[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1
master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 5[1] via P2P/IPC
master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Ring 01 : 2 -> 6 [send] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 -> 6 [receive] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1
master_node:22:295 [2] NCCL INFO Ring 01 : 6 -> 2 [receive] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 6[2] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
secondary_node:42:310 [1] NCCL INFO Trees [0] 4->5->7/-1/-1 [1] 4->5->7/-1/-1
secondary_node:41:312 [0] NCCL INFO Trees [0] 6->4->5/-1/-1 [1] 6->4->5/-1/-1
secondary_node:41:312 [0] NCCL INFO comm 0x7fd8dc3c6740 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6 -> 2 [send] via NET/IB/0
secondary_node:43:309 [2] NCCL INFO Trees [0] 2->6->4/-1/-1 [1] -1->6->4/2/-1
secondary_node:42:310 [1] NCCL INFO comm 0x7fa7cc422c90 rank 5 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO comm 0x7fce9c438c90 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Trees [0] -1->2->0/6/-1 [1] 6->2->0/-1/-1
master_node:22:295 [2] NCCL INFO comm 0x7fd8f038f460 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Launch mode Parallel

vilmara · 2020-01-15T18:47:36Z

solved, see #1523

ajtarraga · 2023-01-26T18:13:30Z

Hi, I have a problem while I am working with GPUDirect RDMA.

I have the following scenario:

8 nodes with 1 GPU Tesla T4 each
GPUDirect RDMA enabled on them

However, while I am trying to compare in 2 nodes the performance obtained by GPUDirect RDMA, it shows me the results:

via NET/IB/0 ~820 images/sec
via NET/IB/0/GDRDMA ~808 images/sec

Horovod is showing better results when GPUDirect RDMA is disabled. Can someone explain me, why Horovod works better without GPUDirect RDMA? I cannot understand because I obtain 2.7x Bandwidth between GPUs while GPUDirect RDMA is enabled

vilmara changed the title ~~Performance tuning parameters for Horovod-TensorFlow benchmarks with~~ Performance tuning parameters for Horovod-TensorFlow benchmarks Jun 2, 2018

alsrgv added the question label Jun 2, 2018

vilmara closed this as completed Jun 14, 2018

vilmara mentioned this issue Jul 27, 2018

Can't use dockers on two machines #397

Closed

tgaddair mentioned this issue Sep 11, 2018

Distributed training speed slow down compare to one node？ #476

Closed

nidhidamodaran mentioned this issue Feb 15, 2019

Low perfomance on V100s #830

Closed

vilmara reopened this Nov 15, 2019

vilmara mentioned this issue Nov 19, 2019

Why it doesn't show connection via NET/IB/0/GDRDMA #1523

Closed

vilmara closed this as completed Jan 15, 2020

Performance tuning parameters for Horovod-TensorFlow benchmarks #288

Performance tuning parameters for Horovod-TensorFlow benchmarks #288

Comments

vilmara commented Jun 2, 2018 • edited Loading

alsrgv commented Jun 2, 2018

vilmara commented Jun 2, 2018 • edited Loading

alsrgv commented Jun 2, 2018

vilmara commented Jun 2, 2018

alsrgv commented Jun 3, 2018 • edited Loading

vilmara commented Jun 5, 2018 • edited Loading

alsrgv commented Jun 5, 2018

vilmara commented Jun 5, 2018

alsrgv commented Jun 5, 2018

vilmara commented Jun 6, 2018

alsrgv commented Jun 6, 2018

vilmara commented Jun 7, 2018 • edited Loading

alsrgv commented Jun 7, 2018

vilmara commented Jun 8, 2018 • edited Loading

alsrgv commented Jun 8, 2018

vilmara commented Jun 8, 2018 • edited Loading

alsrgv commented Jun 8, 2018

vilmara commented Jun 8, 2018 • edited Loading

alsrgv commented Jun 8, 2018

alsrgv commented Jun 8, 2018

vilmara commented Jun 11, 2018 • edited Loading

vilmara commented Jun 12, 2018 • edited Loading

alsrgv commented Jun 12, 2018 • edited Loading

vilmara commented Jun 12, 2018 • edited Loading

alsrgv commented Jun 12, 2018

vilmara commented Jun 12, 2018 • edited Loading

alsrgv commented Jun 12, 2018

vilmara commented Jun 12, 2018

alsrgv commented Jun 12, 2018

vilmara commented Jun 12, 2018

alsrgv commented Jun 12, 2018

vilmara commented Jun 12, 2018

alsrgv commented Jun 12, 2018

vilmara commented Jun 12, 2018

vilmara commented Jun 12, 2018 • edited Loading

alsrgv commented Jun 12, 2018

vilmara commented Jun 13, 2018

vilmara commented Jun 13, 2018

alsrgv commented Jun 13, 2018

vilmara commented Jun 13, 2018 • edited Loading

vilmara commented Jun 13, 2018 • edited Loading

alsrgv commented Jun 13, 2018 • edited Loading

vilmara commented Jun 13, 2018 • edited Loading

alsrgv commented Jun 13, 2018

vilmara commented Jun 13, 2018 • edited Loading

alsrgv commented Jun 13, 2018

vilmara commented Jun 13, 2018

vilmara commented Nov 15, 2019 • edited Loading

vilmara commented Jan 15, 2020

ajtarraga commented Jan 26, 2023

vilmara commented Jun 2, 2018 •

edited

Loading

vilmara commented Jun 2, 2018 •

edited

Loading

alsrgv commented Jun 3, 2018 •

edited

Loading

vilmara commented Jun 5, 2018 •

edited

Loading

vilmara commented Jun 7, 2018 •

edited

Loading

vilmara commented Jun 8, 2018 •

edited

Loading

vilmara commented Jun 8, 2018 •

edited

Loading

vilmara commented Jun 8, 2018 •

edited

Loading

vilmara commented Jun 11, 2018 •

edited

Loading

vilmara commented Jun 12, 2018 •

edited

Loading

alsrgv commented Jun 12, 2018 •

edited

Loading

vilmara commented Jun 12, 2018 •

edited

Loading

vilmara commented Jun 12, 2018 •

edited

Loading

vilmara commented Jun 12, 2018 •

edited

Loading

vilmara commented Jun 13, 2018 •

edited

Loading

vilmara commented Jun 13, 2018 •

edited

Loading

alsrgv commented Jun 13, 2018 •

edited

Loading

vilmara commented Jun 13, 2018 •

edited

Loading

vilmara commented Jun 13, 2018 •

edited

Loading

vilmara commented Nov 15, 2019 •

edited

Loading