Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tuning parameters for Horovod-TensorFlow benchmarks #288

Closed
vilmara opened this issue Jun 2, 2018 · 50 comments
Closed

Performance tuning parameters for Horovod-TensorFlow benchmarks #288

vilmara opened this issue Jun 2, 2018 · 50 comments
Labels

Comments

@vilmara
Copy link

vilmara commented Jun 2, 2018

Hi all, can you recommend a set of tuning parameters to get the highest performance (throughput images/sec) using Horovod-TensorFlow benchmarks? . I got the below result and I was wonder if there is room for more improvement:

mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" -x python tf_cnn_benchmarks.py --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --display_every=1000 --use_fp16=False --local_parameter_device=gpu --variable_update=horovod

total images/sec: 2047.95

@vilmara vilmara changed the title Performance tuning parameters for Horovod-TensorFlow benchmarks with Performance tuning parameters for Horovod-TensorFlow benchmarks Jun 2, 2018
@alsrgv
Copy link
Member

alsrgv commented Jun 2, 2018

@vilmara, these look reasonable. What GPUs are you using for this test (P100? V100?), and what number of images/sec do you get on a single GPU?

@alsrgv alsrgv added the question label Jun 2, 2018
@vilmara
Copy link
Author

vilmara commented Jun 2, 2018

hi @alsrgv, the system has 2 nodes with 4 V100 GPUs each, a single node produces around 255 images/second.

I have run the benchmark with parameter tuning for only a single node (without horovod) and I got even higher throughput; I tried these parameters with horovod but I got errors (except --variable_update which I kept =horovod)

Here is the result for Single Node with regular TF benchmark without horovod
With tuning parameters: --batch_size=128 --use_fp16=True --local_parameter_device=gpu --variable_update=replicated all_reduce_spec=nccl
python3 tf_cnn_benchmarks.py --device=gpu --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --num_gpus=4 --batch_size=128 --use_fp16=True --local_parameter_device=gpu --variable_update=replicated --all_reduce_spec=nccl

Result: 2341.4 images/sec

@alsrgv
Copy link
Member

alsrgv commented Jun 2, 2018

@vilmara, can you try using --use_fp16=True for horovod as well?

@vilmara
Copy link
Author

vilmara commented Jun 2, 2018

@alsrgv with --use_fp16=True It gave around 430 images/sec per gpu. Can you suggest what other parameters can I combine with --variable_update=horovod to get the highest throughput performance?

@alsrgv
Copy link
Member

alsrgv commented Jun 3, 2018

@vilmara, the best performance I got with (ResNet-50, batch size 64, FP16, V100 with NVLink, real data) combination is ~616 img/sec. If you have NVLink, you should get ~80-85% scaling efficiency within the node - so ~2000-2100 img/sec on 4 GPUs. You may get slightly better numbers since you're using a little higher batch size. Scaling efficiency between nodes will depend on your network. It looks like you have InfiniBand, which is great! What is the speed of your NIC? Do you have NVLink or PCIe?

Here's the command that I used to get a good performance:

$ mpirun -np 8 \
  -H 10.128.0.3:4,10.128.0.2:4 \
  -mca btl_tcp_if_exclude lo,docker0 \
  -mca pml ob1 \
  -mca btl ^openib \
  -bind-to none \
  -map-by slot \
  -x NCCL_SOCKET_IFNAME=^docker0 \
  -x NCCL_DEBUG=INFO \
  -x LD_LIBRARY_PATH \
  -x PATH \
  python tf_cnn_benchmarks.py \
    --model resnet50 \
    --batch_size 64 \
    --num_batches 1000 \
    --use_fp16=True \
    --variable_update horovod \
    --horovod_device gpu \
    --data_dir ~/tf_train \
    --data_name imagenet \
    --datasets_num_private_threads 4

Please note --datasets_num_private_threads 4. It really helps to balance the number of threads TensorFlow creates for preprocessing. You may want to play with it to work best on your CPU.

@vilmara
Copy link
Author

vilmara commented Jun 5, 2018

Hi @alsrgv, I have infiniband, the NIC's speed is 100 GB/s, I have NVLink, I ran the test adding the flag --datasets_num_private_threads=4 and it boosted ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs; however, I still don't reach the performance you mentioned.

Here is the command I used:
mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --train_dir=/benchmarks/TrainAccuracyLog/ --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=gpu --horovod_device=gpu --datasets_num_private_threads=4 --display_every=1000

@alsrgv
Copy link
Member

alsrgv commented Jun 5, 2018

@vilmara, great!

Sorry - I made a mistake in the expected number that I published. With batch size 64, ~2000-2100 img/sec on 4 GPUs are expected within a node.

I don't have numbers for batch size 128 - what performance are you getting within a node in your setup?

@vilmara
Copy link
Author

vilmara commented Jun 5, 2018

Hi @alsrgv, I am getting ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs with batch size 128

@alsrgv
Copy link
Member

alsrgv commented Jun 5, 2018

@vilmara, do you also get ~500 img/sec per GPU with 4 GPUs within node?

@vilmara
Copy link
Author

vilmara commented Jun 6, 2018

Hi @alsrgv, sorry I got it, here are the results within node:
with only 1 GPU: ~644 img/sec
with 4 GPUs: : ~595 img/sec per gpu, ~2381 img/sec

@alsrgv
Copy link
Member

alsrgv commented Jun 6, 2018

@vilmara, that's great, you get 92% scaling efficiency within a server. The fact that it goes down to 77% as you cross over the network is a bit underwhelming though. Can you run your test with -x NCCL_DEBUG=INFO and paste the output of NCCL lines? I'm wondering if GPUDirect is enabled and whether it would help.

@vilmara
Copy link
Author

vilmara commented Jun 7, 2018

hi @alsrgv, here are the outputs for NCCL lines, C4140-V100-1 is the primary host and C4140-V100-2 is the secondary host:

C4140-V100-1:24:201 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:24:201 [0] INFO Using internal Network Socket
C4140-V100-1:24:201 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:24:201 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:24:201 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-1:27:199 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:27:199 [3] INFO Using internal Network Socket
C4140-V100-1:27:199 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-1:26:196 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:26:196 [2] INFO Using internal Network Socket
C4140-V100-1:26:196 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-1:25:200 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-1:25:200 [1] INFO Using internal Network Socket
C4140-V100-1:25:200 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:24:229 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:24:229 [0] INFO Using internal Network Socket
C4140-V100-2:24:229 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:26:231 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:26:231 [2] INFO Using internal Network Socket
C4140-V100-2:26:231 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:25:228 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:25:228 [1] INFO Using internal Network Socket
C4140-V100-2:25:228 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:27:230 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:27:230 [3] INFO Using internal Network Socket
C4140-V100-2:27:230 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:24:229 [0] INFO comm 0x7fadd4312ea0 rank 4 nranks 8
C4140-V100-2:24:229 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:24:229 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:27:230 [3] INFO comm 0x7f953430fb40 rank 7 nranks 8
C4140-V100-2:25:228 [1] INFO comm 0x7f75d42dd280 rank 5 nranks 8
C4140-V100-2:26:231 [2] INFO comm 0x7f061c30d3d0 rank 6 nranks 8
C4140-V100-2:27:230 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:27:230 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:25:228 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:25:228 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:26:231 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:26:231 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:25:200 [1] INFO comm 0x7f775c298f10 rank 1 nranks 8
C4140-V100-1:24:201 [0] INFO comm 0x7f7734244ae0 rank 0 nranks 8
C4140-V100-1:25:200 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:26:196 [2] INFO comm 0x7f719c2a1740 rank 2 nranks 8
C4140-V100-1:27:199 [3] INFO comm 0x7f5ae82c0c40 rank 3 nranks 8
C4140-V100-1:26:196 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:196 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:27:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:199 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:24:201 [0] INFO Using 256 threads
C4140-V100-1:24:201 [0] INFO Min Comp Cap 7
C4140-V100-1:24:201 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:24:201 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:26:231 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:25:228 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:24:229 [0] INFO 3 -> 4 via NET/Socket/0
C4140-V100-2:24:229 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:25:200 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:26:196 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-1:24:201 [0] INFO 7 -> 0 via NET/Socket/0
C4140-V100-1:24:201 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:24:201 [0] INFO Launch mode Parallel

I have tried some tuning parameters indicated in this link without success, could you please recommend some?: http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf

@alsrgv
Copy link
Member

alsrgv commented Jun 7, 2018

@vilmara, good news - there is software room for improvement :-)

C4140-V100-1:27:199 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] indicates that Mellanox drivers are not installed.

C4140-V100-1:24:201 [0] INFO 7 -> 0 via NET/Socket/0 indicates that communication is happening over sockets instead of IB.

Hopefully, these two pointers will help:

  1. https://community.mellanox.com/docs/DOC-2688
  2. http://www.mellanox.com/page/products_dyn?product_family=116

@vilmara
Copy link
Author

vilmara commented Jun 8, 2018

hi @alsrgv, I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below:

Outside the docker:
service nv_peer_mem status
Output
● nv_peer_mem.service - LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.
Loaded: loaded (/etc/init.d/nv_peer_mem; bad; vendor preset: enabled)
Active: active (exited) since Thu 2018-06-07 16:02:45 CDT; 16h ago
Docs: man:systemd-sysv-generator(8)
Process: 303965 ExecStart=/etc/init.d/nv_peer_mem start (code=exited, status=0/SUCCESS)
Tasks: 0
Memory: 0B
CPU: 0

Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time....
Jun 07 16:02:45 C4140-V100-1 nv_peer_mem[303965]: starting... OK

Inside the docker:
service nv_peer_mem status
Output:
nv_peer_mem: unrecognized service

@alsrgv
Copy link
Member

alsrgv commented Jun 8, 2018

@vilmara, you need two more things:

  1. Extend Dockerfile to install Mellanox drivers - the same version as the version running on the host.
  2. Plumb through the appropriate devices. The simplest way is to run Docker in --privileged mode.

Mellanox has much more detail in this document: https://community.mellanox.com/docs/DOC-3014#jive_content_id_Create_or_pull_a_base_image_and_run_Container, but I think you can get away with just doing (1) and (2).

@vilmara
Copy link
Author

vilmara commented Jun 8, 2018

@alsrgv how can I extend the Dockerfile to install Mellanox drivers?

@alsrgv
Copy link
Member

alsrgv commented Jun 8, 2018

@vilmara, you can either edit Horovod Dockerfile, or make your own Dockerfile and specify FROM uber/horovod:0.13.4-tf1.8.0-torch0.4.0-py2.7 or another version that you prefer and install drivers there.

This guide has a good introduction to Docker: https://docs.docker.com/get-started/

@vilmara
Copy link
Author

vilmara commented Jun 8, 2018

@alsrgv , when using the command for the option 1 it is throwing this error: Error response from daemon: pull access denied for mellanox/hpcx-2-0, repository does not exist or may require 'docker login'. I logged into the docker but still getting the same error

@alsrgv
Copy link
Member

alsrgv commented Jun 8, 2018

@vilmara, I found this image - https://hub.docker.com/r/mellanox/hpcx20_docker/, but I haven't personally tried it.

@alsrgv
Copy link
Member

alsrgv commented Jun 8, 2018

I don't think their docker would container CUDA / NCCL / TF though. So it may be easier to start with Horovod docker and just install MLNX_OFED using their installation script.

@vilmara
Copy link
Author

vilmara commented Jun 11, 2018

Thanks @alsrgv, here I have some extra questions:

1- if I select the option 3, editing the horovod Dockerfile and adding MLNX_OFED using their installation script, I won't need to use the options 1 or 2, right?.

2- When editing the horovod Dockerfile, do I need to add GPUDirect Rmda API installation?, I guess yes.

3- What are the flags to be used with docker run and mpirun to indicate we are using GPUDirect?

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

Option 3, editing the horovod Dockerfile:

@alsrgv, I was able to extend Dockerfile to install Mellanox driver (only MLNX_OFED since GPUDirect Rmda is throwing errors), so I installed GPUDirect Rmda API outside of the docker and activated it before running the benchmarks.

Now, when running the benchmarks here is what I got:
Output:
C4140-V100-1:25:200 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:25:200 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:25:200 [0] INFO Using internal Network IB
C4140-V100-1:25:200 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:25:200 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:25:200 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0

C4140-V100-2:87186:87393 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87186:87393 [3] INFO Using internal Network Socket
C4140-V100-2:87186:87393 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87183:87387 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87183:87387 [0] INFO Using internal Network Socket
C4140-V100-2:87183:87387 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87185:87396 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87185:87396 [2] INFO Using internal Network Socket
C4140-V100-2:87185:87396 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384

C4140-V100-2:87184:87390 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]
C4140-V100-2:87184:87390 [1] INFO Using internal Network Socket
C4140-V100-2:87184:87390 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:27:197 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:197 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:28:203 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:28:203 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:26:204 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:204 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:26:204 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:26:204 [1] INFO Using internal Network IB
C4140-V100-1:26:204 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:27:197 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:27:197 [2] INFO Using internal Network IB
C4140-V100-1:27:197 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:28:203 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:28:203 [3] INFO Using internal Network IB
C4140-V100-1:28:203 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:25:200 [0] INFO comm 0x7fce942dd730 rank 0 nranks 8
C4140-V100-1:25:200 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:27:197 [2] INFO comm 0x7fcd402af340 rank 4 nranks 8
C4140-V100-1:28:203 [3] INFO comm 0x7fea482e2420 rank 6 nranks 8
C4140-V100-1:26:204 [1] INFO comm 0x7fe3fc311fb0 rank 2 nranks 8
C4140-V100-1:27:197 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:27:197 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:26:204 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:26:204 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:28:203 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:28:203 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:27:197 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:28:203 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:26:204 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:87185:87396 [2] INFO comm 0x7f99382b3e70 rank 5 nranks 8
C4140-V100-2:87184:87390 [1] INFO comm 0x7fda8428e990 rank 3 nranks 8
C4140-V100-2:87186:87393 [3] INFO comm 0x7fa32c2d3260 rank 7 nranks 8
C4140-V100-2:87185:87396 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87185:87396 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87184:87390 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87184:87390 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87186:87393 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87186:87393 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:87183:87387 [0] INFO comm 0x7f935835b090 rank 1 nranks 8
C4140-V100-2:87183:87387 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:87183:87387 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-1:25:200 [0] INFO Using 256 threads
C4140-V100-1:25:200 [0] INFO Min Comp Cap 7
C4140-V100-1:25:200 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:25:200 [0] INFO Ring 00 : 0 2 4 6 1 3 5 7
C4140-V100-2:87184:87390 [1] INFO Ring 00 : 3[1] -> 5[2] via P2P/IPC
C4140-V100-2:87185:87396 [2] INFO Ring 00 : 5[2] -> 7[3] via P2P/IPC
C4140-V100-1:27:197 [2] INFO Ring 00 : 4[2] -> 6[3] via P2P/IPC
C4140-V100-1:26:204 [1] INFO Ring 00 : 2[1] -> 4[2] via P2P/IPC
C4140-V100-2:87183:87387 [0] INFO 6 -> 1 via NET/Socket/0
C4140-V100-2:87183:87387 [0] INFO Ring 00 : 1[0] -> 3[1] via P2P/IPC
C4140-V100-1:25:200 [0] INFO 7 -> 0 via NET/IB/0
C4140-V100-1:25:200 [0] INFO Ring 00 : 0[0] -> 2[1] via P2P/IPC
C4140-V100-1:28:203 [3] INFO NET/IB: Dev 0 Port 1 qpn 238 mtu 5 LID 1

The primary server hangs at this point. I see it is still showing some network connections over the socket

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, are you able to do ibv_devinfo inside the docker image? Are you running container in privileged mode? I believe you should not need to install GPUDirect inside the container.

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

Hi @alsrgv, yes I am able to ibv_devinfo inside the docker image in both servers. Yes I am running container in provileged mode. I tried to install GPUDirect inside the container but it didn't work, also the Mellanox's guys told me that "regarding nv_peer_mem, it is only required to be loaded once, on the host. You don't need to load it inside a container too" Mellanox/nv_peer_memory#41 (comment).

Before installing MLNX_OFED inside the docker, the network connection was only by socket, now it is mixed and the system hangs

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, gotcha. Do all these things apply to the second node as well, i.e. ibv_devinfo works in the container, it's running in privileged mode, etc?

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv here are my commands in both servers:

Primary server:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest

cd /benchmarks/scripts/tf_cnn_benchmarks

mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 -x NCCL_DEBUG=INFO --bind-to none --mca plm_rsh_args "-p 50000" python tf_cnn_benchmarks.py --variable_update=horovod --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --train_dir=/benchmarks/TrainAccuracyLog/ --model=resnet50 --batch_size=128 --device=gpu --num_epochs=1 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --use_fp16=True --local_parameter_device=cpu --horovod_device=gpu --datasets_num_private_threads=4 --display_every=1000

Secondary server:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest bash -c "/usr/sbin/sshd -p 50000; sleep infinity"

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, is horovod:latest on the second server updated to have MLNX_OFED as well? Can you try running ibv_devinfo in the container on the second server?

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv here are the outputs of both servers inside the docker:

Primary server:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.17.2052
node_guid: e41d:2d03:0062:1256
sys_image_guid: e41d:2d03:0062:1256
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: DEL2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 1
port_lmc: 0x00
link_layer: InfiniBand

Secondary server:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.20.1820
node_guid: 7cfe:9003:0028:cec6
sys_image_guid: 7cfe:9003:0028:cec6
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: DEL2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 2
port_lid: 2
port_lmc: 0x00
link_layer: InfiniBand

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, it's extremely strange that you have both:

C4140-V100-2:87186:87393 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1]

and ibv_devinfo working correctly.

Is it possible that you still have container running an older version of docker image on the second server, and that's what gets accessed over ssh from the first node container?

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv in the secondary node outside docker this is what I have:

docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
horovod latest 069fc17c7694 3 hours ago 6.09GB

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, one other way to check:

$ nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest
# inside...
$ ssh -p 50000 192.168.11.1 ibv_devinfo
$ ssh -p 50000 192.168.11.2 ibv_devinfo

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv running from the primary server inside the docker:

ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

ssh -p 50000 192.168.11.2 ibv_devinfo
bash: ibv_devinfo: command not found

Also:
root@c4140v1001:/examples# ssh 192.168.11.2
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-127-generic x86_64)

17 packages can be updated.
0 updates are security updates.

*** System restart required ***
Last login: Tue Jun 12 18:22:23 2018 from 192.168.11.1
oot@c4140v1002:~#

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara, ok, this:

ssh -p 50000 192.168.11.2 ibv_devinfo
bash: ibv_devinfo: command not found

Suggests that container on the second node is running an older version of the image which does not have ibv_devinfo.

Can you just reboot the second server and run your commands again?

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv here are the commands in the primary node after rebooting secondary node :

root@c4140v1001:/examples# ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

root@c4140v1001:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
Failed to get IB devices list: Function not implemented

@vilmara
Copy link
Author

vilmara commented Jun 12, 2018

@alsrgv some progress, I ran the benchmark again and the outputs are not showing WARN Failed to open libibverbs.so[.1] anymore, but still the primary server is hanging and showing socket connections

Running warm up
c4140v1001:23:198 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:23:198 [0] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:23:198 [0] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:23:198 [0] INFO Using internal Network IB
c4140v1001:23:198 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:23:198 [0] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:23:198 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
c4140v1001:24:195 [1] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:24:195 [1] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:25:200 [2] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:25:200 [2] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:26:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:26:199 [3] INFO NET/IB : Using interface ib0 for sideband communication
c4140v1001:26:199 [3] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:26:199 [3] INFO Using internal Network IB
c4140v1001:26:199 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:24:195 [1] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:24:195 [1] INFO Using internal Network IB
c4140v1001:24:195 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:34:240 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:34:240 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:34:240 [0] INFO Using internal Network Socket
C4140-V100-2:34:240 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:36:238 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:36:238 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:36:238 [2] INFO Using internal Network Socket
C4140-V100-2:36:238 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:35:241 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:35:241 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:35:241 [1] INFO Using internal Network Socket
C4140-V100-2:37:239 [3] INFO Using internal Network Socket
C4140-V100-2:35:241 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:37:239 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:25:200 [2] INFO NET/IB: [0] mlx5_0:1/IB
c4140v1001:25:200 [2] INFO Using internal Network IB
c4140v1001:25:200 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
c4140v1001:23:198 [0] INFO comm 0x7f81c024db30 rank 0 nranks 8
c4140v1001:23:198 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:34:240 [0] INFO comm 0x7fcf04335a30 rank 4 nranks 8
C4140-V100-2:34:240 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:34:240 [0] INFO NET/Socket : 1 interfaces found
c4140v1001:25:200 [2] INFO comm 0x7f67e42ef520 rank 2 nranks 8
c4140v1001:24:195 [1] INFO comm 0x7f6b6c2e7170 rank 1 nranks 8
c4140v1001:26:199 [3] INFO comm 0x7f64282fc2a0 rank 3 nranks 8
c4140v1001:24:195 [1] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:24:195 [1] INFO NET/Socket : 1 interfaces found
c4140v1001:25:200 [2] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:25:200 [2] INFO NET/Socket : 1 interfaces found
c4140v1001:26:199 [3] INFO NET : Using interface ib0:192.168.11.1<0>
c4140v1001:26:199 [3] INFO NET/Socket : 1 interfaces found
c4140v1001:24:195 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
c4140v1001:25:200 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
c4140v1001:26:199 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:36:238 [2] INFO comm 0x7faa6035bec0 rank 6 nranks 8
C4140-V100-2:36:238 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:36:238 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:37:239 [3] INFO comm 0x7fd6b4468e40 rank 7 nranks 8
C4140-V100-2:37:239 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:37:239 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:35:241 [1] INFO comm 0x7f89bc2f6c30 rank 5 nranks 8
C4140-V100-2:35:241 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:35:241 [1] INFO NET/Socket : 1 interfaces found
c4140v1001:23:198 [0] INFO Using 256 threads
c4140v1001:23:198 [0] INFO Min Comp Cap 7
c4140v1001:23:198 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
c4140v1001:23:198 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
c4140v1001:25:200 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
c4140v1001:24:195 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
c4140v1001:23:198 [0] INFO 7 -> 0 via NET/IB/0
c4140v1001:23:198 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-2:35:241 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:36:238 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:34:240 [0] INFO 3 -> 4 via NET/Socket/0
C4140-V100-2:34:240 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
c4140v1001:26:199 [3] INFO NET/IB: Dev 0 Port 1 qpn 252 mtu 5 LID 1

@alsrgv
Copy link
Member

alsrgv commented Jun 12, 2018

@vilmara,

root@c4140v1001:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
Failed to get IB devices list: Function not implemented

Can you check that MLNX_OFED is installed on the second host?

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv after rebooting the second host it is not installed, I will rebuild horovod again with the extended horovod Dockerfile

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv here are the outputs in the secondary node after re-building extended horovod docker with MLNX_OFED

root@C4140-V100-2:/examples# ofed_info -s
MLNX_OFED_LINUX-4.3-1.0.1.0:

root@C4140-V100-2:/examples# ibv_devinfo
Failed to get IB devices list: Function not implemented

@alsrgv
Copy link
Member

alsrgv commented Jun 13, 2018

@vilmara, please install MLNX_OFED on the second host as well, since you need to have kernel driver in addition to userland libraries that you have installed in the docker.

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

hi @alsrgv have reinstalled MLNX_OFED on the second host and ran the below test again

In the primary node:
nvidia-docker run -it --network=host --runtime=nvidia -v /root/.ssh:/root/.ssh -v /home/data/:/data/ -v /home/benchmarks/:/benchmarks -v /etc/localtime:/etc/localtime:ro --privileged horovod:latest

#inside the docker:
root@C4140-V100-1:/examples# ssh -p 50000 192.168.11.1 ibv_devinfo
ssh: connect to host 192.168.11.1 port 50000: Connection refused

root@C4140-V100-1:/examples# ssh -p 50000 192.168.11.2 ibv_devinfo
ssh: connect to host 192.168.11.2 port 50000: Connection refused

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv also I was able to run the benchmark again with the below performance and output compilation:

Within a Node (single node mode with 4GPUs ):
1 GPU: ~641 img/sec
4 GPUs: ~595 img/sec
total img/sec: ~2380
92.80% scaling efficiency within the node

Multinode mode (2 nodes with 4 GPUs each):
~552.4 img/sec per GPU
total img/sec: ~4419.71
86.19% scaling efficiency with multinode

Outputs:
C4140-V100-1:340762:340937 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340762:340937 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340762:340937 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340762:340937 [0] INFO Using internal Network IB
C4140-V100-1:340762:340937 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340762:340937 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340762:340937 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
C4140-V100-1:340765:340939 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340765:340939 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340764:340934 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340764:340934 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340763:340938 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340763:340938 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239186:239393 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239186:239393 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239189:239399 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239189:239399 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239188:239390 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239188:239390 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:239187:239394 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239187:239394 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:340765:340939 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340765:340939 [3] INFO Using internal Network IB
C4140-V100-1:340765:340939 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340764:340934 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340764:340934 [2] INFO Using internal Network IB
C4140-V100-1:340764:340934 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340763:340938 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:340763:340938 [1] INFO Using internal Network IB
C4140-V100-1:340763:340938 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239189:239399 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239189:239399 [3] INFO Using internal Network IB
C4140-V100-2:239189:239399 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239187:239394 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239187:239394 [1] INFO Using internal Network IB
C4140-V100-2:239187:239394 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239186:239393 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239186:239393 [0] INFO Using internal Network IB
C4140-V100-2:239186:239393 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:239188:239390 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:239188:239390 [2] INFO Using internal Network IB
C4140-V100-2:239188:239390 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:340762:340937 [0] INFO comm 0x7fe7d42ca870 rank 0 nranks 8
C4140-V100-1:340765:340939 [3] INFO comm 0x7fc4783263a0 rank 3 nranks 8
C4140-V100-1:340763:340938 [1] INFO comm 0x7fe028299500 rank 1 nranks 8
C4140-V100-1:340762:340937 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340763:340938 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340763:340938 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340764:340934 [2] INFO comm 0x7f6f042aa040 rank 2 nranks 8
C4140-V100-2:239189:239399 [3] INFO comm 0x7f47f02ea120 rank 7 nranks 8
C4140-V100-1:340764:340934 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340764:340934 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340763:340938 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340764:340934 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340765:340939 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:340765:340939 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239189:239399 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239189:239399 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-1:340765:340939 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239189:239399 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239187:239394 [1] INFO comm 0x7f9de4304d90 rank 5 nranks 8
C4140-V100-2:239187:239394 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239187:239394 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO comm 0x7fd1d833e740 rank 4 nranks 8
C4140-V100-2:239188:239390 [2] INFO comm 0x7fdc20304440 rank 6 nranks 8
C4140-V100-2:239187:239394 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239188:239390 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239188:239390 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:239186:239393 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:239186:239393 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:239188:239390 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:340762:340937 [0] INFO Using 256 threads
C4140-V100-1:340762:340937 [0] INFO Min Comp Cap 7
C4140-V100-1:340762:340937 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:340762:340937 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:239188:239390 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:239187:239394 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-1:340763:340938 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:340764:340934 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-2:239186:239393 [0] INFO 3 -> 4 via NET/IB/0
C4140-V100-2:239186:239393 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:340762:340937 [0] INFO 7 -> 0 via NET/IB/0
C4140-V100-1:340762:340937 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:340765:340939 [3] INFO NET/IB: Dev 0 Port 1 qpn 243 mtu 5 LID 1
C4140-V100-2:239189:239399 [3] INFO NET/IB: Dev 0 Port 1 qpn 274 mtu 5 LID 2
C4140-V100-1:340762:340937 [0] INFO Launch mode Parallel

@alsrgv
Copy link
Member

alsrgv commented Jun 13, 2018

@vilmara, great! I think installing the MLNX_OFED drivers on the second node resolved the issue. Right now you're using RDMA, but not GPUDirect.

Can you try a couple of things:

  1. Make sure nv_peer_mem driver is installed on the second host, too.
  2. Show output of nvidia-smi topo -m
  3. Try running the test with -x NCCL_IB_CUDA_SUPPORT=1 in mpirun command.

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv here is the nvidia-smi topo -m

    GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity

GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU1 NV2 X NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU2 NV2 NV2 X NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
GPU3 NV2 NV2 NV2 X SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38
mlx5_0 SYS SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks

Regarding the second step, should I delete the flags -x NCCL_IB_DISABLE=0 from my configuration and add -x NCCL_IB_CUDA_SUPPORT=1 or shoud I keep boths?

@alsrgv
Copy link
Member

alsrgv commented Jun 13, 2018

@vilmara, you should keep both. I'm not sure if GPUDirect actually works over NUMA. It would be far better if Mellanox NIC is attached to PCIx root complex that has NVLink mesh. You hardware team or Dell contacts should be able to help with that.

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv , here is the performance with -x NCCL_IB_CUDA_SUPPORT=1, it looks like improved a little using RDMA
~560.5 img/sec per GPU
total img/sec: ~4483
~87% scaling efficiency with multinode

And here are some of the NCCL info output:
C4140-V100-1:366451:366626 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366451:366626 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366451:366626 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366451:366626 [0] INFO Using internal Network IB
C4140-V100-1:366451:366626 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366451:366626 [0] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366451:366626 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.2.12+cuda9.0
C4140-V100-1:366453:366623 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366453:366623 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366454:366627 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366454:366627 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366452:366632 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366452:366632 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370364:370568 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370364:370568 [2] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370365:370567 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370365:370567 [3] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370363:370569 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370363:370569 [1] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-2:370362:370566 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370362:370566 [0] INFO NET/IB : Using interface ib0 for sideband communication
C4140-V100-1:366454:366627 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366454:366627 [3] INFO Using internal Network IB
C4140-V100-1:366454:366627 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366452:366632 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366452:366632 [1] INFO Using internal Network IB
C4140-V100-1:366452:366632 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-1:366453:366623 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-1:366453:366623 [2] INFO Using internal Network IB
C4140-V100-1:366453:366623 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370364:370568 [2] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370364:370568 [2] INFO Using internal Network IB
C4140-V100-2:370365:370567 [3] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370364:370568 [2] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370365:370567 [3] INFO Using internal Network IB
C4140-V100-2:370365:370567 [3] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370363:370569 [1] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370363:370569 [1] INFO Using internal Network IB
C4140-V100-2:370363:370569 [1] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370362:370566 [0] INFO NET/IB: [0] mlx5_0:1/IB
C4140-V100-2:370362:370566 [0] INFO Using internal Network IB
C4140-V100-2:370362:370566 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
C4140-V100-2:370362:370566 [0] INFO comm 0x7f3014327ab0 rank 4 nranks 8
C4140-V100-2:370362:370566 [0] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370362:370566 [0] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370362:370566 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366451:366626 [0] INFO comm 0x7fa5f026bd30 rank 0 nranks 8
C4140-V100-1:366451:366626 [0] INFO CUDA Dev 0, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366452:366632 [1] INFO comm 0x7f516c296610 rank 1 nranks 8
C4140-V100-1:366453:366623 [2] INFO comm 0x7f49b42c1940 rank 2 nranks 8
C4140-V100-1:366454:366627 [3] INFO comm 0x7fde5c2eaa00 rank 3 nranks 8
C4140-V100-1:366453:366623 [2] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366453:366623 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-1:366452:366632 [1] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366452:366632 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-1:366454:366627 [3] INFO NET : Using interface ib0:192.168.11.1<0>
C4140-V100-1:366454:366627 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370363:370569 [1] INFO comm 0x7fbe742eee00 rank 5 nranks 8
C4140-V100-1:366453:366623 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370365:370567 [3] INFO comm 0x7efc24339990 rank 7 nranks 8
C4140-V100-2:370364:370568 [2] INFO comm 0x7f13342caef0 rank 6 nranks 8
C4140-V100-1:366452:366632 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366454:366627 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370365:370567 [3] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370365:370567 [3] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370363:370569 [1] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370363:370569 [1] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370364:370568 [2] INFO NET : Using interface ib0:192.168.11.2<0>
C4140-V100-2:370364:370568 [2] INFO NET/Socket : 1 interfaces found
C4140-V100-2:370365:370567 [3] INFO CUDA Dev 3, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370363:370569 [1] INFO CUDA Dev 1, IB Ports : mlx5_0/1(SOC)
C4140-V100-2:370364:370568 [2] INFO CUDA Dev 2, IB Ports : mlx5_0/1(SOC)
C4140-V100-1:366451:366626 [0] INFO Using 256 threads
C4140-V100-1:366451:366626 [0] INFO Min Comp Cap 7
C4140-V100-1:366451:366626 [0] INFO NCCL_SINGLE_RING_THRESHOLD=262144
C4140-V100-1:366451:366626 [0] INFO Ring 00 : 0 1 2 3 4 5 6 7
C4140-V100-2:370363:370569 [1] INFO Ring 00 : 5[1] -> 6[2] via P2P/IPC
C4140-V100-2:370364:370568 [2] INFO Ring 00 : 6[2] -> 7[3] via P2P/IPC
C4140-V100-2:370362:370566 [0] INFO 3 -> 4 via NET/IB/0/GDRDMA
C4140-V100-2:370362:370566 [0] INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
C4140-V100-1:366453:366623 [2] INFO Ring 00 : 2[2] -> 3[3] via P2P/IPC
C4140-V100-1:366452:366632 [1] INFO Ring 00 : 1[1] -> 2[2] via P2P/IPC
C4140-V100-1:366451:366626 [0] INFO 7 -> 0 via NET/IB/0/GDRDMA
C4140-V100-1:366451:366626 [0] INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
C4140-V100-1:366454:366627 [3] INFO NET/IB: Dev 0 Port 1 qpn 405 mtu 5 LID 1
C4140-V100-2:370365:370567 [3] INFO NET/IB: Dev 0 Port 1 qpn 436 mtu 5 LID 2
C4140-V100-1:366451:366626 [0] INFO Launch mode Parallel

Regarding to your recommendation, is it more a hardware improvement?

@alsrgv
Copy link
Member

alsrgv commented Jun 13, 2018

@vilmara, oh - it worked, great! I think you may get slightly better performance with the hardware change I mentioned, since NVLink <-> PCI <-> Mellanox performance may be quite a bit better than NVLink <-> CPU <-> Mellanox.

@vilmara
Copy link
Author

vilmara commented Jun 13, 2018

@alsrgv thanks so much for your help. I was great boosting the performance from 77% to 87% on multi node systems with your recommendations

@vilmara
Copy link
Author

vilmara commented Nov 15, 2019

Hi, @alsrgv. I am running the TF benchmarks in multi-node mode with the latest version of Horovod 0.18.2, MLNX_OFED_LINUX-4.7-1.0.0.1, and GPUDirect RDMA (nvidia-peer-memory_1.0-8). I am following the above instructions, but I am not seeing the output connection via NET/IB/0/GDRDMA. Do you have some comments or recommendations on what else I am missing to activate GPUDirect RDMA with the new software stack?

Current throughput:
Throughput 1 GPU within a node : ~802 img/sec
Throughput 1 GPU across the nodes: ~760 img/sec
Throughput 8 GPU's total images/sec: ~6071 img/sec

Tracelog
master_node:20:289 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:20:289 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:20:289 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:20:289 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
NCCL version 2.4.7+cuda10.0
master_node:22:295 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:22:295 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:21:290 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:23:288 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
master_node:22:295 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:21:290 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:23:288 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:43:309 [2] NCCL INFO NET/Socket : Using [0]ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:44:311 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:41:312 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
secondary_node:43:309 [2] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:44:311 [3] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:42:310 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
secondary_node:41:312 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
master_node:22:295 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:23:288 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
master_node:21:290 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.1<0>
secondary_node:43:309 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:44:311 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:41:312 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
secondary_node:42:310 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:192.168.11.2<0>
master_node:20:289 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
master_node:23:288 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
master_node:21:290 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
master_node:22:295 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:44:311 [3] NCCL INFO Setting affinity for GPU 3 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:43:309 [2] NCCL INFO Setting affinity for GPU 2 to aaaa,aaaaaaaa,aaaaaaaa
secondary_node:41:312 [0] NCCL INFO Setting affinity for GPU 0 to 5555,55555555,55555555
secondary_node:42:310 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555
secondary_node:41:312 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
secondary_node:44:311 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
secondary_node:42:310 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
secondary_node:43:309 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:22:295 [2] NCCL INFO CUDA Dev 2[2], IB NIC distance : NODE
master_node:23:288 [3] NCCL INFO CUDA Dev 3[3], IB NIC distance : NODE
master_node:21:290 [1] NCCL INFO CUDA Dev 1[1], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO CUDA Dev 0[0], IB NIC distance : SYS
master_node:20:289 [0] NCCL INFO Channel 00 : 0 1 3 6 4 5 7 2
master_node:20:289 [0] NCCL INFO Channel 01 : 0 1 3 6 4 5 7 2
master_node:22:295 [2] NCCL INFO Ring 00 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 3 -> 6 [receive] via NET/IB/0
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 3[3] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 00 : 3 -> 6 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 00 : 3[3] -> 1[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via P2P/IPC
master_node:20:289 [0] NCCL INFO Ring 00 : 0[0] -> 2[2] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 3[3] via P2P/IPC
master_node:23:288 [3] NCCL INFO Ring 01 : 3 -> 6 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 5[1] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7 -> 2 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 6 -> 2 [receive] via NET/IB/0
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO Ring 00 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 6 -> 2 [send] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 00 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 00 : 4[0] -> 6[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 00 : 2 -> 6 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 00 : 2 -> 6 [send] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 7 -> 2 [receive] via NET/IB/0
master_node:22:295 [2] NCCL INFO Ring 01 : 2[2] -> 0[0] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 3 -> 6 [receive] via NET/IB/0
master_node:23:288 [3] NCCL INFO Ring 01 : 3[3] -> 1[1] via P2P/IPC
master_node:21:290 [1] NCCL INFO Trees [0] 0->1->3/-1/-1 [1] 0->1->3/-1/-1
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7 -> 2 [send] via NET/IB/0
master_node:23:288 [3] NCCL INFO Trees [0] 1->3->-1/-1/-1 [1] 1->3->-1/-1/-1
master_node:20:289 [0] NCCL INFO Ring 01 : 0[0] -> 2[2] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6[2] -> 4[0] via P2P/IPC
master_node:21:290 [1] NCCL INFO comm 0x7f4d6839f060 rank 1 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
master_node:23:288 [3] NCCL INFO comm 0x7f48503a3650 rank 3 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Trees [0] 2->0->1/-1/-1 [1] 2->0->1/-1/-1
master_node:20:289 [0] NCCL INFO Using 256 threads, Min Comp Cap 7, Trees enabled for all sizes
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 7[3] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 5[1] via P2P/IPC
master_node:20:289 [0] NCCL INFO comm 0x7f5450362840 rank 0 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Ring 01 : 2 -> 6 [send] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Ring 01 : 7[3] -> 5[1] via P2P/IPC
secondary_node:43:309 [2] NCCL INFO Ring 01 : 2 -> 6 [receive] via NET/IB/0
secondary_node:44:311 [3] NCCL INFO Trees [0] 5->7->-1/-1/-1 [1] 5->7->-1/-1/-1
master_node:22:295 [2] NCCL INFO Ring 01 : 6 -> 2 [receive] via NET/IB/0
secondary_node:42:310 [1] NCCL INFO Ring 01 : 5[1] -> 4[0] via P2P/IPC
secondary_node:41:312 [0] NCCL INFO Ring 01 : 4[0] -> 6[2] via P2P/IPC
secondary_node:44:311 [3] NCCL INFO comm 0x7ff2c43f7c00 rank 7 nranks 8 cudaDev 3 nvmlDev 3 - Init COMPLETE
secondary_node:42:310 [1] NCCL INFO Trees [0] 4->5->7/-1/-1 [1] 4->5->7/-1/-1
secondary_node:41:312 [0] NCCL INFO Trees [0] 6->4->5/-1/-1 [1] 6->4->5/-1/-1
secondary_node:41:312 [0] NCCL INFO comm 0x7fd8dc3c6740 rank 4 nranks 8 cudaDev 0 nvmlDev 0 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO Ring 01 : 6 -> 2 [send] via NET/IB/0
secondary_node:43:309 [2] NCCL INFO Trees [0] 2->6->4/-1/-1 [1] -1->6->4/2/-1
secondary_node:42:310 [1] NCCL INFO comm 0x7fa7cc422c90 rank 5 nranks 8 cudaDev 1 nvmlDev 1 - Init COMPLETE
secondary_node:43:309 [2] NCCL INFO comm 0x7fce9c438c90 rank 6 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:22:295 [2] NCCL INFO Trees [0] -1->2->0/6/-1 [1] 6->2->0/-1/-1
master_node:22:295 [2] NCCL INFO comm 0x7fd8f038f460 rank 2 nranks 8 cudaDev 2 nvmlDev 2 - Init COMPLETE
master_node:20:289 [0] NCCL INFO Launch mode Parallel

@vilmara
Copy link
Author

vilmara commented Jan 15, 2020

solved, see #1523

@vilmara vilmara closed this as completed Jan 15, 2020
@ajtarraga
Copy link

Hi, I have a problem while I am working with GPUDirect RDMA.

I have the following scenario:

  • 8 nodes with 1 GPU Tesla T4 each
  • GPUDirect RDMA enabled on them

However, while I am trying to compare in 2 nodes the performance obtained by GPUDirect RDMA, it shows me the results:

  • via NET/IB/0 ~820 images/sec
  • via NET/IB/0/GDRDMA ~808 images/sec

Horovod is showing better results when GPUDirect RDMA is disabled. Can someone explain me, why Horovod works better without GPUDirect RDMA? I cannot understand because I obtain 2.7x Bandwidth between GPUs while GPUDirect RDMA is enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants