-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance tuning parameters for Horovod-TensorFlow benchmarks #288
Comments
@vilmara, these look reasonable. What GPUs are you using for this test (P100? V100?), and what number of images/sec do you get on a single GPU? |
hi @alsrgv, the system has 2 nodes with 4 V100 GPUs each, a single node produces around 255 images/second. I have run the benchmark with parameter tuning for only a single node (without horovod) and I got even higher throughput; I tried these parameters with horovod but I got errors (except --variable_update which I kept =horovod) Here is the result for Single Node with regular TF benchmark without horovod Result: 2341.4 images/sec |
@vilmara, can you try using |
@alsrgv with --use_fp16=True It gave around 430 images/sec per gpu. Can you suggest what other parameters can I combine with --variable_update=horovod to get the highest throughput performance? |
@vilmara, the best performance I got with (ResNet-50, batch size 64, FP16, V100 with NVLink, real data) combination is ~616 img/sec. If you have NVLink, you should get ~80-85% scaling efficiency within the node - so ~2000-2100 img/sec on 4 GPUs. You may get slightly better numbers since you're using a little higher batch size. Scaling efficiency between nodes will depend on your network. It looks like you have InfiniBand, which is great! What is the speed of your NIC? Do you have NVLink or PCIe? Here's the command that I used to get a good performance:
Please note |
Hi @alsrgv, I have infiniband, the NIC's speed is 100 GB/s, I have NVLink, I ran the test adding the flag --datasets_num_private_threads=4 and it boosted ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs; however, I still don't reach the performance you mentioned. Here is the command I used: |
@vilmara, great! Sorry - I made a mistake in the expected number that I published. With batch size 64, ~2000-2100 img/sec on 4 GPUs are expected within a node. I don't have numbers for batch size 128 - what performance are you getting within a node in your setup? |
Hi @alsrgv, I am getting ~500 img/sec per gpu, so ~4000 img/sec on 8 GPUs with batch size 128 |
@vilmara, do you also get ~500 img/sec per GPU with 4 GPUs within node? |
Hi @alsrgv, sorry I got it, here are the results within node: |
@vilmara, that's great, you get 92% scaling efficiency within a server. The fact that it goes down to 77% as you cross over the network is a bit underwhelming though. Can you run your test with |
hi @alsrgv, here are the outputs for NCCL lines, C4140-V100-1 is the primary host and C4140-V100-2 is the secondary host: C4140-V100-1:24:201 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-1:27:199 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-1:26:196 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-1:25:200 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:24:229 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:26:231 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:25:228 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:27:230 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] I have tried some tuning parameters indicated in this link without success, could you please recommend some?: http://www.mellanox.com/related-docs/prod_software/Mellanox_GPUDirect_User_Manual_v1.5.pdf |
@vilmara, good news - there is software room for improvement :-)
Hopefully, these two pointers will help: |
hi @alsrgv, I have installed Mellanox driver and GPUDirect RDMA API, and loaded the GPUDirect kernel module on each server; also I have checked its status to make sure GPUDirect RDMA is active and I realized it is not recognized inside horovod docker, see below: Outside the docker: Jun 07 16:02:45 C4140-V100-1 systemd[1]: Starting LSB: Activates/Deactivates nv_peer_mem to \ start at boot time.... Inside the docker: |
@vilmara, you need two more things:
Mellanox has much more detail in this document: https://community.mellanox.com/docs/DOC-3014#jive_content_id_Create_or_pull_a_base_image_and_run_Container, but I think you can get away with just doing (1) and (2). |
@alsrgv how can I extend the Dockerfile to install Mellanox drivers? |
@vilmara, you can either edit Horovod Dockerfile, or make your own Dockerfile and specify This guide has a good introduction to Docker: https://docs.docker.com/get-started/ |
@alsrgv , when using the command for the option 1 it is throwing this error: Error response from daemon: pull access denied for mellanox/hpcx-2-0, repository does not exist or may require 'docker login'. I logged into the docker but still getting the same error |
@vilmara, I found this image - https://hub.docker.com/r/mellanox/hpcx20_docker/, but I haven't personally tried it. |
I don't think their docker would container CUDA / NCCL / TF though. So it may be easier to start with Horovod docker and just install MLNX_OFED using their installation script. |
Thanks @alsrgv, here I have some extra questions: 1- if I select the option 3, editing the horovod Dockerfile and adding MLNX_OFED using their installation script, I won't need to use the options 1 or 2, right?. 2- When editing the horovod Dockerfile, do I need to add GPUDirect Rmda API installation?, I guess yes. 3- What are the flags to be used with docker run and mpirun to indicate we are using GPUDirect? |
Option 3, editing the horovod Dockerfile: @alsrgv, I was able to extend Dockerfile to install Mellanox driver (only MLNX_OFED since GPUDirect Rmda is throwing errors), so I installed GPUDirect Rmda API outside of the docker and activated it before running the benchmarks. Now, when running the benchmarks here is what I got: C4140-V100-2:87186:87393 [3] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:87183:87387 [0] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:87185:87396 [2] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] C4140-V100-2:87184:87390 [1] misc/ibvwrap.cu:61 WARN Failed to open libibverbs.so[.1] The primary server hangs at this point. I see it is still showing some network connections over the socket |
@vilmara, are you able to do |
Hi @alsrgv, yes I am able to Before installing MLNX_OFED inside the docker, the network connection was only by socket, now it is mixed and the system hangs |
@vilmara, gotcha. Do all these things apply to the second node as well, i.e. |
@alsrgv here are my commands in both servers: Primary server:
Secondary server: |
@vilmara, is |
@alsrgv here are the outputs of both servers inside the docker: Primary server: Secondary server: |
@vilmara, it's extremely strange that you have both:
and Is it possible that you still have container running an older version of docker image on the second server, and that's what gets accessed over ssh from the first node container? |
@alsrgv in the secondary node outside docker this is what I have:
|
@vilmara, one other way to check:
|
@alsrgv running from the primary server inside the docker:
Also:
17 packages can be updated. *** System restart required *** |
@vilmara, ok, this:
Suggests that container on the second node is running an older version of the image which does not have ibv_devinfo. Can you just reboot the second server and run your commands again? |
@alsrgv here are the commands in the primary node after rebooting secondary node : root@c4140v1001:/examples# root@c4140v1001:/examples# |
@alsrgv some progress, I ran the benchmark again and the outputs are not showing WARN Failed to open libibverbs.so[.1] anymore, but still the primary server is hanging and showing socket connections Running warm up |
Can you check that MLNX_OFED is installed on the second host? |
@alsrgv after rebooting the second host it is not installed, I will rebuild horovod again with the extended horovod Dockerfile |
@alsrgv here are the outputs in the secondary node after re-building extended horovod docker with MLNX_OFED root@C4140-V100-2:/examples# root@C4140-V100-2:/examples# |
@vilmara, please install MLNX_OFED on the second host as well, since you need to have kernel driver in addition to userland libraries that you have installed in the docker. |
hi @alsrgv have reinstalled MLNX_OFED on the second host and ran the below test again In the primary node: #inside the docker: root@C4140-V100-1:/examples# |
@alsrgv also I was able to run the benchmark again with the below performance and output compilation: Within a Node (single node mode with 4GPUs ): Multinode mode (2 nodes with 4 GPUs each): Outputs: |
@vilmara, great! I think installing the MLNX_OFED drivers on the second node resolved the issue. Right now you're using RDMA, but not GPUDirect. Can you try a couple of things:
|
@alsrgv here is the
GPU0 X NV2 NV2 NV2 SYS 0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22,24-24,26-26,28-28,30-30,32-32,34-34,36-36,38-38 Legend: X = Self Regarding the second step, should I delete the flags |
@vilmara, you should keep both. I'm not sure if GPUDirect actually works over NUMA. It would be far better if Mellanox NIC is attached to PCIx root complex that has NVLink mesh. You hardware team or Dell contacts should be able to help with that. |
@alsrgv , here is the performance with And here are some of the NCCL info output: Regarding to your recommendation, is it more a hardware improvement? |
@vilmara, oh - it worked, great! I think you may get slightly better performance with the hardware change I mentioned, since |
@alsrgv thanks so much for your help. I was great boosting the performance from 77% to 87% on multi node systems with your recommendations |
Hi, @alsrgv. I am running the TF benchmarks in multi-node mode with the latest version of Horovod 0.18.2, MLNX_OFED_LINUX-4.7-1.0.0.1, and GPUDirect RDMA (nvidia-peer-memory_1.0-8). I am following the above instructions, but I am not seeing the output connection Current throughput: Tracelog |
solved, see #1523 |
Hi, I have a problem while I am working with GPUDirect RDMA. I have the following scenario:
However, while I am trying to compare in 2 nodes the performance obtained by GPUDirect RDMA, it shows me the results:
Horovod is showing better results when GPUDirect RDMA is disabled. Can someone explain me, why Horovod works better without GPUDirect RDMA? I cannot understand because I obtain 2.7x Bandwidth between GPUs while GPUDirect RDMA is enabled |
Hi all, can you recommend a set of tuning parameters to get the highest performance (throughput images/sec) using Horovod-TensorFlow benchmarks? . I got the below result and I was wonder if there is room for more improvement:
mpirun -np 8 -H 192.168.11.1:4,192.168.11.2:4 -x NCCL_IB_DISABLE=0 -x NCCL_SOCKET_IFNAME=ib0 --mca plm_rsh_args "-p 50000" -x python tf_cnn_benchmarks.py --data_dir=/data/imagenet_tfrecord/train --data_name=imagenet --model=resnet50 --batch_size=128 --device=gpu --num_epochs=90 --print_training_accuracy=true --summary_verbosity=0 --momentum=0.9 --piecewise_learning_rate_schedule='0.4;10;0.04;60;0.004' --weight_decay=0.0001 --optimizer=momentum --display_every=1000 --use_fp16=False --local_parameter_device=gpu --variable_update=horovod
total images/sec: 2047.95
The text was updated successfully, but these errors were encountered: