Performance tuning on horovod #547

cheyang · 2018-10-08T00:09:15Z

Hi all,

I'm running performance benchmark with synthetic data and docker image uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5.

Start docker in each node:

docker run -itd --network=host  -v /nfs/share/ssh:/root/.ssh -v /nfs:/tf-cnn uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5 \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

Single Node and GPU:

python /tf-cnn/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
 --num_gpus=1 \
 --model=resnet50 \
 --num_batches=300 \
 --variable_update=horovod \
 --horovod_device=gpu \
 --trace_file=/tf-cnn/trace_horovod.log \
 --batch_size=64

Performance Result:

----------------------------------------------------------------
total images/sec: 224.30
----------------------------------------------------------------

Two Nodes and each has one GPU:

mpirun --allow-run-as-root -np 2 \
     -H 192.168.0.242:1,192.168.0.243:1 \
     -bind-to none -map-by slot \
     --mca btl_tcp_if_include eth0 \
     --mca oob_tcp_if_include eth0  \
     --mca orte_keep_fqdn_hostnames t \
     -x NCCL_SOCKET_IFNAME=eth0 \
     -x LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64: \
     -x NCCL_DEBUG=INFO \
    python /tf-cnn/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --num_gpus=1 \
    --model=resnet50 \
    --num_batches=300 \
    --variable_update=horovod \
    --horovod_device=gpu \
    --trace_file=/tf-cnn/trace_horovod.log \
    --batch_size=64

Performance Result:

----------------------------------------------------------------
300	images/sec: 185.3 +/- 0.6 (jitter = 8.8)	8.235
----------------------------------------------------------------
total images/sec: 370.14
----------------------------------------------------------------
Training time is 0.33924
300	images/sec: 185.2 +/- 0.6 (jitter = 8.8)	8.257
----------------------------------------------------------------
total images/sec: 370.14
----------------------------------------------------------------

The performance improvement looks less than expectation(370.14/(224.30*2) =82% comparing with single node). Do you have any suggestions to optimize?

Hardware configuration:
GPU: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
CPU: 16, Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Memory: 118 Gi
Network bandwidth: 3.0 Gbit/s

The text was updated successfully, but these errors were encountered:

alsrgv · 2018-10-08T18:18:28Z

@cheyang, network bandwidth 3.0 Gbit/s is very little for P100 GPUs. If possible, you should upgrade to 25 Gbit/s or 50 Gbit/s if you plan to scale to multiple nodes.

You can try to use fp16 allreduce mode for your training since it will transfer 2x fewer data over the wire. To do so, modify this line: https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L2738

Replace:

        grads = [hvd.allreduce(grad, average=False, device_dense=horovod_device)
                 for grad in grads]

With:

        grads = [hvd.allreduce(grad, average=False, device_dense=horovod_device, compression=hvd.Compression.fp16)
                 for grad in grads]

cheyang · 2018-10-08T22:59:33Z

Thank you for the suggestions! I turned the network bandwidth to 25 Gbit/s.

The result is :

----------------------------------------------------------------
total images/sec: 372.05
----------------------------------------------------------------
300	images/sec: 186.1 +/- 0.8 (jitter = 10.1)	8.243
----------------------------------------------------------------
total images/sec: 372.04
----------------------------------------------------------------

Do you think is it normal for 25 Gib/s network? Or do you have any suggestions on optimizing network configuration? I didn't change benchmark source code.

I'm also wondering if using fp16 allreduce will impact the training result, such as accuracy. Why is it not default setting? Thanks in advanced.

byronyi · 2018-10-09T17:35:47Z

Why don't you use RoCE if you are using 25GbE and --net=host?

cheyang · 2018-10-10T00:14:28Z

It's because I'm testing on the public cloud.

alsrgv · 2018-10-11T00:32:22Z

@cheyang, can you share which cloud you're using? It may help narrow down the issue.

Additionally, can you capture network utilization using ethtool, similar to #255 (comment), and check if it actually reaches 25 Gbit?

cheyang · 2018-10-12T16:19:11Z

Thank you. I'm using the Alibaba Cloud.

But the output of ethtool -S eth0 is no stats available. :(

I notice the sar command can only show result every second. Do you have any other good suggestions?

Jeffwan · 2019-02-01T08:19:34Z

@cheyang, you may already resolve this issue. Try to use nloadto have a check of maximum/minimum/average network. That will give a basic idea if it has ever reached limit.
But it's still good to check milliseconds level metrics.

The reason you got no stats available is your network device driver doesn't support it. Try ethtool -i eth0 to confirm supports-statistics

$ sudo ethtool -i eth0
driver: virtio_net
version: 1.0.0
firmware-version:
expansion-rom-version:
bus-info: 0000:00:04.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

cheyang · 2019-02-03T00:19:39Z

@cheyang, you may already resolve this issue. Try to use nloadto have a check of maximum/minimum/average network. That will give a basic idea if it has ever reached limit.
But it's still good to check milliseconds level metrics.

The reason you got no stats available is your network device driver doesn't support it. Try ethtool -i eth0 to confirm supports-statistics
$ sudo ethtool -i eth0
driver: virtio_net
version: 1.0.0
firmware-version:
expansion-rom-version:
bus-info: 0000:00:04.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Thank you for your nice tips!

cheyang changed the title ~~Performance tunning on horovod~~ Performance tuning on horovod Oct 8, 2018

alsrgv added the question label Oct 8, 2018

cheyang closed this as completed Feb 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning on horovod #547

Performance tuning on horovod #547

cheyang commented Oct 8, 2018 •

edited

Loading

alsrgv commented Oct 8, 2018 •

edited

Loading

cheyang commented Oct 8, 2018 •

edited

Loading

byronyi commented Oct 9, 2018

cheyang commented Oct 10, 2018

alsrgv commented Oct 11, 2018

cheyang commented Oct 12, 2018 •

edited

Loading

Jeffwan commented Feb 1, 2019 •

edited

Loading

cheyang commented Feb 3, 2019

Performance tuning on horovod #547

Performance tuning on horovod #547

Comments

cheyang commented Oct 8, 2018 • edited Loading

alsrgv commented Oct 8, 2018 • edited Loading

cheyang commented Oct 8, 2018 • edited Loading

byronyi commented Oct 9, 2018

cheyang commented Oct 10, 2018

alsrgv commented Oct 11, 2018

cheyang commented Oct 12, 2018 • edited Loading

Jeffwan commented Feb 1, 2019 • edited Loading

cheyang commented Feb 3, 2019

cheyang commented Oct 8, 2018 •

edited

Loading

alsrgv commented Oct 8, 2018 •

edited

Loading

cheyang commented Oct 8, 2018 •

edited

Loading

cheyang commented Oct 12, 2018 •

edited

Loading

Jeffwan commented Feb 1, 2019 •

edited

Loading