Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance tuning on horovod #547

Closed
cheyang opened this issue Oct 8, 2018 · 8 comments
Closed

Performance tuning on horovod #547

cheyang opened this issue Oct 8, 2018 · 8 comments
Labels

Comments

@cheyang
Copy link

cheyang commented Oct 8, 2018

Hi all,

I'm running performance benchmark with synthetic data and docker image uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5.

Start docker in each node:

docker run -itd --network=host  -v /nfs/share/ssh:/root/.ssh -v /nfs:/tf-cnn uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5 \
    bash -c "/usr/sbin/sshd -p 12345; sleep infinity"

Single Node and GPU:

python /tf-cnn/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
 --num_gpus=1 \
 --model=resnet50 \
 --num_batches=300 \
 --variable_update=horovod \
 --horovod_device=gpu \
 --trace_file=/tf-cnn/trace_horovod.log \
 --batch_size=64

Performance Result:

----------------------------------------------------------------
total images/sec: 224.30
----------------------------------------------------------------

Two Nodes and each has one GPU:

mpirun --allow-run-as-root -np 2 \
     -H 192.168.0.242:1,192.168.0.243:1 \
     -bind-to none -map-by slot \
     --mca btl_tcp_if_include eth0 \
     --mca oob_tcp_if_include eth0  \
     --mca orte_keep_fqdn_hostnames t \
     -x NCCL_SOCKET_IFNAME=eth0 \
     -x LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64: \
     -x NCCL_DEBUG=INFO \
    python /tf-cnn/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py \
    --num_gpus=1 \
    --model=resnet50 \
    --num_batches=300 \
    --variable_update=horovod \
    --horovod_device=gpu \
    --trace_file=/tf-cnn/trace_horovod.log \
    --batch_size=64

Performance Result:

----------------------------------------------------------------
300	images/sec: 185.3 +/- 0.6 (jitter = 8.8)	8.235
----------------------------------------------------------------
total images/sec: 370.14
----------------------------------------------------------------
Training time is 0.33924
300	images/sec: 185.2 +/- 0.6 (jitter = 8.8)	8.257
----------------------------------------------------------------
total images/sec: 370.14
----------------------------------------------------------------

The performance improvement looks less than expectation(370.14/(224.30*2) =82% comparing with single node). Do you have any suggestions to optimize?

Hardware configuration:
GPU: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
CPU: 16, Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Memory: 118 Gi
Network bandwidth: 3.0 Gbit/s

@cheyang cheyang changed the title Performance tunning on horovod Performance tuning on horovod Oct 8, 2018
@alsrgv
Copy link
Member

alsrgv commented Oct 8, 2018

@cheyang, network bandwidth 3.0 Gbit/s is very little for P100 GPUs. If possible, you should upgrade to 25 Gbit/s or 50 Gbit/s if you plan to scale to multiple nodes.

You can try to use fp16 allreduce mode for your training since it will transfer 2x fewer data over the wire. To do so, modify this line: https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L2738

Replace:

        grads = [hvd.allreduce(grad, average=False, device_dense=horovod_device)
                 for grad in grads]

With:

        grads = [hvd.allreduce(grad, average=False, device_dense=horovod_device, compression=hvd.Compression.fp16)
                 for grad in grads]

@alsrgv alsrgv added the question label Oct 8, 2018
@cheyang
Copy link
Author

cheyang commented Oct 8, 2018

Thank you for the suggestions! I turned the network bandwidth to 25 Gbit/s.

The result is :

----------------------------------------------------------------
total images/sec: 372.05
----------------------------------------------------------------
300	images/sec: 186.1 +/- 0.8 (jitter = 10.1)	8.243
----------------------------------------------------------------
total images/sec: 372.04
----------------------------------------------------------------

Do you think is it normal for 25 Gib/s network? Or do you have any suggestions on optimizing network configuration? I didn't change benchmark source code.

I'm also wondering if using fp16 allreduce will impact the training result, such as accuracy. Why is it not default setting? Thanks in advanced.

@byronyi
Copy link
Contributor

byronyi commented Oct 9, 2018

Why don't you use RoCE if you are using 25GbE and --net=host?

@cheyang
Copy link
Author

cheyang commented Oct 10, 2018

It's because I'm testing on the public cloud.

@alsrgv
Copy link
Member

alsrgv commented Oct 11, 2018

@cheyang, can you share which cloud you're using? It may help narrow down the issue.

Additionally, can you capture network utilization using ethtool, similar to #255 (comment), and check if it actually reaches 25 Gbit?

@cheyang
Copy link
Author

cheyang commented Oct 12, 2018

Thank you. I'm using the Alibaba Cloud.

But the output of ethtool -S eth0 is no stats available. :(

I notice the sar command can only show result every second. Do you have any other good suggestions?

@Jeffwan
Copy link

Jeffwan commented Feb 1, 2019

@cheyang, you may already resolve this issue. Try to use nloadto have a check of maximum/minimum/average network. That will give a basic idea if it has ever reached limit.
But it's still good to check milliseconds level metrics.

The reason you got no stats available is your network device driver doesn't support it. Try ethtool -i eth0 to confirm supports-statistics

$ sudo ethtool -i eth0
driver: virtio_net
version: 1.0.0
firmware-version:
expansion-rom-version:
bus-info: 0000:00:04.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

@cheyang
Copy link
Author

cheyang commented Feb 3, 2019

@cheyang, you may already resolve this issue. Try to use nloadto have a check of maximum/minimum/average network. That will give a basic idea if it has ever reached limit.
But it's still good to check milliseconds level metrics.

The reason you got no stats available is your network device driver doesn't support it. Try ethtool -i eth0 to confirm supports-statistics

$ sudo ethtool -i eth0
driver: virtio_net
version: 1.0.0
firmware-version:
expansion-rom-version:
bus-info: 0000:00:04.0
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no

Thank you for your nice tips!

@cheyang cheyang closed this as completed Feb 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants