-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance tuning on horovod #547
Comments
@cheyang, network bandwidth 3.0 Gbit/s is very little for P100 GPUs. If possible, you should upgrade to 25 Gbit/s or 50 Gbit/s if you plan to scale to multiple nodes. You can try to use fp16 allreduce mode for your training since it will transfer 2x fewer data over the wire. To do so, modify this line: https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/benchmark_cnn.py#L2738 Replace:
With:
|
Thank you for the suggestions! I turned the network bandwidth to 25 Gbit/s. The result is :
Do you think is it normal for 25 Gib/s network? Or do you have any suggestions on optimizing network configuration? I didn't change benchmark source code. I'm also wondering if using fp16 allreduce will impact the training result, such as accuracy. Why is it not default setting? Thanks in advanced. |
Why don't you use RoCE if you are using 25GbE and |
It's because I'm testing on the public cloud. |
@cheyang, can you share which cloud you're using? It may help narrow down the issue. Additionally, can you capture network utilization using |
Thank you. I'm using the Alibaba Cloud. But the output of I notice the |
@cheyang, you may already resolve this issue. Try to use The reason you got
|
Thank you for your nice tips! |
Hi all,
I'm running performance benchmark with synthetic data and docker image uber/horovod:0.13.10-tf1.9.0-torch0.4.0-py3.5.
Start docker in each node:
Single Node and GPU:
Performance Result:
Two Nodes and each has one GPU:
Performance Result:
The performance improvement looks less than expectation(370.14/(224.30*2) =82% comparing with single node). Do you have any suggestions to optimize?
Hardware configuration:
GPU: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB]
CPU: 16, Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Memory: 118 Gi
Network bandwidth: 3.0 Gbit/s
The text was updated successfully, but these errors were encountered: