Unexpected behavior in tensorflow's distributed training #6976

infwinston · 2017-01-20T13:40:31Z

Hi Tensorflowers,

I was doing some distributed training experiments on tensorflow v0.11.0.
I modified the ResNet code from the official model zoos here to be the one can do distributed training.
In the experiments, ResNet56 is used for training on CIFAR10 data set and the settings follow the original paper.

When I set the number of parameter server = 1, worker = 1, I expected the behavior would be the same as the single gpu one (which is the original code in the repo).
It turns out that there is a huge performance gap. Please see the following figure.

x-axis: time in second, y-axis: testing error
tf-0: single gpu version, tf-1: distributed version with # ps=1, # worker=1
Both code were run by 160 epochs and with the same parameter/learning rate schedule.

The single gpu version can achieve 7% error rate (which is consistent with the original paper), but the distributed one is stalled at 12% error.
I think there might be something wrong as the performance should be similar for both cases.
Could you check if that is the case? (or maybe I used the wrong way to do distributed training)
Please feel free to ask if you have any problem with the setting. Thanks.

The single gpu version code can be found here (I used the earlier 0.11 compatible version)
The code for distributed version can be found here.

Some detailed settings:

batch size=128, num_residual_units=9, relu_leakiness=0

The command I launched:

ps0
> export CUDA_VISIBLE_DEVICES=""; python resnet_dist.py --ps_hosts="localhost:50000" --worker_hosts="localhost:50001" --job_name="ps" --task_id="0" --batch_size=128 --dataset='cifar10' --train_data_path=cifar10/data_batch* --log_root=./tmp-log-root/ --num_gpus=1 --mode train
worker0
> export CUDA_VISIBLE_DEVICES="0"; python resnet_dist.py --ps_hosts="localhost:50000" --worker_hosts="localhost:50001" --job_name="worker" --task_id="0" --batch_size=128 --dataset='cifar10' --train_data_path=cifar10/data_batch* --log_root=./tmp-log-root/ --num_gpus=1 --mode train

Environment info

Operating System:
Ubuntu 14.04

Installed version of CUDA and cuDNN:
CUDA 7.5, cuDNN 5.1

> ls -l /usr/local/cuda/lib/libcud*
-rw-r--r-- 1 root root 189170 Oct 25 22:51 /usr/local/cuda/lib/libcudadevrt.a
lrwxrwxrwx 1 root root     16 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root     19 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root 311596 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so.7.5.18
-rw-r--r-- 1 root root 558020 Oct 25 22:51 /usr/local/cuda/lib/libcudart_static.a

If installed from source, provide

The commit hash (git rev-parse HEAD)
282823b
The output of bazel version

Build label: 0.4.2
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Dec 7 18:47:11 2016 (1481136431)
Build timestamp: 1481136431
Build timestamp as int: 1481136431

I built tensorflow v0.11.0 by myself.

The text was updated successfully, but these errors were encountered:

yaroslavvb · 2017-01-20T16:57:33Z

From the graph, it looks that one of the runs "got lucky". These kind of jumps indicate your network is badly tuned and, say, larger network, could have more consistent increase over time.

It seems possible that small differences in behavior of distributed version might make a badly tuned network be sometimes better. IE, using two processes instead of one may introduce small difference in timing, which affects training (ie, summary threads pull data from the input queue, so the timing of summary thread scheduling affects which data the network sees during training).

Also, you are using SyncReplicas optimizer for your distributed version, whereas a more close comparison would use regular SGD. This list is for bugs/feature request, I think you need to isolate the difference more reliably to be sure this is a bug in TensorFlow (ie, perhaps also by transitioning to SavedModel instead of Supervisor to reduce some thread-scheduling based randomness)

asimshankar · 2017-01-20T18:04:28Z

This question is probably better suited for stackoverflow as we try to keep the github issues list focused on bugs and feature requests.

infwinston · 2017-01-21T09:46:43Z

Thanks for your response.
But actually I don't think it's the case that one just got lucky.
There might be some randomness but the performance gap should not be such huge I think.
The jump of curve is because the learning rate decreased and indeed it is reproducible.
I have done several experiments on it and found that is the case.
Also, this kind of curve is consistent with the one in the original paper (See Fig. 6) and many people can reproduce it by using the same settings (e.g., [1], [2], or more examples here)
But you're right, I should report the one without using SyncReplicas first. I am already working on this.

Anyway, I did not have a direct proof to something wrong in tensorflow but if you think this is not the right place to discuss then I can move to stackoverflow.
Thanks.

yaroslavvb · 2017-01-21T12:18:59Z

Yes, the jump seems consistent with learning rate getting divided by 10. In your distributed version, your learning rate is a constant, so I guess that would explain why it doesn't have the jump?

Discussing on closed issues is fine, it's just that open issues are a sign that it needs attention from someone from core team to triage/fix the issue.

infwinston · 2017-01-21T12:40:26Z

@yaroslavvb Thanks!
I think I made a mistake in the distributed version code... I wrongly thought I can still control the learning rate by assigning here.
But the situation is different from the original one as I used an another optimizer instead.
Sorry about the mistake and many thanks for pointing out this. I think it should be normal now.

infwinston · 2017-02-02T16:14:24Z

@yaroslavvb Thanks for your help last time. I've faced another issue related to this. Could you please take a look on my stackoverflow post if you have time? Thanks a lot.

yaroslavvb added the stackoverflow label Jan 20, 2017

asimshankar closed this as completed Jan 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in tensorflow's distributed training #6976

Unexpected behavior in tensorflow's distributed training #6976

infwinston commented Jan 20, 2017 •

edited

yaroslavvb commented Jan 20, 2017

asimshankar commented Jan 20, 2017

infwinston commented Jan 21, 2017

yaroslavvb commented Jan 21, 2017

infwinston commented Jan 21, 2017

infwinston commented Feb 2, 2017 •

edited

Unexpected behavior in tensorflow's distributed training #6976

Unexpected behavior in tensorflow's distributed training #6976

Comments

infwinston commented Jan 20, 2017 • edited

Environment info

yaroslavvb commented Jan 20, 2017

asimshankar commented Jan 20, 2017

infwinston commented Jan 21, 2017

yaroslavvb commented Jan 21, 2017

infwinston commented Jan 21, 2017

infwinston commented Feb 2, 2017 • edited

infwinston commented Jan 20, 2017 •

edited

infwinston commented Feb 2, 2017 •

edited