Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in tensorflow's distributed training #6976

Closed
infwinston opened this issue Jan 20, 2017 · 6 comments
Closed

Unexpected behavior in tensorflow's distributed training #6976

infwinston opened this issue Jan 20, 2017 · 6 comments

Comments

@infwinston
Copy link

infwinston commented Jan 20, 2017

Hi Tensorflowers,

I was doing some distributed training experiments on tensorflow v0.11.0.
I modified the ResNet code from the official model zoos here to be the one can do distributed training.
In the experiments, ResNet56 is used for training on CIFAR10 data set and the settings follow the original paper.

When I set the number of parameter server = 1, worker = 1, I expected the behavior would be the same as the single gpu one (which is the original code in the repo).
It turns out that there is a huge performance gap. Please see the following figure.
resnet_cifar10_1ps1worker_tf
x-axis: time in second, y-axis: testing error
tf-0: single gpu version, tf-1: distributed version with # ps=1, # worker=1
Both code were run by 160 epochs and with the same parameter/learning rate schedule.

The single gpu version can achieve 7% error rate (which is consistent with the original paper), but the distributed one is stalled at 12% error.
I think there might be something wrong as the performance should be similar for both cases.
Could you check if that is the case? (or maybe I used the wrong way to do distributed training)
Please feel free to ask if you have any problem with the setting. Thanks.

The single gpu version code can be found here (I used the earlier 0.11 compatible version)
The code for distributed version can be found here.

Some detailed settings:

batch size=128, num_residual_units=9, relu_leakiness=0

The command I launched:

ps0
> export CUDA_VISIBLE_DEVICES=""; python resnet_dist.py --ps_hosts="localhost:50000" --worker_hosts="localhost:50001" --job_name="ps" --task_id="0" --batch_size=128 --dataset='cifar10' --train_data_path=cifar10/data_batch* --log_root=./tmp-log-root/ --num_gpus=1 --mode train
worker0
> export CUDA_VISIBLE_DEVICES="0"; python resnet_dist.py --ps_hosts="localhost:50000" --worker_hosts="localhost:50001" --job_name="worker" --task_id="0" --batch_size=128 --dataset='cifar10' --train_data_path=cifar10/data_batch* --log_root=./tmp-log-root/ --num_gpus=1 --mode train

Environment info

Operating System:
Ubuntu 14.04

Installed version of CUDA and cuDNN:
CUDA 7.5, cuDNN 5.1

> ls -l /usr/local/cuda/lib/libcud*
-rw-r--r-- 1 root root 189170 Oct 25 22:51 /usr/local/cuda/lib/libcudadevrt.a
lrwxrwxrwx 1 root root     16 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so -> libcudart.so.7.5
lrwxrwxrwx 1 root root     19 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so.7.5 -> libcudart.so.7.5.18
-rwxr-xr-x 1 root root 311596 Oct 25 22:51 /usr/local/cuda/lib/libcudart.so.7.5.18
-rw-r--r-- 1 root root 558020 Oct 25 22:51 /usr/local/cuda/lib/libcudart_static.a

If installed from source, provide

  1. The commit hash (git rev-parse HEAD)
    282823b
  2. The output of bazel version
Build label: 0.4.2
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Dec 7 18:47:11 2016 (1481136431)
Build timestamp: 1481136431
Build timestamp as int: 1481136431

I built tensorflow v0.11.0 by myself.

@yaroslavvb
Copy link
Contributor

From the graph, it looks that one of the runs "got lucky". These kind of jumps indicate your network is badly tuned and, say, larger network, could have more consistent increase over time.

It seems possible that small differences in behavior of distributed version might make a badly tuned network be sometimes better. IE, using two processes instead of one may introduce small difference in timing, which affects training (ie, summary threads pull data from the input queue, so the timing of summary thread scheduling affects which data the network sees during training).

Also, you are using SyncReplicas optimizer for your distributed version, whereas a more close comparison would use regular SGD. This list is for bugs/feature request, I think you need to isolate the difference more reliably to be sure this is a bug in TensorFlow (ie, perhaps also by transitioning to SavedModel instead of Supervisor to reduce some thread-scheduling based randomness)

@asimshankar
Copy link
Contributor

This question is probably better suited for stackoverflow as we try to keep the github issues list focused on bugs and feature requests.

@infwinston
Copy link
Author

Thanks for your response.
But actually I don't think it's the case that one just got lucky.
There might be some randomness but the performance gap should not be such huge I think.
The jump of curve is because the learning rate decreased and indeed it is reproducible.
I have done several experiments on it and found that is the case.
Also, this kind of curve is consistent with the one in the original paper (See Fig. 6) and many people can reproduce it by using the same settings (e.g., [1], [2], or more examples here)
But you're right, I should report the one without using SyncReplicas first. I am already working on this.

Anyway, I did not have a direct proof to something wrong in tensorflow but if you think this is not the right place to discuss then I can move to stackoverflow.
Thanks.

@yaroslavvb
Copy link
Contributor

Yes, the jump seems consistent with learning rate getting divided by 10. In your distributed version, your learning rate is a constant, so I guess that would explain why it doesn't have the jump?

Discussing on closed issues is fine, it's just that open issues are a sign that it needs attention from someone from core team to triage/fix the issue.

@infwinston
Copy link
Author

@yaroslavvb Thanks!
I think I made a mistake in the distributed version code... I wrongly thought I can still control the learning rate by assigning here.
But the situation is different from the original one as I used an another optimizer instead.
Sorry about the mistake and many thanks for pointing out this. I think it should be normal now.

@infwinston
Copy link
Author

infwinston commented Feb 2, 2017

@yaroslavvb Thanks for your help last time. I've faced another issue related to this. Could you please take a look on my stackoverflow post if you have time? Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants