New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behavior in tensorflow's distributed training #6976
Comments
From the graph, it looks that one of the runs "got lucky". These kind of jumps indicate your network is badly tuned and, say, larger network, could have more consistent increase over time. It seems possible that small differences in behavior of distributed version might make a badly tuned network be sometimes better. IE, using two processes instead of one may introduce small difference in timing, which affects training (ie, summary threads pull data from the input queue, so the timing of summary thread scheduling affects which data the network sees during training). Also, you are using SyncReplicas optimizer for your distributed version, whereas a more close comparison would use regular SGD. This list is for bugs/feature request, I think you need to isolate the difference more reliably to be sure this is a bug in TensorFlow (ie, perhaps also by transitioning to SavedModel instead of Supervisor to reduce some thread-scheduling based randomness) |
This question is probably better suited for stackoverflow as we try to keep the github issues list focused on bugs and feature requests. |
Thanks for your response. Anyway, I did not have a direct proof to something wrong in tensorflow but if you think this is not the right place to discuss then I can move to stackoverflow. |
Yes, the jump seems consistent with learning rate getting divided by 10. In your distributed version, your learning rate is a constant, so I guess that would explain why it doesn't have the jump? Discussing on closed issues is fine, it's just that open issues are a sign that it needs attention from someone from core team to triage/fix the issue. |
@yaroslavvb Thanks! |
@yaroslavvb Thanks for your help last time. I've faced another issue related to this. Could you please take a look on my stackoverflow post if you have time? Thanks a lot. |
Hi Tensorflowers,
I was doing some distributed training experiments on tensorflow v0.11.0.
I modified the ResNet code from the official model zoos here to be the one can do distributed training.
In the experiments, ResNet56 is used for training on CIFAR10 data set and the settings follow the original paper.
When I set the number of parameter server = 1, worker = 1, I expected the behavior would be the same as the single gpu one (which is the original code in the repo).
It turns out that there is a huge performance gap. Please see the following figure.
x-axis: time in second, y-axis: testing error
tf-0: single gpu version, tf-1: distributed version with # ps=1, # worker=1
Both code were run by 160 epochs and with the same parameter/learning rate schedule.
The single gpu version can achieve 7% error rate (which is consistent with the original paper), but the distributed one is stalled at 12% error.
I think there might be something wrong as the performance should be similar for both cases.
Could you check if that is the case? (or maybe I used the wrong way to do distributed training)
Please feel free to ask if you have any problem with the setting. Thanks.
The single gpu version code can be found here (I used the earlier 0.11 compatible version)
The code for distributed version can be found here.
Some detailed settings:
The command I launched:
Environment info
Operating System:
Ubuntu 14.04
Installed version of CUDA and cuDNN:
CUDA 7.5, cuDNN 5.1
If installed from source, provide
git rev-parse HEAD
)282823b
bazel version
I built tensorflow v0.11.0 by myself.
The text was updated successfully, but these errors were encountered: