New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horovod behavior with dataset API #223
Comments
@benyoti, that's very interesting. Are you manually setting random seed? Are you using shuffling? |
I am not doing any shuffling or setting a random seed in order to see what's going on. I wrote a basic script using the Dataset API, AlexNet and incorporated horovod following examples/tensorflow_word2vec.py If I use one GPU:
If I use 2 GPUs:
What is shown are the labels input after calling a sess.run(). I have 3 classes, 4 samples per classes (12 samples in total), batch size is 4 and nb. of epoch 2. Ideally, what you would like when using 2 GPUs would be something like:
This is my training loop if it can help:
|
Can you add shuffling to your dataset? That should help different processes read different data. It's also help to reduce over fitting. |
@alsrgv @benyoti I have the same problem. In my tests I have found that If according to my specified batch size, one epoch should get over in 200 steps, and I am training on 8 gpus, then instead one epoch gets over in 1600 epochs. This corresponds exactly to the issue pointed. @alsrgv I do not see any reason for things to improve by shuffling the dataset. This is not a problem of dataset shuffling. Instead it has to do with how |
@calledbymountains, thanks for sharing. Indeed, per https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard, you can do: d = d.shard(hvd.size(), hvd.rank()) |
Great, thanks! I can confirm this is working as intended :) |
So will the same behavior hold true for multiple node setting as well? Say there are 2 servers with 2 gpus server 1 server 2 How can I modify this to multiple node setting? |
@gururao001, I believe your desired outcome is what dataset API will do for you. |
Note that this pipeline with It would be much more efficient to just load (and preprocess) the data once (e.g. in instance 0, or some extern instance), and then use sth like I posted this question also here on StackOverflow. |
Hi,
I was doing some test with the tensorflow dataset API (tf.data.Dataset) on a single machine with multiple GPUs, but it looks like horovod is sending the same data to each GPU each time the dataset's iterator is called (I am not using MonitoredTrainingSession but a standard tf.Session())
Is this behavior intended? Do you have any idea how to overcome this problem?
The text was updated successfully, but these errors were encountered: