New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.keras model.fit calls slow with TPU distribute strategy #30162
Comments
In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thanks! |
I have just added a few lines to illustrate the issue. I have a very big workflow, so I cannot add the whole code here. The issue is that when model.fit is called in a loop(as shown above), the training slows down considerably. |
@capilano It is difficult to reproduce the issue. Will it be possible to provide us the minimal code snippet to reproduce the issue. So that we can reproduce the issue on our environment for faster resolution. Thanks! |
I have not actually tested this code. I just made it here on the fly,so there may be minor errors.
commentsIdeally, the distribute strategy should support the fit_generator method because that makes it possible to use tf.records Dataset and load data directly from GCS buckets because it is almost never going to be possible to preload data into memory esp when there is a data augmentation step in the input pipeline. |
@capilano I tried reproducing the issue with provided code but i received |
TPU_WORKER is the TPU address. If you are using google colab (with TPU accelerator), I think you can leave it blank Just call the function without passing any argument. If that does not work, |
@capilano I have tried reproducing the issue by adding piece of code |
Ok, I have changed the code. If you just copy/paste the code it should work. I have also changed the network, I am just using a few Conv layers to check, maybe the Renet50 Model has some unsupported layers. In the data loading path the last two lines are outside the for loop. I am not able to indent the code here for some reason |
@capilano Thanks for the complete code. I am able to reproduce the issue now with Tf 1.14.0. Thanks! |
I'm facing the same issue. As
Even when I load more data to decrease the |
You can use a tf records dataset and do this with fit and use your data augmentation pipeline using a map function as long as you can do your augmentations with tensorflow functions. |
I could reproduce the issue. But I am not sure whether it is 100X slower or not. Here is the gist. Thanks! |
@jvishnuvardhan Just to give you guys a heads up, one can directly pass a dataset to model.fit and so multiple calls to fit are not really necessary if you are using a pipeline with only tensorflow functions for data augmentation. |
I am not surprised that the notebook is slow as the data processing is all happening on the Colab rather than the TPU system (which has much more processing power than the Colab VMs). With the notebook, the For performance reasons especially on non-trivial image models, you need to use tf.data Datasets with TF supported ops, and load the raw data from GCS. |
Is |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0:python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
TPU distribution strategy does not support model.fit_generator, and repeated model.fit calls result in a 50x slowdown presumably because it adds operations to graph.
Describe the expected behavior
Code to reproduce the issue
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
with strategy.scope():
model = ..... ## Your tf.keras model
model.compile(loss = custom_loss,optimizer ='custom_optimizer)
for i in range(num_its):
data,labels = = next(generator_fn())
model.fit(data,labels)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: