tf.nn.ctc_loss calculated sequentially when using tf.distribute.MirroredStrategy() #52752
Labels
comp:dist-strat
Distribution Strategy related issues
stat:awaiting tensorflower
Status - Awaiting response from tensorflower
TF 2.5
Issues related to TF 2.5
type:performance
Performance Issue
Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template
System information
Describe the current behavior
I am working on a ctc based model and wanted to accelerate the training by applying the MirroredStrategy to my custom training loop. I modified my code according to the tutorials, but did not observe any speed up. After digging into it with the profiler I noticed in the trace viewer that each gpu seemed to compute the loss one after another instead of concurrently (see image further below). Also, there seemed a lot of communication to be going on between the devices and the host. Moreover, the overview page stated that more than ~95% of the device time is spent on eager execution. Maybe that's related somehow?
Describe the expected behavior
Maybe there is some misunderstanding on my side how exactly ctc_loss and MirroredStrategy work, but I would expect that it is possible to calculate the loss concurrently on all gpus.
Standalone code to reproduce the issue
Here is some example code which reproduces the issue I am facing. I removed the model forward/backward pass as it is not important and did not seem to cause any problems:
Other info / logs
Here is a screenshot of the trace viewer of the above code when executed on 3 gpus.
The text was updated successfully, but these errors were encountered: