New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bijectors are orders of magnitude slower in tf2.1 autograph distributed mirrored single-gpu mode #35415
Comments
olegmyrk@ Thank you for posting a detailed summarization! To debug performance issues such as the one above we need timeline traces. Can you post the following traces: |
Here is trace1 with TF 1.x Here is trace3 with TF2.x with 1 GPU Mirrored strategy Please note that adding tracing makes training significantly slower on its own (especially in TF1.x). |
I have managed to create a minimalistic TF1 and TF2 scripts to demonstrate the issue. As you can see from the logs that building graph in TF1 is 2x faster than in TF2 single-gpu mirrored mode and 10x faster than in TF2 multi-gpu mirrored mode. Note that in TF2 code there is a commented-out line of code that also makes the training step 2x slower in multi-gpu mirrored mode. CUDA_VISIBLE_DEVICES=0 time python3 test_maf_tf1.py
CUDA_VISIBLE_DEVICES=0 time python3 test_maf_tf2.py
CUDA_VISIBLE_DEVICES=0,1,2,3 time python3 test_maf_tf2.py
TF1 script:
TF2 script:
|
Could you please test the performance in latest Tensorflow version, since many of the experimental modules are moved to stable and there could be improvement in performance. Refer this document for details. Thank you! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
*GPU model and memory: Tesla V100-SXM2-16GB
Describe the current behavior
I'm using Bijectors as a flexible prior for a VAE.
This code has negligible overhead in tf1.x (for input batch size 18x256x256x3). In tf2.1 autograph distributed mirrored mode
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/SPADE.py#L1152
with single GPU it increases training step duration 1 second (tf1.x) -> 1.9 seconds (tf2.1):
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/SPADE.py#L190
I'm using custom masked autoregressive template
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/masked_autoregressive.py
but it is as slow with the default one:
https://www.tensorflow.org/probability/api_docs/python/tfp/bijectors/masked_autoregressive_default_template
Possible suspects:
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/masked_autoregressive.py#L44
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/masked_autoregressive.py#L115
Describe the expected behavior
Performance in tf2.1 and tf1.x should be comparable.
Code to reproduce the issue
TF2.x code:
https://github.com/olegmyrk/SPADE-Tensorflow/blob/85b5fd7943296561dc3d54557fec5346c2adea58/SPADE.py#L190
TF1.x code:
https://github.com/olegmyrk/SPADE-Tensorflow/blob/develop/SPADE.py#L190
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: