New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling XLA in tensorflow 2.16 causes memory leaks #64170
Comments
@NBCBM |
@Venkat6871 |
I observed memory leaks in Google Colab with T4 GPU. |
@Venkat6871, do you have any progress on this issue. I'm also affected because of this issue, I tried will all tf versions from 2.11 up to 2.16 and it seems like this happens since tf 2.12. It seems this was not happening for tf 2.11 thank you |
I face the exact problem. Using Ubuntu 20.04, NVIDIA RTX3060 & python=3.11 @Venkat6871, do you have any progress on this issue? |
This document from NVIDIA on tweaking environment variables for XLA memory could help. We're yet to test this out though and will try to keep this thread updated on any findings: Btw, the formation of clusters / Operation fusing / compilation using XLA increases the memory for each different shape of input that the framework comes across. If we keep the shape constant, we've found that after X amount of time, the memory growth stabilizes even with TF 2.15. |
I'm also encountering the exact same problem using Ubuntu 20.04 NVIDIA GTX1070 and python 3.10. Would be great to get this fixed since the memory leakage can get pretty astronomical if left unchecked. I've had it grow up to 12gB until it crashed my kernel. |
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
tensorflow 2.16.1
(tensorflow[and-cuda] 2.16.1)
Custom code
Yes
OS platform and distribution
Linux Ubuntu 22.04
Mobile device
No response
Python version
3.12
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
12.3/8.9.7
GPU model and memory
RTX3060Ti、RTX3060
Current behavior?
Executing the tf.keras.Model.fit method with XLA enabled will cause a memory leak.
Note that XLA seems to be enabled by default since Tensorflow 2.16.1.
Setting tf.keras.Model.jit_compile to False disables XLA and eliminates the memory leak.
I think that updating to tensorflow 2.16 will cause memory leaks in almost all existing programs.
I think you need to fix this problem as soon as possible or alert people about it in documents such as Release Note or README.
Standalone code to reproduce the issue
Relevant log output
No response
The text was updated successfully, but these errors were encountered: