Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in tf.data when iterating over Dataset.from_generator #65675

Open
cohaegen opened this issue Apr 14, 2024 · 2 comments
Open

Memory leak in tf.data when iterating over Dataset.from_generator #65675

cohaegen opened this issue Apr 14, 2024 · 2 comments
Assignees
Labels
comp:data tf.data related issues TF 2.16 type:bug Bug

Comments

@cohaegen
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

v1.12.1-108954-g88310ddcbdd 2.17.0-dev20240412

Custom code

Yes

OS platform and distribution

Docker: tensorflow/tensorflow:nightly

Mobile device

No response

Python version

3.11.0rc1

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I discovered what appears to be a memory leak when iterating over a tf.data.Dataset created with from_generator. Process memory usage continues to grow out of hand. The effect only appears in certain combinations of Tensorflow and Python and it may have appeared in Python 3.11. Here are some examples I've tested:

Python 3.10.10, tensorflow 2.13.0: yes
Python 3.10.10, tensorflow 2.16.1: no
Python 3.10.12, tensorflow v2.15.0-0-g6887368d6d4: no
Python 3.11, tensorflow 2.16.1: yes
Python: 3.11.0rc1, tensorflow v1.12.1-109002-g2c2c0a17f05: yes

Maybe related to https://docs.python.org/3/whatsnew/3.11.html#faster-cpython? I thought maybe Python is re-using the memory and not freeing it, but usage grows ridiculously (I noticed it because it started taking up tens of GB in one case) and it seems like it shouldn't with a generator. Odd that 2.13.0 experiences the problem with Python 3.10.10 too though.

Standalone code to reproduce the issue

https://colab.research.google.com/drive/1LmdIqWME19GLFG0E7dsCtRtscLLFF89R?usp=sharing

Relevant log output

2024-04-14 16:12:10.319053: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
TensorFlow: ('v1.12.1-109002-g2c2c0a17f05', '2.17.0-dev20240413')
Iterations:        10000 Memory use: 489308160
Iterations:        20000 Memory use: 496451584
Iterations:        30000 Memory use: 503353344
Iterations:        40000 Memory use: 510517248
Iterations:        50000 Memory use: 517464064
Iterations:        60000 Memory use: 524591104
...
Iterations:       980000 Memory use: 1173454848
Iterations:       990000 Memory use: 1180618752
Iterations:      1000000 Memory use: 1187770368
2024-04-14 16:19:00.851851: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
379



Output from running valgrind:
PYTHONMALLOC=malloc valgrind --track-origins=yes --leak-check=full --show-leak-kinds=definite --trace-children=yes python ./tfdata_test.py
I stopped it at about 80,000 iterations because it's much slower running under valgrind
==324== 24,434,160 (4,660,032 direct, 19,774,128 indirect) bytes in 72,813 blocks are definitely lost in loss record 188,561 of 188,562
==324==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==324==    by 0x4D4AC1: ??? (in /usr/bin/python3.11)
==324==    by 0x5BA613: ??? (in /usr/bin/python3.11)
==324==    by 0x5BB4C4: ??? (in /usr/bin/python3.11)
==324==    by 0x4FE1B5: _PyEval_EvalFrameDefault (in /usr/bin/python3.11)
==324==    by 0x531822: _PyFunction_Vectorcall (in /usr/bin/python3.11)
==324==    by 0x530FB8: ??? (in /usr/bin/python3.11)
==324==    by 0x5DA693: _PyObject_CallMethod_SizeT (in /usr/bin/python3.11)
==324==    by 0x26C7748A: _descriptor_from_pep3118_format (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C8CBD5: _array_from_buffer_3118.part.0 (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C8DF4E: _array_from_array_like (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C6FCD4: PyArray_DiscoverDTypeAndShape_Recursive (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==
==324== 26,302,320 bytes in 73,062 blocks are definitely lost in loss record 188,562 of 188,562
==324==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==324==    by 0x625128: ??? (in /usr/bin/python3.11)
==324==    by 0x6250BA: PyThreadState_New (in /usr/bin/python3.11)
==324==    by 0x64E44F: PyGILState_Ensure (in /usr/bin/python3.11)
==324==    by 0xA597C58: tensorflow::PyFuncOp::Compute(tensorflow::OpKernelContext*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
==324==    by 0x92F0F3F: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x927A539: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode const&, long) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9A3D39D: std::_Function_handler<void (), std::_Bind<tensorflow::data::RunnerWithMaxParallelism(std::function<void (std::function<void ()>)>, int)::$_0::operator()(std::function<void (std::function<void ()>)> const&, std::function<void ()>) const::{lambda(std::function<void ()> const&)#1} (std::function<void ()>)> >::_M_invoke(std::_Any_data const&) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9C2299F: Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9C223D0: void std::__invoke_impl<void, tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}&>(std::__invoke_other, tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}&) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x960E55A: tsl::(anonymous namespace)::PThread::ThreadFn(void*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x4A27AC2: start_thread (pthread_create.c:442)
==324==
==324== LEAK SUMMARY:
==324==    definitely lost: 30,984,576 bytes in 146,328 blocks
==324==    indirectly lost: 19,783,920 bytes in 72,736 blocks
==324==      possibly lost: 136,684,089 bytes in 1,082,632 blocks
==324==    still reachable: 17,341,568 bytes in 182,582 blocks
==324==                       of which reachable via heuristic:
==324==                         stdstring          : 341 bytes in 8 blocks
==324==                         newarray           : 104,387 bytes in 320 blocks
==324==                         multipleinheritance: 7,168 bytes in 24 blocks
==324==         suppressed: 0 bytes in 0 blocks
@Venkat6871
Copy link

Hi @cohaegen ,
I tried to run your code on colab using TF v2.15, 2.16.1, and nightly. But i am not facing any issue. Please find the gist here for reference.

Thank you!

@Venkat6871 Venkat6871 added stat:awaiting response Status - Awaiting response from author comp:data tf.data related issues labels Apr 15, 2024
@cohaegen
Copy link
Author

cohaegen commented Apr 15, 2024 via email

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues TF 2.16 type:bug Bug
Projects
None yet
Development

No branches or pull requests

2 participants