Memory leak in tf.data when iterating over Dataset.from_generator #65675

cohaegen · 2024-04-14T19:02:53Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

v1.12.1-108954-g88310ddcbdd 2.17.0-dev20240412

Custom code

Yes

OS platform and distribution

Docker: tensorflow/tensorflow:nightly

Mobile device

No response

Python version

3.11.0rc1

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

I discovered what appears to be a memory leak when iterating over a tf.data.Dataset created with from_generator. Process memory usage continues to grow out of hand. The effect only appears in certain combinations of Tensorflow and Python and it may have appeared in Python 3.11. Here are some examples I've tested:

Python 3.10.10, tensorflow 2.13.0: yes
Python 3.10.10, tensorflow 2.16.1: no
Python 3.10.12, tensorflow v2.15.0-0-g6887368d6d4: no
Python 3.11, tensorflow 2.16.1: yes
Python: 3.11.0rc1, tensorflow v1.12.1-109002-g2c2c0a17f05: yes

Maybe related to https://docs.python.org/3/whatsnew/3.11.html#faster-cpython? I thought maybe Python is re-using the memory and not freeing it, but usage grows ridiculously (I noticed it because it started taking up tens of GB in one case) and it seems like it shouldn't with a generator. Odd that 2.13.0 experiences the problem with Python 3.10.10 too though.

Standalone code to reproduce the issue

https://colab.research.google.com/drive/1LmdIqWME19GLFG0E7dsCtRtscLLFF89R?usp=sharing

Relevant log output

2024-04-14 16:12:10.319053: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
TensorFlow: ('v1.12.1-109002-g2c2c0a17f05', '2.17.0-dev20240413')
Iterations:        10000 Memory use: 489308160
Iterations:        20000 Memory use: 496451584
Iterations:        30000 Memory use: 503353344
Iterations:        40000 Memory use: 510517248
Iterations:        50000 Memory use: 517464064
Iterations:        60000 Memory use: 524591104
...
Iterations:       980000 Memory use: 1173454848
Iterations:       990000 Memory use: 1180618752
Iterations:      1000000 Memory use: 1187770368
2024-04-14 16:19:00.851851: I tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
379



Output from running valgrind:
PYTHONMALLOC=malloc valgrind --track-origins=yes --leak-check=full --show-leak-kinds=definite --trace-children=yes python ./tfdata_test.py
I stopped it at about 80,000 iterations because it's much slower running under valgrind
==324== 24,434,160 (4,660,032 direct, 19,774,128 indirect) bytes in 72,813 blocks are definitely lost in loss record 188,561 of 188,562
==324==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==324==    by 0x4D4AC1: ??? (in /usr/bin/python3.11)
==324==    by 0x5BA613: ??? (in /usr/bin/python3.11)
==324==    by 0x5BB4C4: ??? (in /usr/bin/python3.11)
==324==    by 0x4FE1B5: _PyEval_EvalFrameDefault (in /usr/bin/python3.11)
==324==    by 0x531822: _PyFunction_Vectorcall (in /usr/bin/python3.11)
==324==    by 0x530FB8: ??? (in /usr/bin/python3.11)
==324==    by 0x5DA693: _PyObject_CallMethod_SizeT (in /usr/bin/python3.11)
==324==    by 0x26C7748A: _descriptor_from_pep3118_format (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C8CBD5: _array_from_buffer_3118.part.0 (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C8DF4E: _array_from_array_like (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==    by 0x26C6FCD4: PyArray_DiscoverDTypeAndShape_Recursive (in /usr/local/lib/python3.11/dist-packages/numpy/core/_multiarray_umath.cpython-311-x86_64-linux-gnu.so)
==324==
==324== 26,302,320 bytes in 73,062 blocks are definitely lost in loss record 188,562 of 188,562
==324==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==324==    by 0x625128: ??? (in /usr/bin/python3.11)
==324==    by 0x6250BA: PyThreadState_New (in /usr/bin/python3.11)
==324==    by 0x64E44F: PyGILState_Ensure (in /usr/bin/python3.11)
==324==    by 0xA597C58: tensorflow::PyFuncOp::Compute(tensorflow::OpKernelContext*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so)
==324==    by 0x92F0F3F: tensorflow::ThreadPoolDevice::Compute(tensorflow::OpKernel*, tensorflow::OpKernelContext*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x927A539: tensorflow::(anonymous namespace)::ExecutorState<tensorflow::SimplePropagatorState>::Process(tensorflow::SimplePropagatorState::TaggedNode const&, long) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9A3D39D: std::_Function_handler<void (), std::_Bind<tensorflow::data::RunnerWithMaxParallelism(std::function<void (std::function<void ()>)>, int)::$_0::operator()(std::function<void (std::function<void ()>)> const&, std::function<void ()>) const::{lambda(std::function<void ()> const&)#1} (std::function<void ()>)> >::_M_invoke(std::_Any_data const&) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9C2299F: Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x9C223D0: void std::__invoke_impl<void, tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}&>(std::__invoke_other, tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}&) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x960E55A: tsl::(anonymous namespace)::PThread::ThreadFn(void*) (in /usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2)
==324==    by 0x4A27AC2: start_thread (pthread_create.c:442)
==324==
==324== LEAK SUMMARY:
==324==    definitely lost: 30,984,576 bytes in 146,328 blocks
==324==    indirectly lost: 19,783,920 bytes in 72,736 blocks
==324==      possibly lost: 136,684,089 bytes in 1,082,632 blocks
==324==    still reachable: 17,341,568 bytes in 182,582 blocks
==324==                       of which reachable via heuristic:
==324==                         stdstring          : 341 bytes in 8 blocks
==324==                         newarray           : 104,387 bytes in 320 blocks
==324==                         multipleinheritance: 7,168 bytes in 24 blocks
==324==         suppressed: 0 bytes in 0 blocks

Venkat6871 · 2024-04-15T09:16:07Z

Hi @cohaegen ,
I tried to run your code on colab using TF v2.15, 2.16.1, and nightly. But i am not facing any issue. Please find the gist here for reference.

Thank you!

cohaegen · 2024-04-15T14:04:05Z

Hi Venkat- thanks for checking it out. I’ve only seen the issue in newer versions of TF with Python 3.11- it seems ok with 3.10. It looks like you ran it with 3.10 (correct me if I’m mistaken). Can you try with 3.11? Thanks, Nick

…

On Mon, Apr 15, 2024 at 2:16 AM Venkat6871 ***@***.***> wrote: Hi ***@***.*** <https://github.com/cohaegen>* , I tried to run your code on colab using TF v2.15, 2.16.1, and nightly. But i am not facing any issue. Please find the gist <https://colab.sandbox.google.com/gist/Venkat6871/b13bb1502f15e0d47eeb043f96a44be0/65675_2-15-2-16-nightly.ipynb> here for reference. Thank you! — Reply to this email directly, view it on GitHub <#65675 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJBJI645INTV5YBHGK4QYT3Y5OLHDAVCNFSM6AAAAABGGJBR46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJWGM2DAMRZGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

google-ml-butler bot added the type:bug Bug label Apr 14, 2024

google-ml-butler bot assigned Venkat6871 Apr 14, 2024

Venkat6871 added the TF 2.16 label Apr 15, 2024

Venkat6871 added stat:awaiting response Status - Awaiting response from author comp:data tf.data related issues labels Apr 15, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in tf.data when iterating over Dataset.from_generator #65675

Memory leak in tf.data when iterating over Dataset.from_generator #65675

cohaegen commented Apr 14, 2024

Venkat6871 commented Apr 15, 2024

cohaegen commented Apr 15, 2024 via email

Memory leak in tf.data when iterating over Dataset.from_generator #65675

Memory leak in tf.data when iterating over Dataset.from_generator #65675

Comments

cohaegen commented Apr 14, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Venkat6871 commented Apr 15, 2024

cohaegen commented Apr 15, 2024 via email