Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Pyrestone · 2023-07-20T20:36:51Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

v2.13.0-rc2-7-g1cb1a030a62 2.13.0

Custom code

Yes

OS platform and distribution

Linux Ubuntu 20.04, Google Colab

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.8 / 8.6

GPU model and memory

various, e.g. 2080ti, 3080ti mobile, Colab T4

Current behavior?

Using tf.py_function in a function that is applied to a tf.data.Dataset via its map() function causes a (C++-level) memory leak.

In my real training with more complex code inside the py_function, this lead to the python script eventually consuming upwards of 30 GB of RAM during a model.fit() loop, despite taking less that 3GB of RAM during the initial epoch.

tf.py_function also more generally causes memory leaks in all kinds of places. See the flags at the top of the linked Collab for details.

Standalone code to reproduce the issue

see Collab: https://colab.research.google.com/drive/1auVJPyHApl4__4FF-rV3xNcJrqYZc38R?usp=sharing

Iterating through a dataset with a tf.py_function in it causes unbounded linear memory consumption growth.

Relevant log output

**Batch 0**
Memory usage: 1732120576
Delta: 1651.88 MiB
**Batch 200**
Memory usage: 1736859648
Delta: 4.52 MiB
**Batch 400**
Memory usage: 1740644352
Delta: 3.61 MiB
**Batch 600**
Memory usage: 1744699392
Delta: 3.87 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.34 MiB
**Batch 800**
Memory usage: 1748750336
Delta: 3.86 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 1000**
Memory usage: 1752805376
Delta: 3.87 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 1200**
Memory usage: 1757401088
Delta: 4.38 MiB
Average Delta since start: 4.00 MiB/iteration
Estimated growth per 1000 steps: 19.98 MiB
**Batch 1400**
Memory usage: 1761456128
Delta: 3.87 MiB
Average Delta since start: 3.97 MiB/iteration
Estimated growth per 1000 steps: 19.85 MiB
**Batch 1600**
Memory usage: 1765240832
Delta: 3.61 MiB
Average Delta since start: 3.91 MiB/iteration
Estimated growth per 1000 steps: 19.55 MiB
**Batch 1800**
Memory usage: 1769025536
Delta: 3.61 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 2000**
Memory usage: 1773621248
Delta: 4.38 MiB
Average Delta since start: 3.93 MiB/iteration
Estimated growth per 1000 steps: 19.66 MiB
**Batch 2200**
Memory usage: 1777676288
Delta: 3.87 MiB
Average Delta since start: 3.92 MiB/iteration
Estimated growth per 1000 steps: 19.62 MiB
**Batch 2400**
Memory usage: 1781731328
Delta: 3.87 MiB
Average Delta since start: 3.92 MiB/iteration
Estimated growth per 1000 steps: 19.59 MiB
**Batch 2600**
Memory usage: 1785786368
Delta: 3.87 MiB
Average Delta since start: 3.91 MiB/iteration
Estimated growth per 1000 steps: 19.57 MiB

Pyrestone · 2023-07-20T20:38:07Z

see also: #35084 and #51839 (both closed, but the issue re-appeared in 2.13.0)

SuryanarayanaY · 2023-07-21T09:20:01Z

Hi @Pyrestone ,

I have replicated the reported behaviour and attached gist for reference.

I can observe memory increased from batch:0 to batch:3600. Needs to be investigated.

Pyrestone · 2023-07-21T12:52:53Z

Note for people finding this issue in the meantime:

As a temporary (or permanent I suppose) workaround is using tf.numpy_function instead of tf.py_function in tf.data.Dataset.map().

It behaves very similarly, except the function then receives and returns numpy arrays instead of eager tensors.
The only other notable difference is that tf.numpy_function doesn't support gradient computation (which probably shouldn't matter in a dataset.map() call).

hyliker · 2023-09-07T10:21:52Z

@Pyrestone Thanks for your solution.

google-ml-butler bot added the type:bug Bug label Jul 20, 2023

google-ml-butler bot assigned SuryanarayanaY Jul 20, 2023

SuryanarayanaY added TF 2.13 For issues related to Tensorflow 2.13 type:performance Performance Issue labels Jul 21, 2023

SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 21, 2023

xtamaga mentioned this issue Nov 28, 2023

Memory leak - postprocessing using YoloPostProc hailo-ai/hailo_model_zoo#72

Open

hrsht mentioned this issue Jan 13, 2024

tf.data.Dataset.map() makes unnecessary memory allocations #62788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Pyrestone commented Jul 20, 2023

Pyrestone commented Jul 20, 2023

SuryanarayanaY commented Jul 21, 2023

Pyrestone commented Jul 21, 2023

hyliker commented Sep 7, 2023

Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Comments

Pyrestone commented Jul 20, 2023

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Pyrestone commented Jul 20, 2023

SuryanarayanaY commented Jul 21, 2023

Pyrestone commented Jul 21, 2023

hyliker commented Sep 7, 2023