Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbounded Memory leak when using tf.py_function in tf.data.Dataset.map() #61344

Open
Pyrestone opened this issue Jul 20, 2023 · 4 comments
Open
Assignees
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.13 For issues related to Tensorflow 2.13 type:bug Bug type:performance Performance Issue

Comments

@Pyrestone
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

No

Source

binary

TensorFlow version

v2.13.0-rc2-7-g1cb1a030a62 2.13.0

Custom code

Yes

OS platform and distribution

Linux Ubuntu 20.04, Google Colab

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.8 / 8.6

GPU model and memory

various, e.g. 2080ti, 3080ti mobile, Colab T4

Current behavior?

Using tf.py_function in a function that is applied to a tf.data.Dataset via its map() function causes a (C++-level) memory leak.

In my real training with more complex code inside the py_function, this lead to the python script eventually consuming upwards of 30 GB of RAM during a model.fit() loop, despite taking less that 3GB of RAM during the initial epoch.

tf.py_function also more generally causes memory leaks in all kinds of places. See the flags at the top of the linked Collab for details.

Standalone code to reproduce the issue

see Collab: https://colab.research.google.com/drive/1auVJPyHApl4__4FF-rV3xNcJrqYZc38R?usp=sharing

Iterating through a dataset with a tf.py_function in it causes unbounded linear memory consumption growth.

Relevant log output

**Batch 0**
Memory usage: 1732120576
Delta: 1651.88 MiB
**Batch 200**
Memory usage: 1736859648
Delta: 4.52 MiB
**Batch 400**
Memory usage: 1740644352
Delta: 3.61 MiB
**Batch 600**
Memory usage: 1744699392
Delta: 3.87 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.34 MiB
**Batch 800**
Memory usage: 1748750336
Delta: 3.86 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 1000**
Memory usage: 1752805376
Delta: 3.87 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 1200**
Memory usage: 1757401088
Delta: 4.38 MiB
Average Delta since start: 4.00 MiB/iteration
Estimated growth per 1000 steps: 19.98 MiB
**Batch 1400**
Memory usage: 1761456128
Delta: 3.87 MiB
Average Delta since start: 3.97 MiB/iteration
Estimated growth per 1000 steps: 19.85 MiB
**Batch 1600**
Memory usage: 1765240832
Delta: 3.61 MiB
Average Delta since start: 3.91 MiB/iteration
Estimated growth per 1000 steps: 19.55 MiB
**Batch 1800**
Memory usage: 1769025536
Delta: 3.61 MiB
Average Delta since start: 3.87 MiB/iteration
Estimated growth per 1000 steps: 19.33 MiB
**Batch 2000**
Memory usage: 1773621248
Delta: 4.38 MiB
Average Delta since start: 3.93 MiB/iteration
Estimated growth per 1000 steps: 19.66 MiB
**Batch 2200**
Memory usage: 1777676288
Delta: 3.87 MiB
Average Delta since start: 3.92 MiB/iteration
Estimated growth per 1000 steps: 19.62 MiB
**Batch 2400**
Memory usage: 1781731328
Delta: 3.87 MiB
Average Delta since start: 3.92 MiB/iteration
Estimated growth per 1000 steps: 19.59 MiB
**Batch 2600**
Memory usage: 1785786368
Delta: 3.87 MiB
Average Delta since start: 3.91 MiB/iteration
Estimated growth per 1000 steps: 19.57 MiB
@Pyrestone
Copy link
Author

see also: #35084 and #51839 (both closed, but the issue re-appeared in 2.13.0)

@SuryanarayanaY SuryanarayanaY added TF 2.13 For issues related to Tensorflow 2.13 type:performance Performance Issue labels Jul 21, 2023
@SuryanarayanaY
Copy link
Collaborator

Hi @Pyrestone ,

I have replicated the reported behaviour and attached gist for reference.

I can observe memory increased from batch:0 to batch:3600. Needs to be investigated.

@SuryanarayanaY SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jul 21, 2023
@Pyrestone
Copy link
Author

Note for people finding this issue in the meantime:

As a temporary (or permanent I suppose) workaround is using tf.numpy_function instead of tf.py_function in tf.data.Dataset.map().

It behaves very similarly, except the function then receives and returns numpy arrays instead of eager tensors.
The only other notable difference is that tf.numpy_function doesn't support gradient computation (which probably shouldn't matter in a dataset.map() call).

@hyliker
Copy link

hyliker commented Sep 7, 2023

@Pyrestone Thanks for your solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.13 For issues related to Tensorflow 2.13 type:bug Bug type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

3 participants