Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.data.Dataset.map() makes unnecessary memory allocations #62788

Open
hrsht opened this issue Jan 13, 2024 · 5 comments
Open

tf.data.Dataset.map() makes unnecessary memory allocations #62788

hrsht opened this issue Jan 13, 2024 · 5 comments
Assignees
Labels
comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.15 For issues related to 2.15.x type:bug Bug type:performance Performance Issue

Comments

@hrsht
Copy link

hrsht commented Jan 13, 2024

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

tf 2.15.0

Custom code

Yes

OS platform and distribution

Linux

Mobile device

No response

Python version

3.10.13

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

When tf.data.Dataset.map is called with a function which references tensor(s) (or a nested object with tensors), it seems to be making a copy of these tensors. If these tensors are large, this causes large memory allocations which can cause the process to OOM.

Please see the below code snippet which reproduces the issue (here is a reference to the colab). I am allocating a tensor which takes 2GB of memory and then referencing it in the get_data function. This function is used in tf.data.Dataset.map to construct the dataset. I am chaining multiple map calls to exacerbate the bug to cause OOM in colab. Each map call allocates a new copy of the original tensor referenced by the passed function.

Please note that this is not a memory leak as these copies are subsequently freed and the memory is released back to the mem allocator. However, depending on the allocator and it's settings, the allocator may hold on to the memory for a long time and not release back to the OS, causing a memory bloat for the process in the best case, and an OOM in the worse case.

It is expected that these tensor copies do not happen as there is no functional need.

It is possible that the root cause of this issue is the same as #61344 , in which case feel free to close this issue and track the underlying bug over there.

Standalone code to reproduce the issue

import tensorflow as tf

print(tf.version.VERSION)

# Depending on the underlying RAM resources,
# increase this value to see OOM. With 10,
# this code should OOM with total RAM resources of 20GB or less.
NUM_MAP_CALLS=10

# Allocate a large tensor. This will take 2GB of RAM.
t = tf.random.uniform((2048, 1024*256))
    
def get_data(_idx):
  return t[0, 0]
    
ds = tf.data.Dataset.range(1)
for _ in range(NUM_MAP_CALLS):
  ds = ds.map(get_data)

next(iter(ds))

Relevant log output

Jan 13, 2024, 12:04:55 PM	WARNING	WARNING:root:kernel 0585280c-54b4-4ab5-8a4d-e5502242c92a restarted
Jan 13, 2024, 12:04:55 PM	INFO	KernelRestarter: restarting kernel (1/5), keep random ports
Jan 13, 2024, 12:04:49 PM	WARNING	2024-01-13 17:04:49.124663: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Jan 13, 2024, 12:04:44 PM	WARNING	2024-01-13 17:04:44.817746: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Jan 13, 2024, 12:04:39 PM	WARNING	2024-01-13 17:04:39.468439: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Jan 13, 2024, 12:04:32 PM	WARNING	2024-01-13 17:04:32.994518: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Jan 13, 2024, 12:04:28 PM	WARNING	2024-01-13 17:04:28.102305: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
@google-ml-butler google-ml-butler bot added the type:bug Bug label Jan 13, 2024
@SuryanarayanaY SuryanarayanaY added comp:data tf.data related issues TF 2.15 For issues related to 2.15.x type:performance Performance Issue labels Jan 16, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @hrsht ,

Thanks for reporting. The dataset.map() method on a dataset indeed returns a new dataset object.But in the attached code snippet during iteration over a mapped dataset (even with ds.as_numpy_iterator) it seems some mem copies happening as we are using same variable for storing the dataset object every time.

There seems to be an issue with map() function with dataset object. Attached gist for reference. Needs to dig more for root cause. Thanks!

@SuryanarayanaY SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jan 18, 2024
@SuryanarayanaY
Copy link
Collaborator

Hi @hrsht ,

It is intended behaviour. Consider the following code:

ds = tf.data.Dataset.range(1)
for call in range(10):
    print('Call Number:',call)
    ds = ds.map(get_data)
    # iterate ds

When we call map function on a dataset it will return a pipeline dataset.map() in first call i.e. call=0 . Please note that we are copying this again to same dataset (from ds = ds.map(get_data)), in second call, i.e when call=1, now ds is not just tf.data.Dataset.range(1) but it is tf.data.Dataset.range(1).map() .Hence in call=1, we are actually mapping it again now ds will become ds.map().map(). When call=2 it will become ds.map().map().map() and so on. Hence the problem.

Changing the code to below will resolve the issue:

ds = tf.data.Dataset.range(1)
for call in range(10):
    print('Call Number:',call)
    ds1 = ds.map(get_data)
    # iterate ds1

@SuryanarayanaY SuryanarayanaY added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Jan 25, 2024
@hrsht
Copy link
Author

hrsht commented Jan 26, 2024

Thanks @SuryanarayanaY for the investigation.

I don't think you quite understood the issue here. The multiple chaining of map calls in my example code was just to highlight the issue of unnecessary memory allocations by exacerbating it with many map calls. It is most likely not how one would code in reality. In fact, you can change the size of the tensor referenced by get_data to be 8GB and code you referenced last with ds1 dataset would also OOM with default RAM resource of ~12GB in colab. (See this run: https://colab.research.google.com/drive/1UykDpf0FefrcNyCgPW9-KW-7zCQFbzTg#scrollTo=OtY6ihLwfvNJ&line=17&uniqifier=1)

I did further investigation on my local machine by printing malloc stats at different steps (I used tcmalloc for this run as the stats it prints are more concise and better than the default malloc). This highlights the issue much better.

My code:

import ctypes
import ctypes.util
import os
import sys  # To print python version
import tensorflow as tf

print('Python version:', sys.version)
print('Tensorflow version:', tf.version.VERSION)

def _load_malloc_lib():
  if 'LD_PRELOAD' in os.environ:
    malloc_lib = os.environ['LD_PRELOAD'].split('/')[-1]
  else:
    # Else find the standard libc library and return
    malloc_lib = ctypes.util.find_library('c')
  return ctypes.CDLL(malloc_lib)

_libmalloc = _load_malloc_lib()

def print_malloc_stats():
  if _libmalloc is not None:
    _libmalloc.malloc_stats()

# Allocate a large tensor. This will take 2GB of RAM.
t = tf.random.uniform((2048, 1024*256))
print('Malloc stats after tensor initialization:')
print_malloc_stats()
print()

def get_data(_idx):
  return t[0, 0]
print('Malloc stats after get_data function definition:')
print_malloc_stats()
print()

NUM_MAP_CALLS=1
ds = tf.data.Dataset.range(1)
for _ in range(NUM_MAP_CALLS):
  ds = ds.map(get_data)
print('Malloc stats after dataset initialization:')
print_malloc_stats()
print()

print('dataset initialized\n')

itr = iter(ds)
print('Malloc stats after iterator setup:')
print_malloc_stats()
print()

_ = next(itr)
print('Malloc stats after calling next on iterator:')
print_malloc_stats()
print()

print('done')

Here are the logs from running the above code with tcmalloc using LD_PRELOAD=<tcmalloc_lib>

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
Tensorflow version: 2.15.0
2024-01-26 12:02:45.458639: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Malloc stats after tensor initialization:
------------------------------------------------
MALLOC:     2231009920 ( 2127.7 MiB) Bytes in use by application
MALLOC: +      1351680 (    1.3 MiB) Bytes in page heap freelist
MALLOC: +      1794504 (    1.7 MiB) Bytes in central cache freelist
MALLOC: +      1254400 (    1.2 MiB) Bytes in transfer cache freelist
MALLOC: +      1906616 (    1.8 MiB) Bytes in thread cache freelists
MALLOC: +      4980736 (    4.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2242297856 ( 2138.4 MiB) Actual memory used (physical + swap)
MALLOC: +       475136 (    0.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2242772992 ( 2138.9 MiB) Virtual address space used
MALLOC:
MALLOC:           7083              Spans in use
MALLOC:             28              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Malloc stats after get_data function definition:
------------------------------------------------
MALLOC:     2231009920 ( 2127.7 MiB) Bytes in use by application
MALLOC: +      1351680 (    1.3 MiB) Bytes in page heap freelist
MALLOC: +      1794504 (    1.7 MiB) Bytes in central cache freelist
MALLOC: +      1254400 (    1.2 MiB) Bytes in transfer cache freelist
MALLOC: +      1906616 (    1.8 MiB) Bytes in thread cache freelists
MALLOC: +      4980736 (    4.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2242297856 ( 2138.4 MiB) Actual memory used (physical + swap)
MALLOC: +       475136 (    0.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2242772992 ( 2138.9 MiB) Virtual address space used
MALLOC:
MALLOC:           7083              Spans in use
MALLOC:             28              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Malloc stats after dataset initialization:
------------------------------------------------
MALLOC:     2231332056 ( 2128.0 MiB) Bytes in use by application
MALLOC: +       983040 (    0.9 MiB) Bytes in page heap freelist
MALLOC: +      1790136 (    1.7 MiB) Bytes in central cache freelist
MALLOC: +      1185280 (    1.1 MiB) Bytes in transfer cache freelist
MALLOC: +      2026608 (    1.9 MiB) Bytes in thread cache freelists
MALLOC: +      4980736 (    4.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   2242297856 ( 2138.4 MiB) Actual memory used (physical + swap)
MALLOC: +       475136 (    0.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   2242772992 ( 2138.9 MiB) Virtual address space used
MALLOC:
MALLOC:           7104              Spans in use
MALLOC:             28              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

dataset initialized

2024-01-26 12:02:46.317419: W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 2147483648 exceeds 10% of free system memory.
Malloc stats after iterator setup:
------------------------------------------------
MALLOC:     2231803544 ( 2128.4 MiB) Bytes in use by application
MALLOC: +   2148024320 ( 2048.5 MiB) Bytes in page heap freelist
MALLOC: +      1740992 (    1.7 MiB) Bytes in central cache freelist
MALLOC: +      1188352 (    1.1 MiB) Bytes in transfer cache freelist
MALLOC: +      2248360 (    2.1 MiB) Bytes in thread cache freelists
MALLOC: +      7208960 (    6.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   4392214528 ( 4188.7 MiB) Actual memory used (physical + swap)
MALLOC: +       270336 (    0.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   4392484864 ( 4189.0 MiB) Virtual address space used
MALLOC:
MALLOC:           7145              Spans in use
MALLOC:             40              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

Malloc stats after calling next on iterator:
------------------------------------------------
MALLOC:     2231872752 ( 2128.5 MiB) Bytes in use by application
MALLOC: +   2147926016 ( 2048.4 MiB) Bytes in page heap freelist
MALLOC: +      1840152 (    1.8 MiB) Bytes in central cache freelist
MALLOC: +      1189376 (    1.1 MiB) Bytes in transfer cache freelist
MALLOC: +      2177272 (    2.1 MiB) Bytes in thread cache freelists
MALLOC: +      7208960 (    6.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   4392214528 ( 4188.7 MiB) Actual memory used (physical + swap)
MALLOC: +       270336 (    0.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   4392484864 ( 4189.0 MiB) Virtual address space used
MALLOC:
MALLOC:           7154              Spans in use
MALLOC:             42              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

done

As you can see from the malloc stats, there is an extra allocation of 2GB when the dataset iterator is initialized using iter(ds). This allocation is freed before returning to the caller, but malloc still holds on to that memory and not yet return it to the OS.

To exacerbate the problem here, if you increase the number of chained map calls on the dataset, you will see the memory allocations increase by a multiple of 2GB (which is the size of the tensor t referenced by get_data function).

This is not a problem in general when referencing small data from the method used in map. But if the data referenced is big, then this may become an issue.

FWIW, I also see a similar copy of tensor when using tf.data.Dataset.from_tensors, where the tensor passed to from_tensors is copied. Not sure if it is the same underlying problem.

I hope this helps clarify the underlying issue. Thanks!

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Jan 26, 2024
@AugustusHsu
Copy link

I have same problem at 2.12, any solution?🫠🫠

@SuryanarayanaY
Copy link
Collaborator

@wilsingosti , Could you please put your comment on how chaining of map functions actually accumulating memory with each chaining of map function as it seems each chain call actually creating memory for Input array and it gets accumulating each time. is this intended behaviour?

@SuryanarayanaY SuryanarayanaY added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.15 For issues related to 2.15.x type:bug Bug type:performance Performance Issue
Projects
None yet
Development

No branches or pull requests

3 participants