Massive memory leak with awkward.Array #1127

Asa-Nisi-Masa · 2021-10-29T14:53:03Z

Version of Awkward Array

1.5.1

Python version

Python 3.6 and 3.8

OS

Ubuntu 18.04 and macOS Big Sur

Problem

Using ak.Array causes memory to be leaked. The following program causes the memory (RAM) consumption to grow indefinitely

import awkward as ak

while True:
    ak.Array([1])

yes, that is the whole program.

Rate of leakage

Memory increase can be tracked with utilities (e.g. htop) or from Python itself. Using psutil, memory usage in megabytes can be printed with the following code

import os

import awkward as ak
import psutil


process = psutil.Process(os.getpid())

while True:
    ak.Array([1])

    usage = process.memory_info().rss / 1000 / 1000
    print(usage)

which (for me) shows a steady RAM usage increase at a rate of about 2 MB/s.

I briefly tried to find the cause for this with various memory-profiling tools but could not find the cause (I tried three tools and none of them even saw any change in object counts/sizes)

Expected behavior

I see no good reason for why this should be happening, I would expect ak.Array to cause no memory leakage.

The text was updated successfully, but these errors were encountered:

jpivarski · 2021-10-29T16:35:30Z

Thanks for catching this, ~~though it is disappointing. Quite a lot of unit tests are dedicated to checking the reference counting between Python and C++ to avoid leaks and double-frees.~~ Actually, it's not: in the course of writing this response, I discovered that there's no problem here. You could skip to the end.

The expression ak.Array([1]) invokes the ArrayBuilder, which walks over the Python data to build a tree of Content nodes representing the data. In this case, that's just an iterable containing a Python int, so the only node in that tree is an Int64Builder. It allocates a GrowableBuffer of initial length 1024 × 8 bytes = 8 kB, which calls kernel::malloc, and since this isn't a GPU, that's this code:

https://github.com/scikit-hep/awkward-1.0/blob/0b0fb3f1c18b8acf5ab664ffefe513f67c82076b/include/awkward/kernel-dispatch.h#L174-L178

a shared pointer with an action to release the array when the reference count goes to zero:

https://github.com/scikit-hep/awkward-1.0/blob/0b0fb3f1c18b8acf5ab664ffefe513f67c82076b/include/awkward/kernel-dispatch.h#L73-L81

When it gets attached to a Python object, the C++ shared_ptr's reference count is controlled by the Python object's reference count. When the Python object is deleted, this whole chain gets invoked that goes back to the array.

I tried testing it, too, though I added a counter to see how many times an array is created:

>>> import awkward as ak
>>> counter = 0
>>> while True:
...   tmp = ak.Array([1])
...   counter += 1
... 
^CTraceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyboardInterrupt
>>> counter
1140231

If it wasn't deleting that array, then we'd have 1140231 × 8 kB = 9 GB of memory leak, which isn't the case. Looking at free, it's more like 300 MB, so the leak is not in the (large) arrays; it might be in the (small) C++ or Python object instances.

While running the above Python code, I watched the memory go down with

% while (true); do free -m | fgrep Mem: | awk '{print $4}' ; sleep 1; done

Here's a plot: the horizontal axis is seconds; the vertical axis is megabytes (according to free -m). The flat parts are when I started and stopped the loop.

With garbage-collected languages, we have to be careful about interpreting this because the garbage collector will let the memory use grow up to a point and then invoke collection. Maybe we're just not seeing it reach that point—after all, only 300 MB have been used in this minute. (On my Linux box, a little more than the rate you saw on MacOS.) However, CPython only uses the garbage collector for the data that it can't eliminate via reference counting, which I had thought was anything with a cycle in it. I tried to construct some pure Python examples with cycles, but Python (3.9) is too smart: it figured out that the self-referential data was cyclic and deleting it without waiting for the garbage collector.

So I might as well ask, does the garbage collector find this data, if it gets invoked? In Python, we can manually invoke it:

>>> import awkward as ak
>>> import gc
>>> while True:
...   tmp = ak.Array([1])
...   tmp2 = gc.collect()
... 
^CTraceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyboardInterrupt

The difference is striking: note the much smaller vertical axis:

The memory usage is jumping around (not very high or low) because we're sampling memory usage at random times with respect to when the garbage collector is running, but there clearly isn't the 300 MB steady decline that we saw without the garbage collector. So one conclusion is that there's no permanent leak: when the garbage collector runs, the objects do get cleaned up.

Why, with our non-cyclic data, is Python not cleaning them up right away? I'm not sure, but I had a hunch that it might be because the reference is going into a compiled extension. We can do the same sort of thing with NumPy, creating a reference cycle through NumPy arrays, which is also a compiled extension:

>>> import numpy as np
>>> while True:
...   tmp = np.array([None], dtype=object)
...   tmp[0] = tmp
... 
^CTraceback (most recent call last):
  File "<stdin>", line 2, in <module>
KeyboardInterrupt

Here, the memory loss is dramatic: whatever NumPy's doing is leaving a lot more data to be cleaned up only at garbage collection time. It uses up all the memory on my computer until it has only a few hundred MB left, then the garbage collector finally decides to get to work and the usage levels out. (The flat part at the bottom is before I stopped the loop.)

Just for completeness, let's do the NumPy example with an explicit garbage collector invocation:

>>> import numpy as np
>>> import gc
>>> while True:
...   tmp = np.array([None], dtype=object)
...   tmp[0] = tmp
...   tmp2 = gc.collect()
... 
^CTraceback (most recent call last):
  File "<stdin>", line 4, in <module>
KeyboardInterrupt

Indeed, it keeps it clean from the start:

The bottom line is that this isn't a problem with Awkward Array; it's just how Python garbage collection works with data in extension libraries. You shouldn't be making millions of little ak.Arrays anyway; they'll be horribly inefficient compared to a few big ak.Arrays, but I understand that this was raised as an in-principle issue.

There's another point that I could have raised at the beginning of all of this, though: we're working on Awkward 2.0, which removes the C++ layer in favor of Python that calls out to zero-memory-allocation C functions with ctypes, so even if there was a real memory leak here, it would be gone in the upcoming pure Python version. I could have led with that, though it feels like a cop-out: a memory leak in the present version would be bad news. Anyway, for these two reasons, this is not an issue that needs action.

Thanks for your diligence in reporting it, though!

Asa-Nisi-Masa · 2021-10-29T17:12:11Z

Thanks for the awesome analysis. I was using ak.Array together with pytorch so at first I thought maybe the problem was there. In the end I walked around it by converting the arrays to regular lists with .to_list(). But thanks for providing details why this happens - such devilish behavior from Python

jpivarski · 2021-10-29T17:32:43Z

You can also work around it by calling gc.collect() at opportune times. Although I'd try to avoid that in a library, rather than an application, I've found it necessary in some very tight code:

https://github.com/scikit-hep/uproot4/blob/7fc47ac486b77983f80f65973490c16e7ebca589/src/uproot/interpretation/library.py#L798-L811

But since you're developing an application, you know the exact speed-to-memory tradeoffs and can decide where you want the garbage collector to collect (in addition to the times it would decide to do so on its own). PyTorch also has compiled extensions, so the above may apply to it, too.

If you're calling .to_list() and that's not a noticeable performance (speed) penalty, you might have been able to get away with working with Python lists from the start (unless you found the array-at-a-time interface convenient, the other reason one might want to use Awkward Array).

such devilish behavior from Python

Well, all garbage collected languages. CPython's hybrid of reference counting and garbage collection makes the garbage collection penalties less significant than, say, PyPy or the Java Virtual Machine. Instead, the cost is an extra 8 bytes in every object...

jrueb · 2022-07-26T13:28:51Z

I was also observing this effect, but I'm not fully convinced this can be explained or solved garbage collector invocation, which is why I'm replying to this rather old issue.

You are calling gc.collect in every loop, but what if that slows down the loop so much, you stop seeing the leak? When I do

import awkward as ak
import gc

i = 0
while True:
    tmp = ak.Array([1])
    i += 1
    if i > 100000:
        print("collect")
        gc.collect()
        i = 0

I get steady increase in memory usage, even though garbage collection is being done.

agoose77 · 2022-07-26T13:52:29Z

I can reproduce growing memory for your trivial example. If I change the Array object to the v2 implementation (ak._v2.Array), I don't see this reproduced. It's not clear to me at this stage where this is a memory leak or fragmentation, but it seems like the former's a reasonable bet.

ianna · 2022-07-26T14:20:49Z

I can reproduce growing memory for your trivial example. If I change the Array object to the v2 implementation (ak._v2.Array), I don't see this reproduced. It's not clear to me at this stage where this is a memory leak or fragmentation, but it seems like the former's a reasonable bet.

@agoose77 - how long before see it? I'm running on master + PR #1560 for ~15 min. It looks like it v1 memory oscillates between 250 - 560 M:

and for v2 it's pretty stable at 50 M:

agoose77 · 2022-07-26T14:53:21Z

@ianna good question. I gave it a while, and observed memory growing from $t=0$. Let me run it for longer, and see whether it saturates as expected.

agoose77 · 2022-07-26T14:59:38Z

For me, it climbs past 1GB (5 minutes). I don't actually have a huge amount of spare memory right now with all of my open tasks, so I won't be able to let it run for any longer 😅

jpivarski · 2022-07-26T16:03:18Z

We've moved from C++ owning the data (ArrayBuilder based on std::shared_ptr that is shared with Python) to Python owning the data (a copy of std::unique_ptr data in C++ to a NumPy-allocated array—until a few weeks ago, 1.9.0rc8? 9? 10? Now ArrayBuilder is using @ManasviGoyal's new GrowableBuffer, so there isn't even a single std::unique_ptr anymore: the only full copy of the data is allocated and owned by NumPy).

So memory management for data created by ArrayBuilder (ak.Array([1])) has dramatically changed in the last few versions, with more and more reliance on NumPy owning the data (both v1 and v2). Let's only do these tests in the latest version, since that's what's relevant going forward. We shouldn't spend a lot of time diagnosing issues that have already been fixed.

jpivarski · 2023-03-13T20:12:23Z

Good news! PR #2311 apparently fixes this memory leak, too! I'm reopening this issue just so that it can be closed by the PR, for record-keeping.

Asa-Nisi-Masa added the bug (unverified) The problem described would be a bug, but needs to be triaged label Oct 29, 2021

jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Oct 29, 2021

jpivarski closed this as completed Oct 29, 2021

jpivarski added wontfix This will not be worked on and removed bug The problem described is something that must be fixed labels Oct 29, 2021

jpivarski mentioned this issue Mar 1, 2023

Unbounded memory use when iterating over RecordArray #2275

Closed

jpivarski mentioned this issue Mar 13, 2023

fix: DECREFing PyObject*s from Python C API calls in v1. #2311

Merged

jpivarski removed the wontfix This will not be worked on label Mar 13, 2023

jpivarski reopened this Mar 13, 2023

jpivarski linked a pull request Mar 13, 2023 that will close this issue

fix: DECREFing PyObject*s from Python C API calls in v1. #2311

Merged

agoose77 closed this as completed Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive memory leak with awkward.Array #1127

Massive memory leak with awkward.Array #1127

Asa-Nisi-Masa commented Oct 29, 2021 •

edited

Loading

jpivarski commented Oct 29, 2021

Asa-Nisi-Masa commented Oct 29, 2021 •

edited

Loading

jpivarski commented Oct 29, 2021

jrueb commented Jul 26, 2022

agoose77 commented Jul 26, 2022 •

edited

Loading

ianna commented Jul 26, 2022

agoose77 commented Jul 26, 2022

agoose77 commented Jul 26, 2022

jpivarski commented Jul 26, 2022

jpivarski commented Mar 13, 2023

Massive memory leak with awkward.Array #1127

Massive memory leak with awkward.Array #1127

Comments

Asa-Nisi-Masa commented Oct 29, 2021 • edited Loading

Version of Awkward Array

Python version

OS

Problem

Rate of leakage

Expected behavior

jpivarski commented Oct 29, 2021

Asa-Nisi-Masa commented Oct 29, 2021 • edited Loading

jpivarski commented Oct 29, 2021

jrueb commented Jul 26, 2022

agoose77 commented Jul 26, 2022 • edited Loading

ianna commented Jul 26, 2022

agoose77 commented Jul 26, 2022

agoose77 commented Jul 26, 2022

jpivarski commented Jul 26, 2022

jpivarski commented Mar 13, 2023

Asa-Nisi-Masa commented Oct 29, 2021 •

edited

Loading

Asa-Nisi-Masa commented Oct 29, 2021 •

edited

Loading

agoose77 commented Jul 26, 2022 •

edited

Loading