Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive memory leak with awkward.Array #1127

Closed
Asa-Nisi-Masa opened this issue Oct 29, 2021 · 10 comments · Fixed by #2311
Closed

Massive memory leak with awkward.Array #1127

Asa-Nisi-Masa opened this issue Oct 29, 2021 · 10 comments · Fixed by #2311

Comments

@Asa-Nisi-Masa
Copy link

Asa-Nisi-Masa commented Oct 29, 2021

Version of Awkward Array

1.5.1

Python version

Python 3.6 and 3.8

OS

Ubuntu 18.04 and macOS Big Sur

Problem

Using ak.Array causes memory to be leaked. The following program causes the memory (RAM) consumption to grow indefinitely

import awkward as ak

while True:
    ak.Array([1])

yes, that is the whole program.

Rate of leakage

Memory increase can be tracked with utilities (e.g. htop) or from Python itself. Using psutil, memory usage in megabytes can be printed with the following code

import os

import awkward as ak
import psutil


process = psutil.Process(os.getpid())

while True:
    ak.Array([1])

    usage = process.memory_info().rss / 1000 / 1000
    print(usage)

which (for me) shows a steady RAM usage increase at a rate of about 2 MB/s.

I briefly tried to find the cause for this with various memory-profiling tools but could not find the cause (I tried three tools and none of them even saw any change in object counts/sizes)

Expected behavior

I see no good reason for why this should be happening, I would expect ak.Array to cause no memory leakage.

@Asa-Nisi-Masa Asa-Nisi-Masa added the bug (unverified) The problem described would be a bug, but needs to be triaged label Oct 29, 2021
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Oct 29, 2021
@jpivarski
Copy link
Member

Thanks for catching this, though it is disappointing. Quite a lot of unit tests are dedicated to checking the reference counting between Python and C++ to avoid leaks and double-frees. Actually, it's not: in the course of writing this response, I discovered that there's no problem here. You could skip to the end.

The expression ak.Array([1]) invokes the ArrayBuilder, which walks over the Python data to build a tree of Content nodes representing the data. In this case, that's just an iterable containing a Python int, so the only node in that tree is an Int64Builder. It allocates a GrowableBuffer of initial length 1024 × 8 bytes = 8 kB, which calls kernel::malloc, and since this isn't a GPU, that's this code:

https://github.com/scikit-hep/awkward-1.0/blob/0b0fb3f1c18b8acf5ab664ffefe513f67c82076b/include/awkward/kernel-dispatch.h#L174-L178

a shared pointer with an action to release the array when the reference count goes to zero:

https://github.com/scikit-hep/awkward-1.0/blob/0b0fb3f1c18b8acf5ab664ffefe513f67c82076b/include/awkward/kernel-dispatch.h#L73-L81

When it gets attached to a Python object, the C++ shared_ptr's reference count is controlled by the Python object's reference count. When the Python object is deleted, this whole chain gets invoked that goes back to the array.

I tried testing it, too, though I added a counter to see how many times an array is created:

>>> import awkward as ak
>>> counter = 0
>>> while True:
...   tmp = ak.Array([1])
...   counter += 1
... 
^CTraceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyboardInterrupt
>>> counter
1140231

If it wasn't deleting that array, then we'd have 1140231 × 8 kB = 9 GB of memory leak, which isn't the case. Looking at free, it's more like 300 MB, so the leak is not in the (large) arrays; it might be in the (small) C++ or Python object instances.

While running the above Python code, I watched the memory go down with

% while (true); do free -m | fgrep Mem: | awk '{print $4}' ; sleep 1; done

Here's a plot: the horizontal axis is seconds; the vertical axis is megabytes (according to free -m). The flat parts are when I started and stopped the loop.

image

With garbage-collected languages, we have to be careful about interpreting this because the garbage collector will let the memory use grow up to a point and then invoke collection. Maybe we're just not seeing it reach that point—after all, only 300 MB have been used in this minute. (On my Linux box, a little more than the rate you saw on MacOS.) However, CPython only uses the garbage collector for the data that it can't eliminate via reference counting, which I had thought was anything with a cycle in it. I tried to construct some pure Python examples with cycles, but Python (3.9) is too smart: it figured out that the self-referential data was cyclic and deleting it without waiting for the garbage collector.

So I might as well ask, does the garbage collector find this data, if it gets invoked? In Python, we can manually invoke it:

>>> import awkward as ak
>>> import gc
>>> while True:
...   tmp = ak.Array([1])
...   tmp2 = gc.collect()
... 
^CTraceback (most recent call last):
  File "<stdin>", line 3, in <module>
KeyboardInterrupt

The difference is striking: note the much smaller vertical axis:

image

The memory usage is jumping around (not very high or low) because we're sampling memory usage at random times with respect to when the garbage collector is running, but there clearly isn't the 300 MB steady decline that we saw without the garbage collector. So one conclusion is that there's no permanent leak: when the garbage collector runs, the objects do get cleaned up.

Why, with our non-cyclic data, is Python not cleaning them up right away? I'm not sure, but I had a hunch that it might be because the reference is going into a compiled extension. We can do the same sort of thing with NumPy, creating a reference cycle through NumPy arrays, which is also a compiled extension:

>>> import numpy as np
>>> while True:
...   tmp = np.array([None], dtype=object)
...   tmp[0] = tmp
... 
^CTraceback (most recent call last):
  File "<stdin>", line 2, in <module>
KeyboardInterrupt

Here, the memory loss is dramatic: whatever NumPy's doing is leaving a lot more data to be cleaned up only at garbage collection time. It uses up all the memory on my computer until it has only a few hundred MB left, then the garbage collector finally decides to get to work and the usage levels out. (The flat part at the bottom is before I stopped the loop.)

image

Just for completeness, let's do the NumPy example with an explicit garbage collector invocation:

>>> import numpy as np
>>> import gc
>>> while True:
...   tmp = np.array([None], dtype=object)
...   tmp[0] = tmp
...   tmp2 = gc.collect()
... 
^CTraceback (most recent call last):
  File "<stdin>", line 4, in <module>
KeyboardInterrupt

Indeed, it keeps it clean from the start:

image

The bottom line is that this isn't a problem with Awkward Array; it's just how Python garbage collection works with data in extension libraries. You shouldn't be making millions of little ak.Arrays anyway; they'll be horribly inefficient compared to a few big ak.Arrays, but I understand that this was raised as an in-principle issue.

There's another point that I could have raised at the beginning of all of this, though: we're working on Awkward 2.0, which removes the C++ layer in favor of Python that calls out to zero-memory-allocation C functions with ctypes, so even if there was a real memory leak here, it would be gone in the upcoming pure Python version. I could have led with that, though it feels like a cop-out: a memory leak in the present version would be bad news. Anyway, for these two reasons, this is not an issue that needs action.

Thanks for your diligence in reporting it, though!

@jpivarski jpivarski added wontfix This will not be worked on and removed bug The problem described is something that must be fixed labels Oct 29, 2021
@Asa-Nisi-Masa
Copy link
Author

Asa-Nisi-Masa commented Oct 29, 2021

Thanks for the awesome analysis. I was using ak.Array together with pytorch so at first I thought maybe the problem was there. In the end I walked around it by converting the arrays to regular lists with .to_list(). But thanks for providing details why this happens - such devilish behavior from Python

@jpivarski
Copy link
Member

You can also work around it by calling gc.collect() at opportune times. Although I'd try to avoid that in a library, rather than an application, I've found it necessary in some very tight code:

https://github.com/scikit-hep/uproot4/blob/7fc47ac486b77983f80f65973490c16e7ebca589/src/uproot/interpretation/library.py#L798-L811

But since you're developing an application, you know the exact speed-to-memory tradeoffs and can decide where you want the garbage collector to collect (in addition to the times it would decide to do so on its own). PyTorch also has compiled extensions, so the above may apply to it, too.

If you're calling .to_list() and that's not a noticeable performance (speed) penalty, you might have been able to get away with working with Python lists from the start (unless you found the array-at-a-time interface convenient, the other reason one might want to use Awkward Array).

such devilish behavior from Python

Well, all garbage collected languages. CPython's hybrid of reference counting and garbage collection makes the garbage collection penalties less significant than, say, PyPy or the Java Virtual Machine. Instead, the cost is an extra 8 bytes in every object...

@jrueb
Copy link
Contributor

jrueb commented Jul 26, 2022

I was also observing this effect, but I'm not fully convinced this can be explained or solved garbage collector invocation, which is why I'm replying to this rather old issue.

You are calling gc.collect in every loop, but what if that slows down the loop so much, you stop seeing the leak? When I do

import awkward as ak
import gc

i = 0
while True:
    tmp = ak.Array([1])
    i += 1
    if i > 100000:
        print("collect")
        gc.collect()
        i = 0

I get steady increase in memory usage, even though garbage collection is being done.

@agoose77
Copy link
Collaborator

agoose77 commented Jul 26, 2022

I can reproduce growing memory for your trivial example. If I change the Array object to the v2 implementation (ak._v2.Array), I don't see this reproduced. It's not clear to me at this stage where this is a memory leak or fragmentation, but it seems like the former's a reasonable bet.

@ianna
Copy link
Collaborator

ianna commented Jul 26, 2022

I can reproduce growing memory for your trivial example. If I change the Array object to the v2 implementation (ak._v2.Array), I don't see this reproduced. It's not clear to me at this stage where this is a memory leak or fragmentation, but it seems like the former's a reasonable bet.

@agoose77 - how long before see it? I'm running on master + PR #1560 for ~15 min. It looks like it v1 memory oscillates between 250 - 560 M:
Screenshot 2022-07-26 at 16 04 24

and for v2 it's pretty stable at 50 M:
Screenshot 2022-07-26 at 16 15 48

@agoose77
Copy link
Collaborator

@ianna good question. I gave it a while, and observed memory growing from $t=0$. Let me run it for longer, and see whether it saturates as expected.

@agoose77
Copy link
Collaborator

For me, it climbs past 1GB (5 minutes). I don't actually have a huge amount of spare memory right now with all of my open tasks, so I won't be able to let it run for any longer 😅

@jpivarski
Copy link
Member

We've moved from C++ owning the data (ArrayBuilder based on std::shared_ptr that is shared with Python) to Python owning the data (a copy of std::unique_ptr data in C++ to a NumPy-allocated array—until a few weeks ago, 1.9.0rc8? 9? 10? Now ArrayBuilder is using @ManasviGoyal's new GrowableBuffer, so there isn't even a single std::unique_ptr anymore: the only full copy of the data is allocated and owned by NumPy).

So memory management for data created by ArrayBuilder (ak.Array([1])) has dramatically changed in the last few versions, with more and more reliance on NumPy owning the data (both v1 and v2). Let's only do these tests in the latest version, since that's what's relevant going forward. We shouldn't spend a lot of time diagnosing issues that have already been fixed.

@jpivarski
Copy link
Member

Good news! PR #2311 apparently fixes this memory leak, too! I'm reopening this issue just so that it can be closed by the PR, for record-keeping.

@jpivarski jpivarski reopened this Mar 13, 2023
@jpivarski jpivarski linked a pull request Mar 13, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants