Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when iterating over array #1280

Closed
jpearkes opened this issue Feb 9, 2022 · 4 comments · Fixed by #2311
Closed

Memory leak when iterating over array #1280

jpearkes opened this issue Feb 9, 2022 · 4 comments · Fixed by #2311
Labels
performance Works, but not fast enough or uses too much memory

Comments

@jpearkes
Copy link

jpearkes commented Feb 9, 2022

Version of Awkward Array

1.7.0

Description and code to reproduce

Hi,

A student I am working with noticed what appears to be a memory leak when iterating over array. I've replicated the issue with the following snippet.

import psutil

A = ak.Array([[0,1],[0],[1,0]]*10000000)
my_sum = np.zeros(len(A))
i = 0
for event in A:
    my_sum[i] =  np.sum(event)
    i += 1
    if i %10000 == 0:
        print('RAM memory % used:', psutil.virtual_memory()[2])

A (much faster) solution without the leak to do the sum on the awkward array directly

my_sum = np.sum(A, axis=1) 

But I figured I should report the issue here in case it is helpful.

@jpearkes jpearkes added the bug (unverified) The problem described would be a bug, but needs to be triaged label Feb 9, 2022
@jpivarski
Copy link
Member

I'm on a phone right now, so I can't test this directly yet, but Python will use all the memory it has available until it reaches a limit before calling the garbage collector. The above code should show a linear increase in memory consumption until it gets to that limit, then I think it plateaus (instead of a sawtooth shape, which is the other conceptual possiblity). If this eventually stops increasing in memory use, even if that's at the limit of your computer's resources, or at the process's ulimit, then that is correct behavior for a garbage-collected language.

What it comes down to, though, is that this is an antipattern: you don't want to iterate over all elements of a large array with a Python for loop, since that creates Python objects for each element in the array (here, ak.Arrays of length 1 or 2). You want to do:

my_sum = ak.sum(A, axis=-1)

and possibly send that through ak.to_numpy if you need that to be NumPy, rather than Awkward. That does the sum entirely in compiled code—no Python objects for each short list, and no waiting on the garbage collector to bring memory use under control.

These should be thought of as techniques to avoid using Python for anything large (memory or time). You get the computational result, but without representing all the intermediate steps in Python objects. (Same philosophy as NumPy.)

@jpearkes
Copy link
Author

jpearkes commented Feb 9, 2022

Thanks for the detailed explanation Jim! We were seeing it use up all the memory available and then promptly crash the lxplus/SWAN node it was running on.

@jpivarski
Copy link
Member

I didn't do this question justice at all. I missed this, for instance:

A (much faster) solution without the leak to do the sum on the awkward array directly

my_sum = np.sum(A, axis=1) 

where you were pretty clear that you know how it's supposed to be done.

I tested the sample code and the extra memory used after the loop doesn't seem to go away with gc.collect(), though it's hard to set up a clean experiment on this computer that also has Zoom running (since psutil.virtual_memory() returns total RAM usage). I think it could really be a memory leak, most likely in the "handle" objects that point to the arrays, rather than the arrays themselves.

If we weren't porting all of these "handle" objects from C++ to Python, that would be something we'd have to fix. As it is, the new Python version of this should be automatically memory-clean. I just tried it by replacing ak.Array with ak._v2.Array and np.sum with ak._v2.sum and the memory use vs i was a lot flatter. So, just because it seems to be a v1-only thing and we'll be replacing that, I'll be labeling this as "won't fix." Thanks for reporting it, though!

@jpivarski jpivarski added performance Works, but not fast enough or uses too much memory wontfix This will not be worked on and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Feb 9, 2022
@jpivarski jpivarski removed the wontfix This will not be worked on label Mar 13, 2023
@jpivarski
Copy link
Member

Good news! PR #2311 apparently fixes this memory leak, too! I'm reopening this issue just so that it can be closed by the PR, for record-keeping.

@jpivarski jpivarski reopened this Mar 13, 2023
@jpivarski jpivarski linked a pull request Mar 13, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Works, but not fast enough or uses too much memory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants