Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Serious memory leak in struct-column construction #15525

Closed
rjzamora opened this issue Apr 12, 2024 · 5 comments
Closed

[BUG] Serious memory leak in struct-column construction #15525

rjzamora opened this issue Apr 12, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@rjzamora
Copy link
Member

rjzamora commented Apr 12, 2024

Describe the bug
@mpenn uncovered a serious memory leak related to cudf struct-columns.

Steps/Code to reproduce bug

import cudf

while(True):
    gdf = cudf.Series([{"test": {str(i): "test" * 100000 for i in range(50)}}])

Note that host memory consumption will grow continuously as this while loop runs.

EDIT: It seems that the root problem is in PyArrow. The following code produces the memory leak without cudf:

import pyarrow as pa

while(True):
    arr = pa.array(
        [{"test": {str(i): "test" * 100000 for i in range(50)}}],
    )

(Issue raised here: apache/arrow#41172)

@rjzamora rjzamora added the bug Something isn't working label Apr 12, 2024
@galipremsagar
Copy link
Contributor

galipremsagar commented Apr 12, 2024

I've tried to avert this memory leak in cudf by doing pa.StructArray.from_pandas(pd.Series(...)) but there is a memory leak there too.

@wence-
Copy link
Contributor

wence- commented Apr 12, 2024

I think it is a pyarrow bug:

import pyarrow as pa

for _ in range(1000):
    long_strings = ["test" * 100000 for i in range(50)]
    val = [{"test": {str(i): s for i, s in enumerate(long_strings)}}]
    x = pa.array(val)

leaks

@rjzamora
Copy link
Member Author

Note that I added a link to the root pyarrow problem above.

@JohnZed
Copy link
Contributor

JohnZed commented Apr 12, 2024

In the linked bug, @galipremsagar mentions that StructArray.from_arrays does not seem to manifest that issue. Is that something that could potentially be used to work around this?

@galipremsagar
Copy link
Contributor

This leak has been patched in 14.0.2 and 15.0.2 builds of arrow:

Verified installing locally that this issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

4 participants