Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: protect ak.to_parquet against memory explosion when args are swapped. #2523

Merged

Conversation

jpivarski
Copy link
Member

If an Awkward Array is passed as destination to fsspec.core.url_to_fs(destination), the memory use explodes before Python crashes. The array doesn't even need to be very large to send fsspec into dozens of GB; I think it might be converting the array into a string and interpreting [ ] { } as wildcards, a combinatorially large file list.

Anyway, it's very common for me to forget the argument order in I/O functions (ak.to_parquet, json.dump, etc.) and I just try one until it works. It needs to be the case that trying the wrong one produces an immediate error message.

@jpivarski jpivarski requested a review from agoose77 June 13, 2023 17:13
@jpivarski jpivarski changed the title Protect ak.to_parquet against memory explosion when args are swapped. fix: protect ak.to_parquet against memory explosion when args are swapped. Jun 13, 2023
@jpivarski
Copy link
Member Author

Improvements: this function isn't going through the standard ErrorContext, so the error message contains text about the fact that this is happening in ak.to_parquet and what the correct order of arguments ought to be. These tidbits wouldn't be necessary if there was an error context.

Also, only strings for destination? That's what the docstring says. What about accepting a pathlib.Path? pq.ParquetWriter accepts a file-like object—we could in principle pass on that capability, but we'd have to bypass a lot of fsspec to do it.

@jpivarski jpivarski temporarily deployed to docs-preview June 13, 2023 17:25 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Jun 13, 2023

Codecov Report

Merging #2523 (5d3ea7e) into main (e627d4a) will decrease coverage by 0.01%.
The diff coverage is 60.00%.

❗ Current head 5d3ea7e differs from pull request most recent head faac18c. Consider uploading reports for the commit faac18c to get more accurate results

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/operations/ak_to_parquet.py 52.88% <60.00%> (+0.35%) ⬆️

... and 3 files with indirect coverage changes

@raymondEhlers
Copy link
Contributor

Just poking around and saw this, so feel free to disregard - although it's just a nice quality of life thing, it would be nice to have support for Path. Every time I use to_parquet, I forget to convert from Path, so if it's not a big burden, it would be a helpful addition

@jpivarski
Copy link
Member Author

That one's easy. Allowing file-like objects to pass through to the ParquetWriter, to take advantage of its ability to use them, would be harder.

@jpivarski jpivarski temporarily deployed to docs-preview June 13, 2023 21:43 — with GitHub Actions Inactive
@agoose77 agoose77 temporarily deployed to docs-preview June 14, 2023 13:06 — with GitHub Actions Inactive
Copy link
Collaborator

@agoose77 agoose77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I made a change to open up support for any path-likes, are you happy with these modifications? I'll just test them locally to ensure that there are no corner-cases.

@agoose77 agoose77 temporarily deployed to docs-preview June 14, 2023 17:49 — with GitHub Actions Inactive
@agoose77 agoose77 enabled auto-merge (squash) June 15, 2023 16:07
@agoose77 agoose77 temporarily deployed to docs-preview June 15, 2023 16:29 — with GitHub Actions Inactive
@agoose77 agoose77 merged commit 28f243e into main Jun 15, 2023
34 checks passed
@agoose77 agoose77 deleted the jpivarski/protect-ak-to_parquet-against-memory-explosion branch June 15, 2023 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants