-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
solve memleak in parquet numpy reader #19
Comments
Is there memleak for parquet/numpy reader as well ? |
I don't know |
I didn't ran out of memory actually, so maybe it's just that the gc is not running for some reason until it's needed (but it did use more than 200GB of ram..) |
I am not sure if this is same as what I have faced.. I am running into OOM when running the NSFW model example from the PR #16 I ran it with all the defaults, 10 parallel processes, except that the embeddings are already on file system(not form a cloud storage) I see that memory keeps increasing as more batches are being processed, it eventually leads to OOM. After doing some debugging using profiler. Most specifically from reading the metadata it seems I did a small modification to read_piece method to move out the metadata part so that I can apply I tried a bunch of things like setting One thing to note is that, we are reading all the metadata using pyarrow and then picking the relevant columns. I see that pandas has a |
probably the same problem embeddings are there now https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/ |
ok so after testing some more, it seems the problem is very related with the fact that it doesn't make much sense to read parquet files in arbitrary pieces because the disk serialization doesn't allow retrieving only these pieces. At best splitting in row groups could make sense. I will test some more and maybe simply change the default such that the parquet+numpy and the parquet reader read whole file at once. or whole row groups at once Another way to solve the problem in a more general way could be to keep the piece fetching logic, but instead of using it to load pieces of embedding files, use it to prepare on disk (or in memory) pieces of files at the byte level, and then load the batch all at once using numpy/pyarrow |
solved now |
No description provided.
The text was updated successfully, but these errors were encountered: