solve memleak in parquet numpy reader #19

rom1504 · 2022-03-24T00:37:56Z

No description provided.

hitchhicker · 2022-03-24T01:10:51Z

Is there memleak for parquet/numpy reader as well ?

rom1504 · 2022-03-24T01:19:50Z

I don't know
I would think no or we would have seen it in autofaiss

rom1504 · 2022-03-24T07:56:38Z

I didn't ran out of memory actually, so maybe it's just that the gc is not running for some reason until it's needed (but it did use more than 200GB of ram..)

vanga · 2022-04-10T17:55:50Z

I am not sure if this is same as what I have faced..

I am running into OOM when running the NSFW model example from the PR #16

I ran it with all the defaults, 10 parallel processes, except that the embeddings are already on file system(not form a cloud storage)

I see that memory keeps increasing as more batches are being processed, it eventually leads to OOM.
VM has 128 GB ram, all of it is available and no other process is contesting RAM

After doing some debugging using profiler.
Root cause seems to be somewhere in read_piece function from https://github.com/rom1504/embedding-reader/blob/main/embedding_reader/parquet_numpy_reader.py

Most specifically from reading the metadata it seems

I did a small modification to read_piece method to move out the metadata part so that I can apply @profile to it.

I tried a bunch of things like setting parallel_pieces to 1, deleting pandas objects and calling explicit GC. None of them changed the behavior.

One thing to note is that, we are reading all the metadata using pyarrow and then picking the relevant columns. I see that pandas has a read_parquet method where we can specify the columns we want to read. Maybe using pandas read_parquet is elegant? Note that pandas internally uses pyarrow as a default engine.
I have tried with change also and I didn't see any difference, it leads OOM eventually.

rom1504 · 2022-04-11T21:21:10Z

probably the same problem

embeddings are there now https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/
fsspec supports https as well, so I'm going to use that to reproduce and fix

rom1504 · 2022-04-11T22:57:19Z

ok so after testing some more, it seems the problem is very related with the fact that it doesn't make much sense to read parquet files in arbitrary pieces because the disk serialization doesn't allow retrieving only these pieces. At best splitting in row groups could make sense.

I will test some more and maybe simply change the default such that the parquet+numpy and the parquet reader read whole file at once. or whole row groups at once

Another way to solve the problem in a more general way could be to keep the piece fetching logic, but instead of using it to load pieces of embedding files, use it to prepare on disk (or in memory) pieces of files at the byte level, and then load the batch all at once using numpy/pyarrow

rom1504 · 2022-05-16T00:25:20Z

solved now

rom1504 closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solve memleak in parquet numpy reader #19

solve memleak in parquet numpy reader #19

rom1504 commented Mar 24, 2022

hitchhicker commented Mar 24, 2022

rom1504 commented Mar 24, 2022 •

edited

Loading

rom1504 commented Mar 24, 2022

vanga commented Apr 10, 2022

rom1504 commented Apr 11, 2022

rom1504 commented Apr 11, 2022 •

edited

Loading

rom1504 commented May 16, 2022

solve memleak in parquet numpy reader #19

solve memleak in parquet numpy reader #19

Comments

rom1504 commented Mar 24, 2022

hitchhicker commented Mar 24, 2022

rom1504 commented Mar 24, 2022 • edited Loading

rom1504 commented Mar 24, 2022

vanga commented Apr 10, 2022

rom1504 commented Apr 11, 2022

rom1504 commented Apr 11, 2022 • edited Loading

rom1504 commented May 16, 2022

rom1504 commented Mar 24, 2022 •

edited

Loading

rom1504 commented Apr 11, 2022 •

edited

Loading