-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reimplement parquet reader using ParquetFile #20
Comments
maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read |
consider using https://arrow.apache.org/docs/python/dataset.html , it's really good |
import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
pass about 1M sample/s eg 200MB/s, saturates the external server connection |
idea:
this should fix the parquet speed and will solve the memory issue |
doesn't work due to no support of start/end in pyarrow.dataset |
numpy reader doing 300MB/s from https fsspec |
maybe https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner can be used to support start stop |
no, can't |
ok best idea is to cache the files |
seems a little bit better but not much
|
actually lru cache doesn't work with multiple threads... |
makes things much much faster |
estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False |
150min for numpy alone |
1012 without this change |
seems to have solved the memleak too |
https://github.com/rom1504/embedding-reader/pull/21/files faster but didn't solve memleak |
actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone |
did a fix here |
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html
the whole logic of splitting in small pieces and using our own thread pool make little sense with pyarrow since it has its own pool
try not doing that and instead use directly ParquetFile
think if it still makes sense for small parquet files
The text was updated successfully, but these errors were encountered: