Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement parquet reader using ParquetFile #20

Closed
rom1504 opened this issue Apr 12, 2022 · 20 comments
Closed

reimplement parquet reader using ParquetFile #20

rom1504 opened this issue Apr 12, 2022 · 20 comments

Comments

@rom1504
Copy link
Owner

rom1504 commented Apr 12, 2022

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html

the whole logic of splitting in small pieces and using our own thread pool make little sense with pyarrow since it has its own pool

try not doing that and instead use directly ParquetFile

think if it still makes sense for small parquet files

@rom1504
Copy link
Owner Author

rom1504 commented Apr 12, 2022

maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read

@rom1504
Copy link
Owner Author

rom1504 commented Apr 16, 2022

consider using https://arrow.apache.org/docs/python/dataset.html , it's really good

@rom1504
Copy link
Owner Author

rom1504 commented Apr 17, 2022

@rom1504
Copy link
Owner Author

rom1504 commented Apr 17, 2022

import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
    pass

about 1M sample/s eg 200MB/s, saturates the external server connection

@rom1504
Copy link
Owner Author

rom1504 commented Apr 17, 2022

idea:

  • use numpy read with the current implementation for numpy
  • use pyarrow dataset for parquet
  • zip the 2 for the numpy parquet reader

this should fix the parquet speed and will solve the memory issue

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

doesn't work due to no support of start/end in pyarrow.dataset

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

numpy reader doing 300MB/s from https fsspec
numpy parquet is at 30MB/s ...
this needs to be improved

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

no, can't

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

ok best idea is to cache the files

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

seems a little bit better but not much

@lru_cache(maxsize=None)
def open_parquet_file(fs, filename):
    return pq.read_table(fs.open(filename, "rb"), use_threads=False)

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

actually lru cache doesn't work with multiple threads...

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

r = Semaphore(1)

d = {}
def open_parquet_file(fs, filename):
    r.acquire()
    if filename in d:
        r.release()
        return d[filename]
    print(filename)
    t = pq.read_table(fs.open(filename, "rb"), use_threads=False)
    d[filename] = t
    r.release()
    return t

makes things much much faster

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

150min for numpy alone

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

1012 without this change

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

seems to have solved the memleak too

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

https://github.com/rom1504/embedding-reader/pull/21/files

faster but didn't solve memleak

@rom1504
Copy link
Owner Author

rom1504 commented Apr 19, 2022

actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone

@rom1504
Copy link
Owner Author

rom1504 commented May 16, 2022

did a fix here

@rom1504 rom1504 closed this as completed May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant