reimplement parquet reader using ParquetFile #20

rom1504 · 2022-04-12T21:39:04Z

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html

the whole logic of splitting in small pieces and using our own thread pool make little sense with pyarrow since it has its own pool

try not doing that and instead use directly ParquetFile

think if it still makes sense for small parquet files

rom1504 · 2022-04-12T22:20:12Z

maybe just try to use the current code but cache usage of ParquetFile and close the file when all pieces have been read

rom1504 · 2022-04-16T17:02:04Z

consider using https://arrow.apache.org/docs/python/dataset.html , it's really good

rom1504 · 2022-04-17T01:34:15Z

along with https://arrow.apache.org/docs/python/filesystems.html#filesystem-fsspec

rom1504 · 2022-04-17T22:32:15Z

import fsspec
import pyarrow.dataset as ds
from tqdm import tqdm
fs, p = fsspec.core.url_to_fs("https://mystic.the-eye.eu/public/AI/cah/laion5b/embeddings/laion1B-nolang/laion1B-nolang-metadata")
files = fs.ls(p, detail=False)
d = ds.dataset(files, filesystem=fs)
b = d.to_batches()
for _ in tqdm(b):
    pass

about 1M sample/s eg 200MB/s, saturates the external server connection

rom1504 · 2022-04-17T22:33:42Z

idea:

use numpy read with the current implementation for numpy
use pyarrow dataset for parquet
zip the 2 for the numpy parquet reader

this should fix the parquet speed and will solve the memory issue

rom1504 · 2022-04-19T20:54:56Z

doesn't work due to no support of start/end in pyarrow.dataset

rom1504 · 2022-04-19T21:11:45Z

numpy reader doing 300MB/s from https fsspec
numpy parquet is at 30MB/s ...
this needs to be improved

rom1504 · 2022-04-19T21:22:00Z

maybe https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html#pyarrow.dataset.Scanner can be used to support start stop

rom1504 · 2022-04-19T21:24:29Z

no, can't

rom1504 · 2022-04-19T21:25:35Z

ok best idea is to cache the files

rom1504 · 2022-04-19T21:32:37Z

seems a little bit better but not much

@lru_cache(maxsize=None)
def open_parquet_file(fs, filename):
    return pq.read_table(fs.open(filename, "rb"), use_threads=False)

rom1504 · 2022-04-19T21:34:10Z

actually lru cache doesn't work with multiple threads...

rom1504 · 2022-04-19T21:40:38Z

r = Semaphore(1)

d = {}
def open_parquet_file(fs, filename):
    r.acquire()
    if filename in d:
        r.release()
        return d[filename]
    print(filename)
    t = pq.read_table(fs.open(filename, "rb"), use_threads=False)
    d[filename] = t
    r.release()
    return t

makes things much much faster

rom1504 · 2022-04-19T22:00:45Z

estimated 325min total (laion1B-nolang) with use_threads=True ; probably the same with False

rom1504 · 2022-04-19T22:09:56Z

150min for numpy alone

rom1504 · 2022-04-19T22:13:41Z

1012 without this change

rom1504 · 2022-04-19T22:28:50Z

seems to have solved the memleak too

rom1504 · 2022-04-19T22:56:36Z

https://github.com/rom1504/embedding-reader/pull/21/files

faster but didn't solve memleak

rom1504 · 2022-04-19T23:04:19Z

actually looks like it did. memory usage is a bit high (15GB with these settings), but memleak seems gone

rom1504 · 2022-05-16T00:22:47Z

did a fix here

rom1504 closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reimplement parquet reader using ParquetFile #20

reimplement parquet reader using ParquetFile #20

rom1504 commented Apr 12, 2022

rom1504 commented Apr 12, 2022

rom1504 commented Apr 16, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022 •

edited

Loading

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented May 16, 2022

reimplement parquet reader using ParquetFile #20

reimplement parquet reader using ParquetFile #20

Comments

rom1504 commented Apr 12, 2022

rom1504 commented Apr 12, 2022

rom1504 commented Apr 16, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 17, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022 • edited Loading

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented Apr 19, 2022

rom1504 commented May 16, 2022

rom1504 commented Apr 19, 2022 •

edited

Loading