-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudf.read_parquet takes too much time(due to cudaMallocHost overhead etc.) to load the zstd compressed parquet files with few thousands to millions of rows #15481
Comments
|
There might be some tricks to avoid the long time 1st round run cudaHostAlloc, which I haven't figured out yet, code as below may only handle gpu side mem pre-alloc. import rmm
# rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9) |
I also tried 1M rows all (same) integers and 1M rows all (same) string column, cudf.read_parquet still suffering the perf issue, very likely due to the long cudaMallocHost call. df = pandas.DataFrame({'j2333c': [2333] * 1000000})
df.to_parquet('/dev/shm/j2333c.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.08919477462768555
>>>
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.026215314865112305
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.014030933380126953
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> import rmm
>>>
>>> rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.08475613594055176
>>>
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.025774002075195312
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('j2333c.parquet'); te = time.time()
>>> print(te - ts)
0.011544227600097656
df = pandas.DataFrame({'jstrc', ['2333'] * 1000000})
df.to_parquet('/dev/shm/jstrc.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> import rmm
>>>
>>> # rmm.reinitialize(pool_allocator=True, initial_pool_size= 4 * 10 ** 9)
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.08581995964050293
>>>
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.057205915451049805
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jstrc.parquet'); te = time.time()
>>> print(te - ts)
0.022694826126098633 |
Well, as GPUs are throughput machine, if increasing rows num from millions to billions, the advantages got well shown: >>> import cudf
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'jnac': [None] * 1000000000})
>>> df.to_parquet('/dev/shm/jnac.parquet', compression='ZSTD')
>>> import cudf
>>> import pandas
>>> import pyarrow.parquet
>>>
>>> import time
>>>
>>> # not accurate timing, while the diff is so obvious which do not require more accurate timing temporrally
>>>
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = cudf.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
0.15029525756835938
>>>
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pandas.read_parquet('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
7.30379843711853
>>>
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> time.sleep(1)
>>> ts = time.time(); tb = pyarrow.parquet.read_table('/dev/shm/jnac.parquet'); te = time.time()
>>> print(te - ts)
1.51247239112854
>>> so, now the major problem is how to resolve the issue for millions scale row num chunked tables. |
Describe the bug
Performance improvement proposal for cudf parquet file reading efficiency.
Steps/Code to reproduce bug
Expected behavior
Environment overview (please complete the following information)
internal T4 node, py3.9, cudf 24.02.02
Additional context
It just takes too much time to process entries, especially for cudf when num rows is just 1K(similar latency cost for 10M rows NA though).
The text was updated successfully, but these errors were encountered: