[BUG] Unable to read a dataframe with multiIndex properly #14352

galipremsagar · 2023-11-01T15:40:59Z

Describe the bug
cudf parquet reader is not able to properly parse through the multiIndex when pyarrow parquet reader is used as the engine.

Steps/Code to reproduce bug

In [1]: import pandas as pd

In [2]: import cudf

In [3]: expected = pd.DataFrame(
   ...:         {"A": [1, 2, 3]},
   ...:         index=pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1)]),
   ...:     )

In [4]: expected
Out[4]: 
     A
a 1  1
  2  2
b 1  3

In [5]: expected.to_parquet("a.parquet", engine="pyarrow")


In [8]: pd.read_parquet("a.parquet", engine="pyarrow")
Out[8]: 
     A
a 1  1
  2  2
b 1  3

In [9]: cudf.read_parquet("a.parquet", engine="pyarrow")
/nvme/0/pgali/envs/cudfdev/lib/python3.10/site-packages/cudf/io/parquet.py:544: UserWarning: Using CPU via PyArrow to read Parquet dataset. This option is both inefficient and unstable!
  warnings.warn(
Out[9]: 
                   A  __index_level_1__
__index_level_0__                      
a                  1                  1
a                  2                  2
b                  3                  1

Expected behavior
Should match pandas

Environment overview (please complete the following information)

Environment location: [Bare-metal]
Method of cuDF install: [from source]

The text was updated successfully, but these errors were encountered:

galipremsagar added bug Something isn't working Python Affects Python cuDF API. labels Nov 1, 2023

galipremsagar self-assigned this Nov 1, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Nov 1, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Nov 1, 2023

rapids-bot bot closed this as completed Nov 1, 2023

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to read a dataframe with multiIndex properly #14352

[BUG] Unable to read a dataframe with multiIndex properly #14352

galipremsagar commented Nov 1, 2023

[BUG] Unable to read a dataframe with multiIndex properly #14352

[BUG] Unable to read a dataframe with multiIndex properly #14352

Comments

galipremsagar commented Nov 1, 2023