-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parquet reader unable to read duration types written by pyarrow #13410
Labels
2 - In Progress
Currently a work in progress
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Comments
galipremsagar
added
bug
Something isn't working
Needs Triage
Need team to review and classify
cuIO
cuIO issue
labels
May 22, 2023
GregoryKimball
added
0 - Backlog
In queue waiting for assignment
libcudf
Affects libcudf (C++/CUDA) code.
and removed
Needs Triage
Need team to review and classify
labels
Jun 7, 2023
Investigating this. Some insights:
Summary: Seems like the issue is in our Note this interesting example import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from io import BytesIO
import cudf
times = pa.array(
[1234, 3456, 32442], type="duration[ms]"
) # setting type="time32[ms] or time64[us]" etc will fail cudf.DF.from_arrow function
names = ["s"]
pa_table = pa.Table.from_arrays([times], names=names)
buf = BytesIO()
pq.write_table(pa_table, buf)
df2 = cudf.read_parquet(buf)
df3 = pq.read_table(buf)
print("Original table (pa)", pa_table, pa_table["s"].type)
print("cudf read parquet", df2, df2["s"].dtype)
print("pyarrow read parquet", df3, df3["s"].type)
df = cudf.DataFrame.from_arrow(
pa_table
) # setting type="time32[ms] or time64[us]" in pa_table etc will fail cudf.DF.from_arrow function
buf2 = BytesIO()
df.to_parquet(buf2)
df4 = cudf.read_parquet(buf)
df5 = pq.read_table(buf)
print("from_arrow table (cudf)", df, df["s"].dtype)
print("cudf read parquet", df4, df4["s"].dtype)
print("pyarrow read parquet", df5, df5["s"].type) |
Updates:
|
mhaseeb123
added
2 - In Progress
Currently a work in progress
and removed
0 - Backlog
In queue waiting for assignment
labels
Apr 29, 2024
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
May 15, 2024
) This PR adds the support for reading and using the `arrow:schema` struct from the serialized `arrow:ipc` message written at the key-value metadata section of the Parquet file with `ARROW:schema` key. This allows cudf to read and interop with arrow for non-standard parquet types (`DurationType` in this PR). Arrow uses Google flatbuffers (inside Schema.fbs) to serialize the `arrow:Schema` structure (containing column descriptors) and puts it (padded for 8 byte alignment) into the header of an empty `ipc:Message` (also a flatbuffer-serialized structure inside Message.fbs). The `ipc:Message` is prepended with two integers containing a `validity` message and the `size of the header` (the `arrow:Schema` + padding). The final message is endoded as a base64 string and written to Parquet file footer key-value metadata using `"ARROW:schema"` key. In this PR, we base64-decode the `ipc:Message`, then we decode the `validity` message and the header size, and offset pointers to the `arrow:Schema` flatbuffer. We then use Flatbuffer structs to walk the `arrow:Schema` and collect information on columns of interest as an unordered_map (using column name as key). This unordered_map is used inside `select_columns` function to build cudf Table columns and get the correct `dtype`. Closes #13410 Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #15617
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currently a work in progress
bug
Something isn't working
cuIO
cuIO issue
libcudf
Affects libcudf (C++/CUDA) code.
Describe the bug
When a pyarrow table containing duration types are being written to parquet, the cudf reader seems to reading the columns as
int64
as opposed to correcttimedelta64[..]
types.Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
Expected behavior
Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: