Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataFrame.from_arrow results in incorrect order of child columns inside a StructColumn #11693

Closed
galipremsagar opened this issue Sep 13, 2022 · 0 comments · Fixed by #11698
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
Reading the same pyarrow table & array via two different APIs result in different results. Ideally both should have same correct results.
sample.parquet.zip

Steps/Code to reproduce bug

In [1]: import pyarrow as pa

In [2]: import pyarrow.parquet as pq

In [4]: table = pq.read_table('sample.parquet')

In [5]: import cudf

In [6]: cudf.DataFrame.from_arrow(table)['2']
Out[6]: 
0       {'0': None, '1': None, '2': None}
1         {'0': '', '1': None, '2': None}
2    {'0': None, '1': 'W&RR=+I', '2': ''}
Name: 2, dtype: struct

In [7]: cudf.Series.from_arrow(table['2'])
Out[7]: 
0       {'2': None, '0': None, '1': None}
1         {'2': '', '0': None, '1': None}
2    {'2': None, '0': 'W&RR=+I', '1': ''}
dtype: struct

Expected behavior

In [6]: cudf.DataFrame.from_arrow(table)['2']
Out[6]: 
0       {'2': None, '0': None, '1': None}
1         {'2': '', '0': None, '1': None}
2    {'2': None, '0': 'W&RR=+I', '1': ''}
Name: 2, dtype: struct

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [from source]
@galipremsagar galipremsagar added bug Something isn't working Python Affects Python cuDF API. labels Sep 13, 2022
@galipremsagar galipremsagar self-assigned this Sep 13, 2022
@galipremsagar galipremsagar added this to Issue-Needs prioritizing in v22.10 Release via automation Sep 13, 2022
@galipremsagar galipremsagar moved this from Issue-Needs prioritizing to Issue-P1 in v22.10 Release Sep 13, 2022
rapids-bot bot pushed a commit that referenced this issue Sep 14, 2022
Fixes: #11693 
This PR fixes `DataFrame.from_arrow` which does not preserve type metadata for `struct`, `list` & `decimal` types.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #11698
v22.10 Release automation moved this from Issue-P1 to Done Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

1 participant