Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Conversion of list column to pandas produces array(..., dtype=object) entries #14568

Closed
wence- opened this issue Dec 5, 2023 · 0 comments · Fixed by #15155
Closed

[BUG] Conversion of list column to pandas produces array(..., dtype=object) entries #14568

wence- opened this issue Dec 5, 2023 · 0 comments · Fixed by #15155
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@wence-
Copy link
Contributor

wence- commented Dec 5, 2023

Describe the bug

import cudf
import pandas as pd

s = cudf.Series([[["abc", "def"]]])
expect = pd.Series([[["abc", "def"]]])
got = s.to_pandas()

I would expect that the entries of got.values are the same as expect.values. However, type(got.values[0]) is numpy.ndarray whereas type(expect.values[0]) is list.

In and of itself this doesn't seem to be a big problem, since cudf.from_pandas(expect) produces the same as cudf.from_pandas(got). But, IO of the two pandas objects (specifically CSV IO) produces different results, since print of a numpy.ndarray produces something different to printing a list.

Consequently, I now can't round-trip through to_csv/read_csv

from io import StringIO
import cudf
import pandas as pd

s = cudf.Series([[["abc", "def"]]])
expect = pd.Series([[["abc", "def"]]])
got = s.to_pandas()

ebuf = StringIO()
gbuf = StringIO()

expect.to_frame().to_csv(ebuf, header=False, index=False)
got.to_frame().to_csv(gbuf, header=False, index=False)

ebuf.seek(0)
gbuf.seek(0)

expect_io = pd.read_csv(ebuf, index_col=None, header=None)
got_io = pd.read_csv(gbuf, index_col=None, header=None)

print(expect.to_frame())
              0
0  [[abc, def]]
print(expect_io)
                  0
0  [['abc', 'def']]
print(got.to_frame())
              0
0  [[abc, def]]
print(got_io)
                                       0
0  [array(['abc', 'def'], dtype=object)]

Expected behavior

I think this should work.

The way to_pandas() is implemented, we just go via arrow and then call to_pandas() on the arrow array. Plausibly this is therefore a bug in arrow, but I am not sure, converting a ListArray to a series of numpy array objects. I suspect that, at least in the short term, we'll need to patch up ListColumn.to_pandas() to post-process the result from arrow.

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Dec 5, 2023
@galipremsagar galipremsagar self-assigned this Feb 27, 2024
@galipremsagar galipremsagar added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 27, 2024
rapids-bot bot pushed a commit that referenced this issue Mar 4, 2024
Fixes: #14568 

This PR fixes `ListColumn.to_pandas()` by calling `ArrowArray.to_pylist()` method to retain `list` type in pandas series.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #15155
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants