Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Column.to_pandas return Index instead of Series #15833

Merged

Conversation

mroeschke
Copy link
Contributor

Description

Column.to_pandas backs Index.to_pandas/Series.to_pandas/DataFrame.to_pandas and returned a pandas.Series; however, the index of this pandas.Series was not strictly necessary for Index.to_pandas and DataFrame.to_pandas.

Additionally, pandas.Index is 1D-like like Column and provides a better mental model to to_pandas conversion.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 23, 2024
@mroeschke mroeschke requested a review from a team as a code owner May 23, 2024 01:47
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense for column objects to convert to pandas objects at all? I think we still need some sort of method at the column level so that columns of different dtypes can overload it to produce different results (e.g. CategoricalColumn needs to convert the categories as well as the labels), but is there a lower-level representation that we could change these methods to return instead so that we could delegate the actual pandas object construction to the Frame level? I'm imagining, in cudf.Series for example, something like return pd.Series(self._column.to_pandas_data()) instead of return self._column.to_pandas(). Does that make sense?

@mroeschke
Copy link
Contributor Author

is there a lower-level representation that we could change these methods to return instead so that we could delegate the actual pandas object construction to the Frame level?

I'm fairly skeptical that there is a common lower-level representation. Generally Column.to_arrow is called first, then each column dtype manipulates that as needed (from just calling to_pandas to needing a Python list of objects).

I agree it would be nice to have the pandas construction happen at the Frame level, but I think Frame would have to inherit the column dtype specific logic depending on what self._column.to_pandas_data() returns which isn't great.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I'd file this in the back of your head in the list of "things we should consider when rewriting cudf internals to use pylibcudf". I suspect that we'll want to think hard about problems like this when reenvisioning what the column layer of cudf ought to look like.

@mroeschke
Copy link
Contributor Author

Thanks. Yeah I think for this use case converting a pylibcudf column to something arrow-like is probably the correct "common data" for the to_pandas API in the future, especially if pandas gains pyarrow as a required dependency

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit fe74129 into rapidsai:branch-24.08 Jun 4, 2024
73 checks passed
@mroeschke mroeschke deleted the ref/column_to_pandas/index branch June 4, 2024 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants