Skip to content

ENH: Implement DataFrame.select #61527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

datapythonista
Copy link
Member

Based on the feedback in #61522 and on the last devs call, I implemented DataFrame.select in the most simple way. It does work with MultiIndex, but it does not support equivalents to filter(regex=) or filter(like=) directly. I added examples in the docs, so users can do that easily in Python (I can add one for regex if people think it's worth it).

The examples in the docs and the tests should make quite clear what's the behavior, feedback welcome.

For context, this is added so we can make DataFrame.filter focus on filtering rows, for example:

df = df.select("name", "age")
df = df.filter(df.age >= 18)

or

(df.select("name", "age")
   .filter(lambda df: df.age >= 18))

CC: @pandas-dev/pandas-core

@datapythonista datapythonista added Indexing Related to indexing on series/frames, not to indexes themselves API Design Enhancement labels May 31, 2025
Parameters
----------
*args : hashable or tuple of hashable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also support a list of hashable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the meaning of a list? Same as a tuple, for MultiIndex?

1 Cooper Alice 22
2 Marley Bob 35
In case the columns are in a list, Python unpacking with star can be used:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this - I'd prefer just passing the list

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns? What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous? What about: df.select("col1", ["col2", "col3"]). Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

Why not support ONLY a list?

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

I think this is about consistency in the API. For example, with DataFrame.groupby(), you can't do df.groupby("a", "b"), you have to do df.groupby(["a", "b"]).

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. It's complexity in the implementation versus consistency of the API.

For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns?

Raise. Only support lists or callables. And a static type checker would see that as invalid.

What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous?

Raise. And a static type checker would see that as invalid.

What about: df.select("col1", ["col2", "col3"]).

Raise. And a static type checker would see that as invalid.

Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

I don't see why a list isn't simple (and consistent), and it allows better type checking, as well as additions to the API in the future, if we should decide to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed feedback, what you say seems reasonable. To me, there is a significant advantage in readability and usability on using df.select("col1", "col2") over df.select(["col1", "col"]). I see you point on consistency with groupby, and while still the list is not my favorite option, it does seem reasonable. I'll let others share their opinion too, as at the end there is a trade-off and is a question of personal preference.

@jbrockmendel
Copy link
Member

Slight preference for (arg) over (*arg), strong preference for supporting one, not both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Implement DataFrame.select to select columns
3 participants