ENH: Implement DataFrame.select #61527

datapythonista · 2025-05-31T13:10:03Z

closes ENH: Implement DataFrame.select to select columns #61522
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Based on the feedback in #61522 and on the last devs call, I implemented DataFrame.select in the most simple way. It does work with MultiIndex, but it does not support equivalents to filter(regex=) or filter(like=) directly. I added examples in the docs, so users can do that easily in Python (I can add one for regex if people think it's worth it).

The examples in the docs and the tests should make quite clear what's the behavior, feedback welcome.

For context, this is added so we can make DataFrame.filter focus on filtering rows, for example:

df = df.select("name", "age")
df = df.filter(df.age >= 18)

or

(df.select("name", "age")
   .filter(lambda df: df.age >= 18))

CC: @pandas-dev/pandas-core

Dr-Irv · 2025-06-03T18:22:25Z

pandas/core/frame.py

+
+        Parameters
+        ----------
+        *args : hashable or tuple of hashable


can we also support a list of hashable ?

What would be the meaning of a list? Same as a tuple, for MultiIndex?

Dr-Irv · 2025-06-03T18:23:23Z

pandas/core/frame.py

+        1    Cooper      Alice   22
+        2    Marley        Bob   35
+
+        In case the columns are in a list, Python unpacking with star can be used:


I'm not a fan of this - I'd prefer just passing the list

I'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns? What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous? What about: df.select("col1", ["col2", "col3"]). Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

'm open to it, and it was my first idea to support both df.select("col1", "col2") and df.col(["col1", "col2"]).

Why not support ONLY a list?

But after checking in more detail, I find the second version not so readable with the double brackets, and for the case when the columns are already in a variable just a star makes it work.

I think this is about consistency in the API. For example, with DataFrame.groupby(), you can't do df.groupby("a", "b"), you have to do df.groupby(["a", "b"]).

And besides readability, that to me would be enough reason to implement it like this, allowing a list adds a decent amount of complexity. It's complexity in the implementation versus consistency of the API.

For example, what would you do here? df.select(["col1", "col2"], "col3"). Raise? Return all columns?

Raise. Only support lists or callables. And a static type checker would see that as invalid.

What about this other case: df.select(["col1", "col2"], ["col3", "col4"]) Same as the previous?

Raise. And a static type checker would see that as invalid.

What about: df.select("col1", ["col2", "col3"]).

Raise. And a static type checker would see that as invalid.

Personally, I think we shouldn't have to answer this, or make users guess much. The simplest approach seems to be good enough, if I'm not missing any use case.

I don't see why a list isn't simple (and consistent), and it allows better type checking, as well as additions to the API in the future, if we should decide to do so.

Thanks for the detailed feedback, what you say seems reasonable. To me, there is a significant advantage in readability and usability on using df.select("col1", "col2") over df.select(["col1", "col"]). I see you point on consistency with groupby, and while still the list is not my favorite option, it does seem reasonable. I'll let others share their opinion too, as at the end there is a trade-off and is a question of personal preference.

jbrockmendel · 2025-06-03T21:04:25Z

Slight preference for (arg) over (*arg), strong preference for supporting one, not both.

ENH: Implement DataFrame.select

0f64c13

datapythonista added Indexing API Design Enhancement labels May 31, 2025

Dr-Irv reviewed Jun 3, 2025

View reviewed changes

datapythonista mentioned this pull request Jun 3, 2025

ENH: Implement DataFrame.select to select columns #61522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Implement DataFrame.select #61527

ENH: Implement DataFrame.select #61527

Uh oh!

datapythonista commented May 31, 2025

Uh oh!

Dr-Irv Jun 3, 2025

Uh oh!

datapythonista Jun 3, 2025

Uh oh!

Dr-Irv Jun 3, 2025

Uh oh!

datapythonista Jun 3, 2025

Uh oh!

Dr-Irv Jun 3, 2025

Uh oh!

datapythonista Jun 3, 2025

Uh oh!

jbrockmendel commented Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

ENH: Implement DataFrame.select #61527

Are you sure you want to change the base?

ENH: Implement DataFrame.select #61527

Uh oh!

Conversation

datapythonista commented May 31, 2025

Uh oh!

Dr-Irv Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

datapythonista Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

datapythonista Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Dr-Irv Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

datapythonista Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Jun 3, 2025

Uh oh!

Uh oh!