Skip to content

ENH: Implement DataFrame.select to select columns #61522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
datapythonista opened this issue May 30, 2025 · 3 comments · May be fixed by #61527
Open

ENH: Implement DataFrame.select to select columns #61522

datapythonista opened this issue May 30, 2025 · 3 comments · May be fixed by #61527
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

Add a new method DataFrame.select to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.

Basic case, select columns. Personally both as a list, or as multiple parameters with *args should be supported for convenience:

df.select("column1", "column2")
df.select(["column1", "column2"])

Cases to consider.

What if a provided column doesn't exist? I assume we want to raise a ValueError.

What if a column is duplicated? I assume we want to return the column twice.

How to select with a wildcard or regex? Some options:

  1. Not support them (users can do anything fancy with df.columns themselves.
  2. Assume the column is a regex if name starts by ^ and ends with $. For wildcards, I guess it could be ok if column* is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcard
  3. Accept callables, so users can do df.select(lambda col: col.startswith("column"))
  4. Have extra parameters regex like df.select(regex="column\d")
  5. Same as 2 by make users enable if explicitly with a flag df.select("column\d", regex=True)

Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.

What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.

Can anyone think of anything else not trivial for implementing this?

@datapythonista datapythonista added Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action labels May 30, 2025
@datapythonista datapythonista changed the title ENH: Implement select to select columns ENH: Implement DataFrame.select to select columns May 30, 2025
@rhshadrach
Copy link
Member

Related issues

#40322
#55289
#61317

What if...

When just *args are provided, this should have the same behavior as __getitem__ when a row is not provided. Doing anything else would be a -1 on my end.

How to select with a wildcard or regex? Some options:

I do not think we should support wildcard. I'd be okay with regex, but don't think it'd be necessary. Long term I'd much rather see pd.col and have regex support there.

@datapythonista datapythonista linked a pull request May 31, 2025 that will close this issue
5 tasks
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jun 3, 2025

What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.

I use MultiIndex on rows, not columns. There are times when a MultiIndex is backing the columns, and it is painful to work with in code (although useful if you export data to Excel). I think we should support a list of tuples to represent selecting columns backed by MultiIndex. e.g., based on example in https://stackoverflow.com/questions/18470323/selecting-columns-from-pandas-multiindex :

# sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data.select(['one', 'a'), ('one', 'b'), ('two', 'a'), ('two', 'b')])

In terms of the choices above, I like 3 - support callables.

@datapythonista
Copy link
Member Author

Thanks for the feedback. I got your example already implemented in #61527.

For now I preferred to not implement wildcards, regex, callables or equivalent. I added to the examples how to do it, which I think for many cases will be reasonable. I think it's better to start like this, see what cases are not well supported, and then add as needed. Adding is very easy, removing something is very annoying, so I think better to make things as simple as possible to start with.

Feedback on the PR is very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants