ENH: Implement DataFrame.select to select columns #61522
Labels
Enhancement
Indexing
Related to indexing on series/frames, not to indexes themselves
Needs Discussion
Requires discussion from core team before further action
Add a new method
DataFrame.select
to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.Basic case, select columns. Personally both as a list, or as multiple parameters with
*args
should be supported for convenience:Cases to consider.
What if a provided column doesn't exist? I assume we want to raise a
ValueError
.What if a column is duplicated? I assume we want to return the column twice.
How to select with a wildcard or regex? Some options:
df.columns
themselves.^
and ends with$
. For wildcards, I guess it could be ok ifcolumn*
is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcarddf.select(lambda col: col.startswith("column"))
regex
likedf.select(regex="column\d")
df.select("column\d", regex=True)
Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.
What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.
Can anyone think of anything else not trivial for implementing this?
The text was updated successfully, but these errors were encountered: