-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Implement DataFrame.select to select columns #61522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Related issues
When just
I do not think we should support wildcard. I'd be okay with regex, but don't think it'd be necessary. Long term I'd much rather see |
I use # sample data
col = pd.MultiIndex.from_arrays([['one', 'one', 'one', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data.select(['one', 'a'), ('one', 'b'), ('two', 'a'), ('two', 'b')]) In terms of the choices above, I like 3 - support callables. |
Thanks for the feedback. I got your example already implemented in #61527. For now I preferred to not implement wildcards, regex, callables or equivalent. I added to the examples how to do it, which I think for many cases will be reasonable. I think it's better to start like this, see what cases are not well supported, and then add as needed. Adding is very easy, removing something is very annoying, so I think better to make things as simple as possible to start with. Feedback on the PR is very welcome. |
Add a new method
DataFrame.select
to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.Basic case, select columns. Personally both as a list, or as multiple parameters with
*args
should be supported for convenience:Cases to consider.
What if a provided column doesn't exist? I assume we want to raise a
ValueError
.What if a column is duplicated? I assume we want to return the column twice.
How to select with a wildcard or regex? Some options:
df.columns
themselves.^
and ends with$
. For wildcards, I guess it could be ok ifcolumn*
is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcarddf.select(lambda col: col.startswith("column"))
regex
likedf.select(regex="column\d")
df.select("column\d", regex=True)
Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.
What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.
Can anyone think of anything else not trivial for implementing this?
The text was updated successfully, but these errors were encountered: