Skip to content

ENH: Implement DataFrame.select #61527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ Other enhancements
- :meth:`pandas.api.interchange.from_dataframe` now uses the `PyCapsule Interface <https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html>`_ if available, only falling back to the Dataframe Interchange Protocol if that fails (:issue:`60739`)
- Added :meth:`.Styler.to_typst` to write Styler objects to file, buffer or string in Typst format (:issue:`57617`)
- Added missing :meth:`pandas.Series.info` to API reference (:issue:`60926`)
- Added new :meth:`DataFrame.select` method to select a subset of columns from the :class:`DataFrame` (:issue:`61522`)
- :class:`pandas.api.typing.NoDefault` is available for typing ``no_default``
- :func:`DataFrame.to_excel` now raises an ``UserWarning`` when the character count in a cell exceeds Excel's limitation of 32767 characters (:issue:`56954`)
- :func:`pandas.merge` now validates the ``how`` parameter input (merge type) (:issue:`59435`)
Expand Down
113 changes: 113 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4479,6 +4479,119 @@ def _get_item(self, item: Hashable) -> Series:
# ----------------------------------------------------------------------
# Unsorted

def select(self, *args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is that it is then possible to do df.select(), since when you specify *args, you don't have to specify any arguments. Maybe change the API to this:

def select(self, arg0: Hashable | list[Hashable], *args: Hashable) -> pd.DataFrame:

This then requires the first argument, which is either a hashable or a list, and the arguments after that (if provided) have to also be hashables.

This also allows better type checking for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I was giving it a try, but after checking in detail this doesn't seem to be a good idea. What about this:

df.select([])

Given that df[[]] returns an empty dataframe, I think the above example should also return an empty dataframe. And I don't think df.select(*my_list) should raise a TypeError when df.select(my_list) doesn't (in the case of an empty list).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a typing perspective, if you have a *args argument, it can't support both lists and "deconstructed" lists.

E.g., if you had def select(*args: Hashable) then that says you would have Hashable separated by commas.

I think this will do what you need:

def select(arg0: Hashable| list[Hashable] = [], *args: Hashable) -> pd.DataFrame: ...

Then select(), select([]), select("a", "b") and select(["a", "b"]) will all pass typing checks, and any combination of lists and Hashable would fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, and from a typing prespective I fully agree. But I don't want to introduce the inconsistency I mention above to have more accurate typing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But my suggestion would not introduce that inconsistency. You'd be able to do select() and select([]) and they'd both return an empty dataframe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, read it on the phone and didn't see the default value. I don't like that it makes the signature significantly more difficult to understand. But open to it, maybe someone else have an opinion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this @Dr-Irv, and I'd prefer to keep the implementation as it is now. Unfortunately no other opinions, and I understand your point, and I think it's a very reasonable one. And I can probably be convinced if I'm missing advantages of what you propose other than the typing. But to me the signature is much clearer and easy to understand only accepting *args. I didn't find good examples, but the cases I've seen, seems like using just *args is more common. In the standard lib, max supports this pattern, but in C they've got some fancy tuple unpacking and it actually has both signatures.

Also, I think it's considered a bad practice to have mutable objects as default parameters. If we are not careful and in the implementation we mutate the value of arg0 when implementing your proposal, a call to df.select() will return unexpected results.

As said, your proposal seems reasonable, but I think keeping just *args and unpacking if args[0] is a list seem to be a simpler and better option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sole advantage is in typing. The question is whether you look at an API as just the signature, or as the docs. With pandas-stubs being embedded now in VS Code, if you keep select(*args), you really can't do any decent type checking.

You can also make the API more "complete" if you just want to look at signatures, by indicating the possible calling sequences via typing overloads.

I also thought that we had a "policy" that we want all new code to be properly typed. But I'm not sure about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Sorry, I didn't realize I didn't add a type annotation to the parameter.. I added it now, thanks for the feedback. What I was referring to that I prefer not to do is to split *args into arg0=[], *args.

Let me know if this looks reasonable now, and when it is, if you can remove the "requested changes" flag that would be great. Once this is merged I'll start working on making .filter filter rows, which I'd like to have ready before pandas 3.

"""
Select a subset of columns from the DataFrame.

Select can be used to return a DataFrame with some specific columns.
This can be used to remove unwanted columns, as well as to return a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't use the word "remove" here, because it implies the columns are removed from the source DF. So instead of "remove unwanted columns", maybe say "select a subset of columns"

DataFrame with the columns sorted in a specific order.

Parameters
----------
*args : hashable or tuple of hashable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also support a list of hashable ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the meaning of a list? Same as a tuple, for MultiIndex?

The names or the columns to return. In general this will be strings,
but pandas supports other types of column names, if they are hashable.

Returns
-------
DataFrame
The DataFrame with the selected columns.

See Also
--------
DataFrame.filter : To return a subset of rows, instead of a subset of columns.

Examples
--------
>>> df = pd.DataFrame(
... {
... "first_name": ["John", "Alice", "Bob"],
... "last_name": ["Smith", "Cooper", "Marley"],
... "age": [61, 22, 35],
... }
... )

Select a subset of columns:

>>> df.select("first_name", "age")
first_name age
0 John 61
1 Alice 22
2 Bob 35

Selecting with a pattern can be done with Python expressions:

>>> df.select(*[col for col in df.columns if col.endswith("_name")])
first_name last_name
0 John Smith
1 Alice Cooper
2 Bob Marley

All columns can be selected, but in a different order:

>>> df.select("last_name", "first_name", "age")
last_name first_name age
0 Smith John 61
1 Cooper Alice 22
2 Marley Bob 35

In case the columns are in a list, Python unpacking with star can be used:

>>> columns = ["last_name", "age"]
>>> df.select(*columns)
last_name age
0 Smith 61
1 Cooper 22
2 Marley 35

Note that a DataFrame is always returned. If a single column is requested, a
DataFrame with a single column is returned, not a Series:

>>> df.select("age")
age
0 61
1 22
2 35

The ``select`` method also works when columns are a ``MultiIndex``:

>>> df = pd.DataFrame(
... [("John", "Smith", 61), ("Alice", "Cooper", 22), ("Bob", "Marley", 35)],
... columns=pd.MultiIndex.from_tuples(
... [("names", "first_name"), ("names", "last_name"), ("other", "age")]
... ),
... )

If just column names are provided, they will select from the first level of the
``MultiIndex``:

>>> df.select("names")
names
first_name last_name
0 John Smith
1 Alice Cooper
2 Bob Marley

To select from multiple or all levels, tuples can be provided:

>>> df.select(("names", "last_name"), ("other", "age"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth also showing the list variant of this, i.e., df.select([("names", "last_name"), ("other", "age")])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try, but personally I don't think it adds too much value, as it's already explained in the parameters, and in the second example that this is possible. So, it really felt like repeating this already complex example for little gain, causing more confusion than adding value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, I think it also shows that you can pass a list of tuples, just like in the MultiIndex.from_tuples() call.

names other
last_name age
0 Smith 61
1 Cooper 22
2 Marley 35
"""
if args and isinstance(args[0], list):
raise ValueError(
"`DataFrame.select` does not support a list. Please use "
"`df.select('col1', 'col2',...)` or `df.select(*['col1', 'col2',...])` "
"instead"
)

indexer = self.columns._get_indexer_strict(list(args), "columns")[1]
return self.take(indexer, axis=1)

@overload
def query(
self,
Expand Down
85 changes: 85 additions & 0 deletions pandas/tests/frame/methods/test_select.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import pytest

import pandas as pd
from pandas import DataFrame
import pandas._testing as tm


@pytest.fixture
def regular_df():
return DataFrame({"a": [1, 2], "b": [3, 4], "c": [5, 6], "d": [7, 8]})


@pytest.fixture
def multiindex_df():
return DataFrame(
[(0, 2, 4), (1, 3, 5)],
columns=pd.MultiIndex.from_tuples([("A", "c"), ("A", "d"), ("B", "e")]),
)


class TestSelect:
def test_select_subset_cols(self, regular_df):
expected = DataFrame({"a": [1, 2], "c": [5, 6]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use expected = df[["a", "c"]] ? (here and in other tests)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want the test to fail for changes in df[[]], I think it's a better practice to make tests as simple and focused as possible.

It can make sense what you say if we think that what I'm testing is that both select and [] behave the same. But I see it as testing that select does what I want it to do, regardless of [].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I could go either way on this

result = regular_df.select("a", "c")
tm.assert_frame_equal(result, expected)

def test_single_value(self, regular_df):
expected = DataFrame({"a": [1, 2]})
result = regular_df.select("a")
assert isinstance(result, DataFrame)
tm.assert_frame_equal(result, expected)

def test_select_change_order(self, regular_df):
expected = DataFrame({"b": [3, 4], "d": [7, 8], "a": [1, 2], "c": [5, 6]})
result = regular_df.select("b", "d", "a", "c")
tm.assert_frame_equal(result, expected)

def test_select_none(self, regular_df):
result = regular_df.select()
assert result.empty

def test_select_duplicated(self, regular_df):
expected = ["a", "d", "a"]
result = regular_df.select("a", "d", "a")
assert result.columns.tolist() == expected

def test_select_list(self, regular_df):
with pytest.raises(ValueError, match="does not support a list"):
regular_df.select(["a", "b"])

def test_select_missing(self, regular_df):
with pytest.raises(KeyError, match=r"None of .* are in the \[columns\]"):
regular_df.select("z")

def test_select_not_hashable(self, regular_df):
with pytest.raises(TypeError, match="unhashable type"):
regular_df.select(set())

def test_select_multiindex_one_level(self, multiindex_df):
expected = DataFrame(
[(0, 2), (1, 3)],
columns=pd.MultiIndex.from_tuples([("A", "c"), ("A", "d")]),
)
result = multiindex_df.select("A")
tm.assert_frame_equal(result, expected)

def test_select_multiindex_single_column(self, multiindex_df):
expected = DataFrame(
[(2,), (3,)], columns=pd.MultiIndex.from_tuples([("A", "d")])
)
result = multiindex_df.select(("A", "d"))
assert isinstance(result, DataFrame)
tm.assert_frame_equal(result, expected)

def test_select_multiindex_multiple_columns(self, multiindex_df):
expected = DataFrame(
[(0, 4), (1, 5)],
columns=pd.MultiIndex.from_tuples([("A", "c"), ("B", "e")]),
)
result = multiindex_df.select(("A", "c"), ("B", "e"))
tm.assert_frame_equal(result, expected)

def test_select_multiindex_missing(self, multiindex_df):
with pytest.raises(KeyError, match="not in index"):
multiindex_df.select("Z")
Loading