Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Boolean selection edge case. #61191

Open
2 of 3 tasks
ptth222 opened this issue Mar 27, 2025 · 2 comments
Open
2 of 3 tasks

BUG: Boolean selection edge case. #61191

ptth222 opened this issue Mar 27, 2025 · 2 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action

Comments

@ptth222
Copy link

ptth222 commented Mar 27, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas

df1 = pandas.DataFrame()
boolean_series = pandas.Series(dtype=bool)
df1[boolean_series]
# Empty DataFrame
# Columns: []
# Index: []

df2 = pandas.DataFrame(index=[0, 1])
df2[boolean_series]
# IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Issue Description

Trying to use an empty boolean Series to select on an empty DataFrame that has an index results in an error.

Expected Behavior

I would expect to return an empty DataFrame. The expectation might make more sense with an example.

import pandas

df1 = pandas.DataFrame(['a', 'b'], index = [0, 1])
df1[df1.duplicated()]
# Empty DataFrame
# Columns: [0]
# Index: []

df2 = pandas.DataFrame(index = [0, 1])
df2[df2.duplicated()]
# IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Both of these DataFrames have no duplicate values, but only one results in an error. It would be nice not to require a test for this special case and just get an empty DataFrame as the result since an empty DataFrame does not contain any duplicates.

I looked into this a little bit because I thought maybe the .duplicated method just needed to have the empty Series also return the index, but it is not possible, as far as I can tell, to create a Series with an index but no values like you can with a DataFrame. If you try, the values are set to some default. In the case for bool it is True. I think the selection code would have to check for an empty Series before trying to use the index and return an empty DataFrame. If I am investigating this correctly, it looks like in pandas/core/frame.py in the ._getitem_bool_array method you could add a case to the if chain at the top. Something like:

if isinstance(key, Series) and key.empty:
    return self._take_with_is_copy(Index([]), axis=0)

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.10.5
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United States.1252

pandas : 2.2.3
numpy : 1.24.4
pytz : 2022.1
dateutil : 2.8.2
pip : 25.0.1
Cython : 3.0.11
sphinx : 5.1.1
IPython : 8.21.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : 4.9.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.4
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.1
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.9
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None

@ptth222 ptth222 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 27, 2025
@rhshadrach
Copy link
Member

Thanks for the report. First a case where we do not deal with empty objects:

df1 = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
mask = pd.Series({"a": True})
df1[mask]
# pandas.errors.IndexingError: Unalignable boolean Series provided as indexer

I do not think there is an appetite for changing this behavior. I agree it would be great if

df2 = pandas.DataFrame(index=[0, 1])
df2[df2.duplicated()]

could always work, but I do not see a way to change duplicated nor __getitem__ to make this so. E.g. if we were to follow your proposal:

I think the selection code would have to check for an empty Series before trying to use the index and return an empty DataFrame.

this would introduce an edge case that goes against the general rule (unalignable Series will raise, except when they are empty). As such, I'm opposed to this way forward.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 28, 2025
@ptth222
Copy link
Author

ptth222 commented Mar 28, 2025

this would introduce an edge case that goes against the general rule (unalignable Series will raise, except when they are empty). As such, I'm opposed to this way forward.

You are just picking your edge case though. You either introduce the one I suggested or leave in the one that's there. The one that's there is much more onerous in my opinion. Is anyone really relying on an empty Series mask to result in an error? How many people would rather rely on the DataFrame's own built in methods to always work with itself? Do you want to keep the current edge case so some arbitrary general rule doesn't have an exception, resulting in a less functional class? Or do you want the edge case that improves functionality, but breaks a general rule in a scenario where it basically doesn't matter (IMO)? Note that these are questions to actually consider, not me just trying to win an argument or something.

Some alternatives:
You could change Series so it could have an index without values, or DataFrame so it can't. It's weird that these don't have the same behavior in that respect. That seems a bit more involved though.

Another solution is to have every method that returns a boolean Series check if it is going to return an empty Series and instead return one with all False values the same length as the DataFrame. This would also be a pain tracking all of these down. Having boolean Series default value be False instead of True and always passing the DataFrame index into the Series constructor would also work, but it's likely people rely on the default True somewhere.

Ultimately this is a structural issue of Series not being able to have an index without values while DataFrames can, and choosing to return boolean Series for methods.

The way I see it the options are:

  1. Deal with the structural issue.
  2. Apply one of the patches I outlined.
  3. Ignore the problem and leave it because 1. is too cumbersome and 2. violates some rule in a way that's unlikely to matter (IMO).

Note that there is a similar error with unaligned sizes:

import pandas

df1 = pandas.DataFrame(['a', 'b'], index = [0, 1])
df1[df1.duplicated().values]
# Empty DataFrame
# Columns: [0]
# Index: []

df = pandas.DataFrame(index=[0, 1])
df[df.duplicated().values]
# ValueError: Item wrong length 0 instead of 2.

To me this suggests having methods always return a Series of the same size as the DataFrame might be a better overall fix. At least when the Series dtype is bool. What's more logical, returning False for every row when there are no columns in a DataFrame, or returning an empty Series? I don't really know, but for the scenario that I started this Issue for the all False Series doesn't result in an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

2 participants