-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Boolean selection edge case. #61191
Comments
Thanks for the report. First a case where we do not deal with empty objects: df1 = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
mask = pd.Series({"a": True})
df1[mask]
# pandas.errors.IndexingError: Unalignable boolean Series provided as indexer I do not think there is an appetite for changing this behavior. I agree it would be great if df2 = pandas.DataFrame(index=[0, 1])
df2[df2.duplicated()] could always work, but I do not see a way to change
this would introduce an edge case that goes against the general rule (unalignable Series will raise, except when they are empty). As such, I'm opposed to this way forward. |
You are just picking your edge case though. You either introduce the one I suggested or leave in the one that's there. The one that's there is much more onerous in my opinion. Is anyone really relying on an empty Series mask to result in an error? How many people would rather rely on the DataFrame's own built in methods to always work with itself? Do you want to keep the current edge case so some arbitrary general rule doesn't have an exception, resulting in a less functional class? Or do you want the edge case that improves functionality, but breaks a general rule in a scenario where it basically doesn't matter (IMO)? Note that these are questions to actually consider, not me just trying to win an argument or something. Some alternatives: Another solution is to have every method that returns a boolean Series check if it is going to return an empty Series and instead return one with all False values the same length as the DataFrame. This would also be a pain tracking all of these down. Having boolean Series default value be False instead of True and always passing the DataFrame index into the Series constructor would also work, but it's likely people rely on the default True somewhere. Ultimately this is a structural issue of Series not being able to have an index without values while DataFrames can, and choosing to return boolean Series for methods. The way I see it the options are:
Note that there is a similar error with unaligned sizes:
To me this suggests having methods always return a Series of the same size as the DataFrame might be a better overall fix. At least when the Series dtype is bool. What's more logical, returning False for every row when there are no columns in a DataFrame, or returning an empty Series? I don't really know, but for the scenario that I started this Issue for the all False Series doesn't result in an error. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Trying to use an empty boolean Series to select on an empty DataFrame that has an index results in an error.
Expected Behavior
I would expect to return an empty DataFrame. The expectation might make more sense with an example.
Both of these DataFrames have no duplicate values, but only one results in an error. It would be nice not to require a test for this special case and just get an empty DataFrame as the result since an empty DataFrame does not contain any duplicates.
I looked into this a little bit because I thought maybe the .duplicated method just needed to have the empty Series also return the index, but it is not possible, as far as I can tell, to create a Series with an index but no values like you can with a DataFrame. If you try, the values are set to some default. In the case for bool it is True. I think the selection code would have to check for an empty Series before trying to use the index and return an empty DataFrame. If I am investigating this correctly, it looks like in pandas/core/frame.py in the ._getitem_bool_array method you could add a case to the if chain at the top. Something like:
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.10.5
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 12, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_United States.1252
pandas : 2.2.3
numpy : 1.24.4
pytz : 2022.1
dateutil : 2.8.2
pip : 25.0.1
Cython : 3.0.11
sphinx : 5.1.1
IPython : 8.21.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : 1.1
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : 4.9.1
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.4
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.1
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.9
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : 3.2.0
zstandard : None
tzdata : 2024.1
qtpy : 2.4.1
pyqt5 : None
The text was updated successfully, but these errors were encountered: