-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
To pick up a draggable item, press the space bar. While dragging, use the arrow keys to move the item. Press space again to drop the item in its new position, or press escape to cancel.
Reproducible Example
## example A
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns.drop_duplicates())
# Traceback (most recent call last):
# File "/home/cameron/.vim-excerpt", line 5, in <module>
# print(df.columns.drop_duplicates())
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 3117, in drop_duplicates
# if self.is_unique:
# ^^^^^^^^^^^^^^
# File "properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
# File "/home/cameron/repos/opensource/narwhals-dev/.venv/lib/python3.12/site-packages/pandas/core/indexes/base.py", line 2346, in is_unique
# return self._engine.is_unique
# ^^^^^^^^^^^^^^^^^^^^^^
# File "index.pyx", line 266, in pandas._libs.index.IndexEngine.is_unique.__get__
# File "index.pyx", line 271, in pandas._libs.index.IndexEngine._do_unique_check
# File "index.pyx", line 333, in pandas._libs.index.IndexEngine._ensure_mapping_populated
# File "pandas/_libs/hashtable_class_helper.pxi", line 7115, in pandas._libs.hashtable.PyObjectHashTable.map_locations
# TypeError: unhashable type: 'list'
## --------
## example B
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
# hasattr triggers a side effect where the `df.columns.drop_duplicates()` now works.
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates())
# Index(['a', ['b', 'c']], dtype='object')
Issue Description
pandas.Index.drop_duplicates()
inconsistently raises TypeError: unhashable type: 'list'
when its values encompass a list. This error does not seem to prevent the underlying uniqueness computation from happening. In addition to the submitted reproducible example there is a direct causation here in the Index
object:
If we call .drop_duplicates
when the Index contains unhashable types, we observe a TypeError
.
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
idx.drop_duplicates() # TypeError: unhashable type: 'list'
But for some reason if we simply ignore the error the first time and try .drop_duplicates()
again it works and removes the duplicated entities including the unhashable ones?
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
try:
idx.drop_duplicates() # TypeError: unhashable type: 'list'
except TypeError:
pass
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Where we can see that the underlying Index implementation populates its hashtable mapping even though the original call to drop_duplicates
fails. We know this population is successful because the second attempt at .drop_duplicates
works.
import pandas as pd
idx = pd.Index(['a', ['b', 'c'], ['b', 'c']])
print(idx._engine.mapping) # None
try:
idx.drop_duplicates() # TypeError: unhashable type: 'list'
except TypeError:
pass
print(idx._engine.mapping) # <pandas._libs.hashtable.PyObjectHashTable>
print(idx.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Finally, it appears that attribute checking on a pandas.DataFrame
causes the PyObjectHashTable
to be constructed for the column index. This is likely due to the shared code path between __getattr__
and __getitem__
.
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns._engine.mapping) # None
hasattr(df, 'hello_world')
print(df.columns._engine.mapping) # <pandas._libs.hashtable.PyObjectHashTable>
print(df.columns.drop_duplicates()) # Index(['a', ['b', 'c']], dtype='object')
Expected Behavior
I expect that Index.drop_duplicates()
should work regardless of whether an attribute has been checked or not. The following two snippets should produce equivalent results (whether that is to raise an error or to produce a result):
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
print(df.columns.drop_duplicates()) # Currently produces → TypeError
import pandas as pd # 2.2.3
df = pd.DataFrame([[1, 2, 3]], columns=['a', ['b', 'c'], ['b', 'c']])
hasattr(df, 'hello_world')
print(df.columns.drop_duplicates()) # Currently produces → Index(['a', ['b', 'c']], dtype='object')
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.7
python-bits : 64
OS : Linux
OS-release : 6.6.52-1-lts
Version : #1 SMP PREEMPT_DYNAMIC Wed, 18 Sep 2024 19:02:04 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 2.2.2
pytz : 2025.1
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.2.0
html5lib : None
hypothesis : 6.125.3
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : 8.3.4
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.1
qtpy : None
pyqt5 : None
Activity
[-]BUG: `Index.drop_duplicates()` is inconsistent for hashable values[/-][+]BUG: `Index.drop_duplicates()` is inconsistent for unhashable values[/+]rhshadrach commentedon Feb 13, 2025
Thanks for the report! This should raise consistently. Further investigations and PRs to fix are welcome!
Edit for visibility: As @MarcoGorelli points out below, this should even raise on index construction!
The source of the issue appears to be here:
pandas/pandas/_libs/index.pyx
Lines 347 to 348 in 0305656
We create an unpopulated hash table, and then fail on the
map_locations
line. However, this only happens whenself.is_mapping_populated
is False, which uses:pandas/pandas/_libs/index.pyx
Lines 335 to 337 in 0305656
Still, I do not see how this could then return correct results. It is certainly inefficient.
My guess is that the hash table is somehow degenerating into saying everything is a collision, and therefore doing an O(n) lookup (where n is the size of the hash table).
johnmwu commentedon Feb 14, 2025
Took a look. Think I see what's happening.
The root cause is that values are put in hashtables in two different ways - one where
hash(val)
is called (raises) and one where it isn't.First path. Raises due to calling
hash()
.pandas/pandas/_libs/hashtable_class_helper.pxi.in
Lines 1387 to 1392 in 19ea997
Second path. Does not raise.
pandas/pandas/_libs/hashtable_func_helper.pxi.in
Lines 169 to 172 in 19ea997
The reason there are two code paths is:
drop_duplicates()
, computation ofself.is_unique
initializesidx._engine.mapping
which needs to callPyObjectHashTable.map_locations
(the snippet linked for the first path)pandas/pandas/core/indexes/base.py
Lines 2847 to 2850 in 19ea997
self.is_unique
is False. We then callsuper().drop_duplicates
, which if you trace through goes down the second code path. Here, we actually rebuild an entirely new and identical (from what I can tell) hash table toidx._engine.mapping
in theduplicated
function ofhashtable_func_helper.pxi
. The natural question of course is why this is done.As for what to do, I'm going on the premise given by @rhshadrach that we should raise twice. Intuitively, there is no need to have strong support for nonhashable values.
I'm going to suggest a few options, though I'm not yet familiar with the code so not sure which is best (if any).
hash(values[i])
in the second path. This achieves the desired behavior.drop_duplicates
, if computation ofself.is_unique
failed the first time, should it be False on the second call? I would think no. In this case, the second call will go down exactly the same code path as the first.drop_duplicates
. Not a direct fix, but would remove the second code path.Thoughts on how to proceed?
MarcoGorelli commentedon Feb 14, 2025
Should this not raise even earlier? As in, should
columns=['a', ['b', 'c'], ['b', 'c']]
even be allowed in theDataFrame
constructor? (I think not?)kiranzo commentedon Jun 6, 2025
MarcoGorelli commentedon Jun 6, 2025
that's a separate issue
rhshadrach commentedon Jun 6, 2025
Agreed @MarcoGorelli - marked as off-topic. @kiranzo - please open a new issue if you'd like to discuss this.
Andre-Andreati commentedon Jun 13, 2025
take
Andre-Andreati commentedon Jun 13, 2025
I was taking a look at this issue. Tracked down the index checks and constructor, noticed some things regarding lists as columns/index items:
columns=[ ['a', 'b'], ['b', 'c'], ['b', 'c'] ]
, it will try to create a MultiIndex. Correct.columns=[ 'a', ['b', 'c'], ['b', 'c'] ]
, there's no check for this condition, and the creation will follow as if all items, including the lists, are valid column names.A simple solution would be to use a check for this condition in the index contructor:
If it seems a valid solution I will make a PR for it. Thoughts?
6 remaining items