You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dataframe.duplicated() is flagging rows as duplicates when they are in fact distinct. This happens when using large dataframes, and duplicated(keep=False):
import pandas as pd, numpy as np
df = pd.DataFrame({'a': pd.Series(range(1,100000)),
'b': pd.Series(range(10,1000000)),
'c': pd.Series(3*range(2,200000,2))})
df.head()
np.sum(df.duplicated())
Out[]: 0
np.sum(df.duplicated(keep=False))
Out[]:110
Changing column order results in different (but still incorrect) behavior.
np.sum(df[['c','b','a']].duplicated(keep=False))
Out[]:2138
Tested on 0.17.1. Environment details are provided below:
>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
Dataframe.duplicated() is flagging rows as duplicates when they are in fact distinct. This happens when using large dataframes, and duplicated(keep=False):
Out[]: 0
Out[]:110
Changing column order results in different (but still incorrect) behavior.
Out[]:2138
Tested on 0.17.1. Environment details are provided below:
>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
This looks like the same kind of problem described in #11668, though the specific examples provided in that issue work properly in 0.17.1
The text was updated successfully, but these errors were encountered: