Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.Series.__eq__ is broken for series with different index #1134

Closed
lesteve opened this issue Apr 25, 2012 · 14 comments · Fixed by #13894
Closed

pandas.Series.__eq__ is broken for series with different index #1134

lesteve opened this issue Apr 25, 2012 · 14 comments · Fixed by #13894
Assignees
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@lesteve
Copy link
Contributor

lesteve commented Apr 25, 2012

Something seems to be wrong with s1 == s2 when s1 and s2 don't have the same index. Here is a snippet example:

import operator
import pandas
s1 = pandas.Series([1,2], ['a','b'])
s2 = pandas.Series([2,3], ['b','c'])
s1 == s2
s2 == s1

with the output:

InIn [5]: s1 == s2
Out[5]: 
a    False
b    False

In [6]: s2 == s1
Out[6]: 
b    False
c    False

On the other hand using combine works fine:

In [7]: s1.combine(s2, operator.eq)
Out[7]: 
a    0
b    1
c    0

In [8]: s2.combine(s1, operator.eq)
Out[8]: 
a    0
b    1
c    0

I guess you can first align s1 and s2 and then compare them, but is there a good reason why this couldn't work out of the box?

There doesn't seem to be any tests for pandas.Series. eq for two series with a different index in pandas/pandas/tests/test_series.py. I have a patch lying around to add such a test and I could commit it if that's useful.

@wesm
Copy link
Member

wesm commented Apr 26, 2012

This is actually a feature / deliberate choice and not a bug-- it's related to #652. Back in January I changed the comparison methods to do auto-alignment, but found that it led to a large amount of bugs / breakage for users and, in particular, many NumPy functions (which regularly do things like arr[1:] == arr[:-1]; example: np.unique) stopped working.

This gets back to the issue that Series isn't quite ndarray-like enough and should probably not be a subclass of ndarray.

So, I haven't got a good answer for you except for that; auto-alignment would be ideal but I don't think I can do it unless I make Series not a subclass of ndarray. I think this is probably a good idea but not likely to happen until 0.9 or 0.10 (several months down the road).

@lesteve
Copy link
Contributor Author

lesteve commented Apr 26, 2012

Interesting, thanks for the answer. Is s[1:] == s[:-1] the main use case though, where you need to have this not completely intuitive == operator?

Out of interest, is there a way to figure out whether s1 and s2 are a view on the same underlying series and in this case have s1 == s2 do the current comparison. When s1 and s2 don't have anything to do with each other you would do the equivalent of aligning + comparison.

Not sure whether you would want to do that even if it was possible, e.g. s1 == s2.copy() would potentially return something different thans1 == s2.

@wesm
Copy link
Member

wesm commented Apr 26, 2012

Interesting, that would be a hack around the np.unique problem. I'll take a look into it sooner rather than later to see

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

closing as not a bug

@jreback jreback closed this as completed Sep 21, 2013
@jorisvandenbossche
Copy link
Member

@jreback I was answering this SO question: http://stackoverflow.com/questions/22983523/comparing-pandas-series-for-equality-when-they-are-in-different-orders/22983621#22983621. And I was wondering:

I understand that s1 == s2 is not flexible, just like df1 == df2 gives the error message ValueError: Can only compare identically-labeled DataFrame objects.
But for dataframes, you can overcome this with the flexible DataFrame.eq method. There is also a Series.eq method, but this is a not flexible method (not aligning). Is there a reason that Series.eq is not flexible?

In [154]: x = pd.Series(index=["A", "B", "C"], data=[1,2,3])
In [155]: y = pd.Series(index=["C", "B", "A"], data=[3,2,1])
In [156]: x == y
Out[156]: 
A    False
B     True
C    False
dtype: bool

In [157]: x.eq(y)
Out[157]: 
A    False
B     True
C    False
dtype: bool

In [158]: x.to_frame() == y.to_frame()
Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects

In [159]: x.to_frame().eq(y.to_frame())
Out[159]: 
      0
A  True
B  True
C  True

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

@jorisvandenbossche interesting, let's reopen and i'll take a look

@snth
Copy link
Contributor

snth commented May 29, 2014

+1 for @jorisvandenbossche's suggestion of at least making the .eq, .ne, .lt, .le, .gt, .ge methods flexible, i.e. use alignment.

@snth
Copy link
Contributor

snth commented May 29, 2014

From what I am reading above, it sounds like a fix might be more complicated/a while off. Could we in the meantime add something to the documentation about this?

I discovered this issue for myself recently and it took me a long time to figure out what was going on. At some point I did check the documentation to see if my understanding of index alignment was correct and there was no mention there that this only applies to the +, -, *, / operators and not to ==, !=, <, <=, >, >=.

In particular,

  • the Caveats and Gotchas section of the Pandas documentation has sections on if/truth statements with Pandas and Bitwise boolean operators (I think this would better be called Elementwise boolean operators). However these make no mention of the fact that indices will be ignored and not aligned in elementwise comparisons.

  • The Basics section of the docs has a section (Flexible Comparisons) on the basics of comparisons. This states that

    Starting in v0.8, pandas introduced binary comparison methods eq, ne, lt, gt, le, and ge to Series and DataFrame whose behavior is analogous to the binary arithmetic operations described above:
    

    However the behaviour for Series.eq is not analogous to the binary arithmetic operations as the arithmetic operations do perform index alignment while the comparison operators do not. Also, Series.eq does not appear to exist in 0.12.0 (i.e. not from 0.8 onwards as claimed) but I do find it in 0.13.1 although the signature there is different from DataFrame.eq (of course axis makes no sense for Series.eq but level and axis could still be included like they are in Series.add).

@jreback
Copy link
Contributor

jreback commented May 29, 2014

see #6860 its not that 'hard' the fix at all. Though I think that doing JUST for .eq,.ne...etc might be better. (as @jorisvandenbossche suggests).

(I think the docs really mean __eq__ (and not .eq); agreed that the signatures of .eq et.al need to be updated / integrated with DataFrame.eq; going to create an issue for that

@jreback
Copy link
Contributor

jreback commented May 29, 2014

#7278

@hayd
Copy link
Contributor

hayd commented Aug 22, 2014

Also came up here: http://stackoverflow.com/q/25435229/1240268

(update: oh, maybe thats with comparison)

@jreback
Copy link
Contributor

jreback commented May 17, 2015

@jreback jreback modified the milestones: 0.17.0, Next Major Release May 17, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@wesm
Copy link
Member

wesm commented Jan 14, 2016

Bumping this issue

@sinhrks
Copy link
Member

sinhrks commented Jul 11, 2016

+1 on adding flexible methods.

Also, there are inconsistencies in normal ops between Series and DataFrame as @jorisvandenbossche pointed. I`ve organized the differences including arithmetic / bool op (xref: #4581, #7278, .#13538, #13587)

Series

Arithmetic

aligns with labels.

pd.Series([1, 2, 3], index=list('ABC')) + pd.Series([2, 2, 2], index=list('ABD')) 
# A    3.0
# B    4.0
# C    NaN
# D    NaN
# dtype: float64

# pd.Series([1, 2, 3], index=list('ABC')) + pd.Series([2, 2, 2, 2], index=list('ABCD')) 
# A    3.0
# B    4.0
# C    5.0
# D    NaN
# dtype: float64

Comparison

ignores labels, raises when lengths are different.

pd.Series([1, 2, 3], index=list('ABC')) > pd.Series([2, 2, 2], index=list('ABD')) 
# A    False
# B    False
# C     True
# dtype: bool

pd.Series([1, 2, 3], index=list('ABC')) > pd.Series([2, 2, 2, 2], index=list('ABCD'))
# ValueError: Series lengths must match to compare

Boolean (logical)

ignores labels, ignores length mismatch.

pd.Series([True, False, True], index=list('ABC')) & pd.Series([True, True, True], index=list('ABD'))
# A     True
# B    False
# C    False
# dtype: bool

pd.Series([True, False, True], index=list('ABC')) & pd.Series([True, True, True, True], index=list('ABCD'))
# A     True
# B    False
# C     True
# dtype: bool

DataFrame

Arithmetic

aligns with labels.

pd.DataFrame([1, 2, 3], index=list('ABC')) + pd.DataFrame([2, 2, 2], index=list('ABD'))
#      0
# A  3.0
# B  4.0
# C  NaN
# D  NaN

pd.DataFrame([1, 2, 3], index=list('ABC')) + pd.DataFrame([2, 2, 2, 2], index=list('ABCD')) 
#      0
# A  3.0
# B  4.0
# C  5.0
# D  NaN

Comparison

raises when labels are different.

pd.DataFrame([1, 2, 3], index=list('ABC')) > pd.DataFrame([2, 2, 2], index=list('ABD'))
# ValueError: Can only compare identically-labeled DataFrame objects

pd.DataFrame([1, 2, 3], index=list('ABC')) > pd.DataFrame([2, 2, 2, 2], index=list('ABCD'))
# ValueError: Can only compare identically-labeled DataFrame objects

Boolean (logical)

aligns with labels.

pd.DataFrame([True, False, True], index=list('ABC')) & pd.DataFrame([True, True, True], index=list('ABD')) 
#        0
# A   True
# B  False
# C    NaN
# D    NaN

pd.DataFrame([True, False, True], index=list('ABC')) & pd.DataFrame([True, True, True, True], index=list('ABCD'))
#        0
# A   True
# B  False
# C    NaN
# D    NaN

Based on above, I think followings are consistent:

  • arithmetic always align with labels
  • comparison is allowed when labels are identical. otherwise raises.
  • boolean always align with labels

If OK, I'd like to do 2 changes:

  • series comparison to check whether labels are identical
  • series boolean to align with labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
7 participants