Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary operators between DataFrame and Series object doesn't seem to work #5284

Closed
liori opened this issue Oct 20, 2013 · 27 comments · Fixed by #28741
Closed

Binary operators between DataFrame and Series object doesn't seem to work #5284

liori opened this issue Oct 20, 2013 · 27 comments · Fixed by #28741
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@liori
Copy link

liori commented Oct 20, 2013

related similar operation

http://stackoverflow.com/questions/19484344/how-do-i-use-a-specific-columns-value-in-a-pandas-dataframe-where-clause/19494873#19494873

http://stackoverflow.com/questions/19507088/filtering-a-pandas-dataframe-without-removing-rows/19516869#19516869

http://stackoverflow.com/q/21627926/190597

This should be a bit more intuitive

In [59]: data = """      A    B    C    D
1/1   0    1    0    1
1/2   2    1    1    1
1/3   3    0    1    0 
1/4   1    0    1    2
1/5   1    0    1    1
1/6   2    0    2    1
1/7   3    5    2    3"""

In [60]: df = read_csv(StringIO(data),sep='\s+')

In [61]: df
Out[61]: 
     A  B  C  D
1/1  0  1  0  1
1/2  2  1  1  1
1/3  3  0  1  0
1/4  1  0  1  2
1/5  1  0  1  1
1/6  2  0  2  1
1/7  3  5  2  3

In [62]: df.where((df>df.shift(1)).values & DataFrame(df.D==1).values)
Out[62]: 
      A   B   C   D
1/1 NaN NaN NaN NaN
1/2   2 NaN   1 NaN
1/3 NaN NaN NaN NaN
1/4 NaN NaN NaN NaN
1/5 NaN NaN NaN NaN
1/6   2 NaN   2 NaN
1/7 NaN NaN NaN NaN

Given that normal binary operators like addition or logical and work well between a pair of Series objects, or between a pair of DataFrame objects (returning a element-wise addition/conjuction), I found it surprising that I cannot do the same between a Series object and a DataFrame object.

Here's a demonstration of what doesn't work now and what would be the expected result: http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/52886258/000-qdoqud/Untitled0.ipynb

@jtratner
Copy link
Contributor

Your example is hard to parse because wakari escapes html and JS. Do you mind posting your example somewhere else or making it readable (maybe nbviewer.ipython.org would work?)

@liori
Copy link
Author

liori commented Oct 20, 2013

My apologies, I didn't know Wakari does stuff like this. Here's an nbviewer link: http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/52886258/000-qdoqud/Untitled0.ipynb

@jtratner
Copy link
Contributor

@liori thanks for re-posting that.

frame & series/series & frame your #5 is a known failure that we need to fix, there's an issue open about it I believe - this one is sort of related - #4615, but this should definitely be kept open because it's not quite the same.

series + frame - this has been the behavior for a long time, because it combines on columns first then on index. You can get around it by using frame.add(series, axis=1) but I personally agree that this is unexpected. I doesn't make sense to me that you'd want to broadcast a Series over rows, given that DataFrame generates columns as Series. Pretty sure that others will disagree.

@jtratner
Copy link
Contributor

For reference, in R this sort of 'works as you expect':

> n = c(2, 3, 5)
> s = c('aa', 'bb', 'cc')
> b = c(TRUE, FALSE, TRUE)
> df = data.frame(n, s, b)
> df
  n  s     b
1 2 aa  TRUE
2 3 bb FALSE
3 5 cc  TRUE
> df * n
   n  s b
1  4 NA 2
2  9 NA 0
3 25 NA 5
Warning message:
In Ops.factor(left, right) : * not meaningful for factors
> df * df$n
   n  s b
1  4 NA 2
2  9 NA 0
3 25 NA 5
Warning message:
In Ops.factor(left, right) : * not meaningful for factors
> 

Whereas in pandas it gives very strange errors (this is 0.12.0)

n = [2, 3, 5]
s = ['aa', 'bb', 'cc']
b = [True, False, True]
df = pandas.DataFrame({'n': n, 's': s, 'b': b})

df * df['n'] # TypeError: Could not operate [array([ nan])] with block values [too many boolean indices]
df + n # TypeError: Could not operate [array([5], dtype=int64)] with block values [too many boolean indices]
df * n # works

   b   n           s
0  2   6  aaaaaaaaaa
1  0   9  bbbbbbbbbb
2  2  15  cccccccccc

And with the original example, it feels weird that doing arithmetic with a selected column still results in the garbled output:

@jtratner
Copy link
Contributor

frame = pandas.DataFrame({'Column': {1: True, 2: False, 3: True, 4: False},
    ...:                           'Another': {1: True, 2: True, 3: True, 4: False}})

frame
Out[25]: 
  Another Column
1    True   True
2    True  False
3    True   True
4   False  False

series = pandas.Series({1: True, 2: True, 3: False, 4: False})

frame + series
Out[27]: 
     1    2    3    4 Another Column
1  NaN  NaN  NaN  NaN     NaN    NaN
2  NaN  NaN  NaN  NaN     NaN    NaN
3  NaN  NaN  NaN  NaN     NaN    NaN
4  NaN  NaN  NaN  NaN     NaN    NaN

frame + frame['Another']
Out[28]: 
     1    2    3    4 Another Column
1  NaN  NaN  NaN  NaN     NaN    NaN
2  NaN  NaN  NaN  NaN     NaN    NaN
3  NaN  NaN  NaN  NaN     NaN    NaN
4  NaN  NaN  NaN  NaN     NaN    NaN

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

you need to use mul/add which provide for alignment - that's what they r for

@liori
Copy link
Author

liori commented Oct 20, 2013

@jreback, there's no equivalent for __and__.

@jtratner
Copy link
Contributor

There is in 0.13 - it's called and_.

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

@liori to be honest your example is aligning correctly, but since DataFrame and series align across columns this is correct

you can use add/mul to force and explicitly alignment if you wish, but keep in mind that what you are expecting is not natural

@jtratner
Copy link
Contributor

@jreback any way we could improve the error messages to advise using the arithmetic flex methods? Maybe we could also warn when you're going to get something like this (since this is probably never what you want). I'm thinking specifically when you union a Series index with DataFrame columns:

     1    2    3    4 Another Column
1  NaN  NaN  NaN  NaN     NaN    NaN
2  NaN  NaN  NaN  NaN     NaN    NaN
3  NaN  NaN  NaN  NaN     NaN    NaN
4  NaN  NaN  NaN  NaN     NaN    NaN

e.g. warn("Arithmetic with Series and DataFrame align along columns, use %s() method to explicitly align on Index" % name.strip("_"))

I also find it confusing that you can't actually do arithmetic with the whole dataframe when you select out a column.

@jtratner
Copy link
Contributor

So, to be clear, this broadcasts:

pd.Series([False, True], index=['Another', 'Column'])
Out[46]: 
Another    False
Column      True
dtype: bool

ser = _

frame * ser
Out[48]: 
  Another Column
1   False   True
2   False  False
3   False   True
4   False  False

@jreback
Copy link
Contributor

jreback commented Oct 20, 2013

yep could use a better errors msg - but can't be right all the time; imagine a df with index and columns of 1-4 then it's ambiguous but most of the time if their is a length/index type mismatch is an incorrect alignment

@jtratner
Copy link
Contributor

I agree, there are certainly ambiguous cases. But we could warn whenever you have the case of Series + DataFrame with no elements overlapping between columns and Series index. I think that would've headed that off. If you're playing around with pandas / have loaded from some IO source, I'd assume that your columns will be string-like and index will be integer-like (or at least different than cols) so it would cover majority of cases.

@liori
Copy link
Author

liori commented Oct 20, 2013

@jreback: I just wanted to reuse my knowledge of R dataframes in pandas; especially given that pandas is described as a library bringing data analysis workflow from “languages like R” to Python. But if pandas doesn't actually work the same way—that's fine for me, just please make the error messages clear.

@jtratner
Copy link
Contributor

and I guess you'd want to say this:

warn("Arithmetic with Series and DataFrame align along columns," "use the %s() method with axis=1 to explicitly align on index" % name.strip("_"))

@jtratner
Copy link
Contributor

@liori I have little experience with R. How would you broadcast along rows rather than along columns and vice-versa? There's a few sections on comparisons with R, probably would be helpful to add that. (and at least in 0.13 you get relatively comprehensible frame.and_(series, axis='index')

@liori
Copy link
Author

liori commented Oct 20, 2013

@jtratner: It can be done using an apply-type of method, or by using a different data type. For example:

> frame <- data.frame(column=c(TRUE, FALSE, TRUE, FALSE), another=c(TRUE, TRUE, TRUE, FALSE))
> frame
  column another
1   TRUE    TRUE
2  FALSE    TRUE
3   TRUE    TRUE
4  FALSE   FALSE
> lst <- list(column=FALSE, another=TRUE)
> lst
$column
[1] FALSE

$another
[1] TRUE

> data.frame(frame & lst)
  column another
1  FALSE    TRUE
2  FALSE    TRUE
3  FALSE    TRUE
4  FALSE   FALSE

BTW, note that in R a data.frame object is just a specialized list of vectors. So frame & lst is just a typical element-wise (list of vectors) vs. (list of scalars) operation, whereas frame & series is a (list of vectors) vs. vector operation.

@jtratner
Copy link
Contributor

Thanks - that's helpful! I'll try to put some more comparisons together so
it's there for people to reference (and hopefully you can take a look
then). My impression is that the usage of both is relatively similar
conceptually, even if they default to aligning on different axes. Anything
else you notice that's confusing would be helpful to add to the docs on R
vs pandas and can ask here or on pydata mailing list

@unutbu
Copy link
Contributor

unutbu commented Feb 7, 2014

Related use case: http://stackoverflow.com/q/21627926/190597

Broadcasting equality testing between DataFrame and Series

df == rowmax    

replaced with

df.values == rowmax[:,None]

@jreback
Copy link
Contributor

jreback commented Feb 7, 2014

@unutbu this a bit tricky....

you will want to emulate something like this:

df = DataFrame(np.random.randn(5,2),columns=list('ab'))
s = Series(np.arange(5))

df.mul(s,axis='index')

what you want to create is a set of functions, exaclty like mul/add etc...

that are called eq/and/or, which literally are called the same exact way
except their functions are operator.eq, operator.and....

df.mul calls this (which then uses the arguments to align the series and such), but that is all done already
https://github.com/pydata/pandas/blob/master/pandas/core/ops.py#L759

you just tneed to add the functions eq/and/or to the frame in a similar manner to how mul/all are added (its a 'bit' magical, but not crazy)

that's it (plus tests of course)!

lmk

@unutbu
Copy link
Contributor

unutbu commented Feb 7, 2014

@jreback: Okay, I'll give it a go...

@unutbu
Copy link
Contributor

unutbu commented Feb 10, 2014

@jreback: Am I missing something, or does eq, __and__, and __or__ already
work as desired?

df = pd.DataFrame({'cat1':[0,3,1], 'cat2':[2,0,1], 'cat3':[2,1,0]})
rowmax = df.max(axis=1)
df.eq(rowmax, axis='index')
df.__and__(rowmax, axis='index')
df.__or__(rowmax, axis='index')

@jreback
Copy link
Contributor

jreback commented Feb 10, 2014

hmm maybe just need to make and/ or be the same as those methods then
so trivial fix then

@unutbu
Copy link
Contributor

unutbu commented Feb 10, 2014

Python syntax prevents and and or from being attribute names.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@unutbu I read above that I think and_ and or_ are defined....hmm...not doced though

@unutbu
Copy link
Contributor

unutbu commented Feb 18, 2014

@jreback: If I understand correctly, and_ and or_ define the __and__ and __or__ attributes, because of this code. I could add a quick mention of __and__ and __or__ to the docs around here. Perhaps __and__ and __or__ should have their own page? Are they automatically generated? I don't know how that is done.

By the way, I'm still working on fixing the nan-sort PR; its failing nosetests after rebasing...

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Mar 30, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@jreback jreback modified the milestones: Next Major Release, 0.16.0 Mar 3, 2015
@jreback
Copy link
Contributor

jreback commented Feb 13, 2016

So this is causing a warning here: https://github.com/pydata/pandas/blob/master/pandas/tests/series/test_operators.py#L1200

because df & s raises a ValueError, which is really doing: df.__and__(s, axis='columns')

and the alignment should be df.__and__(s, axis='index') but of course & is going to default align this way.

@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Feb 13, 2016
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Oct 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants