ENH: Dataframe should have a .isin() method #4211

TomAugspurger · 2013-07-11T15:45:24Z

Any reason not to give DataFrame a .isin() method like Series?

The new wrinkle is that the user needs to specify if they want a logical OR or AND.

e.g.

In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'], 
'ids2': ['e', 'f', 'c', 'f']})

In [27]: df
Out[27]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])

df.isin(other_ids, how='or')

  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3

df.isin(other_ids, how='and')

See this SO post maybe.

If someone else wants to take this, feel free. Can't promise a PR any time soon, but maybe in the fall :)

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-07-12T15:03:18Z

i like this idea but it should return a bool not the frame indexed by and-ing or or-ing the isins, because the bool version is consistent with Series.isin

TomAugspurger · 2013-07-12T15:14:58Z

Absolutely on returning the bool. Mistake on my part.

The most obvious uses are how='and' and how='or', which will return a 1-d bool that can be indexed into.

Would it also be useful to have some way to get just the locations where the value is in the array?

#pseduocode

df.isin(['a', 'b', 'c'], how='any')  # any probably not the right argument name.

     ids   ids2   vals
0   True  False  False
1   True  False  False
2  False   True  False
3  False  False  False

This can be achieved by applying .isin() to each series and then concatenating together, or even using df.applymap(lamda x: x in arr). Not sure we'd need a third way. I'd say just 'and' and 'or' for now.

TomAugspurger · 2013-07-12T15:43:25Z

If no one else minds, I can try to take a crack at this. I haven't done much with cython so you might want to label it someday. It looks like the Series.isin() uses ismember from lib.pyx. I was thinking about either using that for each column of df we're checking for, or writing something similar for the DataFrame version.

Is there any preference for a default on and vs. or for the how parameter?

Also for the examples I've been giving it should be df[['ids', 'ids2']].isin(), not df.isin() since the vals column won't be in the reference array.

jreback · 2013-07-12T15:53:54Z

u can prob just do this is python
iterate over columns and use isin

if its not ast enough could change (but also good to get algo right first)

TomAugspurger · 2013-07-12T15:59:57Z

Yep. Should be much easier than I was thinking. I'll try to get this done today. Going to (possibly) be without internet until Sunday, but I should have it done by then.

TomAugspurger · 2013-07-12T16:48:42Z

So right now I'm thinking about going with @jreback's idea and calling isin on each column.

I'm getting one failure locally... (unreleated to what I've change, I think.)

FAIL: test_convert_objects (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/tom/pandas/pandas/tests/test_frame.py", line 5486, in test_convert_objects
    assert_frame_equal(result, expected)
  File "/Users/tom/pandas/pandas/util/testing.py", line 238, in assert_frame_equal
    check_less_precise=check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 197, in assert_series_equal
    assert_almost_equal(left.values, right.values, check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 141, in assert_almost_equal
    assert_almost_equal(a[i], b[i], check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 129, in assert_almost_equal
    assert a == b, "%s != %s" % (a, b)
AssertionError: na != nan

I've got it pushed to my branch here. I'll do a proper PR in a sec, but I forgot to add the changes the release notes. Can I still make update the release notes and everything into the same commit without screwing up world? I know you aren't supposed to rebase changes pushed to a remote, but in this case is remote pandas/pandas or does TomAugspurger/pandas count too?

cpcloud · 2013-07-12T17:54:58Z

I get this but only when I use tox. Weird

TomAugspurger · 2013-07-12T18:45:37Z

I think I'm also going to add an ordered argument. If you pass that as True then the function will expect a sequence of tuples. That may be more common than just a flat array.

TomAugspurger · 2013-07-13T15:52:20Z

Design question for you all. If we have

In [43]: df = DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': [1, 2, 3, 4]})

In [44]: df
Out[44]: 
   A  B
0  a  1
1  b  2
2  c  3
3  d  4

with something like other = [('a'), (1, 3)]. I'd like to compare column 'A' to 'a' and column 'B' to (1, 3). Should we insist that each sequence be a tuple or list? The way I've done it so far requires that each sub-sequence in other be iterable. So other =['a', (1, 3)] would work, but other = [('a'), 1] would fail since ints are not iterable. Not a great outcome.

FWIW, Series will fail on something like .isin(1) which makes since because in that case you'd just do series == 1.

cpcloud · 2013-07-13T17:10:44Z

hm do you mean other = [('a',), 1]? [('a'), 1] == ['a', 1] since parens with no commas forms an expression.

if it's a scalar then make it a list and it will be fine. OTOH it might be useful to make s.isin(some_scalar) == s == some_scalar where s is a Series object. Then you wouldn't have to worry about forming a list since it would be done in Series

jreback · 2013-07-24T21:48:08Z

@hayd I see u waited a long time to merge after 0.12 :)

hayd · 2013-07-24T22:04:47Z

@jreback I figured best to get in early on the 0.13 merge-storm...

jreback · 2013-09-15T19:53:24Z

@hayd @TomAugspurger I think this should have an example in v0.13.0txt, u can put same example in isin docs section

TomAugspurger · 2013-09-16T13:14:14Z

I'll give write up a quick example. Should I put a warning about issue #4421, where the value passed to .isin() is another DataFrame?

jreback · 2013-09-16T13:22:37Z

shoudl that just raise if the index is not identical? (or maybe have a keyword 'index=False` or something to control it?

hayd · 2013-09-16T14:07:15Z

can we add to What's New until at the same time 4421's fixed?

TomAugspurger · 2013-09-16T16:05:57Z

@hayd are you saying just keep pushing it to the next release until someone, ok fine me :), gets around to fixing 4421 in a reasonable way?

If so, what section should I put it under in What's New?

hayd · 2013-09-16T17:24:30Z

I think we should fix 4421 before 0.13 (so it'll still be in the same What's New), not sure how long we have til release (?), but will have a look at it soon.

TomAugspurger mentioned this issue Jul 14, 2013

ENH: DataFrame isin #4237

Merged

hayd mentioned this issue Jul 16, 2013

ENH: Dataframe isin2 #4258

Merged

hayd closed this as completed in #4258 Jul 24, 2013

TomAugspurger mentioned this issue Jul 31, 2013

ENH/API: DataFrame's isin should accept DataFrames #4421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Dataframe should have a .isin() method #4211

ENH: Dataframe should have a .isin() method #4211

TomAugspurger commented Jul 11, 2013

cpcloud commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

jreback commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

cpcloud commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 13, 2013

cpcloud commented Jul 13, 2013

jreback commented Jul 24, 2013

hayd commented Jul 24, 2013

jreback commented Sep 15, 2013

TomAugspurger commented Sep 16, 2013

jreback commented Sep 16, 2013

hayd commented Sep 16, 2013

TomAugspurger commented Sep 16, 2013

hayd commented Sep 16, 2013

ENH: Dataframe should have a .isin() method #4211

ENH: Dataframe should have a .isin() method #4211

Comments

TomAugspurger commented Jul 11, 2013

cpcloud commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

jreback commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

cpcloud commented Jul 12, 2013

TomAugspurger commented Jul 12, 2013

TomAugspurger commented Jul 13, 2013

cpcloud commented Jul 13, 2013

jreback commented Jul 24, 2013

hayd commented Jul 24, 2013

jreback commented Sep 15, 2013

TomAugspurger commented Sep 16, 2013

jreback commented Sep 16, 2013

hayd commented Sep 16, 2013

TomAugspurger commented Sep 16, 2013

hayd commented Sep 16, 2013