Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Dataframe should have a .isin() method #4211

Closed
TomAugspurger opened this issue Jul 11, 2013 · 18 comments · Fixed by #4258
Closed

ENH: Dataframe should have a .isin() method #4211

TomAugspurger opened this issue Jul 11, 2013 · 18 comments · Fixed by #4258
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Any reason not to give DataFrame a .isin() method like Series?

The new wrinkle is that the user needs to specify if they want a logical OR or AND.

e.g.

In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'], 
'ids2': ['e', 'f', 'c', 'f']})

In [27]: df
Out[27]: 
  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3
3   f    f     4

In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])

df.isin(other_ids, how='or')

  ids ids2  vals
0   a    e     1
1   b    f     2
2   f    c     3

df.isin(other_ids, how='and')

See this SO post maybe.

If someone else wants to take this, feel free. Can't promise a PR any time soon, but maybe in the fall :)

@cpcloud
Copy link
Member

cpcloud commented Jul 12, 2013

i like this idea but it should return a bool not the frame indexed by and-ing or or-ing the isins, because the bool version is consistent with Series.isin

@TomAugspurger
Copy link
Contributor Author

Absolutely on returning the bool. Mistake on my part.

The most obvious uses are how='and' and how='or', which will return a 1-d bool that can be indexed into.

Would it also be useful to have some way to get just the locations where the value is in the array?

#pseduocode

df.isin(['a', 'b', 'c'], how='any')  # any probably not the right argument name.

     ids   ids2   vals
0   True  False  False
1   True  False  False
2  False   True  False
3  False  False  False

This can be achieved by applying .isin() to each series and then concatenating together, or even using df.applymap(lamda x: x in arr). Not sure we'd need a third way. I'd say just 'and' and 'or' for now.

@TomAugspurger
Copy link
Contributor Author

If no one else minds, I can try to take a crack at this. I haven't done much with cython so you might want to label it someday. It looks like the Series.isin() uses ismember from lib.pyx. I was thinking about either using that for each column of df we're checking for, or writing something similar for the DataFrame version.

Is there any preference for a default on and vs. or for the how parameter?

Also for the examples I've been giving it should be df[['ids', 'ids2']].isin(), not df.isin() since the vals column won't be in the reference array.

@jreback
Copy link
Contributor

jreback commented Jul 12, 2013

u can prob just do this is python
iterate over columns and use isin

if its not ast enough could change (but also good to get algo right first)

@TomAugspurger
Copy link
Contributor Author

Yep. Should be much easier than I was thinking. I'll try to get this done today. Going to (possibly) be without internet until Sunday, but I should have it done by then.

@TomAugspurger
Copy link
Contributor Author

So right now I'm thinking about going with @jreback's idea and calling isin on each column.

I'm getting one failure locally... (unreleated to what I've change, I think.)

FAIL: test_convert_objects (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/tom/pandas/pandas/tests/test_frame.py", line 5486, in test_convert_objects
    assert_frame_equal(result, expected)
  File "/Users/tom/pandas/pandas/util/testing.py", line 238, in assert_frame_equal
    check_less_precise=check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 197, in assert_series_equal
    assert_almost_equal(left.values, right.values, check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 141, in assert_almost_equal
    assert_almost_equal(a[i], b[i], check_less_precise)
  File "/Users/tom/pandas/pandas/util/testing.py", line 129, in assert_almost_equal
    assert a == b, "%s != %s" % (a, b)
AssertionError: na != nan

I've got it pushed to my branch here. I'll do a proper PR in a sec, but I forgot to add the changes the release notes. Can I still make update the release notes and everything into the same commit without screwing up world? I know you aren't supposed to rebase changes pushed to a remote, but in this case is remote pandas/pandas or does TomAugspurger/pandas count too?

@cpcloud
Copy link
Member

cpcloud commented Jul 12, 2013

I get this but only when I use tox. Weird

@TomAugspurger
Copy link
Contributor Author

I think I'm also going to add an ordered argument. If you pass that as True then the function will expect a sequence of tuples. That may be more common than just a flat array.

@TomAugspurger
Copy link
Contributor Author

Design question for you all. If we have

In [43]: df = DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': [1, 2, 3, 4]})

In [44]: df
Out[44]: 
   A  B
0  a  1
1  b  2
2  c  3
3  d  4

with something like other = [('a'), (1, 3)]. I'd like to compare column 'A' to 'a' and column 'B' to (1, 3). Should we insist that each sequence be a tuple or list? The way I've done it so far requires that each sub-sequence in other be iterable. So other =['a', (1, 3)] would work, but other = [('a'), 1] would fail since ints are not iterable. Not a great outcome.

FWIW, Series will fail on something like .isin(1) which makes since because in that case you'd just do series == 1.

@cpcloud
Copy link
Member

cpcloud commented Jul 13, 2013

hm do you mean other = [('a',), 1]? [('a'), 1] == ['a', 1] since parens with no commas forms an expression.

if it's a scalar then make it a list and it will be fine. OTOH it might be useful to make s.isin(some_scalar) == s == some_scalar where s is a Series object. Then you wouldn't have to worry about forming a list since it would be done in Series

@jreback
Copy link
Contributor

jreback commented Jul 24, 2013

@hayd I see u waited a long time to merge after 0.12 :)

@hayd
Copy link
Contributor

hayd commented Jul 24, 2013

@jreback I figured best to get in early on the 0.13 merge-storm...

@jreback
Copy link
Contributor

jreback commented Sep 15, 2013

@hayd @TomAugspurger I think this should have an example in v0.13.0txt, u can put same example in isin docs section

@TomAugspurger
Copy link
Contributor Author

I'll give write up a quick example. Should I put a warning about issue #4421, where the value passed to .isin() is another DataFrame?

@jreback
Copy link
Contributor

jreback commented Sep 16, 2013

shoudl that just raise if the index is not identical? (or maybe have a keyword 'index=False` or something to control it?

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

can we add to What's New until at the same time 4421's fixed?

@TomAugspurger
Copy link
Contributor Author

@hayd are you saying just keep pushing it to the next release until someone, ok fine me :), gets around to fixing 4421 in a reasonable way?

If so, what section should I put it under in What's New?

@hayd
Copy link
Contributor

hayd commented Sep 16, 2013

I think we should fix 4421 before 0.13 (so it'll still be in the same What's New), not sure how long we have til release (?), but will have a look at it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants