Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add 'level' kwarg to 'Index.isin' method #7892

Merged
merged 1 commit into from
Aug 4, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Jul 31, 2014

closes #7890

This PR adds level kwarg for Index objects as discussed (briefly in #7890):

Index.isin(iterable, level=None)

Summary

  • for non-MultiIndex classes:
    • no functional changes in the actual membership test
    • valid level values are None, 0, -1 and self.name
    • if level doesn't match the name, it's a KeyError (was AssertionError before)
  • for MultiIndex classes:
    • if level=None, elements of self.values (tuples) are used
    • otherwise elements of corresponding level are used
    • valid level values are the same as in MultiIndex.get_level_values method, -self.nlevels..(self.nlevels - 1) plus all unique index names

TODO

  • add to documentation
  • optimize MultiIndex case

@immerrr
Copy link
Contributor Author

immerrr commented Jul 31, 2014

Also, probably related to #3268

@jreback jreback added this to the 0.15.0 milestone Jul 31, 2014
@@ -687,13 +687,29 @@ def _engine(self):
# property, for now, slow to look up
return self._engine_type(lambda: self.values, len(self))

def _validate_index_level(self, level):
"""
Validate index level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

his already exists as _get_level_number (oh I see you are extending it), ok then

@immerrr
Copy link
Contributor Author

immerrr commented Jul 31, 2014

Performance patch is in, here's a synthetic benchmark:

In [7]: idx = pd.MultiIndex.from_product([np.arange(10000), ['a', 'b', 'c']])

In [8]: timeit idx.isin(['a'], level=1)
10000 loops, best of 3: 104 µs per loop

In [9]: timeit idx.get_level_values(1).isin(['a'])
1000 loops, best of 3: 1.37 ms per loop

In [10]: timeit idx.isin([1], level=0)
1000 loops, best of 3: 1.02 ms per loop

In [11]: timeit idx.get_level_values(0).isin([1])
100 loops, best of 3: 2.84 ms per loop

if level is None:
return lib.ismember(self._array_values(), set(values))
else:
num = self._get_level_number(level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this currently breaks if level is a list, you want to support that? (and pass tuples for the isin values)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go for that in this PR, I tend to be conservative about API and would wait for an actual use case where

idx.isin(vals1, lev1) & idx.isin(vals2, lev2)

is clearly inferior to

idx.isin([vals1, vals2], [lev1, lev2])

@jreback
Copy link
Contributor

jreback commented Aug 1, 2014

looks good....doc and then can merge

FYI, if you want to revise/update isin section gr8! (I would also make it a top-level, rather than below boolean indexing)

@immerrr
Copy link
Contributor Author

immerrr commented Aug 1, 2014

re: doc

other than that I've already added?

@jreback
Copy link
Contributor

jreback commented Aug 1, 2014

I think it makes sense to combine those 2 sections (isin of Index objects) and isin of PandasObject as they are really the same (and confusing to look in 2 sections). http://pandas-docs.github.io/pandas-docs-travis/indexing.html#indexing-with-isin

I would simply change the isin section (that I am pointing), make it a top-level (e.g. same as Boolean Indexing), them put the examples there

@immerrr
Copy link
Contributor Author

immerrr commented Aug 3, 2014

I'm a bit concerned about introducing functionality of classes that themselves have not yet been described, because Index & MultiIndex sections are way below the "isin" section on containers while they probably should be introduced with the rest in dsintro.rst.

Although, I see some of that happening for docs on where that do mention Multiindex, so you may be ok with that, but just to be sure, are you?

@jreback
Copy link
Contributor

jreback commented Aug 3, 2014

I agree with all of that

index/MultiIndex should have an intro at the top of indexing (side issue)

but I think splitting isin up (as it is now) is a bigger problem

so I would consolidate and out as a big bullet after boolean indexing

feel free to do a short intro if u would like as well (for index/mi)

@@ -2157,12 +2178,17 @@ def isin(self, values):
Parameters
----------
values : set or sequence of values
level : {0, None}

`level` argument is provided for compatibility with MultiIndex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two remarks:

  • can you follow the numpy docstring format?
  • I think we should write the docstring for the 'general' case (so it is also applicable for MultiIndex), as this is the docstring that will be rendered in the html docs.

So roughly something like:

level : int, string, default None
    Level number or name to specify the index level to use if the index is a MultiIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you follow the numpy docstring format?

Sure, although I thought I was close. What do you think stands out?

I think we should write the docstring for the 'general' case

I was trying to state explicitly that this argument means nothing in particular for simple Index classes while adding a meaningful docstring for MultiIndex.isin, but I do understand that some might like to have it generic and in one place.

Also, is it stated somewhere that level names must be (or rather, are expected to be) strings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

level can be a name or int (level number)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you follow the numpy docstring format?

Sure, although I thought I was close. What do you think stands out?

It is the exact formatting that is important for the correct rendering (see https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt). It should be formatted as:

arg : type
    Explanation

(so no blank line, and the explanation should be indented)

I think we should write the docstring for the 'general' case

I was trying to state explicitly that this argument means nothing in particular for simple Index classes while adding a meaningful docstring for MultiIndex.isin, but I do understand that some might like to have it generic and in one place.

The docstring in MultiIndex.isin should also be meaningful. But it is just that in the online documentation, there are references to Index.isin, so I think the all relevant information should also be there (to not have to link to both Index.isin and MultiIndex.isin)

Also, is it stated somewhere that level names must be (or rather, are expected to be) strings?

Ah, no, it can be any object I think, but int and string will be the normal ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(so no blank line, and the explanation should be indented)

hah, ok, that particular line was meant as a note, hence non-indented and separated, but ok, let's make it a parameter description :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, but 'notes' like this don't work in the 'Parameter' section (it is then interpreted as another parameter, the format is quit strict), therefore your have the 'Notes' section, as you did now :-)

PERF: optimize MultiIndex.isin when level != None

DOC: update doc bits on Index.isin and MultiIndex.isin

DOC: merge isin sections for container & index classes

DOC: make isin a separate section

It's no longer a subsection of boolean indexing.
@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

looks good...

@jorisvandenbossche anything else?

@jorisvandenbossche
Copy link
Member

no, looking good!

jreback added a commit that referenced this pull request Aug 4, 2014
API: add 'level' kwarg to 'Index.isin' method
@jreback jreback merged commit 0646ad5 into pandas-dev:master Aug 4, 2014
@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

thanks

let's check docs after it's built and review for correctness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: add 'level' kwarg for Index.isin method
3 participants