Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: slicing with decreasing monotonic indexes #8680

Merged
merged 2 commits into from
Nov 2, 2014

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Oct 30, 2014

Fixes #7860

The first commit adds Index.is_monotonic_decreasing and Index.is_monotonic_increasing (alias for is_monotonic).

is_monotonic will have a performance degradation (still O(n) time) in cases where the Index is decreasing monotonic. If necessary, we could work around this, but I think we can probably get away with this because the fall-back options are much slower and in many cases (e.g., for slice indexing) the next thing we'll want to know is if it's decreasing monotonic, anyways.

Next up will be handling indexing monotonic decreasing indexes with reversed slices, e.g., s.loc[30:10].

CC @jreback @hugadams @immerrr (thank you, specialities wiki!)

@immerrr
Copy link
Contributor

immerrr commented Oct 30, 2014

Great addition, I thought that this was already implemented.

I think is_monotonic should become an alias of is_monotonic_increasing, not vice versa. Intuitively it should be an OR of direction-specific ones, but it is mentioned in the public API, so it is probably better off as a property with a deprecation warning till 0.17.0.

return False, None
is_monotonic_inc = 0
if not is_monotonic_dec:
return False, False, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to short circuit this and return early if both incr and decr monotonoc are 0 (break out of loop)
so each branch should be something like

if is_monotonoc_incr & cur < prev:
....
in the else
if not is_monotonoc_incr & not is_monotonic_decr:
return False, False, False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the existing logic did the same thing, but definitely not as clear

@shoyer
Copy link
Member Author

shoyer commented Oct 30, 2014

@immerrr Agreed on all those points. I switched IndexEngine to use is_monotonic_increasing in the second to last commit (still need to clean up Index).

@hughesadam87
Copy link

Looks awesome, thanks!

On Thu, Oct 30, 2014 at 3:02 AM, Stephan Hoyer notifications@github.com
wrote:

@immerrr https://github.com/immerrr Agreed on all those points. I
switched IndexEngine to use is_monotonic_increasing in the second to last
commit (still need to clean up Index).


Reply to this email directly or view it on GitHub
#8680 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

@shoyer
Copy link
Member Author

shoyer commented Oct 30, 2014

By the way, the current behavior of Index.slice_locs has a bug for non-unique, monotonic integer indexes:

>>> idx = pd.Index([10, 12, 12, 14])
>>> idx.slice_locs(12, 12)
(1, 3)
>>> idx.slice_locs(11, 13)
(11, 13)

I'll add a fix (and tests) when I add support for slicing monotonic decreasing indexes.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

I'd u want to add this fix separately with tests would be ok

@shoyer
Copy link
Member Author

shoyer commented Oct 30, 2014

Latest commit adds support for monotonic decreasing indexes and should fix that bug. I ended up just removing the special case for non-unique, monotonic integer indexes, because I couldn't think of any scenarios where that shortcut was valid. Please correct me if I'm mistaken there.

@jreback jreback added Enhancement Indexing Related to indexing on series/frames, not to indexes themselves labels Oct 30, 2014
@shoyer
Copy link
Member Author

shoyer commented Oct 31, 2014

One of the tests is still failing. The current version looks like this (I mangled ts2 to ensure that it wasn't monotonic decreasing, either):

    def test_ix_getitem_not_monotonic(self):
        d1, d2 = self.ts.index[[5, 15]]

        ts2 = self.ts.iloc[lrange(15) + lrange(15, len(self.ts))[::-1]]

        self.assertRaises(KeyError, ts2.ix.__getitem__, slice(d1, d2))
        self.assertRaises(KeyError, ts2.ix.__setitem__, slice(d1, d2), 0)

I think this test should not have been passing before (not quite sure how it was), because d1 and d2 are in the index (a datetimeindex), so slicing should not raise an error, even if isn't monotonic

We don't make those sort of guarantees for slice_loc, and honestly we can't, unless we always check if the index is monotonic first. I'm not even sure if that would actually be desirable, given that I think the new behavior (not raising) is what is actually documented.

Thoughts?

@@ -1461,7 +1461,7 @@ def xs(self, key, axis=0, level=None, copy=None, drop_level=True):
name=self.index[loc])

else:
result = self[loc]
result = self.iloc[loc]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was surprising to encounter (needed to change it to fix some tests), but maybe it was there for a reason? I don't think there is any reason for loc here to do label based rather than integer indexing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be
I have tried to not touch xs recently. it seems s but fragile and multi slicing obviates the need for it anyhow

@shoyer
Copy link
Member Author

shoyer commented Oct 31, 2014

@jreback thoughts on this test case I want to remove? That is to do on this (besides release notes, etc.).

@jreback
Copy link
Contributor

jreback commented Oct 31, 2014

no that test is on ts2 which DOESN't have d1 or d2 in it (they both are missing)

But the index is unique and monotonic (increasing). It tries to find d1 but KeyError so it raises.

I think its correct for both monotonic increase and decrease.

You have to have at least 1 end point in the index for it 2 work

@@ -1521,7 +1521,7 @@ def test_ix_getitem(self):
def test_ix_getitem_not_monotonic(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the failing test (that I think should be removed)

d1 and d2 are in the index, so slicing should not raise a KeyError

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind.... I misread the ::2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add a fix

@shoyer
Copy link
Member Author

shoyer commented Oct 31, 2014

OK, still needs a rebase, but I added what's new notes and tests are passing now.

@jreback jreback added this to the 0.15.1 milestone Oct 31, 2014
@jreback
Copy link
Contributor

jreback commented Oct 31, 2014

ok, looks reasonable. pls do a perf test and post if anything comes up.

@shoyer shoyer changed the title WIP: slicing with decreasing monotonic indexes ENH: slicing with decreasing monotonic indexes Oct 31, 2014
@shoyer
Copy link
Member Author

shoyer commented Nov 1, 2014

Here are the perf tests that came up 10% slower or more. This all looks like noise to me (would be nice if vbench included an auto-repeat thing like timeit):

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_dtypes                                 |   0.1370 |   0.1240 |   1.1051 |
groupby_ngroups_100_head                     |   0.9100 |   0.8230 |   1.1056 |
frame_from_records_generator_nrows           |   1.1386 |   1.0297 |   1.1058 |
frame_mult_no_ne                             |   7.7243 |   6.9840 |   1.1060 |
frame_mult                                   |   7.6423 |   6.8657 |   1.1131 |
frame_isnull                                 |   0.8477 |   0.7557 |   1.1217 |
frame_multi_and                              |  33.9894 |  30.2606 |   1.1232 |
groupby_first_float64                        |   3.3354 |   2.9689 |   1.1234 |
timeseries_custom_bmonthend_decr_n           |   0.2897 |   0.2549 |   1.1362 |
index_datetime_intersection                  |  13.2007 |  11.6110 |   1.1369 |
datetime_index_union                         |   0.0977 |   0.0856 |   1.1411 |
strings_startswith                           |   4.4993 |   3.9283 |   1.1453 |
strings_endswith                             |   4.5290 |   3.9484 |   1.1471 |
frame_mask_bools                             |  13.8410 |  12.0500 |   1.1486 |
frame_getitem_single_column                  |  30.3197 |  26.3140 |   1.1522 |
frame_mask_floats                            |   9.1396 |   7.8970 |   1.1574 |
frame_mult_st                                |   8.5073 |   7.3420 |   1.1587 |
frame_shift_axis0                            |  16.8714 |  14.4614 |   1.1667 |
datetime_index_intersection                  |   0.3490 |   0.2923 |   1.1939 |
timeseries_day_apply                         |   0.0103 |   0.0083 |   1.2381 |
dtype_infer_float32                          |   0.4770 |   0.3787 |   1.2596 |
groupby_ngroups_100_first                    |   0.6906 |   0.5420 |   1.2742 |
dtype_infer_int32                            |   0.6863 |   0.5360 |   1.2805 |
reindex_daterange_pad                        |   0.9470 |   0.7266 |   1.3033 |
frame_float_equal                            |   2.9230 |   2.2174 |   1.3182 |
timeseries_1min_5min_mean                    |   0.9743 |   0.7337 |   1.3280 |
reindex_fillna_backfill                      |   0.4797 |   0.3503 |   1.3693 |
timeseries_custom_bday_cal_incr_n            |   0.0437 |   0.0317 |   1.3784 |

any of these look suspicious?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

these r all ok
FYI - their are a bunch of options to vbench
I think -n -r will do repeating (it's like timeit)

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

can u add some tests if the index had nan/NaT bedded ? (I think will work just verifying)

@shoyer
Copy link
Member Author

shoyer commented Nov 1, 2014

Good idea on NA checks, have to admit I'm a little worried about that.

Actually, I'm pretty sure the existing code for is_monotonic does not handle NA properly. I guess any NA values should imply not monotonic? I believe Numpy's searchsorted assumes a particular sort order for NaN.

On Fri, Oct 31, 2014 at 5:32 PM, jreback notifications@github.com wrote:

can u add some tests if the index had nan/NaT bedded ? (I think will work just verifying)

Reply to this email directly or view it on GitHub:
#8680 (comment)

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

I think na should be ignored for monotonic checks? eg it doesn't change anything
you can test with cur != cur (though for i8 u have to check the tslib.iNaT value)

searchsorted will work but I think u have to trick it a bit - eg see how argmax and such are done (in base.py)

@shoyer
Copy link
Member Author

shoyer commented Nov 1, 2014

@jreback OK, I looked into searchsorted with missing values.

argmax and the like can use the NA skipping versions from bottleneck with no speed penalty, since those functions need to look at all the elements, anyways. However, for binary search there is a steep price, because skipping an arbitrary number of missing values means the algorithm runs in O(n) time instead of O(log(n)).

And indeed, np.searchsorted does not handle out of order NaN:

In [9]: x = np.array([1, 2, np.nan, np.nan, np.nan, 3, 4, np.nan, 5])

In [10]: x.searchsorted(3)
Out[10]: 2

In [11]: x.searchsorted(4)
Out[11]: 2

In [12]: x.searchsorted(5)
Out[12]: 2

So, given that our current implementation is buggy, we could write our own version of searchsorted that skips NA values (at least if the array has any missing values), even though in the worst case scenario is no better than checking every element. But this seems like a lot of work, and given how pathological indexes with NA values are, I question if it is worth the effort.

Indeed, even if searchsorted sometimes makes conceptual sense when there are missing values, we would have to come up with a rule for cases where the missing value is a adjacent to the search value, i.e., what should the result of Index([1, nan, 2]).slice_locs(1.5, None) be?

My preferred choice is to simply declare that the presence of NA means an array is not monotonic. I think this is actually a reasonable interpretation (it should be documented), given that the presence of missing values really means the monotonicity is unknown. Users are also likely to assume that monotonicity implies binary search is possible.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

that seems reasonable - and easy enough to detect in is_monotonic_* routines

@shoyer
Copy link
Member Author

shoyer commented Nov 1, 2014

Okay, the latest commit should ensure that the presence of NaN or NaT means the index is not monotonic.

Include NaT in the same template ended up pretty awkward... perhaps it would be best to write another function specifically for time types? Or perhaps we don't even really need to check NaT, given that iNaT has a well defined sort order?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2014

we had to do this here as well: https://github.com/pydata/pandas/blob/master/pandas/src/generate_code.py#L1194

I think you can just do:

if cur != cur or cur == iNaT or prev == iNat:
    # is a nan

possibly maybe kill perf though

@shoyer
Copy link
Member Author

shoyer commented Nov 1, 2014

Turned up a bug for get_loc with np.nan on numpy 1.7. The root cause is numpy was raising an IndexError when it should have raised (and now does raise) ValueError:

In [1]: import numpy as np

In [2]: x = np.array([0.1, np.nan])

In [3]: x.item()
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-3-f89491118da8> in <module>()
----> 1 x.item()

IndexError: index 511 is out of bounds for axis 0 with size 2

Added a test and a note to what's new.

shoyer added a commit to shoyer/pandas that referenced this pull request Nov 2, 2014
Fixes pandas-dev#7640, pandas-dev#8625

This is a work in progress, but it's far enough along that I'd love to get
some feedback.

TODOs (more called out in the code):

- [ ] documentation + docstrings
- [ ] finish the index methods:
   - [ ] `get_loc`
   - [ ] `get_indexer`
   - [ ] `slice_locs`
- [ ] comparison operations
- [ ] fix `is_monotonic` (pending pandas-dev#8680)
- [ ] ensure sorting works
- [ ] arithmetic operations (not essential for MVP)
- [ ] cythonize the bottlenecks:
   - [ ] `from_breaks`
   - [ ] `_data`
   - [ ] `Interval`?
- [ ] `MultiIndex`
- [ ] `Categorical`/`cut`
- [ ] serialization
- [ ] lots more tests

CC @jreback @cpcloud @immerrr
@shoyer shoyer mentioned this pull request Nov 2, 2014
35 tasks
@@ -85,7 +85,7 @@ def test_asobject_tolist(self):
def test_minmax(self):
for tz in self.tz:
# monotonic
idx1 = pd.DatetimeIndex([pd.NaT, '2011-01-01', '2011-01-02',
idx1 = pd.DatetimeIndex(['2011-01-01', '2011-01-02',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test here (and below), with the nat and assertsFalse for monotonic_incr/decr

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see test_is_monotonic_na

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

ok, if you'd rebase/squash. (and maybe do a perf sanity check) I think ok to merge.

Index.is_monotonic will have a performance degradation (still O(n) time) in
cases where the Index is decreasing monotonic. If necessary, we could work
around this, but I think we can probably get away with this because the fall-
back options are much slower and in many cases (e.g., for slice indexing) the
next thing we'll want to know is if it's decreasing monotonic, anyways.

see GH7860
@shoyer
Copy link
Member Author

shoyer commented Nov 2, 2014

OK, rebased and squashed (I put the first three commits together, which had all passed earlier on Travis pre-squashing).

Ran another performance test and everything looks OK! (I found the -r option for quickly only redoing slow tests.)

So I think we are good to merge

jreback added a commit that referenced this pull request Nov 2, 2014
ENH: slicing with decreasing monotonic indexes
@jreback jreback merged commit 1382aa3 into pandas-dev:master Nov 2, 2014
@jreback
Copy link
Contributor

jreback commented Nov 2, 2014

thank you sir! (may I have another?)

@shoyer shoyer deleted the reversed-monotonic-slices branch November 2, 2014 22:46
@immerrr
Copy link
Contributor

immerrr commented Nov 6, 2014

Oh, this made it to master with slice bounds switch in reverse monotonic case. I thought I managed to dissuade @shoyer from doing that, bummer :(

UPD: disregard this message, I've misread the code that does searchsorted on monotonically decreasing index.

@jorisvandenbossche
Copy link
Member

@immerrr Nothing is too late if we are convinced things should change!
But, what do you mean exactly?

This is on master:

In [30]: s = pd.Series([1,2,3,4], index=[4,3,2,1])

In [32]: s.loc[3:1]
Out[32]:
3    2
2    3
1    4
dtype: int64

In [33]: s.loc[1:3]
Out[33]: Series([], dtype: int64)

In [34]: s.index.is_monotonic_decreasing
Out[34]: True

So you argue it should be the other way around? s.loc[1:3] still returning the result, and s.loc[3:1] not? (while in s.loc[3:1] the labels are in order, following the order of the index)

@immerrr
Copy link
Contributor

immerrr commented Nov 6, 2014

Oh, disregard that message, I've misread the code that does searchsorted on monotonically decreasing index. I agree with your example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: indexing support for reversed is_monotonic
5 participants