API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258

shoyer · 2015-01-15T05:53:32Z

Currently only an index method; I'm open to ideas for how to expose it more
directly -- maybe df.loc_nearest?.

In particular, I think this is usually a more useful way to do indexing for
floats than pandas's reindex based .loc, because floats are inherently
imprecise.

CC @jreback @jorisvandenbossche @immerrr @hugadams

shoyer · 2015-01-15T06:18:05Z

Hmm. I think this has all the functionality we need, but it would also be good to reconcile the API design here with asof. I'll dig into that, but I'm also open to any proposals.

jorisvandenbossche · 2015-01-15T10:40:49Z

Should it return an array or an Index when given multiple values?
For asof, I think this can completely replace the usage of asof, and is also compatible in signature. Only differences are:
- asof would have a default of left in terms of get_nearest
- it returns NaN if it cannot find anything.
I think we can leave asof as is, but add above remarks to the docs. Referring in the docstring of asof to get_nearest and indicating how you can replace it.

jreback · 2015-01-15T10:46:57Z

I agree with @jorisvandenbossche this should just replace asof entirely
asod was essentially a simple version of this

jreback · 2015-01-15T10:48:48Z

I think you should extract the indexer into another method maybe asof_indexer (and asof just calls that)
then we can support 'asof' joins! (I think there is an outstanding issue for this)

shoyer · 2015-01-15T23:23:13Z

Should it return an array or an Index when given multiple values?

I'm returning an array, based on how get_indexer works.

I'm currently modeling this function as a sort of combination of get_loc and get_indexer. Perhaps it would indeed make more sense to separate this into two interfaces -- get_loc_nearest (could handle duplicate indexes) and get_indexer_nearest.

jreback · 2015-01-18T20:44:18Z

@shoyer yes this should be 2 methods get_indexer_nearest which is called by get_loc_nearest (whose impl is trivial), but this exposes the indexer which is important.

my 2c, is that we already have asof so this should be just an impl, e.g. get_loc_asof/get_indexer_asof. No reason to have both

shoyer · 2015-01-19T09:20:13Z

OK, latest commits include the split to 2 methods and handle missing values properly (that was not being done right before). The differences are:

get_loc_nearest will coerce input to the appropriate Timedelta/Timestamp/Period scalar, like get_loc
get_indexer_nearest uses -1 to mark missing values that cannot be found whereas get_loc_nearest raises KeyError

I was able to make Index.asof be a thin wrapper around get_loc_nearest. So it should be able to handle non-datetimes now (for whatever that's worth).

The other use of asof is in the method Series.asof and the undocumented method Index.asof_locs (which appears to only be used or tested by Series.asof). This method is a little different and somewhat strange IMO -- it skips missing values in the Series. For now I've left it alone because the skipping of missing values makes it somewhat different, but is could definitely also use a cleanup.

hughesadam87 · 2015-01-19T17:32:09Z

Where do you find the time man!

shoyer · 2015-01-20T18:58:55Z

One more design question: How should we handle partial string matching for DatetimeIndex.get_loc_nearest?

The current design casts the input to a Timestamp and finds the nearest label. This seems reasonable, but it's not consistent with get_loc, which will return all matching labels (even a slice, if possible). Arguably, get_loc_nearest should be equivalent to calling get_loc and then falling back to nearest lookup if the key is not found.

Here's an concrete example:

idx = pd.to_datetime(['2000-01-31', '2000-02-28'])
print idx.get_loc_nearest('2000-02', side='left')
print idx.get_loc_nearest('2000-02', side='nearest')

With the current implementation, each of these would print 0. This is consistent with the current behavior of asof (as verified by test_asof_datetime_partial in test_index.py), but not get_loc, which would print 1 for each of these.

shoyer · 2015-02-10T00:05:32Z

OK, it looks like I need to make my own API decision :).

I've settled on using the result of get_loc for the result of DatetimeIndex.get_loc_nearest, if possible. This means that the "nearest" value to a partial string index like '2000-01' will be any values in January 2001, not the nearest value to 2000-01-01T00:00:00.

This is also a small breaking API change for DatetimeIndex.asof. I've added documentation to that effect to what's new in the latest commit (along with more functionality/tests).

Any other comments? I think this is possibly ready to merge (once rebased).

jorisvandenbossche · 2015-02-10T00:35:35Z

Would it be much trouble to not have this API change in asof? (possibly just leaving it as it is if it causes too much problems)

jorisvandenbossche · 2015-02-10T00:38:53Z

pandas/core/index.py


-        return label
+        See also: get_loc_nearest


can you make this a See also section?

See also -------- get_loc_nearest : small explanation what this is

then you get automatic linking in the html pages

jorisvandenbossche · 2015-02-10T00:50:52Z

A question: when you have a partial string, what is the 'nearest' value, so what will DatetimeIndex.get_loc_nearest return? (say if you have all days of the month in your index, what day is 'nearest' to the month)?

shoyer · 2015-02-10T00:58:25Z

@jorisvandenbossche We definitely could include a work around to avoid the change for asof by overwriting the subclass DatetimeIndex.asof method.... but I'm not even entirely sure this should really be considered an API change rather than a bug fix (even though there was a test for it).

Note the asof docstring: "For a sorted index, return the most recent label up to and including the passed label." A string like '2000-02' arguably does include the timestamp '2000-02-28'.

Looking into the history a little more: this test appeared in PR #8246 by @TomAugspurger. There, it appears that the issue was something else entirely -- asof was returning time stamps not in the index at all (see #8245). It's not clear to me that anyone carefully considered exactly how partial datetime strings should be interpreted in this change.

shoyer · 2015-02-10T01:05:39Z

@jorisvandenbossche The "nearest" value from idx.get_loc_nearest('2000-01') will be a slice object indexing all values in January 2001 if idx.get_loc('2000-01') is valid; otherwise, it will return the location of the nearest value to 2000-01-01T00:00:00. I suppose it would be a good idea to document this more directly!

jreback · 2015-02-10T02:16:13Z

I haven't really looked at this in depth - but

the string slicing should be analogous to partial string indexing with the same semantics

at first glance it appears that they are the same?

shoyer · 2015-02-10T03:23:33Z

@jreback Yes, the nearest partial string indexing of a DatetimeIndex is exactly equivalent to partial partial string indexing if there is a match; otherwise, it casts the string label to a Timestamp and finds the nearest value.

jorisvandenbossche · 2015-02-10T09:19:54Z

pandas/tseries/index.py

@@ -1256,6 +1256,19 @@ def get_loc(self, key):
            except (KeyError, ValueError):
                raise KeyError(key)

+    def get_loc_nearest(self, key, side='nearest'):


These should also get a docstring (there are normally some decorators for adding a docstring)

jorisvandenbossche · 2015-02-10T09:35:18Z

@shoyer on asof, OK then, given this is rather new behaviour as you pointed out, so lets regard that as a bug fix (maybe also specify it like that in the whatsnew docs)

Some other things:

On the partial string indexing. The docstring of get_loc_nearest says it returns an int -> so this is not always correct then.
For a DatetimeIndex, you only implemented get_loc_nearest and not get_indexer_nearest. Is this on purpose? If you now try using get_indexer_nearest with a string, you get rather unhelpful error messages.

I know you modelled the functions to the existing get_loc and get_indexer, but supposing these wouldn't exist, I would have the following remarks:

Why have both? Why not just one function that accepts label or array of labels? (and returns int or array of ints)
I find the name a bit confusing. get_loc_nearest returns the 'integer location by position'. But in the 'indexing terminology', loc is for the 'label-location' based indexer.
The return of -1 if a key is not found, seems a bit strange for me. As you would typically use the result to index your frame/series. But -1 has in that case a specific meaning, namely the last value. And that is not always what you want I think. (I know this is not that simple, as you cannot have NaN in integer, and you cannot have non-integer in the indexer, ..)

jreback · 2015-02-10T13:39:47Z

@jorisvandenbossche

I think this impl is fine. The signal of -1 is typical in the way the indexers work. These are by definition int64 and -1 is a not-found value. It is up to the user to process these.

By definition these are always indexers, by position, always, full-stop. These only have a single semantic for the output results. loc here is not location based, but just location (its an older nomenclature and I can see it being somewhat confusing, but its how all of the Index classes work.)

@shoyer I would think this API simpler if you do this;

def get_indexer(self, target, method=None, limit=None):   
        """
        Compute indexer and mask for new index given the current index. The
        indexer should be then used as an input to ndarray.take to align the
        current data to the new index. The mask determines whether labels are
        found or not in the current index

        Parameters
        ----------
        target : Index
        method : {'pad', 'ffill', 'backfill', 'bfill'}
            pad / ffill: propagate LAST valid observation forward to next valid
            backfill / bfill: use NEXT valid observation to fill gap

and allow method='nearest' (and allow the side parameter to be passed as well).
(similarly for get_loc).

I think its just think its adding API noise otherwise. These conceptually do exactly the same thing. (you certainly could have an internal _get_indexer_nearest if that makes the impl clearer though).

shoyer · 2015-02-10T19:25:05Z

@jreback This is a good point about incorporating this into the get_indexer method. In fact, I realize now that get_indexer_nearest(side='left') is actually just a slower version of get_indexer(method='pad'). So it looks like adding method='nearest' is probably the way to go here.

shoyer · 2015-02-10T21:45:59Z

It appears that the existing get_indexer method only works for monotonic ascending indexes and keys. So I'll probably keep around some of this code as a fallback option for pad/backfill with decreasing indexes and/or unordered keys.

shoyer · 2015-02-11T04:21:41Z

@jorisvandenbossche I agree, the current index API is a little strange -- some refactoring to provide a unified API and to untangle core.indexers would definitely be in order. See also: #7651.

I'm implementing only the low level methods now, but eventually we should figure out a higher level API for nearest neighbor lookups in pandas.

jreback · 2015-02-13T19:55:18Z

pandas/core/generic.py

        new_index, indexer = axis_values.reindex(labels, method, level,
                                                 limit=limit)
        return self._reindex_with_indexers(
-            {axis: [new_index, indexer]}, method=method, fill_value=fill_value,
-            limit=limit, copy=copy)
+            {axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)


oh, right, method is extraneous here (above too?)

jreback · 2015-02-13T19:59:38Z

minor comments, lgtm otherwise.

jorisvandenbossche · 2015-02-13T20:15:57Z

pandas/core/index.py

+            pad / ffill: find the PREVIOUS index value if no exact match.
+            backfill / bfill: use NEXT index value if no exact match
+            nearest: use the NEAREST index value if no exact match. Tied
+            distances are broken by preferring the larger index value.


For the html docs, you have to make bullet points of these (use - or * before each line) to have this render properly.
The same for the other docstrings.

jorisvandenbossche · 2015-02-13T20:16:29Z

docstring of reindex should still be updated

shoyer · 2015-02-14T06:00:57Z

Yep, the docstrings need a bit of work. I'll do that next week -- going camping and will be offline for the next three days.

shoyer · 2015-02-17T21:46:56Z

OK, I updated docstrings for reindex, reindex_like, get_indexer and get_loc, and they appear to generate well-formatted API docs. Merge ready, I think?

jreback · 2015-02-18T00:40:23Z

pandas/tests/test_series.py

+        s = Series(np.arange(10))
+        target = [0.1, 0.9, 1.5, 2.0]
+        actual = s.reindex(target, method='nearest')
+        expected = Series(np.around(target).astype(int), target)


astype these to int64 otherwiset hey will fail on windows.

jreback · 2015-02-18T00:42:18Z

lgtm

minor dtype chaneges needed in some tests
haven't checked, but make basics.rst docs need updating (where reindex is, to maybe have an example for nearest)

Fixes GH8845

shoyer · 2015-02-18T03:04:34Z

@jreback Fixed the dtypes.

@jorisvandenbossche @jreback I was planning on updating the basics.rst docs later (once "nearest" works for fillna), but it was simple enough to do it now. Let me know what you think. In particular, I removed some musing from Wes about the possibility of more sophisticated interpolation logic.

jreback · 2015-02-18T03:10:55Z

lgtm

shoyer · 2015-02-19T20:45:36Z

@jorisvandenbossche do you want to take another look? otherwise I will merge...

shoyer · 2015-02-23T18:30:24Z

OK, I'm going to merge this. Doc issues can be dealt with in a followup PR if necessary.

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc

jreback · 2015-02-23T22:23:31Z

nice work @shoyer !

windows test dtype fix

mjroth · 2015-04-13T22:26:31Z

Is the intent for this to return the nearest valid observation?
i.e. for the example given in the whats new 0.16,
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html
if we throw in a NaN, it does not search for the next valid observation.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'x': range(5)})
In [4]: df.x[0] = np.nan
In [5]: df.reindex([0.2, 1.8, 3.5], method='nearest')
Out[5]: 
      x
0.2 NaN
1.8   2
3.5   4

I'd expect x=1 at 0.2, given the Docs.

shoyer · 2015-04-13T22:55:42Z

@mjroth the intent is for it to return values at the nearest index label, similarly to the existing behavior for forward and backward filling reindexing methods. If you don't want to include NA values, you'll need to do a dropna first.

shoyer added Indexing Related to indexing on series/frames, not to indexes themselves Enhancement labels Jan 15, 2015

shoyer added this to the 0.16.0 milestone Jan 15, 2015

shoyer force-pushed the get_nearest branch from 9b6fa0d to ab4d622 Compare January 15, 2015 06:06

shoyer changed the title ~~API/ENH: add Index.get_nearest~~ API/ENH: add get_loc_nearest and get_indexer_nearest to Index Jan 19, 2015

shoyer mentioned this pull request Jan 28, 2015

Use KDTrees to support nearest neighbor queries/joins on MultiIndexes? #9365

Closed

jorisvandenbossche reviewed Feb 10, 2015
View reviewed changes

jreback reviewed Feb 13, 2015
View reviewed changes

jorisvandenbossche reviewed Feb 13, 2015
View reviewed changes

shoyer force-pushed the get_nearest branch from c2e2faf to 473cbde Compare February 17, 2015 20:01

shoyer changed the title ~~API/ENH: add method='nearest' to Index.get_indexer and method to get_loc~~ API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc Feb 17, 2015

jreback reviewed Feb 18, 2015
View reviewed changes

shoyer force-pushed the get_nearest branch from 473cbde to 547678c Compare February 18, 2015 00:46

API/ENH: add method='nearest' for Index.get_indexer and reindex

f116421

Fixes GH8845

shoyer force-pushed the get_nearest branch from 547678c to f116421 Compare February 18, 2015 03:00

This was referenced Feb 18, 2015

API: Interpolate at new values #9340

Open

Support pad/backfill/nearest reindexing even for unsorted indexes by storing a sorted index? #9510

Closed

shoyer mentioned this pull request Feb 20, 2015

Allow for near misses when doing alignment with Float64Index? #9530

Open

shoyer added a commit that referenced this pull request Feb 23, 2015

Merge pull request #9258 from shoyer/get_nearest

4e563ea

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc

shoyer merged commit 4e563ea into pandas-dev:master Feb 23, 2015

shoyer deleted the get_nearest branch February 23, 2015 18:30

jreback added a commit that referenced this pull request Feb 28, 2015

xref #9258

fc843d3

windows test dtype fix

shoyer mentioned this pull request Sep 4, 2015

CLN: Series.asof uses reindex GH10343 #10873

Closed

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258

Conversation

shoyer commented Jan 15, 2015

shoyer commented Jan 15, 2015

jorisvandenbossche commented Jan 15, 2015

jreback commented Jan 15, 2015

jreback commented Jan 15, 2015

shoyer commented Jan 15, 2015

jreback commented Jan 18, 2015

shoyer commented Jan 19, 2015

hughesadam87 commented Jan 19, 2015

shoyer commented Jan 20, 2015

shoyer commented Feb 10, 2015

jorisvandenbossche commented Feb 10, 2015

jorisvandenbossche Feb 10, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 10, 2015

shoyer commented Feb 10, 2015

shoyer commented Feb 10, 2015

jreback commented Feb 10, 2015

shoyer commented Feb 10, 2015

jorisvandenbossche Feb 10, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 10, 2015

jreback commented Feb 10, 2015

shoyer commented Feb 10, 2015

shoyer commented Feb 10, 2015

shoyer commented Feb 11, 2015

jreback Feb 13, 2015

Choose a reason for hiding this comment

jreback commented Feb 13, 2015

jorisvandenbossche Feb 13, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 13, 2015

shoyer commented Feb 14, 2015

shoyer commented Feb 17, 2015

jreback Feb 18, 2015

Choose a reason for hiding this comment

jreback commented Feb 18, 2015

shoyer commented Feb 18, 2015

jreback commented Feb 18, 2015

shoyer commented Feb 19, 2015

shoyer commented Feb 23, 2015

jreback commented Feb 23, 2015

mjroth commented Apr 13, 2015

shoyer commented Apr 13, 2015