Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc #9258

Merged
merged 1 commit into from
Feb 23, 2015

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jan 15, 2015

Fixes #8845

Currently only an index method; I'm open to ideas for how to expose it more
directly -- maybe df.loc_nearest?.

In particular, I think this is usually a more useful way to do indexing for
floats than pandas's reindex based .loc, because floats are inherently
imprecise.

CC @jreback @jorisvandenbossche @immerrr @hugadams

@shoyer shoyer added Indexing Related to indexing on series/frames, not to indexes themselves Enhancement labels Jan 15, 2015
@shoyer shoyer added this to the 0.16.0 milestone Jan 15, 2015
@shoyer
Copy link
Member Author

shoyer commented Jan 15, 2015

Hmm. I think this has all the functionality we need, but it would also be good to reconcile the API design here with asof. I'll dig into that, but I'm also open to any proposals.

@jorisvandenbossche
Copy link
Member

  • Should it return an array or an Index when given multiple values?

  • For asof, I think this can completely replace the usage of asof, and is also compatible in signature. Only differences are:

    • asof would have a default of left in terms of get_nearest
    • it returns NaN if it cannot find anything.

    I think we can leave asof as is, but add above remarks to the docs. Referring in the docstring of asof to get_nearest and indicating how you can replace it.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2015

I agree with @jorisvandenbossche this should just replace asof entirely
asod was essentially a simple version of this

@jreback
Copy link
Contributor

jreback commented Jan 15, 2015

I think you should extract the indexer into another method maybe asof_indexer (and asof just calls that)
then we can support 'asof' joins! (I think there is an outstanding issue for this)

@shoyer
Copy link
Member Author

shoyer commented Jan 15, 2015

Should it return an array or an Index when given multiple values?

I'm returning an array, based on how get_indexer works.

I'm currently modeling this function as a sort of combination of get_loc and get_indexer. Perhaps it would indeed make more sense to separate this into two interfaces -- get_loc_nearest (could handle duplicate indexes) and get_indexer_nearest.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@shoyer yes this should be 2 methods get_indexer_nearest which is called by get_loc_nearest (whose impl is trivial), but this exposes the indexer which is important.

my 2c, is that we already have asof so this should be just an impl, e.g. get_loc_asof/get_indexer_asof. No reason to have both

@shoyer shoyer changed the title API/ENH: add Index.get_nearest API/ENH: add get_loc_nearest and get_indexer_nearest to Index Jan 19, 2015
@shoyer
Copy link
Member Author

shoyer commented Jan 19, 2015

OK, latest commits include the split to 2 methods and handle missing values properly (that was not being done right before). The differences are:

  1. get_loc_nearest will coerce input to the appropriate Timedelta/Timestamp/Period scalar, like get_loc
  2. get_indexer_nearest uses -1 to mark missing values that cannot be found whereas get_loc_nearest raises KeyError

I was able to make Index.asof be a thin wrapper around get_loc_nearest. So it should be able to handle non-datetimes now (for whatever that's worth).

The other use of asof is in the method Series.asof and the undocumented method Index.asof_locs (which appears to only be used or tested by Series.asof). This method is a little different and somewhat strange IMO -- it skips missing values in the Series. For now I've left it alone because the skipping of missing values makes it somewhat different, but is could definitely also use a cleanup.

@hughesadam87
Copy link

Where do you find the time man!

@shoyer
Copy link
Member Author

shoyer commented Jan 20, 2015

One more design question: How should we handle partial string matching for DatetimeIndex.get_loc_nearest?

The current design casts the input to a Timestamp and finds the nearest label. This seems reasonable, but it's not consistent with get_loc, which will return all matching labels (even a slice, if possible). Arguably, get_loc_nearest should be equivalent to calling get_loc and then falling back to nearest lookup if the key is not found.

Here's an concrete example:

idx = pd.to_datetime(['2000-01-31', '2000-02-28'])
print idx.get_loc_nearest('2000-02', side='left')
print idx.get_loc_nearest('2000-02', side='nearest')

With the current implementation, each of these would print 0. This is consistent with the current behavior of asof (as verified by test_asof_datetime_partial in test_index.py), but not get_loc, which would print 1 for each of these.

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

OK, it looks like I need to make my own API decision :).

I've settled on using the result of get_loc for the result of DatetimeIndex.get_loc_nearest, if possible. This means that the "nearest" value to a partial string index like '2000-01' will be any values in January 2001, not the nearest value to 2000-01-01T00:00:00.

This is also a small breaking API change for DatetimeIndex.asof. I've added documentation to that effect to what's new in the latest commit (along with more functionality/tests).

Any other comments? I think this is possibly ready to merge (once rebased).

@jorisvandenbossche
Copy link
Member

Would it be much trouble to not have this API change in asof? (possibly just leaving it as it is if it causes too much problems)


return label
See also: get_loc_nearest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a See also section?

See also
--------
get_loc_nearest : small explanation what this is

then you get automatic linking in the html pages

@jorisvandenbossche
Copy link
Member

A question: when you have a partial string, what is the 'nearest' value, so what will DatetimeIndex.get_loc_nearest return? (say if you have all days of the month in your index, what day is 'nearest' to the month)?

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

@jorisvandenbossche We definitely could include a work around to avoid the change for asof by overwriting the subclass DatetimeIndex.asof method.... but I'm not even entirely sure this should really be considered an API change rather than a bug fix (even though there was a test for it).

Note the asof docstring: "For a sorted index, return the most recent label up to and including the passed label." A string like '2000-02' arguably does include the timestamp '2000-02-28'.

Looking into the history a little more: this test appeared in PR #8246 by @TomAugspurger. There, it appears that the issue was something else entirely -- asof was returning time stamps not in the index at all (see #8245). It's not clear to me that anyone carefully considered exactly how partial datetime strings should be interpreted in this change.

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

@jorisvandenbossche The "nearest" value from idx.get_loc_nearest('2000-01') will be a slice object indexing all values in January 2001 if idx.get_loc('2000-01') is valid; otherwise, it will return the location of the nearest value to 2000-01-01T00:00:00. I suppose it would be a good idea to document this more directly!

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

I haven't really looked at this in depth - but

the string slicing should be analogous to partial string indexing with the same semantics

at first glance it appears that they are the same?

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

@jreback Yes, the nearest partial string indexing of a DatetimeIndex is exactly equivalent to partial partial string indexing if there is a match; otherwise, it casts the string label to a Timestamp and finds the nearest value.

@@ -1256,6 +1256,19 @@ def get_loc(self, key):
except (KeyError, ValueError):
raise KeyError(key)

def get_loc_nearest(self, key, side='nearest'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should also get a docstring (there are normally some decorators for adding a docstring)

@jorisvandenbossche
Copy link
Member

@shoyer on asof, OK then, given this is rather new behaviour as you pointed out, so lets regard that as a bug fix (maybe also specify it like that in the whatsnew docs)

Some other things:

  • On the partial string indexing. The docstring of get_loc_nearest says it returns an int -> so this is not always correct then.
  • For a DatetimeIndex, you only implemented get_loc_nearest and not get_indexer_nearest. Is this on purpose? If you now try using get_indexer_nearest with a string, you get rather unhelpful error messages.

I know you modelled the functions to the existing get_loc and get_indexer, but supposing these wouldn't exist, I would have the following remarks:

  • Why have both? Why not just one function that accepts label or array of labels? (and returns int or array of ints)
  • I find the name a bit confusing. get_loc_nearest returns the 'integer location by position'. But in the 'indexing terminology', loc is for the 'label-location' based indexer.
  • The return of -1 if a key is not found, seems a bit strange for me. As you would typically use the result to index your frame/series. But -1 has in that case a specific meaning, namely the last value. And that is not always what you want I think. (I know this is not that simple, as you cannot have NaN in integer, and you cannot have non-integer in the indexer, ..)

@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

@jorisvandenbossche

I think this impl is fine. The signal of -1 is typical in the way the indexers work. These are by definition int64 and -1 is a not-found value. It is up to the user to process these.

By definition these are always indexers, by position, always, full-stop. These only have a single semantic for the output results. loc here is not location based, but just location (its an older nomenclature and I can see it being somewhat confusing, but its how all of the Index classes work.)

@shoyer I would think this API simpler if you do this;

def get_indexer(self, target, method=None, limit=None):   
        """
        Compute indexer and mask for new index given the current index. The
        indexer should be then used as an input to ndarray.take to align the
        current data to the new index. The mask determines whether labels are
        found or not in the current index

        Parameters
        ----------
        target : Index
        method : {'pad', 'ffill', 'backfill', 'bfill'}
            pad / ffill: propagate LAST valid observation forward to next valid
            backfill / bfill: use NEXT valid observation to fill gap

and allow method='nearest' (and allow the side parameter to be passed as well).
(similarly for get_loc).

I think its just think its adding API noise otherwise. These conceptually do exactly the same thing. (you certainly could have an internal _get_indexer_nearest if that makes the impl clearer though).

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

@jreback This is a good point about incorporating this into the get_indexer method. In fact, I realize now that get_indexer_nearest(side='left') is actually just a slower version of get_indexer(method='pad'). So it looks like adding method='nearest' is probably the way to go here.

@shoyer
Copy link
Member Author

shoyer commented Feb 10, 2015

It appears that the existing get_indexer method only works for monotonic ascending indexes and keys. So I'll probably keep around some of this code as a fallback option for pad/backfill with decreasing indexes and/or unordered keys.

@shoyer
Copy link
Member Author

shoyer commented Feb 11, 2015

@jorisvandenbossche I agree, the current index API is a little strange -- some refactoring to provide a unified API and to untangle core.indexers would definitely be in order. See also: #7651.

I'm implementing only the low level methods now, but eventually we should figure out a higher level API for nearest neighbor lookups in pandas.

new_index, indexer = axis_values.reindex(labels, method, level,
limit=limit)
return self._reindex_with_indexers(
{axis: [new_index, indexer]}, method=method, fill_value=fill_value,
limit=limit, copy=copy)
{axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right, method is extraneous here (above too?)

@jreback
Copy link
Contributor

jreback commented Feb 13, 2015

minor comments, lgtm otherwise.

pad / ffill: find the PREVIOUS index value if no exact match.
backfill / bfill: use NEXT index value if no exact match
nearest: use the NEAREST index value if no exact match. Tied
distances are broken by preferring the larger index value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the html docs, you have to make bullet points of these (use - or * before each line) to have this render properly.
The same for the other docstrings.

@jorisvandenbossche
Copy link
Member

docstring of reindex should still be updated

@shoyer
Copy link
Member Author

shoyer commented Feb 14, 2015

Yep, the docstrings need a bit of work. I'll do that next week -- going camping and will be offline for the next three days.

@shoyer shoyer changed the title API/ENH: add method='nearest' to Index.get_indexer and method to get_loc API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc Feb 17, 2015
@shoyer
Copy link
Member Author

shoyer commented Feb 17, 2015

OK, I updated docstrings for reindex, reindex_like, get_indexer and get_loc, and they appear to generate well-formatted API docs. Merge ready, I think?

s = Series(np.arange(10))
target = [0.1, 0.9, 1.5, 2.0]
actual = s.reindex(target, method='nearest')
expected = Series(np.around(target).astype(int), target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

astype these to int64 otherwiset hey will fail on windows.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2015

lgtm

  • minor dtype chaneges needed in some tests
  • haven't checked, but make basics.rst docs need updating (where reindex is, to maybe have an example for nearest)

@shoyer
Copy link
Member Author

shoyer commented Feb 18, 2015

@jreback Fixed the dtypes.

@jorisvandenbossche @jreback I was planning on updating the basics.rst docs later (once "nearest" works for fillna), but it was simple enough to do it now. Let me know what you think. In particular, I removed some musing from Wes about the possibility of more sophisticated interpolation logic.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2015

lgtm

@shoyer
Copy link
Member Author

shoyer commented Feb 19, 2015

@jorisvandenbossche do you want to take another look? otherwise I will merge...

@shoyer
Copy link
Member Author

shoyer commented Feb 23, 2015

OK, I'm going to merge this. Doc issues can be dealt with in a followup PR if necessary.

shoyer added a commit that referenced this pull request Feb 23, 2015
API/ENH: add method='nearest' to Index.get_indexer/reindex and method to get_loc
@shoyer shoyer merged commit 4e563ea into pandas-dev:master Feb 23, 2015
@shoyer shoyer deleted the get_nearest branch February 23, 2015 18:30
@jreback
Copy link
Contributor

jreback commented Feb 23, 2015

nice work @shoyer !

jreback added a commit that referenced this pull request Feb 28, 2015
windows test dtype fix
@mjroth
Copy link

mjroth commented Apr 13, 2015

Is the intent for this to return the nearest valid observation?
i.e. for the example given in the whats new 0.16,
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html
if we throw in a NaN, it does not search for the next valid observation.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'x': range(5)})
In [4]: df.x[0] = np.nan
In [5]: df.reindex([0.2, 1.8, 3.5], method='nearest')
Out[5]: 
      x
0.2 NaN
1.8   2
3.5   4

I'd expect x=1 at 0.2, given the Docs.

@shoyer
Copy link
Member Author

shoyer commented Apr 13, 2015

@mjroth the intent is for it to return values at the nearest index label, similarly to the existing behavior for forward and backward filling reindexing methods. If you don't want to include NA values, you'll need to do a dropna first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Index.get_nearest method
5 participants