Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow interpolate() to fill backwards as well as forwards #10691

Merged
merged 1 commit into from Sep 1, 2015
Merged

Allow interpolate() to fill backwards as well as forwards #10691

merged 1 commit into from Sep 1, 2015

Conversation

lmjohns3
Copy link

closes #9218
closes #10420

Edit: changed approach, see the rest of the thread

Currently, interpolate() has a "limit" kwarg that, when set to n,
prevents interpolated values from propagating more than n rows forward.

This change adds a "backward_limit" kwarg that, when set to n, prevents
interpolated values from propagating more than n rows backward.

The behavior prior to this change is as though "backward_limit" existed
but was always set to 0. That is, interpolated values never filled in
NaNs immediately before a non-NaN value (unless those same NaNs happened
to follow a different, non-NaN value within "limit" rows).

In this change the "backward_limit" kwarg has no effect if "limit" is
not specified (i.e., None).

@lmjohns3
Copy link
Author

I also have a local change to the documentation that adds an example of using the new kwarg, but I'd like to verify first that people think the behavior implemented here is good.

@shoyer
Copy link
Member

shoyer commented Jul 28, 2015

From an API perspective, it would be nice if backward_limit was treated the same way as limit. Also, maybe we want to deprecate the existing limit keyword argument in favor of the more explicit forward_limit?

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

I would rather see a method='forward|backward' (or for compat, ffill|bfill) to be like .fillna. Then this becomes unambiguous.

@lmjohns3
Copy link
Author

Thanks for the feedback!

@jreback if there's a forward/backward flag (or ffill/bfill), then the interpolation can only fill in one direction at a time? So I would have to do two interpolations to fill values around non-NaNs? (This is currently the case for fillna as well, just confirming that's what you have in mind.)

It would be nice if I could fill "forward", "backward", or "around" with just one call, since I can currently use your suggestion from #9218 and just reverse the sequence and interpolate (or fill) if I'm going to interpolate twice anyway.

@shoyer I agree that it would be nice to respect passing the "backward_limit" kwarg only ... this was just my first attempt to put some code in place. It seems like we ought to work out this forward/backward/around issue before getting too detailed on the backward_limit.

@jreback
Copy link
Contributor

jreback commented Jul 28, 2015

around would be ok too. I just want to limit the api differences with other routines.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Jul 28, 2015
@lmjohns3
Copy link
Author

Yes, totally agree, api consistency is a good thing!

I like the idea of adding method=forward|backward|around and then just using limit in whatever is the indicated direction. But method is already used by interpolate to indicate which type of interpolation to use (linear, quadratic etc.)

What if limit accepted an int or a pair of ints? If it's an int it would just apply like it currently does, if a list/tuple it would contain (backward, forward) limits? Perhaps too different from existing api conventions?

@lmjohns3
Copy link
Author

Sorry for the noise, I got my branch in a weird state and did a hard reset.

I changed the code so that limit accepts either an int (preserves current behavior) or a tuple of (backward, forward) ints.

@lmjohns3 lmjohns3 changed the title Add a "backward_limit" kwarg to interpolate(). Allow interpolate() to fill backwards as well as forwards Jul 30, 2015
@lmjohns3
Copy link
Author

Any comments on this version of the change?

@jreback
Copy link
Contributor

jreback commented Jul 30, 2015

@shoyer what you do you think?

@shoyer
Copy link
Member

shoyer commented Jul 30, 2015

To me, limit does not seem like something that should be inherently directional. So if we were starting over from scratch, I would make it apply in both directions.

That said, we are here where we are today, and we do want to make some effort to preserve backwards compatibility.

I'm not a really big fan of using tuples instead of another keyword argument. It seems a little awkward and not so Pythonic. I'm also not sure if there are legitimate use cases for limiting the interpolation only in one direction (maybe?).

If we don't care about asymmetrical limits, we could add a new keyword argument symmetric_limit. For now, the default would be False, but would also issue a deprecation warning. In some future version of pandas, we switch the default to True. Then, we eventually remove it entirely.

@lmjohns3
Copy link
Author

@shoyer I get what you're saying about using a tuple, it's a bit odd considering the way the rest of the pandas api works (at least what I've seen of it).

If we want to add a new kwarg, it could be something categorical like limit_direction='forward'|'backward'|'both'?

I'd also be fine with your symmetric_limit idea, too.

@johne13
Copy link

johne13 commented Aug 1, 2015

"If we want to add a new kwarg, it could be something categorical like limit_direction='forward'|'backward'|'both'?"

This seems like a very good way to me. You can preserve backwards compatibility by setting the default to "forward" which also makes it obvious to a new user in the documentation why interpolate is only extrapolating in one direction (and how to change that).

It's also consistent with fillna behavior aside from the necessary use of a keyword other than 'method'.

FWIW, I think allowing asymmetric behavior is really important overall for this command. I don't think having different limits is all that important though. That is, a user will often want to interpolate forwards or backwards (but not both). In cases where the user wants to go in both directions I'd expect a symmetric limit is fine. I.e. I think it would be fairly uncommon to want to set a limit of, say, 2 in one direction and 5 in the other. And on those rare occasions, I think just running 2 separate interpolations is not that onerous.

@jreback
Copy link
Contributor

jreback commented Aug 2, 2015

I would be +1 on adding limit_direction.

@lmjohns3
Copy link
Author

lmjohns3 commented Aug 2, 2015

I've updated the change in my local branch to add a limit_direction keyword argument. It defaults to 'forward' which preserves the existing behavior. If it's set to 'backward' the limit applies in a backward direction, and if it's set to some other value the limit applies in both directions. (I use 'both' in the test and the docs.)

The change does not change the current behavior at all when limit is falsey. So in particular if you want to fill all the values at the beginning of a series (e.g. #10420) you need to provide a nonzero limit and also limit_direction='backward' or limit_direction='both'. This matches the current default behavior and to me seems like an ok thing to ask of people who really want to fill in all the NaN values in their series.

I'm not such a big fan of extrapolate since in my brain extrapolation only refers to the extreme beginning or end of a series, i.e., when there's not another valid data point to aim for.

I thought a bit about a 'none' or something like that, but then why would you even call interpolate in the first place?

@johne13
Copy link

johne13 commented Aug 2, 2015

@lmjohns3 Sorry about the extrapolation comment, I have since deleted it after I realized the limit arg is applied to both interpolations and extrapolations, so it didn't really make any sense.

Regarding extrapolation, I think things are still a little strange in this sense. If you do the simplest version df.interpolate() you really just get an interpolation followed by ffill(). But if you do df.interpolate(method='spline',order=1) you'll get a true extrapolation. (see below)

But I don't mean to further complicate the current issue. One step at a time and I think this is a very good change you are making.

df=pd.DataFrame( [ 0, np.nan, 2, np.nan] )

df.interpolate()
Out[890]: 
   0
0  0
1  1
2  2
3  2

df.interpolate(method='spline',order=1)
Out[891]: 
   0
0  0
1  1
2  2
3  3

@lmjohns3
Copy link
Author

lmjohns3 commented Aug 2, 2015

@johne13 This is a good point you bring up, and might require a bit more thought/testing (but I'm not sure!). It looks like interpolate() actually is a sort of wrapper for several different interpolation methods. The change as it currently stands only applies to the variant where method is specified. I'm not sure how much that limits the change, or if that would be confusing to people.

@lmjohns3
Copy link
Author

lmjohns3 commented Aug 2, 2015

I just checked further and the code is a bit wacky, but it looks like this change should be ok.

There are basically two branches inside the Block.interpolate method in internals.py. If method is one of ffill, bfill, pad, backfill, or nearest, then interpolate basically does the same thing as fillna (i.e., it doesn't actually do an interpolation in the function-fitting sense). But if method is one of the scipy interpolate methods, then it does a function-fitting interpolation.

There's also a CategoricalBlock.interpolate method, and a SparseBlock.interpolate method. Both of these appear only to fill, not to interpolate.

The interpolate methods are called from a number of different places, but the main one I'm after is NDFrame.interpolate in generic.py (both Series and DataFrame are subclasses). The default method here is 'linear' so it seems like this change should cover the right cases.

If anyone is more familiar with the API and this raises any red flags let me know!

@lmjohns3
Copy link
Author

lmjohns3 commented Aug 7, 2015

Any further comments on this PR?

ser.interpolate(limit=2)

By default, ``limit`` applies in a forward direction, so that only ``NaN``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would make a sub-section (under interpolation).

@lmjohns3
Copy link
Author

Thanks for the comments!

I've addressed the test issues, and I've tried to update the docs in the way that I think you were describing.

I made a new "interpolation limits" section in the "missing_data.rst" file, but I'm unsure if it's kosher -- I used ^^^^^ for the headline level, but there are no other headlines in any of the other docs that use that level. Maybe I should just make the section header bold?

Likewise in the whatsnew file I added a small example, but if anything about the formatting etc. looks weird to you let me know.

@lmjohns3
Copy link
Author

Any further comments on this? And suggestions for the ^^^^ header level?

@jreback
Copy link
Contributor

jreback commented Aug 14, 2015

@lmjohns3 this is what I do, others just add everything in their own commit.

Its sort of up to you when deving. Eventually I'll have you squash if you didn't, so this is sort of a continuous squash.

@jreback
Copy link
Contributor

jreback commented Aug 14, 2015

lgtm.

@jorisvandenbossche
@shoyer

limit_direction='symmetric')
assert_series_equal(result, expected)

expected = Series([1., 3., np.nan, 7., 9., 11.])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't have a full tests for #10420. E.g. I think Series([np.nan.....]) is the idea

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good catch. I'll add that as soon as I can, but probably won't be until Sunday.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

@lmjohns3 need validation on the limit_direction that it can only be forward|backward|both (and tests for same).

@lmjohns3
Copy link
Author

OK, I've made a number of updates to the patch that address the comments above.

  • limit_direction is limited to 'forward', 'backward', 'both', or 'symmetric' ('both' and 'symmetric' do the same thing), but can be passed in any combination of uppercase/lowercase. raises ValueError if not one of these values
  • tests for case/validity of limit_direction kwarg, filling values backward to beginning of series, and different limit_direction cases

In dealing with the case of filling a series backward to the beginning, I had to muck around with the interpolate_1d function a good bit more than before. This makes the diff quite ugly ... I'd be happier with it if I could check in a separate PR first that does the code-rearranging part of things. Any opinions on how to do that? Or is everyone comfortable with an ugly change that is nonetheless pretty well tested?

Thanks for everyone's feedback on this by the way!

@shoyer
Copy link
Member

shoyer commented Aug 18, 2015

Redundant "both" and "symmetric" arguments seems like a bad idea. I would
pick one and stick to it (probably "both").

On Tue, Aug 18, 2015 at 11:11 AM, Leif Johnson notifications@github.com
wrote:

OK, I've made a number of updates to the patch that address the comments
above.

  • limit_direction is limited to 'forward', 'backward', 'both', or
    'symmetric' ('both' and 'symmetric' do the same thing), but can be passed
    in any combination of uppercase/lowercase. raises ValueError if not one of
    these values
  • tests for case/validity of limit_direction kwarg, filling values
    backward to beginning of series, and different limit_direction cases

In dealing with the case of filling a series backward to the beginning, I
had to muck around with the interpolate_1d function a good bit more than
before. This makes the diff quite ugly ... I'd be happier with it if I
could check in a separate PR first that does the code-rearranging part of
things. Any opinions on how to do that? Or is everyone comfortable with an
ugly change that is nonetheless pretty well tested?

Thanks for everyone's feedback on this by the way!


Reply to this email directly or view it on GitHub
#10691 (comment).

@lmjohns3
Copy link
Author

Done -- it's just 'both' now.

@lmjohns3
Copy link
Author

I was also a bit surprised by this change, because it turns out that both numpy.interp and scipy.interp1d resort to "filling" behavior for extrapolations -- they just fill the extrapolated index values with the closest observed value.

To get a true extrapolation you'd have to use something like this:

ser.interpolate(method='spline', order=1, limit=10000, limit_direction='both')

But at least it's possible now!

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

hmm, can you show what you mean (e.g. using numpy/scipy) and the new method side-by-side in a small example?

if not valid.any():
# have to call np.array(xvalues) since xvalues could be an Index
# which cant be mutated
result = np.empty_like(np.array(xvalues), dtype=np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just do np.asarray(xvalues) I think.

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

you need some tests where you have multiple groups of NaN, some of which are greater than the limit and some not

Series([1,3,np.nan,np.nan,np.nan,7,9,np.nan,np.nan,11,np.nan]) with limit=2,limit_direction='both'. Not even sure what this should do, but its a valid case

and we need some cases for testing where you have multiple groups some of which satisfy the limit and some don't. I dont think your algorithm will handle these ATM.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2015

@lmjohns3 any luck with this?

@lmjohns3
Copy link
Author

Sorry, I've had this on hold. I'm supposed to turn in a draft of my dissertation this week.

I will have some more time to work on this next week.

This change adds a new keyword argument for specifying the direction in
which an interpolation limit ought to be applied.

The default, when ``limit_direction`` is 'forward', fills in NaN values
after non-NaNs, for up to ``limit`` consecutive fills. No interpolated
values are used to fill NaNs before non-NaNs. This matches the current
behavior for interpolate.

If ``limit_direction`` is 'backward', the interpolation will fill in up
to ``limit`` NaN values *before* non-NaN values, for up to ``limit``
consecutive fills. No interpolated values will be used to fill NaNs
after non-NaNs.

If ``limit_direction`` is another value, the interpolation will fill in
NaN values both before and after non-NaNs, up to ``limit`` in each
direction, so that up to ``2 * limit`` consecutive values will be filled
in around each non-NaN segment.
@lmjohns3
Copy link
Author

I just checked in a few changes to address the comments above. It looks to me like the algorithm does work for gaps of any length. In the example you suggested here is the behavior:

>>> s = Series([1,3,np.nan,np.nan,np.nan,7,9,np.nan,np.nan,12,np.nan])
>>> np.array(s.interpolate(limit=2, limit_direction='both'))
array([  1.,   3.,   4.,   5.,   6.,   7.,   9.,  10.,  11.,  12.,  12.])
>>> np.array(s.interpolate(limit=1, limit_direction='both'))
array([  1.,   3.,   4.,  nan,   6.,   7.,   9.,  10.,  11.,  12.,  12.])

In the first example, all values are filled, because none of the NaNs is further than 2 from a non-NaN value. In the second, the "5" value remains a NaN because that slot is further than 1 from all non-NaN values. I added these two examples as tests.

Also, here's an example of the "filling" behavior of np.interp that I was talking about for extrapolations:

>>> np.interp(x=[0, 1, 2, 3, 4], xp=[1, 3], fp=[3, 5])
array([ 3.,  3.,  4.,  5.,  5.])

To me this says, "I've observed value (fp) 3 at index (xp) 1 and value 5 at index 3; now I want to fill in values at indexes (x) 0 through 4." Because 0 < 1, interp returns 3 for this index, and because 4 > 3, interp returns 5 for this index.

But if you think these values are actually sampled from some continuous underlying function, and you want to reconstruct this function even outside the [3, 5] interval, you have to use a spline interpolation:

>>> scipy.interpolate.InterpolatedUnivariateSpline([1, 3], [3, 5], k=1)(list(range(5)))
array([ 2.,  3.,  4.,  5.,  6.])

This works pretty well in pandas:

>>> s = pd.Series([np.nan, 3, np.nan, 5, np.nan])
>>> np.array(s.interpolate(method='linear', limit=10, limit_direction='both'))
array([ 3.,  3.,  4.,  5.,  5.])
>>> np.array(s.interpolate(method='spline', limit=10, limit_direction='both', order=1))
array([ 2.,  3.,  4.,  5.,  6.])

You have to specify method='spline' and order=k to get a spline that will extrapolate, though!

@jreback
Copy link
Contributor

jreback commented Aug 31, 2015

@lmjohns3 ok, I see you added those results as tests. gr8! ping me when green and I'll have another look.

@lmjohns3
Copy link
Author

Looks like the tests have passed!

jreback added a commit that referenced this pull request Sep 1, 2015
Allow interpolate() to fill backwards as well as forwards
@jreback jreback merged commit b7374ca into pandas-dev:master Sep 1, 2015
@jreback
Copy link
Contributor

jreback commented Sep 1, 2015

thanks @lmjohns3 nice change!

@lmjohns3
Copy link
Author

lmjohns3 commented Sep 1, 2015

Thanks for the guidance!!

@l736x
Copy link
Contributor

l736x commented Sep 15, 2015

Looking at the "what's new" doc I see the following example:

In [43]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])

In [44]: ser.interpolate(limit=1, limit_direction='both')
Out[44]: 
0     5
1     5
2     5
3     7
4   NaN
5    11
6    13
dtype: float64

Given the description of the method I would have expected:

In [43]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])

In [44]: ser.interpolate(limit=1, limit_direction='both')
Out[44]: 
0     NaN
1     5
2     5
3     7
4   NaN
5    11
6    13
dtype: float64

Am I wrong? Why do we fill the first element?

@lmjohns3
Copy link
Author

Your expectation seems correct to me -- the first element is further than 1 from a non-NaN value and doesn't seem like it should be filled. I'll look into it and see if something is going haywire in the code, or if it's just a documentation issue.

@lmjohns3
Copy link
Author

Yes, this is buggy, excellent find! The method for computing the NaN mask got bogged down when the NaN occurs within limit of the beginning of the series and limit_direction is set to 'both'. I'll file an issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
6 participants