Allow interpolate() to fill backwards as well as forwards #10691

lmjohns3 · 2015-07-28T19:26:37Z

Edit: changed approach, see the rest of the thread

Currently, interpolate() has a "limit" kwarg that, when set to n,
prevents interpolated values from propagating more than n rows forward.

This change adds a "backward_limit" kwarg that, when set to n, prevents
interpolated values from propagating more than n rows backward.

The behavior prior to this change is as though "backward_limit" existed
but was always set to 0. That is, interpolated values never filled in
NaNs immediately before a non-NaN value (unless those same NaNs happened
to follow a different, non-NaN value within "limit" rows).

In this change the "backward_limit" kwarg has no effect if "limit" is
not specified (i.e., None).

lmjohns3 · 2015-07-28T19:28:06Z

I also have a local change to the documentation that adds an example of using the new kwarg, but I'd like to verify first that people think the behavior implemented here is good.

shoyer · 2015-07-28T21:21:19Z

From an API perspective, it would be nice if backward_limit was treated the same way as limit. Also, maybe we want to deprecate the existing limit keyword argument in favor of the more explicit forward_limit?

jreback · 2015-07-28T21:24:06Z

I would rather see a method='forward|backward' (or for compat, ffill|bfill) to be like .fillna. Then this becomes unambiguous.

lmjohns3 · 2015-07-28T21:42:56Z

Thanks for the feedback!

@jreback if there's a forward/backward flag (or ffill/bfill), then the interpolation can only fill in one direction at a time? So I would have to do two interpolations to fill values around non-NaNs? (This is currently the case for fillna as well, just confirming that's what you have in mind.)

It would be nice if I could fill "forward", "backward", or "around" with just one call, since I can currently use your suggestion from #9218 and just reverse the sequence and interpolate (or fill) if I'm going to interpolate twice anyway.

@shoyer I agree that it would be nice to respect passing the "backward_limit" kwarg only ... this was just my first attempt to put some code in place. It seems like we ought to work out this forward/backward/around issue before getting too detailed on the backward_limit.

jreback · 2015-07-28T21:45:16Z

around would be ok too. I just want to limit the api differences with other routines.

lmjohns3 · 2015-07-28T21:56:37Z

Yes, totally agree, api consistency is a good thing!

I like the idea of adding method=forward|backward|around and then just using limit in whatever is the indicated direction. But method is already used by interpolate to indicate which type of interpolation to use (linear, quadratic etc.)

What if limit accepted an int or a pair of ints? If it's an int it would just apply like it currently does, if a list/tuple it would contain (backward, forward) limits? Perhaps too different from existing api conventions?

lmjohns3 · 2015-07-29T14:25:51Z

Sorry for the noise, I got my branch in a weird state and did a hard reset.

I changed the code so that limit accepts either an int (preserves current behavior) or a tuple of (backward, forward) ints.

lmjohns3 · 2015-07-30T19:28:46Z

Any comments on this version of the change?

jreback · 2015-07-30T22:23:52Z

@shoyer what you do you think?

shoyer · 2015-07-30T22:32:30Z

To me, limit does not seem like something that should be inherently directional. So if we were starting over from scratch, I would make it apply in both directions.

That said, we are here where we are today, and we do want to make some effort to preserve backwards compatibility.

I'm not a really big fan of using tuples instead of another keyword argument. It seems a little awkward and not so Pythonic. I'm also not sure if there are legitimate use cases for limiting the interpolation only in one direction (maybe?).

If we don't care about asymmetrical limits, we could add a new keyword argument symmetric_limit. For now, the default would be False, but would also issue a deprecation warning. In some future version of pandas, we switch the default to True. Then, we eventually remove it entirely.

lmjohns3 · 2015-07-31T00:14:40Z

@shoyer I get what you're saying about using a tuple, it's a bit odd considering the way the rest of the pandas api works (at least what I've seen of it).

If we want to add a new kwarg, it could be something categorical like limit_direction='forward'|'backward'|'both'?

I'd also be fine with your symmetric_limit idea, too.

johne13 · 2015-08-01T22:23:09Z

"If we want to add a new kwarg, it could be something categorical like limit_direction='forward'|'backward'|'both'?"

This seems like a very good way to me. You can preserve backwards compatibility by setting the default to "forward" which also makes it obvious to a new user in the documentation why interpolate is only extrapolating in one direction (and how to change that).

It's also consistent with fillna behavior aside from the necessary use of a keyword other than 'method'.

FWIW, I think allowing asymmetric behavior is really important overall for this command. I don't think having different limits is all that important though. That is, a user will often want to interpolate forwards or backwards (but not both). In cases where the user wants to go in both directions I'd expect a symmetric limit is fine. I.e. I think it would be fairly uncommon to want to set a limit of, say, 2 in one direction and 5 in the other. And on those rare occasions, I think just running 2 separate interpolations is not that onerous.

jreback · 2015-08-02T15:30:30Z

I would be +1 on adding limit_direction.

lmjohns3 · 2015-08-02T18:30:16Z

I've updated the change in my local branch to add a limit_direction keyword argument. It defaults to 'forward' which preserves the existing behavior. If it's set to 'backward' the limit applies in a backward direction, and if it's set to some other value the limit applies in both directions. (I use 'both' in the test and the docs.)

The change does not change the current behavior at all when limit is falsey. So in particular if you want to fill all the values at the beginning of a series (e.g. #10420) you need to provide a nonzero limit and also limit_direction='backward' or limit_direction='both'. This matches the current default behavior and to me seems like an ok thing to ask of people who really want to fill in all the NaN values in their series.

I'm not such a big fan of extrapolate since in my brain extrapolation only refers to the extreme beginning or end of a series, i.e., when there's not another valid data point to aim for.

I thought a bit about a 'none' or something like that, but then why would you even call interpolate in the first place?

johne13 · 2015-08-02T19:22:13Z

@lmjohns3 Sorry about the extrapolation comment, I have since deleted it after I realized the limit arg is applied to both interpolations and extrapolations, so it didn't really make any sense.

Regarding extrapolation, I think things are still a little strange in this sense. If you do the simplest version df.interpolate() you really just get an interpolation followed by ffill(). But if you do df.interpolate(method='spline',order=1) you'll get a true extrapolation. (see below)

But I don't mean to further complicate the current issue. One step at a time and I think this is a very good change you are making.

df=pd.DataFrame( [ 0, np.nan, 2, np.nan] )

df.interpolate()
Out[890]: 
   0
0  0
1  1
2  2
3  2

df.interpolate(method='spline',order=1)
Out[891]: 
   0
0  0
1  1
2  2
3  3

lmjohns3 · 2015-08-02T19:26:18Z

@johne13 This is a good point you bring up, and might require a bit more thought/testing (but I'm not sure!). It looks like interpolate() actually is a sort of wrapper for several different interpolation methods. The change as it currently stands only applies to the variant where method is specified. I'm not sure how much that limits the change, or if that would be confusing to people.

lmjohns3 · 2015-08-02T19:44:39Z

I just checked further and the code is a bit wacky, but it looks like this change should be ok.

There are basically two branches inside the Block.interpolate method in internals.py. If method is one of ffill, bfill, pad, backfill, or nearest, then interpolate basically does the same thing as fillna (i.e., it doesn't actually do an interpolation in the function-fitting sense). But if method is one of the scipy interpolate methods, then it does a function-fitting interpolation.

There's also a CategoricalBlock.interpolate method, and a SparseBlock.interpolate method. Both of these appear only to fill, not to interpolate.

The interpolate methods are called from a number of different places, but the main one I'm after is NDFrame.interpolate in generic.py (both Series and DataFrame are subclasses). The default method here is 'linear' so it seems like this change should cover the right cases.

If anyone is more familiar with the API and this raises any red flags let me know!

lmjohns3 · 2015-08-07T15:28:39Z

Any further comments on this PR?

jreback · 2015-08-08T00:29:26Z

doc/source/missing_data.rst

   ser.interpolate(limit=2)

+By default, ``limit`` applies in a forward direction, so that only ``NaN``


would make a sub-section (under interpolation).

lmjohns3 · 2015-08-10T16:40:29Z

Thanks for the comments!

I've addressed the test issues, and I've tried to update the docs in the way that I think you were describing.

I made a new "interpolation limits" section in the "missing_data.rst" file, but I'm unsure if it's kosher -- I used ^^^^^ for the headline level, but there are no other headlines in any of the other docs that use that level. Maybe I should just make the section header bold?

Likewise in the whatsnew file I added a small example, but if anything about the formatting etc. looks weird to you let me know.

lmjohns3 · 2015-08-14T17:57:45Z

Any further comments on this? And suggestions for the ^^^^ header level?

jreback · 2015-08-14T21:03:10Z

@lmjohns3 this is what I do, others just add everything in their own commit.

Its sort of up to you when deving. Eventually I'll have you squash if you didn't, so this is sort of a continuous squash.

jreback · 2015-08-14T21:03:55Z

lgtm.

@jorisvandenbossche
@shoyer

jreback · 2015-08-14T21:05:37Z

pandas/tests/test_generic.py

+                               limit_direction='symmetric')
+        assert_series_equal(result, expected)
+
+        expected = Series([1., 3., np.nan, 7., 9., 11.])


actually we don't have a full tests for #10420. E.g. I think Series([np.nan.....]) is the idea

Yes, that's a good catch. I'll add that as soon as I can, but probably won't be until Sunday.

jreback · 2015-08-18T10:54:20Z

@lmjohns3 need validation on the limit_direction that it can only be forward|backward|both (and tests for same).

lmjohns3 · 2015-08-18T18:11:28Z

OK, I've made a number of updates to the patch that address the comments above.

limit_direction is limited to 'forward', 'backward', 'both', or 'symmetric' ('both' and 'symmetric' do the same thing), but can be passed in any combination of uppercase/lowercase. raises ValueError if not one of these values
tests for case/validity of limit_direction kwarg, filling values backward to beginning of series, and different limit_direction cases

In dealing with the case of filling a series backward to the beginning, I had to muck around with the interpolate_1d function a good bit more than before. This makes the diff quite ugly ... I'd be happier with it if I could check in a separate PR first that does the code-rearranging part of things. Any opinions on how to do that? Or is everyone comfortable with an ugly change that is nonetheless pretty well tested?

Thanks for everyone's feedback on this by the way!

shoyer · 2015-08-18T18:12:52Z

Redundant "both" and "symmetric" arguments seems like a bad idea. I would
pick one and stick to it (probably "both").

On Tue, Aug 18, 2015 at 11:11 AM, Leif Johnson notifications@github.com
wrote:

OK, I've made a number of updates to the patch that address the comments
above.

limit_direction is limited to 'forward', 'backward', 'both', or
'symmetric' ('both' and 'symmetric' do the same thing), but can be passed
in any combination of uppercase/lowercase. raises ValueError if not one of
these values

tests for case/validity of limit_direction kwarg, filling values
backward to beginning of series, and different limit_direction cases

In dealing with the case of filling a series backward to the beginning, I
had to muck around with the interpolate_1d function a good bit more than
before. This makes the diff quite ugly ... I'd be happier with it if I
could check in a separate PR first that does the code-rearranging part of
things. Any opinions on how to do that? Or is everyone comfortable with an
ugly change that is nonetheless pretty well tested?

Thanks for everyone's feedback on this by the way!

—
Reply to this email directly or view it on GitHub
#10691 (comment).

lmjohns3 · 2015-08-18T18:16:28Z

Done -- it's just 'both' now.

lmjohns3 · 2015-08-18T18:20:21Z

I was also a bit surprised by this change, because it turns out that both numpy.interp and scipy.interp1d resort to "filling" behavior for extrapolations -- they just fill the extrapolated index values with the closest observed value.

To get a true extrapolation you'd have to use something like this:

ser.interpolate(method='spline', order=1, limit=10000, limit_direction='both')

But at least it's possible now!

jreback · 2015-08-18T21:45:01Z

hmm, can you show what you mean (e.g. using numpy/scipy) and the new method side-by-side in a small example?

jreback · 2015-08-18T21:46:33Z

pandas/core/common.py

+    if not valid.any():
+        # have to call np.array(xvalues) since xvalues could be an Index
+        # which cant be mutated
+        result = np.empty_like(np.array(xvalues), dtype=np.float64)


you can just do np.asarray(xvalues) I think.

jreback · 2015-08-18T22:06:28Z

you need some tests where you have multiple groups of NaN, some of which are greater than the limit and some not

Series([1,3,np.nan,np.nan,np.nan,7,9,np.nan,np.nan,11,np.nan]) with limit=2,limit_direction='both'. Not even sure what this should do, but its a valid case

and we need some cases for testing where you have multiple groups some of which satisfy the limit and some don't. I dont think your algorithm will handle these ATM.

jreback · 2015-08-26T01:26:45Z

@lmjohns3 any luck with this?

lmjohns3 · 2015-08-26T02:02:51Z

Sorry, I've had this on hold. I'm supposed to turn in a draft of my dissertation this week.

I will have some more time to work on this next week.

This change adds a new keyword argument for specifying the direction in which an interpolation limit ought to be applied. The default, when ``limit_direction`` is 'forward', fills in NaN values after non-NaNs, for up to ``limit`` consecutive fills. No interpolated values are used to fill NaNs before non-NaNs. This matches the current behavior for interpolate. If ``limit_direction`` is 'backward', the interpolation will fill in up to ``limit`` NaN values *before* non-NaN values, for up to ``limit`` consecutive fills. No interpolated values will be used to fill NaNs after non-NaNs. If ``limit_direction`` is another value, the interpolation will fill in NaN values both before and after non-NaNs, up to ``limit`` in each direction, so that up to ``2 * limit`` consecutive values will be filled in around each non-NaN segment.

lmjohns3 · 2015-08-30T18:47:16Z

I just checked in a few changes to address the comments above. It looks to me like the algorithm does work for gaps of any length. In the example you suggested here is the behavior:

>>> s = Series([1,3,np.nan,np.nan,np.nan,7,9,np.nan,np.nan,12,np.nan])
>>> np.array(s.interpolate(limit=2, limit_direction='both'))
array([  1.,   3.,   4.,   5.,   6.,   7.,   9.,  10.,  11.,  12.,  12.])
>>> np.array(s.interpolate(limit=1, limit_direction='both'))
array([  1.,   3.,   4.,  nan,   6.,   7.,   9.,  10.,  11.,  12.,  12.])

In the first example, all values are filled, because none of the NaNs is further than 2 from a non-NaN value. In the second, the "5" value remains a NaN because that slot is further than 1 from all non-NaN values. I added these two examples as tests.

Also, here's an example of the "filling" behavior of np.interp that I was talking about for extrapolations:

>>> np.interp(x=[0, 1, 2, 3, 4], xp=[1, 3], fp=[3, 5])
array([ 3.,  3.,  4.,  5.,  5.])

To me this says, "I've observed value (fp) 3 at index (xp) 1 and value 5 at index 3; now I want to fill in values at indexes (x) 0 through 4." Because 0 < 1, interp returns 3 for this index, and because 4 > 3, interp returns 5 for this index.

But if you think these values are actually sampled from some continuous underlying function, and you want to reconstruct this function even outside the [3, 5] interval, you have to use a spline interpolation:

>>> scipy.interpolate.InterpolatedUnivariateSpline([1, 3], [3, 5], k=1)(list(range(5)))
array([ 2.,  3.,  4.,  5.,  6.])

This works pretty well in pandas:

>>> s = pd.Series([np.nan, 3, np.nan, 5, np.nan])
>>> np.array(s.interpolate(method='linear', limit=10, limit_direction='both'))
array([ 3.,  3.,  4.,  5.,  5.])
>>> np.array(s.interpolate(method='spline', limit=10, limit_direction='both', order=1))
array([ 2.,  3.,  4.,  5.,  6.])

You have to specify method='spline' and order=k to get a spline that will extrapolate, though!

jreback · 2015-08-31T00:45:38Z

@lmjohns3 ok, I see you added those results as tests. gr8! ping me when green and I'll have another look.

lmjohns3 · 2015-08-31T14:56:00Z

Looks like the tests have passed!

Allow interpolate() to fill backwards as well as forwards

jreback · 2015-09-01T12:04:35Z

thanks @lmjohns3 nice change!

lmjohns3 · 2015-09-01T14:35:16Z

Thanks for the guidance!!

l736x · 2015-09-15T09:52:39Z

Looking at the "what's new" doc I see the following example:

In [43]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])

In [44]: ser.interpolate(limit=1, limit_direction='both')
Out[44]: 
0     5
1     5
2     5
3     7
4   NaN
5    11
6    13
dtype: float64

Given the description of the method I would have expected:

In [43]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13])

In [44]: ser.interpolate(limit=1, limit_direction='both')
Out[44]: 
0     NaN
1     5
2     5
3     7
4   NaN
5    11
6    13
dtype: float64

Am I wrong? Why do we fill the first element?

lmjohns3 · 2015-09-15T16:15:07Z

Your expectation seems correct to me -- the first element is further than 1 from a non-NaN value and doesn't seem like it should be filled. I'll look into it and see if something is going haywire in the code, or if it's just a documentation issue.

lmjohns3 · 2015-09-15T16:39:18Z

Yes, this is buggy, excellent find! The method for computing the NaN mask got bogged down when the NaN occurs within limit of the beginning of the series and limit_direction is set to 'both'. I'll file an issue for this.

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Jul 28, 2015

lmjohns3 mentioned this pull request Jul 28, 2015

Pandas Interpolate Does not Handle Empty Values at Front of Series #10420

Closed

lmjohns3 closed this Jul 29, 2015

lmjohns3 reopened this Jul 29, 2015

lmjohns3 changed the title ~~Add a "backward_limit" kwarg to interpolate().~~ Allow interpolate() to fill backwards as well as forwards Jul 30, 2015

jreback reviewed Aug 8, 2015
View reviewed changes

jreback reviewed Aug 14, 2015
View reviewed changes

jreback reviewed Aug 18, 2015
View reviewed changes

jreback added a commit that referenced this pull request Sep 1, 2015

Merge pull request #10691 from lmjohns3/master

b7374ca

Allow interpolate() to fill backwards as well as forwards

jreback merged commit b7374ca into pandas-dev:master Sep 1, 2015

lmjohns3 mentioned this pull request Sep 15, 2015

interpolate erroneously fills NaNs at start of series when limit_direction == 'both' #11115

Closed

		ser.interpolate(limit=2)

		By default, ``limit`` applies in a forward direction, so that only ``NaN``

Allow interpolate() to fill backwards as well as forwards #10691

Allow interpolate() to fill backwards as well as forwards #10691

Conversation

lmjohns3 commented Jul 28, 2015

lmjohns3 commented Jul 28, 2015

shoyer commented Jul 28, 2015

jreback commented Jul 28, 2015

lmjohns3 commented Jul 28, 2015

jreback commented Jul 28, 2015

lmjohns3 commented Jul 28, 2015

lmjohns3 commented Jul 29, 2015

lmjohns3 commented Jul 30, 2015

jreback commented Jul 30, 2015

shoyer commented Jul 30, 2015

lmjohns3 commented Jul 31, 2015

johne13 commented Aug 1, 2015

jreback commented Aug 2, 2015

lmjohns3 commented Aug 2, 2015

johne13 commented Aug 2, 2015

lmjohns3 commented Aug 2, 2015

lmjohns3 commented Aug 2, 2015

lmjohns3 commented Aug 7, 2015

jreback Aug 8, 2015

Choose a reason for hiding this comment

lmjohns3 commented Aug 10, 2015

lmjohns3 commented Aug 14, 2015

jreback commented Aug 14, 2015

jreback commented Aug 14, 2015

jreback Aug 14, 2015

Choose a reason for hiding this comment

lmjohns3 Aug 15, 2015

Choose a reason for hiding this comment

jreback commented Aug 18, 2015

lmjohns3 commented Aug 18, 2015

shoyer commented Aug 18, 2015

lmjohns3 commented Aug 18, 2015

lmjohns3 commented Aug 18, 2015

jreback commented Aug 18, 2015

jreback Aug 18, 2015

Choose a reason for hiding this comment

jreback commented Aug 18, 2015

jreback commented Aug 26, 2015

lmjohns3 commented Aug 26, 2015

lmjohns3 commented Aug 30, 2015

jreback commented Aug 31, 2015

lmjohns3 commented Aug 31, 2015

jreback commented Sep 1, 2015

lmjohns3 commented Sep 1, 2015

l736x commented Sep 15, 2015

lmjohns3 commented Sep 15, 2015

lmjohns3 commented Sep 15, 2015