ENH: add time-window capability to .rolling #13513

jreback · 2016-06-26T02:15:05Z

implement lint checking for cython (currently only for windows.pyx), xref PEP: fix up *.pyx #12995

This implements time-ware windows, IOW, to a .rolling() you can now pass a ragged / sparse timeseries and have it work with an offset (e.g. like 2s). Previously you could achieve these results by resampling first, then using an integer period (though you would have to jump thru hoops when crossing day boundaries and such).

This now provides a nice easy / performant implementation (as indicated on the linked notebook, min/max impl is giving scaling issues).

In [1]: df = DataFrame({'B': range(5)})

In [2]: df.index = [Timestamp('20130101 09:00:00'),
   ...:             Timestamp('20130101 09:00:02'),
   ...:             Timestamp('20130101 09:00:03'),
   ...:             Timestamp('20130101 09:00:05'),
   ...:             Timestamp('20130101 09:00:06')]

In [3]: df
Out[3]: 
                     B
2013-01-01 09:00:00  0
2013-01-01 09:00:02  1
2013-01-01 09:00:03  2
2013-01-01 09:00:05  3
2013-01-01 09:00:06  4

In [4]: df.rolling(2, min_periods=1).sum()
Out[4]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  5.0
2013-01-01 09:00:06  7.0

In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  3.0
2013-01-01 09:00:06  7.0

jreback · 2016-06-26T02:15:50Z

cc @jorisvandenbossche
cc @wesm
cc @chrisaycock
cc @joshuastorck

if anyone would like to try to fixup min/max would be great. here is the same PR with a priority heap (but its a fair bit slower, I am probably doing something non-optimal).

max-sixty · 2016-06-26T03:46:48Z

closes #936

jreback · 2016-06-26T20:17:24Z

yeah we'll close #936 but I think that impl is a bit different.

codecov-io · 2016-06-27T13:05:18Z

Current coverage is 84.39%

Merging #13513 into master will increase coverage by 0.01%

@@             master     #13513   diff @@
==========================================
  Files           142        142          
  Lines         51235      51299    +64   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43234      43296    +62   
- Misses         8001       8003     +2   
  Partials          0          0

Powered by Codecov. Last updated by 084ceae...d8f3d73

chrisaycock · 2016-06-27T15:03:09Z

Looks good to me. Regarding max/min, is the priority list faster than just recomputing the maximum value from scratch for each iteration? Caching and branch prediction can work wonders for small sample sizes.

wesm · 2016-06-27T17:05:28Z

Long awaited feature! Will try to review this today.

joshuastorck · 2016-06-27T17:08:55Z

Take a look at this: https://people.cs.uct.ac.za/~ksmith/articles/sliding_window_minimum.html.

You don't want to use a heap for min/max as it's worst case O(n * log(n)). There is a more efficient algorithm that is O(N). It also should be a bit easier to implement as all you really need for a data structure is a deque.

That's the algorithm that the existing implementation was using.

jreback · 2016-06-27T22:44:32Z

@chrisaycock

ok, silly of me not to test the completely naive soln, which is quite fast. It still scales on O(n) where n is the window size, but now the constant is 1/10 of my prioritylist constant. So this is pretty reasonable.

In [1]: N = 100000                     
dfp = DataFrame({'B' : np.random.randn(N)}, 
                 index=pd.date_range('20130101',periods=N,freq='s'))
   ...: 

In [2]: %timeit dfp.rolling(10).min()
100 loops, best of 3: 2.47 ms per loop

In [3]: %timeit dfp.rolling('10s').min()
100 loops, best of 3: 3.2 ms per loop

In [4]: %timeit dfp.rolling(100).min()
100 loops, best of 3: 2.58 ms per loop

In [5]: %timeit dfp.rolling('100s').min()
100 loops, best of 3: 11.2 ms per loop

In [6]: %timeit dfp.rolling(1000).min()
100 loops, best of 3: 2.62 ms per loop

In [7]: %timeit dfp.rolling('1000s').min()
10 loops, best of 3: 97.2 ms per loop

jreback · 2016-06-27T22:47:25Z

@joshuastorck yes the deque is theoretically faster, but it was very awkward with a variable sized window; others welcome to take a shot thought!

wesm · 2016-06-27T23:58:50Z

Meta-comment: since getting exposed to Gerrit over the last couple years, large reviews on GitHub PR's tend to encourage less careful review (because it's difficult to keep track of what still needs doing sometimes), at least for me. It might be worth playing around with gerrithub.io sometime to mirror patches there for review sometimes. Not sure what's the best solution.

wesm · 2016-06-28T00:12:24Z

pandas/core/window.py

+            if index is None:
+                index = self._on
+            return index, index.asi8, self.win_unit
+        return index, index, self.win_unit


This is a little bit opaque. Is the purpose of the 2nd entry in the tuple to ensure you are always passing integers to the Cython function?

wesm · 2016-06-28T05:36:19Z

This looks like it was quite a slog -- we should add to our ever-increasing TODO list a review of all of our aggregation code and ways to foster reuse or more flexibility. Trying to think if there was a way that this could have been made less painful. C++ templates could help (for example, a particular aggregator could be passed as a template argument, so the update inner loop could all get inlined at compile time -- e.g. the add_sum / remove_sum bits). But we'd need to decide if we want to commit to the route of writing more C++ and less Cython (I think there's good arguments for it, but evolving the build-and-deployment tools to accommodate that would be as discussed an upfront investment of work) -- this is somewhat orthogonal to the bigger discussion of writing more of the data structure internals in C or C++.

wesm · 2016-06-28T05:38:40Z

Also, I'm always happy to see 4+ year-old issues that were once just a twinkle in my eye get realized, so better late than never =) Great stuff

joshuastorck · 2016-06-28T13:54:59Z

I think there are a couple of compelling arguments for going to C++ templates, mostly that you can vary the implementation based on type at compile time. For example, since integer columns don't have NaN's, you don't really need the min_periods for rolling_* functions and could use SFINAE for that distinction. Alternatively, you could put 'if' statements inside the inner loops that are based on type, such as 'if std::is_integral::value". With the rolling functions, much of the looping code is common, but the algorithm and state varies. You can define a rolling_apply function that takes iterators, a functor/lambda and variadic arguments to hold the state. And it goes without saying that there are a lot of possibilities for using SIMD when dealing with contiguous columns.

The downside with C++ is that you have to do a lot of boilerplate dispatching code to templates based on numpy type, the cross-platform compile gets more complicated, and the code could arguably be harder to read.

At a minimum, it would be worth comparing the runtime of hand converted Cython code, as the code generator for Cython introduces a lot of constructs that might make it harder for the compiler to do optimizations, such as sanity checks in inner loops that could probably be done up front.

jorisvandenbossche · 2016-06-30T10:25:48Z

@jreback There is a memory leak when using center=True (first time I got a MemoryError, but second time it just crashed my computer :-))

jreback · 2016-06-30T10:35:16Z

@jorisvandenbossche I didn't change anything w.r.t. center (and its disabled for rolling with time series). can you show a repro?

jorisvandenbossche · 2016-06-30T10:38:14Z

With the example from your notebook: df.rolling('2s', center=True).sum()

jreback · 2016-06-30T10:46:06Z

@jorisvandenbossche pushed. Its not implemented.

jreback · 2016-07-09T16:37:42Z

doc/source/whatsnew/v0.19.0.txt

@@ -3,15 +3,16 @@
 v0.19.0 (August ??, 2016)
 -------------------------

-This is a major release from 0.18.2 and includes a small number of API changes, several new features,


@jorisvandenbossche slight changes need here (which are in master). done here i think is fine.

jreback · 2016-07-10T22:19:45Z

ok, well I'd like to push this into master now; now that we are on 0.19, this can be tried out for a bit by the deer-in-the-headlights crowd who run pandas master 👯

If there are later suggestions for a different API, then these can be addressed then. To be honest this is a highly theoretical use case where you actually want separate args. So I think that we'd need to go whole hog and deprecate window and add 2 new args instead.

But that is really -1 from me.

xref pandas-dev#13327 CLN: pep for cython, xref pandas-dev#12995

jreback · 2016-07-15T10:42:33Z

any final comments? / merging soon.

shoyer · 2016-07-15T16:21:32Z

Well, I'm still not happy with this API choice but I suppose I can live with @wesm's suggested compromise. It seems very awkward to me, though.

shoyer · 2016-07-15T16:25:00Z

And to be clear, by "awkward" I mean that I think the current implementation will lead to bugs and confusion in user code. But that's the price we pay for convenience...

wesm · 2016-07-15T17:25:55Z

@shoyer so to make sure I understand the core concern, it's that this overloads the semantics of the window argument -- either you get a fixed number of periods or a variable-length window based on the type of input? We could certainly introduce auxiliary APIs rolling_fixed and rolling_variable (leaving the existing rolling) which only accept one input or the other to offer a more precise / readable result.

On the other hand, you could imagine having code like

def my_rolling_stat(df, win):
    return df.rolling(win).mean()

(this might be a lot more complex in practice), and this could be easily switched from variable window to fixed window through the win argument

wesm · 2016-07-15T17:27:13Z

At least, I feel the current state of things is much better / more consistent than the earlier smorgasbord of rolling_* functions

shoyer · 2016-07-15T17:42:31Z

@shoyer so to make sure I understand the core concern, it's that this overloads the semantics of the window argument -- either you get a fixed number of periods or a variable-length window based on the type of input?

Yes, this is my concern. I strongly prefer function arguments that only do one thing, independent of the type of the data -- especially when the behavior is potentially ambiguous, as in this case.

We could certainly introduce auxiliary APIs rolling_fixed and rolling_variable (leaving the existing rolling) which only accept one input or the other to offer a more precise / readable result.

Yes, this is another clean way to handle the API.

On the other hand, you could imagine having code like
def my_rolling_stat(df, win):
    return df.rolling(win).mean()
(this might be a lot more complex in practice), and this could be easily switched from variable window to fixed window through the win argument

Indeed, it's certainly more convenient -- at least at first -- to reuse a single argument for these two different purposes. This could even make sense for special purpose code, e.g., if the user is happy to make the assumption that df will always have a DatetimeIndex. But I don't think these sort of special rules make sense for pandas, which does an admirable job of providing a composable language for data analysis that works consistently independently of the particular data types. I also don't think it's hard for a user to add some custom logic before calling .rolling() to implement this sort of functionality themselves.

At least, I feel the current state of things is much better / more consistent than the earlier smorgasbord of rolling_* functions

Totally agreed! I'm very happy we were able to switch to the .rolling() methods.

chris-b1 · 2016-07-16T16:40:06Z

This is smaller, but is on= the right keyword for column selection? xref #13500 - I'd want to use the same keyword there.

Although I suggested key= in that issue, I'm now thinking that by= would be better, for consistency with groupby and the sorting api?

jreback · 2016-07-16T16:41:50Z

we use on for pd.merge already

jorisvandenbossche · 2016-07-18T12:15:09Z

We are a bit stuck it seems ... Below I tried to list the possibilities to go forward based on the discussions above (note: all used keyword/method names are of course open for discussion :-), I used 'size' and 'delta' for now to make a distinction)

The current PR: DataFrame.rolling(window, ...)
- Use the existing window argument for both types of window
- Advantage: simple usage (df.rolling('2s'))
- Disadvantage: overloading of one keyword (window)
Add separate keyword for new functionality: DataFrame.rolling(window, window_delta=.., ...)
- Advantage: no overloading of the same keyword with different behaviours
- Disadvantage: not being able to use the shorthand df.rolling('2s') for the new functionality, as you always have to use a named keyword: df.rolling(window_delta='2s'), so worse ergonomics
- Alternative 1: introduce two new keywords: DataFrame.rolling(window_size, window_delta, ...) to have a more descriptive keyword for the existing functionality as well.
  - Note that there is not necessarily a need to deprecate the window keyword. Eg DataFrame.rolling(window_size=None, window_delta=None, ..., **kwargs) -> window_size can handle the the current use of positional arg, and the keyword use window= can be checked for in the kwargs.
- Alternative 2 (from @wesm): introduce two new keywords for explicitness, but keep current window for convenience: DataFrame.rolling(window, window_size=None, window_delta=None, ...), and then possibly assign the first positional window argument to one of these based on the type (int, string, DateOffset, etc.).
Add separate method for new functionality: DataFrame.rolling_delta(window, ...)
- Advantage: no overloading of the same keyword/function with different behaviours
- Disadvantage: yet another method(s) added to the DataFrame namespace
- Options: also add a DataFrame.rolling_fixed() (but keep existing rolling as an alias for this)

jreback · 2016-07-20T12:36:56Z

I still don't buy that this overloading is bad in any way.

I'd like @shoyer to convince why having the API in the is PR is not a good idea.

And to be clear, by "awkward" I mean that I think the current implementation will lead to bugs and confusion in user code. But that's the price we pay for convenience...

This is just bunk. it is more awkward to having confusing arguments in the first place which is what 2/3 by @jorisvandenbossche suggest.

Python is NOT c++, nor is it Java. We should be pythonic and follow current conventions. Pandas DOES quite a lot of inference. The POINT is to have convenience.

Trying to change something just for the sake of some theoretically purity is not in the spirit of python (and pandas at all). In fact, I would argue that making this NOT similar to simple, unambiguous usage is a MAJOR break in and of itself.

You have to have a VERY convincing argument to actually change things.

chrisaycock · 2016-07-20T13:24:06Z

I agree with @jreback. Overloading the parameters makes things pretty easy since the user needs only to know about one input. Let the machine figure-out what to do with it.

shoyer · 2016-07-20T16:32:03Z

Python is NOT c++, nor is it Java. We should be pythonic and follow current conventions. Pandas DOES quite a lot of inference. The POINT is to have convenience.

There is certainly precedence for this sort of behavior in pandas, but it's not the good kind -- it's confusing features like .ix or the overloading of __getitem__. Things that require entries on the "gotchas" page.

Can you find a single example of this sort of thing in the Python standard library? In my view it is very non-Pythonic, in addition to being poor software design more generally. "In the face of ambiguity, refuse the temptation to guess."

chris-b1 · 2016-07-20T16:48:03Z

What feels different about this one (vs. .ix - which I agree is too magical) is that there is no guessing or fallback - there are well defined behaviors or error messages on each path. (edit: to be more explicit - df.ix[5:10, :] does different things depending on what's in df. df.rolling(..) will always do the same thing or raise).

I do see the point about adding an interval window in the future...but even then - say this was the API - is the second row really ambiguous?

df.rolling('2s').sum()
df.rolling(5).sum()
df.rolling(numeric_interval=5.0).sum()

wesm · 2016-07-20T16:56:59Z

Here's my advice for how to move forward:

Proceed for now with the overloaded argument. This is much lower risk for conflation than .ix had with labels vs integers
Observe user behavior and see how it goes for people.
We always have the option to add explicit keyword arguments
If the string argument is causing problems, we can decide to deprecate and encourage using explicit arguments. This is obviously a last resort.

While not exactly equivalent, I see this as similar to 'ndarray.astype' -- effectively what we are doing is coercing the argument to a Window spec. We might consider adding a "WindowSpec" type of object that can be passed.

The variable windows will only be used by a pretty narrow subset of users -- I agree we should continue to be careful about argument overloading.

shoyer · 2016-07-20T17:14:19Z

@chris-b1 I agree, this is not as bad as .ix. I still don't like functionality specific to the dtype of the function arguments.

@wesm OK, I can live with that. I'm pretty sure that we'll eventually need to add explicit keyword arguments, because there's no other sane way to handle fixed label width rolling windows for non-time labels.

WindowSpec is indeed one way to make this explicit, although it feels rather contrived to me -- it's a way to make an ambiguous API explicit without breaking it, not how you would design an explicit API from scratch.

jreback · 2016-07-20T17:20:06Z

@shoyer I think you missed my original point. pandas does lots of 'magical' things. Changing something to non-magical will be a huge change. Yes you can attempt to forestall increasing the magic, but decreasing magic itself is a rather large API change.

You certainly could, but it should then be vetted like any other API change.

merging

shoyer · 2016-07-20T17:23:48Z

Changing something to non-magical will be a huge change. Yes you can attempt to forestall increasing the magic, but decreasing magic itself is a rather large API change.

This is exactly why I argued for not adding the magic here in the first place....

jreback · 2016-07-20T17:25:00Z

@shoyer

You certainly could, but it should then be vetted like any other API change.

pandas already has magic around these types of things. changing it is bad. (unless you basically change everything)

wesm · 2016-07-22T00:19:18Z

hey folks, just a meta-comment on this discussion. While these are all plain text exchanges and tone is devilishly hard to convey, I encourage all to be particularly mindful of how this discussion might read to outside observers or would-be contributors to the project.

I realize that the disagreement about the API dragged on for about 2 weeks, and I think we all would agree that 2 weeks is too long, and some frustration about this is understandable (always be closing, right?). Can we come up with a working plan for the future when resolving these sorts of contentious issues? I should have intervened sooner to resolve the impasse (bad timing: had a particularly busy couple of weeks, and also was on vacation and not looking at e-mail / github very much). The plan might be just to bug me to help arbitrate.

As far as the specifics of points of view were argued, as frustrating as it may be, there's some things we should definitely avoid. I hate to nail a particular comment to the wall but: #13513 (comment) — statements like "this is just bunk" are not helpful. It is OK to place the burden of proof on the person raising objections, but I think a more constructive way would be "I still disagree with you about this. Can you give some concrete examples illustrating why you feel this way?" We won't always agree but we should always strive to validate others' concerns and keep the conversation both constructive and productive (which I think we can do without being needling or dismissive).

I agree that pandas has teetered on the spectrum between preciseness / unambiguity and convenience, and we (e.g. yours truly) have made some lapses in judgment that have been costly to remedy (like .ix and the integer indexing problem). I try to avoid "Edge Case Driven Development", unless the edge cases can be demonstrated to be particularly harmful to the users (for example: anything that would cause a user to unwittingly obtain erroneous output with some reasonable probability).

Thanks

magicmathmandarin · 2019-03-25T05:19:51Z

I find the time-based window implementation to be confusing.
For example,

df = pd.DataFrame({'x': [0, 1, 2, np.nan, 4]},
                    index=pd.date_range('20200101',
                    periods=5, freq='d'))
df.rolling(2).sum()

and
df.rolling('2d').sum()
return totally different results.
df.rolling('2d').sum() seems to assume min_periods =1.
A user has to go through trial and error to figure that out, and cannot be sure whether the result would be as expected for a different case.

albertbuchard · 2020-09-18T10:28:07Z

As @magicmathmandarin pointed out, min_periods does not work as expected, just try:

df.rolling("2s", min_periods=2).sum()

                                       B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  NaN
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  7.0

But that seem to be due to the variability of the sampling rate

df = pd.DataFrame({'B': range(5)})
now = pd.Timestamp.now() 
df.index = [now, 
             now + pd.Timedelta(seconds=2),
             now + pd.Timedelta(seconds=3),
             now + pd.Timedelta(seconds=5),
             now + pd.Timedelta(seconds=6)
             ]
df.rolling("2s", min_periods=2).sum()

                              B
2020-09-18 10:45:08.811555  NaN
2020-09-18 10:45:10.811555  NaN
2020-09-18 10:45:11.811555  3.0
2020-09-18 10:45:13.811555  NaN
2020-09-18 10:45:14.811555  7.0

df_rs = df.resample("1s").mean().ffill()
df_rs.rolling("2s", min_periods=2).sum()

                       B
2020-09-18 10:45:08  NaN
2020-09-18 10:45:09  0.0
2020-09-18 10:45:10  1.0
2020-09-18 10:45:11  3.0
2020-09-18 10:45:12  4.0
2020-09-18 10:45:13  5.0
2020-09-18 10:45:14  7.0

I guess you could recover the correct output by doing something like:

rolled = df_rs.rolling("2s", min_periods=2).sum()
rolled.iloc[[df_rs.index.get_loc(ts, method='ffill') for ts in df.index]]

                       B
2020-09-18 10:45:08  NaN
2020-09-18 10:45:10  1.0
2020-09-18 10:45:11  3.0
2020-09-18 10:45:13  5.0
2020-09-18 10:45:14  7.0

jreback added Enhancement Timeseries Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 26, 2016

jreback added this to the 0.18.2 milestone Jun 26, 2016

jreback force-pushed the rolling branch from d995384 to 07849c9 Compare June 27, 2016 13:05

jreback force-pushed the rolling branch from 07849c9 to e889ac6 Compare June 27, 2016 13:42

wesm reviewed Jun 28, 2016
View reviewed changes

jreback force-pushed the rolling branch from da7e4b9 to 6e01327 Compare June 30, 2016 10:45

jreback force-pushed the rolling branch from 6e01327 to 4d4baa2 Compare July 1, 2016 11:28

jreback reviewed Jul 9, 2016
View reviewed changes

jreback force-pushed the rolling branch 2 times, most recently from 5a8d290 to 9af83c7 Compare July 13, 2016 21:18

ENH: add time-window capability to .rolling

d8f3d73

xref pandas-dev#13327 CLN: pep for cython, xref pandas-dev#12995

jreback force-pushed the rolling branch from 9af83c7 to d8f3d73 Compare July 14, 2016 10:47

jreback closed this in 786edc7 Jul 20, 2016

chrisaycock mentioned this pull request Aug 10, 2016

kdb-like window join for pandas #13959

Open

sfrodrigues mentioned this pull request Apr 8, 2017

time-window capability to .rolling dask/dask#2190

Closed

ENH: add time-window capability to .rolling #13513

ENH: add time-window capability to .rolling #13513

Conversation

jreback commented Jun 26, 2016 • edited

jreback commented Jun 26, 2016 • edited

max-sixty commented Jun 26, 2016

jreback commented Jun 26, 2016

codecov-io commented Jun 27, 2016 • edited

Current coverage is 84.39%

chrisaycock commented Jun 27, 2016

wesm commented Jun 27, 2016

joshuastorck commented Jun 27, 2016 • edited

jreback commented Jun 27, 2016

jreback commented Jun 27, 2016

wesm commented Jun 27, 2016

wesm Jun 28, 2016

Choose a reason for hiding this comment

wesm commented Jun 28, 2016

wesm commented Jun 28, 2016

joshuastorck commented Jun 28, 2016

jorisvandenbossche commented Jun 30, 2016

jreback commented Jun 30, 2016

jorisvandenbossche commented Jun 30, 2016

jreback commented Jun 30, 2016

jreback Jul 9, 2016

Choose a reason for hiding this comment

jreback commented Jul 10, 2016

jreback commented Jul 15, 2016

shoyer commented Jul 15, 2016

shoyer commented Jul 15, 2016

wesm commented Jul 15, 2016

wesm commented Jul 15, 2016

shoyer commented Jul 15, 2016

chris-b1 commented Jul 16, 2016

jreback commented Jul 16, 2016

jorisvandenbossche commented Jul 18, 2016

jreback commented Jul 20, 2016

chrisaycock commented Jul 20, 2016

shoyer commented Jul 20, 2016

chris-b1 commented Jul 20, 2016 • edited

wesm commented Jul 20, 2016

shoyer commented Jul 20, 2016

jreback commented Jul 20, 2016

shoyer commented Jul 20, 2016

jreback commented Jul 20, 2016

wesm commented Jul 22, 2016

magicmathmandarin commented Mar 25, 2019

albertbuchard commented Sep 18, 2020 • edited

jreback commented Jun 26, 2016 •

edited

jreback commented Jun 26, 2016 •

edited

codecov-io commented Jun 27, 2016 •

edited

joshuastorck commented Jun 27, 2016 •

edited

chris-b1 commented Jul 20, 2016 •

edited

albertbuchard commented Sep 18, 2020 •

edited