Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add time-window capability to .rolling #13513

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Jun 26, 2016

xref #13327
closes #936

This notebook shows the usecase

This implements time-ware windows, IOW, to a .rolling() you can now pass a ragged / sparse timeseries and have it work with an offset (e.g. like 2s). Previously you could achieve these results by resampling first, then using an integer period (though you would have to jump thru hoops when crossing day boundaries and such).

This now provides a nice easy / performant implementation (as indicated on the linked notebook, min/max impl is giving scaling issues).

In [1]: df = DataFrame({'B': range(5)})

In [2]: df.index = [Timestamp('20130101 09:00:00'),
   ...:             Timestamp('20130101 09:00:02'),
   ...:             Timestamp('20130101 09:00:03'),
   ...:             Timestamp('20130101 09:00:05'),
   ...:             Timestamp('20130101 09:00:06')]

In [3]: df
Out[3]: 
                     B
2013-01-01 09:00:00  0
2013-01-01 09:00:02  1
2013-01-01 09:00:03  2
2013-01-01 09:00:05  3
2013-01-01 09:00:06  4

In [4]: df.rolling(2, min_periods=1).sum()
Out[4]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  5.0
2013-01-01 09:00:06  7.0

In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  3.0
2013-01-01 09:00:06  7.0

@jreback jreback added Enhancement Timeseries Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 26, 2016
@jreback jreback added this to the 0.18.2 milestone Jun 26, 2016
@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2016

cc @jorisvandenbossche
cc @wesm
cc @chrisaycock
cc @joshuastorck

if anyone would like to try to fixup min/max would be great. here is the same PR with a priority heap (but its a fair bit slower, I am probably doing something non-optimal).

@max-sixty
Copy link
Contributor

closes #936

@jreback
Copy link
Contributor Author

jreback commented Jun 26, 2016

yeah we'll close #936 but I think that impl is a bit different.

@codecov-io
Copy link

codecov-io commented Jun 27, 2016

Current coverage is 84.39%

Merging #13513 into master will increase coverage by 0.01%

@@             master     #13513   diff @@
==========================================
  Files           142        142          
  Lines         51235      51299    +64   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43234      43296    +62   
- Misses         8001       8003     +2   
  Partials          0          0          

Powered by Codecov. Last updated by 084ceae...d8f3d73

@chrisaycock
Copy link
Contributor

Looks good to me. Regarding max/min, is the priority list faster than just recomputing the maximum value from scratch for each iteration? Caching and branch prediction can work wonders for small sample sizes.

@wesm
Copy link
Member

wesm commented Jun 27, 2016

Long awaited feature! Will try to review this today.

@joshuastorck
Copy link
Contributor

joshuastorck commented Jun 27, 2016

Take a look at this: https://people.cs.uct.ac.za/~ksmith/articles/sliding_window_minimum.html.

You don't want to use a heap for min/max as it's worst case O(n * log(n)). There is a more efficient algorithm that is O(N). It also should be a bit easier to implement as all you really need for a data structure is a deque.

That's the algorithm that the existing implementation was using.

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2016

@chrisaycock

ok, silly of me not to test the completely naive soln, which is quite fast. It still scales on O(n) where n is the window size, but now the constant is 1/10 of my prioritylist constant. So this is pretty reasonable.

In [1]: N = 100000                     
dfp = DataFrame({'B' : np.random.randn(N)}, 
                 index=pd.date_range('20130101',periods=N,freq='s'))
   ...: 

In [2]: %timeit dfp.rolling(10).min()
100 loops, best of 3: 2.47 ms per loop

In [3]: %timeit dfp.rolling('10s').min()
100 loops, best of 3: 3.2 ms per loop

In [4]: %timeit dfp.rolling(100).min()
100 loops, best of 3: 2.58 ms per loop

In [5]: %timeit dfp.rolling('100s').min()
100 loops, best of 3: 11.2 ms per loop

In [6]: %timeit dfp.rolling(1000).min()
100 loops, best of 3: 2.62 ms per loop

In [7]: %timeit dfp.rolling('1000s').min()
10 loops, best of 3: 97.2 ms per loop

@jreback
Copy link
Contributor Author

jreback commented Jun 27, 2016

@joshuastorck yes the deque is theoretically faster, but it was very awkward with a variable sized window; others welcome to take a shot thought!

@wesm
Copy link
Member

wesm commented Jun 27, 2016

Meta-comment: since getting exposed to Gerrit over the last couple years, large reviews on GitHub PR's tend to encourage less careful review (because it's difficult to keep track of what still needs doing sometimes), at least for me. It might be worth playing around with gerrithub.io sometime to mirror patches there for review sometimes. Not sure what's the best solution.

if index is None:
index = self._on
return index, index.asi8, self.win_unit
return index, index, self.win_unit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit opaque. Is the purpose of the 2nd entry in the tuple to ensure you are always passing integers to the Cython function?

@wesm
Copy link
Member

wesm commented Jun 28, 2016

This looks like it was quite a slog -- we should add to our ever-increasing TODO list a review of all of our aggregation code and ways to foster reuse or more flexibility. Trying to think if there was a way that this could have been made less painful. C++ templates could help (for example, a particular aggregator could be passed as a template argument, so the update inner loop could all get inlined at compile time -- e.g. the add_sum / remove_sum bits). But we'd need to decide if we want to commit to the route of writing more C++ and less Cython (I think there's good arguments for it, but evolving the build-and-deployment tools to accommodate that would be as discussed an upfront investment of work) -- this is somewhat orthogonal to the bigger discussion of writing more of the data structure internals in C or C++.

@wesm
Copy link
Member

wesm commented Jun 28, 2016

Also, I'm always happy to see 4+ year-old issues that were once just a twinkle in my eye get realized, so better late than never =) Great stuff

@joshuastorck
Copy link
Contributor

I think there are a couple of compelling arguments for going to C++ templates, mostly that you can vary the implementation based on type at compile time. For example, since integer columns don't have NaN's, you don't really need the min_periods for rolling_* functions and could use SFINAE for that distinction. Alternatively, you could put 'if' statements inside the inner loops that are based on type, such as 'if std::is_integral::value". With the rolling functions, much of the looping code is common, but the algorithm and state varies. You can define a rolling_apply function that takes iterators, a functor/lambda and variadic arguments to hold the state. And it goes without saying that there are a lot of possibilities for using SIMD when dealing with contiguous columns.

The downside with C++ is that you have to do a lot of boilerplate dispatching code to templates based on numpy type, the cross-platform compile gets more complicated, and the code could arguably be harder to read.

At a minimum, it would be worth comparing the runtime of hand converted Cython code, as the code generator for Cython introduces a lot of constructs that might make it harder for the compiler to do optimizations, such as sanity checks in inner loops that could probably be done up front.

@jorisvandenbossche
Copy link
Member

@jreback There is a memory leak when using center=True (first time I got a MemoryError, but second time it just crashed my computer :-))

@jreback
Copy link
Contributor Author

jreback commented Jun 30, 2016

@jorisvandenbossche I didn't change anything w.r.t. center (and its disabled for rolling with time series). can you show a repro?

@jorisvandenbossche
Copy link
Member

With the example from your notebook: df.rolling('2s', center=True).sum()

@jreback
Copy link
Contributor Author

jreback commented Jun 30, 2016

@jorisvandenbossche pushed. Its not implemented.

@@ -3,15 +3,16 @@
v0.19.0 (August ??, 2016)
-------------------------

This is a major release from 0.18.2 and includes a small number of API changes, several new features,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche slight changes need here (which are in master). done here i think is fine.

@jreback
Copy link
Contributor Author

jreback commented Jul 10, 2016

ok, well I'd like to push this into master now; now that we are on 0.19, this can be tried out for a bit by the deer-in-the-headlights crowd who run pandas master 👯

If there are later suggestions for a different API, then these can be addressed then. To be honest this is a highly theoretical use case where you actually want separate args. So I think that we'd need to go whole hog and deprecate window and add 2 new args instead.

But that is really -1 from me.

@jreback jreback force-pushed the rolling branch 2 times, most recently from 5a8d290 to 9af83c7 Compare July 13, 2016 21:18
@jreback
Copy link
Contributor Author

jreback commented Jul 15, 2016

any final comments? / merging soon.

@shoyer
Copy link
Member

shoyer commented Jul 15, 2016

Well, I'm still not happy with this API choice but I suppose I can live with @wesm's suggested compromise. It seems very awkward to me, though.

@shoyer
Copy link
Member

shoyer commented Jul 15, 2016

And to be clear, by "awkward" I mean that I think the current implementation will lead to bugs and confusion in user code. But that's the price we pay for convenience...

@wesm
Copy link
Member

wesm commented Jul 15, 2016

@shoyer so to make sure I understand the core concern, it's that this overloads the semantics of the window argument -- either you get a fixed number of periods or a variable-length window based on the type of input? We could certainly introduce auxiliary APIs rolling_fixed and rolling_variable (leaving the existing rolling) which only accept one input or the other to offer a more precise / readable result.

On the other hand, you could imagine having code like

def my_rolling_stat(df, win):
    return df.rolling(win).mean()

(this might be a lot more complex in practice), and this could be easily switched from variable window to fixed window through the win argument

@wesm
Copy link
Member

wesm commented Jul 15, 2016

At least, I feel the current state of things is much better / more consistent than the earlier smorgasbord of rolling_* functions

@shoyer
Copy link
Member

shoyer commented Jul 15, 2016

@shoyer so to make sure I understand the core concern, it's that this overloads the semantics of the window argument -- either you get a fixed number of periods or a variable-length window based on the type of input?

Yes, this is my concern. I strongly prefer function arguments that only do one thing, independent of the type of the data -- especially when the behavior is potentially ambiguous, as in this case.

We could certainly introduce auxiliary APIs rolling_fixed and rolling_variable (leaving the existing rolling) which only accept one input or the other to offer a more precise / readable result.

Yes, this is another clean way to handle the API.

On the other hand, you could imagine having code like

def my_rolling_stat(df, win):
    return df.rolling(win).mean()

(this might be a lot more complex in practice), and this could be easily switched from variable window to fixed window through the win argument

Indeed, it's certainly more convenient -- at least at first -- to reuse a single argument for these two different purposes. This could even make sense for special purpose code, e.g., if the user is happy to make the assumption that df will always have a DatetimeIndex. But I don't think these sort of special rules make sense for pandas, which does an admirable job of providing a composable language for data analysis that works consistently independently of the particular data types. I also don't think it's hard for a user to add some custom logic before calling .rolling() to implement this sort of functionality themselves.

At least, I feel the current state of things is much better / more consistent than the earlier smorgasbord of rolling_* functions

Totally agreed! I'm very happy we were able to switch to the .rolling() methods.

@chris-b1
Copy link
Contributor

This is smaller, but is on= the right keyword for column selection? xref #13500 - I'd want to use the same keyword there.

Although I suggested key= in that issue, I'm now thinking that by= would be better, for consistency with groupby and the sorting api?

@jreback
Copy link
Contributor Author

jreback commented Jul 16, 2016

we use on for pd.merge already

@jorisvandenbossche
Copy link
Member

We are a bit stuck it seems ... Below I tried to list the possibilities to go forward based on the discussions above (note: all used keyword/method names are of course open for discussion :-), I used 'size' and 'delta' for now to make a distinction)

  1. The current PR: DataFrame.rolling(window, ...)
    • Use the existing window argument for both types of window
    • Advantage: simple usage (df.rolling('2s'))
    • Disadvantage: overloading of one keyword (window)
  2. Add separate keyword for new functionality: DataFrame.rolling(window, window_delta=.., ...)
    • Advantage: no overloading of the same keyword with different behaviours
    • Disadvantage: not being able to use the shorthand df.rolling('2s') for the new functionality, as you always have to use a named keyword: df.rolling(window_delta='2s'), so worse ergonomics
    • Alternative 1: introduce two new keywords: DataFrame.rolling(window_size, window_delta, ...) to have a more descriptive keyword for the existing functionality as well.
      • Note that there is not necessarily a need to deprecate the window keyword. Eg DataFrame.rolling(window_size=None, window_delta=None, ..., **kwargs) -> window_size can handle the the current use of positional arg, and the keyword use window= can be checked for in the kwargs.
    • Alternative 2 (from @wesm): introduce two new keywords for explicitness, but keep current window for convenience: DataFrame.rolling(window, window_size=None, window_delta=None, ...), and then possibly assign the first positional window argument to one of these based on the type (int, string, DateOffset, etc.).
  3. Add separate method for new functionality: DataFrame.rolling_delta(window, ...)
    • Advantage: no overloading of the same keyword/function with different behaviours
    • Disadvantage: yet another method(s) added to the DataFrame namespace
    • Options: also add a DataFrame.rolling_fixed() (but keep existing rolling as an alias for this)

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2016

I still don't buy that this overloading is bad in any way.

I'd like @shoyer to convince why having the API in the is PR is not a good idea.

And to be clear, by "awkward" I mean that I think the current implementation will lead to bugs and confusion in user code. But that's the price we pay for convenience...

This is just bunk. it is more awkward to having confusing arguments in the first place which is what 2/3 by @jorisvandenbossche suggest.

Python is NOT c++, nor is it Java. We should be pythonic and follow current conventions. Pandas DOES quite a lot of inference. The POINT is to have convenience.

Trying to change something just for the sake of some theoretically purity is not in the spirit of python (and pandas at all). In fact, I would argue that making this NOT similar to simple, unambiguous usage is a MAJOR break in and of itself.

You have to have a VERY convincing argument to actually change things.

@chrisaycock
Copy link
Contributor

I agree with @jreback. Overloading the parameters makes things pretty easy since the user needs only to know about one input. Let the machine figure-out what to do with it.

@shoyer
Copy link
Member

shoyer commented Jul 20, 2016

Python is NOT c++, nor is it Java. We should be pythonic and follow current conventions. Pandas DOES quite a lot of inference. The POINT is to have convenience.

There is certainly precedence for this sort of behavior in pandas, but it's not the good kind -- it's confusing features like .ix or the overloading of __getitem__. Things that require entries on the "gotchas" page.

Can you find a single example of this sort of thing in the Python standard library? In my view it is very non-Pythonic, in addition to being poor software design more generally. "In the face of ambiguity, refuse the temptation to guess."

@chris-b1
Copy link
Contributor

chris-b1 commented Jul 20, 2016

What feels different about this one (vs. .ix - which I agree is too magical) is that there is no guessing or fallback - there are well defined behaviors or error messages on each path. (edit: to be more explicit - df.ix[5:10, :] does different things depending on what's in df. df.rolling(..) will always do the same thing or raise).

I do see the point about adding an interval window in the future...but even then - say this was the API - is the second row really ambiguous?

df.rolling('2s').sum()
df.rolling(5).sum()
df.rolling(numeric_interval=5.0).sum()

@wesm
Copy link
Member

wesm commented Jul 20, 2016

Here's my advice for how to move forward:

  • Proceed for now with the overloaded argument. This is much lower risk for conflation than .ix had with labels vs integers
  • Observe user behavior and see how it goes for people.
  • We always have the option to add explicit keyword arguments
  • If the string argument is causing problems, we can decide to deprecate and encourage using explicit arguments. This is obviously a last resort.

While not exactly equivalent, I see this as similar to 'ndarray.astype' -- effectively what we are doing is coercing the argument to a Window spec. We might consider adding a "WindowSpec" type of object that can be passed.

The variable windows will only be used by a pretty narrow subset of users -- I agree we should continue to be careful about argument overloading.

@shoyer
Copy link
Member

shoyer commented Jul 20, 2016

@chris-b1 I agree, this is not as bad as .ix. I still don't like functionality specific to the dtype of the function arguments.

@wesm OK, I can live with that. I'm pretty sure that we'll eventually need to add explicit keyword arguments, because there's no other sane way to handle fixed label width rolling windows for non-time labels.

WindowSpec is indeed one way to make this explicit, although it feels rather contrived to me -- it's a way to make an ambiguous API explicit without breaking it, not how you would design an explicit API from scratch.

@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2016

@shoyer I think you missed my original point. pandas does lots of 'magical' things. Changing something to non-magical will be a huge change. Yes you can attempt to forestall increasing the magic, but decreasing magic itself is a rather large API change.

You certainly could, but it should then be vetted like any other API change.

merging

@shoyer
Copy link
Member

shoyer commented Jul 20, 2016

Changing something to non-magical will be a huge change. Yes you can attempt to forestall increasing the magic, but decreasing magic itself is a rather large API change.

This is exactly why I argued for not adding the magic here in the first place....

@jreback jreback closed this in 786edc7 Jul 20, 2016
@jreback
Copy link
Contributor Author

jreback commented Jul 20, 2016

@shoyer

You certainly could, but it should then be vetted like any other API change.

pandas already has magic around these types of things. changing it is bad. (unless you basically change everything)

@wesm
Copy link
Member

wesm commented Jul 22, 2016

hey folks, just a meta-comment on this discussion. While these are all plain text exchanges and tone is devilishly hard to convey, I encourage all to be particularly mindful of how this discussion might read to outside observers or would-be contributors to the project.

I realize that the disagreement about the API dragged on for about 2 weeks, and I think we all would agree that 2 weeks is too long, and some frustration about this is understandable (always be closing, right?). Can we come up with a working plan for the future when resolving these sorts of contentious issues? I should have intervened sooner to resolve the impasse (bad timing: had a particularly busy couple of weeks, and also was on vacation and not looking at e-mail / github very much). The plan might be just to bug me to help arbitrate.

As far as the specifics of points of view were argued, as frustrating as it may be, there's some things we should definitely avoid. I hate to nail a particular comment to the wall but: #13513 (comment) — statements like "this is just bunk" are not helpful. It is OK to place the burden of proof on the person raising objections, but I think a more constructive way would be "I still disagree with you about this. Can you give some concrete examples illustrating why you feel this way?" We won't always agree but we should always strive to validate others' concerns and keep the conversation both constructive and productive (which I think we can do without being needling or dismissive).

I agree that pandas has teetered on the spectrum between preciseness / unambiguity and convenience, and we (e.g. yours truly) have made some lapses in judgment that have been costly to remedy (like .ix and the integer indexing problem). I try to avoid "Edge Case Driven Development", unless the edge cases can be demonstrated to be particularly harmful to the users (for example: anything that would cause a user to unwittingly obtain erroneous output with some reasonable probability).

Thanks

@magicmathmandarin
Copy link

I find the time-based window implementation to be confusing.
For example,

df = pd.DataFrame({'x': [0, 1, 2, np.nan, 4]},
                    index=pd.date_range('20200101',
                    periods=5, freq='d'))
df.rolling(2).sum()

and
df.rolling('2d').sum()
return totally different results.
df.rolling('2d').sum() seems to assume min_periods =1.
A user has to go through trial and error to figure that out, and cannot be sure whether the result would be as expected for a different case.

@albertbuchard
Copy link

albertbuchard commented Sep 18, 2020

As @magicmathmandarin pointed out, min_periods does not work as expected, just try:

df.rolling("2s", min_periods=2).sum()

                                       B
2013-01-01 09:00:00  NaN
2013-01-01 09:00:02  NaN
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  NaN
2013-01-01 09:00:06  7.0

But that seem to be due to the variability of the sampling rate

df = pd.DataFrame({'B': range(5)})
now = pd.Timestamp.now() 
df.index = [now, 
             now + pd.Timedelta(seconds=2),
             now + pd.Timedelta(seconds=3),
             now + pd.Timedelta(seconds=5),
             now + pd.Timedelta(seconds=6)
             ]
df.rolling("2s", min_periods=2).sum()

                              B
2020-09-18 10:45:08.811555  NaN
2020-09-18 10:45:10.811555  NaN
2020-09-18 10:45:11.811555  3.0
2020-09-18 10:45:13.811555  NaN
2020-09-18 10:45:14.811555  7.0

df_rs = df.resample("1s").mean().ffill()
df_rs.rolling("2s", min_periods=2).sum()

                       B
2020-09-18 10:45:08  NaN
2020-09-18 10:45:09  0.0
2020-09-18 10:45:10  1.0
2020-09-18 10:45:11  3.0
2020-09-18 10:45:12  4.0
2020-09-18 10:45:13  5.0
2020-09-18 10:45:14  7.0                           

I guess you could recover the correct output by doing something like:

rolled = df_rs.rolling("2s", min_periods=2).sum()
rolled.iloc[[df_rs.index.get_loc(ts, method='ffill') for ts in df.index]]

                       B
2020-09-18 10:45:08  NaN
2020-09-18 10:45:10  1.0
2020-09-18 10:45:11  3.0
2020-09-18 10:45:13  5.0
2020-09-18 10:45:14  7.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Numeric Operations Arithmetic, Comparison, and Logical operations Timeseries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add time-length windowing capability to moving statistics