-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Rolling window C++ API #732
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My review just suggests block comment improvements. Jake's review critiques the API.
My main concern is choosing to write NaN in the case where Pandas writes NA. These are not the same in my view. We should probably use the valid bit mask for this and just invalidate the element rather than storing NaN.
@harrism & @jrhemstad can you give it another pass to make sure your comments are addressed? The plan is to finalize the API and merge, then I'll work on a separate PR for the implementation. |
@kkraus14 can someone from the Python devs review this, hopefully someone who would be working on the bindings, just to make sure there are no surprises there? |
cpp/include/cudf/functions.h
Outdated
/* --------------------------------------------------------------------------* | ||
* @brief Computes the rolling window function of the values in a column. | ||
* | ||
* This matches Pandas' API for DataFrame.rolling with a couple notable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to have a self-contained description of what a function does rather than relying on someone to know what the Pandas DataFrame.rolling
is.
It's fine to keep the documentation of what the deviations from the Pandas API are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember a while back seeing a big long list of all the stuff we wanted to support with rolling functions and this seems to be missing quite a bit of it. Even if we do it column by column we need to be able to do things like partitioned windows based not just on window size.
* | ||
* --------------------------------------------------------------------------*/ | ||
gdf_error gdf_rolling_window(gdf_column *output_col, | ||
const gdf_column *input_col, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually rolling operations are done in groups as opposed to a column at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in pandas it might be different but windows are often combined in many data operations across multiple values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if windows are executed across multiple values, these are independent operations and gdf_rolling_window
can be called multiple times. I do see some value in providing an API that can handle multiple columns, which can benefit kernel launch timings, but otherwise there is no reuse of data between the columns. To avoid complicating the initial API, I would postpone this feature to the next version of the API, or until we get a realistic case that suffers from kernel launch latencies for rolling windows. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this sounds fine
* --------------------------------------------------------------------------*/ | ||
gdf_error gdf_rolling_window(gdf_column *output_col, | ||
const gdf_column *input_col, | ||
gdf_size_type window, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about partitioned windows? windows that are based on another columns value for example as opposed to based on fixed size? this is common when users dont have fixed number of events
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 4f61fee to address this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@harrism and @felipeblazing I updated the API, I believe it's ready to merge, let me know if you have further comments |
rerun tests |
If @felipeblazing is happy with this API now, I think you can proceed with implementation. I think the next step is to create a branch from this branch to do the implementation in, so you can start a new PR with the full PR, and close this one out. |
@felipeblazing please answer whether you are now happy with the API after |
I am thrilled and excited at the changes and do so humbly wish to approve it as well. |
@nsakharnykh I'm closing this PR. You can branch from this branch and open a new PR with the implementation for issue #926, or if you prefer you can reopen this PR and rename it and add implementation. |
Posting to review the proposed API first, then I'll work on a GPU implementation (will be a separate PR).
This PR introduces a C++ API for rolling windows. It mirrors Pandas'
DataFrame.rolling
API with some minor changes - seegdf_rolling_window
for more detail.Also see #714 - pure Python implementation of rolling windows using numba kernels