Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Rolling window C++ API #732

Closed
wants to merge 10 commits into from

Conversation

nsakharnykh
Copy link
Collaborator

@nsakharnykh nsakharnykh commented Jan 18, 2019

Posting to review the proposed API first, then I'll work on a GPU implementation (will be a separate PR).

This PR introduces a C++ API for rolling windows. It mirrors Pandas' DataFrame.rolling API with some minor changes - see gdf_rolling_window for more detail.

Also see #714 - pure Python implementation of rolling windows using numba kernels

cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My review just suggests block comment improvements. Jake's review critiques the API.

My main concern is choosing to write NaN in the case where Pandas writes NA. These are not the same in my view. We should probably use the valid bit mask for this and just invalidate the element rather than storing NaN.

cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
@mike-wendt mike-wendt added the 2 - In Progress Currently a work in progress label Jan 24, 2019
@nsakharnykh nsakharnykh changed the title [WIP] Rolling window C++ API and implementation [WIP] Rolling window C++ API Jan 24, 2019
@nsakharnykh
Copy link
Collaborator Author

@harrism & @jrhemstad can you give it another pass to make sure your comments are addressed? The plan is to finalize the API and merge, then I'll work on a separate PR for the implementation.

@nsakharnykh
Copy link
Collaborator Author

@kkraus14 can someone from the Python devs review this, hopefully someone who would be working on the bindings, just to make sure there are no surprises there?

@nsakharnykh nsakharnykh changed the title [WIP] Rolling window C++ API [REVIEW] Rolling window C++ API Jan 24, 2019
/* --------------------------------------------------------------------------*
* @brief Computes the rolling window function of the values in a column.
*
* This matches Pandas' API for DataFrame.rolling with a couple notable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to have a self-contained description of what a function does rather than relying on someone to know what the Pandas DataFrame.rolling is.

It's fine to keep the documentation of what the deviations from the Pandas API are.

Copy link

@felipeblazing felipeblazing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember a while back seeing a big long list of all the stuff we wanted to support with rolling functions and this seems to be missing quite a bit of it. Even if we do it column by column we need to be able to do things like partitioned windows based not just on window size.

*
* --------------------------------------------------------------------------*/
gdf_error gdf_rolling_window(gdf_column *output_col,
const gdf_column *input_col,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually rolling operations are done in groups as opposed to a column at a time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in pandas it might be different but windows are often combined in many data operations across multiple values

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if windows are executed across multiple values, these are independent operations and gdf_rolling_window can be called multiple times. I do see some value in providing an API that can handle multiple columns, which can benefit kernel launch timings, but otherwise there is no reuse of data between the columns. To avoid complicating the initial API, I would postpone this feature to the next version of the API, or until we get a realistic case that suffers from kernel launch latencies for rolling windows. Sounds good?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this sounds fine

* --------------------------------------------------------------------------*/
gdf_error gdf_rolling_window(gdf_column *output_col,
const gdf_column *input_col,
gdf_size_type window,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about partitioned windows? windows that are based on another columns value for example as opposed to based on fixed size? this is common when users dont have fixed number of events

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 4f61fee to address this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

cpp/include/cudf/functions.h Outdated Show resolved Hide resolved
@nsakharnykh
Copy link
Collaborator Author

@harrism and @felipeblazing I updated the API, I believe it's ready to merge, let me know if you have further comments

@nsakharnykh
Copy link
Collaborator Author

rerun tests

@nsakharnykh nsakharnykh added 5 - Ready to Merge Testing and reviews complete, ready to merge 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 4, 2019
@harrism
Copy link
Member

harrism commented Feb 5, 2019

If @felipeblazing is happy with this API now, I think you can proceed with implementation. I think the next step is to create a branch from this branch to do the implementation in, so you can start a new PR with the full PR, and close this one out.

@harrism
Copy link
Member

harrism commented Feb 12, 2019

@felipeblazing please answer whether you are now happy with the API after
@nsakharnykh's latest changes.

@harrism harrism removed the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 12, 2019
@felipeblazing
Copy link

I am thrilled and excited at the changes and do so humbly wish to approve it as well.

@harrism
Copy link
Member

harrism commented Feb 12, 2019

@nsakharnykh I'm closing this PR. You can branch from this branch and open a new PR with the implementation for issue #926, or if you prefer you can reopen this PR and rename it and add implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants