-
Notifications
You must be signed in to change notification settings - Fork 150
Add groupby aggregate tests, along with a few additions for cudf integration. #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #268 +/- ##
==========================================
- Coverage 94.71% 94.69% -0.03%
==========================================
Files 13 13
Lines 1609 1620 +11
==========================================
+ Hits 1524 1534 +10
- Misses 85 86 +1
Continue to review full report at Codecov.
|
|
|
||
|
|
||
| @pytest.fixture(params=['core', 'dask']) | ||
| @pytest.fixture(params=["core", "dask"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we standardizing on double quotes? From what I've seen, Python code has always used single quotes for strings unless there was a reason to do otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to go this way we should commit to black and be done with it IMO.
streamz/dataframe/aggregations.py
Outdated
| if hasattr(og, 'index'): | ||
| assert (o.index == og.index).all() | ||
| # if hasattr(og, 'index'): | ||
| # assert (o.index == og.index).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not check in commented out code. Just delete the code and if we need to restore it will be in the file history in Git.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| old = [] | ||
| while dfs[0].index.min() < mn: | ||
| while pd.Timestamp(dfs[0].index.min()) < mn: | ||
| o = dfs[0].loc[:mn] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious as to how casting to a pandas timestamp helps here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, cudf's df.index.min() (or max()) returns a numpy.datetime64 as opposed to Pandas dataframes returning a pandas._libs.tslibs.timestamps.Timestamp. Hence the explicit cast to make it compatible with Pandas Timedelta, which is required for these operations.
The statements modified for this purpose would be redundant for Pandas, since the types are compatible with Pandas Timedelta.
streamz/dataframe/aggregations.py
Outdated
| mx = max(df.index.max() for df in dfs) | ||
| mn = mx - window | ||
| mx = pd.Timestamp(max(df.index.max() for df in dfs)) | ||
| mn = pd.Timestamp(mx) - window |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't mx already a pd.Timestamp type, why pass it into another pd.Timestamp(...) type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
@martindurant Could you please review my code and merge it soon, if possible? I'd be happy to make any changes you think would be necessary. |
|
Hey @CJ-Wright, could you please have a look at this? I'd appreciate it if this could be merged soon! :) |
|
I'll try to look at it soon (I need to read in more of the dataframe things) |
|
That would be great, thanks @CJ-Wright! |
|
@CJ-Wright Did you have a chance to look at this yet? This is a major blocker for a bigger project! :( |
|
Seems reasonable to me. I wish we had codecov on this repo. |
CJ-Wright
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next time please don't make style changes (single vs double quote) to code that you aren't substantially changing, it makes review a bit more difficult, since I need to parse which lines are logic changes and which are style changes.
Sure, will definitely keep this in mind moving forward. @CJ-Wright Thanks a lot for reviewing and merging this, really appreciate it! Is it possible to upload the updated conda package with these changes? |
|
We'd need to cut a release. Can you open up an issue for this? |
|
Sure, created #271. Please let me know if I need to do anything else. Thanks! |
Most changes made to aggregations.py attempt to fix #266. Somehow, as of now, cudf does not do the index naming implicitly like Pandas.
Other changes in the file attempt to create a temporary fallback on Pandas Timedelta API to perform window-over-time-groupby aggregations.