Skip to content

Conversation

@chinmaychandak
Copy link
Contributor

@chinmaychandak chinmaychandak commented Aug 19, 2019

Most changes made to aggregations.py attempt to fix #266. Somehow, as of now, cudf does not do the index naming implicitly like Pandas.

Other changes in the file attempt to create a temporary fallback on Pandas Timedelta API to perform window-over-time-groupby aggregations.

@codecov-io
Copy link

codecov-io commented Aug 19, 2019

Codecov Report

Merging #268 into master will decrease coverage by 0.02%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
- Coverage   94.71%   94.69%   -0.03%     
==========================================
  Files          13       13              
  Lines        1609     1620      +11     
==========================================
+ Hits         1524     1534      +10     
- Misses         85       86       +1
Impacted Files Coverage Δ
streamz/dataframe/aggregations.py 98.84% <93.75%> (-0.27%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a00b6e3...e691394. Read the comment docs.



@pytest.fixture(params=['core', 'dask'])
@pytest.fixture(params=["core", "dask"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we standardizing on double quotes? From what I've seen, Python code has always used single quotes for strings unless there was a reason to do otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to go this way we should commit to black and be done with it IMO.

if hasattr(og, 'index'):
assert (o.index == og.index).all()
# if hasattr(og, 'index'):
# assert (o.index == og.index).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not check in commented out code. Just delete the code and if we need to restore it will be in the file history in Git.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

old = []
while dfs[0].index.min() < mn:
while pd.Timestamp(dfs[0].index.min()) < mn:
o = dfs[0].loc[:mn]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to how casting to a pandas timestamp helps here.

Copy link
Contributor Author

@chinmaychandak chinmaychandak Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, cudf's df.index.min() (or max()) returns a numpy.datetime64 as opposed to Pandas dataframes returning a pandas._libs.tslibs.timestamps.Timestamp. Hence the explicit cast to make it compatible with Pandas Timedelta, which is required for these operations.

The statements modified for this purpose would be redundant for Pandas, since the types are compatible with Pandas Timedelta.

mx = max(df.index.max() for df in dfs)
mn = mx - window
mx = pd.Timestamp(max(df.index.max() for df in dfs))
mn = pd.Timestamp(mx) - window
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't mx already a pd.Timestamp type, why pass it into another pd.Timestamp(...) type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@chinmaychandak
Copy link
Contributor Author

@martindurant Could you please review my code and merge it soon, if possible? I'd be happy to make any changes you think would be necessary.

@chinmaychandak
Copy link
Contributor Author

Hey @CJ-Wright, could you please have a look at this? I'd appreciate it if this could be merged soon! :)

@CJ-Wright
Copy link
Member

CJ-Wright commented Aug 31, 2019

I'll try to look at it soon (I need to read in more of the dataframe things)

@chinmaychandak
Copy link
Contributor Author

That would be great, thanks @CJ-Wright!

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Sep 4, 2019

@CJ-Wright Did you have a chance to look at this yet? This is a major blocker for a bigger project! :(

@CJ-Wright
Copy link
Member

Seems reasonable to me. I wish we had codecov on this repo.

Copy link
Member

@CJ-Wright CJ-Wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next time please don't make style changes (single vs double quote) to code that you aren't substantially changing, it makes review a bit more difficult, since I need to parse which lines are logic changes and which are style changes.

@CJ-Wright CJ-Wright merged commit d60a6e4 into python-streamz:master Sep 5, 2019
@chinmaychandak
Copy link
Contributor Author

Next time please don't make style changes (single vs double quote) to code that you aren't substantially changing, it makes review a bit more difficult, since I need to parse which lines are logic changes and which are style changes.

Sure, will definitely keep this in mind moving forward.

@CJ-Wright Thanks a lot for reviewing and merging this, really appreciate it! Is it possible to upload the updated conda package with these changes?

@CJ-Wright
Copy link
Member

We'd need to cut a release. Can you open up an issue for this?

@chinmaychandak
Copy link
Contributor Author

chinmaychandak commented Sep 5, 2019

Sure, created #271. Please let me know if I need to do anything else. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Index name lost during groupby aggregates on Streamz Dataframes.

4 participants