-
Notifications
You must be signed in to change notification settings - Fork 150
Add cudf support for Streamz DataFrame #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. A few questions and comments below.
It would also be useful to start setting up a few tests in a separate streamz/dataframe/tests/test_cudf.py file (which may eventually have to move elsewhere, depending on where we place this code).
|
Thanks for the review. I will be adding tests soon. |
|
@mrocklin I hand picked a few tests from test_dataframes and modified them to work for cudf. Most of these tests worked by replacing pd.DataFrame with cudf.DataFrame. But I had to change a few things for some tests. I haven't changed anything that I didn't need to. |
|
In principle these changes look really good to me. The cudf changes are well isolated and generally smaller than I expected. Nice work! We'll probably want the
|
|
Just a question, would this be cleaner with an oop/inheritance model where, instead of saying Also, are we not able to run the cudf tests in every build? Could this be achieved by installing cudf? |
|
@jsmaupin I merged those |
|
@mrocklin I spent some time looking into lazy import implementation and move cudf import to backends file. Here are my thoughts about handling this issue.
|
Pull latest changes from upstream
Codecov Report
@@ Coverage Diff @@
## master #224 +/- ##
==========================================
+ Coverage 93.61% 93.86% +0.25%
==========================================
Files 13 13
Lines 1550 1566 +16
==========================================
+ Hits 1451 1470 +19
+ Misses 99 96 -3
Continue to review full report at Codecov.
|
|
@mrocklin and @martindurant I made the suggested changes. Let me know if I should be make any changes to merge this. |
|
@martindurant any concerns here? I've glanced over things briefly, but you may want to take a stronger look. |
|
Thanks for reviewing, @mrocklin , I'll look it over in the near future. |
|
Note that tests did not run for the latest commits, not sure why. |
martindurant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly some style questions here.
It would be nice to see examples and documentation specifically about the cudf story - but I understand if you want this merged early to get some hands-on experience first. In that case, users creating cudf streams should probably see a warning that this is an experimental feature.
|
@martindurant Restructuring the repo is the reason for new commits not triggering a build: travis-ci/travis-ci#3264? |
|
I did a sync on travis and a manually triggered build did succeed - can you try pushing something to see if it builds for you? |
|
I think this is good to go, but would like a green mark for propriety's sake. If there's something else I need to do in Travis settings or elsewhere, please let me know. |
Pull latest commits from original repo
|
Should this have triggered code coverage report as well? |
|
The running of the test did post coverage results to codecov (https://codecov.io/gh/python-streamz/streamz/commit/d22149db96959a54c38754ecae0e9100e580d6f1 ), but perhaps it too is trying to post it's graphical, I'm not going to worry about that. Merging this now - thank you for the hard work here. |
This adds some functionalities for streaming cudf DataFrames with Streamz DataFrame. Some of the available methods are listed in here: mrocklin/streamz/#222
Most of Streamz DataFrame methods are still available only with pandas and invoking these methods with cudf may result in ambiguous error messages.
@mrocklin I changed definitions of is_series_like and is_index_like a bit to make them work for both cudf and pandas DataFrames. Not sure if I am doing this correct.
Also, there are still a lot of references to pandas and cudf modules in Streamz DataFrame implementation. Most of these would go away after adding support for pandas like methods in cudf. I would be happy to get suggestions on removing/decreasing count of these references as we aspire to make it more general.