Skip to content

Conversation

@skmatti
Copy link
Contributor

@skmatti skmatti commented Feb 7, 2019

This adds some functionalities for streaming cudf DataFrames with Streamz DataFrame. Some of the available methods are listed in here: mrocklin/streamz/#222

Most of Streamz DataFrame methods are still available only with pandas and invoking these methods with cudf may result in ambiguous error messages.

@mrocklin I changed definitions of is_series_like and is_index_like a bit to make them work for both cudf and pandas DataFrames. Not sure if I am doing this correct.

Also, there are still a lot of references to pandas and cudf modules in Streamz DataFrame implementation. Most of these would go away after adding support for pandas like methods in cudf. I would be happy to get suggestions on removing/decreasing count of these references as we aspire to make it more general.

Copy link
Collaborator

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. A few questions and comments below.

It would also be useful to start setting up a few tests in a separate streamz/dataframe/tests/test_cudf.py file (which may eventually have to move elsewhere, depending on where we place this code).

@skmatti
Copy link
Contributor Author

skmatti commented Feb 7, 2019

Thanks for the review. I will be adding tests soon.

@skmatti
Copy link
Contributor Author

skmatti commented Feb 8, 2019

@mrocklin I hand picked a few tests from test_dataframes and modified them to work for cudf. Most of these tests worked by replacing pd.DataFrame with cudf.DataFrame. But I had to change a few things for some tests. I haven't changed anything that I didn't need to.

@mrocklin
Copy link
Collaborator

mrocklin commented Feb 9, 2019

In principle these changes look really good to me. The cudf changes are well isolated and generally smaller than I expected. Nice work!

We'll probably want the cudf import to be either:

  1. Lazily done, as with the register_lazy trick used by the Dask Dispatcher objects. See the following links:

    This would mean replacing the curent _stream_types dict with a Dispatch function

    This is important so that cudf isn't imported whenever a Pandas user wants to use streamz. The cudf import time is a few seconds at this point, and may have other side effects.

  2. In a separate custreamz library. This is cleaner from an organizational point of view, but can sometimes provide a confusing user experience. Users sometimes don't understand why they need to import custreamz when all they ever use is the streamz library.

@jsmaupin
Copy link
Contributor

jsmaupin commented Feb 11, 2019

Just a question, would this be cleaner with an oop/inheritance model where, instead of saying if is_dataframe_like, the code would be run in a class that had the code to handle the items in the if block?

Also, are we not able to run the cudf tests in every build? Could this be achieved by installing cudf?

@skmatti
Copy link
Contributor Author

skmatti commented Feb 12, 2019

@jsmaupin I merged those if blocks into a single method.
Travis does not support GPU based builds yet. So I thought I would skip those tests on travis CI.

@skmatti
Copy link
Contributor Author

skmatti commented Feb 13, 2019

@mrocklin I spent some time looking into lazy import implementation and move cudf import to backends file. Here are my thoughts about handling this issue.

  1. I don't think we need a Dispatch function for current implementation, as we are not invoking any cudf functions explicitly using cudf import in core.py. We just need a few dataframe like class types to identify the corresponding stream types. This can be achieved by making use of is_dataframe_like, is_series_like and is_index_like functions. But I have to add streamz dataframe specific code in collection.py which is undesirable.

  2. In future if we have to access cudf.concat or any other cudf functions explicitly in core.py, then we need a dispatch function. Also, adding dispatch function makes it more difficult to not have dataframe specific code in collection.py.

Satish Kumar Matti added 2 commits March 21, 2019 10:15
@codecov-io
Copy link

codecov-io commented Mar 21, 2019

Codecov Report

Merging #224 into master will increase coverage by 0.25%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #224      +/-   ##
==========================================
+ Coverage   93.61%   93.86%   +0.25%     
==========================================
  Files          13       13              
  Lines        1550     1566      +16     
==========================================
+ Hits         1451     1470      +19     
+ Misses         99       96       -3
Impacted Files Coverage Δ
streamz/collection.py 94.44% <100%> (+1.4%) ⬆️
streamz/dataframe/core.py 92.49% <100%> (+0.32%) ⬆️
streamz/dataframe/utils.py 100% <100%> (ø)
streamz/dataframe/aggregations.py 99.1% <100%> (ø) ⬆️
streamz/utils.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35b8e9c...d22149d. Read the comment docs.

@skmatti
Copy link
Contributor Author

skmatti commented Mar 22, 2019

@mrocklin and @martindurant I made the suggested changes. Let me know if I should be make any changes to merge this.

@mrocklin
Copy link
Collaborator

@martindurant any concerns here? I've glanced over things briefly, but you may want to take a stronger look.

@martindurant
Copy link
Member

Thanks for reviewing, @mrocklin , I'll look it over in the near future.

@martindurant
Copy link
Member

Note that tests did not run for the latest commits, not sure why.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly some style questions here.

It would be nice to see examples and documentation specifically about the cudf story - but I understand if you want this merged early to get some hands-on experience first. In that case, users creating cudf streams should probably see a warning that this is an experimental feature.

@skmatti
Copy link
Contributor Author

skmatti commented Apr 22, 2019

@martindurant Restructuring the repo is the reason for new commits not triggering a build: travis-ci/travis-ci#3264?

@martindurant
Copy link
Member

I did a sync on travis and a manually triggered build did succeed - can you try pushing something to see if it builds for you?

@martindurant
Copy link
Member

I think this is good to go, but would like a green mark for propriety's sake. If there's something else I need to do in Travis settings or elsewhere, please let me know.

Satish Kumar and others added 3 commits April 25, 2019 13:50
@skmatti
Copy link
Contributor Author

skmatti commented Apr 25, 2019

Should this have triggered code coverage report as well?

@martindurant
Copy link
Member

The running of the test did post coverage results to codecov (https://codecov.io/gh/python-streamz/streamz/commit/d22149db96959a54c38754ecae0e9100e580d6f1 ), but perhaps it too is trying to post it's graphical, I'm not going to worry about that. Merging this now - thank you for the hard work here.

@martindurant martindurant merged commit 6445fa9 into python-streamz:master Apr 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants