-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable backend dispatching for Dask-DataFrame creation #11920
Conversation
Codecov ReportBase: 87.40% // Head: 88.13% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11920 +/- ##
================================================
+ Coverage 87.40% 88.13% +0.72%
================================================
Files 133 133
Lines 21833 21987 +154
================================================
+ Hits 19084 19379 +295
+ Misses 2749 2608 -141
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good Rick, few comments before this is ready to merge.
def from_dict(data, npartitions, orient="columns", **kwargs): | ||
if orient != "columns": | ||
raise ValueError(f"orient={orient} is not supported") | ||
return dd.from_pandas( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is from_pandas
required here because we don't support cudf.DataFrame.from_dict
API yet? If so, can we add a todo here to change this after #11934 is resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is from_pandas required here because we don't support cudf.DataFrame.from_dict API yet?
Yes and no - We should certainly use cudf.from_dict
when it is supported (I'll add a TODO). However, I'll change the dd.from_pandas
code to dask_cudf.from_cudf
for clarity (from_cudf
is just a cudf-friendly alias for from_pandas
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rjzamora !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the entrypoint need to be added to setup.py/cfg
or similar?
Correct - The entrypoint is defined in |
Ah, somehow I missed that this had already been done back in March |
@gpucibot merge |
Description
This PR depends on dask/dask#9475 (Now Merged)
After dask#9475, external libraries are now able to implement (and expose) their own
DataFrameBackendEntrypoint
definitions to specify custom creation functions for DataFrame collections. This PR introduces theCudfBackendEntrypoint
class to createdask_cudf.DataFrame
collections using thedask.dataframe
API. By installingdask_cudf
with this entrypoint definition in place, you get the following behavior indask.dataframe
:Note that the code snippet above does not require an explicit import of
cudf
ordask_cudf
. The following creation functions will support backend dispatching after dask#9475:from_dict
read_paquet
read_json
read_orc
read_csv
read_hdf
See also: dask/design-docs#1
Checklist