Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

Merged
merged 3 commits into from
Nov 13, 2019

Conversation

rjzamora
Copy link
Member

Closes #3365

This PR adds explicit definitions for to_parquet and write_partition in dask_cudf. Previously, dask_cudf was falling back to upstream dask for this functionality. However, there is pandas-specific code in write_partition that is causing problems for cudf-based dask dataframes. Since we will soon need to define a CudfEngine implementation of write_partition when a GPU-accelerated cudf.io.to_parquet is supported, it probably makes sense to add an initial implementation to address the #3365 bug.

@rjzamora rjzamora requested a review from a team as a code owner November 13, 2019 18:43
@rjzamora rjzamora added the dask Dask issue label Nov 13, 2019
@rjzamora rjzamora changed the title Adding to_parquet and write_partition definitions to dask_cudf [REVIEW] Adding to_parquet and write_partition definitions to dask_cudf Nov 13, 2019
gddf.to_parquet(tmpdir)

# NOTE: Need `.compute()` to resolve correct index
# name after `from_dask_dataframe`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. Is there something wrong with our metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also a bit confused by this. It seems that gddf = ddf.map_partitions(cudf.from_pandas) will result in a dask_cudf dataframe with the _meta not having the same index name ans the original dask dataframe. Is it reasonable for me to raise a separate issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can find a quick resolution to that I think that that would be ideal. If it ends up being more complicated then sure, let's wait.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can find a quick resolution to that I think that that would be ideal.

Okay - Looks like the index name is dropped in iloc calls for cudf dataframes (and not for pandas). I'll see if the fix is simple

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin My idea for a simple fix doesn't seem to work - Lets address the index-name issue separately so I can focus on higher priority items in the short term.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. Happy to merge on passed tests

Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>
@codecov
Copy link

codecov bot commented Nov 13, 2019

Codecov Report

Merging #3369 into branch-0.11 will increase coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.11    #3369      +/-   ##
===============================================
+ Coverage        87.15%   87.18%   +0.02%     
===============================================
  Files               49       49              
  Lines             9213     9220       +7     
===============================================
+ Hits              8030     8038       +8     
+ Misses            1183     1182       -1
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/core.py 69.5% <100%> (+0.63%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100% <100%> (ø) ⬆️
python/cudf/cudf/core/index.py 89.62% <0%> (-0.05%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70084f4...d096061. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond dask Dask issue
Projects
None yet
4 participants