[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

rjzamora · 2019-11-13T18:43:33Z

This PR adds explicit definitions for to_parquet and write_partition in dask_cudf. Previously, dask_cudf was falling back to upstream dask for this functionality. However, there is pandas-specific code in write_partition that is causing problems for cudf-based dask dataframes. Since we will soon need to define a CudfEngine implementation of write_partition when a GPU-accelerated cudf.io.to_parquet is supported, it probably makes sense to add an initial implementation to address the #3365 bug.

python/dask_cudf/dask_cudf/core.py

mrocklin · 2019-11-13T19:19:46Z

python/dask_cudf/dask_cudf/io/tests/test_parquet.py

+    gddf.to_parquet(tmpdir)
+
+    # NOTE: Need `.compute()` to resolve correct index
+    #       name after `from_dask_dataframe`


I'm not sure I understand this. Is there something wrong with our metadata?

I was also a bit confused by this. It seems that gddf = ddf.map_partitions(cudf.from_pandas) will result in a dask_cudf dataframe with the _meta not having the same index name ans the original dask dataframe. Is it reasonable for me to raise a separate issue?

If you can find a quick resolution to that I think that that would be ideal. If it ends up being more complicated then sure, let's wait.

If you can find a quick resolution to that I think that that would be ideal.

Okay - Looks like the index name is dropped in iloc calls for cudf dataframes (and not for pandas). I'll see if the fix is simple

@mrocklin My idea for a simple fix doesn't seem to work - Lets address the index-name issue separately so I can focus on higher priority items in the short term.

OK, sounds good. Happy to merge on passed tests

Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>

codecov · 2019-11-13T20:17:01Z

Codecov Report

Merging #3369 into branch-0.11 will increase coverage by 0.02%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##           branch-0.11    #3369      +/-   ##
===============================================
+ Coverage        87.15%   87.18%   +0.02%     
===============================================
  Files               49       49              
  Lines             9213     9220       +7     
===============================================
+ Hits              8030     8038       +8     
+ Misses            1183     1182       -1

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/core.py	`69.5% <100%> (+0.63%)`	⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100% <100%> (ø)`	⬆️
python/cudf/cudf/core/index.py	`89.62% <0%> (-0.05%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70084f4...d096061. Read the comment docs.

rjzamora added 2 commits November 13, 2019 10:31

adding to_parquet and write_partition definitions in dask_cudf

322b210

changelog

5a38c47

rjzamora requested a review from a team as a code owner November 13, 2019 18:43

rjzamora added the dask Dask issue label Nov 13, 2019

rjzamora changed the title ~~Adding to_parquet and write_partition definitions to dask_cudf~~ [REVIEW] Adding to_parquet and write_partition definitions to dask_cudf Nov 13, 2019

rjzamora added 3 - Ready for Review Ready for review by team 4 - Needs Dask Reviewer labels Nov 13, 2019

kkraus14 approved these changes Nov 13, 2019

View reviewed changes

mrocklin reviewed Nov 13, 2019

View reviewed changes

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved

mrocklin reviewed Nov 13, 2019

View reviewed changes

mrocklin approved these changes Nov 13, 2019

View reviewed changes

Update python/dask_cudf/dask_cudf/core.py

d096061

Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>

rjzamora mentioned this pull request Nov 13, 2019

[FEA] dask-cudf doesn't support "corr"/correlation function like Pandas and cuDF #3363

Closed

rjzamora merged commit 043d894 into rapidsai:branch-0.11 Nov 13, 2019

rjzamora deleted the to_parquet branch November 21, 2019 18:34

beckernick mentioned this pull request Nov 22, 2019

[BUG] dask dataframe to_parquet fails with missing StringIndex attribute #2637

Closed

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs Dask Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

rjzamora commented Nov 13, 2019

mrocklin Nov 13, 2019

rjzamora Nov 13, 2019

mrocklin Nov 13, 2019

rjzamora Nov 13, 2019

rjzamora Nov 13, 2019

mrocklin Nov 13, 2019

codecov bot commented Nov 13, 2019 •

edited

Loading

[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

[REVIEW] Adding to_parquet and write_partition definitions to dask_cudf #3369

Conversation

rjzamora commented Nov 13, 2019

mrocklin Nov 13, 2019

Choose a reason for hiding this comment

rjzamora Nov 13, 2019

Choose a reason for hiding this comment

mrocklin Nov 13, 2019

Choose a reason for hiding this comment

rjzamora Nov 13, 2019

Choose a reason for hiding this comment

rjzamora Nov 13, 2019

Choose a reason for hiding this comment

mrocklin Nov 13, 2019

Choose a reason for hiding this comment

codecov bot commented Nov 13, 2019 • edited Loading

Codecov Report

codecov bot commented Nov 13, 2019 •

edited

Loading