Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Review] Updating dask_cudf.read_parquet() to leverage new Dask API #2368

Merged
merged 9 commits into from Aug 4, 2019

Conversation

@rjzamora
Copy link
Contributor

commented Jul 22, 2019

After Dask#4995, we can now expand the functionality of dask_cudf.read_parquet to accept additional arguments (as well as parse metadata and utilize row-group statistics).

This PR basically re-implements dask_cudf.read_parquet as a thin wrapper for dask.dataframe.read_parquet. Currently. the following input arguments are supported/tested:

    Parameters
    ----------
    path : string or list
        Source directory for data, or path(s) to individual parquet files.
    columns : string, list or None (default)
        Field name(s) to read in as columns in the output. By default all
        non-index fields will be read (as determined by the pandas parquet
        metadata, if present).
    filters : list
        List of filters to apply, like ``[('x', '>', 0), ...]``. This
         implements row-group (partition) -level filtering only, i.e., to
        prevent the loading of some chunks of the data, and only if relevant
        statistics have been included in the metadata.
    index : string, list, False or None (default)
        Field name(s) to use as the output frame index. By default will be
        inferred from the pandas parquet file metadata (if present). Use False
        to read all fields as columns.
    gather_statistics : bool or None (default).
        Gather the statistics for each dataset partition. By default,
        this will only be done if the _metadata file is available. Otherwise,
        statistics will only be gathered if True, because the footer of
        every file will be parsed (which is very slow on some systems).

cc @mrocklin

@rjzamora rjzamora requested a review from rapidsai/cudf-dask-codeowners as a code owner Jul 22, 2019

@rjzamora rjzamora self-assigned this Jul 22, 2019

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

cc @randerzander this should provide full cudf+dask Parquet support, including reading statistics, identifying sorted columns, filters, and so on.

@kkraus14 kkraus14 added this to PR-WIP in v0.9 Release via automation Jul 23, 2019

@mrocklin mrocklin added this to In Progress in Dask Jul 26, 2019

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

Thanks for merging with branch-0.9 @mrocklin (I was just about to commit when I saw CI change status).

Let me know if you have any thoughts/comments about the ArrowEngine check here. I am making the assumption that we will need to support pre-parquet-refactor dask versions here.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

I am making the assumption that we will need to support pre-parquet-refactor dask versions here.

I think that we can be pretty aggressive here. CuDF generally has to depend on latest Dask anyway because of the tight dask.dataframe integration.

@codecov

This comment has been minimized.

Copy link

commented Jul 31, 2019

Codecov Report

Merging #2368 into branch-0.9 will increase coverage by 0.05%.
The diff coverage is 93.58%.

Impacted file tree graph

@@              Coverage Diff               @@
##           branch-0.9    #2368      +/-   ##
==============================================
+ Coverage       83.11%   83.16%   +0.05%     
==============================================
  Files              56       57       +1     
  Lines            8203     8269      +66     
==============================================
+ Hits             6818     6877      +59     
- Misses           1385     1392       +7
Impacted Files Coverage Δ
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100% <100%> (ø) ⬆️
python/dask_cudf/dask_cudf/io/__init__.py 71.42% <50%> (-28.58%) ⬇️
python/dask_cudf/dask_cudf/__init__.py 80% <60%> (-10.91%) ⬇️
python/dask_cudf/dask_cudf/io/parquet.py 95.23% <94.44%> (-4.77%) ⬇️
python/cudf/cudf/io/avro.py 80% <0%> (ø)
python/cudf/cudf/dataframe/numerical.py 94.87% <0%> (+0.03%) ⬆️
python/cudf/cudf/utils/ioutils.py 93.22% <0%> (+0.23%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cd9fff9...fd09572. Read the comment docs.

rjzamora added some commits Jul 31, 2019

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

Okay - Removed support for older dask versions. This should help with the codcov failure.

@rjzamora rjzamora changed the title Updating dask_cudf.read_parquet() to leverage new Dask API [Review] Updating dask_cudf.read_parquet() to leverage new Dask API Aug 1, 2019

v0.9 Release automation moved this from PR-WIP to PR-Reviewer approved Aug 4, 2019

@kkraus14 kkraus14 merged commit 511b75c into rapidsai:branch-0.9 Aug 4, 2019

11 checks passed

codecov/patch 93.58% of diff hit (target 83.11%)
Details
codecov/project 83.16% (+0.05%) compared to cd9fff9
Details
gpuCI/cudf/check-changelog Build #3617 succeeded in 9.6 sec
Details
gpuCI/cudf/check-style Build #3615 succeeded in 17 sec
Details
gpuCI/cudf/conda-build/ubuntu16.04-cuda10.0-py3.6 Build #17021 succeeded in 39 min
Details
gpuCI/cudf/conda-build/ubuntu16.04-cuda10.0-py3.7 Build #17024 succeeded in 39 min
Details
gpuCI/cudf/conda-build/ubuntu16.04-cuda9.2-py3.6 Build #17022 succeeded in 38 min
Details
gpuCI/cudf/conda-build/ubuntu16.04-cuda9.2-py3.7 Build #17023 succeeded in 37 min
Details
gpuCI/cudf/gpu-test/centos7-cuda10.0 Build #7942 succeeded in 51 min
Details
gpuCI/cudf/gpu-test/ubuntu16.04-cuda9.2 Build #7943 succeeded in 49 min
Details
gpuCI/cudf/pr-builder Build succeeded
Details

Dask automation moved this from In Progress to Done Aug 4, 2019

v0.9 Release automation moved this from PR-Reviewer approved to Done Aug 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.