[RELEASE] cudf v22.02 #10101

GPUtester · 2022-01-21T15:41:50Z

❄️ Code freeze for `branch-22.02` and v22.02 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-22.02 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-22.02 into main for the release

Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

This PR improves the C++ developer guide. My primary goal was to fix some invalid links. The diff is a bit large because of some minor changes in the interest of establishing consistent style and improving the reading/editing experience. (e.g. replacing a few instances of tabs with spaces, trimming trailing whitespace, wrapping sections that were not wrapped like the rest of the file, and correcting typos that I came across while reading). To save time, I recommend that reviewers use the option in GitHub's review tab that will ignore whitespace changes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Karthikeyan (https://github.com/karthikeyann) URL: #9675

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Signed-off-by: Peixin Li <pxli@nyu.edu> cudfjni version update. NOTE: this includes change to use gpuci/cuda images since official cuda images is not ready yet on docker hub Authors: - Peixin (https://github.com/pxLi) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #9681

…xed_point` (#9658) This PR adds Java bindings for `is_fixed_point` Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Nghia Truong (https://github.com/ttnghia) - Robert (Bobby) Evans (https://github.com/revans2) - David Wendt (https://github.com/davidwendt) - Mike Wilson (https://github.com/hyperbolic2346) URL: #9658

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Fixes: #9642 This PR fixes issue where null values being treated as `False` when `boolean` dtype was being passed to the `Series` constructor. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #9691

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Fixes: #7102 Replaces: [#9488](https://github.com/rapidsai/cudf/pull/9488/files) Authors: - Sheilah Kirui (https://github.com/skirui-source) - Mayank Anand (https://github.com/mayankanand007) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Michael Wang (https://github.com/isVoid) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9571

Add additional checks for int8, int16 fixes [#/rapidsai/cudf/4127](NVIDIA/spark-rapids#4127) Authors: - Raza Jafri (https://github.com/razajafri) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) URL: #9707

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Closes #9615 Adds the following API to the Parquet writer: - Set maximum row group size, in bytes (minimum of 512KB); - Set maximum row group size, in rows (minimum of 5000). The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation. Other changes: - Fix naming in some ORC APIs to be consistent. - Change `rowgroup` to `row_group` in APIs, since Parquet specs refer to this as "row group", not "rowgroup". - Replace some `uint32_t` use in Parquet writer. - Remove unused `target_page_size`. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Ashwin Srinath (https://github.com/shwina) URL: #9677

This PR is a pretty thorough rewrite of the internals of merging. There is a ton of complexity imposed by matching all the different edge cases allowed by the pandas API, but I've tried to unify the logic for different code paths as much as possible. I've also added checks for a number of edge cases that were not previously being handled. I see about a 10% performance improvement for merges on small to medium data sizes from this PR (as expected, there's no change for large data where most time is spent in C++). There's also a substantial reduction in total code that should make it easier to address issues going forward. I'm still not entirely happy with the complexity of the result and I think that further simplification should be possible, but I think this is a sufficiently large step forward to be worth pushing forward in this state, especially if it helps enable other changes to joining. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #9516

This PR is a basic implementation of the [interchange dataframe protocol](https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py) for cudf. As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on `pandas` with method like `from_pandas` and `to_pandas`. This is a bad design as libraries should maintain an additional dependency to pandas peculiarities. This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them. Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details. To illustrate: - `df_obj = cudf_dataframe.__dataframe__()` `df_obj` can be consumed by any library implementing the protocol. - `df = cudf.from_dataframe(any_supported_dataframe)` here we create a `cudf dataframe` from any dataframe object supporting the protocol. So far, it supports the following: - Column dtypes: `uint8`, `int`, `float`, `bool` and `categorical`. - Missing values are handled for all these dtypes. - `string` support is on the way. Additionally, we support dataframe from CPU device like `pandas`. But it is not testable here as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol. Authors: - Ismaël Koné (https://github.com/iskode) - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9071

Depends on #9040 and (unfortunately) #9041 Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) URL: #9089

Follow-up to #9571 where we add `ceil` and `floor` support for `Series`. Here we add `ceil` and `floor` support to `DatetimeIndex` class. This PR is dependent on #9571 getting merged first since it assumes the `libcudf` implementation for `floor` exists. Authors: - Mayank Anand (https://github.com/mayankanand007) Approvers: - Michael Wang (https://github.com/isVoid) - Ashwin Srinath (https://github.com/shwina) URL: #9554

This PR continues to address #8974, adding support for structs in `min` and `max` reduction. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Mark Harris (https://github.com/harrism) - https://github.com/nvdbaranec URL: #9697

Regular spell check fixes in comments and docs. Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vukasin Milovanovic (https://github.com/vuule) URL: #9682

…#9715) Closes #9620 Fixes an edge case described in https://docs.python.org/3/library/re.html#re.MULTILINE where the '$' EOL regex pattern character (without `MULTILINE` set) should match at the very end of a string and also just before the end of the string if the end of that string contains a new-line. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) - Sheilah Kirui (https://github.com/skirui-source) URL: #9715

This PR is adding clang-tidy to cudf and adding the initial checks. Note more checks will be enabled in the future. Relevant PRs: * `rmm`: rapidsai/rmm#857 * `cuml`: rapidsai/cuml#1945 To do list: * [x] Add `.clang-tidy` file * [x] Add python script * [x] Apply `modernize-` changes * [x] Revert `cxxopts` changes * [x] Fixed Python parquet failures * [x] Ignore `cxxopts` file * [x] Ignore the `build/_deps` directories Splitting out the following into a separate PR so we can get the changes merged for 22.02 (#10064): * ~~[ ] Disable `clang-diagnostic-errors/warnings`~~ * ~~[ ] Fix include files being skipped~~ * ~~[ ] Set up CI script~~ * ~~[ ] Clean up python script~~ Authors: - Conor Hoekstra (https://github.com/codereport) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) - Mark Harris (https://github.com/harrism) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9860

The `libcudacxx.patch` was required to fix issues with libcudacxx 1.6 and incorrect detection of the arm nvcc 11.4 compiler. As we move to libcudacxx 1.7 this patch is not needed, and should be removed. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Mark Harris (https://github.com/harrism) URL: #10057

…storage (#9589) **Important Note**: ~Marking this as WIP until the `fsspec.parquet` module is available in a filesystem_spec release~ (fsspec.parquet module is available) This PR modifies `cudf.read_parquet` and `dask_cudf.read_parquet` to leverage the new `fsspec.parquet.open_parquet_file` function for optimized data transfer/caching from remote storage. The ~long-term~ goal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet. **Performance Motivation**: ```python In [1]: import cudf, dask_cudf ...: path = [ ...: "gs://my-bucket/criteo-parquet/day_0.parquet", ...: "gs://my-bucket/criteo-parquet/day_1.parquet", ...: ] # cudf BEFORE In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…) CPU times: user 11.1 s, sys: 11.5 s, total: 22.6 s Wall time: 24.4 s # cudf AFTER In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…) CPU times: user 3.48 s, sys: 722 ms, total: 4.2 s Wall time: 6.32 s # (Threaded) Dask-cudf BEFORE In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute() CPU times: user 27.1 s, sys: 15.5 s, total: 42.6 s Wall time: 57.6 s # (Threaded) Dask-cudf AFTER In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute() CPU times: user 3.43 s, sys: 851 ms, total: 4.28 s Wall time: 13.1 s ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller - Benjamin Zaitlen (https://github.com/quasiben) URL: #9589

As we will remove Python 3.7, we need to update the Python version in the upload scripts Authors: - Jordan Jacobelli (https://github.com/Ethyling) Approvers: - Sevag Hanssian (https://github.com/sevagh) - AJ Schmidt (https://github.com/ajschmidt8) URL: #10092

Depends on #10041. The erstwhile ORC writer API exposed only a binary choice to choose the level of statistics: ENABLED/DISABLED. This commit allows the ORC writer to further choose whether statistics are collected at the ROW_GROUP or STRIPE level. This commit also includes the relevant changes to `java/` and `python/`. Authors: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - GALI PREM SAGAR (https://github.com/galipremsagar) - Christopher Harris (https://github.com/cwharris) - Vukasin Milovanovic (https://github.com/vuule) URL: #10058

codecov · 2022-01-21T15:42:04Z

Codecov Report

Merging #10101 (cfcb3ac) into main (41a20f6) will decrease coverage by 0.14%.
The diff coverage is n/a.

❗ Current head cfcb3ac differs from pull request most recent head a7d88cd. Consider uploading reports for the commit a7d88cd to get more accurate results

@@            Coverage Diff             @@
##             main   #10101      +/-   ##
==========================================
- Coverage   10.56%   10.42%   -0.15%     
==========================================
  Files         116      119       +3     
  Lines       18677    20606    +1929     
==========================================
+ Hits         1974     2148     +174     
- Misses      16703    18458    +1755

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/backends.py	`83.13% <0.00%> (-2.58%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.72%)`	⬇️
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_typing.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/avro.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
... and 90 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06540b9...a7d88cd. Read the comment docs.

I know that this is past the freeze date. This is a fix for a P1 bug that we just found when trying to build Scalar values of Lists and Structs that contain Decimal128 values. We might be able to work around it some other way, but it would take a lot of changes to the existing Spark plugin code to do that so I wanted to try this first. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Kuhu Shukla (https://github.com/kuhushukla) - Niranjan Artal (https://github.com/nartal1)

…n `_drop_na_rows` (#10123) Currently when `drop_nan == False`, variable `data_columns` was not created and referenced below. This PR fixes that. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice)

Always upload all cudf packages Authors: - Ray Douglass (https://github.com/raydouglass) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Jordan Jacobelli (https://github.com/Ethyling)

ajschmidt8 and others added 30 commits November 4, 2021 10:13

DOC v22.02 Updates

ad02545

Merge branch-21.12 into branch-22.02

a497d73

Merge pull request #9664 from robertmaynard/branch-22.02-merge-21.12

1ee86e7

Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9670 from rapidsai/branch-21.12

6a3ef7d

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9671 from rapidsai/branch-21.12

9ec8b30

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9673 from rapidsai/branch-21.12

8311512

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9678 from rapidsai/branch-21.12

4a277ca

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9680 from rapidsai/branch-21.12

31f92d7

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9692 from rapidsai/branch-21.12

753db88

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9698 from rapidsai/branch-21.12

c667518

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9699 from rapidsai/branch-21.12

b14d883

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9700 from rapidsai/branch-21.12

4363a55

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9702 from rapidsai/branch-21.12

55c9701

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9708 from rapidsai/branch-21.12

60380de

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge pull request #9714 from rapidsai/branch-21.12

3b38aa7

[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]

Merge branch-21.12 into branch-22.02

967a333

codereport and others added 5 commits January 20, 2022 16:05

GPUtester requested review from a team as code owners January 21, 2022 15:41

GPUtester requested review from hyperbolic2346, jrhemstad, galipremsagar and skirui-source January 21, 2022 15:41

github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jan 21, 2022

galipremsagar approved these changes Jan 21, 2022

View reviewed changes

revans2 and others added 4 commits January 21, 2022 14:24

pin dask release version (#10108)

893f540

Always upload cudf packages (#10147)

cfcb3ac

Always upload all cudf packages Authors: - Ray Douglass (https://github.com/raydouglass) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Jordan Jacobelli (https://github.com/Ethyling)

karthikeyann approved these changes Jan 27, 2022

View reviewed changes

update changelog

a7d88cd

raydouglass merged commit f39f559 into main Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v22.02 #10101

[RELEASE] cudf v22.02 #10101

GPUtester commented Jan 21, 2022

codecov bot commented Jan 21, 2022 •

edited

Loading

[RELEASE] cudf v22.02 #10101

[RELEASE] cudf v22.02 #10101

Conversation

GPUtester commented Jan 21, 2022

❄️ Code freeze for branch-22.02 and v22.02 release

What does this mean?

What is the purpose of this PR?

codecov bot commented Jan 21, 2022 • edited Loading

Codecov Report

❄️ Code freeze for `branch-22.02` and v22.02 release

codecov bot commented Jan 21, 2022 •

edited

Loading