Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v22.08 #11444

Merged
merged 267 commits into from
Aug 17, 2022
Merged

[RELEASE] cudf v22.08 #11444

merged 267 commits into from
Aug 17, 2022

Conversation

ajschmidt8
Copy link
Member

❄️ Code freeze for branch-22.08 and v22.08 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-22.08 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-22.08 into main for the release

GPUtester and others added 30 commits May 24, 2022 18:37
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
Signed-off-by: Peixin Li <pxli@nyu.edu>

update build version of cudfjni to 22.08.0-SNAPSHOT

Authors:
  - Peixin (https://github.com/pxLi)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10910
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
This PR makes `Buffer.ptr` read-only and introduce `Buffer.from_buffer`:
```python 
@classmethod
def from_buffer(cls, buffer: Buffer, size: int = None, offset: int = 0):
    """
    Create a buffer from another buffer

    Parameters
    ----------
    buffer : Buffer
        The base buffer, which will also be set as the owner of
        the memory allocation.
    size : int, optional
        Size of the memory allocation (default: `buffer.size`).
    offset : int, optional
        Start offset relative to `buffer.ptr`.
    """
```

This is mainly motivated by my work on [spilling](#10746) by making it a bit easier to reason about the relationship between buffers.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ashwin Srinath (https://github.com/shwina)

URL: #10872
…h` (#10838)

This PR registers uses the (presumably shortly merged) dask `Grouper` dispatch to register `cudf.core.groupby.Grouper` objects to `cudf.DataFrame` objects. This should allow our own Grouper objects to be used in critical places in dask rather than pandas objects. 

This solution is favorable IMO rather than changing cuDF to handle pandas grouper objects directly. 

Xref dask/dask#9074

Authors:
  - https://github.com/brandon-b-miller

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10838
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
Fixes parts of #9373
added missing documentation to fix doxygen warnings in multiple files

fixes 93 warnings.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10913
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
Cleans up the `regcomp.cpp` source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR.

Reference #3582
Follow on to #10843

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10879
[gpuCI] Forward-merge branch-22.06 to branch-22.08 [skip gpuci]
Files in the groupby benchmark do not need to be in `.cu` extension---they don't contain any device code. This PR changes them to the `.cpp` extension.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)
  - Conor Hoekstra (https://github.com/codereport)

URL: #10985
Adds duration columns to benchmarks of the formats that support these types (everything except ORC).

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #10933
This PR changes the Python build system for cudf to use scikit-build and leverage CMake under the hood.

This PR depends on rapidsai/rapids-cmake#198. Once that PR is merged, I can update the pull of rapids-cmake into the cudf Python CMake build.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Ashwin Srinath (https://github.com/shwina)

URL: #10919
Closes #10909

This PR was intended to fix a bug in the `distinct` implementation where the stream parameter was not passed when invoking `static_map::contains`. During the work, @ttnghia Pointed out that the `contains` + `thrust::copy_if` logic can be simplified by using `static_map::retrieve_all`. Finally, the PR fetches a newer version of `cuco` to utilize `retrieve_all` and fixes a bug in unit tests where results should be sorted before comparison.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10916
Issue #10755

Fixes an issue in protobuf writer where the length on the row index entry was being written into a single byte. This would cause errors when the size is larger than 127.
The issue was uncovered when row group statistics were added. String statistics contain copies to min/max strings, so the size is unbounded.
This PR changes the protobuf writer to write the entry size as a generic uint, allowing larger entries.
Also fixed `start_row` in row group info array in the reader (unrelated).

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10989
…uble. (#10891)

This PR changes a requirement to ensure that both value inputs to a sort-groupby covariance computation are convertible to double.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - https://github.com/nvdbaranec
  - David Wendt (https://github.com/davidwendt)

URL: #10891
… files (#10912)

Fixes parts of #9373
added missing documentation to fix doxygen warnings in multiple files
- cpp/include/cudf/io/avro.hpp
- cpp/include/cudf/io/csv.hpp
- cpp/include/cudf/io/json.hpp
- cpp/include/cudf/io/orc.hpp
- cpp/include/cudf/io/orc_metadata.hpp
- cpp/include/cudf/io/parquet.hpp

fixes 194 warnings

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10912
Fixes parts of #9373
added missing documentation in aggregation.hpp to fix doxygen warnings
fixes 108 warnings.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

URL: #10887
Fixes parts of #9373
added missing documentation to fix doxygen warnings in multiple files cpp/include/cudf/*.hpp
fixes 40 warnings

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #10896
This PR adds  the missing `#pragma once` in few header files in libcudf.
minor include cleanup "" to <>

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #11004
galipremsagar and others added 6 commits August 1, 2022 21:08
Arrow version pinnings were relaxed in this commit: d740c3c, this PR performs the same change in dev env.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #11418
Fixes: rapidsai/docs#284

This PR fixes day(light) & night(dark) mode color schemes which makes text and a lot of html elements look unclear.

In dark mode:
Before:
 
<img width="1489" alt="Screen Shot 2022-07-28 at 10 01 38 PM" src="https://user-images.githubusercontent.com/11664259/181674172-48e9dd8f-9fb9-447c-a63b-4a6f359b2c4f.png">

After:
<img width="1569" alt="Screen Shot 2022-07-29 at 3 31 54 PM" src="https://user-images.githubusercontent.com/11664259/181838929-f27d664a-eb4c-4a72-8ad9-cf54246b3098.png">


In light mode:
Before:

<img width="1545" alt="Screen Shot 2022-07-28 at 10 03 36 PM" src="https://user-images.githubusercontent.com/11664259/181674247-2307b7a4-0dd5-410a-9cb2-ca18d641d89d.png">

After:
<img width="1506" alt="Screen Shot 2022-07-29 at 3 31 07 PM" src="https://user-images.githubusercontent.com/11664259/181838856-fd0abb85-cc56-4392-8cef-182bb790fff4.png">



Introduced darker color schemes such that code text highlightings are visible properly in dark mode:

Before:
<img width="741" alt="Screen Shot 2022-07-28 at 10 06 08 PM" src="https://user-images.githubusercontent.com/11664259/181674530-aa78290f-b011-437e-a955-4e85bbbee5e9.png">

After:
<img width="704" alt="Screen Shot 2022-07-28 at 10 06 15 PM" src="https://user-images.githubusercontent.com/11664259/181674545-3d0ba553-8b35-49b1-972a-a5fb1e33b0b9.png">


Introduced custom javascript method that will add hover text to "**Theme switcher**"  button:

<img width="552" alt="Screen Shot 2022-07-28 at 9 59 40 PM" src="https://user-images.githubusercontent.com/11664259/181674649-091c4b27-aa4b-4752-a8c8-45b7e71e3417.png">
<img width="622" alt="Screen Shot 2022-07-28 at 9 59 28 PM" src="https://user-images.githubusercontent.com/11664259/181674651-88bf5388-bf81-4633-9360-88ca5df88b85.png">

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #11400
This PR resolves the following error showing up in latest `distributed`:

```python

python/dask_cudf/dask_cudf/tests/test_distributed.py EE                                                                                                                                [100%]

=========================================================================================== ERRORS ===========================================================================================
_____________________________________________________________________________ ERROR at setup of test_basic[True] _____________________________________________________________________________
file /nvme/0/pgali/cudf/python/dask_cudf/dask_cudf/tests/test_distributed.py, line 24
  @pytest.mark.parametrize("delayed", [True, False])
  def test_basic(loop, delayed):  # noqa: F811
file /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/distributed/utils_test.py, line 145
  @pytest.fixture
  def loop(loop_in_thread):
E       fixture 'loop_in_thread' not found
>       available fixtures: benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, cleanup, current_cases, doctest_namespace, loop, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, worker_id
>       use 'pytest --fixtures [testpath]' for help on them.

/nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/distributed/utils_test.py:145
____________________________________________________________________________ ERROR at setup of test_basic[False] _____________________________________________________________________________
file /nvme/0/pgali/cudf/python/dask_cudf/dask_cudf/tests/test_distributed.py, line 24
  @pytest.mark.parametrize("delayed", [True, False])
  def test_basic(loop, delayed):  # noqa: F811
file /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/distributed/utils_test.py, line 145
  @pytest.fixture
  def loop(loop_in_thread):
E       fixture 'loop_in_thread' not found
>       available fixtures: benchmark, benchmark_weave, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, cleanup, current_cases, doctest_namespace, loop, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, testrun_uid, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory, worker_id
>       use 'pytest --fixtures [testpath]' for help on them.

```

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #11428
…11429)

Fixes #11425

Changes the `CUDF_VERSION_Arrow` cmake variable to be a cache entry so that a user can override it by providing `-DCUDF_VERSION_Arrow` at configure time.

Happy to repeat this pattern for other dependencies if desired.

Authors:
  - Keith Kraus (https://github.com/kkraus14)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

URL: #11429
This PR uses a documented template auto-generated by `doxygen` and inserts our custom js & css links to it. The process is documented here: https://doxygen.nl/manual/customize.html#minor_tweaks_header_css

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - David Wendt (https://github.com/davidwendt)

URL: #11430
This PR pins `dask` & `distributed` to `2022.7.1` for `22.08` release.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - AJ Schmidt (https://github.com/ajschmidt8)
  - https://github.com/jakirkham

URL: #11433
@ajschmidt8 ajschmidt8 requested review from a team as code owners August 2, 2022 20:02
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Aug 2, 2022
@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

Merging #11444 (dccb586) into main (41a20f6) will increase coverage by 75.90%.
The diff coverage is n/a.

❗ Current head dccb586 differs from pull request most recent head 6ca81bb. Consider uploading reports for the commit 6ca81bb to get more accurate results

@@             Coverage Diff             @@
##             main   #11444       +/-   ##
===========================================
+ Coverage   10.56%   86.47%   +75.90%     
===========================================
  Files         116      144       +28     
  Lines       18677    22856     +4179     
===========================================
+ Hits         1974    19765    +17791     
+ Misses      16703     3091    -13612     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/backends.py 85.26% <0.00%> (-0.45%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 93.03% <0.00%> (-0.35%) ⬇️
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/orc.py 0.00% <0.00%> (ø)
python/custreamz/custreamz/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/__init__.py 82.35% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <0.00%> (ø)
... and 134 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ttnghia and others added 6 commits August 3, 2022 10:43
Generic atomic operations are currently implemented using an `atomicCAS` in a loop that determines when the current thread's result is the one that was actually saved to the address. The final check is performed by directly comparing the values, which can lead to infinite loops when the value being set is a `NaN` because all comparisons involving `NaN`s return false. This PR fixes that issues by casting the data to an integral type and comparing those, bypassing `NaN` comparison.

This error was discovered in in hash-based aggregates `min` and `max`, and fixing it is a blocker for NVIDIA/spark-rapids#5989.

Authors:
   - Nghia Truong (https://github.com/ttnghia)
   - Bradley Dice (https://github.com/bdice)

Approvers:
   - Bradley Dice (https://github.com/bdice)
   - Vyas Ramasubramani (https://github.com/vyasr)
This PR adds Java bindings for adding binary option for `ParquetOptions`

Authors:

Approvers:
   - Jim Brennan (https://github.com/jbrennan333)
   - Mike Wilson (https://github.com/hyperbolic2346)
This PR fixes a flaky test introduced by #11272, cudf joins by default does not guarantee return orders and may lead to occasional test regression. This PR adds `sort` argument to make sure result is deterministic.

Note that `index.union` and `index.intersection` may also include random output ordering, but by default these methods sorts the result before returning so `sort` argument does not need to be modified.

Authors:
   - Michael Wang (https://github.com/isVoid)

Approvers:
   - GALI PREM SAGAR (https://github.com/galipremsagar)
   - https://github.com/brandon-b-miller
   - Nghia Truong (https://github.com/ttnghia)
This PR switches the loading of `custom.js` to `defer` because we will need the entire page to be loading until the methods in this script can even execute correctly. 

Authors:
   - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
   - AJ Schmidt (https://github.com/ajschmidt8)
In `cudf::detail::label_segments`, when the input lists column has empty/nulls lists at the end of the column, its `offsets` column will contain out-of-bound indices. This leads to invalid memory access bug. Such bug is elusive and doesn't show up consistently. Test failures reported in NVIDIA/spark-rapids#6249 are due to this.

The existing unit tests already cover such corner case. Unfortunately, the bug didn't show up until being tested on some systems. Even that, it was very difficult to reproduce it.

Closes #11495.

Authors:
   - Nghia Truong (https://github.com/ttnghia)

Approvers:
   - Tobias Ribizel (https://github.com/upsj)
   - Bradley Dice (https://github.com/bdice)
   - Jim Brennan (https://github.com/jbrennan333)
   - Alessandro Bellina (https://github.com/abellina)
   - Karthikeyan (https://github.com/karthikeyann)
@raydouglass raydouglass merged commit a7f8de5 into main Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.