Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v22.10 #11858

Merged
merged 272 commits into from
Oct 12, 2022
Merged

[RELEASE] cudf v22.10 #11858

merged 272 commits into from
Oct 12, 2022

Conversation

GPUtester
Copy link
Collaborator

❄️ Code freeze for branch-22.10 and v22.10 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-22.10 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-22.10 into main for the release

GPUtester and others added 30 commits August 1, 2022 09:00
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - https://github.com/brandon-b-miller
  - David Wendt (https://github.com/davidwendt)
  - Jason Lowe (https://github.com/jlowe)

URL: #11383
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR adds Java tests for the Spark list hashing feature added in #11292.

Depends on #11292.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: #11379
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This version of CPM corrects issues when the build directory contains symlinks

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11417
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
Populate the `schema_info` structure (in addition to `column_names`) to match the behavior of a (future) JSON reader that supports nested columns.
Use the `schema_info` in Cython to set the struct columns' field names (unused until nested type support is added).

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mark Harris (https://github.com/harrism)

URL: #11419
As a CLI tool CMake belongs in the build section and shouldn't need to be present in the host requirements.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)

URL: #11376
Currently, if the beginning of a field coincides with either the beginning (inclusive) or end (exclusive) of a byte range, the field will be part of the output. This PR fixes the resulting field duplication if we concatenate the results from a partition of the input into byte ranges.

The issue stems from the fact that we use lower_bound to determine the beginning of a field, but upper_bound to determine its end, so if the end of the byte range coincides with the beginning of a field, the result from the range [a,b) doesn't fit exactly onto the result from the range [b,c).

To keep the previous behavior of emitting an empty field if the input ends with a delimiter, I needed to add a small fix that differentiates between byte ranges whose size matches the input size exactly, and ones that overrun the input size (which is the default behavior).

Authors:
  - Tobias Ribizel (https://github.com/upsj)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Nghia Truong (https://github.com/ttnghia)
  - Mark Harris (https://github.com/harrism)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #11371
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
…#11297)

Instead of waiting to compilation time to get a confusing error about int128 support. Quickly terminate at CMake time when we detect an insufficient nvcc version.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11297
…11431)

Adds in a new java binding to allow reading a JSON buffer and getting back the metadata along with the table when inferring the schema.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Jim Brennan (https://github.com/jbrennan333)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11431
Add Python API to expose the future experimental JSON reader implementation.
Add tests for C++ and Python experimental APIs.

Issue #8827

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #11426
Closes #10952 

After #10770 was merged there are no more uses of `unflatten_nested_columns`. This pr removes `unflatten_nested_columns` and adjusts the tests accordingly.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11421
When reviewing PR #11322 it was noted that it would be preferable to use `std::byte` for the data type, but at the time that didn't work out, so the plan was to address it later and issue #11362 was created to track it.

Fixes #11362

Authors:
  - Mike Wilson (https://github.com/hyperbolic2346)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11424
Closes #11115 

This PR adds a `column` constructor to be constructible from a `device_uvector&&` using move semantics.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #11356
… option (#11446)

Changes are mostly equivalent to Parquet changes in #11018.

Store the `columns` option as `optional`:

- `nullopt` when columns are not passed by caller - read all columns.
- Empty vector when caller explicitly passes an empty list/vector - return empty dataframe.
- Vector of column names - read columns with given names.

Also includes a small cleanup of the code equivalent in the Parquet reader.

Fixes #11021

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - MithunR (https://github.com/mythrocks)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11446
As noted in #11368 we should strive towards not having thrust types in our 'public' API. 
This removes occurences of using `thrust::optional` from cudf/io host classes in preference of `std::optional`.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Tobias Ribizel (https://github.com/upsj)
  - Bradley Dice (https://github.com/bdice)

URL: #11455
The hooks for cmake-format and cmake-lint can fail silently if the necessary config files are not available. When creating these hooks we chose this behavior because depending on where and how people build the libraries the location of the format file may not be discoverable. However, this often leads to user confusion where the hooks appear to pass locally when in fact they never ran. This PR changes the hooks to be verbose so that they can provide more useful diagnostic output. In order to leave that output at a maintainable level, it forces these hooks to run serially. On my machine, this results in the cmake-format hook taking ~3.5s instead of ~1.2s to run on all files, which is an acceptable compromise for readable output.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11456
Adds regex compile logic to check quantifier can be used with the previous item even if its within a capture group.
This prevents an infinite loop occurring when evaluating the expression.
Additional gtests are included to check for this condition which should throw an error.

Closes #11311

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Tobias Ribizel (https://github.com/upsj)
  - Elias Stehle (https://github.com/elstehle)

URL: #11373
Thrust 1.16 removed internal header inclusions that libcudf relied on. This PR adds missing `#include`s that were found automatically by a script I wrote. See notes on #10489. This was previously applied in #10489 but the script became more sophisticated (and libcudf has changed) since I last applied it, so more missing `#include`s were found.

Required for #11437 to upgrade to Thrust 1.17. This change has been separated from #11437 to minimize that PR's diff. Some additional changes will be needed on that PR but we don't want to hold off on fixing these includes, as recommended by @davidwendt.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #11457
This adds a simple benchmark for groupby `max` aggregation.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - David Wendt (https://github.com/davidwendt)

URL: #11464
[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]
This PR removes the Dremel encoding logic from Parquet-specific files and places it into a separate set of files for consumption by non-Parquet code. This PR also includes a minor rename of `utilities/column.hpp`->`utilities/linked_column.hpp` to more accurately reflect the contents of that file.

These changes were split out from #11129 to minimize future conflicts with Parquet development (which is very active at present) and to allow further refactoring and other improvements on this Dremel code to proceed independently of the list lexicographic comparator.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11461
This PR adds a primary developer guide for Python. It provides a more complete and informative landing page for new developers. When #11217, #11199, and #11122 are merged, they will all be linked from this page to provide a complete set of developer documentation.

There is one main point of discussion that I would like reviewer comments on, and that is the section on directory and file organization. How do we want that aspect of cuDF to look?

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Lawrence Mitchell (https://github.com/wence-)
  - Ashwin Srinath (https://github.com/shwina)

URL: #11235
@codecov
Copy link

codecov bot commented Oct 4, 2022

Codecov Report

Base: 10.56% // Head: 87.51% // Increases project coverage by +76.94% 🎉

Coverage data is based on head (17868b7) compared to base (41a20f6).
Patch has no changes to coverable lines.

❗ Current head 17868b7 differs from pull request most recent head f817d96. Consider uploading reports for the commit f817d96 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##             main   #11858       +/-   ##
===========================================
+ Coverage   10.56%   87.51%   +76.94%     
===========================================
  Files         116      133       +17     
  Lines       18677    21826     +3149     
===========================================
+ Hits         1974    19100    +17126     
+ Misses      16703     2726    -13977     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/backends.py 85.26% <0.00%> (-0.45%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 93.29% <0.00%> (-0.09%) ⬇️
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
... and 139 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Oct 4, 2022
vuule and others added 5 commits October 7, 2022 14:45
## Description
Zstandard decompression in nvCOMP 2.4 can produce incorrect results on compute 6.0 GPUs. 
This PR disables the Zstandard decompression in this configuration.

## Checklist
- [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.

Authors:
   - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
   - MithunR (https://github.com/mythrocks)
   - Jim Brennan (https://github.com/jbrennan333)
   - Vyas Ramasubramani (https://github.com/vyasr)
   - Nghia Truong (https://github.com/ttnghia)
   - Joseph (https://github.com/jolorunyomi)
Fixes bug in temporary decompression space estimation before calling nvcomp
@raydouglass raydouglass merged commit b466b6a into main Oct 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet