[RELEASE] cudf v22.10 #11858

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

@rwlee

This PR closes #11296. While implementing Spark list hashing in #11292, I noticed that `HASH_SERIAL_MURMUR3` does not appear to be used except in tests. It is not exposed in Python. While it is exposed in the JNI bindings, it is not used by spark-rapids. I discussed this with @rwlee and it seems that this feature was added only for parallel design with the Spark serial hash implementation in #6781, which is superseded by #11292. We do not need to keep this vestigial feature. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert Maynard (https://github.com/robertmaynard) - https://github.com/brandon-b-miller - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11383

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

This PR adds Java tests for the Spark list hashing feature added in #11292. Depends on #11292. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #11379

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

This version of CPM corrects issues when the build directory contains symlinks Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #11417

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

Populate the `schema_info` structure (in addition to `column_names`) to match the behavior of a (future) JSON reader that supports nested columns. Use the `schema_info` in Cython to set the struct columns' field names (unused until nested type support is added). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) - Mark Harris (https://github.com/harrism) URL: #11419

As a CLI tool CMake belongs in the build section and shouldn't need to be present in the host requirements. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #11376

Currently, if the beginning of a field coincides with either the beginning (inclusive) or end (exclusive) of a byte range, the field will be part of the output. This PR fixes the resulting field duplication if we concatenate the results from a partition of the input into byte ranges. The issue stems from the fact that we use lower_bound to determine the beginning of a field, but upper_bound to determine its end, so if the end of the byte range coincides with the beginning of a field, the result from the range [a,b) doesn't fit exactly onto the result from the range [b,c). To keep the previous behavior of emitting an empty field if the input ends with a delimiter, I needed to add a small fix that differentiates between byte ranges whose size matches the input size exactly, and ones that overrun the input size (which is the default behavior). Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Christopher Harris (https://github.com/cwharris) - Nghia Truong (https://github.com/ttnghia) - Mark Harris (https://github.com/harrism) - Charles Blackmon-Luca (https://github.com/charlesbluca) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11371

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

…#11297) Instead of waiting to compilation time to get a confusing error about int128 support. Quickly terminate at CMake time when we detect an insufficient nvcc version. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11297

…11431) Adds in a new java binding to allow reading a JSON buffer and getting back the metadata along with the table when inferring the schema. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jim Brennan (https://github.com/jbrennan333) - Nghia Truong (https://github.com/ttnghia) URL: #11431

Add Python API to expose the future experimental JSON reader implementation. Add tests for C++ and Python experimental APIs. Issue #8827 Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11426

Closes #10952 After #10770 was merged there are no more uses of `unflatten_nested_columns`. This pr removes `unflatten_nested_columns` and adjusts the tests accordingly. Authors: - Srikar Vanavasam (https://github.com/SrikarVanavasam) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11421

When reviewing PR #11322 it was noted that it would be preferable to use `std::byte` for the data type, but at the time that didn't work out, so the plan was to address it later and issue #11362 was created to track it. Fixes #11362 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11424

Closes #11115 This PR adds a `column` constructor to be constructible from a `device_uvector&&` using move semantics. Authors: - Srikar Vanavasam (https://github.com/SrikarVanavasam) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) - Jake Hemstad (https://github.com/jrhemstad) URL: #11356

… option (#11446) Changes are mostly equivalent to Parquet changes in #11018. Store the `columns` option as `optional`: - `nullopt` when columns are not passed by caller - read all columns. - Empty vector when caller explicitly passes an empty list/vector - return empty dataframe. - Vector of column names - read columns with given names. Also includes a small cleanup of the code equivalent in the Parquet reader. Fixes #11021 Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) URL: #11446

As noted in #11368 we should strive towards not having thrust types in our 'public' API. This removes occurences of using `thrust::optional` from cudf/io host classes in preference of `std::optional`. Authors: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) URL: #11455

The hooks for cmake-format and cmake-lint can fail silently if the necessary config files are not available. When creating these hooks we chose this behavior because depending on where and how people build the libraries the location of the format file may not be discoverable. However, this often leads to user confusion where the hooks appear to pass locally when in fact they never ran. This PR changes the hooks to be verbose so that they can provide more useful diagnostic output. In order to leave that output at a maintainable level, it forces these hooks to run serially. On my machine, this results in the cmake-format hook taking ~3.5s instead of ~1.2s to run on all files, which is an acceptable compromise for readable output. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) URL: #11456

Adds regex compile logic to check quantifier can be used with the previous item even if its within a capture group. This prevents an infinite loop occurring when evaluating the expression. Additional gtests are included to check for this condition which should throw an error. Closes #11311 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Tobias Ribizel (https://github.com/upsj) - Elias Stehle (https://github.com/elstehle) URL: #11373

@davidwendt

Thrust 1.16 removed internal header inclusions that libcudf relied on. This PR adds missing `#include`s that were found automatically by a script I wrote. See notes on #10489. This was previously applied in #10489 but the script became more sophisticated (and libcudf has changed) since I last applied it, so more missing `#include`s were found. Required for #11437 to upgrade to Thrust 1.17. This change has been separated from #11437 to minimize that PR's diff. Some additional changes will be needed on that PR but we don't want to hold off on fixing these includes, as recommended by @davidwendt. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #11457

This adds a simple benchmark for groupby `max` aggregation. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - David Wendt (https://github.com/davidwendt) URL: #11464

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

This PR removes the Dremel encoding logic from Parquet-specific files and places it into a separate set of files for consumption by non-Parquet code. This PR also includes a minor rename of `utilities/column.hpp`->`utilities/linked_column.hpp` to more accurately reflect the contents of that file. These changes were split out from #11129 to minimize future conflicts with Parquet development (which is very active at present) and to allow further refactoring and other improvements on this Dremel code to proceed independently of the list lexicographic comparator. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - Devavret Makkar (https://github.com/devavret) Approvers: - Bradley Dice (https://github.com/bdice) - Ray Douglass (https://github.com/raydouglass) - Vukasin Milovanovic (https://github.com/vuule) URL: #11461

This PR adds a primary developer guide for Python. It provides a more complete and informative landing page for new developers. When #11217, #11199, and #11122 are merged, they will all be linked from this page to provide a complete set of developer documentation. There is one main point of discussion that I would like reviewer comments on, and that is the section on directory and file organization. How do we want that aspect of cuDF to look? Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Lawrence Mitchell (https://github.com/wence-) - Ashwin Srinath (https://github.com/shwina) URL: #11235

This PR documents best practices for writing cuDF Python benchmarks. It includes an overview of the various fixtures provided by our benchmarking suite to all benchmarks and indicates how best to make use of them. It also discusses the various features of our benchmarking suite (including easy comparison to pandas and running in CI) and what developers must do to maintain compatibility with those features. A PR to incorporate the [cudf_benchmarks](https://github.com/vyasr/cudf_benchmarks) repo into cudf proper is imminent, but this documentation PR can be reviewed (and merged) independently. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Ashwin Srinath (https://github.com/shwina) URL: #11122

…1480) This PR removes support for `skiprows` & `num_rows` in parquet reader. A continuation of #11218 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Vukasin Milovanovic (https://github.com/vuule) URL: #11480

…unctor (#11482) Refactored the `group_nunique.cu` source to use the `nullate::DYNAMIC` for the equal operator and the unique-iterator. This improves the compile time by almost 2x without much change to performance by reducing the number of calls to `thrust::reduce_by_key`. Found while investigating compile issues for #11437 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Yunsong Wang (https://github.com/PointKernel) URL: #11482

…1365) release() sets the null_count of a column to zero, so previously asking for the null_count provided an incorrect value. Fortunately this never exhibited in the final column, since Column.__init__ always ignores the provided null_count and computes it from the null_mask (if one is given). Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Ashwin Srinath (https://github.com/shwina) - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11365

This document aims to give instruction the following two things: - What to throw given invalid user inputs - How should cuDF handle exceptions from libcudf Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #7917

This PR builds on the _Finite-State Transducer_ (_FST_) algorithm and the _Logical Stack_ to implement a tokenizer that demarcates sections from the JSON input and assigns a category to each such section. **This PR builds on:** ⛓️ #11242 ⛓️ #11078 Specifically, the tokenizer comprises the following processing steps: 1. FST to emit sequence of stack operations (i.e., emit push(LIST), push(STRUCT), pop(), read()). This FST does transduce each occurrence of an opening semantic bracket or brace to the respective push(LIST) and push(STRUCT) operation, respectively. Each semantic closing bracket or brace is transduced to a pop() operation. All other input is transduced to a read() operation. 2. The sequence of stack operations from (1) is fed into the logical stack that resolves what is on top of the stack before each operation from (1) (i.e., STRUCT, LIST). After this stage, for every input character we know what is on top of the stack: either a STRUCT or LIST or ROOT, if there is no symbol on top of the stack. 3. We use the top-of-stack information from (2) for a second FST. This part can be considered a full pushdown or DVPA (because now, we also have stack context). State transitions are caused by the combination of the input character + the top-of-stack for that character. The output of this stage is the token stream: ({beginning-of, end-of}x{struct, list}, field name, value, etc. Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11264

This adds a simple benchmark for groupby `nunique` aggregation. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Tobias Ribizel (https://github.com/upsj) URL: #11472

This PR unpins `dask` & `distributed` for `22.10` development. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Ray Douglass (https://github.com/raydouglass) URL: #11492

@rjzamora

This PR is a breaking change that disables Arrow S3 support by default. Enabling this feature by default has caused build issues for many downstream consumers, all of whom (to my knowledge) manually disable support for this feature. Most commonly, that build error appears as `fatal error: aws/core/Aws.h: No such file or directory`. In my understanding, several downstream consumers of cudf no longer rely on Arrow S3 support from this library and instead get S3 access via fsspec. I am not aware of any users of libcudf who rely on this being enabled by default (or enabled at all). See related issues and discussions: #8617, #11333, #8867, #10644 (comment), NVIDIA/spark-rapids#2827. Build errors caused by this default behavior have also been reported internally. cc: @rjzamora @beckernick @jdye64 @randerzander @robertmaynard @jlowe @quasiben if you have comments following our previous discussion. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) - AJ Schmidt (https://github.com/ajschmidt8) URL: #11470

This PR fixes a minor misalignment in `get_dummies` docstring example. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Michael Wang (https://github.com/isVoid) URL: #11443

This PR removes the unused `is_struct` trait. Users should instead check the column `data_type` id, like `col->type().id() == cudf::type_id::STRUCT`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #11450

This PR moves the `SparkMurmurHash3_32` functor from `hash_functions.cuh` to `spark_murmur_hash.cu`, the only place where it is used. **This is a pure move**, with one small exception to avoid compiler warnings about unused members of the hash functor template instantiations for nested types. I refactored the class template to disallow nested types for the hash functor and removed those specializations using `CUDF_UNREACHABLE`, rather than allowing type dispatching to create template instantiations that have no defined use. (Nested types are being handled by the custom device row hasher in `spark_murmur_hash.cu`, and require some state information that cannot be easily carried in the functor itself.) I am planning to do further refactoring later, but wanted to separate this "pure move" as much as possible. Part of #10081. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Ryan Lee (https://github.com/rwlee) URL: #11489

[gpuCI] Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

This adds a simple benchmark for reduction `distinct_count`. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) URL: #11473

…#11505) In a previous PR #11480, `skiprows` & `num_rows` were removed from `cudf.read_parquet`, this PR updates the corresponding parquet reader fuzz tests. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #11505

This PR upgrades `arrow` to `9.x` in `cudf`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) URL: #11507

Updates the bundled version of Thrust to 1.17.0. I will run benchmarks and include results in a comment below. Depends on #11457. Supersedes #10489, #10577, #10586. Closes #10841. **This should be merged concurrently with rapidsai/rapids-cmake#231 Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Robert Maynard (https://github.com/robertmaynard) URL: #11437

We can simplify the logic around determining the warp_mask by having both queries issued without a dependency Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11508

@shwina

This PR introduces factory functions to create `Buffer` instances, which makes it possible to change the returned buffer type based on a configuration option in a follow-up PR. Beside simplifying the code base a bit, this is motivated by the spilling work in #10746. We would like to introduce a new spillable Buffer class that requires minimal changes to the existing code and is only used when enabled explicitly. This way, we can introduce spilling in cuDF as an experimental feature with minimal risk to the existing code. @shwina and I discussed the possibility to let `Buffer.__new__` return different class type instances instead of using factory functions but we concluded that having `Buffer()` return anything other than an instance of `Buffer` is simply too surprising :) **Notice**, this is breaking because it removes unused methods such as `Buffer.copy()` and `Buffer.nbytes`. ~~However, we still support creating a buffer directly by calling `Buffer(obj)`. AFAIK, this is the only way `Buffer` is created outside of cuDF, which [a github search seems to confirm](https://github.com/search?l=&q=cudf.core.buffer+-repo%3Arapidsai%2Fcudf&type=code).~~ This PR doesn't change the signature of `Buffer.__init__()` anymore. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) - https://github.com/brandon-b-miller URL: #11447

After upgrading to Arrow 9 (#11507), some systems experience a cmake issue: ``` cudf/cpp/build/_deps/arrow-src/cpp/CMakeLists.txt:864: error: The dependency target "xsimd" of target "arrow_dependencies" does not exist. ``` This may be due to the configurations for Arrow is looking for a local installation of `xsimd`, which does not exist or the installation path is not provided. This PRs adds an option to cudf cmake, specifying that Arrow should handle `xsimd` by itself. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Robert Maynard (https://github.com/robertmaynard) URL: #11513

…11512) This PR resolves #11225. It fixes binary operator dispatch for reverse ops like `__radd__` acting on host scalars and `cudf.Scalar` objects in expressions like `1 + cudf.Scalar(3)`, which previously threw an error. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - https://github.com/brandon-b-miller - Matthew Roeschke (https://github.com/mroeschke) URL: #11512

This PR follows up on #10459, #10491 to remove the deprecated `expand` parameter and update the behavior of `str.findall` to always returns a list column of results. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11030

Closes #10856 If all of the tdigest inputs to percentile_approx are empty, it will return a column containing null rows, as expected, but they will have unsanitary offsets. This PR checks if all the inputs are empty and returns an empty column as expected. Authors: - Srikar Vanavasam (https://github.com/SrikarVanavasam) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) - https://github.com/nvdbaranec URL: #11498

This fixes a test warning from `test_feather.py`. ``` cudf/python/cudf/cudf/tests/test_feather.py:15: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. ``` The code used `distutils` to check the pandas version and import a special `feather` package if it was less than pandas 0.24. However, we no longer need to test against pandas versions that old so we can just remove this check entirely instead of updating it to use `packaging.version`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11511

#11503) Removes support for skip_rows / num_rows options in the parquet reader. Users retain control of what gets read via row groups. Did some before/after benchmarking. As expected, this doesn't change much except for a minor boost in list reading (due to simplification of the preprocessing step). Most of the ways the row bounds affected the code was in the page setup process (making it slippery to think through the logic) and didn't do much in the actual process of decoding. A selection of before/after benchmarks (all input files ~512 MB) ``` ParquetRead/integral_buffer_input/29/1000/32/0/1/manual_time Before: bytes_per_second=31.4564G/s After: bytes_per_second=31.58G/s ParquetRead/floats_buffer_input/31/1000/32/0/1/manual_time Before: bytes_per_second=49.2819G/s After: bytes_per_second=49.7408G/s ParquetRead/string_file_input/23/1000/32/0/0/manual_time Before: bytes_per_second=24.634G/s After: bytes_per_second=24.6563G/s ParquetRead/string_buffer_input/23/0/1/0/1/manual_time Before: bytes_per_second=5.03313G/s After: bytes_per_second=5.03535G/s ParquetRead/list_buffer_input/24/0/1/1/1/manual_time Before: bytes_per_second=1.11488G/s After: bytes_per_second=1.31447G/s ``` Authors: - https://github.com/nvdbaranec Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Yunsong Wang (https://github.com/PointKernel) URL: #11503

Resolves #5944, resolves #1214 This PR adds support for basic use cases of `crosstab` and `pivot_table` functions. Authors: - Shaswat Anand (https://github.com/shaswat-indian) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) URL: #11314

Bumps hadoop-common from 3.2.3 to 3.2.4. [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=org.apache.hadoop:hadoop-common&package-manager=maven&previous-version=3.2.3&new-version=3.2.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/rapidsai/cudf/network/alerts). </details> Authors: - https://github.com/apps/dependabot Approvers: - Jason Lowe (https://github.com/jlowe) URL: #11516

This PR builds on the [JSON tokenizer](#11264) algorithm to implement an end-to-end JSON parser that parses to a `table_with_metadata`. **Chained PR depending on:** ⛓️ #11264 Authors: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) Approvers: - https://github.com/nvdbaranec - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11388

…11517) Fixes a bug found in `ConditionalLeftAntiJoinTest/*.TestCompareRandomToHashNulls` gtests in `conditional_join` kernel. Appears to be a race-condition that is fixed by calling `__syncwarp()` before the final `flust_output_cache()` call. The sync call is necessary to make sure shared data is synchronized otherwise garbage data is read from the shared data. This error only appears in a debug build of libcudf. The gtest uses random data so the error is somewhat intermittent. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #11517

Added a builder to enable complex initialization of `data_profile` objects. The builder slightly expands the API to make some common uses easier: - Setting distribution no longer requires passing a value range (default used to be passed in some benchmarks). - The special case where `set_null_frequency(nullopt)` is called to prevent the generator from materializing the null mask is now a more explicit call `no_validity()`. Updated the benchmarks to use the new builder. Setters are still used in places where `data_profile` object is modified and reused. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Karthikeyan (https://github.com/karthikeyann) - David Wendt (https://github.com/davidwendt) URL: #11479

…11404) Adds ASCII flag to the libcudf `regex_flags` for support with builtin character classes: `\w, \W, \s, \S, \d, \D`. Somewhat equivalent to https://docs.python.org/3/library/re.html#re.ASCII But strictly the flag modifies matching for these classes as follows: - `\w` = `[a-zA-Z_0-9]` (alphabetic characters plus underline) - `\W` = `[^\w]` (basically not `\w`) - `\s` = `[\t- ]` (tab through space in the [ASCII table](https://www.asciitable.com/)) - `\S` = `[^\s]` (basically not `\s`) - `\d` = `[0-9]` (digit characters) - `\D` = `[^\d]` (basically not `\d`) Additional gtests are included for this flag with these classes. This will be exposed through Python/Cython in a follow up PR. Closes #10894 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Robert Maynard (https://github.com/robertmaynard) URL: #11404

Resolves the first step of #11519 by deprecating `skiprows` and `num_rows` in orc reader. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #11522

This fixes the crash described in the bug related to writing nested data in parquet with the binary flag set to write binary data as byte_arrays. We were incorrectly selecting the top-most node instead of the list<int8>, which resulted in a crash down in the kernels when the data pointer was null for those upper list columns. closes #11506 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Bradley Dice (https://github.com/bdice) URL: #11526

arrow 9's CMake code generates new imported interface targets which cudf needs to replicate so that consumers of cudf don't get errors abount `arrow::hadoop` or `arrow::flatbuffers`. Fixes #11521 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) URL: #11535

Adds support for Spark's null aware equality binop and expands/improves Java testing for struct binops. Properly tests null structs and full operator testing coverage. Utilizes existing Spark struct binop support with JNI changes to force the full null-aware comparison. Expands on #11153 Partial solution to #8964 -- `NULL_MAX` and `NULL_MIN` still outstanding. Authors: - Ryan Lee (https://github.com/rwlee) Approvers: - Tobias Ribizel (https://github.com/upsj) - Vukasin Milovanovic (https://github.com/vuule) - Jason Lowe (https://github.com/jlowe) URL: #11520

Refactors the `cudf::strings::pad_side` and `cudf::strings::strip_type` to a single enum `cudf::strings::side_type`. These have the same values as used by `cudf::strings::pad` and `cudf::strings::strip` Moving these into a single header helps with reusing them in the `strings_udf` work. Updates to gtests and cython code layers are also included. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Jason Lowe (https://github.com/jlowe) - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #11438

Removes possibility of another projects `RAPIDS.cmake` being used, and removes need to always download a version. Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Bradley Dice (https://github.com/bdice) URL: #11493

Closes #10988 Exposes page_size_rows and page_size_bytes properties of the Parquet writer. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11454

Adds an API to create a single random column, so users don't need to create a table even when a single column is required. The interface is the same as `create_random_table`, except that it only takes a single data type. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11490

Adding a schema for reading parquet files. This is useful for things like binary data reading where the default behavior of cudf is to read it as a string column, but users wish to read it as a list<int8> column instead. Using a schema allows for nested data types to be expressed completely. Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #11524

@jrhemstad

This PR replaces #11509, and adds hexadecimal value separators `'` (supported in C++14 and newer) every 4 characters from the least significant (right) side. For example, values like `0xffffffff` should be written as `0xffff'ffff`. In many cases, I added an unsigned suffix `u` as well, if I could identify the value as needing to be unsigned while refactoring. I also added a note to the Developer Guide referencing [C++ Core Guidelines NL.11](https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#nl11-make-literals-readable), which supports the general arguments for readability that @jrhemstad made in the PR conversation on #11509. I did not add separators to "magic values" used in some of the hashing code, because they're copy-pasted directly from a reference and it should be possible to search and find the original value. Happy to make those changes if reviewers think they're needed. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jason Lowe (https://github.com/jlowe) - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #11527

As noticed in review of #11524 there are unnecessary asserts in the parquet tests. This removes those. closes #11541 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #11544

) Core changes: - Implement the data ingest for the experimental JSON reader. - Call the new JSON parser when the option/flag to use the experimental implementation is set. - Modify C++ and Python tests so they don't expect an exception and check the output instead. Additional fix: - Return the vector of root columns' names from the JSON reader (along with the nested column info) to conform to the current Cython implementation. Info in these structures is redundant and can be removed in the future. Marked as breaking only because the experimental path does not throw any more. No changes in behavior when the experimental option is not selected. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Elias Stehle (https://github.com/elstehle) - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11364

This removes an implementation of a hashing function identical to the `MurmurHash3_32` used elsewhere in libcudf. This removes the re-implementation and instead uses the common hashing function code. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11528

Adds in java APIs to support writing binary columns in parquet. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #11556

This extends the `cudf::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `cudf::contains` will work with literally any type of input data. In addition, this fixes null handling of `cudf::contains` with structs column + struct scalar input when the structs column contains null rows at the top level while the scalar key is valid but all nulls at children levels. Closes: #8965 Depends on: * #10730 * #10883 * #10802 * #10997 * NVIDIA/cuCollections#172 * NVIDIA/cuCollections#173 * #11037 * #11356 Authors: - Nghia Truong (https://github.com/ttnghia) - Devavret Makkar (https://github.com/devavret) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #10656

Continuation of #11216, this adds ability to use 3, 5, 6, 10, and 20 bit dictionary keys in the Parquet encoder. Also adds unit tests for each of the supported bit widths. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Yunsong Wang (https://github.com/PointKernel) URL: #11547

This folder contains 3 image (.png) files that are not referenced anywhere in the repo. One of them includes mention of GDF which is the original name for cuDF. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #11554

Should unblock CI that is now failing due to upstream breakage. It appears that ipywidgets 8.0.0 is incompatible with streamz 0.6.4. Authors: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #11567

…11566) After #11364 `TableWithMeta` no longer returns the column names for a table since it is looking at the old `column_names` field instead of the newer `schema_info` field in the table metadata. This updates the JNI to use the `schema_info` to get the column names and adds a test for this API which was missing before. Authors: - Jason Lowe (https://github.com/jlowe) Approvers: - Ryan Lee (https://github.com/rwlee) - Peixin (https://github.com/pxLi) URL: #11566

Dask-cudf groupby tests *should* be failing as a result of dask/dask#9302 (see [failures](https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-gpu-test/CUDA=11.5,GPU_LABEL=driver-495,LINUX_VER=ubuntu20.04,PYTHON=3.9/9946/) in #11565 is merged - where dask/main is being installed correctly). This PR updates the dask_cudf groupby code to fix these failures. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11561

Resolves: #10116 Authors: - Shaswat Anand (https://github.com/shaswat-indian) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #11523

Resolves: #10529 Authors: - Shaswat Anand (https://github.com/shaswat-indian) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11538

This adds a second benchmark testing byte ranges with 1% - 50% of the input size to the multibyte_split benchmarks. Example output (--run-once): <details> ``` # Benchmark Results ## multibyte_split ### [0] Tesla T4 | source_type | delim_size | delim_percent | size_approx | byte_range_percent | Samples | CPU Time | Noise | GPU Time | Noise | Peak Memory Usage | Encoded file size | |-------------|------------|---------------|-------------------|--------------------|---------|------------|-------|------------|-------|-------------------|-------------------| | 0 | 1 | 1 | 2^15 = 32768 | 1 | 1x | 683.062 us | inf% | 677.888 us | inf% | 86.672 KiB | 315.000 B | | 1 | 1 | 1 | 2^15 = 32768 | 1 | 1x | 11.870 ms | inf% | 11.864 ms | inf% | 117.055 KiB | 315.000 B | | 2 | 1 | 1 | 2^15 = 32768 | 1 | 1x | 8.232 ms | inf% | 8.226 ms | inf% | 112.094 KiB | 315.000 B | | 0 | 4 | 1 | 2^15 = 32768 | 1 | 1x | 647.178 us | inf% | 642.464 us | inf% | 81.938 KiB | 325.000 B | | 1 | 4 | 1 | 2^15 = 32768 | 1 | 1x | 8.410 ms | inf% | 8.405 ms | inf% | 113.719 KiB | 325.000 B | | 2 | 4 | 1 | 2^15 = 32768 | 1 | 1x | 8.194 ms | inf% | 8.188 ms | inf% | 113.086 KiB | 325.000 B | | 0 | 7 | 1 | 2^15 = 32768 | 1 | 1x | 644.414 us | inf% | 638.944 us | inf% | 82.352 KiB | 328.000 B | | 1 | 7 | 1 | 2^15 = 32768 | 1 | 1x | 9.467 ms | inf% | 9.461 ms | inf% | 113.703 KiB | 328.000 B | | 2 | 7 | 1 | 2^15 = 32768 | 1 | 1x | 8.199 ms | inf% | 8.192 ms | inf% | 113.344 KiB | 328.000 B | | 0 | 1 | 25 | 2^15 = 32768 | 1 | 1x | 661.234 us | inf% | 656.864 us | inf% | 146.719 KiB | 243.000 B | | 1 | 1 | 25 | 2^15 = 32768 | 1 | 1x | 9.465 ms | inf% | 9.458 ms | inf% | 169.945 KiB | 243.000 B | | 2 | 1 | 25 | 2^15 = 32768 | 1 | 1x | 8.310 ms | inf% | 8.304 ms | inf% | 105.070 KiB | 243.000 B | | 0 | 4 | 25 | 2^15 = 32768 | 1 | 1x | 722.079 us | inf% | 717.024 us | inf% | 97.547 KiB | 304.000 B | | 1 | 4 | 25 | 2^15 = 32768 | 1 | 1x | 9.565 ms | inf% | 9.558 ms | inf% | 126.891 KiB | 304.000 B | | 2 | 4 | 25 | 2^15 = 32768 | 1 | 1x | 8.242 ms | inf% | 8.237 ms | inf% | 111.031 KiB | 304.000 B | | 0 | 7 | 25 | 2^15 = 32768 | 1 | 1x | 735.674 us | inf% | 729.984 us | inf% | 90.633 KiB | 309.000 B | | 1 | 7 | 25 | 2^15 = 32768 | 1 | 1x | 9.623 ms | inf% | 9.617 ms | inf% | 120.508 KiB | 309.000 B | | 2 | 7 | 25 | 2^15 = 32768 | 1 | 1x | 8.448 ms | inf% | 8.443 ms | inf% | 111.547 KiB | 309.000 B | | 0 | 1 | 1 | 2^30 = 1073741824 | 1 | 1x | 457.785 ms | inf% | 457.785 ms | inf% | 161.148 MiB | 10.066 MiB | | 1 | 1 | 1 | 2^30 = 1073741824 | 1 | 1x | 1.283 s | inf% | 1.283 s | inf% | 165.148 MiB | 10.066 MiB | | 2 | 1 | 1 | 2^30 = 1073741824 | 1 | 1x | 3.676 s | inf% | 3.676 s | inf% | 4.079 MiB | 10.066 MiB | | 0 | 4 | 1 | 2^30 = 1073741824 | 1 | 1x | 466.860 ms | inf% | 466.861 ms | inf% | 30.663 MiB | 10.144 MiB | | 1 | 4 | 1 | 2^30 = 1073741824 | 1 | 1x | 1.251 s | inf% | 1.251 s | inf% | 34.663 MiB | 10.144 MiB | | 2 | 4 | 1 | 2^30 = 1073741824 | 1 | 1x | 3.517 s | inf% | 3.517 s | inf% | 4.079 MiB | 10.144 MiB | | 0 | 7 | 1 | 2^30 = 1073741824 | 1 | 1x | 479.563 ms | inf% | 479.564 ms | inf% | 21.915 MiB | 10.156 MiB | | 1 | 7 | 1 | 2^30 = 1073741824 | 1 | 1x | 1.207 s | inf% | 1.207 s | inf% | 25.915 MiB | 10.156 MiB | | 2 | 7 | 1 | 2^30 = 1073741824 | 1 | 1x | 3.218 s | inf% | 3.218 s | inf% | 4.079 MiB | 10.156 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 1 | 1x | 357.260 ms | inf% | 357.258 ms | inf% | 2.043 GiB | 7.625 MiB | | 1 | 1 | 25 | 2^30 = 1073741824 | 1 | 1x | 948.294 ms | inf% | 948.298 ms | inf% | 2.047 GiB | 7.625 MiB | | 2 | 1 | 25 | 2^30 = 1073741824 | 1 | 1x | 2.471 s | inf% | 2.471 s | inf% | 4.079 MiB | 7.625 MiB | | 0 | 4 | 25 | 2^30 = 1073741824 | 1 | 1x | 477.226 ms | inf% | 477.227 ms | inf% | 520.458 MiB | 9.531 MiB | | 1 | 4 | 25 | 2^30 = 1073741824 | 1 | 1x | 1.180 s | inf% | 1.180 s | inf% | 524.458 MiB | 9.531 MiB | | 2 | 4 | 25 | 2^30 = 1073741824 | 1 | 1x | 3.095 s | inf% | 3.095 s | inf% | 4.079 MiB | 9.531 MiB | | 0 | 7 | 25 | 2^30 = 1073741824 | 1 | 1x | 481.854 ms | inf% | 481.854 ms | inf% | 301.800 MiB | 9.803 MiB | | 1 | 7 | 25 | 2^30 = 1073741824 | 1 | 1x | 1.234 s | inf% | 1.234 s | inf% | 305.800 MiB | 9.803 MiB | | 2 | 7 | 25 | 2^30 = 1073741824 | 1 | 1x | 3.100 s | inf% | 3.100 s | inf% | 4.079 MiB | 9.803 MiB | | 0 | 1 | 1 | 2^15 = 32768 | 5 | 1x | 580.967 us | inf% | 577.184 us | inf% | 87.977 KiB | 1.540 KiB | | 1 | 1 | 1 | 2^15 = 32768 | 5 | 1x | 12.855 ms | inf% | 12.851 ms | inf% | 117.055 KiB | 1.540 KiB | | 2 | 1 | 1 | 2^15 = 32768 | 5 | 1x | 7.830 ms | inf% | 7.826 ms | inf% | 112.094 KiB | 1.540 KiB | | 0 | 4 | 1 | 2^15 = 32768 | 5 | 1x | 434.534 us | inf% | 430.720 us | inf% | 83.531 KiB | 1.589 KiB | | 1 | 4 | 1 | 2^15 = 32768 | 5 | 1x | 8.882 ms | inf% | 8.877 ms | inf% | 113.719 KiB | 1.589 KiB | | 2 | 4 | 1 | 2^15 = 32768 | 5 | 1x | 7.756 ms | inf% | 7.751 ms | inf% | 113.086 KiB | 1.589 KiB | | 0 | 7 | 1 | 2^15 = 32768 | 5 | 1x | 395.562 us | inf% | 391.616 us | inf% | 83.742 KiB | 1.602 KiB | | 1 | 7 | 1 | 2^15 = 32768 | 5 | 1x | 8.917 ms | inf% | 8.911 ms | inf% | 113.703 KiB | 1.602 KiB | | 2 | 7 | 1 | 2^15 = 32768 | 5 | 1x | 7.763 ms | inf% | 7.758 ms | inf% | 113.344 KiB | 1.602 KiB | | 0 | 1 | 25 | 2^15 = 32768 | 5 | 1x | 445.522 us | inf% | 441.760 us | inf% | 149.008 KiB | 1.188 KiB | | 1 | 1 | 25 | 2^15 = 32768 | 5 | 1x | 8.994 ms | inf% | 8.989 ms | inf% | 169.945 KiB | 1.188 KiB | | 2 | 1 | 25 | 2^15 = 32768 | 5 | 1x | 7.975 ms | inf% | 7.971 ms | inf% | 105.070 KiB | 1.188 KiB | | 0 | 4 | 25 | 2^15 = 32768 | 5 | 1x | 425.132 us | inf% | 421.248 us | inf% | 99.031 KiB | 1.486 KiB | | 1 | 4 | 25 | 2^15 = 32768 | 5 | 1x | 8.846 ms | inf% | 8.841 ms | inf% | 126.891 KiB | 1.486 KiB | | 2 | 4 | 25 | 2^15 = 32768 | 5 | 1x | 7.740 ms | inf% | 7.735 ms | inf% | 111.031 KiB | 1.486 KiB | | 0 | 7 | 25 | 2^15 = 32768 | 5 | 1x | 502.938 us | inf% | 499.392 us | inf% | 92.023 KiB | 1.512 KiB | | 1 | 7 | 25 | 2^15 = 32768 | 5 | 1x | 8.887 ms | inf% | 8.882 ms | inf% | 120.508 KiB | 1.512 KiB | | 2 | 7 | 25 | 2^15 = 32768 | 5 | 1x | 7.789 ms | inf% | 7.784 ms | inf% | 111.547 KiB | 1.512 KiB | | 0 | 1 | 1 | 2^30 = 1073741824 | 5 | 1x | 454.848 ms | inf% | 454.849 ms | inf% | 204.424 MiB | 50.332 MiB | | 1 | 1 | 1 | 2^30 = 1073741824 | 5 | 1x | 1.203 s | inf% | 1.203 s | inf% | 208.424 MiB | 50.332 MiB | | 2 | 1 | 1 | 2^30 = 1073741824 | 5 | 1x | 3.307 s | inf% | 3.307 s | inf% | 4.079 MiB | 50.332 MiB | | 0 | 4 | 1 | 2^30 = 1073741824 | 5 | 1x | 476.083 ms | inf% | 476.083 ms | inf% | 71.647 MiB | 50.722 MiB | | 1 | 4 | 1 | 2^30 = 1073741824 | 5 | 1x | 1.267 s | inf% | 1.267 s | inf% | 75.647 MiB | 50.722 MiB | | 2 | 4 | 1 | 2^30 = 1073741824 | 5 | 1x | 3.208 s | inf% | 3.208 s | inf% | 4.079 MiB | 50.722 MiB | | 0 | 7 | 1 | 2^30 = 1073741824 | 5 | 1x | 473.810 ms | inf% | 473.810 ms | inf% | 62.772 MiB | 50.780 MiB | | 1 | 7 | 1 | 2^30 = 1073741824 | 5 | 1x | 1.202 s | inf% | 1.202 s | inf% | 66.772 MiB | 50.780 MiB | | 2 | 7 | 1 | 2^30 = 1073741824 | 5 | 1x | 3.216 s | inf% | 3.216 s | inf% | 4.079 MiB | 50.780 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 5 | 1x | 428.377 ms | inf% | 428.376 ms | inf% | 2.113 GiB | 38.123 MiB | | 1 | 1 | 25 | 2^30 = 1073741824 | 5 | 1x | 986.342 ms | inf% | 986.346 ms | inf% | 2.117 GiB | 38.123 MiB | | 2 | 1 | 25 | 2^30 = 1073741824 | 5 | 1x | 2.498 s | inf% | 2.498 s | inf% | 4.079 MiB | 38.123 MiB | | 0 | 4 | 25 | 2^30 = 1073741824 | 5 | 1x | 446.426 ms | inf% | 446.427 ms | inf% | 568.747 MiB | 47.654 MiB | | 1 | 4 | 25 | 2^30 = 1073741824 | 5 | 1x | 1.167 s | inf% | 1.167 s | inf% | 572.747 MiB | 47.654 MiB | | 2 | 4 | 25 | 2^30 = 1073741824 | 5 | 1x | 2.998 s | inf% | 2.998 s | inf% | 4.079 MiB | 47.654 MiB | | 0 | 7 | 25 | 2^30 = 1073741824 | 5 | 1x | 451.670 ms | inf% | 451.670 ms | inf% | 346.822 MiB | 49.016 MiB | | 1 | 7 | 25 | 2^30 = 1073741824 | 5 | 1x | 1.184 s | inf% | 1.184 s | inf% | 350.822 MiB | 49.016 MiB | | 2 | 7 | 25 | 2^30 = 1073741824 | 5 | 1x | 3.174 s | inf% | 3.174 s | inf% | 4.079 MiB | 49.016 MiB | | 0 | 1 | 1 | 2^15 = 32768 | 25 | 1x | 501.600 us | inf% | 497.728 us | inf% | 94.703 KiB | 7.702 KiB | | 1 | 1 | 1 | 2^15 = 32768 | 25 | 1x | 12.835 ms | inf% | 12.831 ms | inf% | 117.055 KiB | 7.702 KiB | | 2 | 1 | 1 | 2^15 = 32768 | 25 | 1x | 7.827 ms | inf% | 7.822 ms | inf% | 112.094 KiB | 7.702 KiB | | 0 | 4 | 1 | 2^15 = 32768 | 25 | 1x | 400.909 us | inf% | 396.960 us | inf% | 89.906 KiB | 7.947 KiB | | 1 | 4 | 1 | 2^15 = 32768 | 25 | 1x | 8.860 ms | inf% | 8.855 ms | inf% | 113.719 KiB | 7.947 KiB | | 2 | 4 | 1 | 2^15 = 32768 | 25 | 1x | 7.785 ms | inf% | 7.780 ms | inf% | 113.086 KiB | 7.947 KiB | | 0 | 7 | 1 | 2^15 = 32768 | 25 | 1x | 394.719 us | inf% | 390.816 us | inf% | 89.344 KiB | 8.010 KiB | | 1 | 7 | 1 | 2^15 = 32768 | 25 | 1x | 8.853 ms | inf% | 8.848 ms | inf% | 113.703 KiB | 8.010 KiB | | 2 | 7 | 1 | 2^15 = 32768 | 25 | 1x | 7.728 ms | inf% | 7.723 ms | inf% | 113.344 KiB | 8.010 KiB | | 0 | 1 | 25 | 2^15 = 32768 | 25 | 1x | 411.273 us | inf% | 407.424 us | inf% | 160.180 KiB | 5.945 KiB | | 1 | 1 | 25 | 2^15 = 32768 | 25 | 1x | 8.919 ms | inf% | 8.913 ms | inf% | 169.945 KiB | 5.945 KiB | | 2 | 1 | 25 | 2^15 = 32768 | 25 | 1x | 7.732 ms | inf% | 7.728 ms | inf% | 105.070 KiB | 5.945 KiB | | 0 | 4 | 25 | 2^15 = 32768 | 25 | 1x | 413.954 us | inf% | 409.376 us | inf% | 106.562 KiB | 7.434 KiB | | 1 | 4 | 25 | 2^15 = 32768 | 25 | 1x | 8.895 ms | inf% | 8.889 ms | inf% | 126.891 KiB | 7.434 KiB | | 2 | 4 | 25 | 2^15 = 32768 | 25 | 1x | 7.756 ms | inf% | 7.751 ms | inf% | 111.031 KiB | 7.434 KiB | | 0 | 7 | 25 | 2^15 = 32768 | 25 | 1x | 424.539 us | inf% | 420.160 us | inf% | 98.930 KiB | 7.561 KiB | | 1 | 7 | 25 | 2^15 = 32768 | 25 | 1x | 8.947 ms | inf% | 8.943 ms | inf% | 120.508 KiB | 7.561 KiB | | 2 | 7 | 25 | 2^15 = 32768 | 25 | 1x | 7.770 ms | inf% | 7.766 ms | inf% | 111.547 KiB | 7.561 KiB | | 0 | 1 | 1 | 2^30 = 1073741824 | 25 | 1x | 454.401 ms | inf% | 454.400 ms | inf% | 420.754 MiB | 251.660 MiB | | 1 | 1 | 1 | 2^30 = 1073741824 | 25 | 1x | 1.216 s | inf% | 1.216 s | inf% | 424.754 MiB | 251.660 MiB | | 2 | 1 | 1 | 2^30 = 1073741824 | 25 | 1x | 3.169 s | inf% | 3.169 s | inf% | 4.079 MiB | 251.660 MiB | | 0 | 4 | 1 | 2^30 = 1073741824 | 25 | 1x | 473.311 ms | inf% | 473.311 ms | inf% | 276.569 MiB | 253.610 MiB | | 1 | 4 | 1 | 2^30 = 1073741824 | 25 | 1x | 1.265 s | inf% | 1.265 s | inf% | 280.569 MiB | 253.610 MiB | | 2 | 4 | 1 | 2^30 = 1073741824 | 25 | 1x | 3.215 s | inf% | 3.215 s | inf% | 4.079 MiB | 253.610 MiB | | 0 | 7 | 1 | 2^30 = 1073741824 | 25 | 1x | 460.715 ms | inf% | 460.715 ms | inf% | 267.056 MiB | 253.902 MiB | | 1 | 7 | 1 | 2^30 = 1073741824 | 25 | 1x | 1.294 s | inf% | 1.294 s | inf% | 271.056 MiB | 253.902 MiB | | 2 | 7 | 1 | 2^30 = 1073741824 | 25 | 1x | 3.179 s | inf% | 3.179 s | inf% | 4.079 MiB | 253.902 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 25 | 1x | 433.565 ms | inf% | 433.564 ms | inf% | 2.465 GiB | 190.615 MiB | | 1 | 1 | 25 | 2^30 = 1073741824 | 25 | 1x | 1.012 s | inf% | 1.012 s | inf% | 2.469 GiB | 190.615 MiB | | 2 | 1 | 25 | 2^30 = 1073741824 | 25 | 1x | 2.436 s | inf% | 2.436 s | inf% | 4.079 MiB | 190.615 MiB | | 0 | 4 | 25 | 2^30 = 1073741824 | 25 | 1x | 473.925 ms | inf% | 473.927 ms | inf% | 810.193 MiB | 238.269 MiB | | 1 | 4 | 25 | 2^30 = 1073741824 | 25 | 1x | 1.222 s | inf% | 1.222 s | inf% | 814.193 MiB | 238.269 MiB | | 2 | 4 | 25 | 2^30 = 1073741824 | 25 | 1x | 2.954 s | inf% | 2.954 s | inf% | 4.079 MiB | 238.269 MiB | | 0 | 7 | 25 | 2^30 = 1073741824 | 25 | 1x | 475.940 ms | inf% | 475.940 ms | inf% | 571.933 MiB | 245.080 MiB | | 1 | 7 | 25 | 2^30 = 1073741824 | 25 | 1x | 1.216 s | inf% | 1.216 s | inf% | 575.933 MiB | 245.080 MiB | | 2 | 7 | 25 | 2^30 = 1073741824 | 25 | 1x | 3.035 s | inf% | 3.035 s | inf% | 4.079 MiB | 245.080 MiB | | 0 | 1 | 1 | 2^15 = 32768 | 50 | 1x | 453.082 us | inf% | 449.248 us | inf% | 103.055 KiB | 15.404 KiB | | 1 | 1 | 1 | 2^15 = 32768 | 50 | 1x | 12.784 ms | inf% | 12.778 ms | inf% | 118.523 KiB | 15.404 KiB | | 2 | 1 | 1 | 2^15 = 32768 | 50 | 1x | 8.021 ms | inf% | 8.017 ms | inf% | 112.094 KiB | 15.404 KiB | | 0 | 4 | 1 | 2^15 = 32768 | 50 | 1x | 450.433 us | inf% | 445.536 us | inf% | 97.781 KiB | 15.895 KiB | | 1 | 4 | 1 | 2^15 = 32768 | 50 | 1x | 8.740 ms | inf% | 8.735 ms | inf% | 113.719 KiB | 15.895 KiB | | 2 | 4 | 1 | 2^15 = 32768 | 50 | 1x | 7.715 ms | inf% | 7.711 ms | inf% | 113.086 KiB | 15.895 KiB | | 0 | 7 | 1 | 2^15 = 32768 | 50 | 1x | 400.055 us | inf% | 395.264 us | inf% | 97.750 KiB | 16.020 KiB | | 1 | 7 | 1 | 2^15 = 32768 | 50 | 1x | 8.755 ms | inf% | 8.751 ms | inf% | 113.750 KiB | 16.020 KiB | | 2 | 7 | 1 | 2^15 = 32768 | 50 | 1x | 7.687 ms | inf% | 7.682 ms | inf% | 113.344 KiB | 16.020 KiB | | 0 | 1 | 25 | 2^15 = 32768 | 50 | 1x | 410.891 us | inf% | 407.360 us | inf% | 174.227 KiB | 11.892 KiB | | 1 | 1 | 25 | 2^15 = 32768 | 50 | 1x | 8.795 ms | inf% | 8.790 ms | inf% | 186.125 KiB | 11.892 KiB | | 2 | 1 | 25 | 2^15 = 32768 | 50 | 1x | 7.757 ms | inf% | 7.753 ms | inf% | 105.070 KiB | 11.892 KiB | | 0 | 4 | 25 | 2^15 = 32768 | 50 | 1x | 414.914 us | inf% | 411.488 us | inf% | 115.992 KiB | 14.868 KiB | | 1 | 4 | 25 | 2^15 = 32768 | 50 | 1x | 8.790 ms | inf% | 8.784 ms | inf% | 130.867 KiB | 14.868 KiB | | 2 | 4 | 25 | 2^15 = 32768 | 50 | 1x | 7.702 ms | inf% | 7.698 ms | inf% | 111.031 KiB | 14.868 KiB | | 0 | 7 | 25 | 2^15 = 32768 | 50 | 1x | 426.890 us | inf% | 422.528 us | inf% | 107.648 KiB | 15.121 KiB | | 1 | 7 | 25 | 2^15 = 32768 | 50 | 1x | 8.816 ms | inf% | 8.811 ms | inf% | 122.789 KiB | 15.121 KiB | | 2 | 7 | 25 | 2^15 = 32768 | 50 | 1x | 7.710 ms | inf% | 7.706 ms | inf% | 111.547 KiB | 15.121 KiB | | 0 | 1 | 1 | 2^30 = 1073741824 | 50 | 1x | 465.599 ms | inf% | 465.599 ms | inf% | 691.190 MiB | 503.319 MiB | | 1 | 1 | 1 | 2^30 = 1073741824 | 50 | 1x | 1.362 s | inf% | 1.362 s | inf% | 695.190 MiB | 503.319 MiB | | 2 | 1 | 1 | 2^30 = 1073741824 | 50 | 1x | 3.214 s | inf% | 3.214 s | inf% | 4.079 MiB | 503.319 MiB | | 0 | 4 | 1 | 2^30 = 1073741824 | 50 | 1x | 494.805 ms | inf% | 494.807 ms | inf% | 532.722 MiB | 507.221 MiB | | 1 | 4 | 1 | 2^30 = 1073741824 | 50 | 1x | 1.282 s | inf% | 1.282 s | inf% | 536.722 MiB | 507.221 MiB | | 2 | 4 | 1 | 2^30 = 1073741824 | 50 | 1x | 3.178 s | inf% | 3.178 s | inf% | 4.079 MiB | 507.221 MiB | | 0 | 7 | 1 | 2^30 = 1073741824 | 50 | 1x | 478.472 ms | inf% | 478.473 ms | inf% | 522.410 MiB | 507.803 MiB | | 1 | 7 | 1 | 2^30 = 1073741824 | 50 | 1x | 1.287 s | inf% | 1.287 s | inf% | 526.410 MiB | 507.803 MiB | | 2 | 7 | 1 | 2^30 = 1073741824 | 50 | 1x | 3.229 s | inf% | 3.229 s | inf% | 4.079 MiB | 507.803 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 50 | 1x | 451.871 ms | inf% | 451.872 ms | inf% | 2.904 GiB | 381.231 MiB | | 1 | 1 | 25 | 2^30 = 1073741824 | 50 | 1x | 1.029 s | inf% | 1.029 s | inf% | 2.908 GiB | 381.231 MiB | | 2 | 1 | 25 | 2^30 = 1073741824 | 50 | 1x | 2.379 s | inf% | 2.379 s | inf% | 4.079 MiB | 381.231 MiB | | 0 | 4 | 25 | 2^30 = 1073741824 | 50 | 1x | 456.983 ms | inf% | 456.983 ms | inf% | 1.086 GiB | 476.537 MiB | | 1 | 4 | 25 | 2^30 = 1073741824 | 50 | 1x | 1.245 s | inf% | 1.245 s | inf% | 1.090 GiB | 476.537 MiB | | 2 | 4 | 25 | 2^30 = 1073741824 | 50 | 1x | 2.932 s | inf% | 2.932 s | inf% | 4.079 MiB | 476.537 MiB | | 0 | 7 | 25 | 2^30 = 1073741824 | 50 | 1x | 505.587 ms | inf% | 505.586 ms | inf% | 853.321 MiB | 490.160 MiB | | 1 | 7 | 25 | 2^30 = 1073741824 | 50 | 1x | 1.264 s | inf% | 1.264 s | inf% | 857.321 MiB | 490.160 MiB | | 2 | 7 | 25 | 2^30 = 1073741824 | 50 | 1x | 2.991 s | inf% | 2.991 s | inf% | 4.079 MiB | 490.160 MiB | | 0 | 1 | 1 | 2^15 = 32768 | 100 | 1x | 452.555 us | inf% | 448.512 us | inf% | 119.547 KiB | 30.809 KiB | | 1 | 1 | 1 | 2^15 = 32768 | 100 | 1x | 12.755 ms | inf% | 12.750 ms | inf% | 150.359 KiB | 30.809 KiB | | 2 | 1 | 1 | 2^15 = 32768 | 100 | 1x | 8.670 ms | inf% | 8.666 ms | inf% | 142.914 KiB | 30.809 KiB | | 0 | 4 | 1 | 2^15 = 32768 | 100 | 1x | 402.123 us | inf% | 398.688 us | inf% | 114.047 KiB | 31.790 KiB | | 1 | 4 | 1 | 2^15 = 32768 | 100 | 1x | 8.997 ms | inf% | 8.993 ms | inf% | 145.844 KiB | 31.790 KiB | | 2 | 4 | 1 | 2^15 = 32768 | 100 | 1x | 8.703 ms | inf% | 8.698 ms | inf% | 144.891 KiB | 31.790 KiB | | 0 | 7 | 1 | 2^15 = 32768 | 100 | 1x | 396.810 us | inf% | 393.184 us | inf% | 113.891 KiB | 32.040 KiB | | 1 | 7 | 1 | 2^15 = 32768 | 100 | 1x | 8.775 ms | inf% | 8.770 ms | inf% | 145.938 KiB | 32.040 KiB | | 2 | 7 | 1 | 2^15 = 32768 | 100 | 1x | 8.561 ms | inf% | 8.555 ms | inf% | 145.398 KiB | 32.040 KiB | | 0 | 1 | 25 | 2^15 = 32768 | 100 | 1x | 410.632 us | inf% | 407.136 us | inf% | 202.391 KiB | 23.783 KiB | | 1 | 1 | 25 | 2^15 = 32768 | 100 | 1x | 8.663 ms | inf% | 8.658 ms | inf% | 226.180 KiB | 23.783 KiB | | 2 | 1 | 25 | 2^15 = 32768 | 100 | 1x | 8.569 ms | inf% | 8.564 ms | inf% | 128.867 KiB | 23.783 KiB | | 0 | 4 | 25 | 2^15 = 32768 | 100 | 1x | 420.616 us | inf% | 416.288 us | inf% | 134.828 KiB | 29.736 KiB | | 1 | 4 | 25 | 2^15 = 32768 | 100 | 1x | 8.701 ms | inf% | 8.697 ms | inf% | 164.570 KiB | 29.736 KiB | | 2 | 4 | 25 | 2^15 = 32768 | 100 | 1x | 8.700 ms | inf% | 8.696 ms | inf% | 140.781 KiB | 29.736 KiB | | 0 | 7 | 25 | 2^15 = 32768 | 100 | 1x | 420.545 us | inf% | 417.184 us | inf% | 125.000 KiB | 30.243 KiB | | 1 | 7 | 25 | 2^15 = 32768 | 100 | 1x | 8.689 ms | inf% | 8.684 ms | inf% | 155.250 KiB | 30.243 KiB | | 2 | 7 | 25 | 2^15 = 32768 | 100 | 1x | 8.651 ms | inf% | 8.646 ms | inf% | 141.805 KiB | 30.243 KiB | | 0 | 1 | 1 | 2^30 = 1073741824 | 100 | 1x | 473.237 ms | inf% | 473.238 ms | inf% | 1.203 GiB | 1006.638 MiB | | 1 | 1 | 1 | 2^30 = 1073741824 | 100 | 1x | 1.388 s | inf% | 1.388 s | inf% | 1.207 GiB | 1006.638 MiB | | 2 | 1 | 1 | 2^30 = 1073741824 | 100 | 1x | 3.482 s | inf% | 3.483 s | inf% | 1010.718 MiB | 1006.638 MiB | | 0 | 4 | 1 | 2^30 = 1073741824 | 100 | 1x | 475.645 ms | inf% | 475.645 ms | inf% | 1.021 GiB | 1014.442 MiB | | 1 | 4 | 1 | 2^30 = 1073741824 | 100 | 1x | 1.384 s | inf% | 1.384 s | inf% | 1.024 GiB | 1014.442 MiB | | 2 | 4 | 1 | 2^30 = 1073741824 | 100 | 1x | 3.432 s | inf% | 3.432 s | inf% | 1018.521 MiB | 1014.442 MiB | | 0 | 7 | 1 | 2^30 = 1073741824 | 100 | 1x | 488.596 ms | inf% | 488.597 ms | inf% | 1.009 GiB | 1015.606 MiB | | 1 | 7 | 1 | 2^30 = 1073741824 | 100 | 1x | 1.404 s | inf% | 1.404 s | inf% | 1.013 GiB | 1015.606 MiB | | 2 | 7 | 1 | 2^30 = 1073741824 | 100 | 1x | 3.407 s | inf% | 3.407 s | inf% | 1019.686 MiB | 1015.606 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 100 | 1x | 444.580 ms | inf% | 444.580 ms | inf% | 3.783 GiB | 762.462 MiB | | 1 | 1 | 25 | 2^30 = 1073741824 | 100 | 1x | 1.086 s | inf% | 1.086 s | inf% | 3.787 GiB | 762.462 MiB | | 2 | 1 | 25 | 2^30 = 1073741824 | 100 | 1x | 2.761 s | inf% | 2.761 s | inf% | 766.541 MiB | 762.462 MiB | | 0 | 4 | 25 | 2^30 = 1073741824 | 100 | 1x | 517.450 ms | inf% | 517.450 ms | inf% | 1.675 GiB | 953.075 MiB | | 1 | 4 | 25 | 2^30 = 1073741824 | 100 | 1x | 1.627 s | inf% | 1.627 s | inf% | 1.679 GiB | 953.075 MiB | | 2 | 4 | 25 | 2^30 = 1073741824 | 100 | 1x | 4.586 s | inf% | 4.586 s | inf% | 957.154 MiB | 953.075 MiB | | 0 | 7 | 25 | 2^30 = 1073741824 | 100 | 1x | 527.723 ms | inf% | 527.723 ms | inf% | 1.383 GiB | 980.321 MiB | | 1 | 7 | 25 | 2^30 = 1073741824 | 100 | 1x | 1.584 s | inf% | 1.584 s | inf% | 1.387 GiB | 980.321 MiB | | 2 | 7 | 25 | 2^30 = 1073741824 | 100 | 1x | 4.653 s | inf% | 4.653 s | inf% | 984.400 MiB | 980.321 MiB | ``` </details> Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Christopher Harris (https://github.com/cwharris) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #11562

Adds truncation of the min and max values of the Parquet column indexes, as recommended by the [Parquet format](https://github.com/apache/parquet-format/blob/master/PageIndex.md). Adds a parameter column_index_truncate_length to the writer options/builder. It currently defaults to 64, which is the default used by parquet-mr. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #11403

Handles an incomplete range in a regex cclass pattern `[a-], [-z], [-]` by interpretting the hyphen `-` as a literal. Added gtest to test this new behavior. Closes #11537 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Tobias Ribizel (https://github.com/upsj) URL: #11557

This extends the `lists::contains` API to support nested types (lists + structs) with arbitrarily nested levels. As such, `lists::contains` will work with literally any type of input data. In addition, the related implementation has been significantly refactored to facilitate adding new implementation. Closes #8958. Depends on: * #10730 * #10883 * #10999 * #11019 * #11037 Authors: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) Approvers: - MithunR (https://github.com/mythrocks) - Bradley Dice (https://github.com/bdice) URL: #10548

…er (#11568) Resolves #11293 This PR introduces support to write a column of type `ListDtype(StructDtype)` to be written as a `map` type in orc file. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11568

Issue #10941 This PR rewrites the existing ORC reader benchmarks with nvbench w.r.t. #7960. It improves the `input` test case in which all data types were benchmarked with all compression and IO types. By splitting `input` into `decode` and `io_compression`, it reduces the number of test cases from 112 to 44. The PR also removes the current `row_selection` test suite. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #11543

@shwina

Recently I have reviewed a handful of PRs with problems in their docstrings that I've been fixing with GitHub review suggestions. I took 40 minutes and enabled a bunch of pydocstyle rules that we agreed on in #10711, to help prevent some of these problems and reduce the amount of review effort required for the future. There are a handful of big ones (`D200, D202, D205, D400`) that will require a more intense effort to implement -- those rules may not be worth the significant refactoring effort. I think this may resolve the part of #10711 that we wanted to tackle in the short term, though I'm happy to hear others' views (@shwina @vyasr). Error code reference: https://www.pydocstyle.org/en/stable/error_codes.html Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11582

#11479 accidentally changed the distribution data type so the custom distribution was set for the wrong type. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tobias Ribizel (https://github.com/upsj) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11584

…11590) Fixes: #11589 This PR fixes an issue with `DataFrame.to_arrow` when the column names are `int`/`tuples` i.e., non-string types. The fix is trivial and retains the same behavior that of `pa.Table.from_pandas` which converts all non-string column names to `strings`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - https://github.com/brandon-b-miller URL: #11590

On some systems, CUDA compiler may have an issue such that it will throw unused variables warnings into your face if you use `if constexpr` with an `else` branch. Such warnings were reported due to a recent merged PR (#10548). This PR rewrites the `else` branch of the `if constexpr` in `lists/contains.cu` into a second `if constexpr` statement with opposite condition to avoid the warnings. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) - Mark Harris (https://github.com/harrism) URL: #11581

Move and refactor some internal strings/numeric conversion functions for reuse with strings-udf work #11319 Some functions needed modification to satisfy the finicky jitify compiler. Mostly functions have been moved from a `cpp/src/strings/convert/utilities.cuh` and `cpp/include/cudf/strings/string.cuh` files separately into new files in the `cpp/include/cudf/strings/detail/convert` folder. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Tobias Ribizel (https://github.com/upsj) - Nghia Truong (https://github.com/ttnghia) URL: #11545

This PR enables using [upstream jitify2](https://github.com/NVIDIA/jitify/tree/jitify2) rather than RAPIDS' fork of [jitify2](https://github.com/rapidsai/jitify/tree/cudf_0.19). This enables us to take advantage of the latest additions/improvements to jitify. Most notably: upstream jitify2 dlsym/dlopens `libcuda.so` which enables us to [drop our shared library dependency on `libcuda.so`](#11370). --- Two major issues came up when making the switch: 1. NVIDIA/jitify#107 - I used the workaround mentioned in that issue. Hopefully it is fixed soon and we can eliminate the workaround. 2. We need to pass `-D_FILE_OFFSET_BITS=64` to jitify. Due to limitations in the way conda-forge builds glibc, we must explicitly state we require 64bit file offset support. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) - Robert Maynard (https://github.com/robertmaynard) URL: #11287

This PR adds support for the [json lines](https://jsonlines.org/) (aka [newline-delimited json](http://ndjson.org/)) format to the nested JSON reader. Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tobias Ribizel (https://github.com/upsj) - GALI PREM SAGAR (https://github.com/galipremsagar) - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #11534

This fixes an issue where the Java API is using a default stream instead of the stream passed in. This is a small change separated from #11600 because it only touches one Java file and thus only needs a Java reviewer. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11601

While working through #11031 it was discovered that we were missing the ability to "cast" between python "classes" (`int`, `float`, and `bool`) within UDFs. This PR introduces the equivalent syntax into masked UDFs. These operations shall be interpreted as mapping to `int64`, `float64` and `bool` types, following numpy and numba's existing handling for scalar types. Authors: - https://github.com/brandon-b-miller Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) URL: #11578

The multibyte_split benchmark has two issues related to the host buffer inputs: * The host buffer reader does a well-hidden copy when creating a reader, since it uses stringstream internally. I added a host buffer reader that takes care of this issue. * The host string is initialized before the string data was fetched from the device. The second issue only became apparent once the first one was fixed, since suddenly the host buffer benchmark was faster than the device buffer benchmark. This also explains why the memory footprint of the host benchmarks was smaller than the other benchmarks. ## TODO - [x] Add data_chunk_source tests Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) - Christopher Harris (https://github.com/cwharris) URL: #11583

@GregoryKimball

…11600) This PR is derived from changes I made in #11577 while attempting to consolidate stream handling in public APIs. During that refactoring, I noticed three repeated problems across libcudf APIs that I have addressed in this PR. These refactors will make future work on streams much more straightforward as well as increase consistency and quality in the library. 1. Some APIs were putting too much implementation in a public method. I split these so that the public/detail balance is consistent with the rest of libcudf. 2. A number of public APIs were missing `CUDF_FUNC_RANGE`, making it difficult to recognize those functions in profiles (cc: @GregoryKimball). 3. Stream handling was not consistent, with some functions not using the `stream` they were passed and using `cudf::default_stream_value` instead. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11600

Fixes compile warning introduced in #11534 ``` /cudf/cpp/src/io/json/nested_json_gpu.cu(970): warning #177-D: variable "single_item_count" was declared but never referenced ``` Removed unreferenced variable declaration. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #11607

Adds new strings `like` function to cudf. This is a wildcard-based string matching function based on SQL's LIKE statement. https://www.sqltutorial.org/sql-like/ Though some SQL implementations provide regex-like capabilities in the `like` statement pattern, the implementation here is strictly limited to the `%` (multi-character placeholder) and the `_` (single character placeholder) behavior. It also accepts an optional escape character that can be used when trying to match strings that contain `%` or `_` in them. This is an easier (and faster) alternative to using the regex based `contains` function. Example usage: ``` s = cudf.Series(["David", "Daniel", "Darcy"]) s.str.like('Da%') ==> [True, True, True] # starts with 'Da' s.str.like('_a_i%') ==> [True, True, False] # 2nd character is 'a' and 4th character is 'i' s.str.like('_____') ==> [True, False, True] # match any 5 characters s.str.like('%y') ==> [False, False, True] # ends with 'y' ``` This PR includes gtests, pytest, and an nvbench-mark. Reference #10797 Authors: - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) Approvers: - Michael Wang (https://github.com/isVoid) - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11558

Forward-merge branch-22.08 to branch-22.10 [skip gpuci]

…11604) Fixes: #11487 This PR switches default value of `ordered` parameter in `CategoricalDtype` to `False`. This fixes some issues around concat and building categorical columns. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11604

This PR removes the deprecated `Series.applymap` function. This function does not exist in pandas. Users should switch to using `Series.apply`. (Note that `DataFrame.applymap` does exist in both pandas and cudf.) Deprecated in #10497. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/brandon-b-miller - Michael Wang (https://github.com/isVoid) URL: #11031

Fixes exception thrown in `REDUCTION_NVBENCH`. ``` Run: [121/768] segmented_reduction_simple [Device=0 InputType=bool OutputType=F32 AggregationKinds=2 column_size=100000 num_segments=1000] Fail: Unexpected error: cuDF failure at: /cudf/cpp/src/reductions/segmented_min.cu:33: segmented_min() operation requires matching output type ``` (There are 144 of these) The code is refactored to reduce the generated benchmark code in order to match input and output types and to force the output type to `BOOL8` for aggregation kind `ALL`. This also removes code generated that is only _skipped_: ``` Run: [277/768] segmented_reduction_simple [Device=0 InputType=I32 OutputType=I32 AggregationKinds=7 column_size=100000 num_segments=1000] Skip: Invalid combination of dtype and aggregation type. ``` (There are also 144 of these) So 37.5% (288/768) of the benchmark runs were either invalid or throw an exception The code change reduces the benchmarks to 192 valid ones. Matching type input/output types should provide sufficient coverage since internally the numeric reductions `SUM` and `PRODUCT` are performed on `int64` or `double` and then the results are cast to the output-type. So the `int32` and `float` types will always cover this cast step. Extra measurements for casting to `double` to `int32` for example should not be necessary. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #11588

Adds a gtest for `is_timestamp` for valid second value 60 which is allowed only for leap seconds. Added comment for why the range check does not include leap-second. For consistency with Python and Spark, we will not support leap-seconds in `is_timestamp`. Closes #11593 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #11594

Replaces `cudf::strings::findall` with the implementation from `cudf::strings::findall_record`. As referenced in #11510, the column-based `findall` implementation is not used and unnecessary over `findall_record` which returns a lists result. For documentation and discoverability `findall_record` is renamed to `findall` and the current `findall` implementation is removed. Closes #11510 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) - Jason Lowe (https://github.com/jlowe) URL: #11575

The preceding/following columns constructed for the grouped rolling window functions are temporary columns, to be discarded after aggregation. These should be constructed on the default memory resource. This commit should correct the problem. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11618

Issue #10941 Changes involved in this PR: - Rewrites the ORC writer benchmarks with nvbench - Cleans up the existing implementations by moving nvbench macros to a new `nvbench_helpers.hpp` header - Fixes two small issues in `orc_writer_chunks` benchmarks - Set `state` target stream - Set up memory pool - Removes unused headers - Gets rid of `cudf_io` aliases - Fixes a bug in the current ORC writer benchmark: calculating the size of one single file Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Nghia Truong (https://github.com/ttnghia) URL: #11598

This changed the cmake version for jni building to 3.23.3. Matching PR to [this one](NVIDIA/spark-rapids-jni#512). Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Jason Lowe (https://github.com/jlowe) - Jim Brennan (https://github.com/jbrennan333) URL: #11619

This PR adds the final bits that remove `libcuda.so` as a shared library dependency of `libcudf`. Currently, that's just one place that uses the driver API `cuContextGetCurrent()` to get the current CUDA context and maintain a per-context singleton. It was determined in the discussions within this PR that that could be replaced with a per-device singleton and a call to the runtime API `cudaGetDevice()` instead. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Robert Maynard (https://github.com/robertmaynard) - David Wendt (https://github.com/davidwendt) - Jake Hemstad (https://github.com/jrhemstad) - Wonchan Lee (https://github.com/magnatelee) URL: #11370

This adds a new `multibyte_split` implementation that needs to scan the input only once, and takes full advantage of `byte_range`. To accomplish this, I introduce a new data structure `output_builder` (naming bikeshedding welcome 😄 ) for pre-allocation for unknown, but bounded outputs. The structure contains a vector of exponentially growing `device_uvector`s, such that either the last vector has size 0 and the second-to-last vector has `size() < capacity()`, or all vectors but the last are full (`size() == capacity()`). It provides the operations * `next_output(stream)` returns a `split_device_span` of at least the `worst_cast_size` provided at construction pointing to the next free entries from the last two vectors. * `advance_output(actual_size)` marks the first `actual_size` entries of the previously returned `split_device_span` as filled. `split_device_span` takes care of writing to the smaller `device_uvector` first, and the larger `device_uvector` second, if the first one is full. * `gather` copies all elements that were previously written into a single `device_uvector` of the correct size. This data structure should provide a good balance between allocation overheads and memory usage. I only modified the actual multibyte_split kernel slightly to stop writing offsets once it passes the end of the `byte_range`. This way, we can determine all required offsets from a single scan, regardless of whether we need provide a range or not. - [x] Depends on #11562 and #11583 for reliable benchmarking Benchmark results: <details> # multibyte_split ## [0] Tesla T4 | source type | delim size | delim percent | size approx | byte_range percent | Ref Time | Cmp Time | Diff | %Diff | |---------------|--------------|-----------------|---------------|----------------------|------------|------------|-----------------|---------| | device | 1 | 1 | 2^15 | 1 | 501.373 us | 255.010 us | -246.363 us | -49.14% | | file | 1 | 1 | 2^15 | 1 | 10.973 ms | 3.740 ms | -7233.161 us | -65.91% | | host paged | 1 | 1 | 2^15 | 1 | 514.387 us | 258.478 us | -255.909 us | -49.75% | | host pinned | 1 | 1 | 2^15 | 1 | 523.335 us | 273.483 us | -249.852 us | -47.74% | | device | 4 | 1 | 2^15 | 1 | 502.462 us | 247.451 us | -255.011 us | -50.75% | | file | 4 | 1 | 2^15 | 1 | 9.269 ms | 3.717 ms | -5551.573 us | -59.90% | | host paged | 4 | 1 | 2^15 | 1 | 517.398 us | 251.635 us | -265.763 us | -51.37% | | host pinned | 4 | 1 | 2^15 | 1 | 527.321 us | 266.926 us | -260.395 us | -49.38% | | device | 7 | 1 | 2^15 | 1 | 511.079 us | 269.646 us | -241.433 us | -47.24% | | file | 7 | 1 | 2^15 | 1 | 10.103 ms | 3.719 ms | -6383.437 us | -63.18% | | host paged | 7 | 1 | 2^15 | 1 | 530.471 us | 275.805 us | -254.666 us | -48.01% | | host pinned | 7 | 1 | 2^15 | 1 | 539.715 us | 284.943 us | -254.771 us | -47.20% | | device | 1 | 25 | 2^15 | 1 | 522.656 us | 267.467 us | -255.189 us | -48.83% | | file | 1 | 25 | 2^15 | 1 | 10.106 ms | 3.715 ms | -6390.694 us | -63.24% | | host paged | 1 | 25 | 2^15 | 1 | 536.277 us | 273.775 us | -262.502 us | -48.95% | | host pinned | 1 | 25 | 2^15 | 1 | 549.473 us | 281.362 us | -268.111 us | -48.79% | | device | 4 | 25 | 2^15 | 1 | 571.825 us | 300.687 us | -271.138 us | -47.42% | | file | 4 | 25 | 2^15 | 1 | 10.103 ms | 3.714 ms | -6388.482 us | -63.23% | | host paged | 4 | 25 | 2^15 | 1 | 588.115 us | 303.107 us | -285.008 us | -48.46% | | host pinned | 4 | 25 | 2^15 | 1 | 600.452 us | 314.675 us | -285.777 us | -47.59% | | device | 7 | 25 | 2^15 | 1 | 576.380 us | 301.141 us | -275.240 us | -47.75% | | file | 7 | 25 | 2^15 | 1 | 10.019 ms | 3.705 ms | -6313.942 us | -63.02% | | host paged | 7 | 25 | 2^15 | 1 | 590.588 us | 307.916 us | -282.673 us | -47.86% | | host pinned | 7 | 25 | 2^15 | 1 | 605.426 us | 315.630 us | -289.795 us | -47.87% | | device | 1 | 1 | 2^30 | 1 | 759.165 ms | 6.113 ms | -753052.285 us | -99.19% | | file | 1 | 1 | 2^30 | 1 | 900.476 ms | 134.544 ms | -765932.007 us | -85.06% | | host paged | 1 | 1 | 2^30 | 1 | 766.142 ms | 7.020 ms | -759122.773 us | -99.08% | | host pinned | 1 | 1 | 2^30 | 1 | 813.716 ms | 6.483 ms | -807232.765 us | -99.20% | | device | 4 | 1 | 2^30 | 1 | 773.977 ms | 6.180 ms | -767797.237 us | -99.20% | | file | 4 | 1 | 2^30 | 1 | 933.311 ms | 133.473 ms | -799837.803 us | -85.70% | | host paged | 4 | 1 | 2^30 | 1 | 778.010 ms | 7.066 ms | -770943.560 us | -99.09% | | host pinned | 4 | 1 | 2^30 | 1 | 785.550 ms | 6.545 ms | -779004.832 us | -99.17% | | device | 7 | 1 | 2^30 | 1 | 776.541 ms | 6.212 ms | -770328.726 us | -99.20% | | file | 7 | 1 | 2^30 | 1 | 926.038 ms | 130.654 ms | -795384.376 us | -85.89% | | host paged | 7 | 1 | 2^30 | 1 | 929.224 ms | 7.113 ms | -922110.964 us | -99.23% | | host pinned | 7 | 1 | 2^30 | 1 | 808.404 ms | 6.581 ms | -801823.263 us | -99.19% | | device | 1 | 25 | 2^30 | 1 | 649.553 ms | 4.856 ms | -644696.331 us | -99.25% | | file | 1 | 25 | 2^30 | 1 | 754.855 ms | 100.332 ms | -654522.457 us | -86.71% | | host paged | 1 | 25 | 2^30 | 1 | 797.427 ms | 5.755 ms | -791672.232 us | -99.28% | | host pinned | 1 | 25 | 2^30 | 1 | 694.769 ms | 5.235 ms | -689534.397 us | -99.25% | | device | 4 | 25 | 2^30 | 1 | 803.722 ms | 6.831 ms | -796891.259 us | -99.15% | | file | 4 | 25 | 2^30 | 1 | 1.089 s | 122.617 ms | -965958.117 us | -88.74% | | host paged | 4 | 25 | 2^30 | 1 | 809.025 ms | 7.731 ms | -801293.659 us | -99.04% | | host pinned | 4 | 25 | 2^30 | 1 | 863.132 ms | 7.206 ms | -855926.593 us | -99.17% | | device | 7 | 25 | 2^30 | 1 | 822.436 ms | 6.828 ms | -815608.102 us | -99.17% | | file | 7 | 25 | 2^30 | 1 | 953.459 ms | 125.146 ms | -828312.756 us | -86.87% | | host paged | 7 | 25 | 2^30 | 1 | 878.591 ms | 7.729 ms | -870862.213 us | -99.12% | | host pinned | 7 | 25 | 2^30 | 1 | 853.347 ms | 7.202 ms | -846145.222 us | -99.16% | | device | 1 | 1 | 2^15 | 5 | 494.135 us | 267.240 us | -226.896 us | -45.92% | | file | 1 | 1 | 2^15 | 5 | 8.396 ms | 3.790 ms | -4605.974 us | -54.86% | | host paged | 1 | 1 | 2^15 | 5 | 511.952 us | 267.649 us | -244.304 us | -47.72% | | host pinned | 1 | 1 | 2^15 | 5 | 523.247 us | 281.380 us | -241.868 us | -46.22% | | device | 4 | 1 | 2^15 | 5 | 513.061 us | 277.739 us | -235.322 us | -45.87% | | file | 4 | 1 | 2^15 | 5 | 8.393 ms | 3.745 ms | -4647.934 us | -55.38% | | host paged | 4 | 1 | 2^15 | 5 | 532.465 us | 283.033 us | -249.433 us | -46.84% | | host pinned | 4 | 1 | 2^15 | 5 | 541.863 us | 292.102 us | -249.761 us | -46.09% | | device | 7 | 1 | 2^15 | 5 | 536.721 us | 282.465 us | -254.256 us | -47.37% | | file | 7 | 1 | 2^15 | 5 | 8.374 ms | 3.770 ms | -4603.511 us | -54.97% | | host paged | 7 | 1 | 2^15 | 5 | 554.098 us | 285.165 us | -268.933 us | -48.54% | | host pinned | 7 | 1 | 2^15 | 5 | 566.133 us | 298.437 us | -267.696 us | -47.29% | | device | 1 | 25 | 2^15 | 5 | 523.197 us | 274.622 us | -248.575 us | -47.51% | | file | 1 | 25 | 2^15 | 5 | 8.383 ms | 3.721 ms | -4661.680 us | -55.61% | | host paged | 1 | 25 | 2^15 | 5 | 538.202 us | 274.924 us | -263.278 us | -48.92% | | host pinned | 1 | 25 | 2^15 | 5 | 552.161 us | 288.881 us | -263.280 us | -47.68% | | device | 4 | 25 | 2^15 | 5 | 571.249 us | 300.540 us | -270.709 us | -47.39% | | file | 4 | 25 | 2^15 | 5 | 8.365 ms | 3.729 ms | -4636.104 us | -55.42% | | host paged | 4 | 25 | 2^15 | 5 | 587.178 us | 302.430 us | -284.748 us | -48.49% | | host pinned | 4 | 25 | 2^15 | 5 | 600.085 us | 314.728 us | -285.357 us | -47.55% | | device | 7 | 25 | 2^15 | 5 | 576.040 us | 301.207 us | -274.833 us | -47.71% | | file | 7 | 25 | 2^15 | 5 | 8.557 ms | 3.728 ms | -4828.929 us | -56.43% | | host paged | 7 | 25 | 2^15 | 5 | 605.849 us | 306.993 us | -298.856 us | -49.33% | | host pinned | 7 | 25 | 2^15 | 5 | 605.730 us | 315.480 us | -290.250 us | -47.92% | | device | 1 | 1 | 2^30 | 5 | 760.295 ms | 26.647 ms | -733647.234 us | -96.50% | | file | 1 | 1 | 2^30 | 5 | 900.988 ms | 144.221 ms | -756766.941 us | -83.99% | | host paged | 1 | 1 | 2^30 | 5 | 861.858 ms | 27.640 ms | -834218.288 us | -96.79% | | host pinned | 1 | 1 | 2^30 | 5 | 814.384 ms | 27.071 ms | -787313.049 us | -96.68% | | device | 4 | 1 | 2^30 | 5 | 774.991 ms | 26.829 ms | -748162.355 us | -96.54% | | file | 4 | 1 | 2^30 | 5 | 939.380 ms | 145.972 ms | -793408.072 us | -84.46% | | host paged | 4 | 1 | 2^30 | 5 | 824.653 ms | 27.778 ms | -796875.210 us | -96.63% | | host pinned | 4 | 1 | 2^30 | 5 | 780.742 ms | 27.220 ms | -753521.950 us | -96.51% | | device | 7 | 1 | 2^30 | 5 | 777.662 ms | 26.946 ms | -750715.400 us | -96.53% | | file | 7 | 1 | 2^30 | 5 | 1.062 s | 146.987 ms | -915037.565 us | -86.16% | | host paged | 7 | 1 | 2^30 | 5 | 792.441 ms | 27.890 ms | -764551.527 us | -96.48% | | host pinned | 7 | 1 | 2^30 | 5 | 844.303 ms | 27.352 ms | -816951.184 us | -96.76% | | device | 1 | 25 | 2^30 | 5 | 650.803 ms | 24.079 ms | -626723.464 us | -96.30% | | file | 1 | 25 | 2^30 | 5 | 819.158 ms | 113.911 ms | -705247.805 us | -86.09% | | host paged | 1 | 25 | 2^30 | 5 | 892.323 ms | 25.012 ms | -867311.636 us | -97.20% | | host pinned | 1 | 25 | 2^30 | 5 | 705.445 ms | 24.465 ms | -680980.519 us | -96.53% | | device | 4 | 25 | 2^30 | 5 | 804.875 ms | 27.849 ms | -777026.452 us | -96.54% | | file | 4 | 25 | 2^30 | 5 | 939.873 ms | 139.019 ms | -800853.228 us | -85.21% | | host paged | 4 | 25 | 2^30 | 5 | 818.237 ms | 28.763 ms | -789474.346 us | -96.48% | | host pinned | 4 | 25 | 2^30 | 5 | 982.444 ms | 28.217 ms | -954226.125 us | -97.13% | | device | 7 | 25 | 2^30 | 5 | 823.389 ms | 29.782 ms | -793606.771 us | -96.38% | | file | 7 | 25 | 2^30 | 5 | 1.009 s | 146.087 ms | -862689.868 us | -85.52% | | host paged | 7 | 25 | 2^30 | 5 | 1.064 s | 30.730 ms | -1033256.695 us | -97.11% | | host pinned | 7 | 25 | 2^30 | 5 | 877.682 ms | 30.180 ms | -847501.645 us | -96.56% | | device | 1 | 1 | 2^15 | 25 | 495.226 us | 260.457 us | -234.769 us | -47.41% | | file | 1 | 1 | 2^15 | 25 | 8.485 ms | 3.781 ms | -4703.671 us | -55.43% | | host paged | 1 | 1 | 2^15 | 25 | 512.784 us | 267.446 us | -245.337 us | -47.84% | | host pinned | 1 | 1 | 2^15 | 25 | 524.160 us | 273.471 us | -250.689 us | -47.83% | | device | 4 | 1 | 2^15 | 25 | 513.812 us | 268.105 us | -245.707 us | -47.82% | | file | 4 | 1 | 2^15 | 25 | 8.494 ms | 3.779 ms | -4715.096 us | -55.51% | | host paged | 4 | 1 | 2^15 | 25 | 534.551 us | 274.615 us | -259.936 us | -48.63% | | host pinned | 4 | 1 | 2^15 | 25 | 542.672 us | 282.072 us | -260.600 us | -48.02% | | device | 7 | 1 | 2^15 | 25 | 535.988 us | 277.391 us | -258.597 us | -48.25% | | file | 7 | 1 | 2^15 | 25 | 8.344 ms | 3.725 ms | -4619.012 us | -55.36% | | host paged | 7 | 1 | 2^15 | 25 | 554.427 us | 284.250 us | -270.176 us | -48.73% | | host pinned | 7 | 1 | 2^15 | 25 | 567.662 us | 295.147 us | -272.515 us | -48.01% | | device | 1 | 25 | 2^15 | 25 | 523.935 us | 275.170 us | -248.765 us | -47.48% | | file | 1 | 25 | 2^15 | 25 | 8.333 ms | 3.734 ms | -4599.214 us | -55.19% | | host paged | 1 | 25 | 2^15 | 25 | 539.151 us | 277.941 us | -261.210 us | -48.45% | | host pinned | 1 | 25 | 2^15 | 25 | 553.701 us | 289.672 us | -264.028 us | -47.68% | | device | 4 | 25 | 2^15 | 25 | 572.250 us | 300.713 us | -271.537 us | -47.45% | | file | 4 | 25 | 2^15 | 25 | 8.467 ms | 3.732 ms | -4734.345 us | -55.92% | | host paged | 4 | 25 | 2^15 | 25 | 589.709 us | 306.826 us | -282.883 us | -47.97% | | host pinned | 4 | 25 | 2^15 | 25 | 602.942 us | 314.485 us | -288.457 us | -47.84% | | device | 7 | 25 | 2^15 | 25 | 577.125 us | 301.039 us | -276.086 us | -47.84% | | file | 7 | 25 | 2^15 | 25 | 8.477 ms | 3.766 ms | -4711.039 us | -55.57% | | host paged | 7 | 25 | 2^15 | 25 | 593.513 us | 307.753 us | -285.760 us | -48.15% | | host pinned | 7 | 25 | 2^15 | 25 | 606.763 us | 315.183 us | -291.580 us | -48.06% | | device | 1 | 1 | 2^30 | 25 | 765.578 ms | 129.396 ms | -636182.115 us | -83.10% | | file | 1 | 1 | 2^30 | 25 | 929.694 ms | 221.490 ms | -708204.682 us | -76.18% | | host paged | 1 | 1 | 2^30 | 25 | 954.516 ms | 130.558 ms | -823957.460 us | -86.32% | | host pinned | 1 | 1 | 2^30 | 25 | 966.739 ms | 129.951 ms | -836787.576 us | -86.56% | | device | 4 | 1 | 2^30 | 25 | 780.141 ms | 132.059 ms | -648082.202 us | -83.07% | | file | 4 | 1 | 2^30 | 25 | 955.247 ms | 224.799 ms | -730447.507 us | -76.47% | | host paged | 4 | 1 | 2^30 | 25 | 906.541 ms | 133.090 ms | -773450.918 us | -85.32% | | host pinned | 4 | 1 | 2^30 | 25 | 803.184 ms | 132.560 ms | -670623.617 us | -83.50% | | device | 7 | 1 | 2^30 | 25 | 782.807 ms | 132.482 ms | -650324.997 us | -83.08% | | file | 7 | 1 | 2^30 | 25 | 1.006 s | 224.442 ms | -781614.108 us | -77.69% | | host paged | 7 | 1 | 2^30 | 25 | 879.983 ms | 133.681 ms | -746302.241 us | -84.81% | | host pinned | 7 | 1 | 2^30 | 25 | 806.090 ms | 133.161 ms | -672929.604 us | -83.48% | | device | 1 | 25 | 2^30 | 25 | 657.313 ms | 116.237 ms | -541076.105 us | -82.32% | | file | 1 | 25 | 2^30 | 25 | 792.993 ms | 187.620 ms | -605372.917 us | -76.34% | | host paged | 1 | 25 | 2^30 | 25 | 707.506 ms | 117.467 ms | -590039.693 us | -83.40% | | host pinned | 1 | 25 | 2^30 | 25 | 714.342 ms | 116.772 ms | -597570.059 us | -83.65% | | device | 4 | 25 | 2^30 | 25 | 810.447 ms | 138.744 ms | -671702.648 us | -82.88% | | file | 4 | 25 | 2^30 | 25 | 984.757 ms | 226.860 ms | -757897.389 us | -76.96% | | host paged | 4 | 25 | 2^30 | 25 | 870.188 ms | 139.849 ms | -730338.049 us | -83.93% | | host pinned | 4 | 25 | 2^30 | 25 | 1.005 s | 139.282 ms | -865389.388 us | -86.14% | | device | 7 | 25 | 2^30 | 25 | 828.847 ms | 142.057 ms | -686789.436 us | -82.86% | | file | 7 | 25 | 2^30 | 25 | 986.067 ms | 231.655 ms | -754412.174 us | -76.51% | | host paged | 7 | 25 | 2^30 | 25 | 929.677 ms | 143.445 ms | -786232.463 us | -84.57% | | host pinned | 7 | 25 | 2^30 | 25 | 913.866 ms | 142.981 ms | -770884.883 us | -84.35% | | device | 1 | 1 | 2^15 | 50 | 494.177 us | 259.079 us | -235.098 us | -47.57% | | file | 1 | 1 | 2^15 | 50 | 8.305 ms | 3.781 ms | -4524.620 us | -54.48% | | host paged | 1 | 1 | 2^15 | 50 | 515.479 us | 267.290 us | -248.189 us | -48.15% | | host pinned | 1 | 1 | 2^15 | 50 | 527.832 us | 273.050 us | -254.781 us | -48.27% | | device | 4 | 1 | 2^15 | 50 | 513.578 us | 267.341 us | -246.237 us | -47.95% | | file | 4 | 1 | 2^15 | 50 | 8.478 ms | 3.784 ms | -4694.029 us | -55.37% | | host paged | 4 | 1 | 2^15 | 50 | 536.324 us | 274.019 us | -262.305 us | -48.91% | | host pinned | 4 | 1 | 2^15 | 50 | 545.107 us | 281.623 us | -263.484 us | -48.34% | | device | 7 | 1 | 2^15 | 50 | 536.213 us | 279.726 us | -256.487 us | -47.83% | | file | 7 | 1 | 2^15 | 50 | 8.402 ms | 3.790 ms | -4611.689 us | -54.89% | | host paged | 7 | 1 | 2^15 | 50 | 558.127 us | 285.683 us | -272.444 us | -48.81% | | host pinned | 7 | 1 | 2^15 | 50 | 569.002 us | 298.287 us | -270.715 us | -47.58% | | device | 1 | 25 | 2^15 | 50 | 524.190 us | 275.036 us | -249.154 us | -47.53% | | file | 1 | 25 | 2^15 | 50 | 8.467 ms | 3.719 ms | -4748.710 us | -56.08% | | host paged | 1 | 25 | 2^15 | 50 | 540.475 us | 282.192 us | -258.283 us | -47.79% | | host pinned | 1 | 25 | 2^15 | 50 | 554.028 us | 288.838 us | -265.189 us | -47.87% | | device | 4 | 25 | 2^15 | 50 | 571.287 us | 300.003 us | -271.283 us | -47.49% | | file | 4 | 25 | 2^15 | 50 | 8.342 ms | 3.732 ms | -4609.666 us | -55.26% | | host paged | 4 | 25 | 2^15 | 50 | 591.604 us | 308.682 us | -282.922 us | -47.82% | | host pinned | 4 | 25 | 2^15 | 50 | 603.550 us | 314.584 us | -288.966 us | -47.88% | | device | 7 | 25 | 2^15 | 50 | 576.753 us | 300.907 us | -275.846 us | -47.83% | | file | 7 | 25 | 2^15 | 50 | 8.353 ms | 3.725 ms | -4628.266 us | -55.41% | | host paged | 7 | 25 | 2^15 | 50 | 595.478 us | 309.244 us | -286.235 us | -48.07% | | host pinned | 7 | 25 | 2^15 | 50 | 609.549 us | 315.463 us | -294.087 us | -48.25% | | device | 1 | 1 | 2^30 | 50 | 772.301 ms | 259.032 ms | -513268.599 us | -66.46% | | file | 1 | 1 | 2^30 | 50 | 969.116 ms | 321.799 ms | -647316.758 us | -66.79% | | host paged | 1 | 1 | 2^30 | 50 | 1.141 s | 260.091 ms | -881389.084 us | -77.21% | | host pinned | 1 | 1 | 2^30 | 50 | 949.102 ms | 259.552 ms | -689549.243 us | -72.65% | | device | 4 | 1 | 2^30 | 50 | 786.655 ms | 262.064 ms | -524590.867 us | -66.69% | | file | 4 | 1 | 2^30 | 50 | 1.098 s | 326.710 ms | -770958.900 us | -70.24% | | host paged | 4 | 1 | 2^30 | 50 | 948.089 ms | 263.495 ms | -684593.933 us | -72.21% | | host pinned | 4 | 1 | 2^30 | 50 | 907.653 ms | 262.806 ms | -644846.480 us | -71.05% | | device | 7 | 1 | 2^30 | 50 | 789.401 ms | 263.368 ms | -526033.714 us | -66.64% | | file | 7 | 1 | 2^30 | 50 | 1.083 s | 327.480 ms | -755255.187 us | -69.75% | | host paged | 7 | 1 | 2^30 | 50 | 962.268 ms | 264.524 ms | -697744.071 us | -72.51% | | host pinned | 7 | 1 | 2^30 | 50 | 971.881 ms | 264.065 ms | -707815.282 us | -72.83% | | device | 1 | 25 | 2^30 | 50 | 665.425 ms | 232.463 ms | -432961.322 us | -65.07% | | file | 1 | 25 | 2^30 | 50 | 911.572 ms | 282.358 ms | -629214.772 us | -69.03% | | host paged | 1 | 25 | 2^30 | 50 | 924.965 ms | 234.220 ms | -690745.245 us | -74.68% | | host pinned | 1 | 25 | 2^30 | 50 | 699.093 ms | 233.032 ms | -466060.898 us | -66.67% | | device | 4 | 25 | 2^30 | 50 | 817.301 ms | 276.820 ms | -540481.648 us | -66.13% | | file | 4 | 25 | 2^30 | 50 | 1.028 s | 337.968 ms | -690458.768 us | -67.14% | | host paged | 4 | 25 | 2^30 | 50 | 989.999 ms | 278.600 ms | -711398.589 us | -71.86% | | host pinned | 4 | 25 | 2^30 | 50 | 859.267 ms | 278.173 ms | -581093.622 us | -67.63% | | device | 7 | 25 | 2^30 | 50 | 835.640 ms | 282.740 ms | -552899.603 us | -66.16% | | file | 7 | 25 | 2^30 | 50 | 1.028 s | 344.770 ms | -682948.230 us | -66.45% | | host paged | 7 | 25 | 2^30 | 50 | 1.175 s | 283.889 ms | -890653.851 us | -75.83% | | host pinned | 7 | 25 | 2^30 | 50 | 878.720 ms | 283.505 ms | -595215.368 us | -67.74% | | device | 1 | 1 | 2^15 | 100 | 495.879 us | 265.133 us | -230.747 us | -46.53% | | file | 1 | 1 | 2^15 | 100 | 8.474 ms | 3.771 ms | -4703.913 us | -55.51% | | host paged | 1 | 1 | 2^15 | 100 | 521.063 us | 272.803 us | -248.260 us | -47.64% | | host pinned | 1 | 1 | 2^15 | 100 | 535.848 us | 277.043 us | -258.805 us | -48.30% | | device | 4 | 1 | 2^15 | 100 | 513.719 us | 271.754 us | -241.965 us | -47.10% | | file | 4 | 1 | 2^15 | 100 | 8.369 ms | 3.767 ms | -4602.203 us | -54.99% | | host paged | 4 | 1 | 2^15 | 100 | 542.547 us | 281.336 us | -261.212 us | -48.15% | | host pinned | 4 | 1 | 2^15 | 100 | 553.792 us | 290.824 us | -262.968 us | -47.48% | | device | 7 | 1 | 2^15 | 100 | 536.841 us | 284.016 us | -252.826 us | -47.10% | | file | 7 | 1 | 2^15 | 100 | 8.390 ms | 3.764 ms | -4626.076 us | -55.14% | | host paged | 7 | 1 | 2^15 | 100 | 563.217 us | 292.971 us | -270.246 us | -47.98% | | host pinned | 7 | 1 | 2^15 | 100 | 578.990 us | 302.168 us | -276.822 us | -47.81% | | device | 1 | 25 | 2^15 | 100 | 524.359 us | 280.308 us | -244.051 us | -46.54% | | file | 1 | 25 | 2^15 | 100 | 8.374 ms | 3.776 ms | -4598.153 us | -54.91% | | host paged | 1 | 25 | 2^15 | 100 | 560.218 us | 288.964 us | -271.253 us | -48.42% | | host pinned | 1 | 25 | 2^15 | 100 | 566.101 us | 294.120 us | -271.980 us | -48.04% | | device | 4 | 25 | 2^15 | 100 | 582.539 us | 305.593 us | -276.946 us | -47.54% | | file | 4 | 25 | 2^15 | 100 | 8.445 ms | 3.775 ms | -4670.815 us | -55.31% | | host paged | 4 | 25 | 2^15 | 100 | 596.100 us | 314.954 us | -281.146 us | -47.16% | | host pinned | 4 | 25 | 2^15 | 100 | 606.737 us | 319.597 us | -287.140 us | -47.33% | | device | 7 | 25 | 2^15 | 100 | 577.114 us | 305.427 us | -271.687 us | -47.08% | | file | 7 | 25 | 2^15 | 100 | 8.459 ms | 3.775 ms | -4683.776 us | -55.37% | | host paged | 7 | 25 | 2^15 | 100 | 600.814 us | 315.401 us | -285.413 us | -47.50% | | host pinned | 7 | 25 | 2^15 | 100 | 617.354 us | 323.114 us | -294.240 us | -47.66% | | device | 1 | 1 | 2^30 | 100 | 785.740 ms | 515.240 ms | -270499.817 us | -34.43% | | file | 1 | 1 | 2^30 | 100 | 1.043 s | 521.471 ms | -521600.353 us | -50.01% | | host paged | 1 | 1 | 2^30 | 100 | 1.030 s | 518.769 ms | -511254.716 us | -49.64% | | host pinned | 1 | 1 | 2^30 | 100 | 922.861 ms | 517.999 ms | -404862.705 us | -43.87% | | device | 4 | 1 | 2^30 | 100 | 800.010 ms | 521.530 ms | -278479.331 us | -34.81% | | file | 4 | 1 | 2^30 | 100 | 1.190 s | 527.630 ms | -661958.288 us | -55.65% | | host paged | 4 | 1 | 2^30 | 100 | 1.090 s | 524.598 ms | -565287.699 us | -51.87% | | host pinned | 4 | 1 | 2^30 | 100 | 1.059 s | 524.101 ms | -535396.658 us | -50.53% | | device | 7 | 1 | 2^30 | 100 | 802.697 ms | 524.818 ms | -277878.923 us | -34.62% | | file | 7 | 1 | 2^30 | 100 | 1.122 s | 531.504 ms | -590129.710 us | -52.61% | | host paged | 7 | 1 | 2^30 | 100 | 1.365 s | 527.805 ms | -837014.954 us | -61.33% | | host pinned | 7 | 1 | 2^30 | 100 | 942.695 ms | 527.151 ms | -415543.608 us | -44.08% | | device | 1 | 25 | 2^30 | 100 | 681.754 ms | 461.792 ms | -219961.806 us | -32.26% | | file | 1 | 25 | 2^30 | 100 | 888.544 ms | 467.517 ms | -421027.176 us | -47.38% | | host paged | 1 | 25 | 2^30 | 100 | 873.000 ms | 464.901 ms | -408099.261 us | -46.75% | | host pinned | 1 | 25 | 2^30 | 100 | 798.295 ms | 463.761 ms | -334534.095 us | -41.91% | | device | 4 | 25 | 2^30 | 100 | 831.389 ms | 550.050 ms | -281338.656 us | -33.84% | | file | 4 | 25 | 2^30 | 100 | 1.079 s | 555.891 ms | -522743.242 us | -48.46% | | host paged | 4 | 25 | 2^30 | 100 | 1.060 s | 552.930 ms | -506979.337 us | -47.83% | | host pinned | 4 | 25 | 2^30 | 100 | 945.572 ms | 552.551 ms | -393021.062 us | -41.56% | | device | 7 | 25 | 2^30 | 100 | 849.229 ms | 562.074 ms | -287155.401 us | -33.81% | | file | 7 | 25 | 2^30 | 100 | 1.104 s | 567.974 ms | -535690.202 us | -48.54% | | host paged | 7 | 25 | 2^30 | 100 | 1.375 s | 565.169 ms | -809622.226 us | -58.89% | | host pinned | 7 | 25 | 2^30 | 100 | 1.099 s | 564.743 ms | -534429.395 us | -48.62% | </details> ## TODO - [x] Handle `byte_range` edge cases - [x] Handle issues with large inputs - [x] ~~Extend to overlapping delimiters by providing `previous_chunk` support for `data_chunk_source`~~ That will be another PR Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Christopher Harris (https://github.com/cwharris) - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #11500

Moving this header from `src/strings` to `include/cudf/strings/detail` to be accessed by the strings-udf work. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #11585

Rework the `contains_scalar` function to check nulls at runtime and reduce the amount of generated code. The `contains_scalar.cu` compile time increased dramatically recently: https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11743/Build_20Metrics_20Report/ The change here is expected to improve compile time significantly. The nvbench-mark showed no changes to performance. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Tobias Ribizel (https://github.com/upsj) URL: #11622

This improves the `multibyte_split` kernel by * Reducing register pressure: Instead of storing `ITEMS_PER_THREAD` individual states, store only the initial `multistate` for the thread and recompute the individual states on-the-fly * Eliminating local memory usage: Manipulate the `multistate` via shifts instead of array random access * Eliminating trie overhead: Since we have only a single delimiter, the trie is a path. We can do the traversal implicitly * Memoizing which chars were a match: We don't need to recompute this information, but can store it in a bitmask * Changing the block load algorithm: `BLOCK_LOAD_VECTORIZE` was slightly less efficient than `BLOCK_LOAD_WARP_TRANSPOSE` * Reducing the allocation overhead by limiting the `output_builder` max allocation size * Tuning the parameters: `ITEMS_PER_THREAD = 64` works better, and we can improve performance further by operating on larger chunks Overall, this gives a roughly 2x speedup in my benchmarks. Based on #11500 Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11587

Moves the `is_xxxx(data_type)` function definitions from `traits.hpp` to `traits.cpp` There were only a few places that called this function from device-code. A couple of these were potentially problematic (e.g. `row_comparator.cuh`) and corrected, one was avoidable (`column_statistics.cuh`) and the final one (`column_utils.cuh`) was easily worked around. The main goal was to help improve the include dependencies when building code that uses libcudf APIs. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11616

…11627) As brought up in #11626, converted type being written on INT32 and INT64 columns is not out of spec, but abnormal for parquet writers. This change brings cudf's writer in line with pandas and spark by not including converted type information on these types. closes #11626 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Jim Brennan (https://github.com/jbrennan333) - Yunsong Wang (https://github.com/PointKernel) URL: #11627

Adds nvbench for nested json parser. Depends on #11388 Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11466

Fixes compile error introduced in PR #11466 due to mismatched changes occurring in PR #11534 https://gpuci.gpuopenanalytics.com/job/rapidsai/job/gpuci/job/cudf/job/prb/job/cudf-cpu-cuda-build/CUDA=11.5/11851/console Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #11637

Fixes: #11602 This PR: - [x] Fixes issues with extracting host scalar values from nested struct & list columns by constructing the metadata properly for the libcudf `to_arrow` API to function correctly. - [x] Changes made to `to_arrow` API to take in more detailed `cols_dtypes` dict which has a `column-name:column-dtype` mapping. This will help preserve both parent column names and child column names in case of nested dtypes. - [x] Introduces tests to extract host scalars of nested types. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11612

#11302 added `STATISTICS_COLUMN` to the `statistics_freq` enum in libcudf. This adds the same to python. Authors: - Ed Seidl (https://github.com/etseidl) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11453

@vuule

This PR unifies the type casting logic across the existing JSON lines reader, the CSV reader, and the new nested JSON reader. _I'm taking over this excellent piece of work from @vuule, who did all the heavy lifting!_ 🙏 I just adapted the parsing of string values for things like escape sequences, optionally stripping off quotation characters, etc., to meet all the requirements for the new nested JSON parser. ## Supported escape sequences: We decided to follow [ECMA-404](https://www.ecma-international.org/wp-content/uploads/ECMA-404_2nd_edition_december_2017.pdf#page=12) and [RFC 7159](https://www.ietf.org/rfc/rfc7159.txt). ``` \" represents the quotation mark character (U+0022). \\ represents the reverse solidus character (U+005C). \/ represents the solidus character (U+002F). \b represents the backspace character (U+0008). \f represents the form feed character (U+000C). \n represents the line feed character (U+000A). \r represents the carriage return character (U+000D). \t represents the character tabulation character (U+0009). \uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP \uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points ``` Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #11613

Fixes `cudf::strings::zfill` to match Python's `zfill` behavior. This will match Pandas 1.5 `zfill` as well. The new behavior correctly skips the leading sign character when applying the '0' character fill. Updates gtests and added more test data. The pytest was updated to xfail for test data with leading sign characters until Pandas 1.5 is supported. The Java tests did not include any test data with sign characters. Closes #11632 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #11634

…1597) Closes #11486 Authors: - https://github.com/brandon-b-miller Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) URL: #11597

Issue #10941 This PR refactors parquet reader benchmarks with nvbench and reduces the number of test cases by isolating data type, compression type and IO type. It also sets `min_samples` in orc benchmarks to reduce runtime. Example output of the new benchmarks: <details> <summary>Benchmark results</summary> ## parquet_read_decode ### [0] Quadro RTX 8000 | data_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | INTEGRAL | 0 | 1 | 21x | 724.748 ms | 0.50% | 724.748 ms | 0.50% | 740769526 | 1.014 GiB | 506.565 MiB | | INTEGRAL | 1000 | 1 | 18x | 252.209 ms | 0.49% | 252.205 ms | 0.49% | 2128710298 | 698.017 MiB | 165.810 MiB | | INTEGRAL | 0 | 32 | 255x | 58.948 ms | 1.43% | 58.941 ms | 1.43% | 9108549008 | 561.160 MiB | 27.592 MiB | | INTEGRAL | 1000 | 32 | 371x | 40.414 ms | 0.65% | 40.407 ms | 0.65% | 13286592857 | 547.897 MiB | 14.368 MiB | | FLOAT | 0 | 1 | 19x | 718.892 ms | 0.49% | 718.889 ms | 0.49% | 746806205 | 1.010 GiB | 510.303 MiB | | FLOAT | 1000 | 1 | 19x | 166.333 ms | 0.49% | 166.326 ms | 0.49% | 3227819573 | 633.646 MiB | 110.206 MiB | | FLOAT | 0 | 32 | 12x | 45.956 ms | 0.49% | 45.951 ms | 0.49% | 11683615377 | 547.070 MiB | 23.587 MiB | | FLOAT | 1000 | 32 | 19x | 26.427 ms | 0.41% | 26.420 ms | 0.41% | 20320912461 | 533.362 MiB | 9.888 MiB | | DECIMAL | 0 | 1 | 10x | 247.598 ms | 0.50% | 247.592 ms | 0.50% | 2168372563 | 678.171 MiB | 141.643 MiB | | DECIMAL | 1000 | 1 | 17x | 74.017 ms | 0.49% | 74.009 ms | 0.49% | 7254148321 | 565.465 MiB | 44.818 MiB | | DECIMAL | 0 | 32 | 21x | 23.914 ms | 0.39% | 23.907 ms | 0.39% | 22456657584 | 573.297 MiB | 8.325 MiB | | DECIMAL | 1000 | 32 | 726x | 20.638 ms | 1.81% | 20.631 ms | 1.81% | 26022379782 | 563.388 MiB | 6.670 MiB | | TIMESTAMP | 0 | 1 | 18x | 681.565 ms | 0.50% | 681.560 ms | 0.50% | 787708788 | 1005.327 MiB | 462.140 MiB | | TIMESTAMP | 1000 | 1 | 19x | 141.332 ms | 0.49% | 141.325 ms | 0.49% | 3798835430 | 614.400 MiB | 92.808 MiB | | TIMESTAMP | 0 | 32 | 16x | 43.671 ms | 0.49% | 43.664 ms | 0.49% | 12295636362 | 543.157 MiB | 20.855 MiB | | TIMESTAMP | 1000 | 32 | 22x | 23.758 ms | 0.37% | 23.751 ms | 0.37% | 22603868672 | 530.390 MiB | 8.718 MiB | | DURATION | 0 | 1 | 14x | 639.169 ms | 0.48% | 639.164 ms | 0.48% | 839958193 | 1.025 GiB | 436.918 MiB | | DURATION | 1000 | 1 | 26x | 141.232 ms | 0.50% | 141.225 ms | 0.50% | 3801522636 | 672.149 MiB | 92.663 MiB | | DURATION | 0 | 32 | 34x | 40.988 ms | 0.50% | 40.981 ms | 0.50% | 13100505608 | 600.078 MiB | 19.551 MiB | | DURATION | 1000 | 32 | 22x | 23.576 ms | 0.48% | 23.569 ms | 0.48% | 22778384224 | 588.165 MiB | 8.541 MiB | | STRING | 0 | 1 | 14x | 892.957 ms | 0.49% | 892.953 ms | 0.49% | 601230777 | 1.740 GiB | 597.486 MiB | | STRING | 1000 | 1 | 11x | 88.505 ms | 0.49% | 88.498 ms | 0.49% | 6066500017 | 1.173 GiB | 46.473 MiB | | STRING | 0 | 32 | 14x | 893.754 ms | 0.50% | 893.750 ms | 0.50% | 600694814 | 1.740 GiB | 597.486 MiB | | STRING | 1000 | 32 | 15x | 33.743 ms | 0.36% | 33.737 ms | 0.35% | 15913425600 | 1.136 GiB | 8.504 MiB | | LIST | 0 | 1 | 8x | 1.059 s | 0.49% | 1.059 s | 0.49% | 507190581 | 1.061 GiB | 526.626 MiB | | LIST | 1000 | 1 | 5x | 594.205 ms | 0.46% | 594.200 ms | 0.46% | 903518627 | 738.329 MiB | 175.888 MiB | | LIST | 0 | 32 | 5x | 332.162 ms | 0.17% | 332.157 ms | 0.17% | 1616317248 | 599.170 MiB | 38.433 MiB | | LIST | 1000 | 32 | 5x | 319.827 ms | 0.14% | 319.822 ms | 0.14% | 1678656424 | 586.061 MiB | 25.114 MiB | | STRUCT | 0 | 1 | 17x | 858.119 ms | 0.49% | 858.113 ms | 0.49% | 625640971 | 1.508 GiB | 569.526 MiB | | STRUCT | 1000 | 1 | 13x | 169.824 ms | 0.49% | 169.816 ms | 0.49% | 3161479058 | 1.018 GiB | 90.699 MiB | | STRUCT | 0 | 32 | 14x | 636.846 ms | 0.49% | 636.841 ms | 0.49% | 843022470 | 1.352 GiB | 409.289 MiB | | STRUCT | 1000 | 32 | 8x | 62.571 ms | 0.41% | 62.564 ms | 0.41% | 8581146955 | 967.512 MiB | 15.400 MiB | ## parquet_read_io_compression ### [0] Quadro RTX 8000 | io | compression | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-------------|-------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | FILEPATH | SNAPPY | 0 | 1 | 5x | 2.511 s | 0.22% | 2.511 s | 0.22% | 213832406 | 1.065 GiB | 521.113 MiB | | FILEPATH | SNAPPY | 1000 | 1 | 5x | 2.013 s | 0.24% | 2.013 s | 0.24% | 266699568 | 741.251 MiB | 170.914 MiB | | FILEPATH | SNAPPY | 0 | 32 | 5x | 1.579 s | 0.38% | 1.579 s | 0.38% | 340040086 | 619.892 MiB | 50.836 MiB | | FILEPATH | SNAPPY | 1000 | 32 | 5x | 1.578 s | 0.08% | 1.578 s | 0.08% | 340273896 | 593.253 MiB | 24.365 MiB | | FILEPATH | NONE | 0 | 1 | 5x | 2.365 s | 0.31% | 2.365 s | 0.31% | 227035133 | 1.065 GiB | 529.611 MiB | | FILEPATH | NONE | 1000 | 1 | 5x | 1.949 s | 0.24% | 1.949 s | 0.24% | 275494639 | 741.258 MiB | 180.315 MiB | | FILEPATH | NONE | 0 | 32 | 5x | 1.558 s | 0.04% | 1.558 s | 0.04% | 344639390 | 619.903 MiB | 58.968 MiB | | FILEPATH | NONE | 1000 | 32 | 9x | 1.522 s | 0.50% | 1.522 s | 0.50% | 352751882 | 593.268 MiB | 32.308 MiB | | HOST_BUFFER | SNAPPY | 0 | 1 | 5x | 2.493 s | 0.12% | 2.493 s | 0.12% | 215335507 | 1.065 GiB | 521.113 MiB | | HOST_BUFFER | SNAPPY | 1000 | 1 | 5x | 2.024 s | 0.17% | 2.024 s | 0.17% | 265230150 | 741.251 MiB | 170.914 MiB | | HOST_BUFFER | SNAPPY | 0 | 32 | 5x | 1.588 s | 0.38% | 1.588 s | 0.38% | 338124854 | 619.891 MiB | 50.835 MiB | | HOST_BUFFER | SNAPPY | 1000 | 32 | 5x | 1.590 s | 0.12% | 1.590 s | 0.12% | 337724979 | 593.253 MiB | 24.365 MiB | | HOST_BUFFER | NONE | 0 | 1 | 5x | 2.361 s | 0.14% | 2.361 s | 0.14% | 227387701 | 1.065 GiB | 529.611 MiB | | HOST_BUFFER | NONE | 1000 | 1 | 5x | 1.957 s | 0.30% | 1.957 s | 0.30% | 274280188 | 741.258 MiB | 180.315 MiB | | HOST_BUFFER | NONE | 0 | 32 | 5x | 1.566 s | 0.34% | 1.566 s | 0.34% | 342852317 | 619.903 MiB | 58.968 MiB | | HOST_BUFFER | NONE | 1000 | 32 | 5x | 1.532 s | 0.35% | 1.532 s | 0.35% | 350454541 | 593.268 MiB | 32.308 MiB | ## parquet_read_column_selection ### [0] Quadro RTX 8000 | column_selection | row_selection | str_to_categories | uses_pandas_metadata | timestamp_type | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |------------------|---------------|-------------------|----------------------|----------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | ALL | ALL | YES | YES | EMPTY | 5x | 2.131 s | 0.18% | 2.131 s | 0.18% | 251957606 | 688.633 MiB | 122.689 MiB | | ALTERNATE | ALL | YES | YES | EMPTY | 5x | 2.052 s | 0.04% | 2.052 s | 0.04% | 130843237 | 344.275 MiB | 122.661 MiB | | FIRST_HALF | ALL | YES | YES | EMPTY | 5x | 2.041 s | 0.04% | 2.041 s | 0.04% | 131522620 | 344.259 MiB | 122.656 MiB | | SECOND_HALF | ALL | YES | YES | EMPTY | 5x | 2.039 s | 0.03% | 2.039 s | 0.03% | 131633269 | 344.375 MiB | 122.644 MiB | ## parquet_read_misc_options ### [0] Quadro RTX 8000 | column_selection | row_selection | str_to_categories | uses_pandas_metadata | timestamp_type | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |------------------|---------------|-------------------|----------------------|----------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | ALL | ALL | YES | YES | EMPTY | 5x | 2.131 s | 0.24% | 2.131 s | 0.24% | 251911310 | 688.633 MiB | 122.654 MiB | | ALL | ALL | YES | NO | EMPTY | 5x | 2.131 s | 0.13% | 2.131 s | 0.13% | 251962422 | 688.633 MiB | 122.655 MiB | | ALL | ALL | NO | YES | EMPTY | 5x | 2.134 s | 0.16% | 2.134 s | 0.16% | 251594099 | 703.482 MiB | 122.653 MiB | | ALL | ALL | NO | NO | EMPTY | 5x | 2.133 s | 0.26% | 2.133 s | 0.26% | 251645745 | 703.482 MiB | 122.639 MiB | </details> Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11611

@thomcom

Fixes an issue where `get_json_object` returns an incorrect null count on large inputs. The source of the bug is that each thread was resetting `out_valid_count` to zero, then only the threads that execute in the final block contribute to the value of `out_valid_count`. cc: @thomcom Authors: - Paul Taylor (https://github.com/trxcllnt) - H. Thomson Comer (https://github.com/thomcom) Approvers: - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #11633

Generate unique keys table in java JNI `contiguousSplitGroups` closes #11615 `contiguousSplitGroups` splits a table into sub-groups, but each group's `group-by` key is not collected and de-duplicated. This PR is to generate an extra table to collect and deduplicate the unique keys corresponding to sub-groups. ``` contiguousSplitGroups * Example: * Grouping column index: 0 * Input: A table of 3 rows (two groups) * a 1 * b 2 * b 3 * * Result: GroupByResult * groups: Two tables, one group one table. * group[0]: * a 1 * * group[1]: * b 2 * b 3 * uniqKeysTable: Two rows, one row is corresponding to one group. * a // for group 0 * b // for group 1 ``` Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11614

The implementation of `hostdevice_vector` has a lot of code duplication and a defect (probably) because its span cast operators use `max_size` and not `size`. I'd like to give `hostdevice_vector` more `thrust::host_vector`-like functionality and semantics, so this is a first step towards that goal. Needs to be benchmarked before merging, to see if the default-initialization slows anything down significantly Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11631

@GregoryKimball

This PR fixes an issue that arises from benchmark size exceeding our `size_type`'s supported max value size. Thanks to @GregoryKimball for reporting the issue 🙏 Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11649

Closes #11237 This PR enables caching of `Scalar` objects. When the `Scalar()` constructor is invoked with the same arguments more than once, a cached instance is retrieved: ```python >>> x1 = cudf.Scalar(1) >>> x2 = cudf.Scalar(1) >>> x1 is x2 True ``` It also adds registers a reinitialization hook with RMM that clears the cache. This ensures that no objects are left behind that hold on to the previous RMM memory resource after the call to `rmm.reinitialize()`. Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: #11246

Dask-cudf currently maintains a specialized groupby-aggregation code, this code is faster for cudf-based data than the upstream (`dask.dataframe`) code path. However, the custom implementation does not take advantage of Dask's `apply_concat_apply` function, even though the tree-reduction aspect of the algorithm is the same. This PR refactors the dask_cudf groupby-aggregation code to use `apply_concat_apply`. This reduces the amount of code we will need to maintain in cudf, and should improve graph optimizations (like fusion). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11571

Fixes some calls that were not passing the stream variable to detail functions. Found these while looking into improvements for #11577 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) URL: #11642

Previously, we did not explicitly check for zero-sized inputs in to/from_dlpack. cupy>=11 changes behaviour in how it provides information for zero-sized arrays (strides of empty tensors were previously reported as one, but now are zero). This broke some of the validity checking in `from_dlpack`. To handle such cases, check the shape of the incoming tensor as well as strides. Authors: - Lawrence Mitchell (https://github.com/wence-) - GALI PREM SAGAR (https://github.com/galipremsagar) - Benjamin Zaitlen (https://github.com/quasiben) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - GALI PREM SAGAR (https://github.com/galipremsagar) - AJ Schmidt (https://github.com/ajschmidt8) URL: #11449

) Adds calls to `cudf::column.set_null_count()` when the null-count is known. Found these while looking into improvements for #11577 There are several ways to make a `cudf::column` object to be returned. Many times the column is created and then filled in by calling the `cudf::column.mutable_view()` function and using the `mutable_view` object. The `cudf::column::mutable_view()` function has a side-effect that invalidates it's internal null-count. This is for efficiency so the null-count is only computed when the value is specifically requested through the `cudf::column::null_count()` method. Computing the null-count inside `null_count()` requires a kernel launch. However, there are several places where the null-count is known before returning the column and setting the value means a later call to `cudf::column::null_count()` does not require it to be computed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #11646

This PR fixes a typo in the cuDF conda environment pinning for pandoc. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #11658

Moves the internal string strip function to `strip.cuh` header in the `include/cudf/strings/detail` folder. Allows this function to be shared with strings-udf code for strip. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #11635

Fixes a couple small issues I caught while reviewing #11319. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #11647

@vuule

Fixed incorrect size when calling `device_write_async()`. Additionally, CI now runs the csv, orc, and parquet tests using the KvikIO backend. cc. @vuule Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #11651

- [x] This PR adds support for `group_keys` in `groupby`. Starting pandas 1.5.0, issues around `group_keys` have been resolved: pandas-dev/pandas#34998 pandas-dev/pandas#47185 - [x] This PR defaults `group_keys` to `False` which is the same as what pandas is going to be defaulting to in the future version. - [x] Required to unblock `pandas-1.5.0` upgrade in cudf: #11617 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) URL: #11659

…r. (#11657) Reverts: #11480 <s>https://github.com/rapidsai/cudf/pull/11503</s> Authors: - https://github.com/nvdbaranec Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #11657

) Fixes the regex negated class NCCLASS builder to not automatically include the new-line character in the negated range. This logic is from the original plan9 source but does not appear to be honored by any other known regex implementation. Marking this as a breaking change since it changes the existing behavior in case someone is depending on the current logic. Also added a new gtest to include `\n` data when matching with a negated class. Closes #11643 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Elias Stehle (https://github.com/elstehle) URL: #11644

Fixes the check for invalid quantifier pattern to not include the alternation character. This moves the alternation instruction resolution before the quantifier validation checks. Added a gtest for this pattern as well. Closes #11650 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) URL: #11654

This adds `gdb` pretty printers for `rmm::device_uvector`, `thrust::*_vector`, `thrust::device_reference` and `cudf::*_span`. The implementation is based on NVIDIA/thrust#1631. I will probably backport the thrust-specific changes to there as well, but since the location of the thrust source is not fixed, I'd prefer having all types in a self-contained file. Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - Karthikeyan (https://github.com/karthikeyann) URL: #11499

The intention of this PR is to address the dataset-truncation bug described in #11324. It seems that cudf does not currently handle multiple pyarrow-based datasources correctly. This means `cudf.read_parquet` will return incorrect results when reading a list of files from remote storage (e.g. from s3 or gcs). **The underlying problem**: Instead of passing a vector of `datasource` objects to libcudf, the python/cython API only passes along a `datasource` object for the very **first** file in the user-specified list. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller URL: #11655

Update to current mypy, primarily since on Apple silicon, the previous pinned version of mypy is no longer installable (`typed-ast` does not build from source). This necessitates some minor updates to the type annotation rules, since the newer mypy version is a bit pickier. While we're here, exclude directories we were previously just ignoring errors in. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - Jordan Jacobelli (https://github.com/Ethyling) - Vyas Ramasubramani (https://github.com/vyasr) - AJ Schmidt (https://github.com/ajschmidt8) URL: #11640

Closes #11664 Authors: - Ashwin Srinath (https://github.com/shwina) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11677

Contributes to #10186 This PR enables lexicographic comparisons between list columns. The comparison is robust to arbitrary levels of nesting, but does not yet support lists of (lists of...) structs. The comparison is based on the Dremel encoding already in use in the Parquet file format. To assist reviewers, here is a reasonably complete list of the changes in this PR: 1. A helper function to get per-column Dremel data (for list columns) when constructing a preprocessed table, which now owns the Dremel data. 2. Updating the set of lexicographically compatible columns to now include list columns as long as they do not have any nested structs within. 3. An implementation of `lexicographic::device_row_comparator::operator()` for lists. **This function is the heart of the change to enable comparisons between list columns.** 4. A new benchmark for sorting that uses list data. 5. An update to a preexisting rolling collect set test that previously failed (because it requires list comparison) but now works. 6. New tests for list comparison. Authors: - Devavret Makkar (https://github.com/devavret) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Mark Harris (https://github.com/harrism) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #11129

As part of #11640, there was a missing exclusion for `cudf/cudf/utils/metadata/orc_column_statistics_pb2.py`, that causes the following errors: ```bash (cudfdev) pgali@dt07:/nvme/0/pgali/cudf$ git commit -m "address reviews" [WARNING] Unstaged files detected. [INFO] Stashing unstaged files to /home/nfs/pgali/.cache/pre-commit/patch1662993629-63423. isort....................................................................Passed black....................................................................Passed flake8...................................................................Passed mypy.....................................................................Failed - hook id: mypy - exit code: 1 python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:7: error: Library stubs not installed for "google.protobuf.internal" (or incompatible with Python 3.9) python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:7: note: Hint: "python3 -m pip install types-protobuf" python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:7: note: (or run "mypy --install-types" to install all missing stub packages) python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:7: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:8: error: Library stubs not installed for "google.protobuf" (or incompatible with Python 3.9) python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:25: error: Name "_BUCKETSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:26: error: Name "_BUCKETSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:27: error: Name "_INTEGERSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:28: error: Name "_INTEGERSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:29: error: Name "_DOUBLESTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:30: error: Name "_DOUBLESTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:31: error: Name "_STRINGSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:32: error: Name "_STRINGSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:33: error: Name "_BUCKETSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:34: error: Name "_BUCKETSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:35: error: Name "_DECIMALSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:36: error: Name "_DECIMALSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:37: error: Name "_DATESTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:38: error: Name "_DATESTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:39: error: Name "_TIMESTAMPSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:40: error: Name "_TIMESTAMPSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:41: error: Name "_BINARYSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:42: error: Name "_BINARYSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:43: error: Name "_COLUMNSTATISTICS" is not defined python/cudf/cudf/utils/metadata/orc_column_statistics_pb2.py:44: error: Name "_COLUMNSTATISTICS" is not defined Found 22 errors in 1 file (checked 138 source files) pydocstyle...............................................................Passed clang-format.........................................(no files to check)Skipped no-deprecationwarning....................................................Passed cmake-format.........................................(no files to check)Skipped - hook id: cmake-format cmake-lint...........................................(no files to check)Skipped - hook id: cmake-lint copyright-check..........................................................Passed doxygen-check........................................(no files to check)Skipped - hook id: doxygen-check [INFO] Restored changes from /home/nfs/pgali/.cache/pre-commit/patch1662993629-63423. ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #11685

…1671) This PR: Fixes: #11670 - [x] Fixes: #11670, by correctly generating the `column_metadata` for nested scenarios. - [x] Also fixes an issue with dtype mismatch after updating `children` in a `ListColumn`. See the pytest below. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) URL: #11671

This PR runs the conda recipe checks for all headers in `include/` with `pre-commit` instead of as a separate step in `style.sh`. This means that developers using `pre-commit` will be able to fix mistakes before pushing, where errors would cause failures in CI. Combined with #11668, most of our style check suite will be executed via `pre-commit`, enabling us to simplify `style.sh`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Ray Douglass (https://github.com/raydouglass) URL: #11669

This PR removes a redundant style check for clang-format. Our configuration in `.pre-commit-config.yaml` already runs clang-format so we don't need a separate step for that purpose in `style.sh`. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Ray Douglass (https://github.com/raydouglass) - Jordan Jacobelli (https://github.com/Ethyling) - Nghia Truong (https://github.com/ttnghia) URL: #11668

Closes #9058, #9056 Expands nvCOMP adapter to include ZSTD compression. Adds centralized nvCOMP policy. `is_compression_enabled`. Adds centralized nvCOMP alignment utility, `compress_input_alignment_bits`. Adds centralized nvCOMP utility to get the maximum supported compression chunk size - `batched_compress_max_allowed_chunk_size`. Encoded ORC row groups are aligned based on compression requirements. Encoded Parquet pages are aligned based on compression requirements. Parquet fragment size now scales with the page size to better fit the default page size with ZSTD compression. Small refactoring around `decompress_status` for improved type safety and hopefully naming. Replaced `snappy_compress` from the Parquet writer with the nvCOMP adapter call. Vectors of `compression_result`s are initialized before compression to avoid issues with random chunk skipping due to uninitialized memory. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Jason Lowe (https://github.com/jlowe) - Jim Brennan (https://github.com/jbrennan333) - Mike Wilson (https://github.com/hyperbolic2346) - Tobias Ribizel (https://github.com/upsj) - Matthew Roeschke (https://github.com/mroeschke) URL: #11551

This PR adds the developer documentation to our doxygen. The changes to enable this are minimal: the files have been moved from cpp/docs to cpp/doxygen and added to the `INPUT` section of the Doxyfile, and a custom `DoxygenLayout.xml` has been added to include the necessary header (note that this file was autogenerated with `doxygen -l` and the only modification was adding one tab for the developer guide). I added an anchor to the main developer guide header so that it can be linked to from the header. Our current developer guide was written to be viewed on Github. As a result, it uses some [Github Flavored Markdown](https://github.github.com/gfm/) extensions, primarily around the way that code is displayed. Since doxygen does not support those, I had to make some modifications to the contents of the docs so that they would render the same way. Some care is needed around escaping the `@` symbol in the right locations in the docs to prevent doxygen from interpreting commands in examples. Finally due to doxygen/doxygen#6054 we cannot put certain commands inside code blocks and get them rendered correctly. However, with a few cycles of rebuilding and checking the output I was able to get everything to look correct. These changes mean that the guide will no longer look as nice as before when viewed directly on Github. However, the goal of this PR is to allow everyone to move towards viewing the built and published documentation instead, so this isn't a problem. This is how it looks now: ![image](https://user-images.githubusercontent.com/1538165/182971788-b946f406-f490-4698-a484-9046e37a4d88.png) Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Bradley Dice (https://github.com/bdice) URL: #11475

This is needed so that `rapids-cmake` population of `rpath-link` link options is provided to all executables. Fixes #11628 Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Keith Kraus (https://github.com/kkraus14) - Bradley Dice (https://github.com/bdice) URL: #11666

When there are multiple groups in a groupby, passing `group_keys=True` raises an error: ```python In [1]: import cudf In [2]: gdf = cudf.DataFrame( ...: {"A": "a a b".split(), "B": [1, 1, 3], "C": [4, 6, 5]} ...: ) In [3]: g_group = gdf.groupby(["A", "B"], group_keys=True) In [4]: g_group[["B", "C"]].apply(lambda x: x / x.sum()) --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In [4], line 1 ----> 1 g_group[["B", "C"]].apply(lambda x: x / x.sum()) File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/groupby/groupby.py:782, in GroupBy.apply(self, function, *args) 778 result = cudf.concat(chunk_results) 779 if self._group_keys: 780 result.index = cudf.MultiIndex._from_data( 781 { --> 782 group_keys.name: group_keys._column, 783 None: grouped_values.index._column, 784 } 785 ) 787 if self._sort: 788 result = result.sort_index() AttributeError: 'MultiIndex' object has no attribute '_column' ``` This PR fixes the issue: ```python In [1]: import cudf In [2]: gdf = cudf.DataFrame( ...: {"A": "a a b".split(), "B": [1, 1, 3], "C": [4, 6, 5]} ...: ) In [3]: g_group = gdf.groupby(["A", "B"], group_keys=True) In [4]: g_group[["B", "C"]].apply(lambda x: x / x.sum()) Out[4]: B C A B a 1 0 0.5 0.4 1 0.5 0.6 b 3 2 1.0 1.0 ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) Approvers: - Bradley Dice (https://github.com/bdice) URL: #11689

Fixes #11525 Contains a chain of fixes: 1. Allow negative nanoseconds in negative timestamps - aligns writer with pyorc; 2. Limit seconds adjustment to positive nanoseconds - fixes the off-by-one issue reported in #11525; 3. Fix the decode of large uint64_t values (>max `int64_t`) - fixes reading of cuDF encoded negative nanoseconds; 4. Avoid mode 2 encode when the base value is larger than max `int64_t` - follows the specs and fixes reading of negative nanoseconds using non-cuDF readers. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #11586

…1690) Fix `to_orc` defaults for the compression type in cuDF and Dask. Aligns the default to the libcudf default (and to the Parquet default). Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11690

CUDF grouped RANGE window functions currently support only integral types and timestamps as the ORDER BY (OBY) column. This commit adds support for DECIMAL types (i.e. decimal32, decimal64, and decimal128) to be used as the ORDER BY column in RANGE window functions. This feature allows `spark-rapids` to address NVIDIA/spark-rapids#6400. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - David Wendt (https://github.com/davidwendt) - Jason Lowe (https://github.com/jlowe) URL: #11645

The recently merged PR (#11551) did not include the `<optional>` header which may cause compile error in some systems (in particular, CUDA 11.7 + gcc-11.2): ``` error: ‘std::optional’ has not been declared error: ‘optional’ in namespace ‘std’ does not name a template type ``` This PR adds that missing header to fix the compile issue. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #11697

Fixes: #11693 This PR fixes `DataFrame.from_arrow` which does not preserve type metadata for `struct`, `list` & `decimal` types. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #11698

Discussion in dask/dask#9490 and dask/dask#9491 suggests that split_out=None as a default value was never really intended and is likely to be deprecated. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina) - Nghia Truong (https://github.com/ttnghia) URL: #11704

…or (#11699) Closes #11525 Not sure why, but the apache Java ORC reader does the following when reading negative timestamps: https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1284-L1285 This detail does not impact cuDF and pyorc writers (reading cudf files with apache reader already works) because these libraries write negative timestamps with negative nanoseconds. This PR modifies the ORC reader behavior to match the apache reader so that cuDF correctly reads ORC files written by the apache reader. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ashwin Srinath (https://github.com/shwina) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) - Elias Stehle (https://github.com/elstehle) - Nghia Truong (https://github.com/ttnghia) URL: #11699

multibyte_split does two scans to determine delimiters: One using the trie to determine matches, one to determine the offsets. For single-byte delimiters, the trie scan is unnecessary, since we can determine without context what is a delimiter. So I added a single `byte_split_kernel` by replacing the trie logic with a char comparison. The difference is quite significant: | source_type | delim_size | delim_percent | size_approx | GPU Time | Peak Memory Usage | Encoded file size | |-------------|------------|---------------|-------------------|------------|-------------------|-------------------| | 0 | 1 | 1 | 2^30 = 1073741824 | 110.196 ms | 3.947 GiB | 1006.638 MiB | | 0 | 2 | 1 | 2^30 = 1073741824 | 198.067 ms | 3.745 GiB | 1011.775 MiB | | 0 | 1 | 25 | 2^30 = 1073741824 | 122.626 ms | 9.978 GiB | 762.462 MiB | | 0 | 2 | 25 | 2^30 = 1073741824 | 224.000 ms | 6.163 GiB | 889.541 MiB | This might point to the fact that the custom prefix scan implementation in multibyte_split is a bottleneck, but this needs more investigation. Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11681

This PR adds the type inference component for the nested JSON work. Authors: - Yunsong Wang (https://github.com/PointKernel) - Elias Stehle (https://github.com/elstehle) Approvers: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #11121

…11710) As a follow-up to #11645, this change includes `DECIMAL` among the list of data-types that may be used in the order-by column for `RANGE`-based window functions, listed in the exception message. Authors: - MithunR (https://github.com/mythrocks) Approvers: - Gera Shegalov (https://github.com/gerashegalov) - Nghia Truong (https://github.com/ttnghia) URL: #11710

While working on #11711, I noticed a set of files that did not have copyright headers in place. This PR fixes them. I used the "start year" from the output of `git log` indicating the first commit to each file. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Jordan Jacobelli (https://github.com/Ethyling) - Nghia Truong (https://github.com/ttnghia) URL: #11712

Places the `PATCH_COMMAND` on it's own line separate from the `GIT_SHALLOW` command line. Recently got a style-check error on `get_thrust.cmake` https://gpuci.gpuopenanalytics.com/blue/organizations/jenkins/rapidsai%2Fgpuci%2Fcudf%2Fprb%2Fcudf-style/detail/cudf-style/28876/pipeline The style-check tool was used to fix this. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) URL: #11715

This PR removes all excluded filename patterns from the `isort` configuration. We should run `isort` on all files, and if exclusions are needed, those should be handled with action comments like `# isort: skip` on a case-by-case basis (this is sometimes needed for `setup.py` to control import order with Cython / setuptools / etc.). See: https://pycqa.github.io/isort/docs/configuration/action_comments.html Authors: - Bradley Dice (https://github.com/bdice) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - H. Thomson Comer (https://github.com/thomcom) URL: #11680

Since libcudf doesn't keep track of StructDtype key names, round-tripping through outer_explode loses information. We know the correct dtype, since it is the element_type of the exploded list column, so attach that type metadata before handing back the return value. Exploding a list series should be equivalent to unwrapping one level of list from the dtype, so that x = cudf.Series([[{'a': 'b'}]]) x.explode().dtype == x.dtype.element_type Previously this was not the case, since we would lose the names resulting in x.explode().dtype == StructDtype({'0': dtype('O')}) Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #11687

Updates libcudf to use Thrust 1.17.2. This version contains a newer cub util_namespace.h with the upstream fix needed to correct ODR issues inside thrust-cub. Fixes #11368. Authors: - Robert Maynard (https://github.com/robertmaynard) - Bradley Dice (https://github.com/bdice) Approvers: - Mark Harris (https://github.com/harrism) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11665

Adds GPU implementation of JSON-token-stream to JSON-tree Depends on PR [Adds JSON-token-stream to JSON-tree](#11291) #11291 <details> --- This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node. The PR is part of a multi-part PR-chain. Specifically, this PR builds on the [JSON tokenizer PR](#11264). **This PR depends on:** ⛓️ #11264 ⛓️ #11242 ⛓️ #11078 **Each node has one of the following category:** ``` /// A node representing a struct NC_STRUCT, /// A node representing a list NC_LIST, /// A node representing a field name NC_FN, /// A node representing a string value NC_STR, /// A node representing a numeric or literal value (e.g., true, false, null) NC_VAL, /// A node representing a parser error NC_ERR ``` **For each node, the tree representation stores the following information:** - node category - node level - node range begin (index of the first character from the original JSON input that this node demarcates) - node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates) **An example tree:** The following is just an example print of the information represented in the tree generated by the algorithm. - Each line is printing the full path to the next node in the tree. - For each node along the path we have the following format: `<[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>` **The original JSON for this tree:** ``` [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95}, {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] ``` **The tree:** ``` <0:LIST:[2, 3) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> <0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> <0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'> ``` **The original JSON pretty-printed for this tree:** ``` [ { "category": "reference", "index:": [ 4, 12, 42 ], "author": "Nigel Rees", "title": "[Sayings of the Century]", "price": 8.95 }, { "category": "reference", "index": [ 4, {}, null, { "a": [ {}, {} ] } ], "author": "Nigel Rees", "title": "{}[], <=semantic-symbols-string", "price": 8.95 } ] ``` </details> --- Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - David Wendt (https://github.com/davidwendt) URL: #11518

…d JSON parser (#11574) Adds type inference and type conversion for leaf-columns to the nested JSON parser **Note to the reviewers**: It's important to note that we're talking about two different stages of quote-stripping here. 1. Including/excluding quotes in the tokenizer stage (currently always set to `true` using a `constexpr bool`) 2. Including/excluding quotes in the type conversion stage Currently, we always include quotes in the tokenizer stage (1), such that the type casting stage (2) can differentiate between string values and literals (e.g. `[true, "true"]`) and, based on the user-provided choice in `json_reader_options::keep_quotes`, can strip off the quotes or keep them in the values returned to the user. **In addition to adding type inference and type casting:** - Switches logic for inferring nested columns. Inferring any column with at least one nested item (list or struct) as that respective nested column, making all other _non-nested_ items of that column invalid. E.g., `[null,{"a":1},"foo"] => List<Struct<a:int>> with struct col validity: 0, 1, 0` - Adds option for `keep_quotes` to differentiate between string values and numeric & literal values, like (`123.4`, `true`, `false`, `null`). - Migrated libcudf test to cudf test to avoid having large byte BLOBs in source file - Changing column order to match the behaviour of pandas and existing JSON lines reader. That is, column order corresponds to the order they were discovered in: `[{"b":1, "c":1}, {"a":1}] => order: <b, c, a>` - Support for escape sequences (see below) ## Performance comparison ### Tokenizer The following is a comparison of the **JSON tokenizer** stage before this PR and after: #### Before ``` # Benchmark Results ## json_tokenizer ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|-----------|-------|-----------|-------|----------| | 2^20 = 1048576 | 2176x | 2.489 ms | 9.62% | 2.480 ms | 9.61% | 422.729M | | 2^21 = 2097152 | 1936x | 2.501 ms | 7.14% | 2.492 ms | 7.12% | 841.482M | | 2^22 = 4194304 | 1152x | 2.612 ms | 5.43% | 2.604 ms | 5.42% | 1.611G | | 2^23 = 8388608 | 1456x | 2.855 ms | 4.26% | 2.847 ms | 4.23% | 2.947G | | 2^24 = 16777216 | 1104x | 3.395 ms | 5.34% | 3.387 ms | 5.33% | 4.954G | | 2^25 = 33554432 | 560x | 4.410 ms | 2.25% | 4.402 ms | 2.25% | 7.623G | | 2^26 = 67108864 | 1552x | 6.482 ms | 2.23% | 6.473 ms | 2.22% | 10.367G | | 2^27 = 134217728 | 1435x | 10.430 ms | 2.70% | 10.422 ms | 2.70% | 12.879G | | 2^28 = 268435456 | 815x | 18.396 ms | 1.95% | 18.387 ms | 1.95% | 14.599G | | 2^29 = 536870912 | 15x | 34.389 ms | 0.42% | 34.381 ms | 0.42% | 15.615G | | 2^30 = 1073741824 | 11x | 66.097 ms | 0.20% | 66.088 ms | 0.20% | 16.247G | ``` #### After ``` # Benchmark Results ## json_tokenizer ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|--------|------------|--------|----------| | 2^20 = 1048576 | 1408x | 2.600 ms | 11.28% | 2.592 ms | 11.26% | 404.547M | | 2^21 = 2097152 | 800x | 2.838 ms | 7.68% | 2.829 ms | 7.67% | 741.243M | | 2^22 = 4194304 | 2752x | 3.719 ms | 9.24% | 3.710 ms | 9.23% | 1.130G | | 2^23 = 8388608 | 128x | 4.855 ms | 3.38% | 4.846 ms | 3.37% | 1.731G | | 2^24 = 16777216 | 720x | 7.029 ms | 4.67% | 7.021 ms | 4.66% | 2.390G | | 2^25 = 33554432 | 832x | 10.760 ms | 3.83% | 10.751 ms | 3.83% | 3.121G | | 2^26 = 67108864 | 576x | 17.961 ms | 2.86% | 17.953 ms | 2.86% | 3.738G | | 2^27 = 134217728 | 461x | 32.550 ms | 2.13% | 32.542 ms | 2.13% | 4.124G | | 2^28 = 268435456 | 243x | 61.813 ms | 1.60% | 61.805 ms | 1.60% | 4.343G | | 2^29 = 536870912 | 125x | 120.445 ms | 1.21% | 120.437 ms | 1.21% | 4.458G | | 2^30 = 1073741824 | 66x | 228.833 ms | 0.75% | 228.825 ms | 0.75% | 4.692G | ``` ### JSON Parser The overall parser performance is obviously impacted as we're now also doing type conversion instead of just returning string columns. #### Before ``` # Benchmark Results ## nested_json_gpu_parser ### [0] Tesla V100-SXM2-32GB | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|-------|------------|-------|----------| | 2^20 = 1048576 | 1040x | 7.361 ms | 5.61% | 7.353 ms | 5.61% | 142.614M | | 2^21 = 2097152 | 832x | 11.549 ms | 3.63% | 11.541 ms | 3.63% | 181.708M | | 2^22 = 4194304 | 740x | 20.264 ms | 2.98% | 20.257 ms | 2.98% | 207.054M | | 2^23 = 8388608 | 407x | 36.844 ms | 2.26% | 36.837 ms | 2.26% | 227.724M | | 2^24 = 16777216 | 80x | 75.590 ms | 1.95% | 75.582 ms | 1.95% | 221.974M | | 2^25 = 33554432 | 80x | 179.442 ms | 4.40% | 179.434 ms | 4.40% | 187.001M | | 2^26 = 67108864 | 40x | 379.821 ms | 0.98% | 379.815 ms | 0.98% | 176.688M | | 2^27 = 134217728 | 20x | 777.351 ms | 1.72% | 777.347 ms | 1.72% | 172.661M | | 2^28 = 268435456 | 10x | 1.550 s | 0.99% | 1.550 s | 0.99% | 173.212M | | 2^29 = 536870912 | 5x | 3.055 s | 0.41% | 3.055 s | 0.41% | 175.749M | | 2^30 = 1073741824 | 3x | 6.315 s | inf% | 6.315 s | inf% | 170.018M | ``` #### After ``` | string_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | |-------------------|---------|------------|-------|------------|-------|----------| | 2^20 = 1048576 | 1568x | 7.908 ms | 5.24% | 7.900 ms | 5.24% | 132.730M | | 2^21 = 2097152 | 576x | 12.235 ms | 3.24% | 12.228 ms | 3.24% | 171.509M | | 2^22 = 4194304 | 192x | 21.171 ms | 2.09% | 21.164 ms | 2.09% | 198.182M | | 2^23 = 8388608 | 96x | 38.990 ms | 1.96% | 38.983 ms | 1.96% | 215.188M | | 2^24 = 16777216 | 192x | 78.414 ms | 2.21% | 78.407 ms | 2.21% | 213.977M | | 2^25 = 33554432 | 81x | 187.007 ms | 6.47% | 187.000 ms | 6.47% | 179.435M | | 2^26 = 67108864 | 38x | 400.007 ms | 1.59% | 400.000 ms | 1.59% | 167.772M | | 2^27 = 134217728 | 19x | 801.575 ms | 1.29% | 801.571 ms | 1.29% | 167.443M | | 2^28 = 268435456 | 10x | 1.590 s | 0.42% | 1.590 s | 0.42% | 168.799M | | 2^29 = 536870912 | 5x | 3.150 s | 0.40% | 3.150 s | 0.40% | 170.456M | | 2^30 = 1073741824 | 3x | 6.402 s | inf% | 6.402 s | inf% | 167.712M | ``` ## Supported escape sequences: ``` \" represents the quotation mark character (U+0022). \\ represents the reverse solidus character (U+005C). \/ represents the solidus character (U+002F). \b represents the backspace character (U+0008). \f represents the form feed character (U+000C). \n represents the line feed character (U+000A). \r represents the carriage return character (U+000D). \t represents the character tabulation character (U+0009). \uDDDD, where `D` is a hex digit 0-9, a-f, A-F, for code points on the MBP \uDDDD\uDDDD, where `D` is a hex digit 0-9, a-f, A-F, representing UTF-16 surrogate pairs for remaining unicode code points ``` Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Michael Wang (https://github.com/isVoid) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #11574

…ps (#11695) Capture groups are used for extracting specific matching substrings but also used for grouping alternation or other sub-pattern matches. If the capture group is not used for extraction then a non-capture group could be specified for these cases. A non-capture group will generate less regex instructions which can help reduce device memory usage. Since the libcudf strings regex API calls already check where capture groups are required, the API can inform the regex compiler if capture groups are necessary. Then the compiler could automatically convert to non-capture groups reducing the number of instructions produced. Introduces a new `capture_groups` parameter for use in the regex compiler step for this purpose. This is an improvement in efficiency and no external behavior has changed. Also fixes a bug found when testing where a non-capture group pattern is used with an invalid quantifier sequence. A test case was added to verify the bug fix. Closes #11663 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Tobias Ribizel (https://github.com/upsj) URL: #11695

@davidwendt

…ries.apply` (#11319) This PR provides initial support for string data inside UDFs passed to `DataFrame.apply` and `Series.apply`. The allowed APIs are based on python's `str` class. It aims to implement python string semantics as closely as possible starting with APIs that ***return numeric data only.*** These are the following 21 functions: - `str.count` - `str.startswith` - `str.endswith` - `str.find` - `str.rfind` - `str.isalnum` - `str.isdecimal` - `str.isdigit` - `str.islower` - `str.isupper` - `str.isalpha` - `str.istitle` - `str.isspace` - `==`, `!=`, `>=`, `<=`, `>`, `<` (between two strings) - `len` - `__contains__` The following 3 functions are not included due to having no libcudf equivalent code available to back them (due to them referring to python concepts) - `str.isascii` - `str.isidentifier` - `str.isprintable` This works by creating a library of `__device__` functions based on libcudf which perform the above functions for one single string. The rest of the code is a library of numba extensions that replace a python UDF with a chain of those `__device__` functions and creates a kernel that calls the result across a grid of threads, taking a full column of strings as input. cc @davidwendt @gmarkall Authors: - https://github.com/brandon-b-miller - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) - Ashwin Srinath (https://github.com/shwina) - David Wendt (https://github.com/davidwendt) URL: #11319

This PR introduces `pandas-1.5` support in `cudf`. The changes include: - [x] Requires `group_keys` support in `groupby` for `dask_cudf` to work: #11659 - [x] Requires `zfill` updates to match `pandas-1.5` behavior: #11634 - [x] `where` API: Ability to inspect a scalar value if it can be fit into the existing dtype, similar to: pandas-dev/pandas#48373 - [x] Switches `ValueError` to `TypeError` when an unknown category is being set to a `CategoricalColumn` - [x] Handles breaking change of an `ArrowIntervalType` related import that has resulted in `cudf` to error on import itself. - [x] Fix an issue with `IntervalColumn.to_pandas`. - [x] Raises error when an object of `boolean` dtype is being set to a `NumericalColumn`. - [x] Raises error when `pat` is None in `Series.str.startswith` & `Series.str.endswith`. - [x] Add `IntervalDtype.to_pandas` with appropriate versioning. - [x] Handle `get_window_bounds` signature changes. - [x] Fix and version a bunch of pytests. ```python branch-22.10: == 4275 failed, 79837 passed, 2049 skipped, 1193 xfailed, 1923 xpassed, 6597 warnings, 4 errors in 1103.52s (0:18:23) == == 803 failed, 106 passed, 14 skipped, 14 xfailed, 324 warnings, 17 errors in 148.46s (0:02:28) == This PR: == 84041 passed, 2049 skipped, 1199 xfailed, 1710 xpassed, 6599 warnings in 359.27s (0:05:59) == == 954 passed, 14 skipped, 7 xfailed, 3 xpassed, 580 warnings in 54.75s == ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Matthew Roeschke (https://github.com/mroeschke) - Mark Sadang (https://github.com/msadang) URL: #11617

The stream is only `constexpr` in the default case for cudf. When compiled with PTDS (or in the future, with a custom stream) the `default_stream_value` is a constant, but not necessarily a compile-time constant. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11725

@shwina

This will align the implementation with those in other libraries, xref data-apis/dataframe-api#80. Cc @shwina Authors: - Ralf Gommers (https://github.com/rgommers) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #11692

…1682) This PR adds the option to take an explicit nested schema, allowing users to specify the target data types of the leave columns in the nested JSON reader. This PR adds the corresponding interface and implementation to libcudf. In addition, the PR makes existing JSON reader tests parametrised tests and enables those tests for dual execution of (1) the existing JSON reader and (2) the new nested JSON reader. Authors: - Elias Stehle (https://github.com/elstehle) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Karthikeyan (https://github.com/karthikeyann) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) URL: #11682

Closes #10941 This PR refactors the CSV reader benchmarks with nvbench and reduces the number of test cases by isolating data type, IO type, column selection, and row selection. Example output of the new benchmarks: <details> <summary>Benchmark results</summary> ## csv_read_data_type ### [0] Quadro RTX 8000 | data_type | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | INTEGRAL | 5x | 1.140 s | 0.09% | 1.140 s | 0.09% | 235553841 | 1.202 GiB | 668.564 MiB | | FLOAT | 5x | 1.262 s | 0.04% | 1.262 s | 0.04% | 212718321 | 1.041 GiB | 713.885 MiB | | DECIMAL | 5x | 272.787 ms | 0.03% | 272.784 ms | 0.03% | 984060406 | 396.279 MiB | 167.951 MiB | | TIMESTAMP | 7x | 1.681 s | 0.47% | 1.681 s | 0.47% | 159723724 | 2.281 GiB | 814.268 MiB | | DURATION | 7x | 2.121 s | 0.50% | 2.121 s | 0.50% | 126587514 | 2.588 GiB | 971.320 MiB | | STRING | 19x | 496.713 ms | 0.50% | 496.710 ms | 0.50% | 540426462 | 859.526 MiB | 277.082 MiB | ## csv_read_io ### [0] Quadro RTX 8000 | io | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | FILEPATH | 9x | 1.185 s | 0.49% | 1.185 s | 0.49% | 226466264 | 1.445 GiB | 618.876 MiB | | HOST_BUFFER | 5x | 1.170 s | 0.14% | 1.170 s | 0.14% | 229459856 | 1.445 GiB | 618.876 MiB | ## csv_read_column_selection ### [0] Quadro RTX 8000 | column_selection | row_selection | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |------------------|---------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | ALL | ALL | 5x | 1.246 s | 0.18% | 1.246 s | 0.18% | 215514992 | 1.582 GiB | 653.520 MiB | | ALTERNATE | ALL | 5x | 1.128 s | 0.08% | 1.128 s | 0.08% | 119009844 | 1.116 GiB | 648.908 MiB | | FIRST_HALF | ALL | 5x | 1.143 s | 0.07% | 1.143 s | 0.07% | 117443933 | 1.121 GiB | 653.520 MiB | | SECOND_HALF | ALL | 5x | 1.152 s | 0.16% | 1.152 s | 0.16% | 116478469 | 1.121 GiB | 653.520 MiB | ## csv_read_row_selection ### [0] Quadro RTX 8000 | column_selection | row_selection | num_chunks | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |------------------|---------------|------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | ALL | BYTE_RANGE | 1 | 5x | 1.244 s | 0.16% | 1.244 s | 0.16% | 215763257 | 1.582 GiB | 653.520 MiB | | ALL | BYTE_RANGE | 8 | 5x | 1.170 s | 0.04% | 1.170 s | 0.04% | 229339594 | 202.596 MiB | 653.520 MiB | | ALL | NROWS | 1 | 5x | 1.244 s | 0.12% | 1.244 s | 0.12% | 215808401 | 1.582 GiB | 653.520 MiB | | ALL | NROWS | 8 | 4x | 4.560 s | inf% | 4.560 s | inf% | 58870122 | 320.771 MiB | 653.520 MiB | | ALL | SKIPFOOTER | 1 | 5x | 1.245 s | 0.10% | 1.245 s | 0.10% | 215660012 | 1.582 GiB | 653.520 MiB | | ALL | SKIPFOOTER | 8 | 3x | 7.443 s | inf% | 7.443 s | inf% | 36065528 | 1.269 GiB | 653.520 MiB | </details> Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #11678

Issue #10941 <details> <summary>Example benchmark Results</summary> ## parquet_write_encode ### [0] Quadro RTX 8000 | data_type | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-----------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | INTEGRAL | 0 | 1 | 13x | 1.100 s | 2.13% | 1.100 s | 2.13% | 487932662 | 2.146 GiB | 506.565 MiB | | INTEGRAL | 1000 | 1 | 37x | 371.301 ms | 3.51% | 371.290 ms | 3.51% | 1445960381 | 2.770 GiB | 165.810 MiB | | INTEGRAL | 0 | 32 | 6x | 95.736 ms | 0.33% | 95.727 ms | 0.33% | 5608374209 | 2.770 GiB | 27.592 MiB | | INTEGRAL | 1000 | 32 | 183x | 67.311 ms | 2.25% | 67.304 ms | 2.25% | 7976843257 | 2.770 GiB | 14.369 MiB | | FLOAT | 0 | 1 | 13x | 1.094 s | 1.57% | 1.094 s | 1.57% | 490681898 | 1.100 GiB | 510.303 MiB | | FLOAT | 1000 | 1 | 53x | 245.566 ms | 1.58% | 245.553 ms | 1.58% | 2186376800 | 1.765 GiB | 110.206 MiB | | FLOAT | 0 | 32 | 159x | 74.142 ms | 2.54% | 74.134 ms | 2.54% | 7241937929 | 1.765 GiB | 23.587 MiB | | FLOAT | 1000 | 32 | 266x | 45.006 ms | 3.69% | 44.999 ms | 3.69% | 11930657028 | 1.765 GiB | 9.888 MiB | | DECIMAL | 0 | 1 | 33x | 426.241 ms | 1.20% | 426.228 ms | 1.20% | 1259587153 | 1.039 GiB | 141.641 MiB | | DECIMAL | 1000 | 1 | 111x | 109.277 ms | 3.73% | 109.266 ms | 3.73% | 4913426291 | 1.145 GiB | 44.820 MiB | | DECIMAL | 0 | 32 | 309x | 37.947 ms | 3.60% | 37.940 ms | 3.60% | 14150565744 | 1.145 GiB | 8.327 MiB | | DECIMAL | 1000 | 32 | 371x | 32.174 ms | 4.67% | 32.167 ms | 4.67% | 16690275220 | 1.145 GiB | 6.669 MiB | | TIMESTAMP | 0 | 1 | 14x | 1.047 s | 2.11% | 1.047 s | 2.11% | 512870450 | 1.178 GiB | 462.140 MiB | | TIMESTAMP | 1000 | 1 | 60x | 208.567 ms | 2.25% | 208.555 ms | 2.25% | 2574239221 | 1.474 GiB | 92.808 MiB | | TIMESTAMP | 0 | 32 | 162x | 71.909 ms | 1.82% | 71.901 ms | 1.82% | 7466791943 | 1.474 GiB | 20.855 MiB | | TIMESTAMP | 1000 | 32 | 296x | 40.141 ms | 3.10% | 40.134 ms | 3.10% | 13376977353 | 1.474 GiB | 8.718 MiB | | DURATION | 0 | 1 | 14x | 1.010 s | 2.36% | 1.010 s | 2.36% | 531706626 | 1.150 GiB | 436.918 MiB | | DURATION | 1000 | 1 | 59x | 208.890 ms | 2.81% | 208.877 ms | 2.81% | 2570271173 | 1.474 GiB | 92.663 MiB | | DURATION | 0 | 32 | 166x | 69.930 ms | 1.94% | 69.922 ms | 1.94% | 7678100086 | 1.474 GiB | 19.551 MiB | | DURATION | 1000 | 32 | 295x | 39.998 ms | 3.72% | 39.991 ms | 3.72% | 13424718570 | 1.474 GiB | 8.541 MiB | | STRING | 0 | 1 | 5x | 1.281 s | 0.45% | 1.281 s | 0.45% | 418985121 | 1.342 GiB | 597.486 MiB | | STRING | 1000 | 1 | 100x | 123.906 ms | 3.22% | 123.895 ms | 3.22% | 4333268264 | 677.964 MiB | 46.473 MiB | | STRING | 0 | 32 | 5x | 1.283 s | 0.22% | 1.283 s | 0.22% | 418593329 | 1.342 GiB | 597.486 MiB | | STRING | 1000 | 32 | 96x | 36.813 ms | 4.16% | 36.806 ms | 4.16% | 14586612568 | 677.964 MiB | 8.504 MiB | | LIST | 0 | 1 | 5x | 1.552 s | 0.09% | 1.552 s | 0.09% | 345842800 | 1.695 GiB | 526.626 MiB | | LIST | 1000 | 1 | 5x | 697.747 ms | 0.23% | 697.734 ms | 0.23% | 769449441 | 2.911 GiB | 175.888 MiB | | LIST | 0 | 32 | 42x | 336.564 ms | 1.01% | 336.555 ms | 1.01% | 1595194403 | 2.911 GiB | 38.433 MiB | | LIST | 1000 | 32 | 45x | 316.764 ms | 0.68% | 316.757 ms | 0.68% | 1694897420 | 2.911 GiB | 25.115 MiB | | STRUCT | 0 | 1 | 5x | 1.236 s | 0.16% | 1.236 s | 0.16% | 434277368 | 1.283 GiB | 569.525 MiB | | STRUCT | 1000 | 1 | 5x | 225.491 ms | 0.36% | 225.478 ms | 0.36% | 2381034954 | 1.324 GiB | 90.699 MiB | | STRUCT | 0 | 32 | 5x | 903.626 ms | 0.21% | 903.615 ms | 0.21% | 594136463 | 1.477 GiB | 409.290 MiB | | STRUCT | 1000 | 32 | 182x | 67.608 ms | 2.69% | 67.601 ms | 2.69% | 7941800457 | 1.324 GiB | 15.399 MiB | ## parquet_write_io_compression ### [0] Quadro RTX 8000 | io | compression | cardinality | run_length | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |-------------|-------------|-------------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | FILEPATH | SNAPPY | 0 | 1 | 4x | 3.939 s | inf% | 3.939 s | inf% | 136302643 | 1.643 GiB | 521.113 MiB | | FILEPATH | SNAPPY | 1000 | 1 | 5x | 1.941 s | 0.49% | 1.941 s | 0.49% | 276656089 | 2.727 GiB | 170.914 MiB | | FILEPATH | SNAPPY | 0 | 32 | 5x | 1.329 s | 0.45% | 1.329 s | 0.45% | 403934692 | 2.722 GiB | 50.835 MiB | | FILEPATH | SNAPPY | 1000 | 32 | 12x | 1.275 s | 0.51% | 1.275 s | 0.51% | 421015682 | 2.727 GiB | 24.365 MiB | | FILEPATH | NONE | 0 | 1 | 7x | 2.378 s | 0.77% | 2.378 s | 0.77% | 225765543 | 1.643 GiB | 529.611 MiB | | FILEPATH | NONE | 1000 | 1 | 7x | 1.262 s | 0.49% | 1.262 s | 0.49% | 425263712 | 2.727 GiB | 180.315 MiB | | FILEPATH | NONE | 0 | 32 | 5x | 1.116 s | 0.30% | 1.116 s | 0.30% | 480884592 | 2.722 GiB | 58.968 MiB | | FILEPATH | NONE | 1000 | 32 | 8x | 1.014 s | 0.50% | 1.014 s | 0.50% | 529606276 | 2.727 GiB | 32.308 MiB | | HOST_BUFFER | SNAPPY | 0 | 1 | 4x | 4.181 s | inf% | 4.181 s | inf% | 128399871 | 1.643 GiB | 521.112 MiB | | HOST_BUFFER | SNAPPY | 1000 | 1 | 6x | 2.026 s | 0.48% | 2.026 s | 0.48% | 264969784 | 2.727 GiB | 170.914 MiB | | HOST_BUFFER | SNAPPY | 0 | 32 | 5x | 1.363 s | 0.41% | 1.363 s | 0.41% | 393913005 | 2.722 GiB | 50.835 MiB | | HOST_BUFFER | SNAPPY | 1000 | 32 | 5x | 1.277 s | 0.43% | 1.277 s | 0.43% | 420459944 | 2.727 GiB | 24.364 MiB | | HOST_BUFFER | NONE | 0 | 1 | 5x | 2.649 s | 0.42% | 2.649 s | 0.42% | 202649168 | 1.643 GiB | 529.611 MiB | | HOST_BUFFER | NONE | 1000 | 1 | 5x | 1.332 s | 0.41% | 1.332 s | 0.41% | 403090403 | 2.727 GiB | 180.315 MiB | | HOST_BUFFER | NONE | 0 | 32 | 5x | 1.151 s | 0.46% | 1.151 s | 0.46% | 466449565 | 2.722 GiB | 58.968 MiB | | HOST_BUFFER | NONE | 1000 | 32 | 13x | 1.039 s | 0.50% | 1.039 s | 0.50% | 516732638 | 2.727 GiB | 32.308 MiB | | VOID | SNAPPY | 0 | 1 | 5x | 3.559 s | 0.62% | 3.559 s | 0.62% | 150867866 | 1.643 GiB | 521.113 MiB | | VOID | SNAPPY | 1000 | 1 | 7x | 1.817 s | 0.47% | 1.817 s | 0.47% | 295405582 | 2.727 GiB | 170.914 MiB | | VOID | SNAPPY | 0 | 32 | 5x | 1.299 s | 0.04% | 1.299 s | 0.04% | 413272964 | 2.722 GiB | 50.836 MiB | | VOID | SNAPPY | 1000 | 32 | 5x | 1.264 s | 0.28% | 1.264 s | 0.28% | 424605071 | 2.727 GiB | 24.364 MiB | | VOID | NONE | 0 | 1 | 5x | 2.003 s | 0.50% | 2.003 s | 0.50% | 268012332 | 1.643 GiB | 529.611 MiB | | VOID | NONE | 1000 | 1 | 5x | 1.127 s | 0.45% | 1.127 s | 0.45% | 476312808 | 2.727 GiB | 180.315 MiB | | VOID | NONE | 0 | 32 | 5x | 1.081 s | 0.47% | 1.081 s | 0.47% | 496747581 | 2.722 GiB | 58.968 MiB | | VOID | NONE | 1000 | 32 | 5x | 999.381 ms | 0.48% | 999.378 ms | 0.48% | 537205288 | 2.727 GiB | 32.308 MiB | ## parquet_write_options ### [0] Quadro RTX 8000 | statistics | compression | file_path | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |---------------------|-------------|---------------------|---------|----------|-------|----------|-------|------------------|-------------------|-------------------| | STATISTICS_NONE | SNAPPY | unused_path.parquet | 3x | 5.961 s | inf% | 5.961 s | inf% | 90067884 | 2.427 GiB | 122.010 MiB | | STATISTICS_NONE | SNAPPY | | 3x | 5.962 s | inf% | 5.962 s | inf% | 90054559 | 2.427 GiB | 121.968 MiB | | STATISTICS_NONE | NONE | unused_path.parquet | 4x | 4.253 s | inf% | 4.253 s | inf% | 126221980 | 2.427 GiB | 141.623 MiB | | STATISTICS_NONE | NONE | | 4x | 4.249 s | inf% | 4.249 s | inf% | 126356682 | 2.427 GiB | 141.623 MiB | | STATISTICS_ROWGROUP | SNAPPY | unused_path.parquet | 3x | 6.011 s | inf% | 6.011 s | inf% | 89314511 | 2.427 GiB | 122.055 MiB | | STATISTICS_ROWGROUP | SNAPPY | | 3x | 5.983 s | inf% | 5.983 s | inf% | 89740066 | 2.427 GiB | 122.022 MiB | | STATISTICS_ROWGROUP | NONE | unused_path.parquet | 4x | 4.282 s | inf% | 4.282 s | inf% | 125372100 | 2.427 GiB | 141.626 MiB | | STATISTICS_ROWGROUP | NONE | | 4x | 4.287 s | inf% | 4.287 s | inf% | 125241731 | 2.427 GiB | 141.626 MiB | | STATISTICS_PAGE | SNAPPY | unused_path.parquet | 3x | 5.976 s | inf% | 5.976 s | inf% | 89837494 | 2.427 GiB | 122.090 MiB | | STATISTICS_PAGE | SNAPPY | | 3x | 5.979 s | inf% | 5.979 s | inf% | 89788086 | 2.427 GiB | 121.977 MiB | | STATISTICS_PAGE | NONE | unused_path.parquet | 4x | 4.290 s | inf% | 4.290 s | inf% | 125138510 | 2.427 GiB | 141.633 MiB | | STATISTICS_PAGE | NONE | | 4x | 4.292 s | inf% | 4.292 s | inf% | 125087291 | 2.427 GiB | 141.633 MiB | ## parquet_write_num_cols ### [0] Quadro RTX 8000 | num_cols | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |----------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | 8 | 5x | 217.270 ms | 0.13% | 217.262 ms | 0.13% | 2471073081 | 2.648 GiB | 114.635 MiB | | 1024 | 5x | 339.592 ms | 0.25% | 339.582 ms | 0.25% | 1580974198 | 2.649 GiB | 145.293 MiB | ## parquet_chunked_write ### [0] Quadro RTX 8000 | num_cols | num_chunks | Samples | CPU Time | Noise | GPU Time | Noise | bytes_per_second | peak_memory_usage | encoded_file_size | |----------|------------|---------|------------|-------|------------|-------|------------------|-------------------|-------------------| | 8 | 8 | 5x | 239.509 ms | 0.05% | 239.501 ms | 0.05% | 2241622403 | 338.950 MiB | 115.038 MiB | | 1024 | 8 | 5x | 441.931 ms | 0.46% | 441.921 ms | 0.46% | 1214856630 | 339.430 MiB | 158.714 MiB | | 8 | 64 | 5x | 458.133 ms | 0.10% | 458.125 ms | 0.10% | 1171887455 | 42.372 MiB | 117.129 MiB | | 1024 | 64 | 12x | 1.284 s | 0.80% | 1.284 s | 0.80% | 418236962 | 42.828 MiB | 214.851 MiB | </details> Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Tobias Ribizel (https://github.com/upsj) URL: #11623

Issue #9313 The root cause is that the sum value was encoded as an unsigned int. ORC specs show that the value should be encoded as signed. Because both encode and decode where assuming unsigned encoding, the existing C++ test (OrcStatisticsTest, Basic) was passing even without this fix. Added a Python test that uses a different decode method, so it fails without the fix. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tobias Ribizel (https://github.com/upsj) - David Wendt (https://github.com/davidwendt) - Bradley Dice (https://github.com/bdice) URL: #11740

Updates the instruction to build the libcudf documentation files in DOCUMENTATION.md. The `cmake --build . --target docs_cudf` will invoke the appropriate make tool as setup when cmake was configured for building libcudf. Closes #11719 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Tobias Ribizel (https://github.com/upsj) - Bradley Dice (https://github.com/bdice) URL: #11735

Disables a `ContiguousSplitUntypedTest` that simply creates a very large (over 3GB) column to test the output buffer size does not overflow. The gtests ends requiring 25GB of device memory when used with the arena allocator as mentioned in #11249. Very large columns like this should be not part of the unit test for libcudf. This PR disables the test so it can be available for testing on specific conditions outside of CI. Closes #11249 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #11706

This PR fixes an issue where the `strings_udf` conda package for python 3.9 is missing, due to the way `strings_udf` is plumbed through CI. Authors: - https://github.com/brandon-b-miller Approvers: - AJ Schmidt (https://github.com/ajschmidt8) URL: #11730

…memory usage to benchmarks (#11732) This PR reduces memory requirements in the new nested JSON parser and adds `bytes_per_second` and `peak_memory_usage` usage to benchmarks Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Tobias Ribizel (https://github.com/upsj) - Karthikeyan (https://github.com/karthikeyann) - Yunsong Wang (https://github.com/PointKernel) URL: #11732

This PR adds ability to construct a `ListColumn` when `size` is `None`: ```python In [1]: from cudf.core.column import build_list_column In [2]: from cudf.core.column import as_column In [3]: build_list_column(indices = as_column([0, 3]), elements = as_column([0, 2, 4])) ... TypeError: an integer is required ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #11745

The `sort=True` default change in dask/dask#9486 was not meant to propagate to the DataFrameGroupby and SeriesGroupby classes just yet. This PR adds the necessary `sort=None` defaults needed to avoid CI failures. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina) URL: #11755

Currently many of our tests are only stream-safe because libcudf runs everything on the default stream. This PR updates tests to ensure that any function that launches a kernel and supports passing streams will act on cudf's default stream even when it is _not_ CUDA's default stream. There are other aspects required for stream-safety that are not addressed in this PR. For instance, some of our tests make use of `thrust::device_vector`, and its initialization is implicitly always on the default stream. I'll work on that in a separate PR since that also requires some discussion with the team on what expectations a stream-based libcudf API could like like for consumers that make use of thrust (i.e. do we start requiring device syncs for such consumers?). There are also numerous tests that fail when swapping in an alternate default stream, indicating other potential dependencies on streams. I'll work through those remaining issues separately as well to limit the scope of this PR. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Robert Maynard (https://github.com/robertmaynard) - Nghia Truong (https://github.com/ttnghia) URL: #11726

Adds JSON tree traversal algorithm in host and device. It generates column indices for _record_ orient json format. List of structs at root, where each struct is a row. - [x] column indices generation - [x] row offset Depends on PR #11518 ### Tree Traversal This algorithm assigns a unique column id to each node in the tree. The row offset is the row index of the node in that column id. Algorithm: 1. Convert node_category+fieldname to node_type. a. Create a hashmap to hash field name and assign unique node id as values. b. Convert the node categories to node types. Node type is defined as node category enum value if it is not a field node, otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY). 2. Preprocessing: Translate parent node ids after sorting by level. a. sort by level b. get gather map of sorted indices c. translate parent_node_ids to new sorted indices 3. Find level boundaries. copy_if index of first unique values of sorted levels. 4. Per-Level Processing: Propagate parent node ids for each level. For each level, a. gather col_id from previous level results. input=col_id, gather_map is parent_indices. b. stable sort by {parent_col_id, node_type} c. scan sum of unique {parent_col_id, node_type} d. scatter the col_id back to stable node_level order (using scatter_indices) Restore original node_id order 5. Generate row_offset. a. stable_sort by parent_col_id. b. scan_by_key {parent_col_id} (required only on nodes who's parent is list) c. propagate to non-list leaves from parent list node by recursion Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Elias Stehle (https://github.com/elstehle) - Tobias Ribizel (https://github.com/upsj) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: #11610

Adds the ability for ORC statistics reader to read the value `ColumnStatistics::hasNull`. Contributes to #7087. Does not close it because the issue also requires the ability to write the field in the orc writer. Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #11747

This PR Fixes: #11620 by passing `dtype` parameter wherever necessary in the API code to remove unnecessary pandas warnings to the end-user. Note this PR doesn't make changes to the testing code. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #11761

This PR corrects a couple of places where we are running on the CUDA default stream instead of cudf's default stream. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Nghia Truong (https://github.com/ttnghia) - Tobias Ribizel (https://github.com/upsj) URL: #11759

Fixes logic in `cudf::lists::sort_lists` handling of sorting floating-point values containing `NaN`, `-NaN`, `Infinity` and `-Infinity`. For large lists (elements >100) of any numeric type, the `cub::DeviceSegmentedRadixSort` is used for sorting. This function sorts using bit values and results in `-NaN` values sorted to the front. Smaller lists and non-numeric types use the libcudf row-operator sorting. The logic fix in this PR bypasses the radix-sort for floating-point values. This is consistent with other sort logic which uses radix-sort similarly: https://github.com/rapidsai/cudf/blob/972708afcfaca006a8483a133dfacd540f232ef1/cpp/src/sort/sort_column.cu#L34 A gtest was also added to verify the fix. Closes #11630 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11703

Allows use of any bit width from 1 to 24 for Parquet dictionary keys. Also removes some more magic numbers and cleans up the dictionary test code. Finishes off changes for #10948. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) URL: #11580

This PR updates the version updater script so that we don't have to manually upgrade the strings_udf version in cmake file here: #11771 Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ray Douglass (https://github.com/raydouglass) URL: #11772

Fixes: #11011 This PR: - [x] Adds a side-section for `list` & `struct` handling. - [x] Reduces duplication. - [x] Exposes more `ListMethods` APIs. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #11770

Fixes: #11721 This PR: - [x] Fixes: #11721, by not going through the fill & fill_inplace APIs which don't support `struct` and `list` columns. - [x] Fixes an issue in caching while constructing a `struct` or `list` scalar as `list` & `dict` objects are not hashable and we were running into the following errors: ```python In [9]: i = cudf.Scalar([10, 11]) --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:51, in CachedScalarInstanceMeta.__call__(self, value, dtype) 49 try: 50 # try retrieving an instance from the cache: ---> 51 self.__instances.move_to_end(cache_key) 52 return self.__instances[cache_key] KeyError: ([10, 11], <class 'list'>, None, <class 'NoneType'>) During handling of the above exception, another exception occurred: TypeError Traceback (most recent call last) Cell In [9], line 1 ----> 1 i = cudf.Scalar([10, 11]) File /nvme/0/pgali/envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:57, in CachedScalarInstanceMeta.__call__(self, value, dtype) 53 except KeyError: 54 # if an instance couldn't be found in the cache, 55 # construct it and add to cache: 56 obj = super().__call__(value, dtype=dtype) ---> 57 self.__instances[cache_key] = obj 58 if len(self.__instances) > self.__maxsize: 59 self.__instances.popitem(last=False) TypeError: unhashable type: 'list' ``` Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) URL: #11760

This PR fixes: #11159 by returning correct object type for the result of `isna` & `notna` in `Index`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #11769

Fixes: #11683, #10823 This PR: - [x] Removes `kwargs` in CSV reader & writer such that users get clear errors when they misspell a parameter. - [x] Re-orders `read_csv` & `to_csv` parameters which will now match to pandas. The diff is actually adding `storage_options` to `read_csv` & `to_csv` after removing `kwargs`, and the rest of it all re-ordering appropriately. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ashwin Srinath (https://github.com/shwina) - Vukasin Milovanovic (https://github.com/vuule) URL: #11762

…ut table (#11709) By definition, the `cudf::partition*` API will return a vector of offsets with size is at least the number of partitions. As such, an output empty table should associate with an output offset array like `[0, 0, ..., 0]` (all zeros). However, currently the output offsets in such situations is an empty array. This PR corrects the implementation for such corner cases. Closes #11700. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11709

This PR generates json column creation from the traversed json tree. It has following parts 1. `reduce_to_column_tree` - Reduce node tree into column tree by aggregating each property of each column and number of rows in each column. 2. `make_json_column2` - creates the GPU json column tree structure from tree and column info 3. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. 4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Depends on PR #11518 #11610 For code-review, use PR karthikeyann#5 which contains only this tree changes. ### Overview - PR #11264 Tokenizes the JSON string to Tokens - PR #11518 Converts Tokens to Nodes (tree representation) - PR #11610 Traverses this node tree --> assigns column id and row index to each node. - This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column` JSON has 5 categories of nodes. STRUCT, LIST, FIELD, VALUE, STRING, STRUCT, LIST are nested types. FIELD nodes are struct columns' keys. VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2` Tree Representation `tree_meta_t` has 4 data members. 1. node categories 2. node parents' id 3. node level 4. node's string range {begin, end} (as 2 vectors) Currently supported JSON formats are records orient, and JSON lines. ### This PR - Detailed explanation This PR has 3 steps. 1. `reduce_to_column_tree` - Required to compute total number of columns, column type, nested column structure, and number of rows in each column. - Generates `tree_meta_t` data members for column. - - Sort node tree by col_id (stable sort) - - reduce_by_key custom_op on node_categories, collapses to column category - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id. - - reduce_by_key max on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count. 5. `make_json_column2` - Converts nodes to GPU json columns in tree structure - - get column tree, transfer column names to host. - - Create `d_json_column` for non-field columns. - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column. - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length. - - Compute list offset - - Perform scan max operation on offsets. (to fill 0's with previous offset value). - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information. 6. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further. 7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) - Yunsong Wang (https://github.com/PointKernel) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) URL: #11714

This adds a BGZIP `data_chunk_reader` usable with `multibyte_split`. The BGZIP format is a modified GZIP format that consists of multiple blocks of at most 65536 bytes compressed data describing at most 65536 bytes of uncompressed data. The data can be accessed with record offsets provided by Tabix index files, which contain so-called virtual offsets (unsigned 64 bit) of the following form ``` 63 16 0 +----------------------+-------+ | block offset | local | +----------------------+-------+ ``` The lower 16 bits describe the offset inside the uncompressed data belonging to a single compressed block, the upper 48 bits describe the offset of the compressed block inside the BGZIP file. The interface allows two modes: Reading a full compressed file, and reading between the locations described by two Tabix virtual offsets. For a description of the BGZIP format, check section 4 in the [SAM specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf). Closes #10466 ## TODO - [x] Use events to avoid clobbering data that is still in use - [x] stricter handling of local_begin (currently it may overflow into subsequent blocks) - [x] add tests where local_begin and local_end are in the same chunk or even block - [x] ~~add cudf deflate fallback if nvComp doesn't support it~~ this should not be necessary, since we only test with compatible nvcomp versions Authors: - Tobias Ribizel (https://github.com/upsj) Approvers: - Michael Wang (https://github.com/isVoid) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #11652

This PR plumbs `schema_element` and `keep_quotes` support in json reader. **Deprecation:** This PR also contains changes deprecating `dtype` as `list` inputs. This seems to be a very outdated legacy feature we continued to support and cannot be supported with the `schema_element`. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Lawrence Mitchell (https://github.com/wence-) URL: #11746

This PR adds support for the use of the`str.istitle()` method within udfs for `apply`. Authors: - https://github.com/brandon-b-miller - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11738

With rapids-cmake now requiring CMake 3.23.1 update consumers to correctly express this requirement Authors: - Robert Maynard (https://github.com/robertmaynard) Approvers: - Nghia Truong (https://github.com/ttnghia) - GALI PREM SAGAR (https://github.com/galipremsagar) - Ray Douglass (https://github.com/raydouglass) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11751

…talls (#11565) After dask/dask#9367 was fixed in dask upstream we had to bump the minimum version of dask to 2022.8.0 to correctly fetch nightly(if channel exists) or stable (if `dask/dev` label doesn't exist). Without this fix, conda builds were always picking up `2022.7.1` only and/or there would be a mix of nightly & stable packages in an env. This PR also does some cleanup and makes the `build.sh` script easy to maintain. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #11565

…CI (#11785) Authors: - https://github.com/brandon-b-miller Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ray Douglass (https://github.com/raydouglass) URL: #11785

@madsbk

…11576) This PR exposes an option to use Dask-CUDA's explicit-comms shuffle for the primary shuffle-based `dask_cudf.DataFrame` methods: `shuffle`, `sort_values`, and `set_index`. Although "explicit-comms" is still experimental, the explicit-shuffle algorithm is known to consistently outperform the "task"-based shuffle. As far as I can tell, it is not currently possible to use an "explicit-comms" shuffle in `dask_cudf` without directly importing the function from Dask-CUDA (@madsbk - please do correct me if I am mistaken). In order to simplify benchmarking, and to utilize the optimized shuffle within high-cardinality groupby code, I propose that we make it easier to access the explicit shuffle. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) - Benjamin Zaitlen (https://github.com/quasiben) URL: #11576

…y` and update guide to UDFs notebook (#11733) This PR updates some docstrings around cuDF to show some examples of how to use strings inside UDFs, as well as provide some caveats. It also adds a section with some detail and examples to our guide to udfs ipython notebook. Authors: - https://github.com/brandon-b-miller Approvers: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-) URL: #11733

…ltiple levels. (#11779) `row_bit_count` keeps track of a stack of "branches" which represent a span of rows to be included in the computed size. As you traverse through a hierarchy of lists, that span of rows is maintained as a stack. The code that was handling jumping out from the bottom of a stack to a new column was making the faulty assumption that the jump was only 1 level up. Authors: - https://github.com/nvdbaranec Approvers: - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) - Alessandro Bellina (https://github.com/abellina) URL: #11779

Fixes an out-of-bounds write error when a large number of strings requires a strided loop to meet an internal memory maximum. For row sizes that do not require strided loops, the row index never exceeds the size of the column preventing any out-of-bounds access. For large row counts, the CUDA `thread index` may be larger than the minimal count used for building the working-memory buffer. Since the kernel is launched with a thread-count with a specific block size, extra threads past the end of the minimal count are necessary to fill out the last block. These threads never contribute to the overall result but will attempt to access past the end of the working memory. Writing to this memory may corrupt memory for another kernel launched in parallel from another CPU thread. This change adds logic to prevent the extra threads from doing any work. Fixes #11768 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - MithunR (https://github.com/mythrocks) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11797

@wence-

Due to some unfortunate issues with #11576 and rapidsai/dask-cuda#992, I feel that these PRs should be reverted before the 22.10 release. This PRs roll back some recent changes that allow users to explicitly pass `shuffle="explicit-comms"` to certain shuffle-based algorithms. cc @wence- Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) URL: #11803

…n packages are built (#11808) Authors: - https://github.com/brandon-b-miller Approvers: - Ray Douglass (https://github.com/raydouglass)

This PR corresponds to the `dask_cudf` version of dask/dask#9302 (adding a shuffle-based algorithm for high-cardinality groupby aggregations). The benefits of this algorithm are most significant for cases where `split_out>1` is necessary: ```python agg = ddf.groupby("id").agg({"x": "mean", "y": "max"}, split_out=4, shuffle=True) ``` **NOTES**: - ~`shuffle="explicit-comms"` is also supported (when `dask_cuda` is installed)~ - It should be possible to refactor remove some of this code in the future. However, due to some subtle differences between the groupby code in `dask.dataframe` and `dask_cudf`, the specialized `_shuffle_aggregate` is currently necessary. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Benjamin Zaitlen (https://github.com/quasiben) - Lawrence Mitchell (https://github.com/wence-) URL: #11800

Root cause: ```python In [1]: import numpy as np In [2]: x = np.uint8(1) In [3]: y = np.float64(1.0) In [4]: x.__ge__(y) Out[4]: NotImplemented In [8]: x >= y Out[8]: True ``` This is leading to the following error whenever there is a Scalar binary operation involved: ```python python/cudf/cudf/tests/test_series.py:449: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner return func(*args, **kwds) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/series.py:2988: in describe data = _describe_categorical(self, percentiles) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/series.py:152: in _describe_categorical val_counts = obj.value_counts(ascending=False) ../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner return func(*args, **kwds) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/series.py:2862: in value_counts res = res.sort_values(ascending=ascending) ../envs/cudfdev/lib/python3.9/contextlib.py:79: in inner return func(*args, **kwds) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/series.py:1910: in sort_values return super().sort_values( ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/indexed_frame.py:1916: in sort_values out = self._gather( ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/indexed_frame.py:1523: in _gather if not libcudf.copying._gather_map_is_valid( copying.pyx:67: in cudf._lib.copying._gather_map_is_valid ??? ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/mixins/mixin_factory.py:11: in wrapper return method(self, *args1, *args2, **kwargs1, **kwargs2) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:350: in _binaryop return Scalar(result, dtype=out_dtype) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:56: in __call__ obj = super().__call__(value, dtype=dtype) ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:128: in __init__ self._host_value, self._host_dtype = self._preprocess_host_value( ../envs/cudfdev/lib/python3.9/site-packages/cudf/core/scalar.py:222: in _preprocess_host_value value = to_cudf_compatible_scalar(value, dtype=dtype) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ val = NotImplemented, dtype = <class 'numpy.bool_'> def to_cudf_compatible_scalar(val, dtype=None): """ Converts the value `val` to a numpy/Pandas scalar, optionally casting to `dtype`. If `val` is None, returns None. """ if cudf._lib.scalar._is_null_host_scalar(val) or isinstance( val, cudf.Scalar ): return val if not cudf.api.types._is_scalar_or_zero_d_array(val): > raise ValueError( f"Cannot convert value of type {type(val).__name__} " "to cudf scalar" ) E ValueError: Cannot convert value of type NotImplementedType to cudf scalar ../envs/cudfdev/lib/python3.9/site-packages/cudf/utils/dtypes.py:248: ValueError ``` This PR fixes the issue by first trying to call the `op` with `operator` standard library and then try to `getattr` if the `op` is not found in `operator` module. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) - https://github.com/brandon-b-miller URL: #11816

We need to actually call the method otherwise we will get false positives for validity of the operands. Fortunately, this seems to have been a benign bug since the host pandas `NAType` handles all of the operations appropriately, so the code was "working" before, but the logic was incorrect. Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) URL: #11818

Adding some examples to show off the nested type JSON reading Authors: - Gregory Kimball (https://github.com/GregoryKimball) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) URL: #11814

## Description Disable the use of nvCOMP DEFLATE because of issues with nvCOMP 2.4. Also fix a Python test (did not block CI because the comparison in the test is only done with `LIBCUDF_NVCOMP_POLICY="ALWAYS"`. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Jim Brennan (https://github.com/jbrennan333) - GALI PREM SAGAR (https://github.com/galipremsagar) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr)

This PR resolves #10323 and phases out the `gitutils.py` module in favor of a dependency on GitPython that is managed by pre-commit. It fixes the pre-commit check for copyright years so that only modifications between the target branch (`branch-X.Y`) and the current git stage will trigger copyright changes (years will not be updated for unmodified files, or for changes that have not been staged). Additionally, it changes the return code to `1` if changes are requested and applied (if modifications were required, that should be considered a failure). This is the last step to making our entire style check pipeline friendly to pre-commit. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #11711

## Description This switches to using CubinLinker (from PTXCompiler, but CubinLinker uses PTXCompiler internally) for Minor Version Compatibility. This enables support for all Numba features except linking archives with MVC, in support of use cases such as String UDFs (#11319) with MVC. ## Checklist - [X] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [X] New or existing tests cover these changes. - [X] The documentation is up to date with these changes. Authors: - Graham Markall (https://github.com/gmarkall) - https://github.com/brandon-b-miller - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)

## Description The docstring for `cudf.read_text` did not include the `byte_range` argument ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. Authors: - Gregory Kimball (gkimball@nvidia.com) Approvers: - Ashwin Srinath (https://github.com/shwina) - Lawrence Mitchell (https://github.com/wence-)

## Description This PR fixes a subtle bug introduced in #11800. While working on the corresponding dask-cuda benchmark for that PR rapidsai/dask-cuda#979, we discovered that non-deterministic column ordering in `_groupby_partition_agg` and `_tree_node_agg` can trigger metadata-enforcement errors in follow-up operations. This PR simply sorts the output column ordering in those functions (so that the column ordering is always deterministic). Note that this bug is difficult to reproduce in a pytest, because it rarely occurs with a smaller number of devices (I need to use a full dgx machine to consistently trigger the error). ## Checklist - [ ] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [ ] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Ashwin Srinath (https://github.com/shwina)

Reset `strings_udf` CEC and solve several related issues

This PR pins `dask` and `distributed` to `2022.9.2` for `22.10` release. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Ray Douglass (https://github.com/raydouglass) - Lawrence Mitchell (https://github.com/wence-) - Ashwin Srinath (https://github.com/shwina) - https://github.com/jakirkham URL: #11822

Co-authored-by: robertmaynard <robertmaynard@users.noreply.github.com>

Update `guide-to-udfs` notebook

[REVIEW] Handle `ptx` file paths during `strings_udf` import

## Description Zstandard decompression in nvCOMP 2.4 can produce incorrect results on compute 6.0 GPUs. This PR disables the Zstandard decompression in this configuration. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [ ] The documentation is up to date with these changes. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - MithunR (https://github.com/mythrocks) - Jim Brennan (https://github.com/jbrennan333) - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) - Joseph (https://github.com/jolorunyomi)

…nvcomp

Fixes bug in temporary decompression space estimation before calling nvcomp

Commits on Oct 12, 2022

update changelog

raydouglass committed Oct 12, 2022

Configuration menu

View commit details

Copy full SHA for f817d96

Browse repository at this point

Copy the full SHA

f817d96 View commit details

Browse the repository at this point in the history

[RELEASE] cudf v22.10 #11858

[RELEASE] cudf v22.10 #11858

Commits on Aug 1, 2022

Commits on Aug 2, 2022

Commits on Aug 3, 2022

Commits on Aug 4, 2022

Commits on Aug 5, 2022

Commits on Aug 6, 2022

Commits on Aug 8, 2022

Commits on Aug 9, 2022

Commits on Aug 10, 2022

Commits on Aug 11, 2022

Commits on Aug 12, 2022

Commits on Aug 15, 2022

Commits on Aug 16, 2022

Commits on Aug 17, 2022

Commits on Aug 18, 2022

Commits on Aug 19, 2022

Commits on Aug 22, 2022

Commits on Aug 23, 2022

Commits on Aug 24, 2022

Commits on Aug 25, 2022

Commits on Aug 26, 2022

Commits on Aug 29, 2022

Commits on Aug 30, 2022

Commits on Aug 31, 2022

Commits on Sep 1, 2022

Commits on Sep 2, 2022

Commits on Sep 5, 2022

Commits on Sep 6, 2022

Commits on Sep 7, 2022

Commits on Sep 8, 2022

Commits on Sep 9, 2022

Commits on Sep 10, 2022

Commits on Sep 12, 2022

Commits on Sep 13, 2022

Commits on Sep 14, 2022

Commits on Sep 15, 2022

Commits on Sep 19, 2022

Commits on Sep 20, 2022

Commits on Sep 21, 2022

Commits on Sep 22, 2022

Commits on Sep 23, 2022

Commits on Sep 24, 2022

Commits on Sep 26, 2022

Commits on Sep 27, 2022

Commits on Sep 28, 2022

Commits on Sep 29, 2022

Commits on Sep 30, 2022

Commits on Oct 3, 2022

Commits on Oct 4, 2022

Commits on Oct 5, 2022

Commits on Oct 7, 2022

Commits on Oct 8, 2022

Commits on Oct 12, 2022