[RELEASE] cudf v0.17 #6935

GPUtester · 2020-12-07T16:27:11Z

❄️ Code freeze for `branch-0.17` and v0.17 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-0.17 until release (merging of this PR).

What is the purpose of this PR?

Update documentation
Allow testing for the new release
Enable a means to merge branch-0.17 into main for the release

) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`

Use correct stream in hash_join

This fixes the JNI build by updating to the new device memory resource API that uses rmm::cuda_stream_view instead of cudaStream_t directly.

This changes the JNI native dependency load order to always load libcudf_base.so before loading libcudf_comms.so. When built with some toolchains, libcudf_comms.so has an explicit dependency on libcudf_base.so whereas with other toolchains it does not. The JNI code currently loads libcudf_base.so and libcudf_comms.so in parallel, but this only works when building with a toolchain smart enough to realize libcudf_comms.so does not need libcudf_base.so despite linking against it.

@davidwendt

This pr improves subword tokenizer docs by improving the example as well as the general docstring and closes last bits of #5799 . I wasn't sure on the exact details about `max_rows_tensor ` (CC: @davidwendt to confirm) . It is rendered like below: ![image](https://user-images.githubusercontent.com/4837571/97377583-a0a3cc80-187d-11eb-8fc6-21ae18c7a76e.png)

Reference #5963 This PR adds dictionary specialization logic to the cudf::unary_operation API. All math and bitwise operations return new dictionary columns. Logical operations return BOOL8 type columns. This includes a reworked/simplified math_ops.cu and removal of unary_ops.cuh which is not needed. Also, the gtests unary_ops_test.cpp math/logical operations were split out into a new math_ops_test.cpp to simplify updating tests.

Fixes #6580 . Adds missing stream parameters in various calls to rmm::device_scalar methods. Also removes one unused line of code and fixes a narrowing cast warning (nvcc warning that wasn't being treated as error).

Speeds up the conversion to and from row major formats using the GPU.

This adds a new `cudf::strings::contains()` API that accepts a target column instead of a scalar. Each row of the target is check against the corresponding row in the source column. If the target string appears inside the source string then `true` is set in the output column for that row. The gtests in `find_tests.cpp` were also refactored to make it easier to add new tests and features. This also includes an update to the `cudf str.contains()` API so that it accepts a column for the `pat` argument. An appropriate pytest was also added to `test_string.py`.

…issue in `cudf.to_json` (#6614) This PR adds a nullable parameter to to_pandas APIs, which if it is True will convert the cudf object into a pandas with corresponding pandas nullable dtype. Note that this parameter by default is False. This change will help fix the cudf.to_json issue where the JSON contents would vary due to the conversion of integer columns with null values into float columns.

Fixes #6570 Use the unsigned type for variables that store the result of zigzag integer encoding. Refactor the zigzag encoding to simplify the use of the overload set.

This fixes some changes to the JNI cmake that were checked in incorrectly.

This PR: * Adds operator overloading to Column and cleans up all lengthy .binary_operator calls across code-base. * Changes error messages to use f-strings.

* Adds support for Avro fuzz worker * Utilize fastavro to write/create avro files. * Add varying test parameter combinations for cudf.read_avro

* fix future TZ entries cnt; fix name skip; fix default transition hour; * Update CHANGELOG.md * remove unused variable * add test * style * use smaller test file * clean up skip_name

Closes #5345 Features to implement: - [x] numerics -> numerics - [x] timedeltas -> numerics - [x] datetimes -> numerics - [x] strings -> numerics - [x] categorical -> numerics - [x] test match downcast behavior - [x] test match error behavior - [x] strings that contains `inf` -> numeric - [x] account for empty strings

closes #6228 closes #4400 - [x] added groupby hash mean aggregation. (multi-pass method). - [x] added multi-pass method (collated second pass) - [x] enabled MEAN, STD, VARIANCE, ~SUM_OF_SQUARES~ - [x] unit tests Implemented 2 pass approach for compound aggregations. compound aggregations are aggregartion which can be computed from results of simple aggregations. simple aggregations need only 1-pass through the grouped values. `aggregation::get_simple_aggregations()` will return simple aggregation for the aggregation. - find required simple aggregations for compound aggregations and add to list. - first pass is calculating the list of simple aggregations. (1 kernel launch) - second pass takes result of simple aggregations and computes results of compound aggregations. (1 kernel launch) Authors: - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com> - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - Devavret Makkar - Ashwin Srinath - Jake Hemstad URL: #6392

This forces the `ci/gpu/build.sh` script to install the local conda package artifact from the CUDA build. This is achieved by specifying the exact version and build string of the artifact. Skipping CI since the Project Flash branch is not tested by this CI yet. I will manually test this change. Authors: - Raymond Douglass <ray@raydouglass.com> - Ray Douglass <3107146+raydouglass@users.noreply.github.com> Approvers: - AJ Schmidt - Dillon Cullinan URL: #6806

Cleans up apt's cache after installing everything. Also added the number of cores of the current machine in order to avoid an infinite amount of threads being spawned and using way too much RAM. Fixes #881. Authors: - Igor Moura <imphilippini@gmail.com> - Igor Moura <imp2@cin.ufpe.br> - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - AJ Schmidt - Karthikeyan - AJ Schmidt URL: #6619

Closes #6478 `cudf::gather` now will not run a pre-pass to check for index validity. For `out_of_bounds_policy`, remove `FAIL`, while exposing `NULLIFY` and `DONT_CHECK` to user. `NULLIFY` checks out-of-bounds indices and sets them to null rows, while `DONT_CHECK` skips all checks. Using `DONT_CHECK` should yield higher performance, given `gather_map` contains only valid indices. Note that the negative index (wrap-arounds) policy is unchanged. When gather map dtype is `signed`, wrap-around is applied. A new Cython binding to `cudf::minmax`, used for Cython `gather` bound checking is added. Will also close #6731 Authors: - Michael Wang <michelwang0905@icloud.com> - Michael Wang <isVoid@users.noreply.github.com> Approvers: - null - Devavret Makkar - Ashwin Srinath - Keith Kraus - Jake Hemstad URL: #6875

This PR intends to - Allow `hash_partition` to select a different hash function (e.g. identity hash function) in additional to `MurmurHash3_32`. (Close #6307) - Remove redundant identical `hash_partition` implementation in `src/hash/hashing.cu`. Restrictions: - MD5 is not supported. Authors: - Hao Gao <haog@nvidia.com> Approvers: - Nikolay Sakharnykh - Mark Harris - Ram (Ramakrishna Prabhu) - Mark Harris URL: #6726

Fixes typo and 0-d numpy array handling. When numpy scalar is used on lhs while executing binary operation, `__eq__` from numpy returns a 0-d array rather than scalar. closes #6778 Authors: - Ramakrishna Prabhu <ramakrishnap@nvidia.com> - Ram (Ramakrishna Prabhu) <42624703+rgsl888prabhu@users.noreply.github.com> Approvers: - Keith Kraus URL: #6887

…to cupy for cudf.Series(#6839) This pr adds index handling when dispatching to cupy functions with `__ufunc__` and `__array_function__` for cudf.Series. This PR does the following: - [x] Adds index handling for `__ufunc__` and `__array_function` (when being dispatched to `cupy`) - [x] Adds test to ensure the same results as pandas with aligned index - [x] Adds tests for appropriate errors non-aligned index - [x] Removs support for `list` inputs (should not have been supported initially too) Please note that I am unsure how to handle `list` inputs here. The problem being solved here is below: With this **PR #6839** we get the correct index when we do the following: ```python >>> cudf_s1 = cudf.Series(data=[-1, 2, 3, 0], index=[2, 3, 1, 0]) >>> cudf_s2 = cudf.Series(data=[-1, -2, 3, 0], index=[2, 3, 1, 0]) >>> o = np.logaddexp(cudf_s1, cudf_s2) >>> o.index Int64Index([2, 3, 1, 0], dtype='int64') >>> print(o) 2 -0.306853 3 2.018150 1 3.693147 0 0.693147 dtype: float64 ``` On **Master** we get: ```python >>> cudf_s1 = cudf.Series(data=[-1, 2, 3, 0], index=[2, 3, 1, 0]) >>> cudf_s2 = cudf.Series(data=[-1, -2, 3, 0], index=[2, 3, 1, 0]) >>> o = np.logaddexp(cudf_s1, cudf_s2) >>> o.index RangeIndex(start=0, stop=4, step=1) >>> print(o) 0 -0.306853 1 2.018150 2 3.693147 3 0.693147 dtype: float64 ```` Authors: - Vibhu Jawa <vjawa@nvidia.com> - Vibhu Jawa <vibhujawa@gmail.com> Approvers: - null - Michael Wang - GALI PREM SAGAR URL: #6839

Expand existing murmur3 hashing functionality to hash the row elements serially rather than using a merge function. Also enables configuring the hash seed and null hash value. Authors: - Ryan Lee <ryanlee@nvidia.com> - rwlee <rwlee@users.noreply.github.com> Approvers: - null - Mark Harris - GALI PREM SAGAR - Robert (Bobby) Evans URL: #6781

Closes #6530 Changes: - Added a method of specifying the nullability of list columns. The API change is as follows: `table_metadata_with_nullability.column_nullable[i]` used to be the nullability of column[i]. Now it contains the flattened nullability of the table e.g. for a table of three columns, `int, list<double>, float`, the nullability vector contains the values: |Index|Nullability of| |-|-| |0|int column| |1|Level 0 of list column (list itself)| |2|Level 1 of list column (double values)| |3|float column| - Modified the method of checking schema across `write_chunk()` calls. Now the entire schema vector is compared rather than just types. - Fixed a bug introduced in list writing PR where a non-nested column following a list column would have the wrong value of definition bits. Now all such cases where the information was being queried from schema have been fixed to use `parquet_column_view` - Fixed a regression introduced in a later commit in list writing PR while adding column_view with offset support to list columns. Changed pinned memory to normal pageable memory. - Added missing tests for chunked writer where the nullability is mismatched across calls, or nullability is specified in first call. Authors: - Devavret Makkar <dmakkar@nvidia.com> - Devavret Makkar <devavret@users.noreply.github.com> Approvers: - Vukasin Milovanovic - Keith Kraus - Mark Harris URL: #6831

This PR adds support for reading decimals in parquet into decimal32 and decimal64 cudf types. A test was added to test these types by embedding a parquet data file into the cpp file. This is temporary until python supports decimal and the tests move there. partially closes issue #6474 Authors: - Mike Wilson <knobby@burntsheep.com> - Mike Wilson <hyperbolic2346@users.noreply.github.com> - Keith Kraus <kkraus@nvidia.com> Approvers: - Devavret Makkar - Vukasin Milovanovic - Mark Harris URL: #6808

Fixes #5683, #6852 Pr modifies the `get_filepath_or_buffer` utility to support paths resolving to more than one file and returning a list buffers. Currently read_parquet is the only reader that allows wildcard like paths. Note: `cudf.read_parquet` will still not work with parquet datasets partitioned on columns. Authors: - Ayush Dattagupta <ayushdg95@gmail.com> Approvers: - Keith Kraus - GALI PREM SAGAR URL: #6815

* Update JNI to new gather boundary check API * changelog

Fixes #6891 Adds missing `clone()` overrides on aggregations that are derived but do not use `derived_aggregation`. Authors: - Jason Lowe <jlowe@nvidia.com> Approvers: - Mark Harris - MithunR - Alessandro Bellina URL: #6898

This PR is about to add a parquet option to determine whether strictly reading all decimal columns as fixed-point decimal types or converting decimal column who are not backed by int32/64 to float64.

Closes #5247 Adds `agg` function for DataFrame Authors: - Sheilah Kirui <skirui@dt08.aselab.nvidia.com> - Sheilah Kirui <kirui.sheilah@gmail.com> - Michael Wang <isVoid@users.noreply.github.com> - skirui-source <71867292+skirui-source@users.noreply.github.com> - galipremsagar <sagarprem75@gmail.com> - GALI PREM SAGAR <sagarprem75@gmail.com> - Keith Kraus <keith.j.kraus@gmail.com> - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - Michael Wang - Michael Wang - Keith Kraus URL: #6483

Authors: - Ashwin Srinath <shwina@users.noreply.github.com> - Ashwin Srinath <3190405+shwina@users.noreply.github.com> Approvers: - Keith Kraus - Keith Kraus - Keith Kraus URL: #6914

review-notebook-app · 2020-12-07T16:27:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

harrism · 2020-12-08T03:34:28Z

CHANGELOG.md

+- PR #6805 Implement `cudf::detail::copy_if` for `decimal32` and `decimal64`
+- PR #6843 Implement `cudf::copy_range` for `decimal32` and `decimal64`
+- PR #6528 Enable `fixed_point` binary operations
+- PR #6460 Add is_timestamp format check API


Duplicates line 7.

Suggested change

- PR #6460 Add is_timestamp format check API

jrhemstad and others added 30 commits October 27, 2020 11:06

Use correct stream in hash_join.

0fa8e29

changelog.

ecc8193

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/int96

3644dbb

Fix memory usage calculation (#6596)

e620a73

Fixes: #6590

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/int96

b0b389e

Merge pull request #6603 from jrhemstad/fix-hash-join-stream

c40e27a

Use correct stream in hash_join

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/int96

1ddd81d

Update JNI to new RMM cuda_stream_view API (#6612)

bdba041

This fixes the JNI build by updating to the new device memory resource API that uses rmm::cuda_stream_view instead of cudaStream_t directly.

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/int96

13b06a0

int96 changes

d3f0fb6

Add missing device_scalar stream parameters. (#6582)

64bab8f

Fixes #6580 . Adds missing stream parameters in various calls to rmm::device_scalar methods. Also removes one unused line of code and fixes a narrowing cast warning (nvcc warning that wasn't being treated as error).

Add in java column to row conversion (#6578)

a81d07b

Speeds up the conversion to and from row major formats using the GPU.

Fix integer overflow in ORC encoder (#6607)

ea0b5d2

Fixes #6570 Use the unsigned type for variables that store the result of zigzag integer encoding. Refactor the zigzag encoding to simplify the use of the overload set.

Updating to write INT96 type instead of INTERVAL

b0b3ad6

linting

4c6cbb1

Adding some documentation

b504d7a

Adding changelog

b04eb98

Merge branch 'branch-0.17' into mwilson/int96

321c896

Revert bad CMake changes for JNI (#6629)

06cb559

This fixes some changes to the JNI cmake that were checked in incorrectly.

Add operator overloading to column and clean up error messages (#6623)

8aef966

This PR: * Adds operator overloading to Column and cleans up all lengthy .binary_operator calls across code-base. * Changes error messages to use f-strings.

Merge remote-tracking branch 'upstream/branch-0.17' into mwilson/int96

275d462

Add AVRO fuzz tests with varying function parameters (#6489)

b2f875d

* Adds support for Avro fuzz worker * Utilize fastavro to write/create avro files. * Add varying test parameter combinations for cudf.read_avro

Fix timezone offset when reading ORC files (#6601)

2452776

* fix future TZ entries cnt; fix name skip; fix default transition hour; * Update CHANGELOG.md * remove unused variable * add test * style * use smaller test file * clean up skip_name

karthikeyann and others added 16 commits December 2, 2020 15:22

Update JNI to new gather boundary check API [skip ci] (#6899)

1af9bc0

* Update JNI to new gather boundary check API * changelog

Fix missing clone overrides on derived aggregations(#6898)

e22c3ae

Fixes #6891 Adds missing `clone()` overrides on aggregations that are derived but do not use `derived_aggregation`. Authors: - Jason Lowe <jlowe@nvidia.com> Approvers: - Mark Harris - MithunR - Alessandro Bellina URL: #6898

Parquet option for strictly decimal reading (#6908)

cd7a0ad

This PR is about to add a parquet option to determine whether strictly reading all decimal columns as fixed-point decimal types or converting decimal column who are not backed by int32/64 to float64.

Enable groupby list aggregation for strings(#6914)

bd321d1

Authors: - Ashwin Srinath <shwina@users.noreply.github.com> - Ashwin Srinath <3190405+shwina@users.noreply.github.com> Approvers: - Keith Kraus - Keith Kraus - Keith Kraus URL: #6914

GPUtester requested review from a team as code owners December 7, 2020 16:27

GPUtester requested review from harrism, davidwendt, trxcllnt and shwina December 7, 2020 16:27

harrism requested changes Dec 8, 2020

View reviewed changes

Update CHANGELOG.md

00ca246

ajschmidt8 merged commit d72b1eb into main Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RELEASE] cudf v0.17 #6935

[RELEASE] cudf v0.17 #6935

GPUtester commented Dec 7, 2020

review-notebook-app bot commented Dec 7, 2020

harrism Dec 8, 2020

[RELEASE] cudf v0.17 #6935

[RELEASE] cudf v0.17 #6935

Conversation

GPUtester commented Dec 7, 2020

❄️ Code freeze for branch-0.17 and v0.17 release

What does this mean?

What is the purpose of this PR?

review-notebook-app bot commented Dec 7, 2020

harrism Dec 8, 2020

Choose a reason for hiding this comment

❄️ Code freeze for `branch-0.17` and v0.17 release