Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v0.17 #6935

Merged
merged 344 commits into from
Dec 10, 2020
Merged

[RELEASE] cudf v0.17 #6935

merged 344 commits into from
Dec 10, 2020

Conversation

GPUtester
Copy link
Collaborator

❄️ Code freeze for branch-0.17 and v0.17 release

What does this mean?

Only critical/hotfix level issues should be merged into branch-0.17 until release (merging of this PR).

What is the purpose of this PR?

  • Update documentation
  • Allow testing for the new release
  • Enable a means to merge branch-0.17 into main for the release

jrhemstad and others added 30 commits October 27, 2020 11:06
)

This pr closes part of #5799  by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. 

Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. 

To ensure correctness i ensured that we get the same result as `perfect-hash.py`  ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) 

The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . 


### TODO: 
- [x] Add function
- [x] Add Test to ensure equivalence 
- [x] Add ChangeLog  


### Previous Problems:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `30s` to the test suite
b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` 

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. 



### Updated  PR:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `1.5 s` to the test suite
b. Add `112 kb` because of the `ground truth` and `vocabulary files`
This fixes the JNI build by updating to the new device memory resource API that uses rmm::cuda_stream_view instead of cudaStream_t directly.
This changes the JNI native dependency load order to always load libcudf_base.so before loading libcudf_comms.so.

When built with some toolchains, libcudf_comms.so has an explicit dependency on libcudf_base.so whereas with other toolchains it does not. The JNI code currently loads libcudf_base.so and libcudf_comms.so in parallel, but this only works when building with a toolchain smart enough to realize libcudf_comms.so does not need libcudf_base.so despite linking against it.
This pr improves subword tokenizer docs by improving the example as well as the general docstring  and closes last bits of #5799 . 

I wasn't sure on the exact details about `max_rows_tensor ` (CC: @davidwendt  to confirm) .

It is rendered like below: 
![image](https://user-images.githubusercontent.com/4837571/97377583-a0a3cc80-187d-11eb-8fc6-21ae18c7a76e.png)
Reference #5963

This PR adds dictionary specialization logic to the cudf::unary_operation API.
All math and bitwise operations return new dictionary columns. Logical operations return BOOL8 type columns.
This includes a reworked/simplified math_ops.cu and removal of unary_ops.cuh which is not needed.
Also, the gtests unary_ops_test.cpp math/logical operations were split out into a new math_ops_test.cpp to simplify updating tests.
Fixes #6580 . Adds missing stream parameters in various calls to rmm::device_scalar methods. Also removes one unused line of code and fixes a narrowing cast warning (nvcc warning that wasn't being treated as error).
Speeds up the conversion to and from row major formats using the GPU.
This adds a new `cudf::strings::contains()` API that accepts a target column instead of a scalar. Each row of the target is check against the corresponding row in the source column. If the target string appears inside the source string then `true` is set in the output column for that row. 
The gtests in `find_tests.cpp` were also refactored to make it easier to add new tests and features.
This also includes an update to the `cudf str.contains()` API so that it accepts a column for the `pat` argument. An appropriate pytest was also added to `test_string.py`.
…issue in `cudf.to_json` (#6614)

This PR adds a nullable parameter to to_pandas APIs, which if it is True will convert the cudf object into a pandas with corresponding pandas nullable dtype. Note that this parameter by default is False. This change will help fix the cudf.to_json issue where the JSON contents would vary due to the conversion of integer columns with null values into float columns.
Fixes #6570 

Use the unsigned type for variables that store the result of zigzag integer encoding. 
Refactor the zigzag encoding to simplify the use of the overload set.
This fixes some changes to the JNI cmake that were checked in incorrectly.
This PR:
* Adds operator overloading to Column and cleans up all lengthy .binary_operator calls across code-base.
* Changes error messages to use f-strings.
* Adds support for Avro fuzz worker
* Utilize fastavro to write/create avro files.
* Add varying test parameter combinations for cudf.read_avro
* fix future TZ entries cnt; fix name skip; fix default transition hour;

* Update CHANGELOG.md

* remove unused variable

* add test

* style

* use smaller test file

* clean up skip_name
Closes #5345 

Features to implement:

- [x] numerics -> numerics
- [x] timedeltas -> numerics
- [x] datetimes -> numerics
- [x] strings -> numerics
- [x] categorical -> numerics
- [x] test match downcast behavior
- [x] test match error behavior
- [x] strings that contains `inf` -> numeric
- [x] account for empty strings
karthikeyann and others added 16 commits December 2, 2020 15:22
closes #6228
closes #4400 

- [x] added groupby hash mean aggregation. (multi-pass method).
- [x] added multi-pass method (collated second pass)
- [x] enabled MEAN, STD, VARIANCE, ~SUM_OF_SQUARES~
- [x] unit tests

Implemented 2 pass approach for compound aggregations.
compound aggregations are aggregartion which can be computed from results of simple aggregations.
simple aggregations need only 1-pass through the grouped values. 
`aggregation::get_simple_aggregations()` will return simple aggregation for the aggregation. 

- find required simple aggregations for compound aggregations and add to list.
- first pass is calculating the list of simple aggregations. (1 kernel launch)
- second pass takes result of simple aggregations and computes results of compound aggregations. (1 kernel launch)

Authors:
  - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
  - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>

Approvers:
  - Devavret Makkar
  - Ashwin Srinath
  - Jake Hemstad

URL: #6392
This forces the `ci/gpu/build.sh` script to install the local conda package artifact from the CUDA build. This is achieved by specifying the exact version and build string of the artifact.

Skipping CI since the Project Flash branch is not tested by this CI yet. I will manually test this change.

Authors:
  - Raymond Douglass <ray@raydouglass.com>
  - Ray Douglass <3107146+raydouglass@users.noreply.github.com>

Approvers:
  - AJ Schmidt
  - Dillon Cullinan

URL: #6806
Cleans up apt's cache after installing everything.

Also added the number of cores of the current machine in order to avoid an infinite amount of threads being spawned and using way too much RAM.

Fixes #881.

Authors:
  - Igor Moura <imphilippini@gmail.com>
  - Igor Moura <imp2@cin.ufpe.br>
  - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>

Approvers:
  - AJ Schmidt
  - Karthikeyan
  - AJ Schmidt

URL: #6619
Closes #6478

`cudf::gather` now will not run a pre-pass to check for index validity.

For `out_of_bounds_policy`, remove `FAIL`, while exposing `NULLIFY` and `DONT_CHECK` to user. `NULLIFY` checks out-of-bounds indices and sets them to null rows, while `DONT_CHECK` skips all checks. Using `DONT_CHECK` should yield higher performance, given `gather_map` contains only valid indices.

Note that the negative index (wrap-arounds) policy is unchanged. When gather map dtype is `signed`, wrap-around is applied.

A new Cython binding to `cudf::minmax`, used for Cython `gather` bound checking is added. Will also close #6731

Authors:
  - Michael Wang <michelwang0905@icloud.com>
  - Michael Wang <isVoid@users.noreply.github.com>

Approvers:
  - null
  - Devavret Makkar
  - Ashwin Srinath
  - Keith Kraus
  - Jake Hemstad

URL: #6875
This PR intends to
- Allow `hash_partition` to select a different hash function (e.g. identity hash function) in additional to `MurmurHash3_32`. (Close #6307)
- Remove redundant identical `hash_partition` implementation in `src/hash/hashing.cu`.

Restrictions:
- MD5 is not supported.

Authors:
  - Hao Gao <haog@nvidia.com>

Approvers:
  - Nikolay Sakharnykh
  - Mark Harris
  - Ram (Ramakrishna Prabhu)
  - Mark Harris

URL: #6726
Fixes typo and 0-d numpy array handling. When numpy scalar is used on lhs while executing binary operation, `__eq__` from numpy returns a 0-d array rather than scalar.

closes #6778

Authors:
  - Ramakrishna Prabhu <ramakrishnap@nvidia.com>
  - Ram (Ramakrishna Prabhu) <42624703+rgsl888prabhu@users.noreply.github.com>

Approvers:
  - Keith Kraus

URL: #6887
…to cupy for cudf.Series(#6839)

This pr adds index handling when dispatching to cupy functions with `__ufunc__` and `__array_function__` for cudf.Series.  

This PR does the following: 

- [x] Adds index handling for `__ufunc__` and `__array_function` (when being dispatched to `cupy`)
- [x] Adds test to ensure the same results as pandas with aligned index  
- [x] Adds tests for appropriate errors non-aligned index 
- [x] Removs support for `list` inputs  (should not have been supported initially too) 


Please note that I am unsure how to handle `list` inputs here. 
The problem being solved here is below:

With this **PR #6839** we get the correct index when we do the following:
```python
>>> cudf_s1 = cudf.Series(data=[-1, 2, 3, 0], index=[2, 3, 1, 0])
>>> cudf_s2 = cudf.Series(data=[-1, -2, 3, 0], index=[2, 3, 1, 0])
>>> o = np.logaddexp(cudf_s1, cudf_s2)
>>> o.index
Int64Index([2, 3, 1, 0], dtype='int64')
>>> print(o)
2   -0.306853
3    2.018150
1    3.693147
0    0.693147
dtype: float64
```
On **Master** we get:
```python
>>> cudf_s1 = cudf.Series(data=[-1, 2, 3, 0], index=[2, 3, 1, 0])
>>> cudf_s2 = cudf.Series(data=[-1, -2, 3, 0], index=[2, 3, 1, 0])
>>> o = np.logaddexp(cudf_s1, cudf_s2)
>>> o.index
RangeIndex(start=0, stop=4, step=1)
>>> print(o)
0   -0.306853
1    2.018150
2    3.693147
3    0.693147
dtype: float64
````

Authors:
  - Vibhu Jawa <vjawa@nvidia.com>
  - Vibhu Jawa <vibhujawa@gmail.com>

Approvers:
  - null
  - Michael Wang
  - GALI PREM SAGAR

URL: #6839
Expand existing murmur3 hashing functionality to hash the row elements serially rather than using a merge function. Also enables configuring the hash seed and null hash value.

Authors:
  - Ryan Lee <ryanlee@nvidia.com>
  - rwlee <rwlee@users.noreply.github.com>

Approvers:
  - null
  - Mark Harris
  - GALI PREM SAGAR
  - Robert (Bobby) Evans

URL: #6781
Closes #6530 

Changes:
- Added a method of specifying the nullability of list columns. The API change is as follows: `table_metadata_with_nullability.column_nullable[i]` used to be the nullability of column[i]. Now it contains the flattened nullability of the table e.g. for a table of three columns, `int, list<double>, float`, the nullability vector contains the values:

|Index|Nullability of|
|-|-|
|0|int column|
|1|Level 0 of list column (list itself)|
|2|Level 1 of list column (double values)|
|3|float column|

- Modified the method of checking schema across `write_chunk()` calls. Now the entire schema vector is compared rather than just types.
- Fixed a bug introduced in list writing PR where a non-nested column following a list column would have the wrong value of definition bits. Now all such cases where the information was being queried from schema have been fixed to use `parquet_column_view`
- Fixed a regression introduced in a later commit in list writing PR while adding column_view with offset support to list columns. Changed pinned memory to normal pageable memory.
- Added missing tests for chunked writer where the nullability is mismatched across calls, or nullability is specified in first call.

Authors:
  - Devavret Makkar <dmakkar@nvidia.com>
  - Devavret Makkar <devavret@users.noreply.github.com>

Approvers:
  - Vukasin Milovanovic
  - Keith Kraus
  - Mark Harris

URL: #6831
This PR adds support for reading decimals in parquet into decimal32 and decimal64 cudf types. A test was added to test these types by embedding a parquet data file into the cpp file. This is temporary until python supports decimal and the tests move there.

partially closes issue #6474

Authors:
  - Mike Wilson <knobby@burntsheep.com>
  - Mike Wilson <hyperbolic2346@users.noreply.github.com>
  - Keith Kraus <kkraus@nvidia.com>

Approvers:
  - Devavret Makkar
  - Vukasin Milovanovic
  - Mark Harris

URL: #6808
Fixes #5683, #6852 

Pr modifies the `get_filepath_or_buffer` utility to support paths resolving to more than one file and returning a list buffers. Currently read_parquet is the only reader that allows wildcard like paths.

Note: `cudf.read_parquet` will still not work with parquet datasets partitioned on columns.

Authors:
  - Ayush Dattagupta <ayushdg95@gmail.com>

Approvers:
  - Keith Kraus
  - GALI PREM SAGAR

URL: #6815
* Update JNI to new gather boundary check API

* changelog
Fixes #6891 

Adds missing `clone()` overrides on aggregations that are derived but do not use `derived_aggregation`.

Authors:
  - Jason Lowe <jlowe@nvidia.com>

Approvers:
  - Mark Harris
  - MithunR
  - Alessandro Bellina

URL: #6898
This PR is about to add a parquet option to determine whether strictly reading all decimal columns as fixed-point decimal types or converting decimal column who are not backed by int32/64 to float64.
Closes #5247

Adds `agg` function for DataFrame

Authors:
  - Sheilah Kirui <skirui@dt08.aselab.nvidia.com>
  - Sheilah Kirui <kirui.sheilah@gmail.com>
  - Michael Wang <isVoid@users.noreply.github.com>
  - skirui-source <71867292+skirui-source@users.noreply.github.com>
  - galipremsagar <sagarprem75@gmail.com>
  - GALI PREM SAGAR <sagarprem75@gmail.com>
  - Keith Kraus <keith.j.kraus@gmail.com>
  - Ashwin Srinath <shwina@users.noreply.github.com>

Approvers:
  - Michael Wang
  - Michael Wang
  - Keith Kraus

URL: #6483
Authors:
  - Ashwin Srinath <shwina@users.noreply.github.com>
  - Ashwin Srinath <3190405+shwina@users.noreply.github.com>

Approvers:
  - Keith Kraus
  - Keith Kraus
  - Keith Kraus

URL: #6914
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

- PR #6805 Implement `cudf::detail::copy_if` for `decimal32` and `decimal64`
- PR #6843 Implement `cudf::copy_range` for `decimal32` and `decimal64`
- PR #6528 Enable `fixed_point` binary operations
- PR #6460 Add is_timestamp format check API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicates line 7.

Suggested change
- PR #6460 Add is_timestamp format check API

@ajschmidt8 ajschmidt8 merged commit d72b1eb into main Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet