[RELEASE] cudf v0.18 #7405

Add a cmake find module to locate cuFile. If found, add the include directory and link to the shared library. This shouldn't have any effect if cuFile is not installed locally.

[gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]

) This implements the `non_numeric` argument for `DataFrame.quantile` meaning that it now works on `datetime` and `timedelta` data. However, because of the difference in how `DataFrame.iloc` behaves between Pandas and cuDF, this implementation returns a DataFrame when `non_numeric=False` even when Pandas returns a Series Passes tests locally This closes #6799 Authors: - Chris Jarrett <cjarrett@dt08.aselab.nvidia.com> - ChrisJar <chris.jarrett.0@gmail.com> Approvers: - Keith Kraus URL: #6902

When using parameter `--rmm_mode=managed` for gtests `Invalid RMM allocation mode: managed` exception is thrown. The logic in `include/cudf_test/base_fixture.hpp` is just missing a return statement. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Paul Taylor - Mark Harris URL: #6912

Resolves: #6870 This PR adds support for `set_names` API in both `Index` & `MultiIndex`. Authors: - galipremsagar <sagarprem75@gmail.com> - GALI PREM SAGAR <sagarprem75@gmail.com> Approvers: - Keith Kraus URL: #6929

Fixes: #6821 This PR fixes issue where `columns` and `index` are currently not being handled correctly in specific scenarios. Authors: - galipremsagar <sagarprem75@gmail.com> - GALI PREM SAGAR <sagarprem75@gmail.com> Approvers: - Richard (Rick) Zamora - Ashwin Srinath URL: #6838

)

Update to libcu++ on Github. Authors: - ptaylor <paul.e.taylor@me.com> - Paul Taylor <paul.e.taylor@me.com> Approvers: - Mark Harris - Keith Kraus - Christopher Harris - Mark Harris URL: #6275

This PR removes `**kwargs` from the string/categorical accessors where unnecessary, and exposes keyword arguments like `inplace` to the user directly. If we want to maintain parity with Pandas APIs for Dask/others using cuDF internally, we can consider using the approach described in #6135, which will automatically raise `NotimplementedError` when unsupported kwargs are passed. Authors: - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - GALI PREM SAGAR - Keith Kraus - Keith Kraus URL: #6750

Fixes #6682, #6680 Currently, empty fields are treated as N/A regardless on parsing options. However, the desired behavior is to handle empty fields the same way as fields with special values (apply default_na_values, na_filter logic). This PR irons out the behavior so it matches Pandas in this regard. - Tries now support matching empty strings. - The list of special NA values is now generated more robustly, so it has correct elements in any parameter combination. - Empty string is added to the list of special NA values. - Empty string string ("/"/"") is added to NA value list if empty string ("") is included (mirrors Pandas behavior). - Added tests for previously failing parameter combinations. - Reworked some of the tests to check against Pandas results instead of assumed desired behavior. Authors: - vuule <vmilovanovic@nvidia.com> - vuule <vukasin.milovanovic.87@gmail.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> - Vukasin Milovanovic <vmilovanovic@nvidia.com> Approvers: - Ram (Ramakrishna Prabhu) - Christopher Harris - Keith Kraus URL: #6922

The include directory was renamed from `simt` to `cuda`. Authors: - Rong Ou <rong.ou@gmail.com> Approvers: - Jason Lowe URL: #6948

The `cudf::merge` API expects the key columns to be sorted. This means that if null rows are included, these null entries should all appear either at beginning or at the end of the column depending on the null_order for the sort. The `MergeDictionaryTest.WithNull` gtest placed null rows in the middle of the column. The expected results should also have included null entries at the beginning or the end. This PR also includes an extra test for checking merge results are consistent with the sort parameters `cudf::order` and `cudf::null_order`. This test also includes a larger number of rows to ensure `thrust::merge` requires more than one tile/block in its runtime logic. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Ram (Ramakrishna Prabhu) - Vukasin Milovanovic URL: #6942

Updating the Java bindings package version to match the libcudf version. Authors: - Jason Lowe <jlowe@nvidia.com> Approvers: - Robert (Bobby) Evans URL: #6949

This exposes `logical_cast` through a JNI API. I also updated some of the test code to take a ColumnView instead of a ColumnVector so I could test it more easily. Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Jason Lowe URL: #6954

Updates libcudf to use the new, simplified `rmm::exec_policy` and include the new refactored headers `rmm/exec_policy.hpp` and `rmm/device_vector.hpp` The new `exec_policy` can be passed directly to Thrust, no longer any need to call `rmm::exec_policy(stream)->on(stream)`. Depends on rapidsai/rmm#647

As a part of trying to support upper and lower bounds for decimal I found that type checking for this function was broken because it used `==` for equality instead of `.equals`. Looking further I found a few other places where this was a bug (one in ColumnVector that is mostly a performance issue and one in Scalar) I decided to update all of the code to use .equals for comparison of types to make it consistent so it is less likely to have bugs like this crop up in the future. I also took the opportunity to internally move away from using `isTimestamp` (which is deprecated) to `isTimestampType` Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Jason Lowe URL: #6970

Adding Java bindings for the `url_decode` and `url_encode` functions. Authors: - Jason Lowe <jlowe@nvidia.com> Approvers: - Robert (Bobby) Evans - Kuhu Shukla URL: #6972

@jlowe

This adds a `libcufilejni.so` that's by default not built nor loaded. The unit tests are controlled similarly. Tested locally with the corresponding spark-rapids plugin changes. @jlowe @revans2 @abellina Authors: - Rong Ou <rong.ou@gmail.com> Approvers: - Robert (Bobby) Evans URL: #6940

Currently we're missing a few kwargs in `Series.groupby` which is causing issues due to the dask change in dask/dask#6854 Adds the missing kwargs and validates that we support the values passed in. Authors: - Keith Kraus <keith.j.kraus@gmail.com> Approvers: - GALI PREM SAGAR - Michael Wang URL: #6964

closes #6778 Fixes typo missed in PR #6887, the else condition where this is situated would be the last resort for any unaccounted scalar type values. Authors: - Ramakrishna Prabhu <ramakrishnap@nvidia.com> Approvers: - Keith Kraus URL: #6957

More Pandas-like behaviour for groupby when no keys are passed. Possibly fixes #6927. Authors: - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - Keith Kraus URL: #6945

@nartal1

… no-op(#6975) @nartal1 found a small bug while working on: NVIDIA/spark-rapids#1244 Problem is that for `fixed_point`, when the column `scale = -decimal_places`, it should be a no-op. Fix is to make it a no-op. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - David - Karthikeyan URL: #6975

array_tests.cu with same content in same directory exists. (can't convert to .cpp because .cuh is included in array_tests.cu and template source code is tested) Also, array_tests.cpp is not referred in `cpp/tests/CMakeLists.txt` Authors: - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - David - Vukasin Milovanovic URL: #6953

Addresses groupby part of #2188 - [x] Add cython interfaces for aggregation argmin, argmax as idxmin, idxmax - [x] unit tests Authors: - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com> - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - David - Ram (Ramakrishna Prabhu) - GALI PREM SAGAR - Jake Hemstad URL: #6856

… column support(#6907) Part 1 for issue #1361 - Adds `PRECEDING` and `FOLLOWING` options to `replace_nulls` in `libcudf`. This PR provides support for `fixed_width_type` type columns. - Adds Cython binding Authors: - Michael Wang <michaelwang0905@gmail.com> - Michael Wang <isVoid@users.noreply.github.com> Approvers: - Ashwin Srinath - Jake Hemstad - Mark Harris - Mark Harris URL: #6907

This is a small cleanup that replaces a `cudf::binary_operation` with a much cleaner `cudf::cast`. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Vukasin Milovanovic - Mark Harris URL: #6976

Currently has changes for #6950 included. The full set of null mask `fixed_point_column_wrapper` constructors aren't supported. This PR adds them all and also adds unit tests for each of them across difference `fixed_point` API tests. **To Do List:** * [x] Add constructors * [x] Add basic unit test * [x] Add all unit tests * [x] Update docs Authors: - Mark Harris <mharris@nvidia.com> - Conor Hoekstra <codereport@outlook.com> Approvers: - null - Vukasin Milovanovic URL: #6951

Fixes #6671, #6851 - Set the `rows_per_chunk` in `csv_writer_options` to the size of the input table. - Change `rows_per_chunk` type to `size_type` (used for number of rows). - Set the default compression in `to_parquet`/`write_parquet` to "snappy". Authors: - vuule <vmilovanovic@nvidia.com> Approvers: - Keith Kraus - Conor Hoekstra - Ram (Ramakrishna Prabhu) - Mark Harris URL: #6967

Just disables the tests when cufile is not installed. Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Kuhu Shukla URL: #6987

…ated as unequal(#6943) This change mirrors what is done in `groupby` to eliminate null-containing columns from the join hash table if nulls not equal is set. This prevents absolute runaway of the process. I added benchmarks for joins with nulls and I can't even get it to finish without these changes. The 195ms test without nulls takes 2,000,000ms to complete and the larger tests I haven't had the patience to even see complete. With this change, the timings are faster than without nulls proportional to the % of nulls. Meaning half the table is nulls means the query is twice as fast as the non-null version, which makes sense. closes #6052 Authors: - Mike Wilson <knobby@burntsheep.com> - Mike Wilson <hyperbolic2346@users.noreply.github.com> Approvers: - Jake Hemstad - Jake Hemstad - null - Mark Harris URL: #6943

…6959) Fixes #6947 When TZif file has no transitions (e.g. GMT), `build_timezone_transition_table` has an out-of-bounds read that leads to undefined behavior and intermittent issues. This PR makes two changes to behavior: 1. When there are no transitions, the ancient rule is initialized from the first time offset (instead of the first transition rule, which does not exist in this case). 2. When there are no transitions and the time offset is zero, an empty table is returned (avoid using a no-op table in CUDA). Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> Approvers: - GALI PREM SAGAR - null - Ram (Ramakrishna Prabhu) - David URL: #6959

Fixes #6719 Authors: - Kumar Aatish <kaatish@nvidia.com> Approvers: - David - GALI PREM SAGAR - Mark Harris URL: #6991

Closes #6955 This will improve the compile time and size for any function using `thrust::sort` and `thrust::stable_sort`. The PR includes a .patch file to be applied during cmake when downloading the thrust library. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Mark Harris URL: #6982

Fixes #6642 Authors: - Kumar Aatish <kaatish@nvidia.com> - skirui-source <71867292+skirui-source@users.noreply.github.com> Approvers: - Vukasin Milovanovic - Devavret Makkar - Ram (Ramakrishna Prabhu) URL: #6889

Allows for scalars of the same dtype as a column to be passed along a fast codepath to libcudf, instead of being inspected to reduce their dtype beforehand. Authors: - brandon-b-miller <brmiller@nvidia.com> - GALI PREM SAGAR <sagarprem75@gmail.com> Approvers: - GALI PREM SAGAR URL: #6938

PR #6982 added a `PATCH_COMMAND` when fetching Thrust to remove unrolling in `thrust::sort`, thereby improving compile time and performance in some cases. But the command failed on local builds from source (At least on my machine under rapids-compose). This PR simplifies the command. Authors: - Mark Harris <mharris@nvidia.com> Approvers: - Keith Kraus URL: #7002

Closes #6801 This PR adds an extra reduce call in the libcudf gather specialization logic for strings column. This will check to make sure the output size of the gather does not exceed the size limit for the child characters column. The offsets column is first created with the individual output string sizes. Then the reduce call will add these sizes to check for overflow. Also added a gtest to check for the overflow condition. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Devavret Makkar - Karthikeyan URL: #6997

@revans2

I think this is how I started and verified it was working, but then I was trying to exclude the source as well, which didn't work for tests. Then I realized we need the source built for the plugin so remove it. Anyway, no presubmit CI check is really painful. :( @revans2 ```console $ mvn test ... [WARNING] Tests run: 775, Failures: 0, Errors: 0, Skipped: 4 $ mvn test -DUSE-GDS=ON ... [WARNING] Tests run: 777, Failures: 0, Errors: 0, Skipped: 4 ``` Authors: - Rong Ou <rong.ou@gmail.com> Approvers: - Robert (Bobby) Evans URL: #6988

Fix #6926 . Hi! When invoking from_dlpack() and to_dlpack, the following warnings are displayed: from_dlpack() ``` /opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:33: UserWarning: WARNING: cuDF from_dlpack() assumes column-major (Fortran order) input. If the input tensor is row-major, transpose it before passing it to this function. res = libdlpack.from_dlpack(pycapsule_obj) ``` to_dlpack() ``` /opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:74: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function. return libdlpack.to_dlpack(gdf_cols) ``` I think those warnings should be removed, because it contains information that should be available in the API documentation, and not necessarily displayed each time the methods are invoked. Some users, like me, love to have their notebooks/code without warnings. Even if it is possible to disable those warnings, I think the user should not go that way, because the warning is just repeating what the API documentation should cover. Hope it helps! Miguel Authors: - Miguel Martínez <26169771+miguelusque@users.noreply.github.com> - GALI PREM SAGAR <sagarprem75@gmail.com> Approvers: - GALI PREM SAGAR - Ram (Ramakrishna Prabhu) URL: #7001

This PR adds a test to exercise the issue described in #6733. This issue was only reproduced on a laptop Pascal GPU, but I think it's a good test to have. In summary, `copy_if`, used by `apply_boolean_mask` computes the output null count during as part of its custom scatter kernel, rather than using `cudf::count_unset_bits`. #6733 describes an issue where the former is different from the latter. So it's good to have a test that verifies they get the same null count. And since it's difficult to get a repro on a similar machine, this is a first step. Authors: - Mark Harris <mharris@nvidia.com> - Keith Kraus <kkraus@nvidia.com> Approvers: - Karthikeyan - Devavret Makkar URL: #6903

@trxcllnt

#7002 attempted to fix the temporary Thrust sort patch introduced in #6982 which didn't work with CMake 3.19+. This PR updates the thirdparty CMakeLists.txt file to continue if the Thrust sort patch has already been applied. Today, the first time cmake is run, the Thrust sort.h is patched. But if cmake is run again without cleaning the build directory, the build will fail, because the file has already been patched. @trxcllnt showed us the correct `patch` incantation to ignore the patch if already applied. CC @davidwendt Authors: - Mark Harris <mharris@nvidia.com> Approvers: - Paul Taylor - Keith Kraus URL: #7009

This pull request is to address #6909. Authors: - sperlingxx <lovedreamf@gmail.com> - Alfred Xu <lovedreamf@gmail.com> Approvers: - Robert (Bobby) Evans - Mike Wilson - Devavret Makkar URL: #6969

This resolves #6996 Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Mike Wilson - Devavret Makkar URL: #7006

Implements `cudf.DateOffset` - an object used for calendrical arithmetic, similar to pandas.DateOffset - for month units only. Closes #6754 Authors: - brandon-b-miller <brmiller@nvidia.com> - brandon-b-miller <53796099+brandon-b-miller@users.noreply.github.com> - Keith Kraus <kkraus@nvidia.com> Approvers: - GALI PREM SAGAR - Keith Kraus - Keith Kraus URL: #6775

Resolves: #6370 This PR enables the parallel execution of pytests of `cudf`, `dask_cudf` & `custreamz` in CI. The changes also include adding `pytest-xdist` to dev environments. With these changes, here is the change in pytest execution times in CI: | module | without pytest-xdist | with pytest-xdist(n=6) | | ----------- | ----------- | -----------| | cudf | 1 hr | 14 min | | dask_cudf | 4 min | 1 min | | custreamz | 6 min | 2 min | Related Integration changes: rapidsai/integration#188 Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - AJ Schmidt - Keith Kraus URL: #6958

`librdkakfa` 1.5.3 required gcc 9.3 (https://anaconda.org/conda-forge/librdkafka/files?version=1.5.3) and `libcudf_kakfa` (https://anaconda.org/rapidsai-nightly/libcudf_kafka/files?version=0.18.0a201215) is being built requiring 1.5.3 which is not compatible with the rest of RAPIDS. Authors: - Ray Douglass Approvers: - Mike Wendt

* Fixes #6823 * Raise a `KeyError` similar to Pandas rather than an `IndexError` when loc fails * Improve tests to compare more directly with Pandas behaviour Authors: - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - Ram (Ramakrishna Prabhu) - Michael Wang - GALI PREM SAGAR URL: #6993

Found a small bug while working on NVIDIA/spark-rapids#1244. For negative integers, it was not rounding to nearest even number. Authors: - Niranjan Artal <nartal@nvidia.com> - Conor Hoekstra <codereport@outlook.com> Approvers: - Conor Hoekstra - Mark Harris URL: #7014

This PR resolves a part of #3556. Supporting `cudf::reduce`: 1. Part 1 (`MIN`, `MAX`, `SUM` & `PRODUCT` & `NUNIQUE`) #6814 2. Part 2 (the rest) ◀️ **Reduction Ops:** **Done in Previous PR** ✔️ `SUM, ///< sum reduction` ✔️ `PRODUCT, ///< product reduction` ✔️ `MIN, ///< min reduction` ✔️ `MAX, ///< max reduction` ✔️ `NUNIQUE, ///< count number of unique elements` **Not supported by `cudf::reduce`:** * [x] `COUNT_VALID, ///< count number of valid elements` * [x] `COUNT_ALL, ///< count number of elements` * [x] `COLLECT, ///< collect values into a list` * [x] `LEAD, ///< window function, accesses row at specified offset following current row` * [x] `LAG, ///< window function, accesses row at specified offset preceding current row` * [x] `PTX, ///< PTX UDF based reduction` * [x] `CUDA ///< CUDA UDf based reduction` * [x] `ARGMAX, ///< Index of max element` * [x] `ARGMIN, ///< Index of min element` * [x] `ROW_NUMBER, ///< get row-number of element` **Won't be supported:** * [x] `ANY, ///< any reduction` * [x] `ALL, ///< all reduction` **To Do / Investigate:** * [x] `SUM_OF_SQUARES, ///< sum of squares reduction` * [x] `MEDIAN, ///< median reduction` * [x] `QUANTILE, ///< compute specified quantile(s)` * [x] `NTH_ELEMENT, ///< get the nth element` **Deferred until requested** * [x] `MEAN, ///< arithmetic mean reduction` * [x] `VARIANCE, ///< groupwise variance` * [x] `STD, ///< groupwise standard deviation` Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - null - Karthikeyan - David URL: #6980

Follow up for PR #6907 - `replace_null` policy function now supports `string` and `dictionary` dtype column. Since original implementation depends only on column validity and index, this extension trivially removes SFINAE on `replace_null` functor and removes `type_dispatcher`. Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - Mark Harris - Karthikeyan URL: #7004

`pd.DateOffset` uses a metaclass that overrides the usual instance/subclass checking behaviour. Any subclass of `pd._libs.tslibs.offsets.BaseOffset` will be reported as a subclass of `pd.DateOffset` (itself a `pd.DateOffset`). This can lead to some surprising behaviour: ```python In [3]: isinstance(pd.DateOffset(), cudf.DateOffset) Out[3]: True ``` Note that `cudf.DateOffset` inherits from `pd.DateOffset`. But, a `pd.DateOffset` is reported as an instance of `cudf.DateOffset` -- [Child Is Father of the Man](https://en.wikipedia.org/wiki/Child_Is_Father_of_the_Man)! Authors: - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - GALI PREM SAGAR URL: #7029

Closes #6850 dask_cudf version of the `dask.dataframe` changes proposed in [dask#6960](dask/dask#6960). Uses `fsspec` to infer the default `compression` argument from the suffix of the first file-path argument. Authors: - rjzamora <rzamora217@gmail.com> Approvers: - Keith Kraus URL: #7013

Closes #6472. `rolling.cu` is taking inordinately long to compile, slowing down the `libcudf` build. The following changes were made to mitigate this: 1. Moved `grouped_rolling_window()` and `grouped_time_based_rolling_window()` to `grouped_rolling.cu`. Common functions were moved to `rolling_detail.cuh`. 2. Normalized timestamp columns to use int64_t representations. This reduces the number of template instantiations for `time_based_grouped_rolling_window()`. 3. `grouped_*_rolling_window()` functions used to pass around fancy iterators, causing massive template instantiations. This has been changed to materialize the window offsets as separate columns, and use those with existing `rolling_window()` functions to produce the final result. These changes have been tested by running a window function test from SparkSQL, over a 2.4GB ORC file with 155M records (1.5M groups of about 97 records each on average): 1. There has been no discernible change in the end-to-end runtime. (The `nsys` profile seems to indicate that the total time spent in the `gpu_rolling` kernel has reduced. This is still being examined, to confirm.) 2. Compiling `rolling.cu` and `grouped_rolling.cu` in parallel now takes 60s as opposed to about 300s before. 3. The object file size seems to have reduced by a factor of 3. Authors: - Mithun RK <mythrocks@gmail.com> Approvers: - Vukasin Milovanovic - Karthikeyan URL: #6512

We compared the wrong thing on a cast optimization. This fixes that. Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Jason Lowe - Alessandro Bellina URL: #7032

Closes #1361 - Provides "`ffill`" and "`bfill`" `fillna` methods for `Numerical`, `Datetime`, `Timedelta` and `Categorical` type column. - Supports `method` parameter for `Series.fillna` and `DataFrame.fillna` Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - Ashwin Srinath - GALI PREM SAGAR URL: #6998

Fixes #6923 Included other minor cuIO improvements that are too small for individual PRs: - Remove unnecessary NaN-related conditions in JSON, CSV. - Expand a comment in `createSerializedTrie` to make initialization clearer. Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> Approvers: - GALI PREM SAGAR - Karthikeyan - Christopher Harris URL: #7012

…umn data(#7020) I tried experimenting with changing the `cudf::size_type` to `int64_t` and found many, many places that assume `size_type` and `int32_t` (and `int`) are interchangeable. This PR attempts to fix some of the places where offsets column is created as INT32 but the column data is incorrectly referenced as `data<size_type>()` for example. Also, this PR fixes some places that accepts/returns only int32_t (regex internal functions) or size_type (factories) which should be casted or accounted for. This is not a full set of possible violations found but may help minimize future errors. No function has changed/added. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Conor Hoekstra - Devavret Makkar URL: #7020

This corrects an issue with the sampling range used when replacement=True. Before, it sampled the range 0 through `num_rows` meaning it could sample `num_rows` even though it's one position out of bounds. This caused sample to return values not present in the original DataFrame. I also created exceptions for sampling on empty DataFrames that match pandas, as well as an exception for sampling when `axis=1` and `replace=True` as cudf does not support DataFrames with duplicate columns. This closes #6532 Authors: - Chris Jarrett <cjarrett@dt08.aselab.nvidia.com> - Mark Harris <mharris@nvidia.com> - ChrisJar <chris.jarrett.0@gmail.com> Approvers: - Keith Kraus - Mark Harris URL: #6884

Follow up of PR #7004 Adds `method` field to `fillna` method in string type column to support `ffill` and `bfill`. Also involves a small change to a `datetime64` `ffill`, `bfill` test case to improve test robustness. Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - GALI PREM SAGAR URL: #7036

The `run_pos` which was being used was from data rather from secondary stream which was for scale, but resulted value was being used for secondary stream `scale`. The code change fixes that issue and also adds test case to cover the issue. closes #7016 Authors: - Ramakrishna Prabhu <ramakrishnap@nvidia.com> Approvers: - Vukasin Milovanovic - GALI PREM SAGAR - Devavret Makkar URL: #7034

This PR closes #6921 by dispatching to appropriate cudf alias for numpy functions from the UFUNC_ALIASES dictionary : ```python _UFUNC_ALIASES = { "power": "pow", "equal": "eq", "not_equal": "ne", "less": "lt", "less_equal": "le", "greater": "gt", "greater_equal": "ge", "absolute": "abs", } ``` Authors: - Vibhu Jawa <vibhujawa@gmail.com> Approvers: - Keith Kraus - null URL: #6973

Fixes: #7025 This PR: 1. Handles loading of pickle files which have been created with rangeIndex prior to introduction of `step` parameter support. 2. Introduces special-case handling of stringcolumn size where we were previously storing it as a pickled object. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Ram (Ramakrishna Prabhu) URL: #7033

Share the implementation of `cudf.Series.factorize` with the `Index` class and the `cudf` module namespace. Closes #6871 Authors: - brandon-b-miller <brmiller@nvidia.com> - Keith Kraus <kkraus@nvidia.com> - brandon-b-miller <53796099+brandon-b-miller@users.noreply.github.com> Approvers: - Ashwin Srinath - Keith Kraus URL: #6885

Fixes: #6936 This PR introduces changes to `MultiIndex.__repr__`, where the output is now more readable and easy to understand similar to that of pandas MultiIndex. Changes also include handling of `<NA>`, `nan` values and spacing issues around them. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - null - Keith Kraus URL: #6992

Making this PR since wrong formatting keeps getting propagated in new PRs and (sometimes) corrected in code review. Changes: - Ironed out the formatting of Doxygen comments to match the guidelines. - Removed the outdated file with formatting examples. Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> - vukasin <vmilovanovic@nvidia.com> Approvers: - David - Karthikeyan URL: #7041

…7050) Fixes: #7046 This PR: - [x] Updates all doc examples with new NA_REP(`<NA>`) - [x] Fixes reference warnings during doc build. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Keith Kraus URL: #7050

Improves performance of parquet reader on certain multi-GPU systems, which take a long time to allocate pinned memory, by reducing the number of `hostdevice_vector` allocations. Closes #7049 Authors: - Devavret Makkar <dmakkar@nvidia.com> Approvers: - null - Ram (Ramakrishna Prabhu) - Karthikeyan URL: #7005

This PR resolves a part of #3556. Aggregation ops supported: * `MIN` * `MAX` * `COUNT` (both `null_policy` - `EX/INCLUDE`) * `LEAD` * `LAG` **To Do List:** * [x] Basic unit tests * [x] Comprehensive unit tests * [x] Implementation * [x] Figure out which rolling ops to suppport Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Vukasin Milovanovic - Ram (Ramakrishna Prabhu) URL: #7037

Reference #7027 and #5698 This adds a strings column to the current gbenchmark for sort. This will help measure improvements or changes over time to the column and strings comparator functions. No code logic changed or added. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Vukasin Milovanovic - Devavret Makkar - Keith Kraus URL: #7040

Closes #6699 The timestamp format(s) used by the CSV writer have the form `%Y-%m-%dT%H:%M:%SZ`. This means if the column delimiter `','` or the line delimiter `\n` is either `':'` or `'-'` then the timestamp string output could conflict with these delimiters. The current logic simply removed these delimiters from the format if they detected a conflicting column or line delimiter. For example, specifying a dash `'-'` as column delimiter caused the timestamp format to change to `%Y%m%d...` (the dash is removed). I admit this was kind of hacky and also made the output inconsistent with Pandas `to_csv()`. It is easy enough to simply add double-quotes around the timestamp format to prevent these conflicts as well as make the output consistent. This PR fixes that logic. Exception logic to check for a dash as column separator was also found in [csv.py](https://github.com/rapidsai/cudf/blob/8c1f01e1fd713d873cf3d943ab409f3e9efc48f8/python/cudf/cudf/io/csv.py#L139-L149), specifically citing issue 6699 in the exception message. Also, there was a pytest specifically created to check for this exception. The exception is removed and the pytest function updated in this PR as well. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - GALI PREM SAGAR - Karthikeyan - null URL: #7023

Adding support for `cudf::scan` for `decimal32` and `decimal64`. `cudf::scan` only supports 4 operations (sum, product, min and max) but the decimal types will only support `SUM`, `MAX` and `MIN`. This PR resolves a part of #3556. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Jake Hemstad - Mark Harris URL: #7063

Resolves #6863 Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes. Authors: - Ryan Lee <ryanlee@nvidia.com> - rwlee <rwlee@users.noreply.github.com> Approvers: - Jake Hemstad - null - Robert (Bobby) Evans - GALI PREM SAGAR URL: #7024

@jlowe

This version is more friendly to ccache: ```console ccache -C # clear the cache time mvn clean package -DskipTests real 4m43.015s user 11m18.426s sys 0m21.891s time mvn clean package -DskipTests # everything is now cached real 0m20.265s user 0m45.810s sys 0m3.670s ``` Not sure about the ABI flag, but leaving it in causes the .so to not load: ```console /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java: symbol lookup error: /tmp/nvcomp5478764208255606671.so: undefined symbol: _ZN6nvcomp5Check8not_nullEPKvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_i ``` @jlowe Authors: - Rong Ou <rong.ou@gmail.com> Approvers: - Jason Lowe URL: #7069

Since cudf doesn't support precision, the precision must be passed in as a write option. This is handled as a vector of uint8's that indicates the precision of each flattened column in order to support nested types. Partially closes #6474 Authors: - Mike Wilson <knobby@burntsheep.com> - Mike Wilson <hyperbolic2346@users.noreply.github.com> Approvers: - Vukasin Milovanovic - Mark Harris URL: #7017

…7028) Closes #6774 This PR adds a check for a valid day value for a year/month (if these are specified in the format) in the `cudf::is_timestamp()` API. Also, a chunk of messy year/month/day logic in a related functor was replaced with libcu++ implementation of the `year_month_day()` function instead. A gtest is also updated to include to test for an invalid day. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Vukasin Milovanovic - Devavret Makkar - Karthikeyan - Jake Hemstad URL: #7028

Should only upload the packages that were actually built. Project Flash sets `BUILD_CUDF` and `BUILD_LIBCUDF` as needed to control this. Skipping CI as this change only affects uploads which isn't tested by CI. Authors: - Raymond Douglass <ray@raydouglass.com> Approvers: - Dillon Cullinan URL: #7077

…7061) This PR adds support for `cudf::rolling` for the `ROW_NUMBER` option for `decimal32` and `decimal64`. It also clarifies the documentation. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - David - Devavret Makkar URL: #7061

Related to #5826 Refactor the `ProtobufReader` API to facilitate expansion to support robust reading of column statistics. Changes include: - Move `orc::metadata` from `readder_impl.cu` to `orc.h` so it can be reused for statistics related APIs. - Removed duplicated code in `read_orc_statistics` - use `orc::metadata` instead. - Rename `ColumnStatistics` to `ColStatsBlob`, since that's what it currently is. - Avoid redundant copies in `read_orc_statistics`, - Replace `get_u32`, `get_i32`, etc. with templated `get`. - Replace per-type functors (e.g. `FieldUInt64`) with templated `field_reader`s to reduce code repetition. - The two type-specific parts of `FieldXYZ` functors (field enum and read impl) are now separate to avoid redundant code. - `field_reader` dispatches based on the value type, so also added `packed_field_reader` and `raw_field_reader` for packed fields and blob reads (respectively). - Replace return value based error checking in `ProtobufReader` with `CUDF_EXPECTS`. - Removed `InitSchema` from `ProtobufReader` - schema is only used to determine column names. The names are now lazily calculated in `metadata::get_column_name` Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com> Approvers: - Kumar Aatish - Conor Hoekstra URL: #7055

Reference #5963 Add dictionary support to groupby. - [x] argmax - [x] argmin - [x] collect - [x] count - [x] max - [x] mean* - [x] median - [x] min - [x] nth element - [x] nunique - [x] quantile - [x] std* - [x] sum* - [x] var* * _not supported due to 10.2 compile segfault_ Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Jake Hemstad - Karthikeyan URL: #6585

Closes #6360 Updates `digitize()` API to accept any non-nullable column-like objects, as opposed to previously only `np.array`. Adds better error handling and further simplifies the implementation. Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - Ashwin Srinath URL: #7071

Closes #6694 When `unstack()` receives a dataframe with "single" index, returns a series to match pandas behavior. Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - null URL: #7054

Fixes: #7070 This PR fixes a failure in `to_pandas` when `nullable` is set to `True`. The changes in this PR implement `__hash__` in `listDtype`. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Ashwin Srinath URL: #7081

@kuhushukla

…d bug in struct with no children(#7084) The primary goal of this is to add in java APIs to create a struct column from other existing columns. As a part of this work I found a very small bug in the column_vector constructor that copies data from a column view for a struct column with no children in it. Spark supports this use case so I thought it would be good to test/fix the issue. Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Kuhu Shukla (@kuhushukla) - Vukasin Milovanovic (@vuule) - Jason Lowe (@jlowe) URL: #7084

@kkraus14

Fixes: #7056 This PR handles `nan` values separately in `one_hot_encoding` when the given input category is `None`. Previously we were combining both `nan` & `<NA>` values to be the same when cat is `None`. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Keith Kraus (@kkraus14) URL: #7059

@cwharris

This PR adds `pyorc` package to dev environment yml files. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Christopher Harris (@cwharris) - AJ Schmidt (@ajschmidt8) URL: #7085

@kkraus14

Fixes: #7091 This PR introduces validation and throwing of informative error messages for the `sep` parameter in csv writer. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Keith Kraus (@kkraus14) - Vukasin Milovanovic (@vuule) URL: #7095

@jlowe

This pull request attempts to verify the support of decimal cast in terms of java package. Authors: - sperlingxx <lovedreamf@gmail.com> Approvers: - Jason Lowe (@jlowe) URL: #7051

@kkraus14

Closes #6858. Adds a "GroupBy" page to our docs. Authors: - Ashwin Srinath <shwina@users.noreply.github.com> Approvers: - Keith Kraus (@kkraus14) URL: #7100

@jlowe

…7112) Adds in java support to be able to create a list column from other existing columns. Authors: - Robert (Bobby) Evans <bobby@apache.org> Approvers: - Jason Lowe (@jlowe) URL: #7112

@vuule

closes #6542 - [x] Add segmented_gather(list, list) - [x] Add unit tests - [x] Documentation Authors: - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com> - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - Vukasin Milovanovic (@vuule) - @nvdbaranec - AJ Schmidt (@ajschmidt8) - Jake Hemstad (@jrhemstad) URL: #7003

@revans2

This pull request is to verify window operations on decimal columns in java package, which is required by spark-rapids on [issue 1333](NVIDIA/spark-rapids#1333). Authors: - sperlingxx <lovedreamf@gmail.com> Approvers: - Robert (Bobby) Evans (@revans2) URL: #7120

@mythrocks

This PR adds `fixed_point::scale()` and `fixed_point::value()`. It enables developers to avoid the following piece of code (which is how you can currently access scale and value). ```cpp auto si = numeric::scaled_integer<rep_type>{value}; // use si.value or si.scale ``` Note that this PR should merged after #7105 (or I can resolve conflict if it gets merged first) Authors: - Conor Hoekstra <codereport@outlook.com> - Conor Hoekstra <36027403+codereport@users.noreply.github.com> Approvers: - MithunR (@mythrocks) - David (@davidwendt) URL: #7109

@karthikeyann

) Fixes a specific corner case: String columns with no children (a special form of empty string column that can happen) that are nested inside a list (or struct) column. This would be useful as a 0.17 PR but isn't strictly necessary, since it's pretty late. Edit: Updated the fix so that it always includes a record for src/dst buffers, even if they are of size 0 or have null data pointers. The previous method that only checked the data pointer being null was unclean and didn't handle a particularly strange case that came up with the Spark plugin: the plugin was reconstructing columns (on the receiver side of a shuffle) that had size 0 but a non-null data pointer. This is technically legal but super weird. Authors: - Dave Baranec <dbaranec@nvidia.com> - Karthikeyan <6488848+karthikeyann@users.noreply.github.com> Approvers: - Karthikeyan (@karthikeyann) - Alfred Xu (@sperlingxx) - Karthikeyan (@karthikeyann) - Alfred Xu (@sperlingxx) - Karthikeyan (@karthikeyann) - Devavret Makkar (@devavret) URL: #6864

@harrism

…or `decimal32` and `decimal64`(#7119) This PR resolves #7115. Add `cudf::binary_operation` support for `NULL_MAX`, `NULL_MIN` and `NULL_EQUALS` for `decimal32` and `decimal64`. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Mark Harris (@harrism) - David (@davidwendt) - Mike Wilson (@hyperbolic2346) URL: #7119

@codereport

I discovered we're not building libcudf the `-Wall` GCC flag. This PR enables `-Wall` for GCC and nvcc, and fixes most of the errors. ~~The only error I haven't fixed yet is `-Werror=uninitialized` on this line [this line](https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/include/cudf/scalar/scalar.hpp#L334), but @codereport is on it.~~ Fixed ✔️ Authors: - ptaylor <paul.e.taylor@me.com> - Conor Hoekstra <codereport@outlook.com> - Paul Taylor <paul.e.taylor@me.com> Approvers: - Conor Hoekstra (@codereport) - Keith Kraus (@kkraus14) - Mark Harris (@harrism) URL: #7105

@jlowe

The GDS 0.9 release changed the location where it puts the header and library files. @jlowe @kkraus14 Authors: - Rong Ou <rong.ou@gmail.com> Approvers: - Keith Kraus (@kkraus14) - Jason Lowe (@jlowe) URL: #7131

@trxcllnt

#7105 somehow got merged but broke compilation. These are the necessary fixes. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - Paul Taylor (@trxcllnt) - Vukasin Milovanovic (@vuule) URL: #7134

@kkraus14

After recent changes in libcudf compilation in #7105, the compilation of libcudf on my local machine is broken and these changes fixed the compilation errors. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Keith Kraus (@kkraus14) - Devavret Makkar (@devavret) - David (@davidwendt) URL: #7138

@ajschmidt8

Closes #7027 The internal `cudf::strings::detail::sort()` function is faster sorting a single strings coumn than `cudf::sort`. Details are in the #7027 comments. Results using the sort gbenchmark: ``` Baseline: SortStrings/stringssort/1024/manual_time 1.18 ms 1.20 ms 593 SortStrings/stringssort/4096/manual_time 1.98 ms 2.00 ms 352 SortStrings/stringssort/32768/manual_time 2.73 ms 2.75 ms 256 SortStrings/stringssort/262144/manual_time 4.36 ms 4.38 ms 160 SortStrings/stringssort/2097152/manual_time 66.2 ms 66.2 ms 10 SortStrings/stringssort/16777216/manual_time 547 ms 548 ms 1 Calling cudf::strings::detail::sort from cudf::sort: SortStrings/stringssort/1024/manual_time 0.692 ms 0.711 ms 1002 SortStrings/stringssort/4096/manual_time 1.13 ms 1.15 ms 615 SortStrings/stringssort/32768/manual_time 1.59 ms 1.61 ms 440 SortStrings/stringssort/262144/manual_time 2.82 ms 2.84 ms 247 SortStrings/stringssort/2097152/manual_time 43.1 ms 43.1 ms 16 SortStrings/stringssort/16777216/manual_time 386 ms 386 ms 2 ``` Authors: - davidwendt <dwendt@nvidia.com> Approvers: - AJ Schmidt (@ajschmidt8) - Conor Hoekstra (@codereport) - Jake Hemstad (@jrhemstad) - Christopher Harris (@cwharris) URL: #7075

@nvdbaranec

Fixes #6716 Authors: - Devavret Makkar <dmakkar@nvidia.com> Approvers: - @nvdbaranec - David (@davidwendt) URL: #7142

@codereport

While investing the long compile times of the reduction source files `any.cu` and `all.cu` I found it necessary to build a gbenchmark to ensure changes did not effect the performance of these functions. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Conor Hoekstra (@codereport) - Vukasin Milovanovic (@vuule) - Paul Taylor (@trxcllnt) - Keith Kraus (@kkraus14) URL: #7129

@vuule

Fixes: #7103 This PR introduces: - [x] a new doc page which contains dtypes & IO formats matrix supported by cudf currently. This matrix currently lists whether a dtype is supported by a reader / writer. How the table looks can be seen in the below screenshot. - [x] As part of this PR I have also introduced informative error messages in some IO reader/writers. - [x] Raising an error in ORC writer if there is any categorical data. ![Screenshot from 2021-01-15 09-40-57](https://user-images.githubusercontent.com/11664259/104747156-cb335200-5715-11eb-92b3-85a246fbdc8a.png) Authors: - galipremsagar <sagarprem75@gmail.com> - GALI PREM SAGAR <sagarprem75@gmail.com> Approvers: - Vukasin Milovanovic (@vuule) - Ashwin Srinath (@shwina) - AJ Schmidt (@ajschmidt8) - Keith Kraus (@kkraus14) URL: #7139

@davidwendt

…7147) This PR resolves #7117 by adding support for `cudf::rolling` for the `SUM` option for `decimal32` and `decimal64`. Authors: - Conor Hoekstra <codereport@outlook.com> Approvers: - David (@davidwendt) - Karthikeyan (@karthikeyann) URL: #7147

@jlowe

Allow overriding `GPU_ARCHS` with an empty string in cudfjni to enable automatic detection ```bash mvn clean install -DARROW_STATIC_LIB=ON -DBoost_USE_STATIC_LIBS=ON -DGPU_ARCHS= ... [exec] -- CUDA_VERSION: 11.0 [exec] Auto detection of gpu-archs: 75 [exec] GPU_ARCHS = 75 ``` Allow `--h[elp]` switch to `$CUDF_HOME/build.sh` Authors: - Gera Shegalov <gshegalov@nvidia.com> - Gera Shegalov <gera@apache.org> Approvers: - Jason Lowe (@jlowe) - Keith Kraus (@kkraus14) URL: #7155

@galipremsagar

Fixes #7043, gives less than ideal results due to #7066. Authors: - brandon-b-miller <brmiller@nvidia.com> Approvers: - GALI PREM SAGAR (@galipremsagar) URL: #7072

@razajafri

#7146) @razajafri noticed that precision could not be equal to scale when writing decimals. This should be allowed and this fixes that and adds a test to verify it. closes #7145 Authors: - Mike Wilson <knobby@burntsheep.com> Approvers: - Raza Jafri (@razajafri) - Vukasin Milovanovic (@vuule) URL: #7146

@harrism

Not sure why these aren't being caught in local 10.2 envs or CI builds, but I can't build a local CUDA 11.0 env due to a mamba bug. Authors: - ptaylor <paul.e.taylor@me.com> Approvers: - Mark Harris (@harrism) - David (@davidwendt) URL: #7164

@ajschmidt8

The Doxyfile project number is set to 0.16. I know I've seen it in the UI before but cannot find it now. I've updated the number just in case. And I've added a line to the update-version.sh (thanks @ajschmidt8 ) to automatically update the file when a new release is created. Neither of these 2 files effect the CI/CD build. Authors: - davidwendt <dwendt@nvidia.com> Approvers: - Karthikeyan (@karthikeyann) - AJ Schmidt (@ajschmidt8) - Mark Harris (@harrism) URL: #7161

@brandon-b-miller

Implementation of the feature includes: - Renamed libcudf `read_orc_statistics` to `read_raw_orc_statistics` to make a distinction from the new function. - Changed the `read_raw_orc_statistics` return type to `raw_orc_statistics` instead of the vector with heterogeneous data. - Added `read_parsed_orc_statistics` that also parses the statistics blobs to make the API usable without the Python layer. - Fixed a few compiler warnings (i.e. errors). - Added read functions for statistics to ProtobufReader. - Added support for optional fields to ProtobufReader (such fields are `std::unique_ptr` for now). Other changes: - Renamed the existing ORC statistics API to `read_raw_orc_statistics`. - Replaced some explicit H2D and D2H copies with appropriate abstractions. - Enabled several ORC tests for bool columns that were missed when the support for such columns was added. - Remove unused `zigzag(uint64_t)`. Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vmilovanovic@nvidia.com> Approvers: - @brandon-b-miller - GALI PREM SAGAR (@galipremsagar) - Conor Hoekstra (@codereport) - Mark Harris (@harrism) URL: #7136

@galipremsagar

This PR updates s3 tests to use `moto_server` instead of going via a moto mock_s3 context. This enables cleaner s3 testing with `s3fs>=0.5` which incorporates aiobotocore for s3 connections. - The pytests starts up a moto-server for each worker running tests. - Ports used: `5000, 5550 - 5550+ (n_pytest_workers-1)` Updated integration repo with requirements: rapidsai/integration#207 Authors: - Ayush Dattagupta <ayushdg95@gmail.com> Approvers: - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) URL: #7144

@rgsl888prabhu

Fixes: #7137, #7148 This PR fixes converting a pyarrow table which has llist and struct types via `from_arrow`. Incase of `list` dtype we shouldn't have to perform any typecast and incase of `struct` dtype we should be renaming the fields appropriately. Authors: - galipremsagar <sagarprem75@gmail.com> Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Keith Kraus (@kkraus14) URL: #7162

@harrism

Closes #6210 - Rename datetime utils to snake_case; - Rename datetime utils so that `parse_xyz` functions move the input iterator past the parsed value, and `to_xyz` function do not change the input iterators. - Replace `findFirstOccurrence` with `thrust::find`. - Replace use of offsets with pointers in `datetime.cuh` and `csv_gpu.cu`; - Rename some variables in the CSV parser to make the code clearer. Note: the semantics of variables/parameters did not change in this PR - `T* end` points to the last element in the range in many places. Authors: - vuule <vmilovanovic@nvidia.com> - Vukasin Milovanovic <vmilovanovic@nvidia.com> Approvers: - Mark Harris (@harrism) - Christopher Harris (@cwharris) URL: #7150

@vuule

This PR updates `cpp/doxygen/Doxyfile` to consume the generated Doxygen tags from `rmm` (see rapidsai/rmm#672). This will enable linking between the `cudf` docs and `rmm` docs. This PR along with rapidsai/rmm#672 closes issue #5152. I also updated `docs/cudf/source/conf.py` and added it to `update-version.sh`. Authors: - AJ Schmidt <aschmidt@nvidia.com> - AJ Schmidt <ajschmidt8@users.noreply.github.com> Approvers: - Vukasin Milovanovic (@vuule) - @nvdbaranec - Dillon Cullinan (@dillon-cullinan) - Karthikeyan (@karthikeyann) URL: #7149

@galipremsagar

Closes #7057 Properly overrides `MultiIndex.rename` from `Index.rename`, reusing API from `MultiIndex.set_names`. Authors: - Michael Wang <isVoid@users.noreply.github.com> Approvers: - GALI PREM SAGAR (@galipremsagar) URL: #7172

@codereport

) This PR resolves a part of #3556. I decided to push the changes for sort `cudf::group_by` and hash `group_by` in different PRs. Authors: - Conor Hoekstra (@codereport) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Karthikeyan (@karthikeyann) URL: #7169

@VibhuJawa

This PR closes #7083 by adding an encoding argument to our CSV writer, it also adds compression argument to the writer. This will help address some issues with feature tool compatibility [PR](alteryx/featuretools#1246). Authors: - Vibhu Jawa (@VibhuJawa) Approvers: - GALI PREM SAGAR (@galipremsagar) - Michael Wang (@isVoid) URL: #7168

@ChrisJar

This enables round for DataFrames and Series using the libcudf round implementation and removes the old numba round implementation. Closes #1270 Authors: - @ChrisJar Approvers: - Ashwin Srinath (@shwina) - Michael Wang (@isVoid) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - GALI PREM SAGAR (@galipremsagar) URL: #7022

@ayushdg

Fixes issue for failing s3 tests on machines missing aws credentials. The PR exports fake credentials to prevent botocore from looking for aws credentials on the machine. Authors: - Ayush Dattagupta (@ayushdg) Approvers: - Keith Kraus (@kkraus14) URL: #7176

@rgsl888prabhu

Replacing API with class for chunked orc writer to ease the usage, for additional information #6911. This PR also adds support ORC chunked writing in python along with test cases. Authors: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Jason Lowe (@jlowe) Approvers: - Vukasin Milovanovic (@vuule) - GALI PREM SAGAR (@galipremsagar) - Devavret Makkar (@devavret) - AJ Schmidt (@ajschmidt8) - Jason Lowe (@jlowe) - Robert (Bobby) Evans (@revans2) URL: #7099

@razajafri

Adds in java support to be able to write fixed-point type to parquet Authors: - Raza Jafri (@razajafri) Approvers: - Karthikeyan (@karthikeyann) - Jason Lowe (@jlowe) URL: #7153

@galipremsagar

FIxes: #7031 This PR introduces array-like inputs support in `cudf.get_dummies`. I think in near future we will have to deprecate and adapt new name for `get_dummies`: pandas-dev/pandas#35724 Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) URL: #7181

@skirui-source

Resolves: #5543 This PR adds support for updating a DataFrame with non-NA values from another DataFrame, whereby only the values at matching index/column labels are updated. Only left join is supported, keeping the index and columns of original DataFrame. Authors: - @skirui-source Approvers: - GALI PREM SAGAR (@galipremsagar) - Michael Wang (@isVoid) URL: #6883

@shwina

Resolves #6657. Authors: - Ashwin Srinath (@shwina) - Conor Hoekstra (@codereport) - Keith Kraus (@kkraus14) Approvers: - Karthikeyan (@karthikeyann) - Keith Kraus (@kkraus14) - Vukasin Milovanovic (@vuule) URL: #6715

@galipremsagar

…7019) Fixes: #7007 This PR introduces changes to handle the filling of `np.nan` values in `fillna` code by converting `nan` to `null`. This fix surfaced an issue with `can_cast_safely` where when trying to convert a float column with `nan`'s to `int` column is being allowed - This is incorrect and thus added a check to return False if there is atleast 1 `nan` value in the float column. `nan` is not being handled in `dropna` aswell but is being handled in `isna`, thus introduced changes to `nan` in `dropna` too.  Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Christopher Harris (@cwharris) - Keith Kraus (@kkraus14) URL: #7019

@codereport

) After a discussion with Keith and Ashwin today, I realized `libcudf` was missing a couple corner cases for `fixed_point` and decided to open a small PR to add them. Authors: - Conor Hoekstra (@codereport) Approvers: - Vukasin Milovanovic (@vuule) - Keith Kraus (@kkraus14) - Mike Wilson (@hyperbolic2346) - @nvdbaranec - Mark Harris (@harrism) URL: #7178

@rjzamora

Adds a `CudfSeriesGroupBy` class to dask_cudf. This allows the optimizations from #6248 to be used for `CudfSeriesGroupBy.mean` (in addition to `CudfDataFrameGroupBy.aggregate`). Authors: - Richard (Rick) Zamora (@rjzamora) Approvers: - Keith Kraus (@kkraus14) URL: #7194

@hyperbolic2346

This is an operation that expands lists into rows and duplicates the existing rows from other columns. Explanation can be found in the issue #6151 partially fixes #6151 Missing pos_explode support required to completely close out #6151 Authors: - Mike Wilson (@hyperbolic2346) Approvers: - Robert (Bobby) Evans (@revans2) - Jake Hemstad (@jrhemstad) - Karthikeyan (@karthikeyann) - @nvdbaranec URL: #7140

@davidwendt

This adds the libcudf part of #7157 ``` std::unique_ptr<column> cudf::lists::count_elements( lists_column_view const& input, rmm::mr::device_memory_resource* mr); ``` Returns the size of each element in the input lists column. The PR also includes gtests for this new API. Authors: - David (@davidwendt) Approvers: - @nvdbaranec - AJ Schmidt (@ajschmidt8) - Karthikeyan (@karthikeyann) - Mark Harris (@harrism) URL: #7173

@firestarman

This PR is to add Java interface for the new API '`explode`', along with its unit tests. This PR depends on the PR #7140 . Authors: - Liangcai Li (@firestarman) Approvers: - Jason Lowe (@jlowe) - Robert (Bobby) Evans (@revans2) URL: #7151

@isVoid

Closes #5038, also closes #7026 Using `sort=False` yields better `groupby` performance, this PR changes `groupby` API to refrain from sorting the group index by default. Besides, this PR updates docstring to address the performance diff when using `sort=False`. Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - Ashwin Srinath (@shwina) URL: #7180

@devavret

Adds a level to JIT cache which segregates kernels compiled for different compute capabilities. Closes #5469 Authors: - Devavret Makkar (@devavret) Approvers: - Paul Taylor (@trxcllnt) - Mark Harris (@harrism) URL: #7090

@davidwendt

While working on improving the sort performance for strings columns in #7075, we tried a vector-load approach in the `string_view::compare()` function. This approached used some CUDA math intrinsic functions like `__funnelshift_r()` and `__byte_perm()`. Unfortunately, adding these to the `string_view` source would cause compile errors for some .cpp files. This is because the `string_view.cuh` was being included by some .cpp file even though these only used the appropriate `__host__ __device__` functions. This PR breaks up the host/device from the device-only functions so the .cpp files can include `string_view.cuh` without processing the device-only definitions. The host/device functions are now defined in the `string_view.cuh` directly and the device-only source is isolated in the `string_view.inl`. The include of the `string_view.inl` is then wrapped if a `#if CUDA_ARCH` so it will not be processed by a .cpp file compilation. Also, I attempted to minimize includes of `string_view.cuh` by removing it from `traits.hpp` and replacing it with a forward reference. This found a few files that were not including `string_view.cuh` directly as they should've. This also exposed `cpp/tests/utilities/scalar_utilities.cu` which appears to be unused and thus removed along with its header. No functionality has changed. Build times may be slightly faster since `string_view.cuh` is included in less source files and .cpp files no longer the `string_view.inl`. This means changing this file was also have a slightly less impact on rebuilding libcudf. Authors: - David (@davidwendt) Approvers: - Keith Kraus (@kkraus14) - AJ Schmidt (@ajschmidt8) - Karthikeyan (@karthikeyann) - Jake Hemstad (@jrhemstad) URL: #7159

@davidwendt

This change is based on the changes in PR #7075. When `cudf::sort()` or `cudf::sorted_order()` is called with a `cudf::table_view` and specifies only a single strings column, we choose a fast-path sort algorithm with a simpler comparator specifically coded for string compares. The specialized code path was added to `cudf::sorted_order()` which is called by the other libcudf sort functions. For example, `cudf::sort()` calls `cudf::sorted_order()` and the calls `cudf::gather()` on the input `cudf::table_view()` to materialize the results. The libcudf `sorted_order` feature has two APIs: `cudf::sorted_order()` and `cudf::stable_sorted_order()` which internally use `thrust::sort()` and `thrust::stable_sort()` respectively. Each uses the `row_lexographic_comparator` for managing sort of multiple columns. A simpler comparator can be used in the case of a single column per the implementation in #7075. In this PR, I generalized this fast-path for other single column types. I found the same comparator from #7075, templated by type, could be used for speeding up sorting of any comparable type -- where a single column is specified. Further, there are some conditions with numeric types when a comparator is not required and where the `cub::DeviceRadixSort` functions can be used instead of thrust. The restrictions to account for when _not using a comparator_: - the type must support an assignment operator as well as the compare operators (basically only numeric types) - the column must not contain nulls since these are handled specially with a `null_order` parameter - `thrust::sort()` and `thrust::stable_sort()` sort the input data in-place and do not support descending order - `cudf::DeviceRadixSort` does not sort in-place and does not have stable-sort but does have a descending order option Here is how these are used in `cudf::detail::sorted_order<stable>()` matching conditions with these restrictions. | stable | nulls | numeric | ascending | function | |:---:|:---:|:---:|:---:| --- | | y | y | - | - | `thrust::stable_sort()` with comparator | | y | - | n | - | `thrust::stable_sort()` with comparator | | y | - | - | n | `thrust::stable_sort()` with comparator | | y | n | y | y | `thrust::stable_sort_by_key()` with input column copied | | n | y | - | - | `thrust::sort()` with comparator | | n | - | n | - | `thrust::sort()` with comparator | | n | n | y | y | `cub::DeviceRadixSort::SortPairs` with input column copied and output indices copied | | n | n | y | n | `cub::DeviceRadixSort::SortPairsDescending` with input column copied and output indices copied | The `sort_benchmarks.cu` was updated to include a non-nulls set of tests to show the speedups for the bottom half of the chart. The benchmark sorts integers in ascending order. With nulls, the sort is now 1.2x faster. With no nulls, the sort is about 14x faster. The faster speed comes at the expense of 2-3 times the memory required for `thrust::stable_sort_by_key()` or the `cub:DeviceRadixSort::SortPairs()` functions. The generalization using the new single-column comparator accounts for strings columns as well. So the strings-specific code for this has been removed in this PR. Authors: - David (@davidwendt) Approvers: - Jake Hemstad (@jrhemstad) - Karthikeyan (@karthikeyann) URL: #7167

@mythrocks

Closes #6944. This commit adds a method (`contains()`) to check whether each row of a `LIST` column contains the scalar value specified as an argument. The operation returns a `BOOL8` column (with as many rows as the input `LIST`), each row indicating `true` if the value is found, `false` if not. Output `column[i]` is set to null if even one of the following holds true (in line with the semantics of `array_contains()` in SQL): 1. The search key `skey` is null 2. The list row `lists[i]` is null 3. The list row `lists[i]` contains even *one* null, *and* `lists[i]` does not contain the search key. This implementation currently supports the operation on lists of numerics or strings. Authors: - MithunR (@mythrocks) Approvers: - AJ Schmidt (@ajschmidt8) - Mark Harris (@harrism) - David (@davidwendt) - Karthikeyan (@karthikeyann) URL: #7039

@vuule

…ary (#7179) Closes #6252 Fix the `end` parameter semantics to match the standard C++ library. Move `is_whitespace` and `trim_field_start_end` to parsing_utils and use in both CSV and JSON. Authors: - Vukasin Milovanovic (@vuule) Approvers: - Christopher Harris (@cwharris) - Conor Hoekstra (@codereport) URL: #7179

@rgsl888prabhu

This PR contains changes only pertaining to Parquet. Instead of having API, a class is being used to control state and options to reduce burden on user. For more information look at #6911 These changes will break Java since main API changed. Authors: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Jason Lowe (@jlowe) Approvers: - Vukasin Milovanovic (@vuule) - Devavret Makkar (@devavret) - Robert (Bobby) Evans (@revans2) - @brandon-b-miller - David (@davidwendt) URL: #7058

@rgsl888prabhu

`return_filemetadata` was removed in one of the recent PR, and missed to remove it in benchmarks. Authors: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) Approvers: - Conor Hoekstra (@codereport) - Christopher Harris (@cwharris) URL: #7214

@galipremsagar

…ting (#7216) This PR adds coverage for `skiprows` and `num_rows` parameters in parquet reader fuzz tests. Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Vukasin Milovanovic (@vuule) - Keith Kraus (@kkraus14) URL: #7216

@davidwendt

Closes #7212 Reference #7167 (comment) Using radix sort for all fixed-width types causes an [error in Spark when floating point columns contain NaN elements](NVIDIA/spark-rapids#1585). This PR removes floating-point column types from the radix fast-path. This means the original `relational_compare` row operator is used to handle sorting floating point columns since they could possibly contain NaN elements. The `NANSorting` gtest included null elements so it did not catch the fast-path output discrepancy. This PR adds a `NANSortingNonNull` gtest to check for the desired NaN sorting behavior. Authors: - David (@davidwendt) Approvers: - Jake Hemstad (@jrhemstad) - Conor Hoekstra (@codereport) URL: #7215

@shwina

Adds static type checking to cuDF Python via MyPy. * An additional `mypy` style check is enabled in CI * `mypy` is run as part of the pre-commit hook * Many parts of the cuDF internal code now have type annotations * Any new internal code is expected to be written with type annotations (not public-facing APIs) Authors: - Ashwin Srinath (@shwina) Approvers: - Dillon Cullinan (@dillon-cullinan) - Keith Kraus (@kkraus14) - Christopher Harris (@cwharris) URL: #6381

@kuhushukla

Adds JNI and Java side bindings for `list_contains` that is being added as part of #7039. Authors: - Kuhu Shukla (@kuhushukla) Approvers: - Robert (Bobby) Evans (@revans2) - MithunR (@mythrocks) URL: #7125

@nvdbaranec

…lures (#7219) Fixes #7210 Fixes #6733 List of fixes included: - [x] Restore `null_count()` check in `expect_columns_equal` / `expect_columns_equivalent` - [x] Fix issue in `structs_column_view::get_sliced_child` - [x] Fix test failures in COPYING_TEST - [x] Fix test failures in STREAM_COMPACTION_TEST - [x] Fix test failures in RESHAPE_TEST Authors: - @nvdbaranec - Mark Harris (@harrism) Approvers: - Mark Harris (@harrism) - MithunR (@mythrocks) - Jake Hemstad (@jrhemstad) URL: #7219

@tgravescs

…7222) This adds in the JNI layer to be able to take build up Arrow column vectors which are just references to off heap arrow buffers and then convert those into CUDF ColumnVectors by directly copying the arrow data to the GPU. The way this works is you create a ArrowColumnBuilder for each column you need. You call addBatch for each separate arrow buffer you want to add into that column and then you call buildAndPutOnDevice() on the Builder. That will cause the arrow pointer to be passed into CUDF, an Arrow Table with 1 column is created, that Arrow table gets passed into the cudf::from_arrow which returns a CUDF Table and we grab the 1 column from that and return it. Note this only supports primitive types and Strings for now. List, Struct, Dictionary, and Decimal are not supported yet. Signed-off-by: Thomas Graves <tgraves@nvidia.com> Authors: - Thomas Graves (@tgravescs) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: #7222

@isVoid

Closes #7174 This PR adds support for `numeric_only` field for `Dataframe.rank()` and `Series.rank()`. When user specifies `numeric_only=True`, only the numerical data type columns are selected to construct a cudf object and passed to lower level for processing. Two minor refactors are also included in this PR: - This PR refactors internal API of `Frame._get_columns_by_label`, which now supports dispatching to this method from both `Dataframe` and `Series`. - This PR refactors `test_rank.py`, moving test functions inside class `TestRank` out as top level functions. All test variables shared among test cases are moved to a `pytests.fixture` method. A `Dataframe.rank` test case that expects to raise due to a [pandas bug](pandas-dev/pandas#32593) is now captured under `pytest.raises`. Authors: - Michael Wang (@isVoid) Approvers: - Ashwin Srinath (@shwina) - @brandon-b-miller URL: #7213

@kuhushukla

#7125 added a test column vector leak. This PR fixes this minor leak. Authors: - Kuhu Shukla (@kuhushukla) Approvers: - Jason Lowe (@jlowe) - Thomas Graves (@tgravescs) URL: #7238

@revans2

This fixes some bugs in the java support for decimal scalar values. They are fairly minor but prevented me from doing some debugging earlier, and could impact tests in the future. Authors: - Robert (Bobby) Evans (@revans2) Approvers: - Jason Lowe (@jlowe) URL: #7237

@tgravescs

Found leaks in the ArrowColumnVectorTest so fix them. Signed-off-by: Thomas Graves <tgraves@nvidia.com> Authors: - Thomas Graves (@tgravescs) Approvers: - Robert (Bobby) Evans (@revans2) - Jason Lowe (@jlowe) URL: #7241

@davidwendt

Reference #5963 Add support for dictionary column to `cudf::rolling_window` (non-udf) Rolling aggregations - [x] min/max - [x] lead/lag - [x] counting, row-number These only require aggregating the dictionary indices and do not need to access the keys. Authors: - David (@davidwendt) Approvers: - Mark Harris (@harrism) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7186

@codereport

…nd `decimal64` (#7198) This resolves a part of #7132 **ToDo:** * [x] Simple unit test * [x] Comprehensive unit tests * [x] Initial Column + Column * [x] Full Column + Column * [x] Column + Scalar * [x] Scalar + Column * [x] Cleanup Authors: - Conor Hoekstra (@codereport) Approvers: - Mark Harris (@harrism) - @nvdbaranec URL: #7198

@ChrisJar

This replaces `cudaMemcpyAsync(hostdevice_vector)` with `hostdevice_vector.device_to_host()` or `hostdevice_vector.host_to_device()` when appropriate. Issue #6538 Authors: - @ChrisJar Approvers: - Karthikeyan (@karthikeyann) - Vukasin Milovanovic (@vuule) URL: #7035

@shwina

Fixes #7221 and adds improvements to `loc` with a MultiIndex. * Previously, `loc` on a `Series` with a `MultiIndex` would fail. For example: ```python In [7]: sr Out[7]: n_workers type 1 fit 1 2 load 2 3 predict 3 Name: x, dtype: int64 In [8]: sr.loc[(1, "fit")] # KeyError ```` * Previously, `loc` on a `DataFrame` with a `MultiIndex` would fail when a slice without `start` or `end` was used. For example: ```python In [3]: df Out[3]: x n_workers type 1 fit 1 2 load 2 3 predict 3 In [4]: df.loc[:(2, "load")] # TypeError ``` Both the above issues have been addressed and tests added. Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) - Michael Wang (@isVoid) - GALI PREM SAGAR (@galipremsagar) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7243

@mythrocks

Closes #7133. This is an implementation of the `COLLECT` aggregation in the context of rolling window functions. This enables the collection of rows (of type `T`) within specified window boundaries into a list column (containing elements of type `T`). In this context, one list row would be generated per input row. E.g. Consider the following example: ```c++ auto input_col = fixed_width_column_wrapper<int32_t>{70, 71, 72, 73, 74}; ``` Calling `rolling_window()` with `preceding=2`, `following=1`, `min_periods=1` produces the following: ```c++ auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr); // == [ [70,71], [70,71,72], [71,72,73], [72,73,74], [73,74] ] ``` `COLLECT` is supported with `rolling_window()`, `grouped_rolling_window()`, and `grouped_time_range_rolling_window()`, across primitive types and arbitrarily nested lists and structs. `min_periods` is also honoured: If the number of observations is fewer than min_periods, the resulting list row is null. Authors: - MithunR (@mythrocks) Approvers: - Keith Kraus (@kkraus14) - Vukasin Milovanovic (@vuule) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7189

@vuule

Resolves: #6263 This PR introduces changes which will enable generation of random list columns in datagenerator which will be used as part of fuzz tests. cc: @vuule Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Vukasin Milovanovic (@vuule) - @brandon-b-miller URL: #7064

@galipremsagar

Fixes: #7206 The `replace` API has two parameters `to_replace` & `value` which are overloaded and support different types of inputs for each of these two parameters have different behaviors. These changes introduce clear code-flow for each type of possible parameter combination. This way it would be easier to support newer parameters in future like `regex` & nested dict types, which would change the behaviour of `to_replace` & `value` parameters.. - [x] Ensure all combinations are covered for `to_replace` & `value` for both `DataFrame.replace` & `Series.replace`. - [x] Document changes inline & Update func docs. - [x] Add tests to include coverage for all combinations that are not yet covered. Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) - @brandon-b-miller URL: #7207

@brandon-b-miller

…es (#7209) Fixes #6892 Defines the desired behavior for an implicit merge of two possibly differing categorical variables, or one categorical variable and one non-categorical variable, as a function of the dtypes and the merge configuration. The desired behavior is defined through the tests and then implemented in `casting_logic.py`. Authors: - @brandon-b-miller - Keith Kraus (@kkraus14) Approvers: - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) URL: #7209

@devavret

Only top level columns can be selected by name Fixes #7229 Authors: - Devavret Makkar (@devavret) Approvers: - Karthikeyan (@karthikeyann) - Vukasin Milovanovic (@vuule) - @nvdbaranec - Keith Kraus (@kkraus14) URL: #7248

@davidwendt

PR #7215 removed single floating point columns from radix sort fast-path but missed disabling the fast-path sort for floating-point in `cudf::sort()`. This PR fixes `cudf::sort` and adds a new test to the existing `RowOperatorTestForNAN.NANSortingNonNull` gtest. Authors: - David (@davidwendt) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Conor Hoekstra (@codereport) URL: #7250

@harrism

Adds a new developer guide for libcudf. This is based on the existing libcudf++ transition guide. Fixes #5273 TODO - [x] Description of `dictionary_column_wrapper` and `fixed_point_column_wrapper` - [x] Benchmarking Section (put in a new file, Benchmarking.md)? - [x] Better discussion of nested types - [x] Introductory section on data types - [x] Consider splitting into multiple documents: DEVELOPER_GUIDE.md, TESTING.md, BENCHMARKING.md? - [x] Placeholder for cuIO? - [x] Add section on code and documentation style and formatting Authors: - Mark Harris (@harrism) - Jake Hemstad (@jrhemstad) Approvers: - @nvdbaranec - Conor Hoekstra (@codereport) - Jake Hemstad (@jrhemstad) - David (@davidwendt) URL: #6977

@shwina

NumPy 1.20 is [typed](https://numpy.org/devdocs/release/1.20.0-notes.html#numpy-is-now-typed), which exposed a few typing errors in cuDF that this PR addresses. Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) - GALI PREM SAGAR (@galipremsagar) - AJ Schmidt (@ajschmidt8) URL: #7279

@ajschmidt8

This PR prepares the changelog to be automatically updated during releases. Authors: - AJ Schmidt (@ajschmidt8) Approvers: - Keith Kraus (@kkraus14) URL: #7272

@galipremsagar

Fixes: #6963 This PR introduces a "Working with missing data" doc page where we clearly outline how we can work with missing data in cudf. The behavior shown in #6963 is correct due to the fact that cudf treats `NaT` as `<NA>` values. Hence highlighted the difference in behavior of having `NaT` in datetime/timedelta values between pandas and cudf. Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) URL: #7010

@nvdbaranec

…format. (#7096) Addresses #3793 Depends on #6864 (This affects contiguous_split.cu. For the purposes of this PR, the only changes that are relevant are those that involve the generation of metadata) - `pack()` performs a `contiguous_split()` on the incoming table to arrange the memory into a unified device buffer, and generates a host-side metadata buffer. These are returned in the `packed_columns` struct. - unpack() takes the data stored in the `packed_columns` struct and returns a deserialized `table_view` that points into it. The intent of this functionality is as follows (pseudocode) ``` // serialize-side table_view t; packed_columns p = pack(t); send_over_network(p.gpu_data); send_over_network(p.metadata); // deserialize-side packed_columns p = receive_from_network(); table_view t = unpack(p); ``` This PR also renames `contiguous_split_result` to `packed_table` (which is just a bundled `table_view` and `packed_column`) Authors: - @nvdbaranec Approvers: - Jake Hemstad (@jrhemstad) - Paul Taylor (@trxcllnt) - Mike Wilson (@hyperbolic2346) URL: #7096

@mythrocks

Fixes #7265. `cudf::detail::get_num_child_rows()` is currently defined in `cudf/lists/detail/utilities.cuh`. The build pipelines for #7189 are fine, but there seem to be build failures in dependent projects such as `spark-rapids`: ``` [2021-01-31T08:12:10.611Z] /.../workspace/spark/cudf18_nightly/cpp/include/cudf/lists/detail/utilities.cuh:31:18: error: 'cudf::size_type cudf::detail::get_num_child_rows(const cudf::column_view&, rmm::cuda_stream_view)' defined but not used [-Werror=unused-function] [2021-01-31T08:12:10.611Z] static cudf::size_type get_num_child_rows(cudf::column_view const& list_offsets, [2021-01-31T08:12:10.611Z] ^~~~~~~~~~~~~~~~~~ [2021-01-31T08:12:11.981Z] cc1plus: all warnings being treated as errors [2021-01-31T08:12:12.238Z] make[2]: *** [CMakeFiles/cudf_hash.dir/build.make:82: CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o] Error 1 [2021-01-31T08:12:12.238Z] make[1]: *** [CMakeFiles/Makefile2:220: CMakeFiles/cudf_hash.dir/all] Error 2 ``` In any case, it is less than ideal for the function to be completely defined in the header, especially given that the likes of `hashing.cu` are exposed to it (by way of `scatter.cuh`). This commit moves the function definition to a separate translation unit, without changing implementation or interface. Authors: - MithunR (@mythrocks) Approvers: - @nvdbaranec - Mike Wilson (@hyperbolic2346) - David (@davidwendt) URL: #7266

@karthikeyann

addresses part of #6541 Segment sort of lists - [x] lists_column_view segmented_sort - [x] numerical types (cub segmented sort limitation) - [x] sort_lists(table_view) - [x] unit tests closes #4603 Segmented sort - [x] segmented_sort - [x] unit tests. Authors: - Karthikeyan (@karthikeyann) Approvers: - AJ Schmidt (@ajschmidt8) - Keith Kraus (@kkraus14) - Jake Hemstad (@jrhemstad) - Conor Hoekstra (@codereport) URL: #7122

@vuule

…#7261) Issue #6763 Authors: - Vukasin Milovanovic (@vuule) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - @nvdbaranec - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) URL: #7261

@jlowe

This PR requires the libcudf changes in #7096, fixing the Java bindings to `contiguous_split` that are broken by that change. This also adds the ability to create a `ContiguousTable` instance without manifesting a `Table` instance and all `ColumnVector` instances underneath it which should prove useful during Spark's shuffle. Authors: - Jason Lowe (@jlowe) Approvers: - Robert (Bobby) Evans (@revans2) - Alessandro Bellina (@abellina) URL: #7127

@rongou

Turns out we need version > 5.4 of the junit jupiter engine to support `@TempDir`. - Changed the file mode to match Spark's disk manager. - Changed to use `fstat` to get the file length when appending. - Add tests for when a file already exists. Authors: - Rong Ou (@rongou) Approvers: - Jason Lowe (@jlowe) - Robert (Bobby) Evans (@revans2) URL: #7296

@isVoid

Closes #7199 Refactors scalar handling inside `assert_eq`. On higher level, this PR proposes a "whitelist" style testing: all compares should go to the "strict equal" code path unless explicitly allowed. This allows the test system to capture all unintended inequality except the ones that's discussed upon. For example, this PR creates two whitelist items: - If the operands overrides `__eq__`, use it to determine equality. - If the operands are floating type, assert approximate equality. For all other cases, the operands should be strictly equal. Note that for testing purposes, `np.nan` are considered equal to itself. Authors: - Michael Wang (@isVoid) Approvers: - GALI PREM SAGAR (@galipremsagar) - @brandon-b-miller URL: #7220

@galipremsagar

This PR prepares the changelog to be automatically updated during releases. Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) - AJ Schmidt (@ajschmidt8) URL: #7309

@kaatish

Closes #6893, closes #6894. Contributes to #5682 Reduce usage of stats_column_desc members in Parquet writer with column_device_view members. Authors: - Kumar Aatish (@kaatish) Approvers: - David (@davidwendt) - Vukasin Milovanovic (@vuule) - Devavret Makkar (@devavret) URL: #7097

@shwina

Fixes #7249 Copies dtype metadata after calling `ColumnBase.copy()`. Moves logic for copying dtype metadata after calling libcudf functions from `Frame` to `ColumnBase`. Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) - GALI PREM SAGAR (@galipremsagar) URL: #7271

@isVoid

#7256) Small PR to provide two fixes: - Use `rmm::device_uvector` in place of `device_vector` to improve efficiency. This is a scratch space, so supplied stream and default memory resource is used. Part of #5380 - Update `sort_helper::grouped_value` docstring to reflect change after use of stable sort. Authors: - Michael Wang (@isVoid) Approvers: - Vukasin Milovanovic (@vuule) - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - Mark Harris (@harrism) URL: #7256

@vuule

Use a buffer for output in the newly added ORC test. Authors: - Vukasin Milovanovic (@vuule) Approvers: - GALI PREM SAGAR (@galipremsagar) URL: #7313

@firestarman

Add unit tests for aggregate 'collect' with windowing. This PR depends on the PR #7189 . Signed-off-by: Liangcai Li <liangcail@nvidia.com> Authors: - Liangcai Li (@firestarman) Approvers: - MithunR (@mythrocks) - Robert (Bobby) Evans (@revans2) URL: #7121

@adelevie

change: on -> one I read the contributing guidelines, but since this is just a documentation fix, I'm not sure which apply. Great library, I just got started using it. A little rough around the edges, but great so far, and well worth some of the added steps. Authors: - Alan deLevie (@adelevie) - AJ Schmidt (@ajschmidt8) Approvers: - GALI PREM SAGAR (@galipremsagar) - Keith Kraus (@kkraus14) - Michael Wang (@isVoid) - Ray Douglass (@raydouglass) URL: #7253

@davidwendt

Returning a unique pointer using `std::move` causes a compile error for gcc 9 and above. Simple fix to remove the incorrect move semantic in `segmented_sort.cu` `get_segment_indices`. Authors: - David (@davidwendt) Approvers: - Karthikeyan (@karthikeyann) - Devavret Makkar (@devavret) URL: #7319

@galipremsagar

Constructing a DataFrame from a ColumnAccessor previously had unintended side-effects: ```python In [1]: import cudf In [2]: a = cudf.DataFrame({'a': [1, 2, 3]}) In [3]: a._data['a'].__cuda_array_interface__ Out[3]: {'shape': (3,), 'strides': (8,), 'typestr': '<i8', 'data': (140409137266688, False), 'version': 1} In [4]: a[['a']] Out[4]: a 0 1 1 2 2 3 In [5]: a._data['a'].__cuda_array_interface__ Out[5]: {'shape': (3,), 'strides': (8,), 'typestr': '<i8', 'data': (140409137267200, False), 'version': 1} ``` In a discussion with @galipremsagar - we decided that it's probably best not to handle `ColumnAccessor` in the frame constructors. * Remove special handling of `ColumnAccessor` in `Frame` constructors * Collapse `Series.copy()` and `DataFrame.copy()` into a single `Frame.copy()` Authors: - Ashwin Srinath (@shwina) - GALI PREM SAGAR (@galipremsagar) Approvers: - GALI PREM SAGAR (@galipremsagar) URL: #7298

@isVoid

Closes #7246 This PR fixes a bug in `Dataframe.iloc`. When the slice provided to `iloc`, is decrementing and also terminates at `before-the-zero` position, such as `slice(2, -1, -1)` or `slice(4, None, -1)`, the terminal position still gets wrapped around. `Frame._slice` is moved to `DataFrame._slice` to resolve typing issue. Authors: - Michael Wang (@isVoid) Approvers: - Keith Kraus (@kkraus14) - GALI PREM SAGAR (@galipremsagar) URL: #7277

@ChrisJar

This updates the 10 minutes to cuDF and CuPY notebook to use the new methods for moving between cuDF data structures and CuPy arrays. Closes #7160 Authors: - @ChrisJar Approvers: - Ashwin Srinath (@shwina) URL: #7158

@shwina

Closes #7311 Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) - AJ Schmidt (@ajschmidt8) URL: #7318

@jolorunyomi

This PR adds the GitHub action [PR Labeler](https://github.com/actions/labeler) to auto-label PRs based on their content. Labeling is managed with a configuration file `.github/labeler.yml` using the following [options](https://github.com/actions/labeler#usage). Authors: - Joseph (@jolorunyomi) - Mike Wendt (@mike-wendt) Approvers: - AJ Schmidt (@ajschmidt8) - Keith Kraus (@kkraus14) - Mike Wendt (@mike-wendt) URL: #7044

@shwina

Authors: - Ashwin Srinath (@shwina) Approvers: - Keith Kraus (@kkraus14) - @jakirkham - Ray Douglass (@raydouglass) URL: #7335

@dillon-cullinan

Issues and PRs without activity for 30d will be marked as stale. If there is no activity for 90d, they will be marked as rotten. Authors: - Jordan Jacobelli (@Ethyling) Approvers: - Dillon Cullinan (@dillon-cullinan) URL: #7388

@mike-wendt

Follows #7388 Updates the stale GHA with the following changes: - [x] Uses `inactive-30d` and `inactive-90d` labels instead of `stale` and `rotten` - [x] Updates comments to reflect changes in labels - [x] Exempts the following labels from being marked `inactive-30d` or `inactive-90d` - `0 - Blocked` - `0 - Backlog` - `good first issue` Authors: - Mike Wendt (@mike-wendt) Approvers: - Keith Kraus (@kkraus14) - Ray Douglass (@raydouglass) URL: #7395

Commits on Nov 24, 2020

DOC v0.18 Updates

ajschmidt8 committed Nov 24, 2020

Configuration menu

View commit details

Copy full SHA for 80464ce

Browse repository at this point

Copy the full SHA

80464ce View commit details

Browse the repository at this point in the history

Commits on Feb 24, 2021

update changelog

raydouglass committed Feb 24, 2021

Configuration menu

View commit details

Copy full SHA for 1544474

Browse repository at this point

Copy the full SHA

1544474 View commit details

Browse the repository at this point in the history

[RELEASE] cudf v0.18 #7405

[RELEASE] cudf v0.18 #7405

Commits on Nov 24, 2020

Commits on Nov 30, 2020

Commits on Dec 1, 2020

Commits on Dec 2, 2020

Commits on Dec 3, 2020

Commits on Dec 4, 2020

Commits on Dec 6, 2020

Commits on Dec 7, 2020

Commits on Dec 8, 2020

Commits on Dec 9, 2020

Commits on Dec 10, 2020

Commits on Dec 11, 2020

Commits on Dec 12, 2020

Commits on Dec 13, 2020

Commits on Dec 14, 2020

Commits on Dec 15, 2020

Commits on Dec 16, 2020

Commits on Dec 17, 2020

Commits on Dec 18, 2020

Commits on Dec 19, 2020

Commits on Dec 21, 2020

Commits on Dec 23, 2020

Commits on Dec 29, 2020

Commits on Dec 31, 2020

Commits on Jan 4, 2021

Commits on Jan 5, 2021

Commits on Jan 6, 2021

Commits on Jan 7, 2021

Commits on Jan 8, 2021

Commits on Jan 11, 2021

Commits on Jan 12, 2021

Commits on Jan 13, 2021

Commits on Jan 15, 2021

Commits on Jan 18, 2021

Commits on Jan 19, 2021

Commits on Jan 20, 2021

Commits on Jan 21, 2021

Commits on Jan 22, 2021

Commits on Jan 23, 2021

Commits on Jan 25, 2021

Commits on Jan 26, 2021

Commits on Jan 27, 2021

Commits on Jan 28, 2021

Commits on Jan 29, 2021

Commits on Jan 30, 2021

Commits on Feb 1, 2021

Commits on Feb 3, 2021

Commits on Feb 4, 2021

Commits on Feb 5, 2021

Commits on Feb 8, 2021

Commits on Feb 9, 2021

Commits on Feb 16, 2021

Commits on Feb 17, 2021

Commits on Feb 24, 2021