Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELEASE] cudf v0.18 #7405

Merged
merged 204 commits into from
Feb 24, 2021
Merged

[RELEASE] cudf v0.18 #7405

merged 204 commits into from
Feb 24, 2021

Commits on Nov 24, 2020

  1. DOC v0.18 Updates

    ajschmidt8 committed Nov 24, 2020
    Configuration menu
    Copy the full SHA
    80464ce View commit details
    Browse the repository at this point in the history

Commits on Nov 30, 2020

  1. Add a cmake option to link to GDS/cuFile (#6847)

    Add a cmake find module to locate cuFile. If found, add the include directory and link to the shared library.
    
    This shouldn't have any effect if cuFile is not installed locally.
    rongou authored Nov 30, 2020
    Configuration menu
    Copy the full SHA
    0e94bab View commit details
    Browse the repository at this point in the history

Commits on Dec 1, 2020

  1. Merge pull request #6866 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 1, 2020
    Configuration menu
    Copy the full SHA
    2ed7e13 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #6867 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 1, 2020
    Configuration menu
    Copy the full SHA
    a091304 View commit details
    Browse the repository at this point in the history

Commits on Dec 2, 2020

  1. Merge pull request #6874 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    c0e03d6 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #6876 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    018d036 View commit details
    Browse the repository at this point in the history
  3. Merge pull request #6877 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    7aa3863 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #6878 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    36c03a5 View commit details
    Browse the repository at this point in the history
  5. Merge pull request #6879 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    48adcc0 View commit details
    Browse the repository at this point in the history
  6. Merge pull request #6880 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 2, 2020
    Configuration menu
    Copy the full SHA
    36d5205 View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2020

  1. Configuration menu
    Copy the full SHA
    536d23a View commit details
    Browse the repository at this point in the history
  2. Merge pull request #6890 from kkraus14/fix_automerge

    Keith Kraus authored Dec 3, 2020
    Configuration menu
    Copy the full SHA
    737e715 View commit details
    Browse the repository at this point in the history
  3. Merge pull request #6896 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 3, 2020
    Configuration menu
    Copy the full SHA
    3d80bb8 View commit details
    Browse the repository at this point in the history

Commits on Dec 4, 2020

  1. Merge pull request #6900 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 4, 2020
    Configuration menu
    Copy the full SHA
    c6f39b1 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #6904 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 4, 2020
    Configuration menu
    Copy the full SHA
    009c307 View commit details
    Browse the repository at this point in the history
  3. Merge pull request #6906 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 4, 2020
    Configuration menu
    Copy the full SHA
    dd6cf15 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #6910 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 4, 2020
    Configuration menu
    Copy the full SHA
    8c8e05f View commit details
    Browse the repository at this point in the history
  5. Merge pull request #6913 from rapidsai/branch-0.17

    [gpuCI] Auto-merge branch-0.17 to branch-0.18 [skip ci]
    GPUtester authored Dec 4, 2020
    Configuration menu
    Copy the full SHA
    522103d View commit details
    Browse the repository at this point in the history

Commits on Dec 6, 2020

  1. Implement DataFrame.quantile for datetime and timedelta data types(#6902

    )
    
    This implements the `non_numeric` argument for `DataFrame.quantile` meaning that it now works on `datetime` and `timedelta` data. However, because of the difference in how `DataFrame.iloc` behaves between Pandas and cuDF, this implementation returns a DataFrame when `non_numeric=False` even when Pandas returns a Series
    
    Passes tests locally
    
    This closes #6799
    
    Authors:
      - Chris Jarrett <cjarrett@dt08.aselab.nvidia.com>
      - ChrisJar <chris.jarrett.0@gmail.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #6902
    ChrisJar authored Dec 6, 2020
    Configuration menu
    Copy the full SHA
    214dccc View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2020

  1. Fix rmm_mode=managed parameter for gtests(#6912)

    When using parameter `--rmm_mode=managed` for gtests `Invalid RMM allocation mode: managed` exception is thrown.
    The logic in `include/cudf_test/base_fixture.hpp` is just missing a return statement.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Paul Taylor
      - Mark Harris
    
    URL: #6912
    davidwendt authored Dec 7, 2020
    Configuration menu
    Copy the full SHA
    598a14d View commit details
    Browse the repository at this point in the history

Commits on Dec 8, 2020

  1. Add Index.set_names api(#6929)

    Resolves: #6870 
    
    This PR adds support for `set_names` API in both `Index` & `MultiIndex`.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
      - GALI PREM SAGAR <sagarprem75@gmail.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #6929
    galipremsagar authored Dec 8, 2020
    Configuration menu
    Copy the full SHA
    917759b View commit details
    Browse the repository at this point in the history
  2. Fix columns & index handling in dataframe constructor(#6838)

    Fixes: #6821 
    
    This PR fixes issue where `columns` and `index` are currently not being handled correctly in specific scenarios.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
      - GALI PREM SAGAR <sagarprem75@gmail.com>
    
    Approvers:
      - Richard (Rick) Zamora
      - Ashwin Srinath
    
    URL: #6838
    galipremsagar authored Dec 8, 2020
    Configuration menu
    Copy the full SHA
    f6b16ab View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    9120992 View commit details
    Browse the repository at this point in the history
  4. Update to official libcu++ on Github(#6275)

    Update to libcu++ on Github.
    
    Authors:
      - ptaylor <paul.e.taylor@me.com>
      - Paul Taylor <paul.e.taylor@me.com>
    
    Approvers:
      - Mark Harris
      - Keith Kraus
      - Christopher Harris
      - Mark Harris
    
    URL: #6275
    trxcllnt authored Dec 8, 2020
    Configuration menu
    Copy the full SHA
    78f9789 View commit details
    Browse the repository at this point in the history
  5. Remove **kwargs from string/categorical methods(#6750)

    This PR removes `**kwargs` from the string/categorical accessors where unnecessary, and exposes keyword arguments like `inplace` to the user directly.
    
    If we want to maintain parity with Pandas APIs for Dask/others using cuDF internally, we can consider using the approach described in #6135, which will automatically raise `NotimplementedError` when unsupported kwargs are passed.
    
    Authors:
      - Ashwin Srinath <shwina@users.noreply.github.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Keith Kraus
      - Keith Kraus
    
    URL: #6750
    shwina authored Dec 8, 2020
    Configuration menu
    Copy the full SHA
    8a1a6d7 View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2020

  1. Fix N/A detection for empty fields in CSV reader(#6922)

    Fixes #6682, #6680 
    
    Currently, empty fields are treated as N/A regardless on parsing options. However, the desired behavior is to handle empty fields the same way as fields with special values (apply default_na_values, na_filter logic). 
    This PR irons out the behavior so it matches Pandas in this regard.
    
    - Tries now support matching empty strings.
    - The list of special NA values is now generated more robustly, so it has correct elements in any parameter combination.
    - Empty string is added to the list of special NA values.
    - Empty string string ("/"/"") is added to NA value list if empty string ("") is included (mirrors Pandas behavior).
    - Added tests for previously failing parameter combinations.
    - Reworked some of the tests to check against Pandas results instead of assumed desired behavior.
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - vuule <vukasin.milovanovic.87@gmail.com>
      - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>
      - Vukasin Milovanovic <vmilovanovic@nvidia.com>
    
    Approvers:
      - Ram (Ramakrishna Prabhu)
      - Christopher Harris
      - Keith Kraus
    
    URL: #6922
    vuule authored Dec 9, 2020
    Configuration menu
    Copy the full SHA
    17c8f97 View commit details
    Browse the repository at this point in the history
  2. fix libcu++ include path for jni(#6948)

    The include directory was renamed from `simt` to `cuda`.
    
    Authors:
      - Rong Ou <rong.ou@gmail.com>
    
    Approvers:
      - Jason Lowe
    
    URL: #6948
    rongou authored Dec 9, 2020
    Configuration menu
    Copy the full SHA
    83b1851 View commit details
    Browse the repository at this point in the history
  3. Fix cudf::merge gtest for dictionary columns(#6942)

    The `cudf::merge` API expects the key columns to be sorted. This means that if null rows are included, these null entries should all appear either at beginning or at the end of the column depending on the null_order for the sort. The `MergeDictionaryTest.WithNull` gtest placed null rows in the middle of the column. The expected results should also have included null entries at the beginning or the end.
    
    This PR also includes an extra test for checking merge results are consistent with the sort parameters `cudf::order` and `cudf::null_order`. This test also includes a larger number of rows to ensure `thrust::merge` requires more than one tile/block in its runtime logic.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Ram (Ramakrishna Prabhu)
      - Vukasin Milovanovic
    
    URL: #6942
    davidwendt authored Dec 9, 2020
    Configuration menu
    Copy the full SHA
    b45fd4d View commit details
    Browse the repository at this point in the history
  4. Update Java bindings version to 0.18-SNAPSHOT(#6949)

    Updating the Java bindings package version to match the libcudf version.
    
    Authors:
      - Jason Lowe <jlowe@nvidia.com>
    
    Approvers:
      - Robert (Bobby) Evans
    
    URL: #6949
    jlowe authored Dec 9, 2020
    Configuration menu
    Copy the full SHA
    44eeb70 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6d230ee View commit details
    Browse the repository at this point in the history
  6. Add in basic support to JNI for logical_cast(#6954)

    This exposes `logical_cast` through a JNI API. I also updated some of the test code to take a ColumnView instead of a ColumnVector so I could test it more easily.
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Jason Lowe
    
    URL: #6954
    revans2 authored Dec 9, 2020
    Configuration menu
    Copy the full SHA
    a301e65 View commit details
    Browse the repository at this point in the history

Commits on Dec 10, 2020

  1. Use simplified rmm::exec_policy (#6939)

    Updates libcudf to use the new, simplified `rmm::exec_policy` and include the new refactored headers `rmm/exec_policy.hpp` and `rmm/device_vector.hpp`
    
    The new `exec_policy` can be passed directly to Thrust, no longer any need to call `rmm::exec_policy(stream)->on(stream)`.
    
    Depends on rapidsai/rmm#647
    harrism authored Dec 10, 2020
    Configuration menu
    Copy the full SHA
    f117b68 View commit details
    Browse the repository at this point in the history
  2. Fix type comparison for java(#6970)

    As a part of trying to support upper and lower bounds for decimal I found that type checking for this function was broken because it used `==` for equality instead of `.equals`.  Looking further I found a few other places where this was a bug (one in ColumnVector that is mostly a performance issue and one in Scalar)  I decided to update all of the code to use .equals for comparison of types to make it consistent so it is less likely to have bugs like this crop up in the future.
    
    I also took the opportunity to internally move away from using `isTimestamp` (which is deprecated) to `isTimestampType`
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Jason Lowe
    
    URL: #6970
    revans2 authored Dec 10, 2020
    Configuration menu
    Copy the full SHA
    dc05261 View commit details
    Browse the repository at this point in the history
  3. Add Java bindings for URL conversion(#6972)

    Adding Java bindings for the `url_decode` and `url_encode` functions.
    
    Authors:
      - Jason Lowe <jlowe@nvidia.com>
    
    Approvers:
      - Robert (Bobby) Evans
      - Kuhu Shukla
    
    URL: #6972
    jlowe authored Dec 10, 2020
    Configuration menu
    Copy the full SHA
    d028db6 View commit details
    Browse the repository at this point in the history
  4. Add JNI wrapper for the cuFile API (GDS)(#6940)

    This adds a `libcufilejni.so` that's by default not built nor loaded. The unit tests are controlled similarly.
    
    Tested locally with the corresponding spark-rapids plugin changes.
    
    @jlowe @revans2 @abellina
    
    Authors:
      - Rong Ou <rong.ou@gmail.com>
    
    Approvers:
      - Robert (Bobby) Evans
    
    URL: #6940
    rongou authored Dec 10, 2020
    Configuration menu
    Copy the full SHA
    89938fa View commit details
    Browse the repository at this point in the history
  5. Align Series.groupby API to match Pandas(#6964)

    Currently we're missing a few kwargs in `Series.groupby` which is causing issues due to the dask change in dask/dask#6854
    
    Adds the missing kwargs and validates that we support the values passed in.
    
    Authors:
      - Keith Kraus <keith.j.kraus@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Michael Wang
    
    URL: #6964
    Keith Kraus authored Dec 10, 2020
    Configuration menu
    Copy the full SHA
    f965d9a View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2020

  1. Fix typo in numerical.py(#6957)

    closes #6778 
    
    Fixes typo missed in PR #6887, the else condition where this is situated would be the last resort for any unaccounted scalar type values.
    
    Authors:
      - Ramakrishna Prabhu <ramakrishnap@nvidia.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #6957
    rgsl888prabhu authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    ea9c689 View commit details
    Browse the repository at this point in the history
  2. Fix groupby agg/apply behaviour when no key columns are provided(#6945)

    More Pandas-like behaviour for groupby when no keys are passed.
    
    Possibly fixes #6927.
    
    Authors:
      - Ashwin Srinath <shwina@users.noreply.github.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #6945
    shwina authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    b136469 View commit details
    Browse the repository at this point in the history
  3. Make cudf::round for fixed_point when scale = -decimal_places a…

    … no-op(#6975)
    
    @nartal1 found a small bug while working on: NVIDIA/spark-rapids#1244
    
    Problem is that for `fixed_point`, when the column `scale = -decimal_places`, it should be a no-op. Fix is to make it a no-op.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - David
      - Karthikeyan
    
    URL: #6975
    codereport authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    13acc98 View commit details
    Browse the repository at this point in the history
  4. Remove duplicate file array_tests.cpp(#6953)

    array_tests.cu with same content in same directory exists.
    (can't convert to .cpp because .cuh is included in array_tests.cu and template source code is tested)
    Also, array_tests.cpp is not referred in `cpp/tests/CMakeLists.txt`
    
    Authors:
      - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>
    
    Approvers:
      - David
      - Vukasin Milovanovic
    
    URL: #6953
    karthikeyann authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    d842327 View commit details
    Browse the repository at this point in the history
  5. Add groupby idxmin, idxmax aggregation(#6856)

    Addresses groupby part of #2188
    - [x] Add cython interfaces for aggregation argmin, argmax as idxmin, idxmax
    - [x] unit tests
    
    Authors:
      - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
      - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>
    
    Approvers:
      - David
      - Ram (Ramakrishna Prabhu)
      - GALI PREM SAGAR
      - Jake Hemstad
    
    URL: #6856
    karthikeyann authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    2b656b0 View commit details
    Browse the repository at this point in the history
  6. Add replace_null API with replace_policy parameter, fixed_width

    … column support(#6907)
    
    Part 1 for issue #1361 
    
    - Adds `PRECEDING` and `FOLLOWING` options to `replace_nulls` in `libcudf`. This PR provides support for `fixed_width_type` type columns.
    - Adds Cython binding
    
    Authors:
      - Michael Wang <michaelwang0905@gmail.com>
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - Ashwin Srinath
      - Jake Hemstad
      - Mark Harris
      - Mark Harris
    
    URL: #6907
    isVoid authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    f2b9a36 View commit details
    Browse the repository at this point in the history
  7. Minor cudf::round internal refactoring(#6976)

    This is a small cleanup that replaces a `cudf::binary_operation` with a much cleaner `cudf::cast`.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Mark Harris
    
    URL: #6976
    codereport authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    c017cb4 View commit details
    Browse the repository at this point in the history
  8. Add null mask fixed_point_column_wrapper constructors(#6951)

    Currently has changes for #6950 included.
    
    The full set of null mask `fixed_point_column_wrapper` constructors aren't supported. This PR adds them all and also adds unit tests for each of them across difference `fixed_point` API tests.
    
    **To Do List:**
    * [x] Add constructors
    * [x] Add basic unit test
    * [x] Add all unit tests
    * [x] Update docs
    
    Authors:
      - Mark Harris <mharris@nvidia.com>
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - null
      - Vukasin Milovanovic
    
    URL: #6951
    codereport authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    252f478 View commit details
    Browse the repository at this point in the history
  9. Fix default parameter values of write_csv and write_parquet(#6967)

    Fixes #6671, #6851
    
    - Set the `rows_per_chunk` in `csv_writer_options` to the size of the input table.
    - Change `rows_per_chunk` type to `size_type` (used for number of rows).
    - Set the default compression in `to_parquet`/`write_parquet` to "snappy".
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
    
    Approvers:
      - Keith Kraus
      - Conor Hoekstra
      - Ram (Ramakrishna Prabhu)
      - Mark Harris
    
    URL: #6967
    vuule authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    df5d452 View commit details
    Browse the repository at this point in the history
  10. Fix java cufile tests when cufile is not installed(#6987)

    Just disables the tests when cufile is not installed.
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Kuhu Shukla
    
    URL: #6987
    revans2 authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    3c15d30 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    5735da5 View commit details
    Browse the repository at this point in the history
  12. Merge pull request #6995 from shwina/branch-0.18-merge-0.17

    Keith Kraus authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    4c26155 View commit details
    Browse the repository at this point in the history
  13. Avoid inserting null elements into join hash table when nulls are tre…

    …ated as unequal(#6943)
    
    This change mirrors what is done in `groupby` to eliminate null-containing columns from the join hash table if nulls not equal is set. This prevents absolute runaway of the process. I added benchmarks for joins with nulls and I can't even get it to finish without these changes. The 195ms test without nulls takes 2,000,000ms to complete and the larger tests I haven't had the patience to even see complete. With this change, the timings are faster than without nulls proportional to the % of nulls. Meaning half the table is nulls means the query is twice as fast as the non-null version, which makes sense.
    
    closes #6052
    
    Authors:
      - Mike Wilson <knobby@burntsheep.com>
      - Mike Wilson <hyperbolic2346@users.noreply.github.com>
    
    Approvers:
      - Jake Hemstad
      - Jake Hemstad
      - null
      - Mark Harris
    
    URL: #6943
    hyperbolic2346 authored Dec 11, 2020
    Configuration menu
    Copy the full SHA
    ab8c931 View commit details
    Browse the repository at this point in the history

Commits on Dec 12, 2020

  1. Fix timestamp parsing in ORC reader for timezones without transitions(#…

    …6959)
    
    Fixes #6947 
    
    When TZif file has no transitions (e.g. GMT), `build_timezone_transition_table` has an out-of-bounds read that leads to undefined behavior and intermittent issues.
    
    This PR makes two changes to behavior:
    1. When there are no transitions, the ancient rule is initialized from the first time offset (instead of the first transition rule, which does not exist in this case).
    2. When there are no transitions and the time offset is zero, an empty table is returned (avoid using a no-op table in CUDA).
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR
      - null
      - Ram (Ramakrishna Prabhu)
      - David
    
    URL: #6959
    vuule authored Dec 12, 2020
    Configuration menu
    Copy the full SHA
    929c3f4 View commit details
    Browse the repository at this point in the history

Commits on Dec 13, 2020

  1. Fix int to datetime conversion in csv_read(#6991)

    Fixes #6719
    
    Authors:
      - Kumar Aatish <kaatish@nvidia.com>
    
    Approvers:
      - David
      - GALI PREM SAGAR
      - Mark Harris
    
    URL: #6991
    kaatish authored Dec 13, 2020
    Configuration menu
    Copy the full SHA
    2ede7df View commit details
    Browse the repository at this point in the history
  2. Disable some pragma unroll statements in thrust sort.h(#6982)

    Closes #6955 
    
    This will improve the compile time and size for any function using `thrust::sort` and `thrust::stable_sort`. 
    The PR includes a .patch file to be applied during cmake when downloading the thrust library.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Mark Harris
    
    URL: #6982
    davidwendt authored Dec 13, 2020
    Configuration menu
    Copy the full SHA
    b986220 View commit details
    Browse the repository at this point in the history
  3. Fix nullmask offset handling in parquet and orc writer(#6889)

    Fixes #6642
    
    Authors:
      - Kumar Aatish <kaatish@nvidia.com>
      - skirui-source <71867292+skirui-source@users.noreply.github.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Devavret Makkar
      - Ram (Ramakrishna Prabhu)
    
    URL: #6889
    kaatish authored Dec 13, 2020
    Configuration menu
    Copy the full SHA
    8dbaa2f View commit details
    Browse the repository at this point in the history
  4. Pass numeric scalars of the same dtype through numeric binops(#6938)

    Allows for scalars of the same dtype as a column to be passed along a fast codepath to libcudf, instead of being inspected to reduce their dtype beforehand.
    
    Authors:
      - brandon-b-miller <brmiller@nvidia.com>
      - GALI PREM SAGAR <sagarprem75@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR
    
    URL: #6938
    brandon-b-miller authored Dec 13, 2020
    Configuration menu
    Copy the full SHA
    b0cb9db View commit details
    Browse the repository at this point in the history

Commits on Dec 14, 2020

  1. Fix Thrust unroll patch command(#7002)

    PR #6982 added a `PATCH_COMMAND` when fetching Thrust to remove unrolling in `thrust::sort`, thereby improving compile time and performance in some cases. But the command failed on local builds from source (At least on my machine under rapids-compose). 
    
    This PR simplifies the command.
    
    Authors:
      - Mark Harris <mharris@nvidia.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #7002
    harrism authored Dec 14, 2020
    Configuration menu
    Copy the full SHA
    29c0af1 View commit details
    Browse the repository at this point in the history
  2. Check output size overflow on strings gather(#6997)

    Closes #6801 
    
    This PR adds an extra reduce call in the libcudf gather specialization logic for strings column. This will check to make sure the output size of the gather does not exceed the size limit for the child characters column. The offsets column is first created with the individual output string sizes. Then the reduce call will add these sizes to check for overflow.
    
    Also added a gtest to check for the overflow condition.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Devavret Makkar
      - Karthikeyan
    
    URL: #6997
    davidwendt authored Dec 14, 2020
    Configuration menu
    Copy the full SHA
    f37d42d View commit details
    Browse the repository at this point in the history
  3. fix excluding cufile tests by default(#6988)

    I think this is how I started and verified it was working, but then I was trying to exclude the source as well, which didn't work for tests. Then I realized we need the source built for the plugin so remove it. Anyway, no presubmit CI check is really painful. :(
    
    @revans2 
    
    ```console
    $ mvn test
    ...
    [WARNING] Tests run: 775, Failures: 0, Errors: 0, Skipped: 4
    
    $ mvn test -DUSE-GDS=ON
    ...
    [WARNING] Tests run: 777, Failures: 0, Errors: 0, Skipped: 4
    ```
    
    Authors:
      - Rong Ou <rong.ou@gmail.com>
    
    Approvers:
      - Robert (Bobby) Evans
    
    URL: #6988
    rongou authored Dec 14, 2020
    Configuration menu
    Copy the full SHA
    2a2b4d6 View commit details
    Browse the repository at this point in the history
  4. Remove warning in from_dlpack and to_dlpack methods(#7001)

    Fix #6926 .
    
    Hi!
    
    When invoking from_dlpack() and to_dlpack, the following warnings are displayed:
    
    from_dlpack()
    ```
    /opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:33: UserWarning: WARNING: cuDF from_dlpack() assumes column-major (Fortran order) input. If the input tensor is row-major, transpose it before passing it to this function.
    res = libdlpack.from_dlpack(pycapsule_obj)
    ```
    
    to_dlpack()
    ```
    /opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/io/dlpack.py:74: UserWarning: WARNING: cuDF to_dlpack() produces column-major (Fortran order) output. If the output tensor needs to be row major, transpose the output of this function.
    return libdlpack.to_dlpack(gdf_cols)
    ```
    
    I think those warnings should be removed, because it contains information that should be available in the API documentation, and not necessarily displayed each time the methods are invoked.
    
    Some users, like me, love to have their notebooks/code without warnings. Even if it is possible to disable those warnings, I think the user should not go that way, because the warning is just repeating what the API documentation should cover.
    
    Hope it helps!
    Miguel
    
    Authors:
      - Miguel Martínez <26169771+miguelusque@users.noreply.github.com>
      - GALI PREM SAGAR <sagarprem75@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Ram (Ramakrishna Prabhu)
    
    URL: #7001
    miguelusque authored Dec 14, 2020
    Configuration menu
    Copy the full SHA
    a5515f2 View commit details
    Browse the repository at this point in the history

Commits on Dec 15, 2020

  1. Add null count test for apply_boolean_mask(#6903)

    This PR adds a test to exercise the issue described in #6733. This issue was only reproduced on a laptop Pascal GPU, but I think it's a good test to have.  In summary, `copy_if`, used by `apply_boolean_mask` computes the output null count during as part of its custom scatter kernel, rather than using `cudf::count_unset_bits`.  #6733 describes an issue where the former is different from the latter. So it's good to have a test that verifies they get the same null count.
    
    And since it's difficult to get a repro on a similar machine, this is a first step.
    
    Authors:
      - Mark Harris <mharris@nvidia.com>
      - Keith Kraus <kkraus@nvidia.com>
    
    Approvers:
      - Karthikeyan
      - Devavret Makkar
    
    URL: #6903
    harrism authored Dec 15, 2020
    Configuration menu
    Copy the full SHA
    15f9530 View commit details
    Browse the repository at this point in the history
  2. Skip Thrust sort patch if already applied(#7009)

    #7002 attempted to fix the temporary Thrust sort patch introduced in #6982 which didn't work with CMake 3.19+.  
    
    This PR updates the thirdparty CMakeLists.txt file to continue if the Thrust sort patch has already been applied.
    Today, the first time cmake is run, the Thrust sort.h is patched. But if cmake is run again without cleaning the build directory, the build will fail, because the file has already been patched.  @trxcllnt showed us the correct `patch` incantation to ignore the patch if already applied.
    
    CC @davidwendt
    
    Authors:
      - Mark Harris <mharris@nvidia.com>
    
    Approvers:
      - Paul Taylor
      - Keith Kraus
    
    URL: #7009
    harrism authored Dec 15, 2020
    Configuration menu
    Copy the full SHA
    515a173 View commit details
    Browse the repository at this point in the history
  3. Enable strict_decimal_types in parquet reading(#6969)

    This pull request is to address #6909.
    
    Authors:
      - sperlingxx <lovedreamf@gmail.com>
      - Alfred Xu <lovedreamf@gmail.com>
    
    Approvers:
      - Robert (Bobby) Evans
      - Mike Wilson
      - Devavret Makkar
    
    URL: #6969
    sperlingxx authored Dec 15, 2020
    Configuration menu
    Copy the full SHA
    6d1b076 View commit details
    Browse the repository at this point in the history
  4. Fix cudf::hash_partition for decimal32 and decimal64(#7006)

    This resolves #6996
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Mike Wilson
      - Devavret Makkar
    
    URL: #7006
    codereport authored Dec 15, 2020
    Configuration menu
    Copy the full SHA
    b370963 View commit details
    Browse the repository at this point in the history
  5. Implement cudf.DateOffset for months(#6775)

    Implements `cudf.DateOffset` - an object used for calendrical arithmetic, similar to pandas.DateOffset - for month units only. 
    
    Closes #6754
    
    Authors:
      - brandon-b-miller <brmiller@nvidia.com>
      - brandon-b-miller <53796099+brandon-b-miller@users.noreply.github.com>
      - Keith Kraus <kkraus@nvidia.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Keith Kraus
      - Keith Kraus
    
    URL: #6775
    brandon-b-miller authored Dec 15, 2020
    Configuration menu
    Copy the full SHA
    1963111 View commit details
    Browse the repository at this point in the history

Commits on Dec 16, 2020

  1. Add pytest-xdist to dev environment.yml(#6958)

    Resolves: #6370 
    
    This PR enables the parallel execution of pytests of `cudf`, `dask_cudf` & `custreamz` in CI. The changes also include adding `pytest-xdist` to dev environments.
    
    With these changes, here is the change in pytest execution times in CI:
    
    | module      | without pytest-xdist | with pytest-xdist(n=6)  |
    | ----------- | ----------- | -----------|
    | cudf      | 1 hr       |  14 min |
    | dask_cudf   |    4 min     | 1 min |
    | custreamz  |  6 min |  2 min |
    
    
    Related Integration changes: rapidsai/integration#188
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - AJ Schmidt
      - Keith Kraus
    
    URL: #6958
    galipremsagar authored Dec 16, 2020
    Configuration menu
    Copy the full SHA
    8c1f01e View commit details
    Browse the repository at this point in the history
  2. Pin librdkakfa to gcc 7 compatible version (#7021)

    `librdkakfa` 1.5.3 required gcc 9.3 (https://anaconda.org/conda-forge/librdkafka/files?version=1.5.3) and `libcudf_kakfa` (https://anaconda.org/rapidsai-nightly/libcudf_kafka/files?version=0.18.0a201215) is being built requiring 1.5.3 which is not compatible with the rest of RAPIDS.
    
    Authors:
      - Ray Douglass
    
    Approvers:
      - Mike Wendt
    raydouglass authored Dec 16, 2020
    Configuration menu
    Copy the full SHA
    6bc71c8 View commit details
    Browse the repository at this point in the history
  3. Fix loc behaviour when key of incorrect type is used(#6993)

    * Fixes #6823
    * Raise a `KeyError` similar to Pandas rather than an `IndexError` when loc fails
    * Improve tests to compare more directly with Pandas behaviour
    
    Authors:
      - Ashwin Srinath <shwina@users.noreply.github.com>
    
    Approvers:
      - Ram (Ramakrishna Prabhu)
      - Michael Wang
      - GALI PREM SAGAR
    
    URL: #6993
    shwina authored Dec 16, 2020
    Configuration menu
    Copy the full SHA
    7ca3fad View commit details
    Browse the repository at this point in the history

Commits on Dec 17, 2020

  1. Fix round operator's HALF_EVEN computation for negative integers(#7014)

    Found a small bug while working on NVIDIA/spark-rapids#1244. 
    For negative integers, it was not rounding to nearest even number.
    
    Authors:
      - Niranjan Artal <nartal@nvidia.com>
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Conor Hoekstra
      - Mark Harris
    
    URL: #7014
    nartal1 authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    e5d3742 View commit details
    Browse the repository at this point in the history
  2. Implement cudf::reduce for decimal32 and decimal64 (part 2)(#6980)

    This PR resolves a part of #3556.
    
    Supporting `cudf::reduce`:
    1. Part 1 (`MIN`, `MAX`, `SUM` & `PRODUCT` & `NUNIQUE`) #6814
    2. Part 2 (the rest) ◀️ 
    
    **Reduction Ops:**
    
    **Done in Previous PR**
    ✔️  `SUM,             ///< sum reduction`
    ✔️ `PRODUCT,         ///< product reduction`
    ✔️ `MIN,             ///< min reduction`
    ✔️ `MAX,             ///< max reduction`
    ✔️ `NUNIQUE,         ///< count number of unique elements`
    
    **Not supported by `cudf::reduce`:**
    * [x] `COUNT_VALID,     ///< count number of valid elements`
    * [x] `COUNT_ALL,       ///< count number of elements`
    * [x] `COLLECT,         ///< collect values into a list`
    * [x] `LEAD,            ///< window function, accesses row at specified offset following current row`
    * [x] `LAG,             ///< window function, accesses row at specified offset preceding current row`
    * [x] `PTX,             ///< PTX UDF based reduction`
    * [x] `CUDA             ///< CUDA UDf based reduction`
    * [x] `ARGMAX,          ///< Index of max element`
    * [x] `ARGMIN,          ///< Index of min element`
    * [x] `ROW_NUMBER,      ///< get row-number of element`
    
    **Won't be supported:**
    * [x] `ANY,             ///< any reduction`
    * [x] `ALL,             ///< all reduction`
    
    **To Do / Investigate:**
    * [x] `SUM_OF_SQUARES,  ///< sum of squares reduction`
    * [x] `MEDIAN,          ///< median reduction`
    * [x] `QUANTILE,        ///< compute specified quantile(s)`
    * [x] `NTH_ELEMENT,     ///< get the nth element`
    
    **Deferred until requested**
    * [x] `MEAN,            ///< arithmetic mean reduction`
    * [x] `VARIANCE,        ///< groupwise variance`
    * [x] `STD,             ///< groupwise standard deviation`
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - null
      - Karthikeyan
      - David
    
    URL: #6980
    codereport authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    ae17c14 View commit details
    Browse the repository at this point in the history
  3. Extend replace_nulls_policy to string and dictionary type(#7004)

    Follow up for PR #6907 
    
    - `replace_null` policy function now supports `string` and `dictionary` dtype column.
    
    Since original implementation depends only on column validity and index, this extension trivially removes SFINAE on `replace_null` functor and removes `type_dispatcher`.
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - Mark Harris
      - Karthikeyan
    
    URL: #7004
    isVoid authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    da60cce View commit details
    Browse the repository at this point in the history
  4. Restore usual instance/subclass checking to cudf.DateOffset(#7029)

    `pd.DateOffset` uses a metaclass that overrides the usual instance/subclass checking behaviour. Any subclass of `pd._libs.tslibs.offsets.BaseOffset` will be reported as a subclass of `pd.DateOffset` (itself a `pd.DateOffset`). This can lead to some surprising behaviour:
    
    ```python
    In [3]: isinstance(pd.DateOffset(), cudf.DateOffset)
    Out[3]: True
    ``` 
    
    Note that `cudf.DateOffset` inherits from `pd.DateOffset`. But, a `pd.DateOffset` is reported as an instance of `cudf.DateOffset` -- [Child Is Father of the Man](https://en.wikipedia.org/wiki/Child_Is_Father_of_the_Man)!
    
    Authors:
      - Ashwin Srinath <shwina@users.noreply.github.com>
    
    Approvers:
      - GALI PREM SAGAR
    
    URL: #7029
    shwina authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    1c8f2a8 View commit details
    Browse the repository at this point in the history
  5. Add compression="infer" as default for dask_cudf.read_csv(#7013)

    Closes #6850 
    
    dask_cudf version of the `dask.dataframe` changes proposed in [dask#6960](dask/dask#6960).  Uses `fsspec` to infer the default `compression` argument from the suffix of the first file-path argument.
    
    Authors:
      - rjzamora <rzamora217@gmail.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #7013
    rjzamora authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    8c8c421 View commit details
    Browse the repository at this point in the history
  6. Refactor rolling.cu to reduce compile time(#6512)

    Closes #6472.
    
    `rolling.cu` is taking inordinately long to compile, slowing down the `libcudf` build. The following changes were made to mitigate this:
    
    1. Moved `grouped_rolling_window()` and `grouped_time_based_rolling_window()` to `grouped_rolling.cu`. Common functions were moved to `rolling_detail.cuh`.
    2. Normalized timestamp columns to use int64_t representations. This reduces the number of template instantiations for `time_based_grouped_rolling_window()`.
    3. `grouped_*_rolling_window()` functions used to pass around fancy iterators, causing massive template instantiations. This has been changed to materialize the window offsets as separate columns, and use those with existing `rolling_window()` functions to produce the final result.
    
    These changes have been tested by running a window function test from SparkSQL, over a 2.4GB ORC file with 155M records (1.5M groups of about 97 records each on average):
    1. There has been no discernible change in the end-to-end runtime. (The `nsys` profile seems to indicate that the total time spent in the `gpu_rolling` kernel has reduced. This is still being examined, to confirm.)
    2. Compiling `rolling.cu` and `grouped_rolling.cu` in parallel now takes 60s as opposed to about 300s before.
    3. The object file size seems to have reduced by a factor of 3.
    
    Authors:
      - Mithun RK <mythrocks@gmail.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Karthikeyan
    
    URL: #6512
    mythrocks authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    ce21296 View commit details
    Browse the repository at this point in the history
  7. Decimal casts in JNI became a NOOP(#7032)

    We compared the wrong thing on a cast optimization.  This fixes that.
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Jason Lowe
      - Alessandro Bellina
    
    URL: #7032
    revans2 authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    c16a0a5 View commit details
    Browse the repository at this point in the history
  8. Add method field to fillna for fixed width columns(#6998)

    Closes #1361 
    
    - Provides "`ffill`" and "`bfill`" `fillna` methods for `Numerical`, `Datetime`, `Timedelta` and `Categorical` type column.
    - Supports `method` parameter for `Series.fillna` and `DataFrame.fillna`
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - Ashwin Srinath
      - GALI PREM SAGAR
    
    URL: #6998
    isVoid authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    4385f54 View commit details
    Browse the repository at this point in the history
  9. Correct ORC docstring; other minor cuIO improvements(#7012)

    Fixes #6923
    
    Included other minor cuIO improvements that are too small for individual PRs:
    - Remove unnecessary NaN-related conditions in JSON, CSV.
    - Expand a comment in `createSerializedTrie` to make initialization clearer.
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Karthikeyan
      - Christopher Harris
    
    URL: #7012
    vuule authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    ff56585 View commit details
    Browse the repository at this point in the history
  10. Fix libcudf strings logic where size_type is used to access INT32 col…

    …umn data(#7020)
    
    I tried experimenting with changing the `cudf::size_type` to `int64_t` and found many, many places that assume `size_type` and `int32_t` (and `int`) are interchangeable. This PR attempts to fix some of the places where offsets column is created as INT32 but the column data is incorrectly referenced as `data<size_type>()` for example.
    Also, this PR fixes some places that accepts/returns only int32_t (regex internal functions) or size_type (factories) which should be casted or accounted for.
    This is not a full set of possible violations found but may help minimize future errors. No function has changed/added.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Conor Hoekstra
      - Devavret Makkar
    
    URL: #7020
    davidwendt authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    05653ef View commit details
    Browse the repository at this point in the history
  11. Correct the sampling range when sampling with replacement(#6884)

    This corrects an issue with the sampling range used when replacement=True. Before, it sampled the range 0 through `num_rows` meaning it could sample `num_rows` even though it's one position out of bounds. This caused sample to return values not present in the original DataFrame.
    
    I also created exceptions for sampling on empty DataFrames that match pandas, as well as an exception for sampling when `axis=1` and `replace=True` as cudf does not support DataFrames with duplicate columns.
    
    This closes #6532
    
    Authors:
      - Chris Jarrett <cjarrett@dt08.aselab.nvidia.com>
      - Mark Harris <mharris@nvidia.com>
      - ChrisJar <chris.jarrett.0@gmail.com>
    
    Approvers:
      - Keith Kraus
      - Mark Harris
    
    URL: #6884
    ChrisJar authored Dec 17, 2020
    Configuration menu
    Copy the full SHA
    3be4428 View commit details
    Browse the repository at this point in the history

Commits on Dec 18, 2020

  1. Add ffill and bfill to string columns(#7036)

    Follow up of PR #7004 
    
    Adds `method` field to `fillna` method in string type column to support `ffill` and `bfill`.
    Also involves a small change to a `datetime64` `ffill`, `bfill` test case to improve test robustness.
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - GALI PREM SAGAR
    
    URL: #7036
    isVoid authored Dec 18, 2020
    Configuration menu
    Copy the full SHA
    ae90dd9 View commit details
    Browse the repository at this point in the history
  2. Fix read_orc for decimal type(#7034)

    The `run_pos`  which was being used was from data rather from secondary stream which was for scale, but resulted value was being used for secondary stream `scale`. The code change fixes that issue and also adds test case to cover the issue.
    
    closes #7016
    
    Authors:
      - Ramakrishna Prabhu <ramakrishnap@nvidia.com>
    
    Approvers:
      - Vukasin Milovanovic
      - GALI PREM SAGAR
      - Devavret Makkar
    
    URL: #7034
    rgsl888prabhu authored Dec 18, 2020
    Configuration menu
    Copy the full SHA
    c24171b View commit details
    Browse the repository at this point in the history
  3. Add Ufunc alias look up for appropriate numpy ufunc dispatching(#6973)

    This PR closes #6921 by dispatching to appropriate cudf alias for numpy functions from the UFUNC_ALIASES dictionary :  
    ```python
     _UFUNC_ALIASES = {
        "power": "pow",
        "equal": "eq",
        "not_equal": "ne",
        "less": "lt",
        "less_equal": "le",
        "greater": "gt",
        "greater_equal": "ge",
        "absolute": "abs",
    }
    ```
    
    Authors:
      - Vibhu Jawa <vibhujawa@gmail.com>
    
    Approvers:
      - Keith Kraus
      - null
    
    URL: #6973
    VibhuJawa authored Dec 18, 2020
    Configuration menu
    Copy the full SHA
    442985a View commit details
    Browse the repository at this point in the history
  4. Fix backward compatibility of loading a 0.16 pkl file(#7033)

    Fixes: #7025 
    
    This PR:
    
    1. Handles loading of pickle files which have been created with rangeIndex prior to introduction of `step` parameter support.
    2. Introduces special-case handling of stringcolumn size where we were previously storing it as a pickled object.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Ram (Ramakrishna Prabhu)
    
    URL: #7033
    galipremsagar authored Dec 18, 2020
    Configuration menu
    Copy the full SHA
    9317361 View commit details
    Browse the repository at this point in the history

Commits on Dec 19, 2020

  1. Share factorize implementation with Index and cudf module(#6885)

    Share the implementation of `cudf.Series.factorize` with the `Index` class and the `cudf` module namespace. 
    
    Closes #6871
    
    Authors:
      - brandon-b-miller <brmiller@nvidia.com>
      - Keith Kraus <kkraus@nvidia.com>
      - brandon-b-miller <53796099+brandon-b-miller@users.noreply.github.com>
    
    Approvers:
      - Ashwin Srinath
      - Keith Kraus
    
    URL: #6885
    brandon-b-miller authored Dec 19, 2020
    Configuration menu
    Copy the full SHA
    923cf49 View commit details
    Browse the repository at this point in the history

Commits on Dec 21, 2020

  1. Improve representation of MultiIndex(#6992)

    Fixes: #6936 
    
    This PR introduces changes to `MultiIndex.__repr__`, where the output is now more readable and easy to understand similar to that of pandas MultiIndex. Changes also include handling of `<NA>`, `nan` values and spacing issues around them.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - null
      - Keith Kraus
    
    URL: #6992
    galipremsagar authored Dec 21, 2020
    Configuration menu
    Copy the full SHA
    7556e23 View commit details
    Browse the repository at this point in the history

Commits on Dec 23, 2020

  1. Make Doxygen comments formatting consistent(#7041)

    Making this PR since wrong formatting keeps getting propagated in new PRs and (sometimes) corrected in code review.
    
    Changes:
    
    - Ironed out the formatting of Doxygen comments to match the guidelines. 
    - Removed the outdated file with formatting examples.
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>
      - vukasin <vmilovanovic@nvidia.com>
    
    Approvers:
      - David
      - Karthikeyan
    
    URL: #7041
    vuule authored Dec 23, 2020
    Configuration menu
    Copy the full SHA
    2780a8c View commit details
    Browse the repository at this point in the history

Commits on Dec 29, 2020

  1. Update cudf python docstrings with new null representation (<NA>)(#…

    …7050)
    
    Fixes: #7046
    
    This PR:
    
    - [x] Updates all doc examples with new NA_REP(`<NA>`)
    - [x] Fixes reference warnings during doc build.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Keith Kraus
    
    URL: #7050
    galipremsagar authored Dec 29, 2020
    Configuration menu
    Copy the full SHA
    4a1e465 View commit details
    Browse the repository at this point in the history
  2. Reduce number of hostdevice_vector allocations in parquet reader(#7005)

    Improves performance of parquet reader on certain multi-GPU systems, which take a long time to allocate pinned memory, by reducing the number of `hostdevice_vector` allocations.
    
    Closes #7049
    
    Authors:
      - Devavret Makkar <dmakkar@nvidia.com>
    
    Approvers:
      - null
      - Ram (Ramakrishna Prabhu)
      - Karthikeyan
    
    URL: #7005
    devavret authored Dec 29, 2020
    Configuration menu
    Copy the full SHA
    277bd9f View commit details
    Browse the repository at this point in the history

Commits on Dec 31, 2020

  1. Implement cudf::rolling for decimal32 and decimal64(#7037)

    This PR resolves a part of #3556.
    
    Aggregation ops supported:
    * `MIN`
    * `MAX`
    * `COUNT` (both `null_policy` - `EX/INCLUDE`)
    * `LEAD`
    * `LAG`
    
    **To Do List:**
    * [x] Basic unit tests
    * [x] Comprehensive unit tests
    * [x] Implementation
    * [x] Figure out which rolling ops to suppport
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Ram (Ramakrishna Prabhu)
    
    URL: #7037
    codereport authored Dec 31, 2020
    Configuration menu
    Copy the full SHA
    28d18d6 View commit details
    Browse the repository at this point in the history

Commits on Jan 4, 2021

  1. Create sort gbenchmark for strings column(#7040)

    Reference #7027 and #5698
    
    This adds a strings column to the current gbenchmark for sort. This will help measure improvements or changes over time to the column and strings comparator functions.
    
    No code logic changed or added.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Devavret Makkar
      - Keith Kraus
    
    URL: #7040
    davidwendt authored Jan 4, 2021
    Configuration menu
    Copy the full SHA
    af41136 View commit details
    Browse the repository at this point in the history
  2. Fix to_csv delimiter handling of timestamp format(#7023)

    Closes #6699 
    The timestamp format(s) used by the CSV writer have the form `%Y-%m-%dT%H:%M:%SZ`. This means if the column delimiter `','` or the line delimiter `\n` is either `':'` or `'-'` then the timestamp string output could conflict with these delimiters. The current logic simply removed these delimiters from the format if they detected a conflicting column or line delimiter. For example, specifying a dash `'-'` as column delimiter caused the timestamp format to change to `%Y%m%d...` (the dash is removed). I admit this was kind of hacky and also made the output inconsistent with Pandas `to_csv()`.
    
    It is easy enough to simply add double-quotes around the timestamp format to prevent these conflicts as well as make the output consistent. This PR fixes that logic.
    
    Exception logic to check for a dash as column separator was also found in [csv.py](https://github.com/rapidsai/cudf/blob/8c1f01e1fd713d873cf3d943ab409f3e9efc48f8/python/cudf/cudf/io/csv.py#L139-L149), specifically citing issue 6699 in the exception message. Also, there was a pytest specifically created to check for this exception. The exception is removed and the pytest function updated in this PR as well.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - GALI PREM SAGAR
      - Karthikeyan
      - null
    
    URL: #7023
    davidwendt authored Jan 4, 2021
    Configuration menu
    Copy the full SHA
    ca1a4d6 View commit details
    Browse the repository at this point in the history
  3. cudf::scan support for decimal32 and decimal64(#7063)

    Adding support for `cudf::scan` for `decimal32` and `decimal64`. `cudf::scan` only supports 4 operations (sum, product, min and max) but the decimal types will only support `SUM`, `MAX` and `MIN`.
    
    This PR resolves a part of #3556.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Jake Hemstad
      - Mark Harris
    
    URL: #7063
    codereport authored Jan 4, 2021
    Configuration menu
    Copy the full SHA
    fc92bb9 View commit details
    Browse the repository at this point in the history
  4. Spark Murmur3 hash functionality(#7024)

    Resolves #6863
    
    Expands existing murmur3 hashing functionality to match Spark's murmur3 hashing algorithm by modifying tail processing for unaligned bytes and processing booleans as 32bit integers rather than singular bytes.
    
    Authors:
      - Ryan Lee <ryanlee@nvidia.com>
      - rwlee <rwlee@users.noreply.github.com>
    
    Approvers:
      - Jake Hemstad
      - null
      - Robert (Bobby) Evans
      - GALI PREM SAGAR
    
    URL: #7024
    rwlee authored Jan 4, 2021
    Configuration menu
    Copy the full SHA
    8860baf View commit details
    Browse the repository at this point in the history
  5. Upgrade nvcomp to 1.2.1(#7069)

    This version is more friendly to ccache:
    ```console
    ccache -C  # clear the cache
    time mvn clean package -DskipTests
    
    real	4m43.015s
    user	11m18.426s
    sys	0m21.891s
    
    time mvn clean package -DskipTests  # everything is now cached
    
    real	0m20.265s
    user	0m45.810s
    sys	0m3.670s
    ```
    
    Not sure about the ABI flag, but leaving it in causes the .so to not load:
    ```console
    /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java: symbol lookup error: /tmp/nvcomp5478764208255606671.so: undefined symbol: _ZN6nvcomp5Check8not_nullEPKvRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_i
    ```
    
    @jlowe
    
    Authors:
      - Rong Ou <rong.ou@gmail.com>
    
    Approvers:
      - Jason Lowe
    
    URL: #7069
    rongou authored Jan 4, 2021
    Configuration menu
    Copy the full SHA
    d641688 View commit details
    Browse the repository at this point in the history

Commits on Jan 5, 2021

  1. Adding decimal writing support to parquet(#7017)

    Since cudf doesn't support precision, the precision must be passed in as a write option. This is handled as a vector of uint8's that indicates the precision of each flattened column in order to support nested types.
    
    Partially closes #6474
    
    Authors:
      - Mike Wilson <knobby@burntsheep.com>
      - Mike Wilson <hyperbolic2346@users.noreply.github.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Mark Harris
    
    URL: #7017
    hyperbolic2346 authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    31c0d29 View commit details
    Browse the repository at this point in the history
  2. Add days check to cudf::is_timestamp using cuda::std::chrono classes(#…

    …7028)
    
    Closes #6774 
    
    This PR adds a check for a valid day value for a year/month (if these are specified in the format) in the `cudf::is_timestamp()` API. Also, a chunk of messy year/month/day logic in a related functor was replaced with libcu++ implementation of the `year_month_day()` function instead.
    A gtest is also updated to include to test for an invalid day.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Vukasin Milovanovic
      - Devavret Makkar
      - Karthikeyan
      - Jake Hemstad
    
    URL: #7028
    davidwendt authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    6ebd264 View commit details
    Browse the repository at this point in the history
  3. Only upload packages that were built(#7077)

    Should only upload the packages that were actually built. Project Flash sets `BUILD_CUDF` and `BUILD_LIBCUDF` as needed to control this.
    
    Skipping CI as this change only affects uploads which isn't tested by CI.
    
    Authors:
      - Raymond Douglass <ray@raydouglass.com>
    
    Approvers:
      - Dillon Cullinan
    
    URL: #7077
    raydouglass authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    873ab4a View commit details
    Browse the repository at this point in the history
  4. cudf::rolling ROW_NUMBER support for decimal32 and decimal64(#…

    …7061)
    
    This PR adds support for `cudf::rolling` for the `ROW_NUMBER` option for `decimal32` and `decimal64`. It also clarifies the documentation.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - David
      - Devavret Makkar
    
    URL: #7061
    codereport authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    91322ba View commit details
    Browse the repository at this point in the history
  5. Refactor ORC ProtobufReader to make it more extendable(#7055)

    Related to #5826
    
    Refactor the `ProtobufReader` API to facilitate expansion to support robust reading of column statistics.
    Changes include:
    
    - Move `orc::metadata` from `readder_impl.cu` to `orc.h` so it can be reused for statistics related APIs.
    - Removed duplicated code in `read_orc_statistics` - use `orc::metadata` instead.
    - Rename `ColumnStatistics` to `ColStatsBlob`, since that's what it currently is.
    - Avoid redundant copies in `read_orc_statistics`,
    - Replace `get_u32`, `get_i32`, etc. with templated `get`.
    - Replace per-type functors (e.g. `FieldUInt64`) with templated `field_reader`s to reduce code repetition.
    - The two type-specific parts of `FieldXYZ` functors (field enum and read impl) are now separate to avoid redundant code.
    - `field_reader` dispatches based on the value type, so also added `packed_field_reader` and `raw_field_reader` for packed fields and blob reads (respectively).
    - Replace return value based error checking in `ProtobufReader` with `CUDF_EXPECTS`.
    - Removed `InitSchema` from `ProtobufReader` - schema is only used to determine column names. The names are now lazily calculated in `metadata::get_column_name`
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vukasin.milovanovic.87@gmail.com>
    
    Approvers:
      - Kumar Aatish
      - Conor Hoekstra
    
    URL: #7055
    vuule authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    7bf0505 View commit details
    Browse the repository at this point in the history
  6. Add dictionary support to libcudf groupby functions(#6585)

    Reference #5963 Add dictionary support to groupby.
    
    - [x] argmax
    - [x] argmin
    - [x] collect
    - [x] count
    - [x] max
    - [x] mean* 
    - [x] median
    - [x] min
    - [x] nth element
    - [x] nunique
    - [x] quantile
    - [x] std*
    - [x] sum* 
    - [x] var* 
    
    * _not supported due to 10.2 compile segfault_
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Jake Hemstad
      - Karthikeyan
    
    URL: #6585
    davidwendt authored Jan 5, 2021
    Configuration menu
    Copy the full SHA
    6828e2c View commit details
    Browse the repository at this point in the history

Commits on Jan 6, 2021

  1. Improve digitize API(#7071)

    Closes #6360 
    
    Updates `digitize()` API to accept any non-nullable column-like objects, as opposed to previously only `np.array`. Adds better error handling and further simplifies the implementation.
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - Ashwin Srinath
    
    URL: #7071
    isVoid authored Jan 6, 2021
    Configuration menu
    Copy the full SHA
    c0920e6 View commit details
    Browse the repository at this point in the history
  2. Add unstack() support for non-multiindexed dataframes(#7054)

    Closes #6694 
    
    When `unstack()` receives a dataframe with "single" index, returns a series to match pandas behavior.
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - null
    
    URL: #7054
    isVoid authored Jan 6, 2021
    Configuration menu
    Copy the full SHA
    1930432 View commit details
    Browse the repository at this point in the history
  3. Implement __hash__ method for ListDtype(#7081)

    Fixes: #7070 
    
    This PR fixes a failure in `to_pandas` when `nullable` is set to `True`. The changes in this PR implement `__hash__` in `listDtype`.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Ashwin Srinath
    
    URL: #7081
    galipremsagar authored Jan 6, 2021
    Configuration menu
    Copy the full SHA
    8787a64 View commit details
    Browse the repository at this point in the history

Commits on Jan 7, 2021

  1. JNI support for creating struct column from existing columns and fixe…

    …d bug in struct with no children(#7084)
    
    The primary goal of this is to add in java APIs to create a struct column from other existing columns.  As a part of this work I found a very small bug in the column_vector constructor that copies data from  a column view for a struct column with no children in it.  Spark supports this use case so I thought it would be good to test/fix the issue.
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Kuhu Shukla (@kuhushukla)
      - Vukasin Milovanovic (@vuule)
      - Jason Lowe (@jlowe)
    
    URL: #7084
    revans2 authored Jan 7, 2021
    Configuration menu
    Copy the full SHA
    f768da7 View commit details
    Browse the repository at this point in the history
  2. Handle nan values correctly in Series.one_hot_encoding(#7059)

    Fixes: #7056 
    
    This PR handles `nan` values separately in `one_hot_encoding` when the given input category is `None`. Previously we were combining both `nan` & `<NA>` values to be the same when cat is `None`.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7059
    galipremsagar authored Jan 7, 2021
    Configuration menu
    Copy the full SHA
    9439ed8 View commit details
    Browse the repository at this point in the history
  3. Add pyorc to dev environment(#7085)

    This PR adds `pyorc` package to dev environment yml files.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Christopher Harris (@cwharris)
      - AJ Schmidt (@ajschmidt8)
    
    URL: #7085
    galipremsagar authored Jan 7, 2021
    Configuration menu
    Copy the full SHA
    ee65a47 View commit details
    Browse the repository at this point in the history
  4. Add informative error message for sep in CSV writer(#7095)

    Fixes: #7091 
    
    This PR introduces validation and throwing of informative error messages for the `sep` parameter in csv writer.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Vukasin Milovanovic (@vuule)
    
    URL: #7095
    galipremsagar authored Jan 7, 2021
    Configuration menu
    Copy the full SHA
    aa38f85 View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2021

  1. Add Java tests for decimal casts(#7051)

    This pull request attempts to verify the support of decimal cast in terms of java package.
    
    Authors:
      - sperlingxx <lovedreamf@gmail.com>
    
    Approvers:
      - Jason Lowe (@jlowe)
    
    URL: #7051
    sperlingxx authored Jan 8, 2021
    Configuration menu
    Copy the full SHA
    30e154c View commit details
    Browse the repository at this point in the history
  2. Add groupby docs(#7100)

    Closes #6858. Adds a "GroupBy" page to our docs.
    
    Authors:
      - Ashwin Srinath <shwina@users.noreply.github.com>
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7100
    shwina authored Jan 8, 2021
    Configuration menu
    Copy the full SHA
    04aa30c View commit details
    Browse the repository at this point in the history

Commits on Jan 11, 2021

  1. Adds in JNI support for creating an list column from existing columns(#…

    …7112)
    
    Adds in java support to be able to create a list column from other existing columns.
    
    Authors:
      - Robert (Bobby) Evans <bobby@apache.org>
    
    Approvers:
      - Jason Lowe (@jlowe)
    
    URL: #7112
    revans2 authored Jan 11, 2021
    Configuration menu
    Copy the full SHA
    11ebc3e View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2021

  1. Add segmented_gather(list_column, gather_list)(#7003)

    closes #6542
    
    - [x] Add segmented_gather(list, list)
    - [x] Add unit tests
    - [x] Documentation
    
    Authors:
      - Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
      - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - @nvdbaranec
      - AJ Schmidt (@ajschmidt8)
      - Jake Hemstad (@jrhemstad)
    
    URL: #7003
    karthikeyann authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    87e414c View commit details
    Browse the repository at this point in the history
  2. verify window operations on decimal with java tests(#7120)

    This pull request is to verify window operations on decimal columns in java package, which is required by spark-rapids on [issue 1333](NVIDIA/spark-rapids#1333).
    
    Authors:
      - sperlingxx <lovedreamf@gmail.com>
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
    
    URL: #7120
    sperlingxx authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    9a66576 View commit details
    Browse the repository at this point in the history
  3. Add scale and value methods to fixed_point(#7109)

    This PR adds `fixed_point::scale()` and `fixed_point::value()`. It enables developers to avoid the following piece of code (which is how you can currently access scale and value).
    ```cpp
    auto si = numeric::scaled_integer<rep_type>{value};
    // use si.value or si.scale
    ```
    Note that this PR should merged after #7105 (or I can resolve conflict if it gets merged first)
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
      - Conor Hoekstra <36027403+codereport@users.noreply.github.com>
    
    Approvers:
      - MithunR (@mythrocks)
      - David (@davidwendt)
    
    URL: #7109
    codereport authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    4da8312 View commit details
    Browse the repository at this point in the history
  4. Handle nested string columns with no children in contiguous_split.(#6864

    )
    
    Fixes a specific corner case:   String columns with no children (a special form of empty string column that can happen) that are nested inside a list (or struct) column.
    
    This would be useful as a 0.17 PR but isn't strictly necessary, since it's pretty late.
    
    Edit:  Updated the fix so that it always includes a record for src/dst buffers, even if they are of size 0 or have null data pointers.  The previous method that only checked the data pointer being null was unclean and didn't handle a particularly strange case that came up with the Spark plugin:  the plugin was reconstructing columns (on the receiver side of a shuffle) that had size 0 but a non-null data pointer.  This is technically legal but super weird.
    
    Authors:
      - Dave Baranec <dbaranec@nvidia.com>
      - Karthikeyan <6488848+karthikeyann@users.noreply.github.com>
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Alfred Xu (@sperlingxx)
      - Karthikeyan (@karthikeyann)
      - Alfred Xu (@sperlingxx)
      - Karthikeyan (@karthikeyann)
      - Devavret Makkar (@devavret)
    
    URL: #6864
    nvdbaranec authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    d791e20 View commit details
    Browse the repository at this point in the history
  5. Add cudf::binary_operation NULL_MIN, NULL_MAX & NULL_EQUALS f…

    …or `decimal32` and `decimal64`(#7119)
    
    This PR resolves #7115.
    
    Add `cudf::binary_operation` support for `NULL_MAX`, `NULL_MIN` and `NULL_EQUALS` for `decimal32` and `decimal64`.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Mark Harris (@harrism)
      - David (@davidwendt)
      - Mike Wilson (@hyperbolic2346)
    
    URL: #7119
    codereport authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    9790ff7 View commit details
    Browse the repository at this point in the history
  6. Build libcudf with -Wall(#7105)

    I discovered we're not building libcudf the `-Wall` GCC flag. This PR enables `-Wall` for GCC and nvcc, and fixes most of the errors.
    
    ~~The only error I haven't fixed yet is `-Werror=uninitialized` on this line [this line](https://github.com/rapidsai/cudf/blob/branch-0.18/cpp/include/cudf/scalar/scalar.hpp#L334), but @codereport is on it.~~ Fixed ✔️
    
    Authors:
      - ptaylor <paul.e.taylor@me.com>
      - Conor Hoekstra <codereport@outlook.com>
      - Paul Taylor <paul.e.taylor@me.com>
    
    Approvers:
      - Conor Hoekstra (@codereport)
      - Keith Kraus (@kkraus14)
      - Mark Harris (@harrism)
    
    URL: #7105
    trxcllnt authored Jan 12, 2021
    Configuration menu
    Copy the full SHA
    68d4791 View commit details
    Browse the repository at this point in the history

Commits on Jan 13, 2021

  1. update GDS/cuFile location for 0.9 release(#7131)

    The GDS 0.9 release changed the location where it puts the header and library files.
    
    @jlowe @kkraus14
    
    Authors:
      - Rong Ou <rong.ou@gmail.com>
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Jason Lowe (@jlowe)
    
    URL: #7131
    rongou authored Jan 13, 2021
    Configuration menu
    Copy the full SHA
    0c7b36e View commit details
    Browse the repository at this point in the history
  2. Fix compilation failure caused by -Wall addition.(#7134)

    #7105 somehow got merged but broke compilation. These are the necessary fixes.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - Paul Taylor (@trxcllnt)
      - Vukasin Milovanovic (@vuule)
    
    URL: #7134
    codereport authored Jan 13, 2021
    Configuration menu
    Copy the full SHA
    e647d1a View commit details
    Browse the repository at this point in the history
  3. Fix compilation errors in libcudf(#7138)

    After recent changes in libcudf compilation in #7105, the compilation of libcudf on my local machine is broken and these changes fixed the compilation errors.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Devavret Makkar (@devavret)
      - David (@davidwendt)
    
    URL: #7138
    galipremsagar authored Jan 13, 2021
    Configuration menu
    Copy the full SHA
    e0e2cf8 View commit details
    Browse the repository at this point in the history

Commits on Jan 15, 2021

  1. Fastpath single strings column in cudf::sort(#7075)

    Closes #7027 
    
    The internal `cudf::strings::detail::sort()` function is faster sorting a single strings coumn than `cudf::sort`. Details are in the #7027 comments.
    
    Results using the sort gbenchmark:
    
    ```
    Baseline:
    SortStrings/stringssort/1024/manual_time               1.18 ms         1.20 ms          593
    SortStrings/stringssort/4096/manual_time               1.98 ms         2.00 ms          352
    SortStrings/stringssort/32768/manual_time              2.73 ms         2.75 ms          256
    SortStrings/stringssort/262144/manual_time             4.36 ms         4.38 ms          160
    SortStrings/stringssort/2097152/manual_time            66.2 ms         66.2 ms           10
    SortStrings/stringssort/16777216/manual_time            547 ms          548 ms            1
    
    Calling cudf::strings::detail::sort from cudf::sort:
    SortStrings/stringssort/1024/manual_time              0.692 ms        0.711 ms         1002
    SortStrings/stringssort/4096/manual_time               1.13 ms         1.15 ms          615
    SortStrings/stringssort/32768/manual_time              1.59 ms         1.61 ms          440
    SortStrings/stringssort/262144/manual_time             2.82 ms         2.84 ms          247
    SortStrings/stringssort/2097152/manual_time            43.1 ms         43.1 ms           16
    SortStrings/stringssort/16777216/manual_time            386 ms          386 ms            2
    
    ```
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - AJ Schmidt (@ajschmidt8)
      - Conor Hoekstra (@codereport)
      - Jake Hemstad (@jrhemstad)
      - Christopher Harris (@cwharris)
    
    URL: #7075
    davidwendt authored Jan 15, 2021
    Configuration menu
    Copy the full SHA
    c2e9ffd View commit details
    Browse the repository at this point in the history
  2. Fix JIT cache multi-process test flakiness in slow drives(#7142)

    Fixes #6716
    
    Authors:
      - Devavret Makkar <dmakkar@nvidia.com>
    
    Approvers:
      - @nvdbaranec
      - David (@davidwendt)
    
    URL: #7142
    devavret authored Jan 15, 2021
    Configuration menu
    Copy the full SHA
    5828cef View commit details
    Browse the repository at this point in the history
  3. Add gbenchmarks for reduction aggregations any() and all()(#7129)

    While investing the long compile times of the reduction source files `any.cu` and `all.cu` I found it necessary to build a gbenchmark to ensure changes did not effect the performance of these functions.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Conor Hoekstra (@codereport)
      - Vukasin Milovanovic (@vuule)
      - Paul Taylor (@trxcllnt)
      - Keith Kraus (@kkraus14)
    
    URL: #7129
    davidwendt authored Jan 15, 2021
    Configuration menu
    Copy the full SHA
    bce9552 View commit details
    Browse the repository at this point in the history
  4. Add documentation for support dtypes in all IO formats(#7139)

    Fixes: #7103 
    
    This PR introduces:
    
    - [x] a new doc page which contains dtypes & IO formats matrix supported by cudf currently. This matrix currently lists whether a dtype is supported by a reader / writer. How the table looks can be seen in the below screenshot.
    - [x] As part of this PR I have also introduced informative error messages in some IO reader/writers.
    - [x] Raising an error in ORC writer if there is any categorical data.
    
    ![Screenshot from 2021-01-15 09-40-57](https://user-images.githubusercontent.com/11664259/104747156-cb335200-5715-11eb-92b3-85a246fbdc8a.png)
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
      - GALI PREM SAGAR <sagarprem75@gmail.com>
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - Ashwin Srinath (@shwina)
      - AJ Schmidt (@ajschmidt8)
      - Keith Kraus (@kkraus14)
    
    URL: #7139
    galipremsagar authored Jan 15, 2021
    Configuration menu
    Copy the full SHA
    e86cc65 View commit details
    Browse the repository at this point in the history

Commits on Jan 18, 2021

  1. cudf::rolling_window SUM support for decimal32 and decimal64(#…

    …7147)
    
    This PR resolves #7117 by adding support for `cudf::rolling` for the `SUM` option for `decimal32` and `decimal64`.
    
    Authors:
      - Conor Hoekstra <codereport@outlook.com>
    
    Approvers:
      - David (@davidwendt)
      - Karthikeyan (@karthikeyann)
    
    URL: #7147
    codereport authored Jan 18, 2021
    Configuration menu
    Copy the full SHA
    835ccf9 View commit details
    Browse the repository at this point in the history

Commits on Jan 19, 2021

  1. Enable logic for GPU auto-detection in cudfjni(#7155)

    Allow overriding `GPU_ARCHS` with an empty string in cudfjni to enable automatic detection
    ```bash
    mvn clean install -DARROW_STATIC_LIB=ON -DBoost_USE_STATIC_LIBS=ON -DGPU_ARCHS=
    ...
         [exec] -- CUDA_VERSION: 11.0
         [exec] Auto detection of gpu-archs: 75
         [exec] GPU_ARCHS = 75
    ```
    
    Allow `--h[elp]`  switch to `$CUDF_HOME/build.sh`
    
    Authors:
      - Gera Shegalov <gshegalov@nvidia.com>
      - Gera Shegalov <gera@apache.org>
    
    Approvers:
      - Jason Lowe (@jlowe)
      - Keith Kraus (@kkraus14)
    
    URL: #7155
    gerashegalov authored Jan 19, 2021
    Configuration menu
    Copy the full SHA
    e8ecb24 View commit details
    Browse the repository at this point in the history
  2. Fix comparisons between Series and cudf.NA(#7072)

    Fixes #7043, gives less than ideal results due to #7066.
    
    Authors:
      - brandon-b-miller <brmiller@nvidia.com>
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7072
    brandon-b-miller authored Jan 19, 2021
    Configuration menu
    Copy the full SHA
    8d80d5c View commit details
    Browse the repository at this point in the history
  3. Fixing parquet precision writing failing if scale is equal to precision(

    #7146)
    
    @razajafri noticed that precision could not be equal to scale when writing decimals. This should be allowed and this fixes that and adds a test to verify it.
    
    closes #7145
    
    Authors:
      - Mike Wilson <knobby@burntsheep.com>
    
    Approvers:
      - Raza Jafri (@razajafri)
      - Vukasin Milovanovic (@vuule)
    
    URL: #7146
    hyperbolic2346 authored Jan 19, 2021
    Configuration menu
    Copy the full SHA
    b0525f4 View commit details
    Browse the repository at this point in the history

Commits on Jan 20, 2021

  1. Fix -Werror=sign-compare errors in device code(#7164)

    Not sure why these aren't being caught in local 10.2 envs or CI builds, but I can't build a local CUDA 11.0 env due to a mamba bug.
    
    Authors:
      - ptaylor <paul.e.taylor@me.com>
    
    Approvers:
      - Mark Harris (@harrism)
      - David (@davidwendt)
    
    URL: #7164
    trxcllnt authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    5828be5 View commit details
    Browse the repository at this point in the history
  2. Update doxyfile project number(#7161)

    The Doxyfile project number is set to 0.16. I know I've seen it in the UI before but cannot find it now. I've updated the number just in case. And I've added a line to the update-version.sh (thanks @ajschmidt8 ) to automatically update the file when a new release is created.
    
    Neither of these 2 files effect the CI/CD build.
    
    Authors:
      - davidwendt <dwendt@nvidia.com>
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - AJ Schmidt (@ajschmidt8)
      - Mark Harris (@harrism)
    
    URL: #7161
    davidwendt authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    7df4a4c View commit details
    Browse the repository at this point in the history
  3. Add libcudf API for parsing of ORC statistics(#7136)

    Implementation of the feature includes:
    - Renamed libcudf `read_orc_statistics` to `read_raw_orc_statistics` to make a distinction from the new function.
    - Changed the `read_raw_orc_statistics` return type to `raw_orc_statistics` instead of the vector with heterogeneous data.
    - Added `read_parsed_orc_statistics` that also parses the statistics blobs to make the API usable without the Python layer.
    - Fixed a few compiler warnings (i.e. errors).
    - Added read functions for statistics to ProtobufReader.
    - Added support for optional fields to ProtobufReader (such fields are `std::unique_ptr` for now).
    
    Other changes:
    
    - Renamed the existing ORC statistics API to `read_raw_orc_statistics`.
    - Replaced some explicit H2D and D2H copies with appropriate abstractions.
    - Enabled several ORC tests for bool columns that were missed when the support for such columns was added.
    - Remove unused `zigzag(uint64_t)`.
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vmilovanovic@nvidia.com>
    
    Approvers:
      - @brandon-b-miller
      - GALI PREM SAGAR (@galipremsagar)
      - Conor Hoekstra (@codereport)
      - Mark Harris (@harrism)
    
    URL: #7136
    vuule authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    3e0af46 View commit details
    Browse the repository at this point in the history
  4. Update s3 tests to use moto_server(#7144)

    This PR updates s3 tests to use `moto_server` instead of going via a moto mock_s3 context. This enables cleaner s3 testing with `s3fs>=0.5` which incorporates aiobotocore for s3 connections.
    
    
    - The pytests starts up a moto-server for each worker running tests. 
    - Ports used: `5000, 5550 - 5550+ (n_pytest_workers-1)`
    
    Updated integration repo with requirements: rapidsai/integration#207
    
    Authors:
      - Ayush Dattagupta <ayushdg95@gmail.com>
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - Keith Kraus (@kkraus14)
    
    URL: #7144
    ayushdg authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    5855bfa View commit details
    Browse the repository at this point in the history
  5. Fix importing list & struct types in from_arrow(#7162)

    Fixes: #7137, #7148
    
    This PR fixes converting a pyarrow table which has llist and struct types via `from_arrow`. Incase of `list` dtype we shouldn't have to perform any typecast and incase of `struct` dtype we should be renaming the fields appropriately.
    
    Authors:
      - galipremsagar <sagarprem75@gmail.com>
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Keith Kraus (@kkraus14)
    
    URL: #7162
    galipremsagar authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    0515a42 View commit details
    Browse the repository at this point in the history
  6. Replace offsets with iterators in cuIO utilities and CSV parser(#7150)

    Closes #6210 
    
    - Rename datetime utils to snake_case;
    - Rename datetime utils so that `parse_xyz` functions move the input iterator past the parsed value, and `to_xyz` function do not change the input iterators.
    - Replace `findFirstOccurrence` with `thrust::find`.
    - Replace use of offsets with pointers in `datetime.cuh` and `csv_gpu.cu`;
    - Rename some variables in the CSV parser to make the code clearer.
    
    Note: the semantics of variables/parameters did not change in this PR - `T* end` points to the last element in the range in many places.
    
    Authors:
      - vuule <vmilovanovic@nvidia.com>
      - Vukasin Milovanovic <vmilovanovic@nvidia.com>
    
    Approvers:
      - Mark Harris (@harrism)
      - Christopher Harris (@cwharris)
    
    URL: #7150
    vuule authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    36f85dc View commit details
    Browse the repository at this point in the history
  7. Cross link RMM & libcudf Doxygen docs(#7149)

    This PR updates `cpp/doxygen/Doxyfile` to consume the generated Doxygen tags from `rmm` (see rapidsai/rmm#672). This will enable linking between the `cudf` docs and `rmm` docs.
    
    This PR along with rapidsai/rmm#672 closes issue #5152.
    
    I also updated `docs/cudf/source/conf.py` and added it to `update-version.sh`.
    
    Authors:
      - AJ Schmidt <aschmidt@nvidia.com>
      - AJ Schmidt <ajschmidt8@users.noreply.github.com>
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - @nvdbaranec
      - Dillon Cullinan (@dillon-cullinan)
      - Karthikeyan (@karthikeyann)
    
    URL: #7149
    ajschmidt8 authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    d79da2c View commit details
    Browse the repository at this point in the history
  8. Add MultiIndex.rename API(#7172)

    Closes #7057 
    
    Properly overrides `MultiIndex.rename` from `Index.rename`, reusing API from `MultiIndex.set_names`.
    
    Authors:
      - Michael Wang <isVoid@users.noreply.github.com>
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7172
    isVoid authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    02e25b6 View commit details
    Browse the repository at this point in the history
  9. Implement cudf::group_by (sort) for decimal32 and decimal64 (#7169

    )
    
    This PR resolves a part of #3556.
    
    I decided to push the changes for sort `cudf::group_by` and hash `group_by` in different PRs.
    
    Authors:
      - Conor Hoekstra (@codereport)
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Karthikeyan (@karthikeyann)
    
    URL: #7169
    codereport authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    27893db View commit details
    Browse the repository at this point in the history
  10. Add encoding and compression argument to CSV writer (#7168)

    This PR closes #7083  by adding an encoding argument to our CSV writer, it also adds compression argument to the writer. 
    
    This will help address some issues with feature tool compatibility [PR](alteryx/featuretools#1246).
    
    Authors:
      - Vibhu Jawa (@VibhuJawa)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - Michael Wang (@isVoid)
    
    URL: #7168
    VibhuJawa authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    95059b8 View commit details
    Browse the repository at this point in the history
  11. Enable round in cudf for DataFrame and Series (#7022)

    This enables round for DataFrames and Series using the libcudf round implementation and removes the old numba round implementation.
    
    Closes #1270
    
    Authors:
      - @ChrisJar
    
    Approvers:
      - Ashwin Srinath (@shwina)
      - Michael Wang (@isVoid)
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7022
    ChrisJar authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    a51caa5 View commit details
    Browse the repository at this point in the history
  12. Export mock aws credentials for s3 tests (#7176)

    Fixes issue for failing s3 tests on machines missing aws credentials. 
    The PR exports fake credentials to prevent botocore from looking for aws credentials on the machine.
    
    Authors:
      - Ayush Dattagupta (@ayushdg)
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7176
    ayushdg authored Jan 20, 2021
    Configuration menu
    Copy the full SHA
    81952d0 View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2021

  1. Replace ORC writer api with class (#7099)

    Replacing API with class for chunked orc writer to ease the usage, for additional information #6911.
    
    This PR also adds support ORC chunked writing in python along with test cases.
    
    Authors:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Jason Lowe (@jlowe)
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - GALI PREM SAGAR (@galipremsagar)
      - Devavret Makkar (@devavret)
      - AJ Schmidt (@ajschmidt8)
      - Jason Lowe (@jlowe)
      - Robert (Bobby) Evans (@revans2)
    
    URL: #7099
    rgsl888prabhu authored Jan 21, 2021
    Configuration menu
    Copy the full SHA
    6390498 View commit details
    Browse the repository at this point in the history
  2. Java bindings for Fixed-point type support for Parquet (#7153)

    Adds in java support to be able to write fixed-point type to parquet
    
    Authors:
      - Raza Jafri (@razajafri)
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Jason Lowe (@jlowe)
    
    URL: #7153
    razajafri authored Jan 21, 2021
    Configuration menu
    Copy the full SHA
    4111cb7 View commit details
    Browse the repository at this point in the history
  3. Add support for array-like inputs in cudf.get_dummies (#7181)

    FIxes: #7031 
    
    This PR introduces array-like inputs support in `cudf.get_dummies`. I think in near future we will have to deprecate and adapt new name for `get_dummies`: pandas-dev/pandas#35724
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7181
    galipremsagar authored Jan 21, 2021
    Configuration menu
    Copy the full SHA
    4c6a57c View commit details
    Browse the repository at this point in the history
  4. Implement update() function (#6883)

    Resolves: #5543
    
    This PR adds support for updating a DataFrame with non-NA values from another DataFrame, whereby only the values at matching index/column labels are updated. Only left join is supported, keeping the index and columns of original DataFrame.
    
    Authors:
      - @skirui-source
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - Michael Wang (@isVoid)
    
    URL: #6883
    skirui-source authored Jan 21, 2021
    Configuration menu
    Copy the full SHA
    6c116e3 View commit details
    Browse the repository at this point in the history

Commits on Jan 22, 2021

  1. Add Python DecimalColumn (#6715)

    Resolves #6657.
    
    Authors:
      - Ashwin Srinath (@shwina)
      - Conor Hoekstra (@codereport)
      - Keith Kraus (@kkraus14)
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Keith Kraus (@kkraus14)
      - Vukasin Milovanovic (@vuule)
    
    URL: #6715
    shwina authored Jan 22, 2021
    Configuration menu
    Copy the full SHA
    797f004 View commit details
    Browse the repository at this point in the history
  2. Fix fillna & dropna to also consider np.nan as a missing value (#…

    …7019)
    
    Fixes: #7007 
    
    This PR introduces changes to handle the filling of `np.nan` values in `fillna` code by converting `nan` to `null`. This fix surfaced an issue with `can_cast_safely` where when trying to convert a float column with `nan`'s to `int` column is being allowed - This is incorrect and thus added a check to return False if there is atleast 1 `nan` value in the float column.
    
    `nan` is not being handled in `dropna` aswell but is being handled in `isna`, thus introduced changes to `nan` in `dropna` too.
    
    <!--
    
    Thank you for contributing to cuDF :)
    
    Here are some guidelines to help the review process go smoothly.
    
    1. Please write a description in this text box of the changes that are being
       made.
    
    2. Please ensure that you have written units tests for the changes made/features
       added.
    
    3. There are CI checks in place to enforce that committed code follows our style
       and syntax standards. Please see our contribution guide in `CONTRIBUTING.MD`
       in the project root for more information about the checks we perform and how
       you can run them locally.
    
    4. If you are closing an issue please use one of the automatic closing words as
       noted here: https://help.github.com/articles/closing-issues-using-keywords/
    
    5. If your pull request is not ready for review but you want to make use of the
       continuous integration testing facilities please label it with `[WIP]`.
    
    6. If your pull request is ready to be reviewed without requiring additional
       work on top of it, then remove the `[WIP]` label (if present) and replace
       it with `[REVIEW]`. If assistance is required to complete the functionality,
       for example when the C/C++ code of a feature is complete but Python bindings
       are still required, then add the label `[HELP-REQ]` so that others can triage
       and assist. The additional changes then can be implemented on top of the
       same PR. If the assistance is done by members of the rapidsAI team, then no
       additional actions are required by the creator of the original PR for this,
       otherwise the original author of the PR needs to give permission to the
       person(s) assisting to commit to their personal fork of the project. If that
       doesn't happen then a new PR based on the code of the original PR can be
       opened by the person assisting, which then will be the PR that will be
       merged.
    
    7. Once all work has been done and review has taken place please do not add
       features or make changes out of the scope of those requested by the reviewer
       (doing this just add delays as already reviewed code ends up having to be
       re-reviewed/it is hard to tell what is new etc!). Further, please do not
       rebase your branch on master/force push/rewrite history, doing any of these
       causes the context of any comments made by reviewers to be lost. If
       conflicts occur against master they should be resolved by merging master
       into the branch used for making the pull request.
    
    Many thanks in advance for your cooperation!
    
    -->
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Christopher Harris (@cwharris)
      - Keith Kraus (@kkraus14)
    
    URL: #7019
    galipremsagar authored Jan 22, 2021
    Configuration menu
    Copy the full SHA
    78113f5 View commit details
    Browse the repository at this point in the history

Commits on Jan 23, 2021

  1. Adding unit tests for fixed_point with extremely large scales (#7178

    )
    
    After a discussion with Keith and Ashwin today, I realized `libcudf` was missing a couple corner cases for `fixed_point` and decided to open a small PR to add them.
    
    Authors:
      - Conor Hoekstra (@codereport)
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - Keith Kraus (@kkraus14)
      - Mike Wilson (@hyperbolic2346)
      - @nvdbaranec
      - Mark Harris (@harrism)
    
    URL: #7178
    codereport authored Jan 23, 2021
    Configuration menu
    Copy the full SHA
    70cefa4 View commit details
    Browse the repository at this point in the history
  2. Add CudfSeriesGroupBy to optimize dask_cudf groupby-mean (#7194)

    Adds a `CudfSeriesGroupBy` class to dask_cudf.  This allows the optimizations from #6248 to be used for `CudfSeriesGroupBy.mean` (in addition to `CudfDataFrameGroupBy.aggregate`).
    
    Authors:
      - Richard (Rick) Zamora (@rjzamora)
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7194
    rjzamora authored Jan 23, 2021
    Configuration menu
    Copy the full SHA
    2e0889a View commit details
    Browse the repository at this point in the history

Commits on Jan 25, 2021

  1. Adding support for explode to cuDF (#7140)

    This is an operation that expands lists into rows and duplicates the existing rows from other columns. Explanation can be found in the issue #6151 
    
    partially fixes #6151 
    
    Missing pos_explode support required to completely close out #6151
    
    Authors:
      - Mike Wilson (@hyperbolic2346)
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
      - Jake Hemstad (@jrhemstad)
      - Karthikeyan (@karthikeyann)
      - @nvdbaranec
    
    URL: #7140
    hyperbolic2346 authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    f422391 View commit details
    Browse the repository at this point in the history
  2. Add libcudf lists column count_elements API (#7173)

    This adds the libcudf part of #7157 
    
    ```
    std::unique_ptr<column> cudf::lists::count_elements(
      lists_column_view const& input,
      rmm::mr::device_memory_resource* mr);
    ```
    
    Returns the size of each element in the input lists column.
    The PR also includes gtests for this new API.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - @nvdbaranec
      - AJ Schmidt (@ajschmidt8)
      - Karthikeyan (@karthikeyann)
      - Mark Harris (@harrism)
    
    URL: #7173
    davidwendt authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    6c2675c View commit details
    Browse the repository at this point in the history
  3. Add Java interface for the new API 'explode' (#7151)

    This PR is to add Java interface for the new API '`explode`', along with its unit tests.
    
    This PR depends on the PR #7140 .
    
    Authors:
      - Liangcai Li (@firestarman)
    
    Approvers:
      - Jason Lowe (@jlowe)
      - Robert (Bobby) Evans (@revans2)
    
    URL: #7151
    firestarman authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    bf0c37a View commit details
    Browse the repository at this point in the history
  4. Default groupby to sort=False (#7180)

    Closes #5038, also closes #7026
    
    Using `sort=False` yields better `groupby` performance, this PR changes `groupby` API to refrain from sorting the group index by default.
    
    Besides, this PR updates docstring to address the performance diff when using `sort=False`.
    
    Authors:
      - Michael Wang (@isVoid)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Ashwin Srinath (@shwina)
    
    URL: #7180
    isVoid authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    93ef1d2 View commit details
    Browse the repository at this point in the history
  5. Add JIT cache per compute capability (#7090)

    Adds a level to JIT cache which segregates kernels compiled for different compute capabilities.
    
    Closes #5469
    
    Authors:
      - Devavret Makkar (@devavret)
    
    Approvers:
      - Paul Taylor (@trxcllnt)
      - Mark Harris (@harrism)
    
    URL: #7090
    devavret authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    f09a75f View commit details
    Browse the repository at this point in the history
  6. Refactor cudf::string_view host and device code (#7159)

    While working on improving the sort performance for strings columns in #7075, we tried a vector-load approach in the `string_view::compare()` function. This approached used some CUDA math intrinsic functions like `__funnelshift_r()` and `__byte_perm()`. Unfortunately, adding these to the `string_view` source would cause compile errors for some .cpp files. This is because the `string_view.cuh` was being included by some .cpp file even though these only used the appropriate `__host__ __device__` functions.
    
    This PR breaks up the host/device from the device-only functions so the .cpp files can include `string_view.cuh` without processing the device-only definitions. The host/device functions are now defined in the `string_view.cuh` directly and the device-only source is isolated in the `string_view.inl`. The include of the `string_view.inl` is then wrapped if a `#if CUDA_ARCH` so it will not be processed by a .cpp file compilation.
    
    Also, I attempted to minimize includes of `string_view.cuh` by removing it from `traits.hpp` and replacing it with a forward reference. This found a few files that were not including `string_view.cuh` directly as they should've. This also exposed `cpp/tests/utilities/scalar_utilities.cu` which appears to be unused and thus removed along with its header.
    
    No functionality has changed. Build times may be slightly faster since `string_view.cuh` is included in less source files and .cpp files no longer the `string_view.inl`. This means changing this file was also have a slightly less impact on rebuilding libcudf.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - AJ Schmidt (@ajschmidt8)
      - Karthikeyan (@karthikeyann)
      - Jake Hemstad (@jrhemstad)
    
    URL: #7159
    davidwendt authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    103c41a View commit details
    Browse the repository at this point in the history
  7. Fast path single column sort (#7167)

    This change is based on the changes in PR #7075. When `cudf::sort()` or `cudf::sorted_order()` is called with a `cudf::table_view` and specifies only a single strings column, we choose a fast-path sort algorithm with a simpler comparator specifically coded for string compares. The specialized code path was added to `cudf::sorted_order()` which is called by the other libcudf sort functions. For example, `cudf::sort()` calls `cudf::sorted_order()` and the calls `cudf::gather()` on the input `cudf::table_view()` to materialize the results. The libcudf `sorted_order` feature has two APIs: `cudf::sorted_order()` and `cudf::stable_sorted_order()` which internally use `thrust::sort()` and `thrust::stable_sort()` respectively. Each uses the `row_lexographic_comparator` for managing sort of multiple columns. A simpler comparator can be used in the case of a single column per the implementation in #7075.
    
    In this PR, I generalized this fast-path for other single column types. I found the same comparator from #7075, templated by type, could be used for speeding up sorting of any comparable type -- where a single column is specified. Further, there are some conditions with numeric types when a comparator is not required and where the `cub::DeviceRadixSort` functions can be used instead of thrust.
    
    The restrictions to account for when _not using a comparator_:
    - the type must support an assignment operator as well as the compare operators (basically only numeric types)
    - the column must not contain nulls since these are handled specially with a `null_order` parameter
    - `thrust::sort()` and `thrust::stable_sort()` sort the input data in-place and do not support descending order
    - `cudf::DeviceRadixSort` does not sort in-place and does not have stable-sort but does have a descending order option
    
    Here is how these are used in `cudf::detail::sorted_order<stable>()` matching conditions with these restrictions.
    
    | stable | nulls | numeric | ascending | function |
    |:---:|:---:|:---:|:---:| --- |
    | y | y | - | - |  `thrust::stable_sort()` with comparator |
    | y | - | n | - | `thrust::stable_sort()` with comparator |
    | y | - | - | n | `thrust::stable_sort()` with comparator |
    | y | n | y | y | `thrust::stable_sort_by_key()` with input column copied |
    | n | y | - | - | `thrust::sort()` with comparator |
    | n | - | n | - | `thrust::sort()` with comparator |
    | n | n | y | y | `cub::DeviceRadixSort::SortPairs` with input column copied and output indices copied |
    | n | n | y | n | `cub::DeviceRadixSort::SortPairsDescending` with input column copied and output indices copied |
    
    The `sort_benchmarks.cu` was updated to include a non-nulls set of tests to show the speedups for the bottom half of the chart. The benchmark sorts integers in ascending order. With nulls, the sort is now 1.2x faster. With no nulls, the sort is about 14x faster. The faster speed comes at the expense of 2-3 times the memory required for `thrust::stable_sort_by_key()` or the `cub:DeviceRadixSort::SortPairs()` functions.
    
    The generalization using the new single-column comparator accounts for strings columns as well. So the strings-specific code for this has been removed in this PR.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Jake Hemstad (@jrhemstad)
      - Karthikeyan (@karthikeyann)
    
    URL: #7167
    davidwendt authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    eb1336f View commit details
    Browse the repository at this point in the history
  8. Support contains() on lists of primitives (#7039)

    Closes #6944.
    
    This commit adds a method (`contains()`) to check whether each row of a `LIST` column contains the scalar value specified as an argument. The operation returns a `BOOL8` column (with as many rows as the input `LIST`), each row indicating `true` if the value is found, `false` if not.
    
    Output `column[i]` is set to null if even one of the following holds true (in line with the semantics of `array_contains()` in SQL):
      1. The search key `skey` is null
      2. The list row `lists[i]` is null
      3. The list row `lists[i]` contains even *one* null, *and* `lists[i]` does not contain the search key.
    
    This implementation currently supports the operation on lists of numerics or strings.
    
    Authors:
      - MithunR (@mythrocks)
    
    Approvers:
      - AJ Schmidt (@ajschmidt8)
      - Mark Harris (@harrism)
      - David (@davidwendt)
      - Karthikeyan (@karthikeyann)
    
    URL: #7039
    mythrocks authored Jan 25, 2021
    Configuration menu
    Copy the full SHA
    b1e9e20 View commit details
    Browse the repository at this point in the history

Commits on Jan 26, 2021

  1. Modify the semantics of end pointers in cuIO to match standard libr…

    …ary (#7179)
    
    Closes #6252
    
    Fix the `end` parameter semantics to match the standard C++ library.
    Move `is_whitespace` and `trim_field_start_end` to parsing_utils and use in both CSV and JSON.
    
    Authors:
      - Vukasin Milovanovic (@vuule)
    
    Approvers:
      - Christopher Harris (@cwharris)
      - Conor Hoekstra (@codereport)
    
    URL: #7179
    vuule authored Jan 26, 2021
    Configuration menu
    Copy the full SHA
    a1db5c5 View commit details
    Browse the repository at this point in the history
  2. Replace parquet writer api with class (#7058)

    This PR contains changes only pertaining to Parquet. 
    
    Instead of having API, a class is being used to control state and options to reduce burden on user. For more information look at #6911
    
    These changes will break Java since main API changed.
    
    Authors:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Jason Lowe (@jlowe)
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - Devavret Makkar (@devavret)
      - Robert (Bobby) Evans (@revans2)
      - @brandon-b-miller
      - David (@davidwendt)
    
    URL: #7058
    rgsl888prabhu authored Jan 26, 2021
    Configuration menu
    Copy the full SHA
    6a4c760 View commit details
    Browse the repository at this point in the history
  3. Fixing parquet benchmarks (#7214)

    `return_filemetadata` was removed in one of the recent PR, and missed to remove it in benchmarks.
    
    Authors:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
    
    Approvers:
      - Conor Hoekstra (@codereport)
      - Christopher Harris (@cwharris)
    
    URL: #7214
    rgsl888prabhu authored Jan 26, 2021
    Configuration menu
    Copy the full SHA
    d97b09e View commit details
    Browse the repository at this point in the history
  4. Add coverage for skiprows and num_rows in parquet reader fuzz tes…

    …ting (#7216)
    
    This PR adds coverage for `skiprows` and `num_rows` parameters in parquet reader fuzz tests.
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Vukasin Milovanovic (@vuule)
      - Keith Kraus (@kkraus14)
    
    URL: #7216
    galipremsagar authored Jan 26, 2021
    Configuration menu
    Copy the full SHA
    ccf4ffa View commit details
    Browse the repository at this point in the history

Commits on Jan 27, 2021

  1. Remove floating point types from radix sort fast-path (#7215)

    Closes #7212 
    
    Reference #7167 (comment)
    
    Using radix sort for all fixed-width types causes an [error in Spark when floating point columns contain NaN elements](NVIDIA/spark-rapids#1585).
    
    This PR removes floating-point column types from the radix fast-path. This means the original `relational_compare` row operator is used to handle sorting floating point columns since they could possibly contain NaN elements.
    
    The `NANSorting` gtest included null elements so it did not catch the fast-path output discrepancy. This PR adds a `NANSortingNonNull` gtest to check for the desired NaN sorting behavior.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Jake Hemstad (@jrhemstad)
      - Conor Hoekstra (@codereport)
    
    URL: #7215
    davidwendt authored Jan 27, 2021
    Configuration menu
    Copy the full SHA
    d19cb40 View commit details
    Browse the repository at this point in the history
  2. Add static type checking via Mypy (#6381)

    Adds static type checking to cuDF Python via MyPy.
    
    * An additional `mypy` style check is enabled in CI
    * `mypy` is run as part of the pre-commit hook
    * Many parts of the cuDF internal code now have type annotations
    * Any new internal code is expected to be written with type annotations (not public-facing APIs)
    
    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Dillon Cullinan (@dillon-cullinan)
      - Keith Kraus (@kkraus14)
      - Christopher Harris (@cwharris)
    
    URL: #6381
    shwina authored Jan 27, 2021
    Configuration menu
    Copy the full SHA
    fc40c52 View commit details
    Browse the repository at this point in the history
  3. Add JNI and Java bindings for list_contains (#7125)

    Adds JNI and Java side bindings for `list_contains` that is being added as part of #7039.
    
    Authors:
      - Kuhu Shukla (@kuhushukla)
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
      - MithunR (@mythrocks)
    
    URL: #7125
    Kuhu Shukla authored Jan 27, 2021
    Configuration menu
    Copy the full SHA
    dd1efe1 View commit details
    Browse the repository at this point in the history
  4. Fix missing null_count() comparison in test framework and related fai…

    …lures (#7219)
    
    Fixes #7210
    Fixes #6733 
    
    List of fixes included:
    
    - [x] Restore `null_count()` check in `expect_columns_equal` / `expect_columns_equivalent`
    - [x] Fix issue in `structs_column_view::get_sliced_child`
    - [x] Fix test failures in COPYING_TEST
    - [x] Fix test failures in STREAM_COMPACTION_TEST
    - [x] Fix test failures in RESHAPE_TEST
    
    Authors:
      - @nvdbaranec
      - Mark Harris (@harrism)
    
    Approvers:
      - Mark Harris (@harrism)
      - MithunR (@mythrocks)
      - Jake Hemstad (@jrhemstad)
    
    URL: #7219
    nvdbaranec authored Jan 27, 2021
    Configuration menu
    Copy the full SHA
    9631660 View commit details
    Browse the repository at this point in the history

Commits on Jan 28, 2021

  1. Add JNI support for converting Arrow buffers to CUDF ColumnVectors (#…

    …7222)
    
    This adds in the JNI layer to be able to take build up Arrow column vectors which are just references to off heap arrow buffers and then convert those into CUDF ColumnVectors by directly copying the arrow data to the GPU.
    
    The way this works is you create a ArrowColumnBuilder for each column you need. You call addBatch for each separate arrow buffer you want to add into that column and then you call buildAndPutOnDevice() on the Builder. That will cause the arrow pointer to be passed into CUDF, an Arrow Table with 1 column is created, that Arrow table gets passed into the cudf::from_arrow which returns a CUDF Table and we grab the 1 column from that and return it.
    
    Note this only supports primitive types and Strings for now. List, Struct, Dictionary, and Decimal are not supported yet.
    
    Signed-off-by: Thomas Graves <tgraves@nvidia.com>
    
    Authors:
      - Thomas Graves (@tgravescs)
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
      - Jason Lowe (@jlowe)
    
    URL: #7222
    tgravescs authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    cbc0394 View commit details
    Browse the repository at this point in the history
  2. Support numeric_only field for rank() (#7213)

    Closes #7174 
    
    This PR adds support for `numeric_only` field for `Dataframe.rank()`  and `Series.rank()`. When user specifies `numeric_only=True`, only the numerical data type columns are selected to construct a cudf object and passed to lower level for processing.
    
    Two minor refactors are also included in this PR:
    
    - This PR refactors internal API of `Frame._get_columns_by_label`, which now supports dispatching to this method from both `Dataframe` and `Series`. 
    - This PR refactors `test_rank.py`, moving test functions inside class `TestRank` out as top level functions. All test variables shared among test cases are moved to a `pytests.fixture` method. A `Dataframe.rank` test case that expects to raise due to a [pandas bug](pandas-dev/pandas#32593) is now captured under `pytest.raises`.
    
    Authors:
      - Michael Wang (@isVoid)
    
    Approvers:
      - Ashwin Srinath (@shwina)
      - @brandon-b-miller
    
    URL: #7213
    isVoid authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    7d52970 View commit details
    Browse the repository at this point in the history
  3. Fix test column vector leak (#7238)

    #7125 added a test column vector leak. This PR fixes this minor leak.
    
    Authors:
      - Kuhu Shukla (@kuhushukla)
    
    Approvers:
      - Jason Lowe (@jlowe)
      - Thomas Graves (@tgravescs)
    
    URL: #7238
    Kuhu Shukla authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    ab34580 View commit details
    Browse the repository at this point in the history
  4. Fix some bugs in java scalar support for decimal (#7237)

    This fixes some bugs in the java support for decimal scalar values.  They are fairly minor but prevented me from doing some debugging earlier, and could impact tests in the future.
    
    Authors:
      - Robert (Bobby) Evans (@revans2)
    
    Approvers:
      - Jason Lowe (@jlowe)
    
    URL: #7237
    revans2 authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    02166da View commit details
    Browse the repository at this point in the history
  5. Fix Arrow column test leaks (#7241)

    Found leaks in the ArrowColumnVectorTest so fix them.
    
    Signed-off-by: Thomas Graves <tgraves@nvidia.com>
    
    Authors:
      - Thomas Graves (@tgravescs)
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
      - Jason Lowe (@jlowe)
    
    URL: #7241
    tgravescs authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    9672e3d View commit details
    Browse the repository at this point in the history
  6. Add dictionary column support to rolling_window (#7186)

    Reference #5963 
    
    Add support for dictionary column to `cudf::rolling_window` (non-udf)
    
    Rolling aggregations
    - [x] min/max
    - [x] lead/lag
    - [x] counting, row-number
    
    These only require aggregating the dictionary indices and do not need to access the keys.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Mark Harris (@harrism)
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
    
    URL: #7186
    davidwendt authored Jan 28, 2021
    Configuration menu
    Copy the full SHA
    b608832 View commit details
    Browse the repository at this point in the history

Commits on Jan 29, 2021

  1. Add support for cudf::binary_operation TRUE_DIV for decimal32 a…

    …nd `decimal64` (#7198)
    
    This resolves a part of #7132
    
    **ToDo:**
    * [x] Simple unit test
    * [x] Comprehensive unit tests
    * [x] Initial Column + Column
    * [x] Full Column + Column
    * [x] Column + Scalar
    * [x] Scalar + Column
    * [x] Cleanup
    
    Authors:
      - Conor Hoekstra (@codereport)
    
    Approvers:
      - Mark Harris (@harrism)
      - @nvdbaranec
    
    URL: #7198
    codereport authored Jan 29, 2021
    Configuration menu
    Copy the full SHA
    b097b5a View commit details
    Browse the repository at this point in the history
  2. Refactor io memory fetches to use hostdevice_vector methods (#7035)

    This replaces `cudaMemcpyAsync(hostdevice_vector)` with `hostdevice_vector.device_to_host()` or `hostdevice_vector.host_to_device()` when appropriate.
    
    Issue #6538
    
    Authors:
      - @ChrisJar
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Vukasin Milovanovic (@vuule)
    
    URL: #7035
    ChrisJar authored Jan 29, 2021
    Configuration menu
    Copy the full SHA
    fe5e07d View commit details
    Browse the repository at this point in the history
  3. Fix loc for Series with a MultiIndex (#7243)

    Fixes #7221 and adds improvements to `loc` with a MultiIndex.
    
    * Previously, `loc` on a `Series` with a `MultiIndex` would fail. For example:
    
    ```python
    In [7]: sr
    Out[7]:
    n_workers  type
    1          fit        1
    2          load       2
    3          predict    3
    Name: x, dtype: int64
    
    In [8]: sr.loc[(1, "fit")]  # KeyError
    ````
    
    * Previously, `loc` on a `DataFrame` with a `MultiIndex` would fail when a slice without `start` or `end` was used. For example:
    
    ```python
    In [3]: df
    Out[3]:
                       x
    n_workers type
    1         fit      1
    2         load     2
    3         predict  3
    
    In [4]: df.loc[:(2, "load")]  # TypeError
    ```
    
    Both the above issues have been addressed and tests added.
    
    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Michael Wang (@isVoid)
      - GALI PREM SAGAR (@galipremsagar)
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
    
    URL: #7243
    shwina authored Jan 29, 2021
    Configuration menu
    Copy the full SHA
    019d7cc View commit details
    Browse the repository at this point in the history

Commits on Jan 30, 2021

  1. Implement COLLECT rolling window aggregation (#7189)

    Closes #7133. 
    
    This is an implementation of the `COLLECT` aggregation in the context of rolling window functions. This enables the collection of rows (of type `T`) within specified window boundaries into a list column (containing elements of type `T`). In this context, one list row would be generated per input row. E.g. Consider the following example:
    ```c++
    auto input_col = fixed_width_column_wrapper<int32_t>{70, 71, 72, 73, 74};
    ```
    Calling `rolling_window()` with `preceding=2`, `following=1`, `min_periods=1` produces the following:
    ```c++
    auto output_col = cudf::rolling_window(input_col, 2, 1, 1, collect_aggr);
                // == [ [70,71], [70,71,72], [71,72,73], [72,73,74], [73,74] ]
    ```
    `COLLECT` is supported with `rolling_window()`, `grouped_rolling_window()`, and `grouped_time_range_rolling_window()`, across primitive types and arbitrarily nested lists and structs.
    
    `min_periods` is also honoured:  If the number of observations is fewer than min_periods, the resulting list row is null.
    
    Authors:
      - MithunR (@mythrocks)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Vukasin Milovanovic (@vuule)
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
    
    URL: #7189
    mythrocks authored Jan 30, 2021
    Configuration menu
    Copy the full SHA
    14b0900 View commit details
    Browse the repository at this point in the history

Commits on Feb 1, 2021

  1. Add List types support in data generator (#7064)

    Resolves: #6263 
    
    This PR introduces changes which will enable generation of random list columns in datagenerator which will be used as part of fuzz tests.
    
    cc: @vuule
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - @brandon-b-miller
    
    URL: #7064
    galipremsagar authored Feb 1, 2021
    Configuration menu
    Copy the full SHA
    50be922 View commit details
    Browse the repository at this point in the history
  2. Handle various parameter combinations in replace API (#7207)

    Fixes: #7206 
    
    The `replace` API has two parameters `to_replace` & `value` which are overloaded and support different types of inputs for each of these two parameters have different behaviors. These changes introduce clear code-flow for each type of possible parameter combination. This way it would be easier to support newer parameters in future like `regex` & nested dict types, which would change the behaviour of `to_replace` & `value` parameters..
    
    - [x] Ensure all combinations are covered for `to_replace` & `value` for both `DataFrame.replace` & `Series.replace`.
    - [x] Document changes inline & Update func docs.
    - [x] Add tests to include coverage for all combinations that are not yet covered.
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - @brandon-b-miller
    
    URL: #7207
    galipremsagar authored Feb 1, 2021
    Configuration menu
    Copy the full SHA
    b8cb8c7 View commit details
    Browse the repository at this point in the history
  3. Define and implement more behavior for merging on categorical variabl…

    …es (#7209)
    
    Fixes #6892
    Defines the desired behavior for an implicit merge of two possibly differing categorical variables, or one categorical variable and one non-categorical variable, as a function of the dtypes and the merge configuration.
    
    The desired behavior is defined through the tests and then implemented in `casting_logic.py`.
    
    Authors:
      - @brandon-b-miller
      - Keith Kraus (@kkraus14)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - Keith Kraus (@kkraus14)
    
    URL: #7209
    brandon-b-miller authored Feb 1, 2021
    Configuration menu
    Copy the full SHA
    ccc9173 View commit details
    Browse the repository at this point in the history
  4. Disallow picking output columns from nested columns. (#7248)

    Only top level columns can be selected by name
    
    Fixes #7229
    
    Authors:
      - Devavret Makkar (@devavret)
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Vukasin Milovanovic (@vuule)
      - @nvdbaranec
      - Keith Kraus (@kkraus14)
    
    URL: #7248
    devavret authored Feb 1, 2021
    Configuration menu
    Copy the full SHA
    0ee8004 View commit details
    Browse the repository at this point in the history
  5. Remove floating point types from cudf::sort fast-path (#7250)

    PR #7215 removed single floating point columns from radix sort fast-path but missed disabling the fast-path sort for floating-point in `cudf::sort()`. 
    
    This PR fixes `cudf::sort` and adds a new test to the existing `RowOperatorTestForNAN.NANSortingNonNull` gtest.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Conor Hoekstra (@codereport)
    
    URL: #7250
    davidwendt authored Feb 1, 2021
    Configuration menu
    Copy the full SHA
    3ecde9d View commit details
    Browse the repository at this point in the history

Commits on Feb 3, 2021

  1. libcudf Developer Guide (#6977)

    Adds a new developer guide for libcudf. This is based on the
    existing libcudf++ transition guide.
    
    Fixes #5273 
    
    TODO
    
    - [x] Description of `dictionary_column_wrapper` and `fixed_point_column_wrapper`
    - [x] Benchmarking Section (put in a new file, Benchmarking.md)?
    - [x] Better discussion of nested types
    - [x] Introductory section on data types
    - [x] Consider splitting into multiple documents: DEVELOPER_GUIDE.md, TESTING.md, BENCHMARKING.md?
    - [x] Placeholder for cuIO?
    - [x] Add section on code and documentation style and formatting
    
    Authors:
      - Mark Harris (@harrism)
      - Jake Hemstad (@jrhemstad)
    
    Approvers:
      - @nvdbaranec
      - Conor Hoekstra (@codereport)
      - Jake Hemstad (@jrhemstad)
      - David (@davidwendt)
    
    URL: #6977
    harrism authored Feb 3, 2021
    Configuration menu
    Copy the full SHA
    52f5b32 View commit details
    Browse the repository at this point in the history
  2. Fix style issues related to NumPy (#7279)

    NumPy 1.20 is [typed](https://numpy.org/devdocs/release/1.20.0-notes.html#numpy-is-now-typed), which exposed a few typing errors in cuDF that this PR addresses.
    
    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - GALI PREM SAGAR (@galipremsagar)
      - AJ Schmidt (@ajschmidt8)
    
    URL: #7279
    shwina authored Feb 3, 2021
    Configuration menu
    Copy the full SHA
    900c1e1 View commit details
    Browse the repository at this point in the history
  3. Prepare Changelog for Automation (#7272)

    This PR prepares the changelog to be automatically updated during releases.
    
    Authors:
      - AJ Schmidt (@ajschmidt8)
    
    Approvers:
      - Keith Kraus (@kkraus14)
    
    URL: #7272
    ajschmidt8 authored Feb 3, 2021
    Configuration menu
    Copy the full SHA
    54cddb1 View commit details
    Browse the repository at this point in the history

Commits on Feb 4, 2021

  1. Add docs for working with missing data (#7010)

    Fixes: #6963 
    
    This PR introduces a "Working with missing data" doc page where we clearly outline how we can work with missing data in cudf. 
    
    The behavior shown in #6963 is correct due to the fact that cudf treats `NaT` as `<NA>` values. Hence highlighted the difference in behavior of having `NaT` in datetime/timedelta values between pandas and cudf.
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
    
    URL: #7010
    galipremsagar authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    2e71b36 View commit details
    Browse the repository at this point in the history
  2. Pack/unpack functionality to convert tables to and from a serialized …

    …format. (#7096)
    
    Addresses #3793
    
    Depends on  #6864   (This affects contiguous_split.cu.  For the purposes of this PR, the only changes that are relevant are those that involve the generation of metadata)
    
    - `pack()` performs a `contiguous_split()` on the incoming table to arrange the memory into a unified device buffer, and generates a host-side metadata buffer.   These are returned in the `packed_columns` struct.
    
    - unpack() takes the data stored in the `packed_columns` struct and returns a deserialized `table_view` that points into it.
    
    The intent of this functionality is as follows (pseudocode)
    
    ```
    // serialize-side
    table_view t;
    packed_columns p = pack(t);
    send_over_network(p.gpu_data);
    send_over_network(p.metadata);
    
    // deserialize-side
    packed_columns p = receive_from_network();
    table_view t = unpack(p);
    ```
    
    This PR also renames `contiguous_split_result` to `packed_table` (which is just a bundled `table_view` and `packed_column`)
    
    Authors:
      - @nvdbaranec
    
    Approvers:
      - Jake Hemstad (@jrhemstad)
      - Paul Taylor (@trxcllnt)
      - Mike Wilson (@hyperbolic2346)
    
    URL: #7096
    nvdbaranec authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    fd2d0e2 View commit details
    Browse the repository at this point in the history
  3. Move lists utility function definition out of header (#7266)

    Fixes #7265.
    
    `cudf::detail::get_num_child_rows()` is currently defined in `cudf/lists/detail/utilities.cuh`. The build pipelines for #7189 are fine, but there seem to be build failures in dependent projects such as `spark-rapids`:
    ```
    [2021-01-31T08:12:10.611Z] /.../workspace/spark/cudf18_nightly/cpp/include/cudf/lists/detail/utilities.cuh:31:18: error: 'cudf::size_type cudf::detail::get_num_child_rows(const cudf::column_view&, rmm::cuda_stream_view)' defined but not used [-Werror=unused-function]
    [2021-01-31T08:12:10.611Z]  static cudf::size_type get_num_child_rows(cudf::column_view const& list_offsets,
    [2021-01-31T08:12:10.611Z]                   ^~~~~~~~~~~~~~~~~~
    [2021-01-31T08:12:11.981Z] cc1plus: all warnings being treated as errors
    [2021-01-31T08:12:12.238Z] make[2]: *** [CMakeFiles/cudf_hash.dir/build.make:82: CMakeFiles/cudf_hash.dir/src/hash/hashing.cu.o] Error 1
    [2021-01-31T08:12:12.238Z] make[1]: *** [CMakeFiles/Makefile2:220: CMakeFiles/cudf_hash.dir/all] Error 2
    ```
    In any case, it is less than ideal for the function to be completely defined in the header, especially given that the likes of `hashing.cu` are exposed to it (by way of `scatter.cuh`). 
    
    This commit moves the function definition to a separate translation unit, without changing implementation or interface.
    
    Authors:
      - MithunR (@mythrocks)
    
    Approvers:
      - @nvdbaranec
      - Mike Wilson (@hyperbolic2346)
      - David (@davidwendt)
    
    URL: #7266
    mythrocks authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    fd38b4c View commit details
    Browse the repository at this point in the history
  4. Add Segmented sort (#7122)

    addresses part of #6541 Segment sort of lists
    
    - [x] lists_column_view segmented_sort
    - [x] numerical types (cub segmented sort limitation)
    - [x] sort_lists(table_view)
    - [x] unit tests
    
    closes  #4603 Segmented sort
    - [x] segmented_sort
    - [x] unit tests.
    
    Authors:
      - Karthikeyan (@karthikeyann)
    
    Approvers:
      - AJ Schmidt (@ajschmidt8)
      - Keith Kraus (@kkraus14)
      - Jake Hemstad (@jrhemstad)
      - Conor Hoekstra (@codereport)
    
    URL: #7122
    karthikeyann authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    369ec98 View commit details
    Browse the repository at this point in the history
  5. Throw if bool column would cause incorrect result when writing to ORC (

    …#7261)
    
    Issue #6763
    
    Authors:
      - Vukasin Milovanovic (@vuule)
    
    Approvers:
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - @nvdbaranec
      - GALI PREM SAGAR (@galipremsagar)
      - Keith Kraus (@kkraus14)
    
    URL: #7261
    vuule authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    4f87a59 View commit details
    Browse the repository at this point in the history
  6. Update JNI for contiguous_split packed results (#7127)

    This PR requires the libcudf changes in #7096, fixing the Java bindings to `contiguous_split` that are broken by that change.
    
    This also adds the ability to create a `ContiguousTable` instance without manifesting a `Table` instance and all `ColumnVector` instances underneath it which should prove useful during Spark's shuffle.
    
    Authors:
      - Jason Lowe (@jlowe)
    
    Approvers:
      - Robert (Bobby) Evans (@revans2)
      - Alessandro Bellina (@abellina)
    
    URL: #7127
    jlowe authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    110ef3e View commit details
    Browse the repository at this point in the history
  7. fix java cuFile tests (#7296)

    Turns out we need version > 5.4 of the junit jupiter engine to support `@TempDir`.
    - Changed the file mode to match Spark's disk manager.
    - Changed to use `fstat` to get the file length when appending.
    - Add tests for when a file already exists.
    
    Authors:
      - Rong Ou (@rongou)
    
    Approvers:
      - Jason Lowe (@jlowe)
      - Robert (Bobby) Evans (@revans2)
    
    URL: #7296
    rongou authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    1062fbc View commit details
    Browse the repository at this point in the history
  8. Improve assert_eq handling of scalar (#7220)

    Closes #7199
    
    Refactors scalar handling inside `assert_eq`. On higher level, this PR proposes a "whitelist" style testing: all compares should go to the "strict equal" code path unless explicitly allowed. This allows the test system to capture all unintended inequality except the ones that's discussed upon. For example, this PR creates two whitelist items:
    - If the operands overrides `__eq__`, use it to determine equality.
    - If the operands are floating type, assert approximate equality.
    For all other cases, the operands should be strictly equal. Note that for testing purposes, `np.nan` are considered equal to itself.
    
    Authors:
      - Michael Wang (@isVoid)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - @brandon-b-miller
    
    URL: #7220
    isVoid authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    fc9a00f View commit details
    Browse the repository at this point in the history
  9. Prepare Changelog for Automation (#7309)

    This PR prepares the changelog to be automatically updated during releases.
    
    Authors:
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - AJ Schmidt (@ajschmidt8)
    
    URL: #7309
    galipremsagar authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    568df5b View commit details
    Browse the repository at this point in the history
  10. Add column_device_view pointers to EncColumnDesc (#7097)

    Closes #6893, closes #6894. Contributes to #5682 
    
    Reduce usage of stats_column_desc members in Parquet writer with column_device_view members.
    
    Authors:
      - Kumar Aatish (@kaatish)
    
    Approvers:
      - David (@davidwendt)
      - Vukasin Milovanovic (@vuule)
      - Devavret Makkar (@devavret)
    
    URL: #7097
    kaatish authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    8334700 View commit details
    Browse the repository at this point in the history
  11. Fix copying dtype metadata after calling libcudf functions (#7271)

    Fixes #7249
    
    Copies dtype metadata after calling `ColumnBase.copy()`. Moves logic for copying dtype metadata after calling libcudf functions from `Frame` to `ColumnBase`.
    
    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7271
    shwina authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    253dfdf View commit details
    Browse the repository at this point in the history
  12. Use uvector in replace_nulls; Fix sort_helper::grouped_value doc (

    #7256)
    
    Small PR to provide two fixes:
    - Use `rmm::device_uvector` in place of `device_vector` to improve efficiency. This is a scratch space, so supplied stream and default memory resource is used. Part of #5380
    - Update `sort_helper::grouped_value` docstring to reflect change after use of stable sort.
    
    Authors:
      - Michael Wang (@isVoid)
    
    Approvers:
      - Vukasin Milovanovic (@vuule)
      - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
      - Mark Harris (@harrism)
    
    URL: #7256
    isVoid authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    fb33b94 View commit details
    Browse the repository at this point in the history
  13. Fix failing CI ORC test (#7313)

    Use a buffer for output in the newly added ORC test.
    
    Authors:
      - Vukasin Milovanovic (@vuule)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7313
    vuule authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    3a52d93 View commit details
    Browse the repository at this point in the history
  14. Add Java unit tests for window aggregate 'collect' (#7121)

    Add unit tests for aggregate 'collect' with windowing.
    
    This PR depends on the PR #7189 . 
    
    Signed-off-by: Liangcai Li <liangcail@nvidia.com>
    
    Authors:
      - Liangcai Li (@firestarman)
    
    Approvers:
      - MithunR (@mythrocks)
      - Robert (Bobby) Evans (@revans2)
    
    URL: #7121
    firestarman authored Feb 4, 2021
    Configuration menu
    Copy the full SHA
    e2f6952 View commit details
    Browse the repository at this point in the history

Commits on Feb 5, 2021

  1. Fix typo in cudf.core.column.string.extract docs (#7253)

    change: on -> one
    
    I read the contributing guidelines, but since this is just a documentation fix, I'm not sure which apply.
    
    Great library, I just got started using it. A little rough around the edges, but great so far, and well worth some of the added steps.
    
    Authors:
      - Alan deLevie (@adelevie)
      - AJ Schmidt (@ajschmidt8)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
      - Keith Kraus (@kkraus14)
      - Michael Wang (@isVoid)
      - Ray Douglass (@raydouglass)
    
    URL: #7253
    adelevie authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    3fef7f7 View commit details
    Browse the repository at this point in the history
  2. Remove incorrect std::move call on return variable (#7319)

    Returning a unique pointer using `std::move` causes a compile error for gcc 9 and above.
    Simple fix to remove the incorrect move semantic in `segmented_sort.cu` `get_segment_indices`.
    
    Authors:
      - David (@davidwendt)
    
    Approvers:
      - Karthikeyan (@karthikeyann)
      - Devavret Makkar (@devavret)
    
    URL: #7319
    davidwendt authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    f1a6616 View commit details
    Browse the repository at this point in the history
  3. Disallow constructing frames from a ColumnAccessor (#7298)

    Constructing a DataFrame from a ColumnAccessor previously had unintended side-effects:
    
    ```python
    
    In [1]: import cudf
    
    In [2]: a = cudf.DataFrame({'a': [1, 2, 3]})
    
    In [3]: a._data['a'].__cuda_array_interface__
    Out[3]:
    {'shape': (3,),
     'strides': (8,),
     'typestr': '<i8',
     'data': (140409137266688, False),
     'version': 1}
    
    In [4]: a[['a']]
    Out[4]:
       a
    0  1
    1  2
    2  3
    
    In [5]: a._data['a'].__cuda_array_interface__
    Out[5]:
    {'shape': (3,),
     'strides': (8,),
     'typestr': '<i8',
     'data': (140409137267200, False),
     'version': 1}
    ```
    
    In a discussion with @galipremsagar - we decided that it's probably best not to handle `ColumnAccessor` in the frame constructors. 
    
    * Remove special handling of `ColumnAccessor` in `Frame` constructors
    * Collapse `Series.copy()` and `DataFrame.copy()` into a single `Frame.copy()`
    
    Authors:
      - Ashwin Srinath (@shwina)
      - GALI PREM SAGAR (@galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7298
    shwina authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    26b8c60 View commit details
    Browse the repository at this point in the history
  4. Fix bug when iloc slice terminates at before-the-zero position (#7277)

    Closes #7246
    
    This PR fixes a bug in `Dataframe.iloc`. When the slice provided to `iloc`, is decrementing and also terminates at `before-the-zero` position, such as `slice(2, -1, -1)` or `slice(4, None, -1)`, the terminal position still gets wrapped around. 
    
    `Frame._slice` is moved to `DataFrame._slice` to resolve typing issue.
    
    Authors:
      - Michael Wang (@isVoid)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - GALI PREM SAGAR (@galipremsagar)
    
    URL: #7277
    isVoid authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    0410a36 View commit details
    Browse the repository at this point in the history
  5. Update 10 minutes to cuDF and CuPy with new APIs (#7158)

    This updates the 10 minutes to cuDF and CuPY notebook to use the new methods for moving between cuDF data structures and CuPy arrays.
    
    Closes #7160
    
    Authors:
      - @ChrisJar
    
    Approvers:
      - Ashwin Srinath (@shwina)
    
    URL: #7158
    ChrisJar authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    658e91a View commit details
    Browse the repository at this point in the history
  6. Update readme (#7318)

    Closes #7311
    
    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - AJ Schmidt (@ajschmidt8)
    
    URL: #7318
    shwina authored Feb 5, 2021
    Configuration menu
    Copy the full SHA
    da0e794 View commit details
    Browse the repository at this point in the history

Commits on Feb 8, 2021

  1. Auto-label PRs based on their content (#7044)

    This PR adds the GitHub action [PR Labeler](https://github.com/actions/labeler) to auto-label PRs based on their content. 
    
    Labeling is managed with a configuration file `.github/labeler.yml` using the following [options](https://github.com/actions/labeler#usage).
    
    Authors:
      - Joseph (@jolorunyomi)
      - Mike Wendt (@mike-wendt)
    
    Approvers:
      - AJ Schmidt (@ajschmidt8)
      - Keith Kraus (@kkraus14)
      - Mike Wendt (@mike-wendt)
    
    URL: #7044
    jolorunyomi authored Feb 8, 2021
    Configuration menu
    Copy the full SHA
    a86d5dd View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2021

  1. Unpin from numpy < 1.20 (#7335)

    Authors:
      - Ashwin Srinath (@shwina)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - @jakirkham
      - Ray Douglass (@raydouglass)
    
    URL: #7335
    shwina authored Feb 9, 2021
    Configuration menu
    Copy the full SHA
    d3f5add View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2021

  1. Add GHA to mark issues/prs as stale/rotten (#7388)

    Issues and PRs without activity for 30d will be marked as stale.
    If there is no activity for 90d, they will be marked as rotten.
    
    Authors:
      - Jordan Jacobelli (@Ethyling)
    
    Approvers:
      - Dillon Cullinan (@dillon-cullinan)
    
    URL: #7388
    jjacobelli authored Feb 16, 2021
    Configuration menu
    Copy the full SHA
    26c2dfe View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2021

  1. Update stale GHA with exemptions & new labels (#7395)

    Follows #7388
    
    Updates the stale GHA with the following changes:
    
    - [x] Uses `inactive-30d` and `inactive-90d` labels instead of `stale` and `rotten`
    - [x] Updates comments to reflect changes in labels
    - [x] Exempts the following labels from being marked `inactive-30d` or `inactive-90d`
      - `0 - Blocked`
      - `0 - Backlog`
      - `good first issue`
    
    Authors:
      - Mike Wendt (@mike-wendt)
    
    Approvers:
      - Keith Kraus (@kkraus14)
      - Ray Douglass (@raydouglass)
    
    URL: #7395
    mike-wendt authored Feb 17, 2021
    Configuration menu
    Copy the full SHA
    53ed28e View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2021

  1. update changelog

    raydouglass committed Feb 24, 2021
    Configuration menu
    Copy the full SHA
    1544474 View commit details
    Browse the repository at this point in the history