Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add separator-on-null parameter to strings concatenate APIs #8282

Merged
merged 10 commits into from
May 24, 2021

Conversation

davidwendt
Copy link
Contributor

Closes #4728

This PR adds a new parameter to the cudf::strings::concatenate APIs to specify if separators should be added between null entries when the null-replacement (narep) parameter is valid. If the narep scalar is invalid (i.e. null itself) then the entire output row becomes null. If not, separators are added between each element. Examples:

s1 = ['a', 'b', null, 'dd', null]
s2 = ['A', null, 'CC', 'D', null]
concatenate( {s1, s2}, sep='+', narep=invalid ) -> ['a+A', null, null, 'dd+D', null]
concatenate( {s1, s2}, sep='+', narep='@' ) -> ['a+A', 'b+@', '@+CC', 'dd+D', '@+@']
concatenate( {s1, s2}, sep='+', narep='' ) -> ['a+A', 'b+', '+CC', 'dd+D', '+']

The new parameter is an enum separator_on_nulls which has YES or NO settings. The default parameter value will be YES to keep the current behavior as expected by Python cudf and for consistency with Pandas behavior.
Specifying NO here will suppress the separator with null elements (when narep is valid).

concatenate( {s1, s2}, sep='+', narep='', NO ) -> ['a+A', 'b', 'CC', 'dd+D', '']

This PR also changes the name of the cudf::strings::concatenate_list_elements API to cudf::strings::join_list_elements instead. The API pattern and behavior more mimic the cudf::strings::join_strings then the concatenate functions. Also, these are called by the Python join functions so the rename makes it more consistent with cudf.

This is a breaking change in order to make these APIs more consistent. Previously, the separators column version was returning nulls only for an all-null row. This has been changed to honor the separator_on_null parameter instead. Currently there was no Python cudf API calling this version. Only the rename required minor changes to the Cython layer.

The gtests were updated to reflect the new behavior. None of the pytests required any changes since the default parameter value matches the original behavior for those APIs that cudf actually calls.

@davidwendt davidwendt added bug Something isn't working 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) breaking Breaking change labels May 19, 2021
@davidwendt davidwendt self-assigned this May 19, 2021
@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. labels May 19, 2021
@davidwendt davidwendt requested a review from ttnghia May 19, 2021 13:08
@davidwendt davidwendt changed the title Separator on nulls policy Add separator-on-null parameter to strings concatenate APIs May 19, 2021
@codecov
Copy link

codecov bot commented May 19, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.06@3975f10). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.06    #8282   +/-   ##
===============================================
  Coverage                ?   82.88%           
===============================================
  Files                   ?      105           
  Lines                   ?    17874           
  Branches                ?        0           
===============================================
  Hits                    ?    14814           
  Misses                  ?     3060           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3975f10...18b1e03. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 19, 2021
@davidwendt davidwendt marked this pull request as ready for review May 19, 2021 17:40
@davidwendt davidwendt requested review from a team as code owners May 19, 2021 17:40
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes look good to me

Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake / python lgtm

@kkraus14 kkraus14 added the 4 - Needs Review Waiting for reviewer to review or respond label May 20, 2021
@davidwendt davidwendt requested a review from cwharris May 21, 2021 02:01
@kkraus14 kkraus14 removed the 4 - Needs Review Waiting for reviewer to review or respond label May 21, 2021
Copy link
Contributor

@tgravescs tgravescs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested this latest patch out (along with pr 8285) and all my Spark tests pass so we have the desired null handling behavior.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ef20706 into rapidsai:branch-21.06 May 24, 2021
@davidwendt davidwendt deleted the separator-on-nulls-policy branch May 24, 2021 11:27
rapids-bot bot pushed a commit that referenced this pull request May 24, 2021
to stringConcatenate when using a scalar separator.

Reference #8282 changed to throw an exception if only a single column is passed in to the stringConcatenate using scalar separator.  Update our Java test for that functionality.

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

Authors:
  - Thomas Graves (https://github.com/tgravescs)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jason Lowe (https://github.com/jlowe)

URL: #8330
rapids-bot bot pushed a commit that referenced this pull request May 26, 2021
This PR implements a new option for `strings::join_list_elements` on top of #8282. In particular, the new option is:
```
/**
 * @brief Setting for specifying what will be output from `join_list_elements` when an input list
 * is empty.
 */
enum class output_if_empty_list {
  EMPTY_STRING,  ///< Empty list will result in empty string
  NULL_ELEMENT   ///< Empty list will result in a null
};
```
This new option is necessary for implementing `concat_ws` in Spark, since the behavior of the output string is required to be different depending on the situation.

Currently blocked from merging by #8282.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Keith Kraus (https://github.com/kkraus14)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ashwin Srinath (https://github.com/shwina)

URL: #8285
rapids-bot bot pushed a commit that referenced this pull request May 26, 2021
Add java api's to be able to call string concatenate with separators.  

DO NOT MERGE until:  #8285 and #8282 are merged.  We need those changes for the API used in this PR as well as functionality to match what Spark needs.

new arguments were added to the existing concatenate api that takes a scalar for the separator, so I extended that. I added new api's for the concatenate api that takes a column as the separator.  I also added new api's for both join_list_elements api's, one with scalar separator and one with column separator.

Authors:
  - Thomas Graves (https://github.com/tgravescs)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Jason Lowe (https://github.com/jlowe)

URL: #8289
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change bug Something isn't working CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Concatenate with separator, ignore separator for null replacement Inconsistent
6 participants