Add nvtext hash_character_ngrams function #13654

davidwendt · 2023-07-03T15:01:31Z

Description

Adds a new nvtext function to return a hash column of generated character ngrams. The result can help speed up minhash and jaccard calculations which can use the hash values as unique tokens in place of the actual character ngrams themselves. There should be very few hash collisions especially when the ngram value is small.

Also, since the character ngrams are not actually generated, the cudf column size limit becomes less of an issue since the number of hashes becomes the limit instead of the total number of generated characters. So larger batch sizes may be possible with this approach.

The code for the ngram function from https://github.com/rapidsai/rapids-deduplication/issues/36 can be modified as follows

def ngram(ds):
    return ds.str.hash_character_ngrams(5, True).list.unique()

The rest of the code can remain the same.
except perhaps this additional improvement

This change reduced the runtime for the reference code above from 3.9s to 2.4s on my local machine.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

wence-

Python changes look good, thanks!

karthikeyann

Looks good to me.
minor suggestion on the doc.

cpp/include/nvtext/generate_ngrams.hpp

python/cudf/cudf/core/column/string.py

davidwendt · 2023-07-12T15:06:30Z

/merge

…er strings (#13874) Fixes performance regression when generating character ngrams. The regression was introduced as part of refactoring common code when adding the `nvtext::hash_character_ngrams` function (Reference #13654). Defactoring the code fixed the regression. Overall, these functions only share about 6 lines of code in common so the defactoring is expected to require minimal maintenance. The defactoring involves re-instating the original kernel code logic for `nvtext::generate_character_ngrams`. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) URL: #13874

davidwendt added 2 commits July 3, 2023 10:44

Add nvtext hash_character_ngrams function

7182a47

Merge branch 'branch-23.08' into ngram-chars-hashed

d9477d6

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 3, 2023

davidwendt self-assigned this Jul 3, 2023

github-actions bot added the CMake CMake build issue label Jul 3, 2023

davidwendt added 2 commits July 5, 2023 10:08

Merge branch 'branch-23.08' into ngram-chars-hashed

83d982a

add detail declaration

5b0dd13

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 5, 2023

davidwendt marked this pull request as ready for review July 5, 2023 17:50

davidwendt requested review from a team as code owners July 5, 2023 17:50

davidwendt requested review from wence-, vyasr and PointKernel July 5, 2023 17:50

Merge branch 'branch-23.08' into ngram-chars-hashed

094440b

wence- approved these changes Jul 6, 2023

View reviewed changes

davidwendt mentioned this pull request Jul 6, 2023

Add nvtext::jaccard_index API for strings columns #13669

Merged

3 tasks

karthikeyann approved these changes Jul 9, 2023

View reviewed changes

cpp/include/nvtext/generate_ngrams.hpp Show resolved Hide resolved

python/cudf/cudf/core/column/string.py Show resolved Hide resolved

davidwendt added 2 commits July 10, 2023 16:17

Merge branch 'branch-23.08' into ngram-chars-hashed

9dd80de

add the name of the hash algorithm to the docs

e283c5c

PointKernel approved these changes Jul 11, 2023

View reviewed changes

davidwendt added 2 commits July 11, 2023 15:54

Merge branch 'branch-23.08' into ngram-chars-hashed

10c44d1

remove include of deleted header file

83be132

rapids-bot bot merged commit 22b00f5 into rapidsai:branch-23.08 Jul 12, 2023
56 of 57 checks passed

davidwendt deleted the ngram-chars-hashed branch July 12, 2023 15:06

davidwendt mentioned this pull request Aug 14, 2023

Fix nvtext::generate_character_ngrams performance regression for longer strings #13874

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvtext hash_character_ngrams function #13654

Add nvtext hash_character_ngrams function #13654

davidwendt commented Jul 3, 2023 •

edited

Loading

wence- left a comment

karthikeyann left a comment

davidwendt commented Jul 12, 2023

Add nvtext hash_character_ngrams function #13654

Add nvtext hash_character_ngrams function #13654

Conversation

davidwendt commented Jul 3, 2023 • edited Loading

Description

Checklist

wence- left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

davidwendt commented Jul 12, 2023

davidwendt commented Jul 3, 2023 •

edited

Loading