Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext hash_character_ngrams function #13654

Merged
merged 9 commits into from
Jul 12, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 3, 2023

Description

Adds a new nvtext function to return a hash column of generated character ngrams. The result can help speed up minhash and jaccard calculations which can use the hash values as unique tokens in place of the actual character ngrams themselves. There should be very few hash collisions especially when the ngram value is small.

Also, since the character ngrams are not actually generated, the cudf column size limit becomes less of an issue since the number of hashes becomes the limit instead of the total number of generated characters. So larger batch sizes may be possible with this approach.

The code for the ngram function from https://github.com/rapidsai/rapids-deduplication/issues/36 can be modified as follows

def ngram(ds):
    return ds.str.hash_character_ngrams(5, True).list.unique()

The rest of the code can remain the same.
except perhaps this additional improvement

This change reduced the runtime for the reference code above from 3.9s to 2.4s on my local machine.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 3, 2023
@davidwendt davidwendt self-assigned this Jul 3, 2023
@github-actions github-actions bot added the CMake CMake build issue label Jul 3, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 5, 2023
@davidwendt davidwendt marked this pull request as ready for review July 5, 2023 17:50
@davidwendt davidwendt requested review from a team as code owners July 5, 2023 17:50
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes look good, thanks!

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
minor suggestion on the doc.

cpp/include/nvtext/generate_ngrams.hpp Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 22b00f5 into rapidsai:branch-23.08 Jul 12, 2023
56 of 57 checks passed
@davidwendt davidwendt deleted the ngram-chars-hashed branch July 12, 2023 15:06
rapids-bot bot pushed a commit that referenced this pull request Aug 16, 2023
…er strings (#13874)

Fixes performance regression when generating character ngrams. The regression was introduced as part of refactoring common code when adding the `nvtext::hash_character_ngrams` function (Reference #13654). Defactoring the code fixed the regression. Overall, these functions only share about 6 lines of code in common so the defactoring is expected to require minimal maintenance.
The defactoring involves re-instating the original kernel code logic for `nvtext::generate_character_ngrams`.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)

URL: #13874
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants