Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::minhash function #12961

Merged
merged 89 commits into from
Apr 21, 2023
Merged

Conversation

davidwendt
Copy link
Contributor

Description

Adds the nvtext::minhash() function to compute the MinHash of strings in a string column without needing extra memory to hold the intermediate substrings and hash values.
This also uses the MurmurHash3_32 algorithm directly so the hash results have better parity with external algorithms. The API also accepts a hash seed value.

std::unique_ptr<cudf::column> minhash(
  cudf::strings_column_view const& input,
  cudf::size_type width               = 4,
  cudf::hash_value_type seed          = cudf::DEFAULT_HASH_SEED,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Closes #12950

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Mar 16, 2023
@davidwendt davidwendt self-assigned this Mar 16, 2023
@github-actions github-actions bot added the CMake CMake build issue label Mar 16, 2023
@davidwendt davidwendt changed the title Text minhashing Add nvtext::minhash function Mar 16, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 17, 2023
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/src/text/minhash.cu Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Outdated Show resolved Hide resolved
@GregoryKimball GregoryKimball removed the request for review from elstehle April 17, 2023 16:51
cpp/include/doxygen_groups.h Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
@davidwendt davidwendt requested a review from bdice April 19, 2023 12:30
cpp/benchmarks/CMakeLists.txt Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/nvtext/minhash.pxd Outdated Show resolved Hide resolved
cpp/include/nvtext/minhash.hpp Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c7f2342 into rapidsai:branch-23.06 Apr 21, 2023
@davidwendt davidwendt deleted the text-minhashing branch April 21, 2023 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST] Efficient minhashing with cuDF
5 participants