Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Efficient minhashing with cuDF #12950

Closed
ayushdg opened this issue Mar 15, 2023 · 1 comment · Fixed by #12961
Closed

[QST] Efficient minhashing with cuDF #12950

ayushdg opened this issue Mar 15, 2023 · 1 comment · Fixed by #12961
Assignees
Labels
2 - In Progress Currently a work in progress feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@ayushdg
Copy link
Member

ayushdg commented Mar 15, 2023

What is your question?
Minhashing or locality sensitive hashing is a popular technique used to group similar documents based on the similarity b/w words/shingles in those documents.

Right now I have an approach that computes the minhashes for documents as follows:
Series of documents -> n-grams(shingles)-> hash_values -> groupby+min


Documents

0            doc 1
1    this is doc 2
2            doc 3
Name: doc_text, dtype: object

exploded n-grams

0     doc 
0     oc 1
1     this
1     his 
1     is i
1     s is
1     is 
1     is d
1     s do
1     doc
1     doc 
1     oc 2
3     doc 
3     oc 3
Name: doc_text, dtype: object

hash_values -> groupby+min

2    2005254014
0     216665549
1    1081029400
dtype: uint32

The need to explode each document into a bunch of tokens increase the memory consumption per document and adds the need to groupby document_id at the end

I was wondering if some sort of UDF or custom kernel might be more efficient for this task which can parallelize the sliding window hashing + storing minhash that would avoid this extra memory usage and the need for a groupby at the end.

cc: @davidwendt @wence-

@ayushdg ayushdg added question Further information is requested Needs Triage Need team to review and classify labels Mar 15, 2023
@davidwendt davidwendt self-assigned this Mar 15, 2023
@ayushdg
Copy link
Member Author

ayushdg commented Mar 23, 2023

Edit: Also it is common in practice to compute multiple minhashes k, either by using k different hash function, 1 hash function and k different permutations of that hash, or 1 hash function with k different seeds.

Taking the last case as an example we would compute 1 minhash for each seed per document, and end up with k such minhashes.

@GregoryKimball GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed question Further information is requested Needs Triage Need team to review and classify labels Apr 2, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 21, 2023
Adds the `nvtext::minhash()` function to compute the MinHash of strings in a string column without needing extra memory to hold the intermediate substrings and hash values.
This also uses the `MurmurHash3_32` algorithm directly so the hash results have better parity with external algorithms. The API also accepts a hash seed value.

```
std::unique_ptr<cudf::column> minhash(
  cudf::strings_column_view const& input,
  cudf::size_type width               = 4,
  cudf::hash_value_type seed          = cudf::DEFAULT_HASH_SEED,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());
```

Closes #12950

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)
  - Karthikeyan (https://github.com/karthikeyann)
  - Ayush Dattagupta (https://github.com/ayushdg)

URL: #12961
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants