-
Notifications
You must be signed in to change notification settings - Fork 875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nvtext::jaccard_index API for strings columns #13669
Add nvtext::jaccard_index API for strings columns #13669
Conversation
…ngram-chars-hashed
We are seeing a 6x+speedup 🚀 with this PR on our internal workflows. Thanks for the great work on this @davidwendt . Let's try to get this one in soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks good. My primary concern is around the default width (I'd prefer no default) and the requirement for a minimum of width 5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple small suggestions, otherwise LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a slick implementation, nice work! I have some queries, but most are either not terribly substantial or could be punted on if needed.
Do we care about the potential for hash collisions, or is the approximation we get from the hashed values good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! One request, but totally fine to address it later.
/merge |
Adds the `nvtext::jaccard_index()` API to cudf. ``` std::unique_ptr<cudf::column> jaccard_index( cudf::strings_column_view const& input1, cudf::strings_column_view const& input2, cudf::size_type width, rmm::mr::device_memory_resource* mr); ``` The Jaccard Index is described here for computing the distance between two arrays of elements: https://en.wikipedia.org/wiki/Jaccard_index ``` Formula is J = |A ∩ B| ------- |A ∪ B| where |A ∩ B| is number of common values between A and B and |x| is the number of unique values in x. ``` The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the `width` parameter) to each string to form the set of tokens to compare within each row of the two input columns. These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row. Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value. The two input columns must be the same size and will match the output column size. Authors: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#13669
Description
Adds the
nvtext::jaccard_index()
API to cudf.The Jaccard Index is described here for computing the distance between two arrays of elements: https://en.wikipedia.org/wiki/Jaccard_index
The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the
width
parameter) to each string to form the set of tokens to compare within each row of the two input columns.These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row.
Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.
The two input columns must be the same size and will match the output column size.
Checklist