Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::jaccard_index API for strings columns #13669

Merged
merged 36 commits into from
Jul 20, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 6, 2023

Description

Adds the nvtext::jaccard_index() API to cudf.

std::unique_ptr<cudf::column> jaccard_index(
  cudf::strings_column_view const& input1,  cudf::strings_column_view const& input2,  cudf::size_type width,
  rmm::mr::device_memory_resource* mr);

The Jaccard Index is described here for computing the distance between two arrays of elements: https://en.wikipedia.org/wiki/Jaccard_index

 Formula is J = |A ∩ B| 
                -------
                |A ∪ B|

 where |A ∩ B| is number of common values between A and B
 and |x| is the number of unique values in x.

The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the width parameter) to each string to form the set of tokens to compare within each row of the two input columns.

These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row.
Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.

The two input columns must be the same size and will match the output column size.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. non-breaking Non-breaking change labels Jul 6, 2023
@davidwendt davidwendt self-assigned this Jul 6, 2023
@github-actions github-actions bot added the CMake CMake build issue label Jul 6, 2023
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 13, 2023
@davidwendt davidwendt marked this pull request as ready for review July 13, 2023 20:34
@davidwendt davidwendt marked this pull request as ready for review July 17, 2023 18:50
@VibhuJawa
Copy link
Member

VibhuJawa commented Jul 17, 2023

We are seeing a 6x+speedup 🚀 with this PR on our internal workflows. Thanks for the great work on this @davidwendt .

Let's try to get this one in soon.
CC: @ayushdg , @GregoryKimball

@davidwendt davidwendt changed the title Add nvtext::jaccard_index API for strings columns Add nvtext::jaccard_index API for strings columns Jul 18, 2023
@davidwendt davidwendt changed the title Add nvtext::jaccard_index API for strings columns Add nvtext::jaccard_index API for strings columns Jul 18, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good. My primary concern is around the default width (I'd prefer no default) and the requirement for a minimum of width 5.

cpp/include/nvtext/jaccard.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/jaccard.hpp Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from bdice July 18, 2023 18:15
@davidwendt davidwendt changed the title Add nvtext::jaccard_index API for strings columns Add nvtext::jaccard_index API for strings columns Jul 18, 2023
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small suggestions, otherwise LGTM!

python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a slick implementation, nice work! I have some queries, but most are either not terribly substantial or could be punted on if needed.

Do we care about the potential for hash collisions, or is the approximation we get from the hashed values good enough?

python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd Show resolved Hide resolved
cpp/include/nvtext/jaccard.hpp Show resolved Hide resolved
cpp/include/nvtext/jaccard.hpp Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/tests/text/jaccard_tests.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/text/jaccard.cpp Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes! One request, but totally fine to address it later.

python/cudf/cudf/tests/text/test_text_methods.py Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Jul 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit ad64c66 into rapidsai:branch-23.08 Jul 20, 2023
60 checks passed
etseidl pushed a commit to etseidl/cudf that referenced this pull request Jul 21, 2023
Adds the `nvtext::jaccard_index()` API to cudf.

```
std::unique_ptr<cudf::column> jaccard_index(
  cudf::strings_column_view const& input1,  cudf::strings_column_view const& input2,  cudf::size_type width,
  rmm::mr::device_memory_resource* mr);
```

The Jaccard Index is described here for computing the distance between two arrays of elements:  https://en.wikipedia.org/wiki/Jaccard_index

```
 Formula is J = |A ∩ B| 
                -------
                |A ∪ B|

 where |A ∩ B| is number of common values between A and B
 and |x| is the number of unique values in x.
```

The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the `width` parameter) to each string to form the set of tokens to compare within each row of the two input columns.

These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row.
Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.

The two input columns must be the same size and will match the output column size.

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: rapidsai#13669
@davidwendt davidwendt deleted the fea-jaccard-compute branch July 31, 2023 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

4 participants