Add nvtext::jaccard_index API for strings columns #13669

davidwendt · 2023-07-06T21:40:34Z

Description

Adds the nvtext::jaccard_index() API to cudf.

std::unique_ptr<cudf::column> jaccard_index(
  cudf::strings_column_view const& input1,  cudf::strings_column_view const& input2,  cudf::size_type width,
  rmm::mr::device_memory_resource* mr);

The Jaccard Index is described here for computing the distance between two arrays of elements: https://en.wikipedia.org/wiki/Jaccard_index

 Formula is J = |A ∩ B| 
                -------
                |A ∪ B|

 where |A ∩ B| is number of common values between A and B
 and |x| is the number of unique values in x.

The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the width parameter) to each string to form the set of tokens to compare within each row of the two input columns.

These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row.
Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.

The two input columns must be the same size and will match the output column size.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ngram-chars-hashed

VibhuJawa · 2023-07-17T18:55:38Z

We are seeing a 6x+speedup 🚀 with this PR on our internal workflows. Thanks for the great work on this @davidwendt .

Let's try to get this one in soon.
CC: @ayushdg , @GregoryKimball

bdice

Implementation looks good. My primary concern is around the default width (I'd prefer no default) and the requirement for a minimum of width 5.

cpp/include/nvtext/jaccard.hpp

cpp/src/text/jaccard.cu

python/cudf/cudf/core/column/string.py

cpp/src/text/jaccard.cu

python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd

python/cudf/cudf/core/column/string.py

bdice

A couple small suggestions, otherwise LGTM!

python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd

python/cudf/cudf/core/column/string.py

vyasr

This is a slick implementation, nice work! I have some queries, but most are either not terribly substantial or could be punted on if needed.

Do we care about the potential for hash collisions, or is the approximation we get from the hashed values good enough?

python/cudf/cudf/_lib/cpp/nvtext/jaccard.pxd

cpp/include/nvtext/jaccard.hpp

cpp/src/text/jaccard.cu

cpp/tests/text/jaccard_tests.cpp

cpp/benchmarks/text/jaccard.cpp

vyasr

Thanks for the fixes! One request, but totally fine to address it later.

python/cudf/cudf/tests/text/test_text_methods.py

vyasr · 2023-07-20T17:11:07Z

/merge

Adds the `nvtext::jaccard_index()` API to cudf. ``` std::unique_ptr<cudf::column> jaccard_index( cudf::strings_column_view const& input1, cudf::strings_column_view const& input2, cudf::size_type width, rmm::mr::device_memory_resource* mr); ``` The Jaccard Index is described here for computing the distance between two arrays of elements: https://en.wikipedia.org/wiki/Jaccard_index ``` Formula is J = |A ∩ B| ------- |A ∪ B| where |A ∩ B| is number of common values between A and B and |x| is the number of unique values in x. ``` The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the `width` parameter) to each string to form the set of tokens to compare within each row of the two input columns. These substrings are essentially character ngrams and each substring is part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row. Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value. The two input columns must be the same size and will match the output column size. Authors: - David Wendt (https://github.com/davidwendt) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#13669

davidwendt added 8 commits July 3, 2023 10:44

Add nvtext hash_character_ngrams function

7182a47

Merge branch 'branch-23.08' into ngram-chars-hashed

d9477d6

Merge branch 'branch-23.08' into ngram-chars-hashed

83d982a

add detail declaration

5b0dd13

Merge branch 'branch-23.08' into ngram-chars-hashed

094440b

Add nvtext::jaccard_index API for strings columns

7074251

Merge branch 'branch-23.08' into ngram-chars-hashed

c21b26a

Merge branch 'ngram-chars-hashed' into fea-jaccard-compute

f0bfbdb

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. non-breaking Non-breaking change labels Jul 6, 2023

davidwendt self-assigned this Jul 6, 2023

github-actions bot added the CMake CMake build issue label Jul 6, 2023

davidwendt added 11 commits July 10, 2023 16:17

Merge branch 'branch-23.08' into ngram-chars-hashed

9dd80de

add the name of the hash algorithm to the docs

e283c5c

Merge branch 'branch-23.08' into ngram-chars-hashed

10c44d1

remove include of deleted header file

83be132

Merge branch 'ngram-chars-hashed' of github.com:davidwendt/cudf into …

cfd8091

…ngram-chars-hashed

Merge branch 'ngram-chars-hashed' into fea-jaccard-compute

6e3c1dd

remove unneeded include

3845990

fix merge conflict

d11e0d6

fix doxygen

698e708

add gtests with nulls

8fb43b7

Merge branch 'branch-23.08' into fea-jaccard-compute

e492d77

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 13, 2023

Merge branch 'branch-23.08' into fea-jaccard-compute

d9207b5

davidwendt marked this pull request as ready for review July 13, 2023 20:34

davidwendt added 4 commits July 17, 2023 11:45

Merge branch 'branch-23.08' into fea-jaccard-compute

423e747

add pytest

94b4f05

Merge branch 'branch-23.08' into fea-jaccard-compute

38523da

free some working memory before final result

631c60c

davidwendt marked this pull request as ready for review July 17, 2023 18:50

fix errors gtests

03cb399

davidwendt changed the title ~~Add nvtext::jaccard_index API for strings columns~~ Add nvtext::jaccard_index API for strings columns Jul 18, 2023

davidwendt added 2 commits July 18, 2023 11:36

Merge branch 'branch-23.08' into fea-jaccard-compute

1aed45d

add more const decls

b37d9f1

bdice reviewed Jul 18, 2023

View reviewed changes

add defaulted stream parameter

562c873

davidwendt requested a review from bdice July 18, 2023 18:15

davidwendt changed the title ~~Add nvtext::jaccard_index API for strings columns~~ Add nvtext::jaccard_index API for strings columns Jul 18, 2023

Merge branch 'branch-23.08' into fea-jaccard-compute

5266166

bdice mentioned this pull request Jul 19, 2023

[ENH] Audit argsort + gather/scatter patterns for missing performance #13557

Open

Merge branch 'branch-23.08' into fea-jaccard-compute

ab83dd7

bdice approved these changes Jul 19, 2023

View reviewed changes

davidwendt added 2 commits July 19, 2023 13:18

removed default value for width parameter

f844f07

Merge branch 'branch-23.08' into fea-jaccard-compute

7d7d5af

vyasr requested changes Jul 19, 2023

View reviewed changes

vyasr mentioned this pull request Jul 19, 2023

[FEA] Benchmark Jaccard with alternative distributions #13726

Open

add to pytest

7c93605

vyasr approved these changes Jul 19, 2023

View reviewed changes

python/cudf/cudf/tests/text/test_text_methods.py Outdated Show resolved Hide resolved

Update test

0778c74

rapids-bot bot merged commit ad64c66 into rapidsai:branch-23.08 Jul 20, 2023
60 checks passed

davidwendt deleted the fea-jaccard-compute branch July 31, 2023 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nvtext::jaccard_index API for strings columns #13669

Add nvtext::jaccard_index API for strings columns #13669

davidwendt commented Jul 6, 2023 •

edited

Loading

VibhuJawa commented Jul 17, 2023 •

edited

Loading

bdice left a comment

bdice left a comment

vyasr left a comment

vyasr left a comment

vyasr commented Jul 20, 2023

Add nvtext::jaccard_index API for strings columns #13669

Add nvtext::jaccard_index API for strings columns #13669

Conversation

davidwendt commented Jul 6, 2023 • edited Loading

Description

Checklist

VibhuJawa commented Jul 17, 2023 • edited Loading

bdice left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr commented Jul 20, 2023

davidwendt commented Jul 6, 2023 •

edited

Loading

VibhuJawa commented Jul 17, 2023 •

edited

Loading