Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove hash_character_ngrams dependency from jaccard_index #16241

Merged
merged 15 commits into from
Jul 18, 2024

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jul 10, 2024

Description

Removes internal dependency of nvtext::hash_character_ngrams from nvtext::jaccard_index.
Works around the size-type limit imposed by hash_character_ngrams which returns a list column.
This also specializes the hashing logic for the jaccard calculation specifically.

The overall algorithm has not changed. Code has moved around a bit and internal list-columns have been replaced with just offsets and values vectors.

Closes #16157

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 10, 2024
@davidwendt davidwendt self-assigned this Jul 10, 2024
@davidwendt davidwendt changed the title Jaccard hash limit Remove hash_character_ngrams dependency from jaccard_index Jul 10, 2024
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 11, 2024
@davidwendt
Copy link
Contributor Author

Benchmark shows some good improvements for long strings as well

|  rows  |  width  | ssw  |   Ref Time |   Cmp Time |           Diff |   %Diff |
|--------|---------|------|------------|------------|----------------|---------|
| 32768  |   128   |   5  |   3.348 ms |   3.288 ms |     -60.283 us |  -1.80% |
| 131072 |   128   |   5  |  12.687 ms |  12.259 ms |    -428.319 us |  -3.38% |
| 262144 |   128   |   5  |  29.288 ms |  24.124 ms |   -5164.332 us | -17.63% |
| 32768  |   512   |   5  |   6.630 ms |   5.427 ms |   -1203.693 us | -18.15% |
| 131072 |   512   |   5  |  36.891 ms |  20.705 ms |  -16186.027 us | -43.87% |
| 262144 |   512   |   5  | 141.644 ms |  41.023 ms | -100620.349 us | -71.04% |
| 32768  |  1024   |   5  |  12.238 ms |   9.189 ms |   -3048.618 us | -24.91% |
| 131072 |  1024   |   5  |  79.830 ms |  35.594 ms |  -44236.498 us | -55.41% |
| 262144 |  1024   |   5  | 305.998 ms |  70.760 ms | -235238.131 us | -76.88% |
| 32768  |  2048   |   5  |  19.203 ms |  12.492 ms |   -6710.927 us | -34.95% |
| 131072 |  2048   |   5  | 147.357 ms |  47.998 ms |  -99359.754 us | -67.43% |
| 262144 |  2048   |   5  | 660.775 ms |  95.292 ms | -565483.399 us | -85.58% |
| 32768  |   128   |  10  |   3.386 ms |   3.168 ms |    -218.336 us |  -6.45% |
| 131072 |   128   |  10  |  12.029 ms |  11.760 ms |    -268.798 us |  -2.23% |
| 262144 |   128   |  10  |  27.850 ms |  23.167 ms |   -4683.010 us | -16.82% |
| 32768  |   512   |  10  |   7.598 ms |   5.621 ms |   -1976.327 us | -26.01% |
| 131072 |   512   |  10  |  38.809 ms |  21.401 ms |  -17407.598 us | -44.85% |
| 262144 |   512   |  10  | 183.316 ms |  42.413 ms | -140903.753 us | -76.86% |
| 32768  |  1024   |  10  |  14.295 ms |   9.648 ms |   -4646.415 us | -32.50% |
| 131072 |  1024   |  10  |  85.602 ms |  37.213 ms |  -48388.792 us | -56.53% |
| 262144 |  1024   |  10  | 416.415 ms |  73.965 ms | -342449.973 us | -82.24% |
| 32768  |  2048   |  10  |  23.723 ms |  13.409 ms |  -10314.216 us | -43.48% |
| 131072 |  2048   |  10  | 161.892 ms |  51.430 ms | -110462.043 us | -68.23% |
| 262144 |  2048   |  10  | 918.178 ms | 102.122 ms | -816056.502 us | -88.88% |

@davidwendt davidwendt marked this pull request as ready for review July 11, 2024 17:08
@davidwendt davidwendt requested a review from a team as a code owner July 11, 2024 17:08
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments but generally all looks fine.

cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cpp/src/text/jaccard.cu Outdated Show resolved Hide resolved
cudf::size_type const* d_uniques2;
cudf::size_type const* d_intersects;

__device__ float operator()(cudf::size_type idx) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It surprises me a bit that we return float and not double for this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is what is expected. I can double check though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed float is correct for now.

@davidwendt davidwendt added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 16, 2024
@davidwendt davidwendt removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jul 17, 2024
@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit aeef0a1 into rapidsai:branch-24.08 Jul 18, 2024
88 checks passed
@davidwendt davidwendt deleted the jaccard-hash-limit branch July 18, 2024 20:56
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] StringMethods - Jaccard-index fails with long strings
3 participants