Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StringMethods - Jaccard-index fails with long strings #16157

Open
ayushdg opened this issue Jul 2, 2024 · 2 comments · May be fixed by #16241
Open

[BUG] StringMethods - Jaccard-index fails with long strings #16157

ayushdg opened this issue Jul 2, 2024 · 2 comments · May be fixed by #16241
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@ayushdg
Copy link
Member

ayushdg commented Jul 2, 2024

Describe the bug
Calling jaccard_index on long strings leads to OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Steps/Code to reproduce bug

import cudf
import numpy as np

test_string = "a" *(np.iinfo(np.int32).max // 10)
df = cudf.Series([test_string] * 11)
res = df.str.jaccard_index(input=df, width=5)

Results in:

File /opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/string.py:5378, in StringMethods.jaccard_index(self, input, width)
   5353 def jaccard_index(self, input: cudf.Series, width: int) -> SeriesOrIndex:
   5354     """
   5355     Compute the Jaccard index between this column and the given
   5356     input strings column.
   (...)
   5374     dtype: float32
   5375     """
   5377     return self._return_or_inplace(
-> 5378         libstrings.jaccard_index(self._column, input._column, width),
   5379     )

File /opt/conda/envs/rapids/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File jaccard.pyx:26, in cudf._lib.nvtext.jaccard.jaccard_index()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Expected behavior
Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda 24.08 nightly
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@ayushdg ayushdg added the bug Something isn't working label Jul 2, 2024
@davidwendt davidwendt self-assigned this Jul 2, 2024
@mroeschke mroeschke added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2024
@davidwendt
Copy link
Contributor

davidwendt commented Jul 3, 2024

The jaccard API uses hash_character_ngrams internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type and so the function is unable to build the output list column.

So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32 individual strings.
Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.

@davidwendt davidwendt linked a pull request Jul 10, 2024 that will close this issue
3 tasks
@davidwendt
Copy link
Contributor

Even with large-strings support the amount of memory needed to process this example will be significant.
The original df size is 11 rows of 214,748,364 bytes each = ~2.4GB for the total input strings size.
Using a width=5 means each row generates 214,748,368 individual substrings at 5 bytes each = ~1.1GB per row. (11 rows ~ 12GB). The internal code uses hashing which reduces the 5 bytes to 4 bytes = ~859MB per row. (11 rows ~ 9.5GB).
Since the jaccard call here in this example is comparing the df with itself the temporary memory doubles to ~19GB.
Internally the intermediate substrings/hashes are sorted to help with counting the unique values. The sorted output requires a 2nd temporary copy (of the 9.5GB) which gets us to (19+9.5) = 28.5GB peak memory.

So overall jaccard_index would need about 6x the input memory available for processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants