[BUG] StringMethods - Jaccard-index fails with long strings #16157

ayushdg · 2024-07-02T00:36:01Z

Describe the bug
Calling jaccard_index on long strings leads to OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Steps/Code to reproduce bug

import cudf
import numpy as np

test_string = "a" *(np.iinfo(np.int32).max // 10)
df = cudf.Series([test_string] * 11)
res = df.str.jaccard_index(input=df, width=5)

Results in:

File /opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/string.py:5378, in StringMethods.jaccard_index(self, input, width)
   5353 def jaccard_index(self, input: cudf.Series, width: int) -> SeriesOrIndex:
   5354     """
   5355     Compute the Jaccard index between this column and the given
   5356     input strings column.
   (...)
   5374     dtype: float32
   5375     """
   5377     return self._return_or_inplace(
-> 5378         libstrings.jaccard_index(self._column, input._column, width),
   5379     )

File /opt/conda/envs/rapids/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
     76 @wraps(func)
     77 def inner(*args, **kwds):
     78     with self._recreate_cm():
---> 79         return func(*args, **kwds)

File jaccard.pyx:26, in cudf._lib.nvtext.jaccard.jaccard_index()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

Expected behavior
Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.

Environment overview (please complete the following information)

Environment location: Bare-metal
Method of cuDF install: conda 24.08 nightly
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

davidwendt · 2024-07-03T13:33:31Z

The jaccard API uses hash_character_ngrams internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type and so the function is unable to build the output list column.

So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32 individual strings.
Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.

davidwendt · 2024-07-15T17:17:00Z

Even with large-strings support the amount of memory needed to process this example will be significant.
The original df size is 11 rows of 214,748,364 bytes each = ~2.4GB for the total input strings size.
Using a width=5 means each row generates 214,748,368 individual substrings at 5 bytes each = ~1.1GB per row. (11 rows ~ 12GB). The internal code uses hashing which reduces the 5 bytes to 4 bytes = ~859MB per row. (11 rows ~ 9.5GB).
Since the jaccard call here in this example is comparing the df with itself the temporary memory doubles to ~19GB.
Internally the intermediate substrings/hashes are sorted to help with counting the unique values. The sorted output requires a 2nd temporary copy (of the 9.5GB) which gets us to (19+9.5) = 28.5GB peak memory.

So overall jaccard_index would need about 6x the input memory available for processing.

ayushdg added the bug Something isn't working label Jul 2, 2024

davidwendt self-assigned this Jul 2, 2024

mroeschke added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2024

davidwendt linked a pull request Jul 10, 2024 that will close this issue

Remove hash_character_ngrams dependency from jaccard_index #16241

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] StringMethods - Jaccard-index fails with long strings #16157

[BUG] StringMethods - Jaccard-index fails with long strings #16157

ayushdg commented Jul 2, 2024

davidwendt commented Jul 3, 2024 •

edited

Loading

davidwendt commented Jul 15, 2024

[BUG] StringMethods - Jaccard-index fails with long strings #16157

[BUG] StringMethods - Jaccard-index fails with long strings #16157

Comments

ayushdg commented Jul 2, 2024

davidwendt commented Jul 3, 2024 • edited Loading

davidwendt commented Jul 15, 2024

davidwendt commented Jul 3, 2024 •

edited

Loading