Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Increase maximum characters in strings columns #13733

Closed
GregoryKimball opened this issue Jul 22, 2023 · 12 comments
Closed

[FEA] Increase maximum characters in strings columns #13733

GregoryKimball opened this issue Jul 22, 2023 · 12 comments
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 22, 2023

In libcudf, strings columns have child columns containing character data and offsets, and the offsets child column uses a 32-bit signed size type. This limits strings columns to containing ~2.1 billion characters. For LLM training, documents have up to 1M characters, and a median around 3K characters. Due to the size type limit, LLM training pipelines have to carefully batch the data down to a few thousand rows to stay comfortably within the size type limit. We have a general issue open to explore a 64-bit size type in libcudf (#13159). For size issues with LLM training pipelines, we should consider a targeted change to only address the size limit for strings columns.

Requirements

  • We must maintain or improve throughput for functions processing strings columns with <2.1 billion characters. This requirement prevents us from using 64-bit offsets for all strings columns. It does not prevent us from using 64-bit offsets for strings columns with >2.1 billion characters.
  • We must not introduce a new data type or otherwise increase compile times significantly. This requirement prevents us from dispatching between "strings" types and "large strings" types.

Proposed solution

One idea that satisfies these requirements would be to represent the character data as an int64 typed column instead of an int8 typed column. This would allow us to store 8x more bytes of character data. To access the character bytes, we would use an offset-normalizing iterator (inspired by "indexalator") to identify byte positions using an int64 iterator output. Please note that the row count 32-bit size type would still apply to the proposed "large strings" columns.

We should also consider an "unbounded" character data allocation that is not typed, but rather a single buffer up to 2^64 bytes in size. The 64-bit offset type would be able to index into much larger allocations.

Please note that this solution will not impact the offsets for list columns. We believe that the best design to allow for more than 2.1B elements in lists will be to use 64-bit size type in libcudf as discussed in #13159.

Creating strings columns

Strings columns factories would choose child column types at the time of column creation, based on the size of the character data. This change would impact strings column factories, as well as algorithms that use strings column utilities or generate their own offsets buffers. At column creation time, the constructor will choose between int32 offsets with int8 character data and int64 offsets with int64 character data, based on the size of the character data. Any function that calls make_offsets_child_column will need to be aware of the alternate child column types for large strings.

Accessing strings data

The offset-normalizing iterator would always return int64 type so that strings column consumers would not need to support both int32 and int64 offset types. See cudf::detail::sizes_to_offsets_iterator for an example of how an iterator operating on int32 data can output int64 data.

Interoperability with Arrow

The new strings column variant with int64 offsets with int64 character data may already be Arrow-compatible. This requires more testing and some changes to our Arrow interop utilities.

Part 1: libcudf changes to support large strings columns

Definitions:
"strings column": int8 character data and int32 offset data (2.1B characters)
"large strings column": int8 character data up to 2^64 bytes and int64 offset data (18400T characters)

Step PR Notes
Replace offset_type references with size_type #13788 offsets generated by the offset-normalizing iterator will have type int64_t
Add new data-size member to cudf::column_view, cudf::mutable_column_view and cudf::column_device_view #14031 solution for character counts greater than int32
Create an offset-normalizing iterator over character data that always outputs 64-bit offsets #14206
#14234
First step in #14043
* Add the character data buffer to the parent strings column, rather than as a child column
* Also refactor algorithms such as concat, contiguous split and gather which access character data
* Update code in cuDF-python that interact with character child columns
* Update code in cudf-java that interact with character child columns
#14202 See performance blocker resolved in ✅ #14540
Deprecate unneeded factories and use strings column factories consistently #14461, #14771, #14695, #14612, +one more
Introduce an environment variable to control the threshold for converting to 64-bit indices, to enable testing on smaller strings columns LIBCUDF_LARGE_STRINGS_THRESHOLD added part of #14612
Transition strings APIs to use the offset-normalizing iterator ("offsetalator") See #14611, #14700, #14744, #14745, #14757, #14783, #14824
Remove references to strings_column_view::offsets_begin() in libcudf since it hardcodes the return type as int32. See #15112 #15077
Remove references to create_chars_child_column in libcudf since it wraps a column around chars data. #15241
Change the current make_strings_children to return a uvector for chars instead of a column See #15171
Introduce an environment variable LIBCUDF_LARGE_STRINGS_ENABLED to let users force libcudf to throw rather than start using 64-bit offsets, to allow try-catch-repartitioning instead #15195
Introduce an environment variable LIBCUDF_LARGE_STRINGS_THRESHOLD #14612
Rework concatenate to produce large strings when LIBCUDF_LARGE_STRINGS_ENABLED and character count is above the LIBCUDF_LARGE_STRINGS_THRESHOLD See #15195
cuDF-python testing. use concat to create a large string column. We should be able to operate on this column, as long as we aren't creating a large string. Can we: (1) returns int/bool, like, contains, (2) slice (3) returns smaller strings. 🔄
Add an experimental version of make_strings_children that generates 64-bit offsets when the total character length exceeds the threshold #15363
Add a large strings test fixture that stores large columns between unit tests and controls the environment variables #15513
Check appropriate cudf tests pass with LIBCUDF_LARGE_STRINGS_THRESHOLD at zero
benchmark regressions analyzed and approved
Spark-RAPIDS tests pass
Remove experimental namespace. Replace make_strings_children with implementation with the experimental namespace version. #15702
Live session with cuDF-python expert to start producing and operating on large strings

Part 2: cuIO changes to read and write large strings columns

Step PR Notes
JSON Reader: Probably solved by #15930
Parquet Reader and writer Added in #15632 Still needs to be tested
ORC Partial fix in #15891 Needs investigation
CSV This uses factories that are already enabled. Still needs testing.
Text This uses factories that are already enabled. Still needs testing.

Part 3: Interop changes to read and write large strings columns

Also see #15093 about large_strings compatibility for pandas-2.2

Step PR Notes
to_arrow functions Use offsets type to set LARGE_STRINGS type.
from_arrow functions Honor the env var settings as well.
@GregoryKimball GregoryKimball added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 22, 2023
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Jul 23, 2023
rapids-bot bot pushed a commit that referenced this issue Aug 3, 2023
Replace all occurrences  of `cudf::offset_type` with `cudf::size_type`
This helps clear up code where sizes are computed and then converted to offsets in-place.

Also, reference #13733

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - https://github.com/brandon-b-miller
  - Nghia Truong (https://github.com/ttnghia)
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)

URL: #13788
rapids-bot bot pushed a commit that referenced this issue Sep 3, 2023
Replaces places where the `cudf::column_view(type,size,...)` constructor was used to create an empty view with a call to `cudf::make_column_view(type)->view()`.

This helps minimize the dependency on calling the constructors directly as part of the work needed for #13733 which may require an update to the `column_view` classes and its constructor(s).

Most of the changes occur in strings gtests source files.
No functionality or behavior has changed.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14030
@harrism
Copy link
Member

harrism commented Sep 20, 2023

👏 praise: ‏ Great descriptive issue!
💡 suggestion: ‏ I think the table in the description could be a GH task list so each "step" can be made into an issue.
❓ question:

"large strings column": int64 character data and int64 offset data

int64 data with int64 offsets sounds larger than:

"unbounded strings column": int8 character data up to 2^64 bytes and int64 offset data

And this sounds bounded, but large. Not boundless.

Can you clarify or give an example of the structure?

@gaohao95
Copy link
Contributor

gaohao95 commented Oct 2, 2023

This feature would be a great help for us. We use cudf::table as a CPU-memory storage to enable zero-copy. In theory, CPU memory has more capacity to hold larger tables, but we constantly run into the character limits.

@LutzCle
Copy link

LutzCle commented Oct 3, 2023

Following up on @gaohao95's comment, our use case is storing TPC-H data in a cudf::table. At scale factor 100, the l_comment and ps_comment string columns overflow the 32-bit offset.

rapids-bot bot pushed a commit that referenced this issue Jan 17, 2024
Eliminates chars column and moves chars data to parent string column's _data buffer.

Summary of changes 
- chars child column is removed, chars buffer is added to parent column
- Adds stream to `chars_size()`, `chars_end()` in `strings_column_view` and their invocations
- Remove `chars_column_index`, and deprecate `chars()` from `strings_column_view`
- Replace `chars_col.begin<char>()` with `static_cast<char*>(parent.head())`
- Adds string column factory which accepts `rmm::device_buffer` instead of chars column
- Deprecate string column factory which accepts chars column
- IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer.
- Fix binary ops, column_view,  interleave columns, byte cast, strings APIs, text APIs
- Fix tests, benchmarks (mostly adding `stream` parameter to chars_size)
- Java fixes (From @andygrove)
- Python changes
  - .data special case for string column
  - get size from offsets column for rmm.DeviceBuffer in column
  - special condition for string slice
  - Pickle file update for string column
  - a few unit tests updates

Preparing for #13733

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Jason Lowe (https://github.com/jlowe)
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Lawrence Mitchell (https://github.com/wence-)
  - Matthew Roeschke (https://github.com/mroeschke)
  - Ashwin Srinath (https://github.com/shwina)

URL: #14202
rapids-bot bot pushed a commit that referenced this issue Feb 20, 2024
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733

This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220

This PR fixes all 25 `dask-cudf` failures.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Ashwin Srinath (https://github.com/shwina)

URL: #15093
@GregoryKimball
Copy link
Contributor Author

GregoryKimball commented Apr 17, 2024

Now that #15195 is merged, I did some testing via cuDF-python

>>> df2['char'].str.slice(0,2)  
OK

>>> df2['char'].str.contains(0,2)  
OK

>>> df2['char'].str.contains('p', regex=True)
OK

>>> df2['char'].nunique()
57

# add a column, df2['c'] = 1
>>> df2.groupby('char').sum()['c'].reset_index()
OK

>>> df2['char'] + 'b'
OverflowError: CUDF failure at: /nfs/repo/cudf24.06/cpp/include/cudf/strings/
detail/strings_children.cuh:82: Size of output exceeds the column size limit

>>> df2['char'].str.upper()
RuntimeError: THRUST_INDEX_TYPE_DISPATCH 64-bit count is unsupported in libcudf

So far so good! Many APIs can successfully consume large strings, and only concat can produce them for now. 🎉

@beckernick
Copy link
Member

beckernick commented Apr 26, 2024

This is incredibly exciting! More than any individual string operation, one of the most common pain points I see in workflows is the inability to bring strings along as a payload during joins (now that concat works):

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[11], line 13
      6 N = 6000
      8 df1 = cudf.DataFrame({
      9     "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
     10     "key": [0, 1]*N
     11 })
---> 13 res = df1.merge(df1, on="key")
     14 print(f"{res.val_x.str.len().sum():,} characters in string column")

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116]
...
File copying.pyx:151, in cudf._lib.copying.gather()

File copying.pyx:34, in cudf._lib.pylibcudf.copying.gather()

File copying.pyx:66, in cudf._lib.pylibcudf.copying.gather()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

If I only have numeric data, this works smoothly as the output dataframe is only 72M rows.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": [10, 100]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{len(res):,} rows in dataframe")
72,000,000 rows in dataframe

I'd love to be able to complete this (contrived) example, because I think it's representative of something we see often: this limit causing failures in workflows where users expect things to work smoothly.

As a reference, the self-join works with N=5000 as it only ends up with 1.7B total characters.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 5000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
env: LIBCUDF_LARGE_STRINGS_ENABLED=1
1,675,000,000 characters in string column

rapids-bot bot pushed a commit that referenced this issue May 1, 2024
Replaces `make_offsets_child_column` with strings specific version in `cudf::strings::detail::gather` function.
Fixes issue found here: #13733 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Mark Harris (https://github.com/harrism)

URL: #15621
rapids-bot bot pushed a commit that referenced this issue May 3, 2024
…5632)

Part of #13733.

Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses `cudf::strings::detail::make_offsets_child_column()`.  This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - David Wendt (https://github.com/davidwendt)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15632
@davidwendt
Copy link
Contributor

The issue described here #13733 (comment) should be fixed with #15621

@davidwendt
Copy link
Contributor

And now the issue described here #13733 (comment) should be fixed with #15721

@beckernick
Copy link
Member

beckernick commented Jun 4, 2024

CSV   This uses factories that are already enabled. Still needs testing.

Gave this a quick spin -- the CSV reader itself works but we fail if we need to copy a libcudf column/table from device to host (e.g., if we call .head() if we need to go down the to_pandas codepath for device-to-host copies.

So I'd say:

  • CSV reader appears to work
  • to_pandas device-to-host transfer fails (will file an issue), while to_arrow works (which makes sense given "part 3" above)
%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf

N = int(5e7)

df = cudf.DataFrame({
    "val": ["this is a short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})
df.to_csv("large_string_df.csv", chunksize=1000000, index=False)
del df
df = cudf.read_csv("large_string_df.csv")
print(len(df))
print(df.iloc[0])
100000000
                      val  key
0  this is a short string    0
df.head()
---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py:711](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py#line=710), in PlainTextFormatter.__call__(self, obj)
    704 stream = StringIO()
    705 printer = pretty.RepresentationPrinter(stream, self.verbose,
    706     self.max_width, self.newline,
    707     max_seq_length=self.max_seq_length,
    708     singleton_pprinters=self.singleton_printers,
    709     type_pprinters=self.type_printers,
    710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
    712 printer.flush()
    713 return stream.getvalue()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:411](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=410), in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if cls is not object \
    410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
    413     return _default_pprint(obj, self, cycle)
    414 finally:

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:779](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=778), in _repr_pprint(obj, p, cycle)
    777 """A pprint that just redirects to the normal repr function."""
    778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
    780 lines = output.splitlines()
    781 with p.group():

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1973](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1972), in DataFrame.__repr__(self)
   1970 @_cudf_nvtx_annotate
   1971 def __repr__(self):
   1972     output = self._get_renderable_dataframe()
-> 1973     return self._clean_renderable_dataframe(output)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1835](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1834), in DataFrame._clean_renderable_dataframe(self, output)
   1832 else:
   1833     width = None
-> 1835 output = output.to_pandas().to_string(
   1836     max_rows=max_rows,
   1837     min_rows=min_rows,
   1838     max_cols=max_cols,
   1839     line_width=width,
   1840     max_colwidth=max_colwidth,
   1841     show_dimensions=show_dimensions,
   1842 )
   1844 lines = output.split("\n")
   1846 if lines[-1].startswith("["):

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5324](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5323), in DataFrame.to_pandas(self, nullable, arrow_type)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
-> 5324 out_data = {
   5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5325](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5324), in <dictcomp>(.0)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
   5324 out_data = {
-> 5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py:5802](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py#line=5801), in StringColumn.to_pandas(self, index, nullable, arrow_type)
   5800     return pd.Series(pandas_array, copy=False, index=index)
   5801 else:
-> 5802     return super().to_pandas(index=index, nullable=nullable)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py:215](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py#line=214), in ColumnBase.to_pandas(self, index, nullable, arrow_type)
    211     return pd.Series(
    212         pd.arrays.ArrowExtensionArray(pa_array), index=index
    213     )
    214 else:
--> 215     pd_series = pa_array.to_pandas()
    217     if index is not None:
    218         pd_series.index = index

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:872](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=871), in pyarrow.lib._PandasConvertible.to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1517](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1516), in pyarrow.lib.Array._to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1916](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1915), in pyarrow.lib._array_like_to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi:91](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()

ArrowException: Unknown error: Wrapping

@beckernick
Copy link
Member

beckernick commented Jun 5, 2024

From a local run of our Python test suite with large strings enabled, it seems like:

  • Enabling this doesn't break things when we've got < 2.1B characters
  • We don't test scenarios with >= 2.1B characters except when we intentionally trigger and catch a large string OverflowError (which we correctly fail with this turned on), so passing doesn't give a strong signal about readiness

@GregoryKimball @vyasr @davidwendt , what do you think is the right balance of exhaustiveness vs. practicality? I don't think we want to add a large string scenario to every Python unit test..

(Running the test suite with large strings enabled in #15932 so folks can review more easily)

@vyasr
Copy link
Contributor

vyasr commented Jun 11, 2024

I thought about this a little bit today, but haven't come up with any very concrete guidelines yet. At a high level a reasonable balance would be testing each API that supports large strings once so that we skip any exhaustive sweeps over parameters etc but we do at least exercise all of the ocde paths at least once from Python and ensure that all of the basic logic around dtype handling etc works. Perhaps @davidwendt also has some ideas around common failure modes for large strings, or properties of the large strings data type that are different from normal strings. If there are particular dimensions along which we expect that type to differ then we would benefit from emphasizing testing that. I'll keep thinking though.

@davidwendt
Copy link
Contributor

Finding the right balance has been a challenge. I've tried to isolate the large-strings decision logic to a few utilities and use those as much as possible hoping that only a few tests would be needed to catch all the common usages. Of course, this coverage requires internal knowledge of how the APIs are implemented. For libcudf, I created a set of large strings unit tests here: https://github.com/rapidsai/cudf/tree/branch-24.08/cpp/tests/large_strings that should cover the known usages.
Additional coverage from the Python layer is welcome of course. It would be good to consider the data size requirements (and scope) as well as runtime so as to not overload the CI systems.

@GregoryKimball
Copy link
Contributor Author

Congratulations! Starting in 24.08, large strings are enabled by default in libcudf and cudf! 🎉🎉🎉🎉

We will track further development in separate issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
Status: Done
Status: Story Issue
Development

No branches or pull requests

7 participants