[FEA] Increase maximum characters in strings columns #13733

GregoryKimball · 2023-07-22T20:58:01Z

In libcudf, strings columns have child columns containing character data and offsets, and the offsets child column uses a 32-bit signed size type. This limits strings columns to containing ~2.1 billion characters. For LLM training, documents have up to 1M characters, and a median around 3K characters. Due to the size type limit, LLM training pipelines have to carefully batch the data down to a few thousand rows to stay comfortably within the size type limit. We have a general issue open to explore a 64-bit size type in libcudf (#13159). For size issues with LLM training pipelines, we should consider a targeted change to only address the size limit for strings columns.

Requirements

We must maintain or improve throughput for functions processing strings columns with <2.1 billion characters. This requirement prevents us from using 64-bit offsets for all strings columns. It does not prevent us from using 64-bit offsets for strings columns with >2.1 billion characters.
We must not introduce a new data type or otherwise increase compile times significantly. This requirement prevents us from dispatching between "strings" types and "large strings" types.

Proposed solution

One idea that satisfies these requirements would be to represent the character data as an int64 typed column instead of an int8 typed column. This would allow us to store 8x more bytes of character data. To access the character bytes, we would use an offset-normalizing iterator (inspired by "indexalator") to identify byte positions using an int64 iterator output. Please note that the row count 32-bit size type would still apply to the proposed "large strings" columns.

We should also consider an "unbounded" character data allocation that is not typed, but rather a single buffer up to 2^64 bytes in size. The 64-bit offset type would be able to index into much larger allocations.

Please note that this solution will not impact the offsets for list columns. We believe that the best design to allow for more than 2.1B elements in lists will be to use 64-bit size type in libcudf as discussed in #13159.

Creating strings columns

Strings columns factories would choose child column types at the time of column creation, based on the size of the character data. This change would impact strings column factories, as well as algorithms that use strings column utilities or generate their own offsets buffers. At column creation time, the constructor will choose between int32 offsets with int8 character data and int64 offsets with int64 character data, based on the size of the character data. Any function that calls make_offsets_child_column will need to be aware of the alternate child column types for large strings.

Accessing strings data

The offset-normalizing iterator would always return int64 type so that strings column consumers would not need to support both int32 and int64 offset types. See cudf::detail::sizes_to_offsets_iterator for an example of how an iterator operating on int32 data can output int64 data.

Interoperability with Arrow

The new strings column variant with int64 offsets with int64 character data may already be Arrow-compatible. This requires more testing and some changes to our Arrow interop utilities.

Part 1: libcudf changes to support large strings columns

Definitions:
"strings column": int8 character data and int32 offset data (2.1B characters)
"large strings column": int8 character data up to 2^64 bytes and int64 offset data (18400T characters)

Step	PR	Notes
Replace `offset_type` references with `size_type`	✅ #13788	offsets generated by the offset-normalizing iterator will have type `int64_t`
~~Add new data-size member to `cudf::column_view`, `cudf::mutable_column_view` and `cudf::column_device_view`~~	❌ #14031	solution for character counts greater than `int32`
Create an offset-normalizing iterator over character data that always outputs 64-bit offsets	✅ #14206 ✅ #14234	First step in #14043
* Add the character data buffer to the parent strings column, rather than as a child column * Also refactor algorithms such as concat, contiguous split and gather which access character data * Update code in cuDF-python that interact with character child columns * Update code in cudf-java that interact with character child columns	✅ #14202	See performance blocker resolved in ✅ #14540
Deprecate unneeded factories and use strings column factories consistently	✅	#14461, #14771, #14695, #14612, +one more
Introduce an environment variable to control the threshold for converting to 64-bit indices, to enable testing on smaller strings columns	✅ `LIBCUDF_LARGE_STRINGS_THRESHOLD` added	part of #14612
Transition strings APIs to use the offset-normalizing iterator ("offsetalator")	✅	See #14611, #14700, #14744, #14745, #14757, #14783, #14824
Remove references to `strings_column_view::offsets_begin()` in libcudf since it hardcodes the return type as int32.	✅	See #15112 #15077
Remove references to `create_chars_child_column` in libcudf since it wraps a column around chars data.	✅	#15241
Change the current `make_strings_children` to return a uvector for chars instead of a column	✅	See #15171
Introduce an environment variable `LIBCUDF_LARGE_STRINGS_ENABLED` to let users force libcudf to throw rather than start using 64-bit offsets, to allow try-catch-repartitioning instead	✅	#15195
Introduce an environment variable `LIBCUDF_LARGE_STRINGS_THRESHOLD`	✅	#14612
Rework `concatenate` to produce large strings when `LIBCUDF_LARGE_STRINGS_ENABLED` and character count is above the `LIBCUDF_LARGE_STRINGS_THRESHOLD`	✅	See #15195
cuDF-python testing. use concat to create a large string column. We should be able to operate on this column, as long as we aren't creating a large string. Can we: (1) returns int/bool, like, contains, (2) slice (3) returns smaller strings.	🔄
Add an `experimental` version of `make_strings_children` that generates 64-bit offsets when the total character length exceeds the threshold	✅	#15363
Add a large strings test fixture that stores large columns between unit tests and controls the environment variables	✅	#15513
Check appropriate cudf tests pass with `LIBCUDF_LARGE_STRINGS_THRESHOLD` at zero
benchmark regressions analyzed and approved
Spark-RAPIDS tests pass
Remove `experimental` namespace. Replace `make_strings_children` with implementation with the `experimental` namespace version.	✅	#15702
Live session with cuDF-python expert to start producing and operating on large strings

Part 2: cuIO changes to read and write large strings columns

Step	PR	Notes
JSON	Reader: Probably solved by #15930
Parquet	Reader and writer Added in #15632	Still needs to be tested
ORC	Partial fix in #15891	Needs investigation
CSV		This uses factories that are already enabled. Still needs testing.
Text		This uses factories that are already enabled. Still needs testing.

Part 3: Interop changes to read and write large strings columns

Also see #15093 about large_strings compatibility for pandas-2.2

Step	PR	Notes
to_arrow functions		Use offsets type to set LARGE_STRINGS type.
from_arrow functions		Honor the env var settings as well.

The text was updated successfully, but these errors were encountered:

Replace all occurrences of `cudf::offset_type` with `cudf::size_type` This helps clear up code where sizes are computed and then converted to offsets in-place. Also, reference #13733 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - https://github.com/brandon-b-miller - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #13788

Replaces places where the `cudf::column_view(type,size,...)` constructor was used to create an empty view with a call to `cudf::make_column_view(type)->view()`. This helps minimize the dependency on calling the constructors directly as part of the work needed for #13733 which may require an update to the `column_view` classes and its constructor(s). Most of the changes occur in strings gtests source files. No functionality or behavior has changed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14030

harrism · 2023-09-20T11:43:15Z

👏 praise: ‏ Great descriptive issue!
💡 suggestion: ‏ I think the table in the description could be a GH task list so each "step" can be made into an issue.
❓ question: ‏

"large strings column": int64 character data and int64 offset data

int64 data with int64 offsets sounds larger than:

"unbounded strings column": int8 character data up to 2^64 bytes and int64 offset data

And this sounds bounded, but large. Not boundless.

Can you clarify or give an example of the structure?

gaohao95 · 2023-10-02T17:06:09Z

This feature would be a great help for us. We use cudf::table as a CPU-memory storage to enable zero-copy. In theory, CPU memory has more capacity to hold larger tables, but we constantly run into the character limits.

LutzCle · 2023-10-03T09:12:50Z

Following up on @gaohao95's comment, our use case is storing TPC-H data in a cudf::table. At scale factor 100, the l_comment and ps_comment string columns overflow the 32-bit offset.

@nvdbaranec

Eliminates chars column and moves chars data to parent string column's _data buffer. Summary of changes - chars child column is removed, chars buffer is added to parent column - Adds stream to `chars_size()`, `chars_end()` in `strings_column_view` and their invocations - Remove `chars_column_index`, and deprecate `chars()` from `strings_column_view` - Replace `chars_col.begin<char>()` with `static_cast<char*>(parent.head())` - Adds string column factory which accepts `rmm::device_buffer` instead of chars column - Deprecate string column factory which accepts chars column - IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer. - Fix binary ops, column_view, interleave columns, byte cast, strings APIs, text APIs - Fix tests, benchmarks (mostly adding `stream` parameter to chars_size) - Java fixes (From @andygrove) - Python changes - .data special case for string column - get size from offsets column for rmm.DeviceBuffer in column - special condition for string slice - Pickle file update for string column - a few unit tests updates Preparing for #13733 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Jason Lowe (https://github.com/jlowe) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Lawrence Mitchell (https://github.com/wence-) - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #14202

This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733 This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220 This PR fixes all 25 `dask-cudf` failures. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #15093

GregoryKimball · 2024-04-17T03:45:41Z

Now that #15195 is merged, I did some testing via cuDF-python

>>> df2['char'].str.slice(0,2)  
OK

>>> df2['char'].str.contains(0,2)  
OK

>>> df2['char'].str.contains('p', regex=True)
OK

>>> df2['char'].nunique()
57

# add a column, df2['c'] = 1
>>> df2.groupby('char').sum()['c'].reset_index()
OK

>>> df2['char'] + 'b'
OverflowError: CUDF failure at: /nfs/repo/cudf24.06/cpp/include/cudf/strings/
detail/strings_children.cuh:82: Size of output exceeds the column size limit

>>> df2['char'].str.upper()
RuntimeError: THRUST_INDEX_TYPE_DISPATCH 64-bit count is unsupported in libcudf

So far so good! Many APIs can successfully consume large strings, and only concat can produce them for now. 🎉

beckernick · 2024-04-26T15:51:01Z

This is incredibly exciting! More than any individual string operation, one of the most common pain points I see in workflows is the inability to bring strings along as a payload during joins (now that concat works):

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[11], line 13
      6 N = 6000
      8 df1 = cudf.DataFrame({
      9     "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
     10     "key": [0, 1]*N
     11 })
---> 13 res = df1.merge(df1, on="key")
     14 print(f"{res.val_x.str.len().sum():,} characters in string column")

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116]
...
File copying.pyx:151, in cudf._lib.copying.gather()

File copying.pyx:34, in cudf._lib.pylibcudf.copying.gather()

File copying.pyx:66, in cudf._lib.pylibcudf.copying.gather()

OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit

If I only have numeric data, this works smoothly as the output dataframe is only 72M rows.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 6000

df1 = cudf.DataFrame({
    "val": [10, 100]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{len(res):,} rows in dataframe")
72,000,000 rows in dataframe

I'd love to be able to complete this (contrived) example, because I think it's representative of something we see often: this limit causing failures in workflows where users expect things to work smoothly.

As a reference, the self-join works with N=5000 as it only ends up with 1.7B total characters.

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf
import numpy as np

N = 5000

df1 = cudf.DataFrame({
    "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})

res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
env: LIBCUDF_LARGE_STRINGS_ENABLED=1
1,675,000,000 characters in string column

Replaces `make_offsets_child_column` with strings specific version in `cudf::strings::detail::gather` function. Fixes issue found here: #13733 (comment) Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) URL: #15621

…5632) Part of #13733. Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses `cudf::strings::detail::make_offsets_child_column()`. This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #15632

davidwendt · 2024-05-08T22:19:43Z

The issue described here #13733 (comment) should be fixed with #15621

davidwendt · 2024-05-31T12:50:31Z

And now the issue described here #13733 (comment) should be fixed with #15721

beckernick · 2024-06-04T20:57:28Z

CSV This uses factories that are already enabled. Still needs testing.

Gave this a quick spin -- the CSV reader itself works but we fail ~~if we need to copy a libcudf column/table from device to host (e.g., if we call .head()~~ if we need to go down the to_pandas codepath for device-to-host copies.

So I'd say:

CSV reader appears to work
to_pandas device-to-host transfer fails (will file an issue), while to_arrow works (which makes sense given "part 3" above)

%env LIBCUDF_LARGE_STRINGS_ENABLED=1

import cudf

N = int(5e7)

df = cudf.DataFrame({
    "val": ["this is a short string", "this one is a bit longer, but not much"]*N,
    "key": [0, 1]*N
})
df.to_csv("large_string_df.csv", chunksize=1000000, index=False)
del df

df = cudf.read_csv("large_string_df.csv")
print(len(df))
print(df.iloc[0])
100000000
                      val  key
0  this is a short string    0

df.head()
---------------------------------------------------------------------------
ArrowException                            Traceback (most recent call last)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py:711](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py#line=710), in PlainTextFormatter.__call__(self, obj)
    704 stream = StringIO()
    705 printer = pretty.RepresentationPrinter(stream, self.verbose,
    706     self.max_width, self.newline,
    707     max_seq_length=self.max_seq_length,
    708     singleton_pprinters=self.singleton_printers,
    709     type_pprinters=self.type_printers,
    710     deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
    712 printer.flush()
    713 return stream.getvalue()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:411](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=410), in RepresentationPrinter.pretty(self, obj)
    408                         return meth(obj, self, cycle)
    409                 if cls is not object \
    410                         and callable(cls.__dict__.get('__repr__')):
--> 411                     return _repr_pprint(obj, self, cycle)
    413     return _default_pprint(obj, self, cycle)
    414 finally:

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:779](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=778), in _repr_pprint(obj, p, cycle)
    777 """A pprint that just redirects to the normal repr function."""
    778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
    780 lines = output.splitlines()
    781 with p.group():

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1973](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1972), in DataFrame.__repr__(self)
   1970 @_cudf_nvtx_annotate
   1971 def __repr__(self):
   1972     output = self._get_renderable_dataframe()
-> 1973     return self._clean_renderable_dataframe(output)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1835](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1834), in DataFrame._clean_renderable_dataframe(self, output)
   1832 else:
   1833     width = None
-> 1835 output = output.to_pandas().to_string(
   1836     max_rows=max_rows,
   1837     min_rows=min_rows,
   1838     max_cols=max_cols,
   1839     line_width=width,
   1840     max_colwidth=max_colwidth,
   1841     show_dimensions=show_dimensions,
   1842 )
   1844 lines = output.split("\n")
   1846 if lines[-1].startswith("["):

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5324](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5323), in DataFrame.to_pandas(self, nullable, arrow_type)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
-> 5324 out_data = {
   5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5325](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5324), in <dictcomp>(.0)
   5249 """
   5250 Convert to a Pandas DataFrame.
   5251 
   (...)
   5321 dtype: object
   5322 """
   5323 out_index = self.index.to_pandas()
   5324 out_data = {
-> 5325     i: col.to_pandas(
   5326         index=out_index, nullable=nullable, arrow_type=arrow_type
   5327     )
   5328     for i, col in enumerate(self._data.columns)
   5329 }
   5331 out_df = pd.DataFrame(out_data, index=out_index)
   5332 out_df.columns = self._data.to_pandas_index()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py:5802](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py#line=5801), in StringColumn.to_pandas(self, index, nullable, arrow_type)
   5800     return pd.Series(pandas_array, copy=False, index=index)
   5801 else:
-> 5802     return super().to_pandas(index=index, nullable=nullable)

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py:215](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py#line=214), in ColumnBase.to_pandas(self, index, nullable, arrow_type)
    211     return pd.Series(
    212         pd.arrays.ArrowExtensionArray(pa_array), index=index
    213     )
    214 else:
--> 215     pd_series = pa_array.to_pandas()
    217     if index is not None:
    218         pd_series.index = index

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:872](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=871), in pyarrow.lib._PandasConvertible.to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1517](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1516), in pyarrow.lib.Array._to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1916](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1915), in pyarrow.lib._array_like_to_pandas()

File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi:91](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()

ArrowException: Unknown error: Wrapping

beckernick · 2024-06-05T19:43:00Z

From a local run of our Python test suite with large strings enabled, it seems like:

Enabling this doesn't break things when we've got < 2.1B characters
We don't test scenarios with >= 2.1B characters except when we intentionally trigger and catch a large string OverflowError (which we correctly fail with this turned on), so passing doesn't give a strong signal about readiness

@GregoryKimball @vyasr @davidwendt , what do you think is the right balance of exhaustiveness vs. practicality? I don't think we want to add a large string scenario to every Python unit test..

(Running the test suite with large strings enabled in #15932 so folks can review more easily)

vyasr · 2024-06-11T00:55:37Z

I thought about this a little bit today, but haven't come up with any very concrete guidelines yet. At a high level a reasonable balance would be testing each API that supports large strings once so that we skip any exhaustive sweeps over parameters etc but we do at least exercise all of the ocde paths at least once from Python and ensure that all of the basic logic around dtype handling etc works. Perhaps @davidwendt also has some ideas around common failure modes for large strings, or properties of the large strings data type that are different from normal strings. If there are particular dimensions along which we expect that type to differ then we would benefit from emphasizing testing that. I'll keep thinking though.

davidwendt · 2024-06-11T14:53:04Z

Finding the right balance has been a challenge. I've tried to isolate the large-strings decision logic to a few utilities and use those as much as possible hoping that only a few tests would be needed to catch all the common usages. Of course, this coverage requires internal knowledge of how the APIs are implemented. For libcudf, I created a set of large strings unit tests here: https://github.com/rapidsai/cudf/tree/branch-24.08/cpp/tests/large_strings that should cover the known usages.
Additional coverage from the Python layer is welcome of course. It would be good to consider the data size requirements (and scope) as well as runtime so as to not overload the CI systems.

GregoryKimball · 2024-08-01T22:56:16Z

Congratulations! Starting in 24.08, large strings are enabled by default in libcudf and cudf! 🎉🎉🎉🎉

We will track further development in separate issues.

GregoryKimball added feature request New feature or request Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) labels Jul 22, 2023

GregoryKimball added this to the Stabilizing large workflows (OOM, spilling, partitioning) milestone Jul 22, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed Needs Triage Need team to review and classify labels Jul 23, 2023

davidwendt mentioned this issue Jul 31, 2023

Remove the libcudf cudf::offset_type type #13788

Merged

3 tasks

This was referenced Aug 4, 2023

[BUG]The cudf to_csv interface cannot read files larger than 2GB and displays a negative size error. #13785

Closed

[BUG] Errors converting tables from arrow to cuDF #13771

Closed

GregoryKimball modified the milestones: Stabilizing large workflows (OOM, spilling, partitioning), Language model acceleration Aug 14, 2023

This was referenced Sep 1, 2023

Use cudf::make_empty_column instead of column_view constructor #14030

Merged

Add data-size member to libcudf column_view classes #14031

Closed

revans2 mentioned this issue Sep 29, 2023

[FEA] Get started to work on new long string changes in CUDF NVIDIA/spark-rapids#9347

Open

karthikeyann mentioned this issue Oct 4, 2023

Move chars column to parent data buffer in strings column #14202

Merged

sameerz mentioned this issue Jan 23, 2024

Output size assertions in unify_json_strings NVIDIA/spark-rapids-jni#1707

Open

galipremsagar mentioned this issue Feb 20, 2024

Add support for arrow large_string in cudf #15093

Merged

3 tasks

GregoryKimball mentioned this issue Mar 14, 2024

[FEA] Accelerate conversion from arrow::StringViewType to arrow::StringType in libcudf interop #15298

Open

davidwendt mentioned this issue Apr 30, 2024

Large strings support for cudf::gather #15621

Merged

3 tasks

etseidl mentioned this issue May 1, 2024

Add support for large string columns to Parquet reader and writer #15632

Merged

3 tasks

davidwendt mentioned this issue Jun 3, 2024

Add overflow check when converting large strings to lists columns #15887

Merged

3 tasks

beckernick mentioned this issue Jun 4, 2024

[FEA] Support large strings in to_pandas #15922

Closed

GregoryKimball closed this as completed Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Increase maximum characters in strings columns #13733

[FEA] Increase maximum characters in strings columns #13733

GregoryKimball commented Jul 22, 2023 •

edited

Loading

harrism commented Sep 20, 2023 •

edited

Loading

gaohao95 commented Oct 2, 2023

LutzCle commented Oct 3, 2023

GregoryKimball commented Apr 17, 2024 •

edited

Loading

beckernick commented Apr 26, 2024 •

edited

Loading

davidwendt commented May 8, 2024

davidwendt commented May 31, 2024

beckernick commented Jun 4, 2024 •

edited

Loading

beckernick commented Jun 5, 2024 •

edited

Loading

vyasr commented Jun 11, 2024

davidwendt commented Jun 11, 2024

GregoryKimball commented Aug 1, 2024

[FEA] Increase maximum characters in strings columns #13733

[FEA] Increase maximum characters in strings columns #13733

Comments

GregoryKimball commented Jul 22, 2023 • edited Loading

Requirements

Proposed solution

Creating strings columns

Accessing strings data

Interoperability with Arrow

Part 1: libcudf changes to support large strings columns

Part 2: cuIO changes to read and write large strings columns

Part 3: Interop changes to read and write large strings columns

harrism commented Sep 20, 2023 • edited Loading

gaohao95 commented Oct 2, 2023

LutzCle commented Oct 3, 2023

GregoryKimball commented Apr 17, 2024 • edited Loading

beckernick commented Apr 26, 2024 • edited Loading

davidwendt commented May 8, 2024

davidwendt commented May 31, 2024

beckernick commented Jun 4, 2024 • edited Loading

beckernick commented Jun 5, 2024 • edited Loading

vyasr commented Jun 11, 2024

davidwendt commented Jun 11, 2024

GregoryKimball commented Aug 1, 2024

GregoryKimball commented Jul 22, 2023 •

edited

Loading

harrism commented Sep 20, 2023 •

edited

Loading

GregoryKimball commented Apr 17, 2024 •

edited

Loading

beckernick commented Apr 26, 2024 •

edited

Loading

beckernick commented Jun 4, 2024 •

edited

Loading

beckernick commented Jun 5, 2024 •

edited

Loading