-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Increase maximum characters in strings columns #13733
Comments
Replace all occurrences of `cudf::offset_type` with `cudf::size_type` This helps clear up code where sizes are computed and then converted to offsets in-place. Also, reference #13733 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - https://github.com/brandon-b-miller - Nghia Truong (https://github.com/ttnghia) - Bradley Dice (https://github.com/bdice) - MithunR (https://github.com/mythrocks) URL: #13788
Replaces places where the `cudf::column_view(type,size,...)` constructor was used to create an empty view with a call to `cudf::make_column_view(type)->view()`. This helps minimize the dependency on calling the constructors directly as part of the work needed for #13733 which may require an update to the `column_view` classes and its constructor(s). Most of the changes occur in strings gtests source files. No functionality or behavior has changed. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Mike Wilson (https://github.com/hyperbolic2346) URL: #14030
👏 praise: Great descriptive issue!
int64 data with int64 offsets sounds larger than:
And this sounds bounded, but large. Not boundless. Can you clarify or give an example of the structure? |
This feature would be a great help for us. We use |
Following up on @gaohao95's comment, our use case is storing TPC-H data in a |
Eliminates chars column and moves chars data to parent string column's _data buffer. Summary of changes - chars child column is removed, chars buffer is added to parent column - Adds stream to `chars_size()`, `chars_end()` in `strings_column_view` and their invocations - Remove `chars_column_index`, and deprecate `chars()` from `strings_column_view` - Replace `chars_col.begin<char>()` with `static_cast<char*>(parent.head())` - Adds string column factory which accepts `rmm::device_buffer` instead of chars column - Deprecate string column factory which accepts chars column - IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer. - Fix binary ops, column_view, interleave columns, byte cast, strings APIs, text APIs - Fix tests, benchmarks (mostly adding `stream` parameter to chars_size) - Java fixes (From @andygrove) - Python changes - .data special case for string column - get size from offsets column for rmm.DeviceBuffer in column - special condition for string slice - Pickle file update for string column - a few unit tests updates Preparing for #13733 Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Jason Lowe (https://github.com/jlowe) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Lawrence Mitchell (https://github.com/wence-) - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #14202
This PR adds support for `large_string` type of `arrow` arrays in `cudf`. `cudf` strings column lacks 64 bit offset support and it is WIP: #13733 This workaround is essential because `pandas-2.2+` is now defaulting to `large_string` type for arrow-strings instead of `string` type.: pandas-dev/pandas#56220 This PR fixes all 25 `dask-cudf` failures. Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - Ashwin Srinath (https://github.com/shwina) URL: #15093
Now that #15195 is merged, I did some testing via cuDF-python
So far so good! Many APIs can successfully consume large strings, and only concat can produce them for now. 🎉 |
This is incredibly exciting! More than any individual string operation, one of the most common pain points I see in workflows is the inability to bring strings along as a payload during joins (now that concat works): %env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 6000
df1 = cudf.DataFrame({
"val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
---------------------------------------------------------------------------
OverflowError Traceback (most recent call last)
Cell In[11], line 13
6 N = 6000
8 df1 = cudf.DataFrame({
9 "val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
10 "key": [0, 1]*N
11 })
---> 13 res = df1.merge(df1, on="key")
14 print(f"{res.val_x.str.len().sum():,} characters in string column")
File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116]
...
File copying.pyx:151, in cudf._lib.copying.gather()
File copying.pyx:34, in cudf._lib.pylibcudf.copying.gather()
File copying.pyx:66, in cudf._lib.pylibcudf.copying.gather()
OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit If I only have numeric data, this works smoothly as the output dataframe is only 72M rows. %env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 6000
df1 = cudf.DataFrame({
"val": [10, 100]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{len(res):,} rows in dataframe")
72,000,000 rows in dataframe I'd love to be able to complete this (contrived) example, because I think it's representative of something we see often: this limit causing failures in workflows where users expect things to work smoothly. As a reference, the self-join works with %env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
import numpy as np
N = 5000
df1 = cudf.DataFrame({
"val": ["this is a fairly short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
res = df1.merge(df1, on="key")
print(f"{res.val_x.str.len().sum():,} characters in string column")
env: LIBCUDF_LARGE_STRINGS_ENABLED=1
1,675,000,000 characters in string column |
Replaces `make_offsets_child_column` with strings specific version in `cudf::strings::detail::gather` function. Fixes issue found here: #13733 (comment) Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Mark Harris (https://github.com/harrism) URL: #15621
…5632) Part of #13733. Adds support for reading and writing cuDF string columns where the string data exceeds 2GB. This is accomplished by skipping the final offsets calculation in the string decoding kernel when the 2GB threshold is exceeded, and instead uses `cudf::strings::detail::make_offsets_child_column()`. This could lead to increased overhead with many columns (see #13024), so this will need some more benchmarking. But if there are many columns that exceed the 2GB limit, it's likely reads will have to be chunked to stay within the memory budget. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #15632
The issue described here #13733 (comment) should be fixed with #15621 |
And now the issue described here #13733 (comment) should be fixed with #15721 |
Gave this a quick spin -- the CSV reader itself works but we fail So I'd say:
%env LIBCUDF_LARGE_STRINGS_ENABLED=1
import cudf
N = int(5e7)
df = cudf.DataFrame({
"val": ["this is a short string", "this one is a bit longer, but not much"]*N,
"key": [0, 1]*N
})
df.to_csv("large_string_df.csv", chunksize=1000000, index=False)
del df df = cudf.read_csv("large_string_df.csv")
print(len(df))
print(df.iloc[0])
100000000
val key
0 this is a short string 0 df.head()
---------------------------------------------------------------------------
ArrowException Traceback (most recent call last)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py:711](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/core/formatters.py#line=710), in PlainTextFormatter.__call__(self, obj)
704 stream = StringIO()
705 printer = pretty.RepresentationPrinter(stream, self.verbose,
706 self.max_width, self.newline,
707 max_seq_length=self.max_seq_length,
708 singleton_pprinters=self.singleton_printers,
709 type_pprinters=self.type_printers,
710 deferred_pprinters=self.deferred_printers)
--> 711 printer.pretty(obj)
712 printer.flush()
713 return stream.getvalue()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:411](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=410), in RepresentationPrinter.pretty(self, obj)
408 return meth(obj, self, cycle)
409 if cls is not object \
410 and callable(cls.__dict__.get('__repr__')):
--> 411 return _repr_pprint(obj, self, cycle)
413 return _default_pprint(obj, self, cycle)
414 finally:
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py:779](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/IPython/lib/pretty.py#line=778), in _repr_pprint(obj, p, cycle)
777 """A pprint that just redirects to the normal repr function."""
778 # Find newlines and replace them with p.break_()
--> 779 output = repr(obj)
780 lines = output.splitlines()
781 with p.group():
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1973](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1972), in DataFrame.__repr__(self)
1970 @_cudf_nvtx_annotate
1971 def __repr__(self):
1972 output = self._get_renderable_dataframe()
-> 1973 return self._clean_renderable_dataframe(output)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:1835](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=1834), in DataFrame._clean_renderable_dataframe(self, output)
1832 else:
1833 width = None
-> 1835 output = output.to_pandas().to_string(
1836 max_rows=max_rows,
1837 min_rows=min_rows,
1838 max_cols=max_cols,
1839 line_width=width,
1840 max_colwidth=max_colwidth,
1841 show_dimensions=show_dimensions,
1842 )
1844 lines = output.split("\n")
1846 if lines[-1].startswith("["):
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
113 @wraps(func)
114 def inner(*args, **kwargs):
115 libnvtx_push_range(self.attributes, self.domain.handle)
--> 116 result = func(*args, **kwargs)
117 libnvtx_pop_range(self.domain.handle)
118 return result
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5324](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5323), in DataFrame.to_pandas(self, nullable, arrow_type)
5249 """
5250 Convert to a Pandas DataFrame.
5251
(...)
5321 dtype: object
5322 """
5323 out_index = self.index.to_pandas()
-> 5324 out_data = {
5325 i: col.to_pandas(
5326 index=out_index, nullable=nullable, arrow_type=arrow_type
5327 )
5328 for i, col in enumerate(self._data.columns)
5329 }
5331 out_df = pd.DataFrame(out_data, index=out_index)
5332 out_df.columns = self._data.to_pandas_index()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py:5325](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/dataframe.py#line=5324), in <dictcomp>(.0)
5249 """
5250 Convert to a Pandas DataFrame.
5251
(...)
5321 dtype: object
5322 """
5323 out_index = self.index.to_pandas()
5324 out_data = {
-> 5325 i: col.to_pandas(
5326 index=out_index, nullable=nullable, arrow_type=arrow_type
5327 )
5328 for i, col in enumerate(self._data.columns)
5329 }
5331 out_df = pd.DataFrame(out_data, index=out_index)
5332 out_df.columns = self._data.to_pandas_index()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py:5802](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/string.py#line=5801), in StringColumn.to_pandas(self, index, nullable, arrow_type)
5800 return pd.Series(pandas_array, copy=False, index=index)
5801 else:
-> 5802 return super().to_pandas(index=index, nullable=nullable)
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py:215](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column/column.py#line=214), in ColumnBase.to_pandas(self, index, nullable, arrow_type)
211 return pd.Series(
212 pd.arrays.ArrowExtensionArray(pa_array), index=index
213 )
214 else:
--> 215 pd_series = pa_array.to_pandas()
217 if index is not None:
218 pd_series.index = index
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:872](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=871), in pyarrow.lib._PandasConvertible.to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1517](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1516), in pyarrow.lib.Array._to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi:1916](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/array.pxi#line=1915), in pyarrow.lib._array_like_to_pandas()
File [/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi:91](http://10.117.23.184:8881/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/pyarrow/error.pxi#line=90), in pyarrow.lib.check_status()
ArrowException: Unknown error: Wrapping |
From a local run of our Python test suite with large strings enabled, it seems like:
@GregoryKimball @vyasr @davidwendt , what do you think is the right balance of exhaustiveness vs. practicality? I don't think we want to add a large string scenario to every Python unit test.. (Running the test suite with large strings enabled in #15932 so folks can review more easily) |
I thought about this a little bit today, but haven't come up with any very concrete guidelines yet. At a high level a reasonable balance would be testing each API that supports large strings once so that we skip any exhaustive sweeps over parameters etc but we do at least exercise all of the ocde paths at least once from Python and ensure that all of the basic logic around dtype handling etc works. Perhaps @davidwendt also has some ideas around common failure modes for large strings, or properties of the large strings data type that are different from normal strings. If there are particular dimensions along which we expect that type to differ then we would benefit from emphasizing testing that. I'll keep thinking though. |
Finding the right balance has been a challenge. I've tried to isolate the large-strings decision logic to a few utilities and use those as much as possible hoping that only a few tests would be needed to catch all the common usages. Of course, this coverage requires internal knowledge of how the APIs are implemented. For libcudf, I created a set of large strings unit tests here: https://github.com/rapidsai/cudf/tree/branch-24.08/cpp/tests/large_strings that should cover the known usages. |
Congratulations! Starting in 24.08, large strings are enabled by default in libcudf and cudf! 🎉🎉🎉🎉 We will track further development in separate issues. |
In libcudf, strings columns have child columns containing character data and offsets, and the offsets child column uses a 32-bit signed size type. This limits strings columns to containing ~2.1 billion characters. For LLM training, documents have up to 1M characters, and a median around 3K characters. Due to the size type limit, LLM training pipelines have to carefully batch the data down to a few thousand rows to stay comfortably within the size type limit. We have a general issue open to explore a 64-bit size type in libcudf (#13159). For size issues with LLM training pipelines, we should consider a targeted change to only address the size limit for strings columns.
Requirements
Proposed solution
One idea that satisfies these requirements would be to represent the character data as an
int64
typed column instead of anint8
typed column. This would allow us to store 8x more bytes of character data. To access the character bytes, we would use an offset-normalizing iterator (inspired by "indexalator") to identify byte positions using anint64
iterator output. Please note that the row count 32-bit size type would still apply to the proposed "large strings" columns.We should also consider an "unbounded" character data allocation that is not typed, but rather a single buffer up to 2^64 bytes in size. The 64-bit offset type would be able to index into much larger allocations.
Please note that this solution will not impact the offsets for list columns. We believe that the best design to allow for more than 2.1B elements in lists will be to use 64-bit size type in libcudf as discussed in #13159.
Creating strings columns
Strings columns factories would choose child column types at the time of column creation, based on the size of the character data. This change would impact strings column factories, as well as algorithms that use strings column utilities or generate their own offsets buffers. At column creation time, the constructor will choose between
int32
offsets withint8
character data andint64
offsets withint64
character data, based on the size of the character data. Any function that calls make_offsets_child_column will need to be aware of the alternate child column types for large strings.Accessing strings data
The offset-normalizing iterator would always return
int64
type so that strings column consumers would not need to support bothint32
andint64
offset types. See cudf::detail::sizes_to_offsets_iterator for an example of how an iterator operating onint32
data can outputint64
data.Interoperability with Arrow
The new strings column variant with
int64
offsets withint64
character data may already be Arrow-compatible. This requires more testing and some changes to our Arrow interop utilities.Part 1: libcudf changes to support large strings columns
Definitions:
"strings column":
int8
character data andint32
offset data (2.1B characters)"large strings column":
int8
character data up to 2^64 bytes andint64
offset data (18400T characters)offset_type
references withsize_type
int64_t
Add new data-size member tocudf::column_view
,cudf::mutable_column_view
andcudf::column_device_view
int32
✅ #14234
* Also refactor algorithms such as concat, contiguous split and gather which access character data
* Update code in cuDF-python that interact with character child columns
* Update code in cudf-java that interact with character child columns
LIBCUDF_LARGE_STRINGS_THRESHOLD
addedstrings_column_view::offsets_begin()
in libcudf since it hardcodes the return type as int32.create_chars_child_column
in libcudf since it wraps a column around chars data.make_strings_children
to return a uvector for chars instead of a columnLIBCUDF_LARGE_STRINGS_ENABLED
to let users force libcudf to throw rather than start using 64-bit offsets, to allow try-catch-repartitioning insteadLIBCUDF_LARGE_STRINGS_THRESHOLD
concatenate
to produce large strings whenLIBCUDF_LARGE_STRINGS_ENABLED
and character count is above theLIBCUDF_LARGE_STRINGS_THRESHOLD
experimental
version ofmake_strings_children
that generates 64-bit offsets when the total character length exceeds the thresholdLIBCUDF_LARGE_STRINGS_THRESHOLD
at zeroexperimental
namespace. Replacemake_strings_children
with implementation with theexperimental
namespace version.Part 2: cuIO changes to read and write large strings columns
Part 3: Interop changes to read and write large strings columns
Also see #15093 about large_strings compatibility for pandas-2.2
The text was updated successfully, but these errors were encountered: