Move chars column to parent data buffer in strings column #14202

karthikeyann · 2023-09-26T19:30:07Z

Description

Eliminates chars column and moves chars data to parent string column's _data buffer.

Summary of changes

chars child column is removed, chars buffer is added to parent column
Adds stream to chars_size(), chars_end() in strings_column_view and their invocations
Remove chars_column_index, and deprecate chars() from strings_column_view
Replace chars_col.begin<char>() with static_cast<char*>(parent.head())
Adds string column factory which accepts rmm::device_buffer instead of chars column
Deprecate string column factory which accepts chars column
IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer.
Fix binary ops, column_view, interleave columns, byte cast, strings APIs, text APIs
Fix tests, benchmarks (mostly adding stream parameter to chars_size)
Java fixes (From @andygrove)
Python changes
- .data special case for string column
- get size from offsets column for rmm.DeviceBuffer in column
- special condition for string slice
- Pickle file update for string column
- a few unit tests updates

Preparing for #13733

…_limit_experiment

karthikeyann · 2023-11-02T21:37:14Z

@galipremsagar failing pytests
~~FAILED test_pickling.py::test_pickle_string_column[slices1] - AssertionError: Series are different~~
~~FAILED test_pickling.py::test_pickle_string_column[slices3] - AssertionError: Series are different~~
FAILED test_serialize.py::test_deserialize_cudf_0_16 - RuntimeError: CUDF failure at: /home/knataraj/dev/rapids/cudf/cpp/src/column/column_view.cpp:57: Null data pointer.
FAILED test_testing.py::test_assert_column_memory_basic_same[arrow_arrays1] - ValueError: Buffer size must be divisible by element size
FAILED test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_reader_negative_timestamp[pyarrow] - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_writer_negative_timestamp - AssertionError: numpy array are different

Update: these issues are fixed.

use Optional move string special case for data to string.py use data for memory usage calculation remove chars_data declaration

wence-

Approving python changes with (non-blocking) suggestion to introduce a single type definition for the char type.

wence- · 2024-01-10T10:06:04Z

python/cudf/cudf/core/column/column.py

-            build_column(
-                data=as_buffer(
-                    rmm.DeviceBuffer(
-                        size=row_count * cudf.dtype("int8").itemsize


In general this condition can also be true for (at least) List and Struct columns. But, those are handled by specific cases above.

From a quick test, I think we don't need an empty chars buffer, so can you try using rmm.DeviceBuffer(0)?

python/cudf/cudf/core/column/string.py

wence- · 2024-01-10T10:19:23Z

python/cudf/cudf/core/column/string.py

@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase":
        str_end_byte_offset = self.base_children[0].element_indexing(
            self.offset + self.size
        )
-        char_dtype_size = self.base_children[1].dtype.itemsize
+        char_dtype_size = cudf.api.types.dtype("int8").itemsize


Ah, ok. Can we introduce (like cudf._lib.types.size_type_dtype) a single source of truth for the type of the string char buffer, perhaps cudf._lib.types.char_type_dtype?

python/cudf/cudf/tests/test_testing.py

python/cudf/cudf/core/column/string.py

python/cudf/cudf/_lib/column.pyx

davidwendt · 2024-01-10T23:52:13Z

We'll probably want to update the developer's guide once this is merged as well
https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#strings-columns

cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md

Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>

mroeschke

Optional comment otherwise LGTM

mroeschke · 2024-01-12T20:10:20Z

python/cudf/cudf/core/column/string.py

@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase":
        str_end_byte_offset = self.base_children[0].element_indexing(
            self.offset + self.size
        )
-        char_dtype_size = self.base_children[1].dtype.itemsize
+        char_dtype_size = cudf.api.types.dtype("int8").itemsize


I think it would be worth adding a # TODO comment noting that int8 is a workaround

karthikeyann · 2024-01-17T11:38:44Z

/merge

Fixes deprecation warnings introduced when #14202 merged. Most of these are for calls to `cudf::make_strings_column` which deprecated the chars-column function overload. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14771

Removes the functions deprecated in 24.02 in #14202. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Vyas Ramasubramani (https://github.com/vyasr) URL: #14848

This PR contains a number of different fixes currently required to get cugraph tests passing: - There are two main changes for pandas 2 compatibility: - [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`. - Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion. - cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy. - rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs. This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph). Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Bradley Dice (https://github.com/bdice) - Chuck Hastings (https://github.com/ChuckHastings) - Rick Ratzel (https://github.com/rlratzel) - Jake Awe (https://github.com/AyodeAwe) URL: #4144

karthikeyann added 6 commits September 27, 2023 00:49

add stream to chars(), chars_size, chars_end, use head()

f5ca3d3

src/ changes

aced40d

tests/ changes

7c34040

benchmarks/ changes

b592458

java/ changes

86a1e59

examples/ changes

55dce95

karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change labels Sep 26, 2023

karthikeyann self-assigned this Sep 26, 2023

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. labels Sep 26, 2023

karthikeyann and others added 2 commits September 27, 2023 15:52

Merge branch 'branch-23.10' into fea-char_limit_experiment

8212300

fix typo

a874f32

GregoryKimball mentioned this pull request Sep 27, 2023

[FEA] Increase maximum characters in strings columns #13733

Open

python/ changes

3b23642

github-actions bot added the Python Affects Python cuDF API. label Sep 29, 2023

karthikeyann changed the base branch from branch-23.10 to branch-23.12 October 3, 2023 09:07

karthikeyann and others added 6 commits October 3, 2023 14:37

Merge branch 'branch-23.12' into fea-char_limit_experiment

766c939

string fixes for contiguous split (nvdbaranec)

75d683c

fix view char ptr, naming with numbers for a pytest

5e4ef98

Merge branch 'branch-23.12' of github.com:rapidsai/cudf into fea-char…

8ff34b0

…_limit_experiment

base_data fix in strings column in Cython

a900572

Merge branch 'branch-23.12' of github.com:rapidsai/cudf into fea-char…

88c2b36

…_limit_experiment

karthikeyann added 4 commits November 3, 2023 21:48

fix serialization of sliced string column

ee1fff2

fix test_assert_column_memory_basic_same for string column

dc5acd9

fix test_deserialize_cudf_0_16, rename to 23_12

902b466

fix ParquetWriterTest.StringsAsBinary test

13aa651

karthikeyann and others added 2 commits January 10, 2024 04:11

address review comments

d3db9e7

use Optional move string special case for data to string.py use data for memory usage calculation remove chars_data declaration

Merge branch 'branch-24.02' into fea-char_limit_experiment

f543551

karthikeyann requested a review from wence- January 9, 2024 22:46

wence- approved these changes Jan 10, 2024

View reviewed changes

karthikeyann and others added 2 commits January 10, 2024 21:33

zero size buffer for column_empty

f2e6e15

Merge branch 'branch-24.02' into fea-char_limit_experiment

e7ff5c0

karthikeyann and others added 3 commits January 12, 2024 17:50

Update DEVELOPER_GUIDE.md

906f27e

Merge branch 'branch-24.02' into fea-char_limit_experiment

39ee47d

update strings.png

555cc67

davidwendt reviewed Jan 12, 2024

View reviewed changes

cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md Outdated Show resolved Hide resolved

karthikeyann and others added 2 commits January 12, 2024 20:12

Update cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md

4f4ee07

Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>

remove space at end

87b8bce

mroeschke approved these changes Jan 12, 2024

View reviewed changes

karthikeyann and others added 3 commits January 13, 2024 02:54

remove int8 usage in string.py

3ed1557

Merge branch 'branch-24.02' into fea-char_limit_experiment

debf7de

Merge branch 'branch-24.02' into fea-char_limit_experiment

f8e5845

shwina approved these changes Jan 16, 2024

View reviewed changes

Merge branch 'branch-24.02' into fea-char_limit_experiment

a577c88

rapids-bot bot merged commit c7acdaa into rapidsai:branch-24.02 Jan 17, 2024
67 of 68 checks passed

jlowe mentioned this pull request Jan 17, 2024

Update to new cudf strings where character data is no longer a child column NVIDIA/spark-rapids-jni#1708

Merged

davidwendt mentioned this pull request Jan 17, 2024

Fix calls to deprecated strings factory API #14771

Merged

3 tasks

karthikeyann mentioned this pull request Jan 22, 2024

[BUG] deprecated warnings should be enabled #14819

Open

davidwendt mentioned this pull request Jan 23, 2024

Remove deprecated strings functions #14848

Merged

3 tasks

vyasr mentioned this pull request Feb 6, 2024

Fixes for pandas 2, latest cudf, and wheel building rapidsai/cugraph#4144

Merged

PointKernel mentioned this pull request Feb 29, 2024

Change make_strings_children to return uvector #15171

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move chars column to parent data buffer in strings column #14202

Move chars column to parent data buffer in strings column #14202

karthikeyann commented Sep 26, 2023 •

edited

Loading

karthikeyann commented Nov 2, 2023 •

edited

Loading

wence- left a comment

wence- Jan 10, 2024

wence- Jan 10, 2024

davidwendt commented Jan 10, 2024

mroeschke left a comment

mroeschke Jan 12, 2024

karthikeyann commented Jan 17, 2024

Move chars column to parent data buffer in strings column #14202

Move chars column to parent data buffer in strings column #14202

Conversation

karthikeyann commented Sep 26, 2023 • edited Loading

Description

karthikeyann commented Nov 2, 2023 • edited Loading

wence- left a comment

Choose a reason for hiding this comment

wence- Jan 10, 2024

Choose a reason for hiding this comment

wence- Jan 10, 2024

Choose a reason for hiding this comment

davidwendt commented Jan 10, 2024

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke Jan 12, 2024

Choose a reason for hiding this comment

karthikeyann commented Jan 17, 2024

karthikeyann commented Sep 26, 2023 •

edited

Loading

karthikeyann commented Nov 2, 2023 •

edited

Loading