Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move chars column to parent data buffer in strings column #14202

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Sep 26, 2023

Description

Eliminates chars column and moves chars data to parent string column's _data buffer.

Summary of changes

  • chars child column is removed, chars buffer is added to parent column
  • Adds stream to chars_size(), chars_end() in strings_column_view and their invocations
  • Remove chars_column_index, and deprecate chars() from strings_column_view
  • Replace chars_col.begin<char>() with static_cast<char*>(parent.head())
  • Adds string column factory which accepts rmm::device_buffer instead of chars column
  • Deprecate string column factory which accepts chars column
  • IO changes - contiguous split (From @nvdbaranec ), to_arrow, parquet writer.
  • Fix binary ops, column_view, interleave columns, byte cast, strings APIs, text APIs
  • Fix tests, benchmarks (mostly adding stream parameter to chars_size)
  • Java fixes (From @andygrove)
  • Python changes
    • .data special case for string column
    • get size from offsets column for rmm.DeviceBuffer in column
    • special condition for string slice
    • Pickle file update for string column
    • a few unit tests updates

Preparing for #13733

@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change labels Sep 26, 2023
@karthikeyann karthikeyann self-assigned this Sep 26, 2023
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Java Affects Java cuDF API. labels Sep 26, 2023
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 29, 2023
@karthikeyann karthikeyann changed the base branch from branch-23.10 to branch-23.12 October 3, 2023 09:07
@karthikeyann
Copy link
Contributor Author

karthikeyann commented Nov 2, 2023

@galipremsagar failing pytests
FAILED test_pickling.py::test_pickle_string_column[slices1] - AssertionError: Series are different
FAILED test_pickling.py::test_pickle_string_column[slices3] - AssertionError: Series are different
FAILED test_serialize.py::test_deserialize_cudf_0_16 - RuntimeError: CUDF failure at: /home/knataraj/dev/rapids/cudf/cpp/src/column/column_view.cpp:57: Null data pointer.
FAILED test_testing.py::test_assert_column_memory_basic_same[arrow_arrays1] - ValueError: Buffer size must be divisible by element size
FAILED test_orc.py::test_writer_timestamp_stream_size - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_reader_negative_timestamp[pyarrow] - AssertionError: numpy array are different
FAILED test_orc.py::test_orc_writer_negative_timestamp - AssertionError: numpy array are different

Update: these issues are fixed.

karthikeyann and others added 2 commits January 10, 2024 04:11
use Optional
move string special case for data to string.py
use data for memory usage calculation
remove chars_data declaration
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving python changes with (non-blocking) suggestion to introduce a single type definition for the char type.

build_column(
data=as_buffer(
rmm.DeviceBuffer(
size=row_count * cudf.dtype("int8").itemsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this condition can also be true for (at least) List and Struct columns. But, those are handled by specific cases above.

From a quick test, I think we don't need an empty chars buffer, so can you try using rmm.DeviceBuffer(0)?

python/cudf/cudf/core/column/string.py Show resolved Hide resolved
@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase":
str_end_byte_offset = self.base_children[0].element_indexing(
self.offset + self.size
)
char_dtype_size = self.base_children[1].dtype.itemsize
char_dtype_size = cudf.api.types.dtype("int8").itemsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Can we introduce (like cudf._lib.types.size_type_dtype) a single source of truth for the type of the string char buffer, perhaps cudf._lib.types.char_type_dtype?

python/cudf/cudf/tests/test_testing.py Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
python/cudf/cudf/_lib/column.pyx Show resolved Hide resolved
@davidwendt
Copy link
Contributor

We'll probably want to update the developer's guide once this is merged as well
https://github.com/rapidsai/cudf/blob/branch-24.02/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md#strings-columns

karthikeyann and others added 2 commits January 12, 2024 20:12
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional comment otherwise LGTM

@@ -5938,15 +5930,15 @@ def view(self, dtype) -> "cudf.core.column.ColumnBase":
str_end_byte_offset = self.base_children[0].element_indexing(
self.offset + self.size
)
char_dtype_size = self.base_children[1].dtype.itemsize
char_dtype_size = cudf.api.types.dtype("int8").itemsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth adding a # TODO comment noting that int8 is a workaround

@karthikeyann
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit c7acdaa into rapidsai:branch-24.02 Jan 17, 2024
67 of 68 checks passed
rapids-bot bot pushed a commit that referenced this pull request Jan 19, 2024
Fixes deprecation warnings introduced when #14202 merged.
Most of these are for calls to `cudf::make_strings_column` which deprecated the chars-column function overload.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14771
rapids-bot bot pushed a commit that referenced this pull request Feb 3, 2024
Removes the functions deprecated in 24.02 in #14202.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #14848
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Feb 6, 2024
This PR contains a number of different fixes currently required to get cugraph tests passing:
- There are two main changes for pandas 2 compatibility:
    - [pandas renamed `DataFrame.applymap` to `DataFrame.map`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html) so creating the renumbering map with a column `map` caused problems for attribute-based column access `renumber_map.map`. Those columns are now renamed to `renumber_map`.
    - Empty columns now default to str rather than float, so tests that assumed we could access the values as cupy arrays failed because cudf's string columns cannot be converted to cupy arrays. These columns are now always cast to float in the tests before the cupy conversion.
- cugraph-dgl and cugraph-pyg's wheel builds were not downloading the latest cugraph/pylibcugraph wheels to run tests. As a result, the above pandas 2 fixes didn't take when running the dgl and pyg tests. I updated the wheel building scripts to account for this discrepancy.
- rapidsai/cudf#14202 made a breaking change to how characters are encoded in strings columns in cudf, which broke cugraph_etl. This PR fixes the code that depended on the old APIs.

This code also includes a small patch to the cugraph_etl CMake so that it exports the correct package name (previously it was using cugraph).

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Bradley Dice (https://github.com/bdice)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Rick Ratzel (https://github.com/rlratzel)
  - Jake Awe (https://github.com/AyodeAwe)

URL: #4144
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

9 participants