Implement `IndexedFrame.duplicated` with `distinct_indices` + `scatter` #14493

wence- · 2023-11-24T17:12:00Z

Description

To obtain the duplicate rows in a dataframe we previously performed a drop-duplicates with a carrier column of row indices and then set entries in a boolean column to False for those row indices that remained. Furthermore, we were performing an unnecessary merge after the drop-duplicates call to obtain the row indices.

Note that the carrier column provides exactly the information that is computed internally in libcudf by cudf::detail::get_distinct_indices (called as part of cudf::distinct). We therefore promote get_distinct_indices to a public function (as cudf::distinct_indices) and replace the (unnecessary) merge plus iloc-based setting of the result with a call to libcudf.copying.scatter.

This provides a reasonable speedup (around 1.5x) for duplicated() on Series, and significantly improves performance of duplicated() on DataFrames, especially when providing a subset argument. Previously we would pay the cost in drop-duplicates of moving all columns of the distinct rows to the output table, even though we only actually needed the carrier "indices" column. Now we just obtain those indices directly, duplicated() scales only with the number of "active" columns. In some simple benchmarking this is between two and five times faster for tables with 10% distinct rows depending on the number of passive additional columns.

Closes [PERF/ENH]: DataFrame.duplicated does unnecessary inner merge #14486

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/include/cudf/stream_compaction.hpp

python/cudf/cudf/core/indexed_frame.py

wence- · 2023-11-27T15:58:11Z

Note that one could also imagine a distinct_by_key(table_view values, table_view keys), but the choice in these functions (which is deliberate policy, #3303 (comment)) is to just allow specifying keys by subsetting a table. However, if one only needs a subset of the table as output to continue, this will move more data in the gather phase than necessary: there's no way with the current distinct(table_view values, std::vector<size_type> key_columns) to say "use these columns as keys, but only gather these other columns".

As well as being able to drop duplicates from a table, it is useful to be able to mark duplicates in the table. This is the information provided by get_distinct_indices, so promote it to a public function.

Rather than drop_duplicates + merge + iloc, use the distinct_indices to get the duplicate rows and do a single scatter. - Closes rapidsai#14486

cpp/include/cudf/detail/stream_compaction.hpp

cpp/include/cudf/stream_compaction.hpp

python/cudf/cudf/core/indexed_frame.py

No need to allocate more memory than necessary.

wence- · 2023-12-04T11:14:30Z

Ready for another look @bdice, thanks!

python/cudf/cudf/_lib/stream_compaction.pyx

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

python/cudf/cudf/_lib/cpp/stream_compaction.pxd

karthikeyann

LGTM 👍

wence- · 2023-12-12T13:46:35Z

/merge

wence- requested review from a team as code owners November 24, 2023 17:12

wence- requested review from galipremsagar, charlesbluca, PointKernel and divyegala November 24, 2023 17:12

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Nov 24, 2023

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tech debt labels Nov 24, 2023

wence- force-pushed the wence/fix/14486 branch from fa1cd62 to 8f1891a Compare November 24, 2023 17:33

wence- commented Nov 24, 2023

View reviewed changes

cpp/include/cudf/stream_compaction.hpp Show resolved Hide resolved

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved

wence- force-pushed the wence/fix/14486 branch from 8f1891a to a2c77e6 Compare November 30, 2023 16:54

wence- added 3 commits December 1, 2023 09:09

Expose cudf::detail::get_distinct_indices as cudf::distinct_indices

a10e27e

As well as being able to drop duplicates from a table, it is useful to be able to mark duplicates in the table. This is the information provided by get_distinct_indices, so promote it to a public function.

Expose new cudf::distinct_indices as stream_compaction.distinct_indices

67d4d6d

Implement IndexedFrame.duplicated with distinct_indices

ac1ee09

Rather than drop_duplicates + merge + iloc, use the distinct_indices to get the duplicate rows and do a single scatter. - Closes rapidsai#14486

wence- force-pushed the wence/fix/14486 branch from a2c77e6 to ac1ee09 Compare December 1, 2023 09:10

bdice reviewed Dec 1, 2023

View reviewed changes

cpp/include/cudf/detail/stream_compaction.hpp Outdated Show resolved Hide resolved

cpp/include/cudf/stream_compaction.hpp Show resolved Hide resolved

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved

wence- added 2 commits December 4, 2023 11:00

Scatter a scalar rather than a column

eca95b3

No need to allocate more memory than necessary.

Rename get_distinct_indices to distinct_indices

f6064b5

wence- force-pushed the wence/fix/14486 branch from 092997f to f6064b5 Compare December 4, 2023 11:00

rapidsai deleted a comment from copy-pr-bot bot Dec 4, 2023

bdice approved these changes Dec 10, 2023

View reviewed changes

python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved

Typography

d2983f1

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

karthikeyann reviewed Dec 12, 2023

View reviewed changes

python/cudf/cudf/_lib/cpp/stream_compaction.pxd Outdated Show resolved Hide resolved

distinct_indices is except +

14514d6

karthikeyann approved these changes Dec 12, 2023

View reviewed changes

rapids-bot bot merged commit ef11061 into rapidsai:branch-24.02 Dec 12, 2023
67 checks passed

wence- deleted the wence/fix/14486 branch January 8, 2024 12:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `IndexedFrame.duplicated` with `distinct_indices` + `scatter` #14493

Implement `IndexedFrame.duplicated` with `distinct_indices` + `scatter` #14493

wence- commented Nov 24, 2023 •

edited

Loading

wence- commented Nov 27, 2023

wence- commented Dec 4, 2023

karthikeyann left a comment

wence- commented Dec 12, 2023

Implement IndexedFrame.duplicated with distinct_indices + scatter #14493

Implement IndexedFrame.duplicated with distinct_indices + scatter #14493

Conversation

wence- commented Nov 24, 2023 • edited Loading

Description

Checklist

wence- commented Nov 27, 2023

wence- commented Dec 4, 2023

karthikeyann left a comment

Choose a reason for hiding this comment

wence- commented Dec 12, 2023

Implement `IndexedFrame.duplicated` with `distinct_indices` + `scatter` #14493

Implement `IndexedFrame.duplicated` with `distinct_indices` + `scatter` #14493

wence- commented Nov 24, 2023 •

edited

Loading