Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IndexedFrame.duplicated with distinct_indices + scatter #14493

Merged
merged 7 commits into from
Dec 12, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Nov 24, 2023

Description

To obtain the duplicate rows in a dataframe we previously performed a drop-duplicates with a carrier column of row indices and then set entries in a boolean column to False for those row indices that remained. Furthermore, we were performing an unnecessary merge after the drop-duplicates call to obtain the row indices.

Note that the carrier column provides exactly the information that is computed internally in libcudf by cudf::detail::get_distinct_indices (called as part of cudf::distinct). We therefore promote get_distinct_indices to a public function (as cudf::distinct_indices) and replace the (unnecessary) merge plus iloc-based setting of the result with a call to libcudf.copying.scatter.

This provides a reasonable speedup (around 1.5x) for duplicated() on Series, and significantly improves performance of duplicated() on DataFrames, especially when providing a subset argument. Previously we would pay the cost in drop-duplicates of moving all columns of the distinct rows to the output table, even though we only actually needed the carrier "indices" column. Now we just obtain those indices directly, duplicated() scales only with the number of "active" columns. In some simple benchmarking this is between two and five times faster for tables with 10% distinct rows depending on the number of passive additional columns.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested review from a team as code owners November 24, 2023 17:12
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Nov 24, 2023
@wence- wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change tech debt labels Nov 24, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 27, 2023

Note that one could also imagine a distinct_by_key(table_view values, table_view keys), but the choice in these functions (which is deliberate policy, #3303 (comment)) is to just allow specifying keys by subsetting a table. However, if one only needs a subset of the table as output to continue, this will move more data in the gather phase than necessary: there's no way with the current distinct(table_view values, std::vector<size_type> key_columns) to say "use these columns as keys, but only gather these other columns".

As well as being able to drop duplicates from a table, it is useful to
be able to mark duplicates in the table. This is the information
provided by get_distinct_indices, so promote it to a public function.
Rather than drop_duplicates + merge + iloc, use the distinct_indices
to get the duplicate rows and do a single scatter.

- Closes rapidsai#14486
@wence-
Copy link
Contributor Author

wence- commented Dec 4, 2023

Ready for another look @bdice, thanks!

python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@wence-
Copy link
Contributor Author

wence- commented Dec 12, 2023

/merge

@rapids-bot rapids-bot bot merged commit ef11061 into rapidsai:branch-24.02 Dec 12, 2023
67 checks passed
@wence- wence- deleted the wence/fix/14486 branch January 8, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[PERF/ENH]: DataFrame.duplicated does unnecessary inner merge
3 participants