-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement IndexedFrame.duplicated
with distinct_indices
+ scatter
#14493
Conversation
fa1cd62
to
8f1891a
Compare
Note that one could also imagine a |
8f1891a
to
a2c77e6
Compare
As well as being able to drop duplicates from a table, it is useful to be able to mark duplicates in the table. This is the information provided by get_distinct_indices, so promote it to a public function.
Rather than drop_duplicates + merge + iloc, use the distinct_indices to get the duplicate rows and do a single scatter. - Closes rapidsai#14486
a2c77e6
to
ac1ee09
Compare
No need to allocate more memory than necessary.
092997f
to
f6064b5
Compare
Ready for another look @bdice, thanks! |
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
/merge |
Description
To obtain the duplicate rows in a dataframe we previously performed a drop-duplicates with a carrier column of row indices and then set entries in a boolean column to False for those row indices that remained. Furthermore, we were performing an unnecessary merge after the drop-duplicates call to obtain the row indices.
Note that the carrier column provides exactly the information that is computed internally in
libcudf
bycudf::detail::get_distinct_indices
(called as part ofcudf::distinct
). We therefore promoteget_distinct_indices
to a public function (ascudf::distinct_indices
) and replace the (unnecessary) merge plusiloc
-based setting of the result with a call tolibcudf.copying.scatter
.This provides a reasonable speedup (around 1.5x) for
duplicated()
onSeries
, and significantly improves performance ofduplicated()
onDataFrames
, especially when providing asubset
argument. Previously we would pay the cost in drop-duplicates of moving all columns of the distinct rows to the output table, even though we only actually needed the carrier "indices" column. Now we just obtain those indices directly,duplicated()
scales only with the number of "active" columns. In some simple benchmarking this is between two and five times faster for tables with 10% distinct rows depending on the number of passive additional columns.DataFrame.duplicated
does unnecessary inner merge #14486Checklist