Improve memory footprint of isin by using contains #14478

wence- · 2023-11-22T18:36:17Z

Description

Previously, isin was implemented using an inner join between the column we are searching (the haystack) and the values we are searching for (the needles). This had a large memory footprint when there were repeated needles (since that blows up the cardinality of the merge).

To fix this, note that we don't need to do a merge at all, since libcudf provides a primitive (contains) to search for many needles in a haystack. The only thing we must bear in mind is that left.isin(right) is asking for the locations in left that match an entry in right, whereas contains(haystack, needles) provides a bool mask that selects needles that are in the haystack. To get the behaviour we want, we therefore need to do contains(right, left) and treat the values to search for as the haystack.

As well as having a much better memory footprint, this hash-based approach search is significantly faster than the previous merge-based one.

While we are here, lower the memory footprint of MultiIndex.isin by using a left-semi join (the implementation is separate from the isin implementation on columns and looks a little more complicated to unpick).

Closes [BUG] OOM with isin when lhs/rhs contain repeats / spill fails #14298

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Previously, isin was implemented using an inner join between the column we are searching (the haystack) and the values we are searching for (the needles). This had a large memory footprint when there were repeated needles (since that blows up the cardinality of the merge). To fix this, note that we don't need to do a merge at all, since libcudf provides a primitive (contains) to search for many needles in a haystack. The only thing we must bear in mind is that left.isin(right) is asking for the locations in left that match an entry in right, whereas contains(haystack, needles) provides a bool mask that selects needles that are in the haystack. To get the behaviour we want, we therefore need to do contains(right, left) and treat the values to search for as the haystack. As well as having a much better memory footprint, this hash-based approach search is significantly faster than the previous merge-based one. While we are here, lower the memory footprint of MultiIndex.isin by using a left-semi join (the implementation is separate from the isin implementation on columns and looks a little more complicated to unpick). - Closes rapidsai#14298

wence- · 2023-11-22T18:46:23Z

As well as having a much better memory footprint, this hash-based approach search is significantly faster than the previous merge-based one

Using the data from #14298:

Old:

In [1]: %timeit result = haystack.isin(needles.unique())
6.08 ms ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

New

# No need for unique now either
In [1]: %timeit result = haystack.isin(needles)
123 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Happy thanksgiving @thomcom

wence- · 2023-11-22T20:19:46Z

Hmph, broken somehow...

wence- · 2023-11-23T10:56:02Z

The problem is that contains preserves the null mask of the needles being searched for. I fix this by dropping the returned null mask on the floor, but marking as do-not-merge so that we can discuss if this is the best approach.

python/cudf/cudf/core/column/column.py

The result returned from libcudf is masked by the null mask of the needles. If it has any nulls we must replace them with whether or not the haystack contains nulls to match the semantics we need for isin.

wence- · 2023-11-30T17:00:17Z

/merge

wence- requested a review from a team as a code owner November 22, 2023 18:36

wence- requested review from bdice and charlesbluca November 22, 2023 18:36

github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2023

wence- added bug Something isn't working non-breaking Non-breaking change no-oom Reducing memory footprint of cudf algorithms and removed Python Affects Python cuDF API. labels Nov 22, 2023

wence- force-pushed the wence/fix/14298 branch from 09939f0 to 7848147 Compare November 22, 2023 18:37

github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2023

bdice approved these changes Nov 22, 2023

View reviewed changes

galipremsagar approved these changes Nov 22, 2023

View reviewed changes

wence- mentioned this pull request Nov 22, 2023

[ENH] Audit cudf APIs for use of inappropriate algorithms #14479

Open

wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 23, 2023

wence- requested review from bdice and galipremsagar November 23, 2023 10:56

wence- commented Nov 24, 2023

View reviewed changes

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved

Fix null handling

884f9ce

The result returned from libcudf is masked by the null mask of the needles. If it has any nulls we must replace them with whether or not the haystack contains nulls to match the semantics we need for isin.

wence- force-pushed the wence/fix/14298 branch from cef017e to 884f9ce Compare November 28, 2023 17:23

wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 28, 2023

rapids-bot bot merged commit ac35d19 into rapidsai:branch-24.02 Nov 30, 2023
67 checks passed

wence- deleted the wence/fix/14298 branch November 30, 2023 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory footprint of isin by using contains #14478

Improve memory footprint of isin by using contains #14478

wence- commented Nov 22, 2023 •

edited

Loading

wence- commented Nov 22, 2023

wence- commented Nov 22, 2023

wence- commented Nov 23, 2023

wence- commented Nov 30, 2023

Improve memory footprint of isin by using contains #14478

Improve memory footprint of isin by using contains #14478

Conversation

wence- commented Nov 22, 2023 • edited Loading

Description

Checklist

wence- commented Nov 22, 2023

wence- commented Nov 22, 2023

wence- commented Nov 23, 2023

wence- commented Nov 30, 2023

wence- commented Nov 22, 2023 •

edited

Loading