Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory footprint of isin by using contains #14478

Merged
merged 2 commits into from
Nov 30, 2023

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Nov 22, 2023

Description

Previously, isin was implemented using an inner join between the column we are searching (the haystack) and the values we are searching for (the needles). This had a large memory footprint when there were repeated needles (since that blows up the cardinality of the merge).

To fix this, note that we don't need to do a merge at all, since libcudf provides a primitive (contains) to search for many needles in a haystack. The only thing we must bear in mind is that left.isin(right) is asking for the locations in left that match an entry in right, whereas contains(haystack, needles) provides a bool mask that selects needles that are in the haystack. To get the behaviour we want, we therefore need to do contains(right, left) and treat the values to search for as the haystack.

As well as having a much better memory footprint, this hash-based approach search is significantly faster than the previous merge-based one.

While we are here, lower the memory footprint of MultiIndex.isin by using a left-semi join (the implementation is separate from the isin implementation on columns and looks a little more complicated to unpick).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner November 22, 2023 18:36
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2023
@wence- wence- added bug Something isn't working non-breaking Non-breaking change no-oom Reducing memory footprint of cudf algorithms and removed Python Affects Python cuDF API. labels Nov 22, 2023
Previously, isin was implemented using an inner join between the
column we are searching (the haystack) and the values we are searching
for (the needles). This had a large memory footprint when there were
repeated needles (since that blows up the cardinality of the merge).

To fix this, note that we don't need to do a merge at all, since
libcudf provides a primitive (contains) to search for many needles in
a haystack. The only thing we must bear in mind is that
left.isin(right) is asking for the locations in left that match an
entry in right, whereas contains(haystack, needles) provides a bool
mask that selects needles that are in the haystack. To get the
behaviour we want, we therefore need to do contains(right, left) and
treat the values to search for as the haystack.

As well as having a much better memory footprint, this hash-based
approach search is significantly faster than the previous merge-based
one.

While we are here, lower the memory footprint of MultiIndex.isin by
using a left-semi join (the implementation is separate from the isin
implementation on columns and looks a little more complicated to
unpick).

- Closes rapidsai#14298
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 22, 2023

As well as having a much better memory footprint, this hash-based approach search is significantly faster than the previous merge-based one

Using the data from #14298:

Old:

In [1]: %timeit result = haystack.isin(needles.unique())
6.08 ms ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

New

# No need for unique now either
In [1]: %timeit result = haystack.isin(needles)
123 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Happy thanksgiving @thomcom

@wence-
Copy link
Contributor Author

wence- commented Nov 22, 2023

Hmph, broken somehow...

@wence- wence- added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 23, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 23, 2023

The problem is that contains preserves the null mask of the needles being searched for. I fix this by dropping the returned null mask on the floor, but marking as do-not-merge so that we can discuss if this is the best approach.

The result returned from libcudf is masked by the null mask of the
needles. If it has any nulls we must replace them with
whether or not the haystack contains nulls to match the semantics we
need for isin.
@wence- wence- removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Nov 28, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 30, 2023

/merge

@rapids-bot rapids-bot bot merged commit ac35d19 into rapidsai:branch-24.02 Nov 30, 2023
67 checks passed
@wence- wence- deleted the wence/fix/14298 branch November 30, 2023 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working no-oom Reducing memory footprint of cudf algorithms non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] OOM with isin when lhs/rhs contain repeats / spill fails
3 participants