Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for inserting duplicates in groupby result cache #9508

Merged

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Oct 22, 2021

Fixes #9507

Prevents inserting to groupby result cache if the result for <column, aggregation> pair is already present in the cache.
Added unit test to test this.

Details:
When add_result(col1, agg1, column1); add_result(col1, agg1, column2); is called (see twice), then _cache doesn't contain any value for {col1, agg1} anymore.
Issue is in _cache std::unordered_map with std::reference_wrapper<aggregation const> in the key.

When _cache[{input, key}] = std::move(value); executes 2nd time, old key is destroyed. But the key's reference never changes which points to the destroyed key.
So, when compared again, pair_column_aggregation_equal_to fails because we are comparing a destroyed object (whose memory may have been overwritten).
#9507 (comment)

@karthikeyann karthikeyann added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond non-breaking Non-breaking change labels Oct 22, 2021
@karthikeyann karthikeyann self-assigned this Oct 22, 2021
@karthikeyann karthikeyann requested a review from a team as a code owner October 22, 2021 23:00
@karthikeyann karthikeyann added this to PR-WIP in v21.12 Release via automation Oct 22, 2021
@codecov
Copy link

codecov bot commented Oct 23, 2021

Codecov Report

Merging #9508 (96fef91) into branch-21.12 (ab4bfaa) will decrease coverage by 0.11%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9508      +/-   ##
================================================
- Coverage         10.79%   10.67%   -0.12%     
================================================
  Files               116      117       +1     
  Lines             18869    19714     +845     
================================================
+ Hits               2036     2104      +68     
- Misses            16833    17610     +777     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.90% <0.00%> (-1.21%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
... and 63 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72694d2...96fef91. Read the comment docs.

v21.12 Release automation moved this from PR-WIP to PR-Reviewer approved Oct 23, 2021
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to read the issue description to understand why this fix works and how it works. So please add some comments in the code to clarify the problem.

@bdice
Copy link
Contributor

bdice commented Oct 23, 2021

I had to read the issue description to understand why this fix works and how it works. So please add some comments in the code to clarify the problem.

Agreed. The fix looks good but the test needs comments. I would add comments that specifically indicate which calls should exit early and do not modify the cache.

Is the design of using a reference to an aggregation in the map’s key (which requires storing a copy of the aggregation in the value for lifetime reasons) motivated primarily by performance during lookup or convenience? I saw the comment about it being designed to allow lookup by reference but I was surprised this isn’t possible in some other way that is less awkward and avoids object lifetime gymnastics. I have to think about it some more but it seems like this bug might have been avoided in the first place with a different design.

@karthikeyann
Copy link
Contributor Author

Added comments to test & code.
The problem is that keys are not updated when access by map[key]. Old key retains. replaced [] with C++17 try_emplace.

reference wrapper design was old code. Not sure of why this design decision is taken. I just changed old code to add column_view and shallow comparison to the cache.

@bdice bdice mentioned this pull request Oct 25, 2021
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@bdice
Copy link
Contributor

bdice commented Oct 26, 2021

Thanks @karthikeyann! This should be ready to merge.

@bdice bdice added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Oct 26, 2021
@jrhemstad
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d6b624c into rapidsai:branch-21.12 Oct 26, 2021
v21.12 Release automation moved this from PR-Reviewer approved to Done Oct 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

[BUG] groupby result cache error in dask-cudf test
5 participants