New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite GroupedTopNBuilder with flat data structures #6072
Conversation
9ffaf75
to
f9b5a11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments for the first few commits. I haven't reviewed "Add GroupedTopNRowNumberAccumulator" or "Refactor GroupedTopNBuilder for higher performance and reduced memory"
If LongLong2Long
isn't going to be used, I suggest doing a rename to LongLongToLongBig
when that commit is introduced. We we need the non-big version later, we can get the code from git history.
presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomHashMap.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/util/TestLongLong2LongOpenCustomBigHashMap.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/util/TestLongLong2LongOpenCustomBigHashMap.java
Show resolved
Hide resolved
return defRetValue; | ||
} | ||
|
||
/** | ||
* {@inheritDoc} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove these in the original fork commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dain, are we sure we want to remove these in the original fork? The reason I have them removed here is because this is the actual commit that actually removes the interface and override (and thus the doc linkage).
presto-main/src/main/java/io/prestosql/util/Long2LongOpenBigHashMap.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/CyclingGroupByHash.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/CyclingGroupByHash.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestCyclingGroupByHash.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestCyclingGroupByHash.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments... I'm about halfway through the last commit.
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Outdated
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRowNumberAccumulator.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNBuilder.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java
Show resolved
Hide resolved
presto-main/src/test/java/io/prestosql/operator/TestRowReferencePageManager.java
Show resolved
Hide resolved
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Outdated
Show resolved
Hide resolved
ffa4b03
to
55a1e1b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. One new suggestion on adding a clarifying comment.
GitHub doesn't seem to want to let me follow up on some of the old comments:
- For
CyclingGroupByHash
I just looked and I think all of the uses are in the tests tree. If so, I would move it there. - For the bulk removal of the empty from fast utils javadocs and the unused features. I would consider doing that in a separate commit as it will make your refactorings stand out more.
Other than that, I had a follow up comment on:
int IdRegistry.add(IntFunction<T>)
- How to easily create a group id block from an array (no need to change anything here)
Finally, you have some CI test failures
presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java
Show resolved
Hide resolved
Ok everything updated, thanks @dain |
Copying the one that was already implemented for IntBigArray.
Fork a copy of fastutil Long2LongOpenCustomHashMap to a new LongLong2LongOpenCustomHashMap stub
Fork a copy of fastutil Long2LongOpenHashMap to new Long2LongOpenBigHashMap stub
Fork a copy of fastutil LongArrayFIFOQueue to new LongBigArrayFIFOQueue stub
- Update benchmark to actually run over multiple groups - Reduce TopN to most common sizes - Fix benchmark scoping so that it doesn't bleed the same instance across iterations (GroupedTopNBuilder is only supposed to be used for add/drain pass per instance).
This is a flat data structure for managing multiple top N row number group calculations that doesn't require per-row or per-group object allocations. The main idea is that a heap (while classically represented as an array), can be also represented as a tree with node pointers. These nodes (even across groups) can be efficiently compacted into a single data structure.
- Introduce RowReferencePageManager to handle the generation of stable row IDs across compaction events - Refactor GroupedTopNBuilder to use RowReferencePageManager as well as new GroupedTopNRowNumberAccumulator Improved characteristics - Vastly improved GC characteristics: Negligible object allocations regardless of row count or group count - Performance benchmarks upwards of 4x performance improvements when working with large numbers of groups, and parity with the existing solution in the worst case - Requires up to 20% less memory than the current solution when there are many groups, but does have a increased constant memory overhead when dealing with tiny data sets.
The structure has a lot of invariants that can be easily verified, and it is quite easy for developers to make mistakes when modifying this code.
GroupedTopNBuilder would previously generate tons of objects for each row and group. By rewriting this entirely with flat data structures we can get massive GC improvements, as well as performance and memory benefits.
Compared to the preexisting solution, this new implementation shows the following characteristics:
Additionally, this is setup in preparation for #1073, which will share similar structures for rank and dense rank implementations.