Rewrite GroupedTopNBuilder with flat data structures #6072

erichwang · 2020-11-24T12:20:05Z

GroupedTopNBuilder would previously generate tons of objects for each row and group. By rewriting this entirely with flat data structures we can get massive GC improvements, as well as performance and memory benefits.

Compared to the preexisting solution, this new implementation shows the following characteristics:

Vastly improved GC characteristics: negligible object allocations regardless of row count or group count.
Performance benchmarks upwards of 4x performance improvements when working with large numbers of groups, and no worse than parity with the existing solution in the worst case.
Requires up to 20% less memory than the current solution when there are many groups, but does have increased constant memory overhead when dealing with tiny data sets.

Additionally, this is setup in preparation for #1073, which will share similar structures for rank and dense rank implementations.

dain

Comments for the first few commits. I haven't reviewed "Add GroupedTopNRowNumberAccumulator" or "Refactor GroupedTopNBuilder for higher performance and reduced memory"

If LongLong2Long isn't going to be used, I suggest doing a rename to LongLongToLongBig when that commit is introduced. We we need the non-big version later, we can get the code from git history.

presto-array/src/main/java/io/prestosql/array/BigArrays.java

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomHashMap.java

presto-main/src/test/java/io/prestosql/util/TestLongLong2LongOpenCustomBigHashMap.java

dain · 2020-12-01T19:43:14Z

presto-main/src/main/java/io/prestosql/util/LongLong2LongOpenCustomHashMap.java

        return defRetValue;
    }

-    /**
-     * {@inheritDoc}


Maybe remove these in the original fork commit

@dain, are we sure we want to remove these in the original fork? The reason I have them removed here is because this is the actual commit that actually removes the interface and override (and thus the doc linkage).

presto-main/src/main/java/io/prestosql/util/Long2LongOpenBigHashMap.java

presto-main/src/main/java/io/prestosql/operator/CyclingGroupByHash.java

presto-main/src/test/java/io/prestosql/operator/TestCyclingGroupByHash.java

dain

Some more comments... I'm about halfway through the last commit.

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java

presto-main/src/test/java/io/prestosql/operator/TestGroupedTopNRowNumberAccumulator.java

presto-main/src/main/java/io/prestosql/operator/GroupedTopNBuilder.java

presto-main/src/main/java/io/prestosql/operator/IdRegistry.java

presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java

presto-main/src/main/java/io/prestosql/operator/GroupedTopNRowNumberAccumulator.java

presto-main/src/test/java/io/prestosql/operator/TestRowReferencePageManager.java

presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java

dain

Looking good. One new suggestion on adding a clarifying comment.

GitHub doesn't seem to want to let me follow up on some of the old comments:

For CyclingGroupByHash I just looked and I think all of the uses are in the tests tree. If so, I would move it there.
For the bulk removal of the empty from fast utils javadocs and the unused features. I would consider doing that in a separate commit as it will make your refactorings stand out more.

Other than that, I had a follow up comment on:

int IdRegistry.add(IntFunction<T>)
How to easily create a group id block from an array (no need to change anything here)

Finally, you have some CI test failures

presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java

erichwang · 2020-12-08T03:40:28Z

Ok everything updated, thanks @dain

Copying the one that was already implemented for IntBigArray.

Fork a copy of fastutil Long2LongOpenCustomHashMap to a new LongLong2LongOpenCustomHashMap stub

Fork a copy of fastutil Long2LongOpenHashMap to new Long2LongOpenBigHashMap stub

Fork a copy of fastutil LongArrayFIFOQueue to new LongBigArrayFIFOQueue stub

- Update benchmark to actually run over multiple groups - Reduce TopN to most common sizes - Fix benchmark scoping so that it doesn't bleed the same instance across iterations (GroupedTopNBuilder is only supposed to be used for add/drain pass per instance).

This is a flat data structure for managing multiple top N row number group calculations that doesn't require per-row or per-group object allocations. The main idea is that a heap (while classically represented as an array), can be also represented as a tree with node pointers. These nodes (even across groups) can be efficiently compacted into a single data structure.

- Introduce RowReferencePageManager to handle the generation of stable row IDs across compaction events - Refactor GroupedTopNBuilder to use RowReferencePageManager as well as new GroupedTopNRowNumberAccumulator Improved characteristics - Vastly improved GC characteristics: Negligible object allocations regardless of row count or group count - Performance benchmarks upwards of 4x performance improvements when working with large numbers of groups, and parity with the existing solution in the worst case - Requires up to 20% less memory than the current solution when there are many groups, but does have a increased constant memory overhead when dealing with tiny data sets.

The structure has a lot of invariants that can be easily verified, and it is quite easy for developers to make mistakes when modifying this code.

erichwang force-pushed the topnrownum branch 3 times, most recently from 9ffaf75 to f9b5a11 Compare December 1, 2020 00:01

dain requested review from sopel39 and dain December 1, 2020 18:01

erichwang force-pushed the topnrownum branch from f9b5a11 to 73b78a6 Compare December 1, 2020 19:07

trinodb deleted a comment from cla-bot bot Dec 1, 2020

erichwang force-pushed the topnrownum branch from 73b78a6 to 2dc3425 Compare December 1, 2020 19:18

dain reviewed Dec 1, 2020

View reviewed changes

erichwang force-pushed the topnrownum branch from 2dc3425 to cf623a5 Compare December 2, 2020 06:25

trinodb deleted a comment from cla-bot bot Dec 3, 2020

dain reviewed Dec 3, 2020

View reviewed changes

dain reviewed Dec 4, 2020

View reviewed changes

erichwang force-pushed the topnrownum branch from cf623a5 to 1d882be Compare December 6, 2020 01:04

cla-bot bot added the cla-signed label Dec 6, 2020

erichwang force-pushed the topnrownum branch 2 times, most recently from ffa4b03 to 55a1e1b Compare December 6, 2020 03:08

dain self-requested a review December 7, 2020 18:33

trinodb deleted a comment from cla-bot bot Dec 7, 2020

dain reviewed Dec 7, 2020

View reviewed changes

presto-main/src/main/java/io/prestosql/operator/RowReferencePageManager.java Show resolved Hide resolved

erichwang force-pushed the topnrownum branch from 55a1e1b to ee58037 Compare December 8, 2020 03:39

erichwang added 4 commits December 7, 2020 19:40

Add fill method to BigArray collections

b5d1e1b

Copying the one that was already implemented for IntBigArray.

Add copyTo methods to presto BigArray collections

2024120

Make BigArrays.SEGMENT_SIZE publicly accessible for optimizations

bd8a7ca

Fork a local copy of fastutil Long2LongOpenCustomHashMap

0fddc95

Fork a copy of fastutil Long2LongOpenCustomHashMap to a new LongLong2LongOpenCustomHashMap stub

erichwang added 10 commits December 7, 2020 19:40

Properly implement LongLong2LongOpenCustomHashMap within stub

41da377

Rename LongLong2LongOpenCustomHashMap into new big variant stub

83847a0

Properly implement LongLong2LongOpenCustomBigHashMap within stub

3774ff6

Fork a local copy of fastutil Long2LongOpenHashMap

9829fbb

Fork a copy of fastutil Long2LongOpenHashMap to new Long2LongOpenBigHashMap stub

Properly implement Long2LongOpenBigHashMap within stub

7544842

Fork a local copy of fastutil LongArrayFIFOQueue

734b99d

Fork a copy of fastutil LongArrayFIFOQueue to new LongBigArrayFIFOQueue stub

Properly implement LongBigArrayFIFOQueue within stub

cc8d875

Add heap data structure utility for locating path from root to a node

126b156

erichwang force-pushed the topnrownum branch from ee58037 to e857197 Compare December 8, 2020 03:44

erichwang force-pushed the topnrownum branch from e857197 to 6210036 Compare December 8, 2020 04:07

Add structural sanity check method to GroupedTopNRowNumberAccumulator

622be17

The structure has a lot of invariants that can be easily verified, and it is quite easy for developers to make mistakes when modifying this code.

erichwang force-pushed the topnrownum branch from b0d8310 to 622be17 Compare December 9, 2020 03:56

dain approved these changes Dec 9, 2020

View reviewed changes

dain merged commit ad41446 into trinodb:master Dec 9, 2020

erichwang deleted the topnrownum branch December 9, 2020 23:22

dain mentioned this pull request Dec 12, 2020

Release notes for 348 #6100

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite GroupedTopNBuilder with flat data structures #6072

Rewrite GroupedTopNBuilder with flat data structures #6072

erichwang commented Nov 24, 2020

dain left a comment •

edited

dain Dec 1, 2020

erichwang Dec 1, 2020

dain left a comment

dain left a comment

erichwang commented Dec 8, 2020

Rewrite GroupedTopNBuilder with flat data structures #6072

Rewrite GroupedTopNBuilder with flat data structures #6072

Conversation

erichwang commented Nov 24, 2020

dain left a comment • edited

Choose a reason for hiding this comment

dain Dec 1, 2020

Choose a reason for hiding this comment

erichwang Dec 1, 2020

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

dain left a comment

Choose a reason for hiding this comment

erichwang commented Dec 8, 2020

dain left a comment •

edited