Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite GroupedTopNBuilder with flat data structures #6072

Merged
merged 16 commits into from Dec 9, 2020

Conversation

erichwang
Copy link
Contributor

GroupedTopNBuilder would previously generate tons of objects for each row and group. By rewriting this entirely with flat data structures we can get massive GC improvements, as well as performance and memory benefits.

Compared to the preexisting solution, this new implementation shows the following characteristics:

  • Vastly improved GC characteristics: negligible object allocations regardless of row count or group count.
  • Performance benchmarks upwards of 4x performance improvements when working with large numbers of groups, and no worse than parity with the existing solution in the worst case.
  • Requires up to 20% less memory than the current solution when there are many groups, but does have increased constant memory overhead when dealing with tiny data sets.

Additionally, this is setup in preparation for #1073, which will share similar structures for rank and dense rank implementations.

@erichwang erichwang force-pushed the topnrownum branch 3 times, most recently from 9ffaf75 to f9b5a11 Compare December 1, 2020 00:01
@dain dain requested review from sopel39 and dain December 1, 2020 18:01
@trinodb trinodb deleted a comment from cla-bot bot Dec 1, 2020
@trinodb trinodb deleted a comment from cla-bot bot Dec 1, 2020
@trinodb trinodb deleted a comment from cla-bot bot Dec 1, 2020
@trinodb trinodb deleted a comment from cla-bot bot Dec 1, 2020
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for the first few commits. I haven't reviewed "Add GroupedTopNRowNumberAccumulator" or "Refactor GroupedTopNBuilder for higher performance and reduced memory"

If LongLong2Long isn't going to be used, I suggest doing a rename to LongLongToLongBig when that commit is introduced. We we need the non-big version later, we can get the code from git history.

return defRetValue;
}

/**
* {@inheritDoc}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove these in the original fork commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain, are we sure we want to remove these in the original fork? The reason I have them removed here is because this is the actual commit that actually removes the interface and override (and thus the doc linkage).

@trinodb trinodb deleted a comment from cla-bot bot Dec 3, 2020
@trinodb trinodb deleted a comment from cla-bot bot Dec 3, 2020
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments... I'm about halfway through the last commit.

@cla-bot cla-bot bot added the cla-signed label Dec 6, 2020
@erichwang erichwang force-pushed the topnrownum branch 2 times, most recently from ffa4b03 to 55a1e1b Compare December 6, 2020 03:08
@dain dain self-requested a review December 7, 2020 18:33
@trinodb trinodb deleted a comment from cla-bot bot Dec 7, 2020
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. One new suggestion on adding a clarifying comment.

GitHub doesn't seem to want to let me follow up on some of the old comments:

  • For CyclingGroupByHash I just looked and I think all of the uses are in the tests tree. If so, I would move it there.
  • For the bulk removal of the empty from fast utils javadocs and the unused features. I would consider doing that in a separate commit as it will make your refactorings stand out more.

Other than that, I had a follow up comment on:

  • int IdRegistry.add(IntFunction<T>)
  • How to easily create a group id block from an array (no need to change anything here)

Finally, you have some CI test failures

@erichwang
Copy link
Contributor Author

Ok everything updated, thanks @dain

Copying the one that was already implemented for IntBigArray.
Fork a copy of fastutil Long2LongOpenCustomHashMap to a new
LongLong2LongOpenCustomHashMap stub
Fork a copy of fastutil Long2LongOpenHashMap to new
Long2LongOpenBigHashMap stub
Fork a copy of fastutil LongArrayFIFOQueue to new LongBigArrayFIFOQueue
stub
- Update benchmark to actually run over multiple groups
- Reduce TopN to most common sizes
- Fix benchmark scoping so that it doesn't bleed the same instance
across iterations (GroupedTopNBuilder is only supposed to be used for
add/drain pass per instance).
This is a flat data structure for managing multiple top N row number
group calculations that doesn't require per-row or per-group object
allocations.  The main idea is that a heap (while classically
represented as an array), can be also represented as a tree with node
pointers. These nodes (even across groups) can be efficiently compacted
into a single data structure.
- Introduce RowReferencePageManager to handle the generation of stable
row IDs across compaction events
- Refactor GroupedTopNBuilder to use
RowReferencePageManager as well as new GroupedTopNRowNumberAccumulator

Improved characteristics
- Vastly improved GC characteristics:
Negligible object allocations regardless of row count or group count
- Performance benchmarks upwards of 4x performance improvements when
working with large numbers of groups, and parity with the existing
solution in the worst case
- Requires up to 20% less memory than the current solution when there
are many groups, but does have a increased constant memory overhead when
dealing with tiny data sets.
The structure has a lot of invariants that can be easily verified, and
it is quite easy for developers to make mistakes when modifying this
code.
@dain dain merged commit ad41446 into trinodb:master Dec 9, 2020
@erichwang erichwang deleted the topnrownum branch December 9, 2020 23:22
@dain dain mentioned this pull request Dec 12, 2020
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

2 participants