Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-5195] TopPerKey support #22

Merged
merged 1 commit into from Aug 24, 2018
Merged

Conversation

VaclavPlajt
Copy link

TopPerKey decomposition changed to use RBK instead of RSBK. Documentation was also updated VaclavPlajt/beam-site#2


Follow this checklist to help us incorporate your contribution quickly and easily:

  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- --- --- --- ---
Java Build Status Build Status Build Status Build Status Build Status Build Status Build Status
Python Build Status --- Build Status
Build Status
--- --- --- ---

@je-ik
Copy link

je-ik commented Aug 23, 2018

Is this operator really limited to single top value? If so, should we extend in to support configurable number of top elements in this refactoring? I would say we could do it, because it is quite trivial. We should probably add even more complex statistics operators (i.e. DataSummary), which would use TDigest to calculate overall statistics over data. That would be probably for separate issue.

@VaclavPlajt VaclavPlajt requested a review from je-ik August 23, 2018 09:20
@VaclavPlajt
Copy link
Author

Yes TopPerKey really outputs one single element per key. Both ideas expressed by @je-ik are compelling to me. TopNPerKey is likely useful operator (we may call it deferent way of course). And having ability to compute TDigets over datasets may open a way for guided optimizations in future.

@VaclavPlajt VaclavPlajt merged commit 16ed929 into dsl-euphoria Aug 24, 2018
@VaclavPlajt VaclavPlajt deleted the vasek/support-top-per-key branch August 24, 2018 12:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants