-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMS computes heavy hitters incorrectly when combining CMS instances with ++ #353
Comments
I'm fairly sure you're correct on both issues. Your fix seems to be what SketchMap already does (yet another reason that putting these together would be good); see SketchMapParams#updatedHeavyHitters etc.: But I believe @johnynek would be talking about this case (N=1): (AAABBBB)(CCC) -> definitely B |
Thanks for your response, Joe. Slightly off-topic: As a follow-up on your SketchMap code link, is there a specific reason that in the current CMS code I am asking because if I understand correctly the CMS code already sorts heavy hitters behind the scenes because of its use of case class CMSInstance(...) {
def heavyHitters: Set[Long] = hhs.items
}
case class HeavyHitters(hhs: SortedSet[HeavyHitter] = SortedSet[HeavyHitter]()(HeavyHitter.ordering)) {
def items: Set[Long] = hhs.map { _.item } // <<< this `map` actually returns a SortedSet
}
object HeavyHitter {
val ordering = Ordering.by { hh: HeavyHitter => (hh.count, hh.item) }
} |
Updated original post to rename GH of @johnynek from "posco" to "johnynek". Why can't you use consistent names across systems, eh? ;-) |
Yes, exactly. This example is similar to what I had in mind when I was talking about merging being subject to "left-biased" mono-culture when folding left-to-right, i.e. the top-N CMS instances on the left get preferential treatment over instances on the right simply because they are being merged earlier -- a crude analogy would be preferential attachment from network theory. My example: (A)(A)(B)(B)(B) when folded left-to-right => A (after first merge A has count 2, meaning it would always win against the individual B=1's) Here, the correct behavior would be to always return B. In the example above, however, sometimes we incorrectly return A, and in the case where we do return B it is kinda by chance. ;-) |
I documented the known misbehavior of top-N CMS in a unit test example: I think for the very same reason SketchMap will not correctly merge heavy hitters, too. The relevant code (see source) is: hitters.sorted(specificOrdering).take(heavyHittersCount).toList The problem is not the code itself, but that merging top-N based heavy hitters is not an associative operation (the |
I don't know exactly, but it's no doubt related to the use of But yes, the top-N associativity issue is a fundamental limitation directly related to the |
Addendum as FYI for future readers of this issue: The top-N limitation only applies when adding CMS instances via |
This is a follow-up on a conversation I had with @johnynek and @avibryant on Twitter. Many thanks to both of you for your help on narrowing this down. And please do let me know if I'm mistaken with what I'm writing here.
From what I understand today, I think I was partly right and partly wrong in my Twitter comments from yesterday:
In summary:
Background
As part of my refactoring work in GH-345 I decoupled the counting/querying functionality of CMS from the heavy hitters functionality (as per @jnievelt's suggestion), and once I reached that point I could easily provide both top-% and top-N CMS variants (here, having a top-N CMS variant brings us one step closer to eventually retire SketchMap). And while testing the top-N variant I first discovered the heavy hitters issue I describe here.
Problem statement
In the discussion below I assume the two CMS instances we want to merge are called
left
andright
, respectively.The computation to combine two sets of heavy hitters is not correct right now when you use
CMSInstance#++(other: CMS): CMS
for combing two CMS instances. The problem itself is two-fold, though both aspects are related and can be fixed in one swing.First, currently the way to combine the "left" and "right" heavy hitters is a simple call to
++
of theHeavyHitters
case class (same++
name, but different class; see code) -- or, alternatively speaking, that we don't post-process the combined heavy hitters to merge entries for the same item.Here, heavy hitters elements for the same item are not summed/merged. The effect is that items that are elements of the intersection of left's and right's heavy hitters are not represented correctly.
Example REPL output:
As you can see in this example the item
20L
is actually represented twice in the combined heavy hitterscombinedHhs
--HeavyHitter(20L, 1)
andHeavyHitter(20L, 2)
-- though it should be represented only once as a singleHeavyHitter(20L, 3)
.I think in summary the first problem means a) items that are in the left+right HH intersection will have lower effective counts than they should (
max(individual counts)
instead ofsum(individual counts)
), which means they are less likely to stay in the post-merge set of heavy hitters -- which is the opposite behavior of what one would intuitively expect for such intersection items -- and b) the cardinality of heavy hitters is>=
the cardinality of the items in the heavy hitters -- because we have duplicate HH entries for some items -- which means the memory footprint is larger than it needs to be (at least until one or more subsequent "purging/dropping" operations are run on the heavy hitters), though this additional memory footprint may be of little concern in practice at the moment. In case we would have a truly parameterizedCMS[K, V]
at some point in the future, e.g. withV=HyperLogLog
like Avi suggested, the additional memory footprint might become an issue.This first problem could be fixed by properly merging the left and right sets (think:
groupBy(._item)
). But fixing only this two-set-merge is not sufficient, because there's a second issue:Second, items that are elements of heavy hitters of either left or right ("left.heavyHitters and not right.heavyHitters", "right.heavyHitters and not left.heavyHitters") may not be represented correctly. For example, it may happen that item
A
is in left's heavy hitters and not in right's BUT it was seen by right and thus counted (A
simply didn't make it into the heavy hitters). Here, in order to representA
correctly when combining left and right, we do not want to forget right's count ofA
when doing the merge.Unit test that demonstrates the problem
You can see that problem by adding the following unit test to CountMinSketchTest.scala:
Here, we would expect that the final heavy hitters in the "merged"
aggregated
CMS are identical to the heavy hitters in thesingle
CMS. However, in the current CMS implementation they are not.How to fix
The basic algorithm to correctly combine two sets of heavy hitters is as follows. We assume the two corresponding CMSs are called
left
andright
, respectively.left.cms ++ right.cms
, where thecms
field is a CMS that can only count and query.the combined counting table from step 2.
Proof of concept fix for the current CMS implementation:
Please note that the purpose of the patch above is to let me demonstrate the idea of the fix in a simple, short diff. A proper, cleaner variant of the fix would touch multiple places in the CMS code. (For instance, such a clean fix is already part of the GH-345 code.)
Limitation of this fix
The fix approach above does address the heavy hitters problem for top-% CMS variants, i.e. a CMS that computes heavy hitters based on the rule
count >= heavyHittersPct * totalCount
. The current (pre GH-345) CMS code in Algebird is such a top-% CMS variant, and can thus be fixed with the approach above.However, the fix does not address the heavy hitters problem for top-N CMS variants, i.e. a CMS that computes heavy hitters by sorting items by count, and keeping only the top N items. In a nutshell, the reason is that there may be items that should be promoted to the "global" heavy hitters list (aka post-merge) but which will not. What are the characteristics of these items? They happen to never make it into the heavy hitters of any of the CMS instances being merged (think:
cms1 ++ cms2 ++ ...
), and thus never make it into the union-based set of HH candidates (cf. step 1 in the "How to fix" section above). Alternatively speaking this limitation is caused by the operation of merging "top-N" sets not being associative. (The latter may or may not be what @johnynek was eluding to in his Twitter comment that "SketchMap's approach is not strictly associative [...]".)For the current top-% CMS implementation in comparison, every item that should make it into the global, post-merge set of heavy hitters will at some point become an element of the left+right heavy hitter union set aka the list of HH candidates, thanks to the
count >= heavyHittersPct * totalCount
rule.Next steps?
What I say below assume you agree there actually is an issue with the current CMS code. Please let me know if I'm mistaken, which may well be. :-)
My work in GH-345 already includes this fix in a "clean" way, and I should get open source approval by the end of this week. So depending on how pressing / important this issue is for you, we could opt to just wait until GH-345 is in.
Alternatively, we could patch the current CMS implementation -- either with the PoC fix in this issue, or with a cleaner variant (the PoC fix is intentionally not clean to better demonstrate the fix approach).
The text was updated successfully, but these errors were encountered: