Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters #354

Merged
merged 67 commits into from Nov 19, 2014

Conversation

miguno
Copy link

@miguno miguno commented Oct 16, 2014

See the discussion in GH-345 for details on this pull request.

Note: This PR contains a "clean" fix plus regression tests for GH-353, i.e. the bug where heavy hitters are not correctly computed in CMS.

High-level summary of this pull request

  • Parameterize CMS with a type K to decouple the item identifier type from hardcoded Long.
  • Two traits CMSCounting[K, C[_]] and CMSHeavyHitters[K] to decouple counting/querying from heavy hitters tracking.
  • Three implementations of these traits:
    • CMS[K], which can only counting and estimate.
    • TopCMS[K], which can count, estimate, and track heavy hitters. The logic to compute heavy hitters is pluggable, see below.
  • A trait HeavyHittersLogic (for the lack of a better name) that controls how a CMS that implements CMSHeavyHitters tracks heavy hitters. This decouples the "core" Top CMS implementation from the heavy hitters sub-functionality.
  • Two implementations of this trait: TopPctLogic, TopNLogic.
  • Monoids and aggregators for CMS, TopPctCMS, TopNCMS.
  • A lot of new/improved CMS documentation.
  • An updated spec/test suite, including regression tests for CMS computes heavy hitters incorrectly when combining CMS instances with ++ #353.
    • Includes tests for K in {Short, Int, Long, BigInt}.
    • Includes a "negative test case" that documents the ++ order bias of top-N CMS.
  • A Caliper benchmark for CMS. Also adds a README to algebird-caliper.

Please take a look at the code and let me know which further changes you'd like me to do so that this PR and thus the improved CMS implementation can be merged.

And many thanks for all the help and support so far!
Michael

PS: For the sake of this PR I decided not to squash individual commits into a single, big commit so it's easier to back out of changes you may not like etc. I can of course squash the commits if needed once the PR code is approved.

Michael G. Noll added 30 commits October 9, 2014 12:31
This change parameterizes the type K that is used to identify the
elements to be counted.  Previously, the CMS implementation hardcoded
`Long` as the type.  In other words, this change turns the hardcoded
CMS[K=Long] into a parameterized CMS[K].
"If a static checker finds a violation of assert it considers it an
error in the code, while when require is violated it assumes the caller
is at fault."

This commit changes assert to require in those places where it is used
to validate input parameters.

See stackoverflow.com/questions/8002835
Providing default values leads to scalac errors, example::

  method aggregator$default$4:[K]=> Double and
  method aggregator$default$4:[K]=> Double
  have same type
Hashes must be >= 0 to prevent IndexOutOfBoundsException when
accessing rows in the counting table.
/**
* Zero element. Used for initialization.
*/
case class CMSZero[K: Ordering](override val params: CMSParams[K]) extends CMS[K](params) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the Ordering used here? Is it needed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh. On creating CMSInstance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seemingly innocent change of going from CMS to CMS[K] resulted in more context bound fiddling than I initially expected. ;-)

@johnynek
Copy link
Collaborator

This is thorough, excellent work. I really appreciate it. I want to merge it, just a few issues.

I am concerned that we have now SketchMap and CMS that are almost identical except that SketchMap parameterizes the value. Given that making the key generic didn't hurt performance, I tend to think we must just have some small bottleneck in SketchMap that could be fixed.

It would be great to give this a going over with yourkit and see if we can identify a few hotspots, but clearly this is an improvement.

This is API breaking, but I think it represents enough of a win and a straight forward upgrade path, so it's not a big pain.

One question: is there a type definition we could make to get all the old methods? Like:

object CMS {
  // This was what CMS was in prior versions of algebird. Change import CMS to CMS.LongKey
  type LongKey = TopPctCms[Long]
}

@johnynek
Copy link
Collaborator

One more note: would @specialized(Int, Long) K on CMS[K] and CMSHash improve perf at all? Just a guess that it could possibly if we could avoid boxing Longs, which might be a common case. Just a thought.

@miguno
Copy link
Author

miguno commented Nov 18, 2014

Many thanks for the review, Oscar. I really do appreciate it -- this has been a lot of lines to go through!

SketchMap vs. CMS

I see your point. Shall I create a separate ticket for that discussion? Also, IIRC @avibryant was mentioning that SketchMap might be superseded by SpaceSaver. So the discussion may become "SketchMap vs. CMS and/or SpaceSaver".

One question: is there a type definition we could make to get all the old methods? Like:

object CMS {
  // This was what CMS was in prior versions of algebird. Change import CMS to CMS.LongKey
  type LongKey = TopPctCms[Long]
}

I have thought a bit about the very same question but haven't come up with a really good approach. :-(

The following would be working examples of your LongKey idea. Here, we have a legacy monoid and a legacy CMS based on Long. I think the monoid is needed, too, because the monoid bootstraps the initial CMS instances.

object CMS {
  type LegacyCMSMonoid = TopPctCMSMonoid[Long]
  type LegacyCMS = TopCMS[Long]
}

Another idea: Instead of putting it into the CMS object, we may also add a com.twitter.algebird package object, and add the Legacy* types there, which may allow the following pattern to further minimize changes in legacy code.

import com.twitter.algebird.{LegacyCMS => CMS, LegacyCMSMonoid => CountMinSketchMonoid}

// Now legacy code can still refer to `CMS` and `CountMinSketchMonoid`, if I'm not mistaken.

I haven't really tried the latter approach in detail though.

Note: The new code hides how the heavy hitters are tracked -- the user only sees a TopCMS[K] (there are no TopPctCMS[K] or TopNCMS[K] classes). The user needs to pick the correct Top*CMSMonoid to instantiate the desired "Top" CMS variants.

One more note: would @specialized(Int, Long) K on CMS[K] and CMSHash improve perf at all? Just a guess that it could possibly if we could avoid boxing Longs, which might be a common case. Just a thought.

Ah, nice idea. (some time passes) My quick comparison using the new com.twitter.algebird.caliper.CMSBenchmark indicates that @specialized does not improve the CMS[K] performance.

I'll send an updated PR with the latest changes.

@miguno
Copy link
Author

miguno commented Nov 18, 2014

PS: Regardless of what we will come up with regarding backwards compatibility / upgrade instructions, do you have any plans to add such information to CHANGES.md? At the moment Algebird change log only contains a list of fixed tickets, but there does not seem to be a place where such upgrade infos are tracked.

@avibryant
Copy link
Contributor

For clarity: I don't think SpaceSaver replaces SketchMap, but I do think it might be a good alternative to TopNLogic. (It would be interesting to do some benchmarking of error rates between these two). But SpaceSaver doesn't even parameterize the value type currently, which is something else we should fix.

@johnynek
Copy link
Collaborator

  1. About sketchmap: yes let's create a new ticket and deal with it after we merge this code (hopefully before our next full release).

  2. Changes.md: we have been too lazy about summarizing the meaningful changes and how to deal with them. We can just add a section for each release about how to upgrade if needed.

  3. I'm feeling like junking up the algebird package object does not feel right. What about a algebird.legacy package and put these types in that package object: import com.twitter.algebird.legacy.{CMS, CountMinSketchMonoid} could be the old algebird.CMS, etc... How does that sound?

  4. About hiding the algorithm for TopK: I am not sweating it too much. We could put the algorithm as a type parameter and force us to combine only items with the same heavy-hitters approach, but I think there is a cost to type-complexity, and given that errors of this type seem so unlikely, I would be comfortable with keeping the code as is. What do you think?

  5. Thanks for checking specialized. I guess so much boxing happens already, you can't get a win by adding it here. Maybe one day miniboxing will be on by default (or the JVM will get value types).

@miguno
Copy link
Author

miguno commented Nov 19, 2014

Thanks for the clarification on SpaceSaver, Avi!

  1. About sketchmap: yes let's create a new ticket and deal with it after we merge this code (hopefully before our next full release).

I created GH-360: Decide on the future of SketchMap, CMS, SpaceSaver.

  1. I'm feeling like junking up the algebird package object does not feel right. What about a algebird.legacy package and put these types in that package object: import com.twitter.algebird.legacy.{CMS, CountMinSketchMonoid} could be the old algebird.CMS, etc... How does that sound?

Looks good to me. I'll experiment with that and will either update the PR or write a comment why it didn't work. ;-)

  1. About hiding the algorithm for TopK: I am not sweating it too much. We could put the algorithm as a type parameter and force us to combine only items with the same heavy-hitters approach, but I think there is a cost to type-complexity, and given that errors of this type seem so unlikely, I would be comfortable with keeping the code as is. What do you think?

I, too, am fine with keeping the code as is.

But just to clarify, because I am not sure whether I read your comment ("keeping the code as is") correctly: When I talked about TopCMS[K] (and not having TopPctCMS[K] or TopNCMS[K]), I was describing the actual code that is already in the PR -- it was not a suggestion to introduce it. Again, I am only highlighting this so that, after we all said "Ok, let's keep the code as is.", you won't be surprised about what you'll end up seeing in Algebird post-merge. :-)

@miguno
Copy link
Author

miguno commented Nov 19, 2014

I added com.twitter.algebird.legacy, see https://github.com/miguno/algebird/commit/bec16fe435b751f9a6248ab4718f1024b4342150. Our original idea to help users of legacy code didn't work out exactly as planned, but I came very close. I also added two simple test cases that demonstrate the usage of the legacy helpers.

In most cases, users should only need to remove the "new" when creating legacy CMS monoids:

// previously
val CMS_MONOID: CountMinSketchMonoid = new CountMinSketchMonoid(EPS, DELTA, SEED)

// with code in this PR and its legacy package
import com.twitter.algebird.legacy.CountMinSketchMonoid
val CMS_MONOID: CountMinSketchMonoid = CountMinSketchMonoid(EPS, DELTA, SEED)

Note that because of name collisions for CMS (c.t.a.CMS vs. c.t.a.legacy.CMS) users may need to explicitly use the name legacy.CMS when using type annotation for CMS:

import com.twitter.algebird._
import com.twitter.algebird.legacy.CountMinSketchMonoid

val CMS_MONOID: CountMinSketchMonoid = CountMinSketchMonoid(EPS, DELTA, SEED)

// Example 1: without type annotation
// Here, the written code is the same pre and post merge.
// Behind the scenes, however, scalac will infer type `TopCMS[Long]` for `cms`,
// whereas previously the inferred type was `CMS`.
val cms = CMS_MONOID.create(0L)

// Example 2: with type annotation
// This code won't work because of name collisions for CMS.
val cms: CMS = CMS_MONOID.create(0L)

// The following works:
val cms: legacy.CMS = CMS_MONOID.create(0L)

If you have a better idea, please let me know.

@miguno
Copy link
Author

miguno commented Nov 19, 2014

I added new commits as well as a comment/question regarding testing delta/depth and width/eps via roundtripping, see #354 (comment). GitHub hides this comment by default because the updated PR modified the code on which the first code review comment was made.

johnynek added a commit that referenced this pull request Nov 19, 2014
GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters
@johnynek johnynek merged commit 6ed1356 into twitter:develop Nov 19, 2014
@miguno
Copy link
Author

miguno commented Nov 20, 2014

Thanks Oscar for merging, and @jnievelt and @avibryant for their help getting there!

@miguno
Copy link
Author

miguno commented Nov 20, 2014

PS: Let me know if you need a text snippet for CMS upgrade instructions in CHANGES.md.

@miguno miguno mentioned this pull request Jan 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants