New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A generic version of CountMinSketch[T] to decouple CMS from Long? #345

Closed
miguno opened this Issue Sep 12, 2014 · 19 comments

Comments

Projects
None yet
4 participants
@miguno

miguno commented Sep 12, 2014

This is a question about a) whether this is a good idea in general and b) whether you'd be interested in a PR with draft code.

"Why should I care about this issue?"

If we had a way to use CMS with "bigger" number types than Long we would be able to cover use cases that require counting more distinct elements than can fit into the range of Long (which is 2^64 bits). This would allow, for instance, to probabilistically count IPv6 addresses, which is currently not possible.

Background

The current CMS implementation (CountMinSketch.scala) is backed by a Long, which means that CMS will count the occurrences of Long instances. As a user I must ensure that I can properly encode as well as decode my "real" data entities to and from Long (and this mapping should be bijective).

What does this mean in practice? Long has 2^64 bits (= you can count up to 1.8E19 distinct elements), which means means you can't use the current CMS implementation if what you are trying to count has more than 2^64 distinct elements.

Unfortunately, there are such use cases where the address space of Long is not sufficient. One example is counting IPv6 addresses, which have an address space of 2^128. (IPv4 addresses can be counted as they are 2^32 bits). Similarly, you may want to count the occurrences of unique strings, where one option is to encode a string as a BigInt (e.g. via BigInt(s.getBytes("UTF-8")), and then perform the probabilistic counts via a CMS[BigInt] instance (though this is but one option to count strings of course).

Alternatives to CMS in Algebird

The one option I found is SketchMap, which is "a generalized version of Count-Min Sketch":

A Sketch Map is a generalized version of the Count-Min Sketch that is an approximation of Map[K, V] that stores reference to top heavy hitters. The Sketch Map can approximate the sums of any summable value that has a monoid.

Unfortunately, SketchMap seems to be really slow. In a simple benchmark -- all other things being equal -- the updates/s were 0.3k/s (SketchMap with K=BigInt and V=Long) vs. 150k/s (CMS with hardcoded V=Long).

Code snippet for SketchMap (similar setup for CMS):

    type K = BigInt
    type V = Long

    val SM_MONOID = {
      val eps = 0.005
      val delta = 1E-8
      val seed = 1
      val heavyHittersCount = 1000
      def explicitSerialization(i: K): Array[Byte] = i.toByteArray
      val params = SketchMapParams[K](seed, eps, delta, heavyHittersCount)(explicitSerialization)
      new SketchMapMonoid[K, V](params)
    }

    val sketch = longRange(1, 1000000000).foldLeft(SM_MONOID.zero)((l, r) => {
      val pair = (BigInt(r), 1L)
      SM_MONOID.plus(l, SM_MONOID.create(pair))
    })

Other alternatives

At this point I investigated modifying CMS to use (for example) BigInt instead of Long as the backend data type for identifying the counted elements. This worked, and the performance was much better than SketchMap.

  • CMS (hardcoded K=Long): 160k/s
  • CMS (hardcoded K=BigInt): 65k/s ~ 40% as fast as the existing Long variant
  • SketchMap[K, V] (K=BigInt, V=Long): 0.3k/s

That's reasonably fast given that we can count significantly more distinct elements than we can in the Long variant of CMS.

I then experimented with turning CMS into a generic variant CMS[K], so to speak. In other words, I added a second, parameterized CMS implementation based on the existing, Long-backed CMS.

Benchmark results:

  • CMS (hardcoded K=Long): 160k/s
  • CMS[K] (K=Long): 160k/s (as fast as the hardcoded Long CMS variant)
  • CMS (hardcoded K=BigInt): 65k/s
  • CMS[K] (K=BigInt): 65k/s (as fast as the hardcoded BigInt variant)
  • SketchMap[K, V] (K=BigInt, V=Long): 0.3k/s

Usage of the generic CMS variant is basically:

val genCmsMonoid = new GenCountMinSketchMonoid[BigInt](eps, delta, seed, heavyHittersPct)

(I chose the Gen prefix deliberately in order not to break any existing code that might be written against the hardcoded CMS variant.)

The generic CMS requires implicit val's to be in scope (type classes) for the K you choose (here: BigInt), of the following form. Also, a) there's a context bound requirement [K : Ordering] which is not shown and not relevant for the code snippet below; and b) K must be serializable.

trait GenCMSHasher[K] {
  val PRIME_MODULUS = (1L << 31) - 1
  def hash(a: Int, b: Int, width: Int)(x: T): Int
}

case class GenCMSHash[K: GenCMSHasher](a: Int, b: Int, width: Int) {
  def apply(x: K): Int = implicitly[GenCMSHasher[K]].hash(a, b, width)(x)
}

Then you'd only need to put correct implicits in place, e.g. for Long:

  implicit object GenCMSHasherLong extends GenCMSHasher[Long] {
    def hash(a: Int, b: Int, width: Int)(x: Long) = {
      // The code below is taken verbatim from the existing CMSHash code in Algebird.
      val unmodded: Long = (x * a) + b
      val modded: Long = (unmodded + (unmodded >> 32)) & PRIME_MODULUS
      modded.toInt % width
    }
  }

(I already have such implicits for Byte, Short, Int, Long, BigInt, which the user can import if needed.)

Open questions

There may or may not be an issue with handling type erasure correctly. The current draft code happily ignores type erasure although some pattern matching is done on GenFoo[K] instead of GenFoo[_]. I won't go into details here but can do that later if needed. (Any feedback you'd have here would help of course. :-P)

Summary

Does the approach above make sense to you? And would you be interested in a pull request? If you are, let me know how you'd like to handle backwards compatibility with the existing CMS code -- e.g. shall the PR add a, say, GenCountMinSketch.scala or modify the existing CountMinSketch.scala code?

I already performed some basic testing, and the actual counting/estimation of CMS[K] seems to be correct, too. But before I'd spend the time to cover all the test cases of the current CMS implementation in Algebird I'd like to get your feedback on whether it's worth the time. ;-)

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Sep 12, 2014

PS: Note that the performance of BigInt -- and the corresponding benchmark results above -- depends on the size of the values it represents (this is due to how BigInt is implemented).

The benchmark numbers above are based on small values (from 1 upwards). When I re-ran the same benchmark with randomly drawn BigInts from a 2048 bit address space (2^2048 unique elements), the updates/s performance drops from 65k/s to 22k/s.

miguno commented Sep 12, 2014

PS: Note that the performance of BigInt -- and the corresponding benchmark results above -- depends on the size of the values it represents (this is due to how BigInt is implemented).

The benchmark numbers above are based on small values (from 1 upwards). When I re-ran the same benchmark with randomly drawn BigInts from a 2048 bit address space (2^2048 unique elements), the updates/s performance drops from 65k/s to 22k/s.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Sep 29, 2014

Here's a gist with the generic CMS code (prior to any needed modifications):
https://gist.github.com/miguno/3bf1a24be446b2edaeb1

miguno commented Sep 29, 2014

Here's a gist with the generic CMS code (prior to any needed modifications):
https://gist.github.com/miguno/3bf1a24be446b2edaeb1

@johnynek

This comment has been minimized.

Show comment
Hide comment
@johnynek

johnynek Sep 29, 2014

Collaborator

I have a few questions and comments. Maybe @echen will chime in who wrote the original CMS.

  1. Why is sketchmap slow? My knee jerk is that we should make SketchMap fast rather than add another CMS type. I wonder what YourKit would show as the difference here. It could be boxing, it could be the way we represent the underlying matrix.

  2. I seem to remember asking Edwin about the Long key, and his argument was basically that you will be hashed anyway, so the Long space is fine. Consider your IPv6 example. If you were counting 1 ip / nanosecond, it would still take 63 years to enumerate all the longs. So, it seems to me that Edwin was right, and anything you want to add to a CMS, if you first hash it to a Long (say using Murmur or the xor of some of the normal hashcodes of the object if it is compound at all), then you should be okay.

  3. I'm really impressed with the writing and detail with which you have considered the problem (which is frankly greater than we have examined the performance issues). There is an algebird-caliper project to run benchmarks on these types. Perhaps we can add some benchmarks for CMS and SketchMap and see if we can close the gap.

@miguno What do you think? Let me know if you think I have this wrong. I want to improve performance and increase applicability, but also I am reluctant to maintain 3 almost identical classes/algorithms.

Collaborator

johnynek commented Sep 29, 2014

I have a few questions and comments. Maybe @echen will chime in who wrote the original CMS.

  1. Why is sketchmap slow? My knee jerk is that we should make SketchMap fast rather than add another CMS type. I wonder what YourKit would show as the difference here. It could be boxing, it could be the way we represent the underlying matrix.

  2. I seem to remember asking Edwin about the Long key, and his argument was basically that you will be hashed anyway, so the Long space is fine. Consider your IPv6 example. If you were counting 1 ip / nanosecond, it would still take 63 years to enumerate all the longs. So, it seems to me that Edwin was right, and anything you want to add to a CMS, if you first hash it to a Long (say using Murmur or the xor of some of the normal hashcodes of the object if it is compound at all), then you should be okay.

  3. I'm really impressed with the writing and detail with which you have considered the problem (which is frankly greater than we have examined the performance issues). There is an algebird-caliper project to run benchmarks on these types. Perhaps we can add some benchmarks for CMS and SketchMap and see if we can close the gap.

@miguno What do you think? Let me know if you think I have this wrong. I want to improve performance and increase applicability, but also I am reluctant to maintain 3 almost identical classes/algorithms.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Sep 29, 2014

Thanks for taking the time to reply, @posco.

  1. Why is sketchmap slow? My knee jerk is that we should make SketchMap fast rather than add another CMS type. I wonder what YourKit would show as the difference here. It could be boxing, it could be the way we represent the underlying matrix.

I can't say why sketchmap is so slow. To be frank I didn't bother to check because the performance discrepancy was so pronounced -- orders of magnitude -- that I thought the slowness is inherently caused by how the sketchmap works.

  1. I seem to remember asking Edwin about the Long key, and his argument was basically that you will be hashed anyway, so the Long space is fine. Consider your IPv6 example. If you were counting 1 ip / nanosecond, it would still take 63 years to enumerate all the longs. So, it seems to me that Edwin was right, and anything you want to add to a CMS, if you first hash it to a Long (say using Murmur or the xor of some of the normal hashcodes of the object if it is compound at all), then you should be okay.

(I hope what I'll be saying now does not make me look totally stupid. :-P)

First, you're completely right that covering the full address space of e.g. IPv6 (2^132) will take a very long time. This is even more true in streaming applications, where we have time windows in the order of seconds/minutes.

But is the important aspect really how fast/slow we can cover the full space (~ sparsity of target address space, possibly within a certain time window), or isn't it rather the requirement that the "thing" <-> Long mapping must be bijective -- i.e. not unidirectional -- in order to reconstruct the original thing once you're done counting with CMS (~ source vs. target address space)? For instance, let's say we want to compute the top N IPv6 addresses. We can use the CMS for this, notably its heavy hitters functionality. But how, at the end of the counting process, can we actually reconstruct the original IPv6 addresses that are now listed as top 1, top 2, etc.? I mean, either we track the "thing" <-> Long mappings elsewhere (e.g. via external systems/DB) or we use computation -- a mapping -- to move back and forth between them. (And all things being equal I'd say the computation approach is preferable as it means less complexity, less architectural dependencies, etc.)

In other words, yes, I can hash an object (such as an IPv6 address, or something "much" bigger like a text file) to a small, fixed-size fingerprint, then map this fingerprint to a Long, and finally count this Long via CMS (obj -> hash -> Long). But how can I retrieve the original object out of the CMS, i.e. Long -> obj?

If the mapping is bijective -- which a hash due to its nature isn't -- then even though I am not able retrieve the original object from the CMS directly I can reconstruct it via the mapping function. Example: Use the CMS to count IPv6 addresses and compute heavy hitters (by mapping e.g. FE80:0000:0000:0000:0202:B3FF:FE1E:8329 to BigInt 123456789123456789) then list the actual IPv6 addresses of which the heavy hitters are comprised (by mapping in the opposite direction, from BigInt 123456789123456789 back to FE80:0000:0000:0000:0202:B3FF:FE1E:8329).

However, if the mapping must be bijective, then the address space of a Long (2^64) can't accommodate for use cases that have significantly larger address spaces (e.g. 2^132 in the case of IPv6, but we have use cases that would need even larger spaces). And here increasing the target address space from Long to something bigger (like BigInt) makes the task easier to create/find such bijective mappings when the source address spaces >> 2^64.

All the above being said with the big disclaimer that it's totally possible that I am misunderstanding the way I should use the CMS. :-)

  1. I'm really impressed with the writing and detail with which you have considered the problem (which is frankly greater than we have examined the performance issues). There is an algebird-caliper project to run benchmarks on these types. Perhaps we can add some benchmarks for CMS and SketchMap and see if we can close the gap.

Thanks, Oscar! The algebird-caliper project is new to me, I can take a look.

@miguno What do you think? Let me know if you think I have this wrong. I want to improve performance and increase applicability, but also I am reluctant to maintain 3 almost identical classes/algorithms.

Oh, I can totally understand.

Just to clarify and to give you more options to think about: I think we could actually consolidate the current CountMinSketch with the generic variant GenCountMinSketch, as the former is simply the special case of CMS[Long]. The two only reasons IMHO to keep both of them around in the code base would be a) if there was some deficiency in the generic variant and/or b) for backwards compatibility (e.g. users at Twitter who have coded against the current, Long-based CMS implementation. I think (a) could be technically addressed if needed, and (b) could potentially be addressed by making the current CMS functionality available under the same name but backed by the generic implementation (i.e. CMS[Long]). The same applies to the existing test suite -- we could use the existing test suite (Long-based CMS) and, say, loop through the CMS test suite with various CMS types, e.g. CMS[Long] (= mimic current test suite), CMS[BigInt].

If the above (CMS consolidation) is possible, then there would be only 2 similar classes -- CMS and SketchMap -- just like there are today. (I'll skip brainstorming about what to possibly do with SketchMap, which I think you were hinting at. I am not very familiar with this data structure yet, and on top of that I also can't say why it's so slow as mentioned above.)

miguno commented Sep 29, 2014

Thanks for taking the time to reply, @posco.

  1. Why is sketchmap slow? My knee jerk is that we should make SketchMap fast rather than add another CMS type. I wonder what YourKit would show as the difference here. It could be boxing, it could be the way we represent the underlying matrix.

I can't say why sketchmap is so slow. To be frank I didn't bother to check because the performance discrepancy was so pronounced -- orders of magnitude -- that I thought the slowness is inherently caused by how the sketchmap works.

  1. I seem to remember asking Edwin about the Long key, and his argument was basically that you will be hashed anyway, so the Long space is fine. Consider your IPv6 example. If you were counting 1 ip / nanosecond, it would still take 63 years to enumerate all the longs. So, it seems to me that Edwin was right, and anything you want to add to a CMS, if you first hash it to a Long (say using Murmur or the xor of some of the normal hashcodes of the object if it is compound at all), then you should be okay.

(I hope what I'll be saying now does not make me look totally stupid. :-P)

First, you're completely right that covering the full address space of e.g. IPv6 (2^132) will take a very long time. This is even more true in streaming applications, where we have time windows in the order of seconds/minutes.

But is the important aspect really how fast/slow we can cover the full space (~ sparsity of target address space, possibly within a certain time window), or isn't it rather the requirement that the "thing" <-> Long mapping must be bijective -- i.e. not unidirectional -- in order to reconstruct the original thing once you're done counting with CMS (~ source vs. target address space)? For instance, let's say we want to compute the top N IPv6 addresses. We can use the CMS for this, notably its heavy hitters functionality. But how, at the end of the counting process, can we actually reconstruct the original IPv6 addresses that are now listed as top 1, top 2, etc.? I mean, either we track the "thing" <-> Long mappings elsewhere (e.g. via external systems/DB) or we use computation -- a mapping -- to move back and forth between them. (And all things being equal I'd say the computation approach is preferable as it means less complexity, less architectural dependencies, etc.)

In other words, yes, I can hash an object (such as an IPv6 address, or something "much" bigger like a text file) to a small, fixed-size fingerprint, then map this fingerprint to a Long, and finally count this Long via CMS (obj -> hash -> Long). But how can I retrieve the original object out of the CMS, i.e. Long -> obj?

If the mapping is bijective -- which a hash due to its nature isn't -- then even though I am not able retrieve the original object from the CMS directly I can reconstruct it via the mapping function. Example: Use the CMS to count IPv6 addresses and compute heavy hitters (by mapping e.g. FE80:0000:0000:0000:0202:B3FF:FE1E:8329 to BigInt 123456789123456789) then list the actual IPv6 addresses of which the heavy hitters are comprised (by mapping in the opposite direction, from BigInt 123456789123456789 back to FE80:0000:0000:0000:0202:B3FF:FE1E:8329).

However, if the mapping must be bijective, then the address space of a Long (2^64) can't accommodate for use cases that have significantly larger address spaces (e.g. 2^132 in the case of IPv6, but we have use cases that would need even larger spaces). And here increasing the target address space from Long to something bigger (like BigInt) makes the task easier to create/find such bijective mappings when the source address spaces >> 2^64.

All the above being said with the big disclaimer that it's totally possible that I am misunderstanding the way I should use the CMS. :-)

  1. I'm really impressed with the writing and detail with which you have considered the problem (which is frankly greater than we have examined the performance issues). There is an algebird-caliper project to run benchmarks on these types. Perhaps we can add some benchmarks for CMS and SketchMap and see if we can close the gap.

Thanks, Oscar! The algebird-caliper project is new to me, I can take a look.

@miguno What do you think? Let me know if you think I have this wrong. I want to improve performance and increase applicability, but also I am reluctant to maintain 3 almost identical classes/algorithms.

Oh, I can totally understand.

Just to clarify and to give you more options to think about: I think we could actually consolidate the current CountMinSketch with the generic variant GenCountMinSketch, as the former is simply the special case of CMS[Long]. The two only reasons IMHO to keep both of them around in the code base would be a) if there was some deficiency in the generic variant and/or b) for backwards compatibility (e.g. users at Twitter who have coded against the current, Long-based CMS implementation. I think (a) could be technically addressed if needed, and (b) could potentially be addressed by making the current CMS functionality available under the same name but backed by the generic implementation (i.e. CMS[Long]). The same applies to the existing test suite -- we could use the existing test suite (Long-based CMS) and, say, loop through the CMS test suite with various CMS types, e.g. CMS[Long] (= mimic current test suite), CMS[BigInt].

If the above (CMS consolidation) is possible, then there would be only 2 similar classes -- CMS and SketchMap -- just like there are today. (I'll skip brainstorming about what to possibly do with SketchMap, which I think you were hinting at. I am not very familiar with this data structure yet, and on top of that I also can't say why it's so slow as mentioned above.)

@jnievelt

This comment has been minimized.

Show comment
Hide comment
@jnievelt

jnievelt Sep 30, 2014

Contributor

Regarding the size of the key, individual hashes may be Int sized (arguably this isn't necessary to begin with) but we are also dealing with a sequence of hashes. So if we have w x h there are w^h possible combinations; these values don't need to be very large to eclipse 2^64. Further, if we are interested in returning heavyHitters, @miguno is correct that we'd need a bijection there for it to be useful.

Overall I'm in favor of putting this into three traits:

  • CMS[K, V] - a query-only CMS that does not have any heavy hitters
    • hash functions K => Int
    • Monoid[V] for construction and combination
    • meet-semilattice Semigroup[V] for querying
  • "top k" CMS (SketchMap)
    • Ordering[V]
  • "top %" CMS (which is what CMS is today)
    • Ordering[K], Ordering[V]

The latter two, of course, being based on the first (possibly extending it?).

Contributor

jnievelt commented Sep 30, 2014

Regarding the size of the key, individual hashes may be Int sized (arguably this isn't necessary to begin with) but we are also dealing with a sequence of hashes. So if we have w x h there are w^h possible combinations; these values don't need to be very large to eclipse 2^64. Further, if we are interested in returning heavyHitters, @miguno is correct that we'd need a bijection there for it to be useful.

Overall I'm in favor of putting this into three traits:

  • CMS[K, V] - a query-only CMS that does not have any heavy hitters
    • hash functions K => Int
    • Monoid[V] for construction and combination
    • meet-semilattice Semigroup[V] for querying
  • "top k" CMS (SketchMap)
    • Ordering[V]
  • "top %" CMS (which is what CMS is today)
    • Ordering[K], Ordering[V]

The latter two, of course, being based on the first (possibly extending it?).

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 2, 2014

I ran a couple of Caliper benchmarks for CMS, CMS[K], and SketchMap[K, V] with K = { Long, BigInt }. I include the summary below. Details at: https://gist.github.com/miguno/00217d87fd340ad7f370

Disclaimer: First, I'm new to Caliper. Second, these were very simple benchmarks (see gist above for details). Still, I think they should be in the right ballpark.

FYI: I used the com.twiter.algebird.caliper.Runner included in algebird-caliper, run via IntelliJ IDEA 13. I ran into compilation errors (AsyncSummerBenchmark) straight away; this might be a code relic because apparently the normal Algebird build does not compile algebird-caliper? The Caliper benchmark classes I used -- including CLI and VM options -- are listed in the gist above.


Summary results, variant A

For the following parameters:

  • eps = 0.005
  • delta = 1E-8
  • seed = 1
  • heavyHittersPct: 0.2 (CMS variants) / heavyHittersCount: 1000 (SketchMap)

Input data: 100 small numbers, i.e. the sequence (1, 2, ..., 100).

Results:

(*) denotes new, generic CMS implementation

Summary results, variant B

Same parameters as above but different input data. This benchmark, when run against BigInt, is expected to run slower than benchmark variant A above because BigInt operations become slower the larger the numbers represented by a BigInt are.

Input data: large(r) numbers, i.e. randomly drawn numbers from a 2^2048 address space.

Because of the input data characteristics this benchmark variant cannot be run for Long.

Results:

(*) denotes new, generic CMS implementation

Notes

Several benchmarks logged warnings and/or errors similar to the following:

WARNING: Hotspot compilation occurred during timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a longer warmup.

WARNING: Hotspot compilation occurred after warmup, but outside of timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a longer warmup.

WARNING: GC occurred during timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a larger heap size.

Benchmark environment

  • MacBook Pro Retina (Late 2013), 2.3 GHz Intel Core i7, 16 GB RAM
  • Mac OS X 10.9.5
  • Oracle Java 7 (1.7.0_55)
  • IntelliJ IDEA 13.1

Caliper references

miguno commented Oct 2, 2014

I ran a couple of Caliper benchmarks for CMS, CMS[K], and SketchMap[K, V] with K = { Long, BigInt }. I include the summary below. Details at: https://gist.github.com/miguno/00217d87fd340ad7f370

Disclaimer: First, I'm new to Caliper. Second, these were very simple benchmarks (see gist above for details). Still, I think they should be in the right ballpark.

FYI: I used the com.twiter.algebird.caliper.Runner included in algebird-caliper, run via IntelliJ IDEA 13. I ran into compilation errors (AsyncSummerBenchmark) straight away; this might be a code relic because apparently the normal Algebird build does not compile algebird-caliper? The Caliper benchmark classes I used -- including CLI and VM options -- are listed in the gist above.


Summary results, variant A

For the following parameters:

  • eps = 0.005
  • delta = 1E-8
  • seed = 1
  • heavyHittersPct: 0.2 (CMS variants) / heavyHittersCount: 1000 (SketchMap)

Input data: 100 small numbers, i.e. the sequence (1, 2, ..., 100).

Results:

(*) denotes new, generic CMS implementation

Summary results, variant B

Same parameters as above but different input data. This benchmark, when run against BigInt, is expected to run slower than benchmark variant A above because BigInt operations become slower the larger the numbers represented by a BigInt are.

Input data: large(r) numbers, i.e. randomly drawn numbers from a 2^2048 address space.

Because of the input data characteristics this benchmark variant cannot be run for Long.

Results:

(*) denotes new, generic CMS implementation

Notes

Several benchmarks logged warnings and/or errors similar to the following:

WARNING: Hotspot compilation occurred during timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a longer warmup.

WARNING: Hotspot compilation occurred after warmup, but outside of timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a longer warmup.

WARNING: GC occurred during timing. Depending on the scope of the benchmark, this might significantly impact results. Consider running with a larger heap size.

Benchmark environment

  • MacBook Pro Retina (Late 2013), 2.3 GHz Intel Core i7, 16 GB RAM
  • Mac OS X 10.9.5
  • Oracle Java 7 (1.7.0_55)
  • IntelliJ IDEA 13.1

Caliper references

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 2, 2014

Thanks for jumping in, @jnievelt -- much appreciated!

@jnievelt wrote:
Overall I'm in favor of putting this into three traits:

  • CMS[K, V] - a query-only CMS that does not have any heavy hitters
    • hash functions K => Int
    • Monoid[V] for construction and combination
    • meet-semilattice Semigroup[V] for querying
  • "top k" CMS (SketchMap)
    • Ordering[V]
  • "top %" CMS (which is what CMS is today)
    • Ordering[K], Ordering[V]
      The latter two, of course, being based on the first (possibly extending it?).

I like Joe's idea about splitting the responsibilities of the "core" CMS and the heavy hitters part. Notably, this helps to prevent the Ordering requirement from leaking everywhere into the core CMS code. I already noticed this leakage while working on the CMS[K] draft and wasn't too happy about it.

Also, turning CMS[K] (w/ hardcoded V=Long) into CMS[K, V] would give us the option to deprecate SketchMap, if that is desired.

Question: How important is turning also the "count value" Long into a type parameter V? (And thus, indirectly, how important is it to use CMS to replace SketchMap?) Without trying to make a suggestive statement, this will make the CMS implementation more complex. From what I understand, for instance, moving to a type parameter V means that we'd also need Numeric[V] in addition to Ordering[V] because of Approximate:

case class Approximate[N](min: N, estimate: N, max: N, probWithinBounds: Double)(implicit val numeric: Numeric[N]) { ... }

I think this would also mean that the "core" CMS[K, V] code -- the one without heavy hitters -- would need to require Numeric[V]. (And there might be more such required changes lurking in the code.)

Addendum: I think the Numeric[V] would also be required because of the following code in CountMinSketch.scala. I just list a few examples, but in general I am referring to (potentially all?) the lines that currently have 0L or 1L in them. From what I understand having "only" a Monoid[V] would not be enough to implement the following semantics, even though Monoid[V] would provide us with a zero element (aka 0L) -- but not with a one (aka 1L).

case class CMSZero(params: CMSParams) extends CMS {
  ...
  def totalCount = 0L
  def frequency(item: Long) = Approximate.exact(0L)
  def innerProduct(other: CMS) = Approximate.exact(0L)
  ..
}
/**
 * Used for holding a single element, to avoid repeatedly adding elements from
 * sparse counts tables.
 */
case class CMSItem(item: Long, params: CMSParams) extends CMS {
  ...
  def totalCount = 1L
  def frequency(x: Long) = if (item == x) Approximate.exact(1L) else Approximate.exact(0L)
  ...
}
case class CMSInstance(countsTable: CMSCountsTable, totalCount: Long,
  hhs: HeavyHitters, params: CMSParams) extends CMS {

  ...

  def frequency(item: Long): Approximate[Long] = {
    val estimates = countsTable.counts.zipWithIndex.map {
      case (row, i) =>
        row(params.hashes(i)(item))
    }
    makeApprox(estimates.min)
  }

  private def makeApprox(est: Long): Approximate[Long] = {
    if (est == 0L) Approximate.exact(0L)
    else {
      val lower = math.max(0L, est - (eps * totalCount).toLong)
      Approximate(lower, est, est, 1 - delta)
    }
  }

  def +(item: Long): CMSInstance = this + (item, 1L)
  ...
}

Scala's Numeric[T] trait requires such zero and one elements, as well as e.g. methods that convert T to Int, Long, Float, Double.

// From scala.math.Numeric.scala
trait Numeric[T] extends Ordering[T] {
  def plus(x: T, y: T): T
  def minus(x: T, y: T): T
  def times(x: T, y: T): T
  def negate(x: T): T
  def fromInt(x: Int): T
  def toInt(x: T): Int
  def toLong(x: T): Long
  def toFloat(x: T): Float
  def toDouble(x: T): Double

  def zero = fromInt(0)
  def one = fromInt(1)

  def abs(x: T): T = if (lt(x, zero)) negate(x) else x
  def signum(x: T): Int =
    if (lt(x, zero)) -1
    else if (gt(x, zero)) 1
    else 0
  ...
}

In other words, there is some effort moving from current CMS (w/ hardcoded K=Long, V=Long) to CMS[K] (w/ hardcoded V=Long), and further effort to move to CMS[K, V]. Since I don't know how important the V part is I raised my question above -- that is, on my side I do see the need for the K change, but I haven't run into a use case yet that would require the V change (I only experimented with SketchMap because it allowed me to control K, but not because I needed a V other than Long). I understand that the V part is the key (pun intended) to unlocking the door that deprecates SketchMap.

I'm fine either way. :-)

miguno commented Oct 2, 2014

Thanks for jumping in, @jnievelt -- much appreciated!

@jnievelt wrote:
Overall I'm in favor of putting this into three traits:

  • CMS[K, V] - a query-only CMS that does not have any heavy hitters
    • hash functions K => Int
    • Monoid[V] for construction and combination
    • meet-semilattice Semigroup[V] for querying
  • "top k" CMS (SketchMap)
    • Ordering[V]
  • "top %" CMS (which is what CMS is today)
    • Ordering[K], Ordering[V]
      The latter two, of course, being based on the first (possibly extending it?).

I like Joe's idea about splitting the responsibilities of the "core" CMS and the heavy hitters part. Notably, this helps to prevent the Ordering requirement from leaking everywhere into the core CMS code. I already noticed this leakage while working on the CMS[K] draft and wasn't too happy about it.

Also, turning CMS[K] (w/ hardcoded V=Long) into CMS[K, V] would give us the option to deprecate SketchMap, if that is desired.

Question: How important is turning also the "count value" Long into a type parameter V? (And thus, indirectly, how important is it to use CMS to replace SketchMap?) Without trying to make a suggestive statement, this will make the CMS implementation more complex. From what I understand, for instance, moving to a type parameter V means that we'd also need Numeric[V] in addition to Ordering[V] because of Approximate:

case class Approximate[N](min: N, estimate: N, max: N, probWithinBounds: Double)(implicit val numeric: Numeric[N]) { ... }

I think this would also mean that the "core" CMS[K, V] code -- the one without heavy hitters -- would need to require Numeric[V]. (And there might be more such required changes lurking in the code.)

Addendum: I think the Numeric[V] would also be required because of the following code in CountMinSketch.scala. I just list a few examples, but in general I am referring to (potentially all?) the lines that currently have 0L or 1L in them. From what I understand having "only" a Monoid[V] would not be enough to implement the following semantics, even though Monoid[V] would provide us with a zero element (aka 0L) -- but not with a one (aka 1L).

case class CMSZero(params: CMSParams) extends CMS {
  ...
  def totalCount = 0L
  def frequency(item: Long) = Approximate.exact(0L)
  def innerProduct(other: CMS) = Approximate.exact(0L)
  ..
}
/**
 * Used for holding a single element, to avoid repeatedly adding elements from
 * sparse counts tables.
 */
case class CMSItem(item: Long, params: CMSParams) extends CMS {
  ...
  def totalCount = 1L
  def frequency(x: Long) = if (item == x) Approximate.exact(1L) else Approximate.exact(0L)
  ...
}
case class CMSInstance(countsTable: CMSCountsTable, totalCount: Long,
  hhs: HeavyHitters, params: CMSParams) extends CMS {

  ...

  def frequency(item: Long): Approximate[Long] = {
    val estimates = countsTable.counts.zipWithIndex.map {
      case (row, i) =>
        row(params.hashes(i)(item))
    }
    makeApprox(estimates.min)
  }

  private def makeApprox(est: Long): Approximate[Long] = {
    if (est == 0L) Approximate.exact(0L)
    else {
      val lower = math.max(0L, est - (eps * totalCount).toLong)
      Approximate(lower, est, est, 1 - delta)
    }
  }

  def +(item: Long): CMSInstance = this + (item, 1L)
  ...
}

Scala's Numeric[T] trait requires such zero and one elements, as well as e.g. methods that convert T to Int, Long, Float, Double.

// From scala.math.Numeric.scala
trait Numeric[T] extends Ordering[T] {
  def plus(x: T, y: T): T
  def minus(x: T, y: T): T
  def times(x: T, y: T): T
  def negate(x: T): T
  def fromInt(x: Int): T
  def toInt(x: T): Int
  def toLong(x: T): Long
  def toFloat(x: T): Float
  def toDouble(x: T): Double

  def zero = fromInt(0)
  def one = fromInt(1)

  def abs(x: T): T = if (lt(x, zero)) negate(x) else x
  def signum(x: T): Int =
    if (lt(x, zero)) -1
    else if (gt(x, zero)) 1
    else 0
  ...
}

In other words, there is some effort moving from current CMS (w/ hardcoded K=Long, V=Long) to CMS[K] (w/ hardcoded V=Long), and further effort to move to CMS[K, V]. Since I don't know how important the V part is I raised my question above -- that is, on my side I do see the need for the K change, but I haven't run into a use case yet that would require the V change (I only experimented with SketchMap because it allowed me to control K, but not because I needed a V other than Long). I understand that the V part is the key (pun intended) to unlocking the door that deprecates SketchMap.

I'm fine either way. :-)

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 2, 2014

meet-semilattice Semigroup[V] for querying

I assume this is because of the min in the point query formula â_i = min(count[j, h_j(i)]) of CMS, where â_i is the frequency estimate, right? That is, the frequency estimate is guaranteed to be the true frequency. So we'd need a way to compute the infimum/minimum of a sequence of V's.

If that's true, then I think the code that would require the Semigroup[V] form is limited to CMSInstance, which is the only place where min is used, and the only place where we do a "real" point query aka frequency():

/**
 * The general Count-Min sketch structure, used for holding any number of elements.
 */
case class CMSInstance(countsTable: CMSCountsTable, totalCount: Long,
  hhs: HeavyHitters, params: CMSParams) extends CMS {

  ...

  def frequency(item: Long): Approximate[Long] = {
    val estimates = countsTable.counts.zipWithIndex.map { case (row, i) => row(params.hashes(i)(item)) }
    makeApprox(estimates.min) // <<< HERE
  }

  def innerProduct(other: CMS): Approximate[Long] = {
    other match {
      case other: CMSInstance => {
        assert((other.depth, other.width) == (depth, width), "Tables must have the same dimensions.")

        def innerProductAtDepth(d: Int) = (0 to (width - 1)).map { w =>
          countsTable.getCount(d, w) * other.countsTable.getCount(d, w)
        }.sum // <<< this would be covered already since we'd also have Monoid[V]

        val est = (0 to (depth - 1)).map { innerProductAtDepth(_) }.min // <<< AND HERE
        Approximate(est - (eps * totalCount * other.totalCount).toLong, est, est, 1 - delta)
      }
      case _ => other.innerProduct(this)
    }
  }

  ...
}

The current CMS implementation also provides innerProduct(), and here min is used, too.

miguno commented Oct 2, 2014

meet-semilattice Semigroup[V] for querying

I assume this is because of the min in the point query formula â_i = min(count[j, h_j(i)]) of CMS, where â_i is the frequency estimate, right? That is, the frequency estimate is guaranteed to be the true frequency. So we'd need a way to compute the infimum/minimum of a sequence of V's.

If that's true, then I think the code that would require the Semigroup[V] form is limited to CMSInstance, which is the only place where min is used, and the only place where we do a "real" point query aka frequency():

/**
 * The general Count-Min sketch structure, used for holding any number of elements.
 */
case class CMSInstance(countsTable: CMSCountsTable, totalCount: Long,
  hhs: HeavyHitters, params: CMSParams) extends CMS {

  ...

  def frequency(item: Long): Approximate[Long] = {
    val estimates = countsTable.counts.zipWithIndex.map { case (row, i) => row(params.hashes(i)(item)) }
    makeApprox(estimates.min) // <<< HERE
  }

  def innerProduct(other: CMS): Approximate[Long] = {
    other match {
      case other: CMSInstance => {
        assert((other.depth, other.width) == (depth, width), "Tables must have the same dimensions.")

        def innerProductAtDepth(d: Int) = (0 to (width - 1)).map { w =>
          countsTable.getCount(d, w) * other.countsTable.getCount(d, w)
        }.sum // <<< this would be covered already since we'd also have Monoid[V]

        val est = (0 to (depth - 1)).map { innerProductAtDepth(_) }.min // <<< AND HERE
        Approximate(est - (eps * totalCount * other.totalCount).toLong, est, est, 1 - delta)
      }
      case _ => other.innerProduct(this)
    }
  }

  ...
}

The current CMS implementation also provides innerProduct(), and here min is used, too.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 6, 2014

I updated my comments above (to keep the information in one place) with additional remarks.

miguno commented Oct 6, 2014

I updated my comments above (to keep the information in one place) with additional remarks.

@jnievelt

This comment has been minimized.

Show comment
Hide comment
@jnievelt

jnievelt Oct 6, 2014

Contributor

Right about the meet-semilattice. Ideally we should have at least the following invariant:

sg.plus(x, monoid.plus(x, y)) == x

Note that CMS already doesn't support negative numbers because they would violate the above.

You're right about the Numeric[V], but I believe this is only leveraged in the frequency and innerProduct methods themselves, so we could make it implicit there? Of course we'd have to have separate methods for the frequency estimate and the approximation, but I don't think that's very bad. For example, updateHeavyHitters already calls frequency only to get the estimate.

Contributor

jnievelt commented Oct 6, 2014

Right about the meet-semilattice. Ideally we should have at least the following invariant:

sg.plus(x, monoid.plus(x, y)) == x

Note that CMS already doesn't support negative numbers because they would violate the above.

You're right about the Numeric[V], but I believe this is only leveraged in the frequency and innerProduct methods themselves, so we could make it implicit there? Of course we'd have to have separate methods for the frequency estimate and the approximation, but I don't think that's very bad. For example, updateHeavyHitters already calls frequency only to get the estimate.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 6, 2014

Regarding Numeric[V]:

Please do correct me if I'm wrong but I think we'd need this elsewhere to.

case class CMSItem(item: Long, params: CMSParams) extends CMS {
  ...
  def totalCount = 1L   // <<< this one
  ...
}

(This is one of the examples I included above w.r.t. Numeric[V], just shortened even further.)

From what I understand we need a zero and a one element for V, which would replace the current use of 0L and 1L in the code, respectively. The zero is required to e.g. initialize the counting table (currently a 2-dmin array of Long values) and the CMSZero implementation. The one is required for i++ like operations. If this assumption of mine is true, then we'd need Numeric[V] in more places than just frequency() and innerProduct().

We could still make it implicit, though it would spread throughout larger parts of the CMS code. Thinking about it, this might be a hint that there is a closer coupling between Ordering and Numeric given the CMS[K, V] semantics. Thanks to the current use of Long we got all of these properties for free.

miguno commented Oct 6, 2014

Regarding Numeric[V]:

Please do correct me if I'm wrong but I think we'd need this elsewhere to.

case class CMSItem(item: Long, params: CMSParams) extends CMS {
  ...
  def totalCount = 1L   // <<< this one
  ...
}

(This is one of the examples I included above w.r.t. Numeric[V], just shortened even further.)

From what I understand we need a zero and a one element for V, which would replace the current use of 0L and 1L in the code, respectively. The zero is required to e.g. initialize the counting table (currently a 2-dmin array of Long values) and the CMSZero implementation. The one is required for i++ like operations. If this assumption of mine is true, then we'd need Numeric[V] in more places than just frequency() and innerProduct().

We could still make it implicit, though it would spread throughout larger parts of the CMS code. Thinking about it, this might be a hint that there is a closer coupling between Ordering and Numeric given the CMS[K, V] semantics. Thanks to the current use of Long we got all of these properties for free.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 7, 2014

Also, something that I realized only now though it might have been obvious to others:

CMS[K, V] - a query-only CMS that does not have any heavy hitters

Joe's suggestion to split the current CMS implementation into a counting-only CMS and the associated top N/% CMS's that add heavy hitters functionality also changes the memory footprint of CMS.

That is, the counting-only CMS would be a fixed memory data structure whose size is independent from the number of elements that are being counted (i.e. independent from the number of unique elements being counted as well as independent from the number of count operations aka totalCount). Once you have defined eps and delta (we can ignore seed), the size of such a counting-only CMS instance is fixed. Roughly speaking: O(w x d), with w and d being derived from eps and delta, respectively.

The heavy hitter variants are different. Since they track heavy hitters, their size is dependent on how many elements make it into the heavy hitters. Roughly speaking: O(m) with m < n, where m is the subset of the n elements that make it into the heavy hitters. Also, this memory footprint must be added to the footprint of the counting-only CMS.

(The current CMS implementation is a combination of the counting-only CMS and the "top %" CMS variant discussed above, and its memory footprint is thus dependent on w, d, and m.)

I think these characteristics -- and hence Joe's suggestion -- may become important once we move from CMS w/ hardcoded K=Long / V=Long to a parameterized CMS[K, V]. Depending on your parameter choices for the heavy hitters (either N or %) it becomes easier to run into OOM with types larger than Long for K (and V, for that matter), notably in the case of the heavy hitter CMS variants. Similarly, for use cases were OOM are a concern it might be helpful to have the choice to limit yourself to the counting-only CMS, something which is currently not possible.

miguno commented Oct 7, 2014

Also, something that I realized only now though it might have been obvious to others:

CMS[K, V] - a query-only CMS that does not have any heavy hitters

Joe's suggestion to split the current CMS implementation into a counting-only CMS and the associated top N/% CMS's that add heavy hitters functionality also changes the memory footprint of CMS.

That is, the counting-only CMS would be a fixed memory data structure whose size is independent from the number of elements that are being counted (i.e. independent from the number of unique elements being counted as well as independent from the number of count operations aka totalCount). Once you have defined eps and delta (we can ignore seed), the size of such a counting-only CMS instance is fixed. Roughly speaking: O(w x d), with w and d being derived from eps and delta, respectively.

The heavy hitter variants are different. Since they track heavy hitters, their size is dependent on how many elements make it into the heavy hitters. Roughly speaking: O(m) with m < n, where m is the subset of the n elements that make it into the heavy hitters. Also, this memory footprint must be added to the footprint of the counting-only CMS.

(The current CMS implementation is a combination of the counting-only CMS and the "top %" CMS variant discussed above, and its memory footprint is thus dependent on w, d, and m.)

I think these characteristics -- and hence Joe's suggestion -- may become important once we move from CMS w/ hardcoded K=Long / V=Long to a parameterized CMS[K, V]. Depending on your parameter choices for the heavy hitters (either N or %) it becomes easier to run into OOM with types larger than Long for K (and V, for that matter), notably in the case of the heavy hitter CMS variants. Similarly, for use cases were OOM are a concern it might be helpful to have the choice to limit yourself to the counting-only CMS, something which is currently not possible.

@jnievelt

This comment has been minimized.

Show comment
Hide comment
@jnievelt

jnievelt Oct 7, 2014

Contributor

Regarding the totalCount -- I do believe there is a deeper entanglement. Really it seems that all of the logic around approximation relies not only on having a Numeric[V], but also that the Numeric[V] is the Monoid (i.e., plus is regular old plus). It may be the case that the proper separation of these provides a "top %" CMS only for Numeric[V], with frequency (approximation) and innerProduct only available there. It can then infer the Monoid, Ordering, and meet-semilattice Semigroup (from Numeric's plus, zero, compare, and min).

Regarding fixed-size CMS, it's true that a basic CMS would remove variability in size related to the size of K (the number of heavy hitters for both variants of CMS is already bounded if we allow heavyHittersPct to be fixed). But there is still potential for variance in size in V (consider a naive CMS[String, Set[String]]). That being said, you're probably correct in practice as folks are more likely to use the compact, fixed-size structures for V.

Contributor

jnievelt commented Oct 7, 2014

Regarding the totalCount -- I do believe there is a deeper entanglement. Really it seems that all of the logic around approximation relies not only on having a Numeric[V], but also that the Numeric[V] is the Monoid (i.e., plus is regular old plus). It may be the case that the proper separation of these provides a "top %" CMS only for Numeric[V], with frequency (approximation) and innerProduct only available there. It can then infer the Monoid, Ordering, and meet-semilattice Semigroup (from Numeric's plus, zero, compare, and min).

Regarding fixed-size CMS, it's true that a basic CMS would remove variability in size related to the size of K (the number of heavy hitters for both variants of CMS is already bounded if we allow heavyHittersPct to be fixed). But there is still potential for variance in size in V (consider a naive CMS[String, Set[String]]). That being said, you're probably correct in practice as folks are more likely to use the compact, fixed-size structures for V.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 7, 2014

Hmm, I see your point. Given your comments (and also what I wrote earlier), how much value would it actually add then if we were to untangle this in order to parameterize V, too, instead of parameterizing only K?

I mean, if it turns out to be mostly an academic exercise then it might not be worth the effort, and we could just stick with V=Long. Maybe I'm biased but I can't really see a reason to use sth like a CMS[String, Set[String]]. :-) That is, use cases where V is not a number and thus has Numeric[V]. And if V is a number, then similarly I can't think of many if any reasons not to use a positive integer (whole number) anyways, and here a Long seems to be the one-size-fits-all data type (yes, pun intended). But again this might be my biased point of view, of course.

Do you see or even have a use case for parameterizing V that would justify the required effort?

miguno commented Oct 7, 2014

Hmm, I see your point. Given your comments (and also what I wrote earlier), how much value would it actually add then if we were to untangle this in order to parameterize V, too, instead of parameterizing only K?

I mean, if it turns out to be mostly an academic exercise then it might not be worth the effort, and we could just stick with V=Long. Maybe I'm biased but I can't really see a reason to use sth like a CMS[String, Set[String]]. :-) That is, use cases where V is not a number and thus has Numeric[V]. And if V is a number, then similarly I can't think of many if any reasons not to use a positive integer (whole number) anyways, and here a Long seems to be the one-size-fits-all data type (yes, pun intended). But again this might be my biased point of view, of course.

Do you see or even have a use case for parameterizing V that would justify the required effort?

@jnievelt

This comment has been minimized.

Show comment
Hide comment
@jnievelt

jnievelt Oct 7, 2014

Contributor

I don't think there's a ton of immediate value, and I would fully support going to CMS[K]. I don't think it will be a lot of trouble to later make it CMS[K, V : Numeric].

The value, I think, is in de-duplicating code between CMS and SketchMap. Of course we'd want to verify that it can be done in a sufficiently performant way, but it sounds like you have a more immediate need that shouldn't be blocked by all this.

Contributor

jnievelt commented Oct 7, 2014

I don't think there's a ton of immediate value, and I would fully support going to CMS[K]. I don't think it will be a lot of trouble to later make it CMS[K, V : Numeric].

The value, I think, is in de-duplicating code between CMS and SketchMap. Of course we'd want to verify that it can be done in a sufficiently performant way, but it sounds like you have a more immediate need that shouldn't be blocked by all this.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 7, 2014

Ok, that sounds like a plan. I'll take a stab at updating the CMS[K] draft, where applicable, with any of the changes discussed above (excluding parameterizing V), and then send a "real" pull request.

miguno commented Oct 7, 2014

Ok, that sounds like a plan. I'll take a stab at updating the CMS[K] draft, where applicable, with any of the changes discussed above (excluding parameterizing V), and then send a "real" pull request.

@avibryant

This comment has been minimized.

Show comment
Hide comment
@avibryant

avibryant Oct 8, 2014

Collaborator

FWIW: one non-Numeric parameterization of V which I think is practical and interesting to think about is where V is an approximate set like hyperloglog, minhash signature, or bloom filter. Each of these is themselves represented by a vector of something Numeric, and their meet-semilattice for the purposes of CMS would be an element-wise min. Ultimately it would be great if our base CMS abstraction was able to support this (I really liked @jnievelt's original decomposition), but understood that this is a lot more work than just starting with CMS[K].

Collaborator

avibryant commented Oct 8, 2014

FWIW: one non-Numeric parameterization of V which I think is practical and interesting to think about is where V is an approximate set like hyperloglog, minhash signature, or bloom filter. Each of these is themselves represented by a vector of something Numeric, and their meet-semilattice for the purposes of CMS would be an element-wise min. Ultimately it would be great if our base CMS abstraction was able to support this (I really liked @jnievelt's original decomposition), but understood that this is a lot more work than just starting with CMS[K].

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 8, 2014

@avibryant Thanks for the feedback, Avi. As I said above I will try to modify the current CMS[K] similar to @jnievelt's suggestions above. That would at least pave the way (a little bit) to move to a fully parameterized CMS[K, V] later on, if needed.

miguno commented Oct 8, 2014

@avibryant Thanks for the feedback, Avi. As I said above I will try to modify the current CMS[K] similar to @jnievelt's suggestions above. That would at least pave the way (a little bit) to move to a fully parameterized CMS[K, V] later on, if needed.

@miguno

This comment has been minimized.

Show comment
Hide comment
@miguno

miguno Oct 12, 2014

Status update: I got a first version of the refactored code. I'll follow up with the actual pull request shortly. In the meantime I wanted to give you a heads up and a high-level summary of the upcoming PR.

Incoming changes

I have split the conflated "counting/querying" and "heavyHitters" responsibilities of the original CMS[K=Long] code as follows:

  • Two traits
    • trait CMSCounting[K, C[_]] -- contains the contract of all the previous CMS functionality excluding heavy hitters (see next trait)
    • trait CMSHeavyHitters[K] -- contains only the heavy hitters functionality, i.e. heavyHittersPct: Double and def heavyHitters: Set[K]
    • FYI: As you can see below I decided not to implement TopPctCMS[K] -- the "top %" CMS -- as a direct subclass of CMS[K], but rather split the contract into two traits, and then mix-in those traits into the respective CMS variants. This simplifies e.g. the pattern matching (e.g. the three TopPctCMS classes must only pattern match against their direct siblings); it also simplifies the return type of the API defined in the traits, because e.g. a TopPctCMS will always return a TopPctCMS. If you e.g. want to treat a TopPctCMS as a "normal" count-min sketch, just check against the CMSCounting trait.
  • Two implementations of these traits. The first is a counting/querying-only CMS named CMS[K], the second is a combination of the first and the heavy hitters (top %) functionality, named TopPctCMS[K].
    • CMS[K], with three sub-classes CMS{Zero, Item, Instance} (= hierarchy that matches previous implementation), a corresponding CMSMonoid, etc.

      sealed abstract class CMS[K: Ordering](params: CMSParams[K])
        extends java.io.Serializable with CMSCounting[K, CMS]
    • TopPctCMS[K], with three sub-classes TopPctCMS{Zero, Item, Instance} (= hierarchy that matches previous implementation), a corresponding TopPctCMSMonoid, etc..

      sealed abstract class TopPctCMS[K: Ordering](val cms: CMS[K], params: TopPctCMSParams)
         extends java.io.Serializable with CMSCounting[K, TopPctCMS] with CMSHeavyHitters[K]
  • I also improved the respective code documentation (i.e. scaladoc) so that users have an easier time to understand how they can use the CMS code. Notably, I documented what the purpose of the K parameter is, and what the recommended types are ("gimme the TL;DR!").
  • Specs/tests:
    • I have updated the unit tests to check the monoid laws for both CMS[K] and TopPctCMS[K].

    • I have opted to focus the "normal" tests on TopPctCMS[K], as it encapsulates CMS[K] for all the counting/querying, i.e. by testing one I also test the other. That is not perfect, but I haven't come up with an easy way to re-write the specs in such a way that it is easy to a) "loop" through both CMS implementations AND b) "loop" throw various types for K. Currently, I have added a type parameter to the original CMSTest class, and am "looping" through K types like so:

      class CMSShortTest extends CMSTest[Short]
      class CMSIntTest extends CMSTest[Int]
      class CMSLongTest extends CMSTest[Long]
      class CMSBigIntTest extends CMSTest[BigInt]
      
      abstract class CMSTest[K: Ordering: CMSHasher: Numeric] extends WordSpec with Matchers { ... }
    • The changes required to update the specs/tests were minimal, too, which is a good sign. The only reason the tests look slightly more complex than before is that I needed a way to support "looping" through various K types. Here I am exploiting the fact that Numeric[K] is available for Short, Int, Long, BigInt out of the box, and that Numeric provides a fromInt(k: K) method, which I now use to create the test input data (e.g. Seq(1L, 2L) has become Set(1, 2) with some implicit conversions that turn 1 into 1L or BigInt(1) etc. as needed).

  • From the user perspective / API basically nothing has changed, which is good (see my comment on specs/tests above). The user only needs to pick the CMS variant she wants, e.g. via CMS.monoid[Long](eps, delta, seed) or TopPctCMS.monoid[Long](eps, delta, seed, heavyHittersPct).
  • Because CMS is now K-parameterized, so is CMSHash[K]. And because of that I needed to update e.g. SketchMap to use CMSHash[Long] where it was just CMSHash before. Note that I did not change SketchMapto use CMSHash[K] (SketchMap itself is K-parameterized) because the original SketchMap code apparently worked perfectly fine with Long-backed hashes, and I didn't want to open up another can of worms by touching SketchMap, too.
  • I added a CMSBenchmark that runs caliper benchmarks against TopPctCMS for K=Long and K=BigInt. Here, I am benchmarking counting the numbers 1..100 (Long + BigInt) and 100 random 2048-bit numbers (BigInt only). This benchmark is not identical to the previous one because cappi enforces an older version of caliper, so I couldn't reuse my benchmark code (see above) as is.
    • Side note: The cappi sbt plugin is not that perfect (I noticed @posco already filed one bug report). It still uses caliper 0.5-something, which is from 2012. I found a way customize its settings to use the latest caliper (1.0-beta-1), but because of some backwards-incompatible caliper changes cappi won't work with that version (the Runner API changed). Hence I couldn't just copy my benchmark code from above, which is based on @MacroBenchmark. Instead, I am using the SimpleBenchmark API of caliper 0.5 similar to the other benchmarks in algebird-caliper.
  • I created a short README that explains how to use algebird-caliper.

Caliper benchmark results of latest PR code

Micro benchmarks (via SimpleBenchmark in caliper), with cappi (sbt plugin) and caliper 0.5. I added the previous benchmark results in <<< comments.

Note that the previous benchmark results were based on caliper 1.0-beta-1 and macro benchmarks. For this reason I wouldn't read too much into the absolute before-after numbers, but rather point out that the relative performance numbers of the new benchmark are close to the relative performance numbers of the old benchmark (e.g. Long being about 2-3x as fast as BigInt for the same benchmark scenario, and BigInt being 3x slower when we draw a hundred random 2048 bit numbers instead of counting a hundred small numbers, i.e. 1..100).

Edit: Re-run all tests with laptop connected to power supply.

[info] Running com.google.caliper.Runner com.twitter.algebird.caliper.CMSBenchmark
[info]  0% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithLongCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 310584.31 ns; σ=2170.53 ns @ 3 trials
[info] 17% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithBigIntCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 862000.46 ns; σ=21005.66 ns @ 10 trials
[info] 33% Scenario{vm=java, trial=0, benchmark=PlusOfRandom2048BitNumbersWithBigIntCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 3400540.65 ns; σ=31420.80 ns @ 3 trials
[info] 50% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithLongCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 362521.41 ns; σ=3097.73 ns @ 4 trials
[info] 67% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithBigIntCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 919076.25 ns; σ=8675.47 ns @ 7 trials
[info] 83% Scenario{vm=java, trial=0, benchmark=PlusOfRandom2048BitNumbersWithBigIntCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 3459892.49 ns; σ=34131.65 ns @ 6 trials
[info]
[info]                               benchmark   eps   us linear runtime
[info]   PlusOfFirstHundredIntegersWithLongCms   0.1  311 ==
[info]   PlusOfFirstHundredIntegersWithLongCms 0.005  363 ===   <<< now 363us, before 360us (w/ caliper 1.0-beta-1 and macro benchmark)
[info] PlusOfFirstHundredIntegersWithBigIntCms   0.1  862 =======
[info] PlusOfFirstHundredIntegersWithBigIntCms 0.005  919 =======   <<< now 919us, before 912us (w/ caliper 1.0-beta-1 and macro benchmark)
[info] PlusOfRandom2048BitNumbersWithBigIntCms   0.1 3401 =============================
[info] PlusOfRandom2048BitNumbersWithBigIntCms 0.005 3460 ==============================  <<< now 3,460us, before 3,067us (w/ caliper 1.0-beta-1 and macro benchmark)
[info]
[info] vm: java
[info] trial: 0
[info] delta: 0.0000001
[info] heavyHittersPct: 0.2
[info] maxBits: 2048
[info] operations: 100
[success] Total time: 62 s, completed Oct 13, 2014 10:36:54 AM

Benchmark results for the previous CMS implementation (w/ hardcoded K=Long):

> cappi::benchmarkOnly com.twitter.algebird.caliper.CMSBenchmark
[info] Running com.google.caliper.Runner com.twitter.algebird.caliper.CMSBenchmark
[info]  0% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithHardcodedLongCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 304399.97 ns; σ=7517.47 ns @ 10 trials
[info] 50% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithHardcodedLongCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 373643.43 ns; σ=7146.40 ns @ 10 trials
[info]
[info]   eps  us linear runtime
[info]   0.1 304 ========================
[info] 0.005 374 ==============================   <<< now 374us, before 354us (w/ caliper 1.0-beta-1 and macro benchmark)
[info]
[info] vm: java
[info] trial: 0
[info] benchmark: PlusOfFirstHundredIntegersWithHardcodedLongCms
[info] delta: 0.0000001
[info] heavyHittersPct: 0.2
[info] maxBits: 2048
[info] operations: 100
[success] Total time: 30 s, completed Oct 13, 2014 9:35:31 AM

(Note: The ASCII performance bar ==== of the second benchmark output above is not normalized to the scale of the first benchmark output further above.)

Summary results, variant A

Note: "Current" results with cappi, caliper 0.5 and micro benchmark. "Before" results: caliper 1.0-beta-1 and macro benchmark.

  • CMS (hardcoded K=Long): 363us, before 354us (also: 160k writes/s)
  • CMS[Long] (*): 384us, before 360us (also: 160k writes/s)
  • CMS[BigInt] (*): 919us, before 912us (also: 65k writes/s)
  • SketchMap[Long, Long]: N/A, before 38,382us (also: N/A)
  • SketchMap[BigInt, Long]: N/A, before 38,377us (also: 0.3k writes/s)

Summary results, variant B

Note: "Current" results with cappi, caliper 0.5 and micro benchmark. "Before" results: caliper 1.0-beta-1 and macro benchmark.

  • CMS[BigInt] (*): 3,460us, before 3,067us (also: 22k writes/s)
  • SketchMap[BigInt, Long]: N/A, before 119,841us (also: N/A)

Regarding names of classes

Feel free to let me know if you prefer different names than CMS and TopPctCMS. In general tried to keep the names as close as possible to the original code. Personally, I do not really like the name "TopPctCMS" -- because the heavy hitters it tracks are not really the "top N%" CMS; rather, they are those elements in the data stream that have occurred >= a minimum/treshold percentage, and they are provided in desendingly sorted order. In other words, if you set heavyHittersPct = 0.1, you won't get the top 10% of elements (e.g. 20 out of 200) in the true sense of the word. (Also, heavyHittersPct might be better called sth like minimumPct or minOccurrencePct.)
On the other hand, I couldn't come up with a more descriptive name that is also short and/or that stays close to the terminology used in the CMS literature. Hence I kept the name TopPctCMS for the time being.

Future changes

Lastly, it might make sense at this point to do one or more of the following:

  • Add a cms sub-package to better structure the currently flat namespace. We have many CMS* and TopPctCMS* classes in the com.twitter.algebird namespace now.
  • Possibly split CMS[K] and TopPctCMS[K] into two separate files instead of the single CountMinSketch.scala (same for the test spec).

PS: I apologize for the verbosity. Since there's at least one ocean between us, I'd rather err on the side of giving you too much information. ;-)

miguno commented Oct 12, 2014

Status update: I got a first version of the refactored code. I'll follow up with the actual pull request shortly. In the meantime I wanted to give you a heads up and a high-level summary of the upcoming PR.

Incoming changes

I have split the conflated "counting/querying" and "heavyHitters" responsibilities of the original CMS[K=Long] code as follows:

  • Two traits
    • trait CMSCounting[K, C[_]] -- contains the contract of all the previous CMS functionality excluding heavy hitters (see next trait)
    • trait CMSHeavyHitters[K] -- contains only the heavy hitters functionality, i.e. heavyHittersPct: Double and def heavyHitters: Set[K]
    • FYI: As you can see below I decided not to implement TopPctCMS[K] -- the "top %" CMS -- as a direct subclass of CMS[K], but rather split the contract into two traits, and then mix-in those traits into the respective CMS variants. This simplifies e.g. the pattern matching (e.g. the three TopPctCMS classes must only pattern match against their direct siblings); it also simplifies the return type of the API defined in the traits, because e.g. a TopPctCMS will always return a TopPctCMS. If you e.g. want to treat a TopPctCMS as a "normal" count-min sketch, just check against the CMSCounting trait.
  • Two implementations of these traits. The first is a counting/querying-only CMS named CMS[K], the second is a combination of the first and the heavy hitters (top %) functionality, named TopPctCMS[K].
    • CMS[K], with three sub-classes CMS{Zero, Item, Instance} (= hierarchy that matches previous implementation), a corresponding CMSMonoid, etc.

      sealed abstract class CMS[K: Ordering](params: CMSParams[K])
        extends java.io.Serializable with CMSCounting[K, CMS]
    • TopPctCMS[K], with three sub-classes TopPctCMS{Zero, Item, Instance} (= hierarchy that matches previous implementation), a corresponding TopPctCMSMonoid, etc..

      sealed abstract class TopPctCMS[K: Ordering](val cms: CMS[K], params: TopPctCMSParams)
         extends java.io.Serializable with CMSCounting[K, TopPctCMS] with CMSHeavyHitters[K]
  • I also improved the respective code documentation (i.e. scaladoc) so that users have an easier time to understand how they can use the CMS code. Notably, I documented what the purpose of the K parameter is, and what the recommended types are ("gimme the TL;DR!").
  • Specs/tests:
    • I have updated the unit tests to check the monoid laws for both CMS[K] and TopPctCMS[K].

    • I have opted to focus the "normal" tests on TopPctCMS[K], as it encapsulates CMS[K] for all the counting/querying, i.e. by testing one I also test the other. That is not perfect, but I haven't come up with an easy way to re-write the specs in such a way that it is easy to a) "loop" through both CMS implementations AND b) "loop" throw various types for K. Currently, I have added a type parameter to the original CMSTest class, and am "looping" through K types like so:

      class CMSShortTest extends CMSTest[Short]
      class CMSIntTest extends CMSTest[Int]
      class CMSLongTest extends CMSTest[Long]
      class CMSBigIntTest extends CMSTest[BigInt]
      
      abstract class CMSTest[K: Ordering: CMSHasher: Numeric] extends WordSpec with Matchers { ... }
    • The changes required to update the specs/tests were minimal, too, which is a good sign. The only reason the tests look slightly more complex than before is that I needed a way to support "looping" through various K types. Here I am exploiting the fact that Numeric[K] is available for Short, Int, Long, BigInt out of the box, and that Numeric provides a fromInt(k: K) method, which I now use to create the test input data (e.g. Seq(1L, 2L) has become Set(1, 2) with some implicit conversions that turn 1 into 1L or BigInt(1) etc. as needed).

  • From the user perspective / API basically nothing has changed, which is good (see my comment on specs/tests above). The user only needs to pick the CMS variant she wants, e.g. via CMS.monoid[Long](eps, delta, seed) or TopPctCMS.monoid[Long](eps, delta, seed, heavyHittersPct).
  • Because CMS is now K-parameterized, so is CMSHash[K]. And because of that I needed to update e.g. SketchMap to use CMSHash[Long] where it was just CMSHash before. Note that I did not change SketchMapto use CMSHash[K] (SketchMap itself is K-parameterized) because the original SketchMap code apparently worked perfectly fine with Long-backed hashes, and I didn't want to open up another can of worms by touching SketchMap, too.
  • I added a CMSBenchmark that runs caliper benchmarks against TopPctCMS for K=Long and K=BigInt. Here, I am benchmarking counting the numbers 1..100 (Long + BigInt) and 100 random 2048-bit numbers (BigInt only). This benchmark is not identical to the previous one because cappi enforces an older version of caliper, so I couldn't reuse my benchmark code (see above) as is.
    • Side note: The cappi sbt plugin is not that perfect (I noticed @posco already filed one bug report). It still uses caliper 0.5-something, which is from 2012. I found a way customize its settings to use the latest caliper (1.0-beta-1), but because of some backwards-incompatible caliper changes cappi won't work with that version (the Runner API changed). Hence I couldn't just copy my benchmark code from above, which is based on @MacroBenchmark. Instead, I am using the SimpleBenchmark API of caliper 0.5 similar to the other benchmarks in algebird-caliper.
  • I created a short README that explains how to use algebird-caliper.

Caliper benchmark results of latest PR code

Micro benchmarks (via SimpleBenchmark in caliper), with cappi (sbt plugin) and caliper 0.5. I added the previous benchmark results in <<< comments.

Note that the previous benchmark results were based on caliper 1.0-beta-1 and macro benchmarks. For this reason I wouldn't read too much into the absolute before-after numbers, but rather point out that the relative performance numbers of the new benchmark are close to the relative performance numbers of the old benchmark (e.g. Long being about 2-3x as fast as BigInt for the same benchmark scenario, and BigInt being 3x slower when we draw a hundred random 2048 bit numbers instead of counting a hundred small numbers, i.e. 1..100).

Edit: Re-run all tests with laptop connected to power supply.

[info] Running com.google.caliper.Runner com.twitter.algebird.caliper.CMSBenchmark
[info]  0% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithLongCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 310584.31 ns; σ=2170.53 ns @ 3 trials
[info] 17% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithBigIntCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 862000.46 ns; σ=21005.66 ns @ 10 trials
[info] 33% Scenario{vm=java, trial=0, benchmark=PlusOfRandom2048BitNumbersWithBigIntCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 3400540.65 ns; σ=31420.80 ns @ 3 trials
[info] 50% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithLongCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 362521.41 ns; σ=3097.73 ns @ 4 trials
[info] 67% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithBigIntCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 919076.25 ns; σ=8675.47 ns @ 7 trials
[info] 83% Scenario{vm=java, trial=0, benchmark=PlusOfRandom2048BitNumbersWithBigIntCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 3459892.49 ns; σ=34131.65 ns @ 6 trials
[info]
[info]                               benchmark   eps   us linear runtime
[info]   PlusOfFirstHundredIntegersWithLongCms   0.1  311 ==
[info]   PlusOfFirstHundredIntegersWithLongCms 0.005  363 ===   <<< now 363us, before 360us (w/ caliper 1.0-beta-1 and macro benchmark)
[info] PlusOfFirstHundredIntegersWithBigIntCms   0.1  862 =======
[info] PlusOfFirstHundredIntegersWithBigIntCms 0.005  919 =======   <<< now 919us, before 912us (w/ caliper 1.0-beta-1 and macro benchmark)
[info] PlusOfRandom2048BitNumbersWithBigIntCms   0.1 3401 =============================
[info] PlusOfRandom2048BitNumbersWithBigIntCms 0.005 3460 ==============================  <<< now 3,460us, before 3,067us (w/ caliper 1.0-beta-1 and macro benchmark)
[info]
[info] vm: java
[info] trial: 0
[info] delta: 0.0000001
[info] heavyHittersPct: 0.2
[info] maxBits: 2048
[info] operations: 100
[success] Total time: 62 s, completed Oct 13, 2014 10:36:54 AM

Benchmark results for the previous CMS implementation (w/ hardcoded K=Long):

> cappi::benchmarkOnly com.twitter.algebird.caliper.CMSBenchmark
[info] Running com.google.caliper.Runner com.twitter.algebird.caliper.CMSBenchmark
[info]  0% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithHardcodedLongCms, delta=0.0000001, eps=0.1, heavyHittersPct=0.2, maxBits=2048, operations=100} 304399.97 ns; σ=7517.47 ns @ 10 trials
[info] 50% Scenario{vm=java, trial=0, benchmark=PlusOfFirstHundredIntegersWithHardcodedLongCms, delta=0.0000001, eps=0.005, heavyHittersPct=0.2, maxBits=2048, operations=100} 373643.43 ns; σ=7146.40 ns @ 10 trials
[info]
[info]   eps  us linear runtime
[info]   0.1 304 ========================
[info] 0.005 374 ==============================   <<< now 374us, before 354us (w/ caliper 1.0-beta-1 and macro benchmark)
[info]
[info] vm: java
[info] trial: 0
[info] benchmark: PlusOfFirstHundredIntegersWithHardcodedLongCms
[info] delta: 0.0000001
[info] heavyHittersPct: 0.2
[info] maxBits: 2048
[info] operations: 100
[success] Total time: 30 s, completed Oct 13, 2014 9:35:31 AM

(Note: The ASCII performance bar ==== of the second benchmark output above is not normalized to the scale of the first benchmark output further above.)

Summary results, variant A

Note: "Current" results with cappi, caliper 0.5 and micro benchmark. "Before" results: caliper 1.0-beta-1 and macro benchmark.

  • CMS (hardcoded K=Long): 363us, before 354us (also: 160k writes/s)
  • CMS[Long] (*): 384us, before 360us (also: 160k writes/s)
  • CMS[BigInt] (*): 919us, before 912us (also: 65k writes/s)
  • SketchMap[Long, Long]: N/A, before 38,382us (also: N/A)
  • SketchMap[BigInt, Long]: N/A, before 38,377us (also: 0.3k writes/s)

Summary results, variant B

Note: "Current" results with cappi, caliper 0.5 and micro benchmark. "Before" results: caliper 1.0-beta-1 and macro benchmark.

  • CMS[BigInt] (*): 3,460us, before 3,067us (also: 22k writes/s)
  • SketchMap[BigInt, Long]: N/A, before 119,841us (also: N/A)

Regarding names of classes

Feel free to let me know if you prefer different names than CMS and TopPctCMS. In general tried to keep the names as close as possible to the original code. Personally, I do not really like the name "TopPctCMS" -- because the heavy hitters it tracks are not really the "top N%" CMS; rather, they are those elements in the data stream that have occurred >= a minimum/treshold percentage, and they are provided in desendingly sorted order. In other words, if you set heavyHittersPct = 0.1, you won't get the top 10% of elements (e.g. 20 out of 200) in the true sense of the word. (Also, heavyHittersPct might be better called sth like minimumPct or minOccurrencePct.)
On the other hand, I couldn't come up with a more descriptive name that is also short and/or that stays close to the terminology used in the CMS literature. Hence I kept the name TopPctCMS for the time being.

Future changes

Lastly, it might make sense at this point to do one or more of the following:

  • Add a cms sub-package to better structure the currently flat namespace. We have many CMS* and TopPctCMS* classes in the com.twitter.algebird namespace now.
  • Possibly split CMS[K] and TopPctCMS[K] into two separate files instead of the single CountMinSketch.scala (same for the test spec).

PS: I apologize for the verbosity. Since there's at least one ocean between us, I'd rather err on the side of giving you too much information. ;-)

miguno pushed a commit to miguno/algebird that referenced this issue Nov 18, 2014

johnynek added a commit that referenced this issue Nov 19, 2014

Merge pull request #354 from miguno/GH-345
GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters

johnynek added a commit that referenced this issue Dec 3, 2014

Merge pull request #361 from miguno/GH-345
Improve `require` setup for depth/delta and associated test spec

@miguno miguno referenced this issue Jan 1, 2015

Closed

CMSItems #390

@miguno miguno closed this Jan 14, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment