Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range queries for CountMinSketch #58

Open
avibryant opened this issue Dec 8, 2012 · 3 comments
Open

Range queries for CountMinSketch #58

avibryant opened this issue Dec 8, 2012 · 3 comments

Comments

@avibryant
Copy link
Contributor

See eg http://www.cse.cuhk.edu.hk/~taoyf/course/5705f10/lec8.pdf for how this works.

I have a basic working implementation of the DyadicRange logic (for positive integer keys, which you can map whatever else to), but it needs to be hooked up to levels instances of CMS. We also want to implement binary search using this to get quantile queries.

case class DyadicRange(maxValue : Long = Long.MaxValue) {
  val levels = math.ceil(math.log(maxValue) / math.log(2)).toInt

  def indicesForPoint(v : Long) = (1 to levels).map{level => (level, indexForPoint(v, level))}
  def indicesForRange(start : Long, end : Long)  : List[(Int,Long)] = indicesForRange(start, end, levels)

  def indexForPoint(v : Long, level : Int) = v >> (level - 1)
  def rangeForIndex(i : Long, level : Int) = (i << (level-1), ((i+1) << (level-1)) - 1)

  def indicesForRange(start : Long, end : Long, maxLevel : Int) : List[(Int,Long)] = {
    if(start > end) {
      Nil
    } else if(start == end) {
      List((1, start))
    } else {
      val startIndex = indexForPoint(start, maxLevel)
      val (a,b) = rangeForIndex(startIndex, maxLevel)
      if(a >= start) {
        if(b <= end) {
          List((maxLevel, startIndex)) ++ indicesForRange(start, a - 1, maxLevel) ++ indicesForRange(b + 1, end, maxLevel) 
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      } else {
        val (a2, b2) = rangeForIndex(startIndex + 1, maxLevel)
        if(b2 <= end && a2 <= end) {
          List((maxLevel, startIndex + 1)) ++ indicesForRange(start, a2 - 1, maxLevel - 1) ++ indicesForRange(b2 + 1, end, maxLevel)
        } else {
          indicesForRange(start, end, maxLevel - 1)
        }
      }
    }
  }

}
@avibryant
Copy link
Contributor Author

BTW: the above could also be used with levels instances of Map[Long,V] where we have a Monoid[V], to allow arbitrary exact range queries with at most 2log_2(maxValue) plus() operations. We'd only want to use the exact version for sufficiently small values of maxValue, though if the Map is backed by disk storage, or sharded in memory across a cluster, this could still be reasonably large.

@johnynek
Copy link
Collaborator

This looks good to me. Some tests and I'll be convinced.

About the monoid approach, I like the idea. The problem is we are going to have some issues proving bounds. So, I wonder if we have a monoid on V and a metric on V, can we push the proof through so we can show that |sum(V) - approxSum(V)| < epsilon except with probability delta for some known epsilon, delta.

By the way, pushing this all through for any monoid with a metric is probably worth a paper, if it is possible. We may need to assume something about the structure of the metric.

@avibryant avibryant mentioned this issue Apr 10, 2013
@sritchie
Copy link
Collaborator

Nice! This blog post describes the dyadic range approach as well.

Apparently http://madlib.incubator.apache.org/ uses this algorithm to do range queries and estimate percentiles on big datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants