Skip to content

Commit

Permalink
Merge with develop
Browse files Browse the repository at this point in the history
  • Loading branch information
johnynek committed Jul 1, 2015
2 parents 2fcf1cd + b1e35ac commit 7b188ad
Show file tree
Hide file tree
Showing 92 changed files with 4,425 additions and 1,029 deletions.
12 changes: 9 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
language: scala
scala:
- 2.10.4
- 2.11.4
sudo: false
matrix:
include:
- scala: 2.10.4
script: ./sbt ++$TRAVIS_SCALA_VERSION clean test

- scala: 2.11.5
script: ./sbt ++$TRAVIS_SCALA_VERSION clean test
after_success: "./sbt coveralls"
60 changes: 60 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,65 @@
# Algebird #

### Version 0.10.2 ###
* QTree quantileBounds assert percentile <= 1.0 #447

### Version 0.10.1 ###
* Make HLL easier to use, add Hash128 typeclass #440
* add ! to ApproximateBoolean #442
* add QTreeAggregator and add approximatePercentileBounds to Aggregator #443
* Make level configurable in QTreeAggregators #444

### Version 0.10.0 ###
* HyperLogLogSeries #295
* CMS: add contramap to convert CMS[K] to CMS[L], add support for String and Bytes, remove Ordering context bound for K #399
* EventuallyAggregator and variants #407
* Add MultiAggregator.apply #408
* Return a MonoidAggregator from MultiAggregator when possible #409
* Add SummingWithHitsCache class to also track key hits. #410
* Add MapAggregator to compose tuples of (key, agg) pairs #411
* fix README.md. 2.9.3 no longer published #412
* Add Coveralls Badge to the README #413
* Add some combinators on MonoidAggregator #417
* Added function to safely downsize a HyperLogLog sketch #418
* AdaptiveCache #419
* fix property tests #421
* Make Preparer extend Serializable #422
* Make MutableBackedMap Serializable. #424
* A couple of performance optimizations: HyperLogLog and BloomFilter #426
* Adds a presenting benchmark and optimizes it #427
* Fixed broken links in README #428
* Speed up QTree #433
* Moments returns NaN when count is too low for higher order statistics #434
* Add HLL method to do error-based Aggregator #438
* Bump bijection to 0.8.0 #441

### Version 0.9.0 ###
* Replace mapValues with one single map to avoid serialization in frameworks like Spark. #344
* Add Fold trait for composable incremental processing (for develop) #350
* Add a GC friendly LRU cache to improve SummingCache #341
* BloomFilter should warn or raise on unrealistic input. #355
* GH-345: Parameterize CMS to CMS[K] and decouple counting/querying from heavy hitters #354
* Add Array Monoid & Group. #356
* Improvements to Aggregator #359
* Improve require setup for depth/delta and associated test spec #361
* Bump from 2.11.2 to 2.11.4 #365
* Move to sbt 0.13.5 #364
* Correct wrong comment in estimation function #372
* Add increments to all Summers #373
* removed duplicate semigroup #375
* GH-381: Fix serialization errors when using new CMS implementation in Storm #382
* Fix snoble's name #384
* Lift methods for Aggregator and MonoidAggregator #380
* applyCumulative method on Aggregator #386
* Add Aggregator.zip #389
* GH-388: Fix CMS test issue caused by roundtripping depth->delta->depth #395
* GH-392: Improve hashing of BigInt #394
* add averageFrom to DecayedValue #391
* Freshen up Applicative instances a bit #387
* less noise on DecayedValue tests #405
* Preparer #400
* Upgrade bijection to 0.7.2 #406

### Version 0.8.0 ###
* Removes deprecated monoid: https://github.com/twitter/algebird/pull/342
* Use some value classes: https://github.com/twitter/algebird/pull/340
Expand Down
17 changes: 9 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
## Algebird [![Build Status](https://secure.travis-ci.org/twitter/algebird.png)](http://travis-ci.org/twitter/algebird)
## Algebird [![Build status](https://img.shields.io/travis/twitter/algebird/develop.svg)](http://travis-ci.org/twitter/algebird) [![Coverage status](https://img.shields.io/coveralls/twitter/algebird/develop.svg)](https://coveralls.io/r/twitter/algebird?branch=develop)


Abstract algebra for Scala. This code is targeted at building aggregation systems (via [Scalding](https://github.com/twitter/scalding) or [Storm](https://github.com/nathanmarz/storm)). It was originally developed as part of Scalding's Matrix API, where Matrices had values which are elements of Monoids, Groups, or Rings. Subsequently, it was clear that the code had broader application within Scalding and on other projects within Twitter.

Expand Down Expand Up @@ -47,17 +48,17 @@ Discussion occurs primarily on the [Algebird mailing list](https://groups.google

## Maven

Algebird modules are available on maven central. The current groupid and version for all modules is, respectively, `"com.twitter"` and `0.8.1`.
Algebird modules are available on maven central. The current groupid and version for all modules is, respectively, `"com.twitter"` and `0.10.2`.

Current published artifacts are

* `algebird-core_2.9.3`
* `algebird-core_2.11`
* `algebird-core_2.10`
* `algebird-test_2.9.3`
* `algebird-test_2.11`
* `algebird-test_2.10`
* `algebird-util_2.9.3`
* `algebird-util_2.11`
* `algebird-util_2.10`
* `algebird-bijection_2.9.3`
* `algebird-bijection_2.11`
* `algebird-bijection_2.10`

The suffix denotes the scala version.
Expand All @@ -68,7 +69,7 @@ The suffix denotes the scala version.
We didn't know about it when we started this code, but it seems like we're more focused on
large scale analytics.

> Why not use Scalaz's [Monoid](https://github.com/scalaz/scalaz/blob/master/core/src/main/scala/scalaz/Monoid.scala) trait?
> Why not use Scalaz's [Monoid](http://docs.typelevel.org/api/scalaz/stable/7.0.4/doc/#scalaz.Monoid) trait?
The answer is a mix of the following:
* The trait itself is tiny, we just need zero and plus, it is the implementations for all the types that are important. We wrote a code generator to derive instances for all the tuples, and by hand wrote monoids for List, Set, Option, Map, and several other objects used for counting (DecayedValue for exponential decay, AveragedValue for averaging, HyperLogLog for approximate cardinality counting). It's the instances that are useful in scalding and elsewhere.
Expand All @@ -82,7 +83,7 @@ The answer is a mix of the following:
* Avi Bryant <http://twitter.com/avibryant>
* Edwin Chen <http://twitter.com/echen>
* ellchow <http://github.com/ellchow>
* Mike Gagnon <https://twitter.com/MichaelNGagnon>
* Mike Gagnon <https://twitter.com/gmike>
* Moses Nakamura <https://twitter.com/mnnakamura>
* Steven Noble <http://twitter.com/snoble>
* Sam Ritchie <http://twitter.com/sritchie>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,9 @@

package com.twitter.algebird.bijection

import org.scalatest.{ PropSpec, Matchers }
import org.scalatest.prop.PropertyChecks
import org.scalacheck.{ Arbitrary, Properties }
import com.twitter.algebird.CheckProperties

class AlgebirdBijectionLaws extends PropSpec with PropertyChecks with Matchers {
class AlgebirdBijectionLaws extends CheckProperties {
// TODO: Fill in tests. Ideally we'd publish an algebird-testing
// module before merging this in.
}
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ import java.lang.Math._
import java.util.concurrent.Executors
import java.util.concurrent.atomic.AtomicLong

import com.google.caliper.{Param, SimpleBenchmark}
import com.twitter.algebird.{HyperLogLogMonoid, _}
import com.google.caliper.{ Param, SimpleBenchmark }
import com.twitter.algebird.{ HyperLogLogMonoid, _ }
import com.twitter.algebird.util.summer._
import com.twitter.bijection._
import com.twitter.util.{Await, Duration, FuturePool}
import com.twitter.util.{ Await, Duration, FuturePool }

import scala.util.Random

Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package com.twitter.algebird.caliper

import com.google.caliper.{ Param, SimpleBenchmark }
import com.twitter.algebird.{ TopPctCMS, TopCMS, CMSHasherImplicits, TopPctCMSMonoid }
import com.twitter.algebird.{ TopPctCMS, CMSHasherImplicits, TopPctCMSMonoid }

/**
* Benchmarks the Count-Min sketch implementation in Algebird.
Expand All @@ -14,6 +14,9 @@ import com.twitter.algebird.{ TopPctCMS, TopCMS, CMSHasherImplicits, TopPctCMSMo
// - Annotate `timePlus` with `@MacroBenchmark`.
class CMSBenchmark extends SimpleBenchmark {

val Seed = 1
val JavaCharSizeInBits = 2 * 8

@Param(Array("0.1", "0.005"))
val eps: Double = 0.0

Expand All @@ -32,22 +35,24 @@ class CMSBenchmark extends SimpleBenchmark {
var random: scala.util.Random = _
var cmsLongMonoid: TopPctCMSMonoid[Long] = _
var cmsBigIntMonoid: TopPctCMSMonoid[BigInt] = _
var cmsStringMonoid: TopPctCMSMonoid[String] = _
var inputsBigInt: Seq[BigInt] = _
var inputsString: Seq[String] = _

override def setUp {
override def setUp() {
// Required import of implicit values (e.g. for BigInt- or Long-backed CMS instances)
import CMSHasherImplicits._

cmsLongMonoid = {
val seed = 1
TopPctCMS.monoid[Long](eps, delta, seed, heavyHittersPct)
}

cmsBigIntMonoid = {
val seed = 1
TopPctCMS.monoid[BigInt](eps, delta, seed, heavyHittersPct)
}
cmsLongMonoid = TopPctCMS.monoid[Long](eps, delta, Seed, heavyHittersPct)
cmsBigIntMonoid = TopPctCMS.monoid[BigInt](eps, delta, Seed, heavyHittersPct)
cmsStringMonoid = TopPctCMS.monoid[String](eps, delta, Seed, heavyHittersPct)

random = new scala.util.Random

inputsString = (0 to operations).map { i => random.nextString(maxBits / JavaCharSizeInBits) }.toSeq
Console.out.println(s"Created ${inputsString.size} input records for String")
inputsBigInt = inputsString.map { s => BigInt(s.getBytes) }
Console.out.println(s"Created ${inputsBigInt.size} input records for BigInt")
}

// Case A (K=Long): We count the first hundred integers, i.e. [1, 100]
Expand All @@ -70,14 +75,21 @@ class CMSBenchmark extends SimpleBenchmark {
dummy
}

// Case B.2 (K=BigInt): We draw numbers randomly from a 2^maxBits address space
// Case B.2 (K=BigInt): We count numbers drawn randomly from a 2^maxBits address space
def timePlusOfRandom2048BitNumbersWithBigIntCms(reps: Int): Int = {
var dummy = 0
while (dummy < reps) {
(1 to operations).view.foldLeft(cmsBigIntMonoid.zero)((l, r) => {
val n = scala.math.BigInt(maxBits, random)
l ++ cmsBigIntMonoid.create(n)
})
inputsBigInt.view.foldLeft(cmsBigIntMonoid.zero)((l, r) => l ++ cmsBigIntMonoid.create(r))
dummy += 1
}
dummy
}

// Case C (K=String): We count strings drawn randomly from a 2^maxBits address space
def timePlusOfRandom2048BitNumbersWithStringCms(reps: Int): Int = {
var dummy = 0
while (dummy < reps) {
inputsString.view.foldLeft(cmsStringMonoid.zero)((l, r) => l ++ cmsStringMonoid.create(r))
dummy += 1
}
dummy
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
package com.twitter.algebird.caliper

import com.google.caliper.{ Param, SimpleBenchmark }

/**
* Benchmarks the hashing algorithms used by Count-Min sketch for CMS[BigInt].
*
* The input values are generated ahead of time to ensure that each trial uses the same input (and that the RNG is not
* influencing the runtime of the trials).
*
* More details available at https://github.com/twitter/algebird/issues/392.
*/
// Once we can convince cappi (https://github.com/softprops/capp) -- the sbt plugin we use to run
// caliper benchmarks -- to work with the latest caliper 1.0-beta-1, we would:
// - Let `CMSHashingBenchmark` extend `Benchmark` (instead of `SimpleBenchmark`)
// - Annotate `timePlus` with `@MacroBenchmark`.
class CMSHashingBenchmark extends SimpleBenchmark {

/**
* The `a` parameter for CMS' default ("legacy") hashing algorithm: `h_i(x) = a_i * x + b_i (mod p)`.
*/
@Param(Array("5123456"))
val a: Int = 0

/**
* The `b` parameter for CMS' default ("legacy") hashing algorithm: `h_i(x) = a_i * x + b_i (mod p)`.
*
* Algebird's CMS implementation hard-codes `b` to `0`.
*/
@Param(Array("0"))
val b: Int = 0

/**
* Width of the counting table.
*/
@Param(Array("11" /* eps = 0.271 */ , "544" /* eps = 0.005 */ , "2719" /* eps = 1E-3 */ , "271829" /* eps = 1E-5 */ ))
val width: Int = 0

/**
* Number of operations per benchmark repetition.
*/
@Param(Array("100000"))
val operations: Int = 0

/**
* Maximum number of bits for randomly generated BigInt instances.
*/
@Param(Array("128", "1024", "2048"))
val maxBits: Int = 0

var random: scala.util.Random = _
var inputs: Seq[BigInt] = _

override def setUp() {
random = new scala.util.Random
// We draw numbers randomly from a 2^maxBits address space.
inputs = (1 to operations).view.map { _ => scala.math.BigInt(maxBits, random) }
}

private def murmurHashScala(a: Int, b: Int, width: Int)(x: BigInt) = {
val hash: Int = scala.util.hashing.MurmurHash3.arrayHash(x.toByteArray, a)
val h = {
// We only want positive integers for the subsequent modulo. This method mimics Java's Hashtable
// implementation. The Java code uses `0x7FFFFFFF` for the bit-wise AND, which is equal to Int.MaxValue.
val positiveHash = hash & Int.MaxValue
positiveHash % width
}
assert(h >= 0, "hash must not be negative")
h
}

private val PRIME_MODULUS = (1L << 31) - 1

private def brokenCurrentHash(a: Int, b: Int, width: Int)(x: BigInt) = {
val unModded: BigInt = (x * a) + b
val modded: BigInt = (unModded + (unModded >> 32)) & PRIME_MODULUS
val h = modded.toInt % width
assert(h >= 0, "hash must not be negative")
h
}

def timeBrokenCurrentHashWithRandomMaxBitsNumbers(operations: Int): Int = {
var dummy = 0
while (dummy < operations) {
inputs.foreach { input => brokenCurrentHash(a, b, width)(input) }
dummy += 1
}
dummy
}

def timeMurmurHashScalaWithRandomMaxBitsNumbers(operations: Int): Int = {
var dummy = 0
while (dummy < operations) {
inputs.foreach { input => murmurHashScala(a, b, width)(input) }
dummy += 1
}
dummy
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ import com.twitter.bijection._
import java.util.concurrent.Executors
import com.twitter.algebird.util._
import com.google.caliper.{ Param, SimpleBenchmark }
import com.twitter.algebird.HyperLogLogMonoid
import java.nio.ByteBuffer

import scala.math._
Expand All @@ -19,7 +18,7 @@ class OldMonoid(bits: Int) extends HyperLogLogMonoid(bits) {
else {
val buffer = new Array[Byte](size)
items.foreach { _.updateInto(buffer) }
Some(DenseHLL(bits, buffer.toIndexedSeq))
Some(DenseHLL(bits, Bytes(buffer)))
}
}

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
package com.twitter.algebird.caliper

import com.google.caliper.{ SimpleBenchmark, Param }
import com.twitter.algebird.{ HyperLogLogMonoid, HLL }
import com.twitter.bijection._
import java.nio.ByteBuffer

class HLLPresentBenchmark extends SimpleBenchmark {
@Param(Array("5", "10", "17", "20"))
val bits: Int = 0

@Param(Array("10", "100", "500", "1000", "10000"))
val max: Int = 0

@Param(Array("10", "20", "100"))
val numHLL: Int = 0

var data: IndexedSeq[HLL] = _

implicit val byteEncoder = implicitly[Injection[Long, Array[Byte]]]

override def setUp {
val hllMonoid = new HyperLogLogMonoid(bits)
val r = new scala.util.Random(12345L)
data = (0 until numHLL).map { _ =>
val input = (0 until max).map(_ => r.nextLong).toSet
hllMonoid.batchCreate(input)(byteEncoder.toFunction)
}.toIndexedSeq

}

def timeBatchCreate(reps: Int): Int = {
var dummy = 0
while (dummy < reps) {
data.foreach { hll =>
hll.approximateSize
}
dummy += 1
}
dummy
}
}
Loading

0 comments on commit 7b188ad

Please sign in to comment.