A couple of performance optimizations: HyperLogLog and BloomFilter #426

Bekbolatov · 2015-03-14T03:12:55Z

Added possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid.
This makes it possible to re-use pre-computed hashes and increase performance.

…gLogMonoid (streamline and re-use pre-computed hashes for performance).

Bekbolatov · 2015-03-21T00:58:03Z

Possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid.
Performance optimization for BloomFilter checkAndAdd operation.

Bekbolatov · 2015-03-21T01:04:00Z

Basically, current changes try to save time by re-using hashed values, in two useful situations. BloomFilter checkAndAdd will allow for a little error during partition merges (reduce): when the same item appears in two different partitions once. Worst case is when all partitions contain exactly same sets of items and items appears at most once in each partition.

johnynek · 2015-03-23T18:49:17Z

algebird-core/src/main/scala/com/twitter/algebird/BloomFilter.scala

+    val contained = this.contains(item)
+    (BFInstance(hashes, bits ++ bitsToActivate, width), contained)
+  }
+
  def bitSetContains(bs: BitSet, il: Int*): Boolean = {
    il.map{ i: Int => bs.contains(i) }.reduce{ _ && _ }


this could be optimized by stopping at the first false:

val ilit = il.iterator var contains = true while(ilit.hasNext && contains) { contains = contains && bs.contains(ilit.next) } contains

it is mutable, but it might speed things up measurably.

Defintiely. I considered this ealier - performance measurements showed hash calculations and bit array conversions trumped everything small like this. If we do add this, how about:

il.foreach { i => if (!bs.contains(i)) return false } return true

Not sure how we feel about break-out returns on Scala, but we will avoid extra assignments.

johnynek · 2015-03-23T18:53:51Z

algebird-core/src/main/scala/com/twitter/algebird/BloomFilter.scala

@@ -150,6 +154,8 @@ case class BFItem(item: String, hashes: BFHash, width: Int) extends BF {

  def +(other: String) = this ++ BFItem(other, hashes, width)


in ++ should we optimize the case where other.item == item and not allocate? I guess that's rare.

An extra "if " shortcut is definitely cheaper than hashes calculations - benefit of this trade-off will depend on data consumed. Probably doesn't hurt to put the check there - but as you said, it would happen pretty rarely - data values must be same. Your call.

…ions will require conversions anyway)

johnynek · 2015-05-06T21:45:52Z

Thanks. This will be of help!

A couple of performance optimizations: HyperLogLog and BloomFilter

Bekbolatov and others added 3 commits March 13, 2015 20:10

Added possibility of HLL creation from pre-computed hashes to HyperLo…

06ba4f7

…gLogMonoid (streamline and re-use pre-computed hashes for performance).

Merge remote-tracking branch 'upstream/develop' into develop

53dc964

Performance optimization for BloomFilter checkAndAdd operation.

02ecb08

Bekbolatov changed the title ~~Added possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid~~ A couple of performance optimizations: HyperLogLog and BloomFilter Mar 21, 2015

johnynek reviewed Mar 23, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/develop' into develop

bec8b47

johnynek reviewed Mar 23, 2015
View reviewed changes

Renat Bekbolatov added 4 commits March 23, 2015 12:22

BloomFilter, HyperLogLog - a couple of shortcuts/optimizations

302f88d

BloomFilter: expand and to re-use hashes in

94d2b38

BloomFilter: BFSparse will delegate to BFInstance for (further operat…

583484e

…ions will require conversions anyway)

Merge remote-tracking branch 'upstream/develop' into develop

ed39056

Bekbolatov closed this May 4, 2015

Bekbolatov reopened this May 4, 2015

johnynek added a commit that referenced this pull request May 6, 2015

Merge pull request #426 from Bekbolatov/develop

e9aabba

A couple of performance optimizations: HyperLogLog and BloomFilter

johnynek merged commit e9aabba into twitter:develop May 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A couple of performance optimizations: HyperLogLog and BloomFilter #426

A couple of performance optimizations: HyperLogLog and BloomFilter #426

Bekbolatov commented Mar 14, 2015

Bekbolatov commented Mar 21, 2015

Bekbolatov commented Mar 21, 2015

johnynek Mar 23, 2015

Bekbolatov Mar 23, 2015

johnynek Mar 23, 2015

Bekbolatov Mar 23, 2015

johnynek commented May 6, 2015

		@@ -150,6 +154,8 @@ case class BFItem(item: String, hashes: BFHash, width: Int) extends BF {

		def +(other: String) = this ++ BFItem(other, hashes, width)

A couple of performance optimizations: HyperLogLog and BloomFilter #426

A couple of performance optimizations: HyperLogLog and BloomFilter #426

Conversation

Bekbolatov commented Mar 14, 2015

Bekbolatov commented Mar 21, 2015

Bekbolatov commented Mar 21, 2015

johnynek Mar 23, 2015

Choose a reason for hiding this comment

Bekbolatov Mar 23, 2015

Choose a reason for hiding this comment

johnynek Mar 23, 2015

Choose a reason for hiding this comment

Bekbolatov Mar 23, 2015

Choose a reason for hiding this comment

johnynek commented May 6, 2015