-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A couple of performance optimizations: HyperLogLog and BloomFilter #426
Conversation
…gLogMonoid (streamline and re-use pre-computed hashes for performance).
|
Basically, current changes try to save time by re-using hashed values, in two useful situations. BloomFilter checkAndAdd will allow for a little error during partition merges (reduce): when the same item appears in two different partitions once. Worst case is when all partitions contain exactly same sets of items and items appears at most once in each partition. |
val contained = this.contains(item) | ||
(BFInstance(hashes, bits ++ bitsToActivate, width), contained) | ||
} | ||
|
||
def bitSetContains(bs: BitSet, il: Int*): Boolean = { | ||
il.map{ i: Int => bs.contains(i) }.reduce{ _ && _ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this could be optimized by stopping at the first false:
val ilit = il.iterator
var contains = true
while(ilit.hasNext && contains) {
contains = contains && bs.contains(ilit.next)
}
contains
it is mutable, but it might speed things up measurably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defintiely. I considered this ealier - performance measurements showed hash calculations and bit array conversions trumped everything small like this. If we do add this, how about:
il.foreach { i => if (!bs.contains(i)) return false }
return true
Not sure how we feel about break-out returns on Scala, but we will avoid extra assignments.
@@ -150,6 +154,8 @@ case class BFItem(item: String, hashes: BFHash, width: Int) extends BF { | |||
|
|||
def +(other: String) = this ++ BFItem(other, hashes, width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in ++ should we optimize the case where other.item == item
and not allocate? I guess that's rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extra "if " shortcut is definitely cheaper than hashes calculations - benefit of this trade-off will depend on data consumed. Probably doesn't hurt to put the check there - but as you said, it would happen pretty rarely - data values must be same. Your call.
…ions will require conversions anyway)
Thanks. This will be of help! |
A couple of performance optimizations: HyperLogLog and BloomFilter
Added possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid.
This makes it possible to re-use pre-computed hashes and increase performance.