Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A couple of performance optimizations: HyperLogLog and BloomFilter #426

Merged
merged 8 commits into from
May 6, 2015

Conversation

Bekbolatov
Copy link
Contributor

Added possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid.
This makes it possible to re-use pre-computed hashes and increase performance.

@Bekbolatov Bekbolatov changed the title Added possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid A couple of performance optimizations: HyperLogLog and BloomFilter Mar 21, 2015
@Bekbolatov
Copy link
Contributor Author

  • Possibility of HLL creation from pre-computed hashes to HyperLogLogMonoid.
  • Performance optimization for BloomFilter checkAndAdd operation.

@Bekbolatov
Copy link
Contributor Author

Basically, current changes try to save time by re-using hashed values, in two useful situations. BloomFilter checkAndAdd will allow for a little error during partition merges (reduce): when the same item appears in two different partitions once. Worst case is when all partitions contain exactly same sets of items and items appears at most once in each partition.

val contained = this.contains(item)
(BFInstance(hashes, bits ++ bitsToActivate, width), contained)
}

def bitSetContains(bs: BitSet, il: Int*): Boolean = {
il.map{ i: Int => bs.contains(i) }.reduce{ _ && _ }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be optimized by stopping at the first false:

val ilit = il.iterator
var contains = true
while(ilit.hasNext && contains) {
  contains = contains && bs.contains(ilit.next)
}
contains

it is mutable, but it might speed things up measurably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defintiely. I considered this ealier - performance measurements showed hash calculations and bit array conversions trumped everything small like this. If we do add this, how about:

il.foreach { i => if (!bs.contains(i)) return false }
    return true

Not sure how we feel about break-out returns on Scala, but we will avoid extra assignments.

@@ -150,6 +154,8 @@ case class BFItem(item: String, hashes: BFHash, width: Int) extends BF {

def +(other: String) = this ++ BFItem(other, hashes, width)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ++ should we optimize the case where other.item == item and not allocate? I guess that's rare.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra "if " shortcut is definitely cheaper than hashes calculations - benefit of this trade-off will depend on data consumed. Probably doesn't hurt to put the check there - but as you said, it would happen pretty rarely - data values must be same. Your call.

@Bekbolatov Bekbolatov closed this May 4, 2015
@Bekbolatov Bekbolatov reopened this May 4, 2015
@johnynek
Copy link
Collaborator

johnynek commented May 6, 2015

Thanks. This will be of help!

johnynek added a commit that referenced this pull request May 6, 2015
A couple of performance optimizations: HyperLogLog and BloomFilter
@johnynek johnynek merged commit e9aabba into twitter:develop May 6, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants