Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change BloomFilter implementation for Sparse Joins #1806

Merged
merged 46 commits into from May 3, 2019

Conversation

anish749
Copy link
Contributor

@anish749 anish749 commented Apr 2, 2019

I am trying to add the optimisations from #1686 and twitter/algebird#673.

This involves using a java.util.BitSet instead of a custom Array[Long] to back the BloomFilter, and not reallocating the BitSet after adding each element. Since adding elements to a BloomFilter are idempotent, and we do not need a reference to the old BloomFilter after it has been created, using a mutable BloomFilter works faster with less things to copy.

Here is what I did:
Use java.util.BitSet backed bloom filter from twitter/algebird#673.
It works but the memory allocation is based on the width of the filter which is derived from the number of keys hint passed in sparse joins.
This means if the hint is misleading (abnormally high) we end up with OOM. Also we have test cases which pass 1Billion elements but doesn't actually add so many elements. This means allocating memory for for the estimated width. The original Algebird BF uses a compressed bitmap and has a sparse bloom filter implementation when less than 10% of the bits are used. This reduces memory usage.
Now, EWAHCompressedBitMaps, used by Algebird, is immutable and is very slow when inserting, because it again copies and reallocates when inserting data. Using this bitmap would again reduce the throughput compared to ClearSpring's BF that I used in #1686. Also this would use Kryo serialization since this bitmap doesn't have a Beam Coder, but that is probably not a problem.
So now we write another compressed mutable bitmap which essentially is a mutable.Set[Int]. This works better, and has BeamCoders. With this we don't have any OOM issues for sparse filters and is also mutable, hence we are not reallocating.
But this is slower than ClearSpring. I didn't quite understand why. But it was 50% slower in terms of throughput compared to util.BitSet.
Then we build a DelayedBloomFilter (in this version) where we just calculate the hashes(as an Array[Int]), and keep the references in a mutable.Buffer[Array[Int]] when inserting values.
We wait till this is filled up by the aggregator to a point where this no longer offers any benefit of not allocation the complete memory of util.BitSet and then we change to the MutableBFInstance[T]. When the first query happens, we materialise the mutable.Buffer[Array[Int]] into a Set[Int] and store that, to answer queries. For multiple queries, we don't re materialise this. If a new element is inserted, we discard the previously materialised Set[Int] and wait till we see the first query. We use aggregators here, which are nice to us and they don't interleave inserts and reads from the filter. This means we can delay the Set allocation till we have a query. Also this BF is not meant to be in caches and only in ephemeral pipelines, which means that we have clearly separated steps for creation and usage of the bloom filter.

With the above implementation,
we have the following results for putting 1M elements in the BloomFilter:
Original Implementation (v0.7.4)
Screenshot 2019-04-25 at 14 28 44
Current Branch (v0.7.5-SNAPSHOT)
Screenshot 2019-04-25 at 14 28 58

JMH Benchmark (locally outside this repo)

Benchmark                         (falsePositiveRate)  (nbrOfElements)   Mode  Cnt      Score       Error  Units
algebirdBF                 0.01              100  thrpt    3  13146.325 ± 11543.779  ops/s
algebirdBF                 0.01            10000  thrpt    3      6.977 ±     4.191  ops/s
clearSpringBF                 0.01              100  thrpt    3  65031.775 ±  5328.316  ops/s
clearSpringBF                 0.01            10000  thrpt    3    575.296 ±   544.399  ops/s
scioBF                        0.01              100  thrpt    3  53440.614 ± 17581.014  ops/s
scioBF                        0.01            10000  thrpt    3    499.989 ±   312.770  ops/s

@jto jto added the WIP label Apr 5, 2019
@codecov
Copy link

codecov bot commented Apr 24, 2019

Codecov Report

Merging #1806 into master will decrease coverage by 1.74%.
The diff coverage is 94.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1806      +/-   ##
==========================================
- Coverage   71.16%   69.42%   -1.75%     
==========================================
  Files         196      197       +1     
  Lines        6077     6211     +134     
  Branches      395      443      +48     
==========================================
- Hits         4325     4312      -13     
- Misses       1752     1899     +147
Impacted Files Coverage Δ
...spotify/scio/values/PairSCollectionFunctions.scala 96.04% <100%> (ø) ⬆️
...main/scala/com/spotify/scio/util/BloomFilter.scala 94.02% <94.02%> (ø)
.../spotify/scio/bigquery/client/BigQueryConfig.scala 0% <0%> (-70.59%) ⬇️
...scala/com/spotify/scio/bigquery/client/Cache.scala 0% <0%> (-70.38%) ⬇️
.../spotify/scio/bigquery/BigQueryPartitionUtil.scala 0% <0%> (-57.5%) ⬇️
...scala/com/spotify/scio/bigquery/BigQueryUtil.scala 50% <0%> (-50%) ⬇️
...la/com/spotify/scio/bigquery/client/QueryOps.scala 0.82% <0%> (-47.11%) ⬇️
...spotify/scio/coders/instances/AlgebirdCoders.scala 50% <0%> (-25%) ⬇️
...com/spotify/scio/bigquery/types/BigQueryType.scala 36.84% <0%> (-21.06%) ⬇️
...la/com/spotify/scio/bigquery/client/TableOps.scala 0% <0%> (-16.37%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72a6a33...2ff5176. Read the comment docs.

Copy link
Contributor

@nevillelyh nevillelyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rough first pass. LGTM mostly. I'd make most BF stuff private for now until we're ready to open it for end users.

For that we should probably verify that the MutableBF:

  • passes Beam mutation detection in transforms
  • has efficient and deterministic coder (for those store it to disk for future jobs)

nevillelyh and others added 7 commits May 1, 2019 15:13
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
@anish749
Copy link
Contributor Author

anish749 commented May 1, 2019

Thanks for the review @nevillelyh .
I'll look into the comments by tomorrow and fix them, and also resolve conflicts.
I tried the mutation detection, and it works. I'll add that and a verifyDeterministic test on the coder as a unit test:

val bf: MutableBF[String] = BloomFilter[String](10, 0.1).create(items: _*)
val beamMutationDetector = MutationDetectors
  .forValueWithCoder(
      bf,
      CoderMaterializer
        .beam(ScioContext.forTest(), Coder[MutableBF[String]]) // Create Coder for MutableBF.
    )
beamMutationDetector.verifyUnmodified()

For your second point, any pointers about comparing the efficiency of coders?

Copy link
Contributor

@regadas regadas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor comments. It's looking good! Thanks @anish749

// TODO: investigate this upper bound and density more closely (or derive a better formula).
// TODO: The following logic is same for immutable Bloom Filters and may be referred here.
val fpProb =
if (density > 0.95)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎨 use some {} just because it's a multiline

Anish and others added 8 commits May 2, 2019 13:13
# Conflicts:
#	scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
Co-Authored-By: anish749 <anish749@users.noreply.github.com>
@anish749
Copy link
Contributor Author

anish749 commented May 2, 2019

@regadas thanks for reviewing. I've made the changes.

@anish749
Copy link
Contributor Author

anish749 commented May 2, 2019

Here are the updated benchmarks

Benchmark      (falsePositiveRate)  (nbrOfElements)   Mode  Cnt     Score     Error  Units
clearSpringBF                 0.01             1000  thrpt    7  5987.608 ± 1363.882  ops/s
clearSpringBF                 0.01            10000  thrpt    7   652.844 ±   63.079  ops/s
scioBF                        0.01             1000  thrpt    7  5155.060 ±  311.114  ops/s
scioBF                        0.01            10000  thrpt    7   425.954 ±   99.796  ops/s

@regadas regadas removed the WIP label May 3, 2019
@regadas regadas merged commit 9c8aa07 into spotify:master May 3, 2019
@anish749 anish749 deleted the mutableBloomFilters branch August 10, 2019 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants