Change BloomFilter implementation for Sparse Joins #1806

anish749 · 2019-04-02T21:20:33Z

I am trying to add the optimisations from #1686 and twitter/algebird#673.

This involves using a java.util.BitSet instead of a custom Array[Long] to back the BloomFilter, and not reallocating the BitSet after adding each element. Since adding elements to a BloomFilter are idempotent, and we do not need a reference to the old BloomFilter after it has been created, using a mutable BloomFilter works faster with less things to copy.

Here is what I did:
Use java.util.BitSet backed bloom filter from twitter/algebird#673.
It works but the memory allocation is based on the width of the filter which is derived from the number of keys hint passed in sparse joins.
This means if the hint is misleading (abnormally high) we end up with OOM. Also we have test cases which pass 1Billion elements but doesn't actually add so many elements. This means allocating memory for for the estimated width. The original Algebird BF uses a compressed bitmap and has a sparse bloom filter implementation when less than 10% of the bits are used. This reduces memory usage.
Now, EWAHCompressedBitMaps, used by Algebird, is immutable and is very slow when inserting, because it again copies and reallocates when inserting data. Using this bitmap would again reduce the throughput compared to ClearSpring's BF that I used in #1686. Also this would use Kryo serialization since this bitmap doesn't have a Beam Coder, but that is probably not a problem.
So now we write another compressed mutable bitmap which essentially is a mutable.Set[Int]. This works better, and has BeamCoders. With this we don't have any OOM issues for sparse filters and is also mutable, hence we are not reallocating.
But this is slower than ClearSpring. I didn't quite understand why. But it was 50% slower in terms of throughput compared to util.BitSet.
Then we build a DelayedBloomFilter (in this version) where we just calculate the hashes(as an Array[Int]), and keep the references in a mutable.Buffer[Array[Int]] when inserting values.
We wait till this is filled up by the aggregator to a point where this no longer offers any benefit of not allocation the complete memory of util.BitSet and then we change to the MutableBFInstance[T]. When the first query happens, we materialise the mutable.Buffer[Array[Int]] into a Set[Int] and store that, to answer queries. For multiple queries, we don't re materialise this. If a new element is inserted, we discard the previously materialised Set[Int] and wait till we see the first query. We use aggregators here, which are nice to us and they don't interleave inserts and reads from the filter. This means we can delay the Set allocation till we have a query. Also this BF is not meant to be in caches and only in ephemeral pipelines, which means that we have clearly separated steps for creation and usage of the bloom filter.

With the above implementation,
we have the following results for putting 1M elements in the BloomFilter:
Original Implementation (v0.7.4)

Current Branch (v0.7.5-SNAPSHOT)

JMH Benchmark (locally outside this repo)

Benchmark                         (falsePositiveRate)  (nbrOfElements)   Mode  Cnt      Score       Error  Units
algebirdBF                 0.01              100  thrpt    3  13146.325 ± 11543.779  ops/s
algebirdBF                 0.01            10000  thrpt    3      6.977 ±     4.191  ops/s
clearSpringBF                 0.01              100  thrpt    3  65031.775 ±  5328.316  ops/s
clearSpringBF                 0.01            10000  thrpt    3    575.296 ±   544.399  ops/s
scioBF                        0.01              100  thrpt    3  53440.614 ± 17581.014  ops/s
scioBF                        0.01            10000  thrpt    3    499.989 ±   312.770  ops/s

* Add groupBy benchmark * Add hand made impl of Tuple2 Coder

- SchemaCoder - BeamSQL support

* Simplify query row transform

scio-core/src/main/scala/com/spotify/scio/coders/instances/AlgebirdCoders.scala

codecov · 2019-04-24T21:42:27Z

Codecov Report

Merging #1806 into master will decrease coverage by 1.74%.
The diff coverage is 94.07%.

@@            Coverage Diff             @@
##           master    #1806      +/-   ##
==========================================
- Coverage   71.16%   69.42%   -1.75%     
==========================================
  Files         196      197       +1     
  Lines        6077     6211     +134     
  Branches      395      443      +48     
==========================================
- Hits         4325     4312      -13     
- Misses       1752     1899     +147

Impacted Files	Coverage Δ
...spotify/scio/values/PairSCollectionFunctions.scala	`96.04% <100%> (ø)`	⬆️
...main/scala/com/spotify/scio/util/BloomFilter.scala	`94.02% <94.02%> (ø)`
.../spotify/scio/bigquery/client/BigQueryConfig.scala	`0% <0%> (-70.59%)`	⬇️
...scala/com/spotify/scio/bigquery/client/Cache.scala	`0% <0%> (-70.38%)`	⬇️
.../spotify/scio/bigquery/BigQueryPartitionUtil.scala	`0% <0%> (-57.5%)`	⬇️
...scala/com/spotify/scio/bigquery/BigQueryUtil.scala	`50% <0%> (-50%)`	⬇️
...la/com/spotify/scio/bigquery/client/QueryOps.scala	`0.82% <0%> (-47.11%)`	⬇️
...spotify/scio/coders/instances/AlgebirdCoders.scala	`50% <0%> (-25%)`	⬇️
...com/spotify/scio/bigquery/types/BigQueryType.scala	`36.84% <0%> (-21.06%)`	⬇️
...la/com/spotify/scio/bigquery/client/TableOps.scala	`0% <0%> (-16.37%)`	⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72a6a33...2ff5176. Read the comment docs.

nevillelyh

Rough first pass. LGTM mostly. I'd make most BF stuff private for now until we're ready to open it for end users.

For that we should probably verify that the MutableBF:

passes Beam mutation detection in transforms
has efficient and deterministic coder (for those store it to disk for future jobs)

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

scio-core/src/main/scala/com/spotify/scio/coders/instances/AlgebirdCoders.scala

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

anish749 · 2019-05-01T13:52:29Z

Thanks for the review @nevillelyh .
I'll look into the comments by tomorrow and fix them, and also resolve conflicts.
I tried the mutation detection, and it works. I'll add that and a verifyDeterministic test on the coder as a unit test:

val bf: MutableBF[String] = BloomFilter[String](10, 0.1).create(items: _*)
val beamMutationDetector = MutationDetectors
  .forValueWithCoder(
      bf,
      CoderMaterializer
        .beam(ScioContext.forTest(), Coder[MutableBF[String]]) // Create Coder for MutableBF.
    )
beamMutationDetector.verifyUnmodified()

For your second point, any pointers about comparing the efficiency of coders?

regadas

left some minor comments. It's looking good! Thanks @anish749

regadas · 2019-05-01T18:06:09Z

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

+      // TODO: investigate this upper bound and density more closely (or derive a better formula).
+      // TODO: The following logic is same for immutable Bloom Filters and may be referred here.
+      val fpProb =
+        if (density > 0.95)


🎨 use some {} just because it's a multiline

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

# Conflicts: # scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

anish749 · 2019-05-02T15:36:36Z

@regadas thanks for reviewing. I've made the changes.

anish749 · 2019-05-02T15:56:03Z

Here are the updated benchmarks

Benchmark      (falsePositiveRate)  (nbrOfElements)   Mode  Cnt     Score     Error  Units
clearSpringBF                 0.01             1000  thrpt    7  5987.608 ± 1363.882  ops/s
clearSpringBF                 0.01            10000  thrpt    7   652.844 ±   63.079  ops/s
scioBF                        0.01             1000  thrpt    7  5155.060 ±  311.114  ops/s
scioBF                        0.01            10000  thrpt    7   425.954 ±   99.796  ops/s

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

nevillelyh and others added 19 commits March 22, 2019 16:30

update docs

d70edee

fix maven-metadata.xml URL in scripts/bump_scio.sh

8181a11

Move scio-contrib back to scio-extra (spotify#1776)

b4c2a0c

Add a hand written Coder for pairs (spotify#1775)

9584792

* Add groupBy benchmark * Add hand made impl of Tuple2 Coder

Make BQ annotations serializable (spotify#1773)

b75aaa1

use golang image for CircleCI deploy, fix spotify#1772 (spotify#1779)

d3a41b5

Remove the usage of Future around ScioContext and Tap's (spotify#1666)

01dd7bc

Fix benchmark rebase

e5675b9

Add schema and row coders (spotify#1698)

62bbd20

- SchemaCoder - BeamSQL support

Simplify query row transform (spotify#1767)

dc6f4a4

* Simplify query row transform

Fix typos

f48c2cd

use camelCase for typed arguments, fix spotify#1770 (spotify#1780)

4a8496c

Rework sql syntax (spotify#1778)

ddf1f09

Fix: use same protoc (spotify#1781)

ac346dc

Bump coursier version (spotify#1782)

65f0f43

Merge remote-tracking branch 'origin/master'

6bb54ed

update pair scollection functions to use mutable bloom filters.

a8fba4f

turn off scalastyle for use of return

96b9088

Merge branch 'master' of github.com:spotify/scio

80ab0b5

jto added the WIP label Apr 5, 2019

anish749 added 3 commits April 8, 2019 11:06

Merge branch 'master' of github.com:spotify/scio

def347f

Merge branch 'master' into mutableBloomFilters

244210a

add sparse bloom filter implementation to save memory

5ebb747

anish749 commented Apr 10, 2019

View reviewed changes

scio-core/src/main/scala/com/spotify/scio/coders/instances/AlgebirdCoders.scala Outdated Show resolved Hide resolved

Anish added 2 commits April 24, 2019 22:37

delayed mutable bf

4d63959

fix bf set init condition, delayed init tests.

756d6f4

Anish added 3 commits April 25, 2019 15:58

clean up sparse mutable bf

5780fae

less values

32348f5

add gens for sparse mutable bf

1a59d10

fix imports and ret type

dcb8056

nevillelyh approved these changes Apr 30, 2019

View reviewed changes

nevillelyh and others added 7 commits May 1, 2019 15:13

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

0f251dd

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

d416e6a

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

d346d61

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

070bb8b

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

a5fed50

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

1b719c3

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

cdfe002

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

add mutation detection tests

699fe13

regadas reviewed May 1, 2019

View reviewed changes

Anish and others added 8 commits May 2, 2019 13:13

Merge branch 'master' into mutableBloomFilters

4f1a3d7

# Conflicts: # scio-core/src/main/scala/com/spotify/scio/values/PairSCollectionFunctions.scala

Update scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala

52f6313

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

Apply suggestions from code review

ae146cf

Co-Authored-By: anish749 <anish749@users.noreply.github.com>

address review comments

9ff97a7

more restricted access

e6bab9f

foreach -> while

66bd624

add warning

5ed4707

remove return

2ff5176

regadas reviewed May 2, 2019

View reviewed changes

scio-core/src/main/scala/com/spotify/scio/util/BloomFilter.scala Show resolved Hide resolved

regadas approved these changes May 3, 2019

View reviewed changes

regadas removed the WIP label May 3, 2019

regadas merged commit 9c8aa07 into spotify:master May 3, 2019

regadas mentioned this pull request May 3, 2019

Change Bloom Filter implementation for Sparse Joins #1686

Closed

anish749 mentioned this pull request May 9, 2019

jmh: Micro bench marks for BloomFilter #1913

Merged

nevillelyh mentioned this pull request Jun 24, 2019

Unified probabilistic filter API #1997

Closed

anish749 deleted the mutableBloomFilters branch August 10, 2019 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change BloomFilter implementation for Sparse Joins #1806

Change BloomFilter implementation for Sparse Joins #1806

anish749 commented Apr 2, 2019 •

edited

codecov bot commented Apr 24, 2019 •

edited

nevillelyh left a comment

anish749 commented May 1, 2019

regadas left a comment

regadas May 1, 2019

anish749 commented May 2, 2019

anish749 commented May 2, 2019

Change BloomFilter implementation for Sparse Joins #1806

Change BloomFilter implementation for Sparse Joins #1806

Conversation

anish749 commented Apr 2, 2019 • edited

codecov bot commented Apr 24, 2019 • edited

Codecov Report

nevillelyh left a comment

Choose a reason for hiding this comment

anish749 commented May 1, 2019

regadas left a comment

Choose a reason for hiding this comment

regadas May 1, 2019

Choose a reason for hiding this comment

anish749 commented May 2, 2019

anish749 commented May 2, 2019

anish749 commented Apr 2, 2019 •

edited

codecov bot commented Apr 24, 2019 •

edited