Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Bloom Filter implementation for Sparse Joins #1686

Closed
anish749 opened this issue Feb 18, 2019 · 11 comments
Closed

Change Bloom Filter implementation for Sparse Joins #1686

anish749 opened this issue Feb 18, 2019 · 11 comments
Labels
blocked Blocked by upstream

Comments

@anish749
Copy link
Contributor

In the current version of Sparse Joins and lookups, we are using Algebird's Bloom Filters Aggregators.
This works well and is very easy and elegant to express in Scala, but its not performant. It takes a long time and CPU to build this filter (~3 hrs wall time for 3M entries). For smaller datasets (~5M entries) I found a simple HashSet filter to work better, though that starts to go near the limits of SideInput size. For entries around 50M, it starts taking longer than a shuffle join.
The Bloom Filters from Breeze as well as ClearSpring Stream lib however gets constructed within 3 to 4 min wall time for 3M entries.

Suggestion 1:
I feel using Stream Lib from AddThis / ClearSpring instead of Algebird BloomFilter would allow us to optimise a lot more jobs. I am not sure why, but I was having a very high false positives with Breeze BFs (may be I was making a mistake, as it seems weird)

Breaking API changes(?)
Algebird provides us with a Hash128[K], however the BF in StreamLib takes a byte array / String, which means we would need to add an implicit ByteArrayEncoder for primitive types which are present in algebird's Hash128 to make sure we are not breaking any API.
Can we go ahead with adding this additional implicit, and move away from Algebird's BF for sparse functions?

Also because this adds an additional dependency, should this be a part of scio-extra which would again means removing those methods from scio-core thus breaking again? I am a bit confused about adding an extra dependency.

I have some code in an internal repository using StreamLib and here are some benchmarks. Based on suggestions on the above, I can create a PR fixing this.

Here is a Dataflow join of ~4M with two datasets of 1B each.
Using Algebird Bloom Filters
screenshot 2019-02-15 at 16 57 41

Using ClearSpring / AddThis BloomFilters
screenshot 2019-02-15 at 16 57 30

Suggestion 2:
Optimise Algebird Bloom Filters, and make it run faster. The Bloom Filter in Algebird is running very slow, but it can surely be fixed to make this work faster.

@regadas
Copy link
Contributor

regadas commented Feb 19, 2019

@anish749 interesting stuff! I'll have a look at this and I'll come back to you.

@anish749
Copy link
Contributor Author

I've worked on this a bit. I did some benchmarks on Algebird Bloom Filters, we can optimise the implementation, and improve wall time. I'll post something soon..

@regadas
Copy link
Contributor

regadas commented Feb 21, 2019

Nice! That's awesome! To be honest i haven't had proper time to have a look at this! Maybe today...

@anish749
Copy link
Contributor Author

Here is a PR in Algebird with an implementation which would be able to match with no breaking changes in any API. twitter/algebird#673

@regadas
Copy link
Contributor

regadas commented Feb 25, 2019

PR #1699 brings some nice improvements. @anish749 can you try the snapshot once it's available with your use case?

@regadas
Copy link
Contributor

regadas commented Feb 26, 2019

Our benchmarks have shown improvements around 30% 😄

@anish749
Copy link
Contributor Author

That's awesome.. I'll try this out tomorrow 👍

@anish749
Copy link
Contributor Author

anish749 commented Mar 1, 2019

I tried this out, but I'm unable to replicate this. It is a bit difficult to get snapshot versions and try out without releasing the internal mono repo. I was in dependency hell for quite sometime. Do you have any canary to test directly?

I did rename some functions, copy these classes and sort of hacked my way around to execute this optimisation, but it looks so dirty that I suspect I am introducing bugs!!

@anish749
Copy link
Contributor Author

anish749 commented Mar 6, 2019

@regadas
I tried out 0.7.2, with the #1699 and there is 30% improvement in wall time. 👍 4 hr -> 2hr 50 mins

@jbx jbx added blocked Blocked by upstream P2 labels Mar 15, 2019
@anish749
Copy link
Contributor Author

There is another option which is to use Guava which offers a BloomFilter with similar performance, however we would need to add another implicit evidence similar to Hash128 for Funnel that Guava uses, which is just a implicit to convert from a typeT to a byte array for hashing.

@regadas
Copy link
Contributor

regadas commented May 3, 2019

Closed via #1806

@regadas regadas closed this as completed May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Blocked by upstream
Projects
None yet
Development

No branches or pull requests

3 participants