Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues? #9

Open
jnysteen opened this issue Jun 9, 2019 · 3 comments
Open

Performance issues? #9

jnysteen opened this issue Jun 9, 2019 · 3 comments
Assignees

Comments

@jnysteen
Copy link

jnysteen commented Jun 9, 2019

Hi - thanks for creating a library for Bloom filters (with a great API!)

I have run some experiments using your scalable Bloom filters, but they do not seem to perform very well :(

I created a fork containing code for benchmarking the use cases I want to use your library for, as well as experiments with optimizing some of your code.

One performance bottleneck can be found in your choice of method for converting objects to bytes, in order to hash the bytes. I have created a project for benchmarking various methods for the conversion, showing that the BinaryFormatter used in your library performs horribly compared to the alternatives.

The benchmarks in my fork shows the performance improvements achievable by replacing BinaryFormatter with alternatives (I have included the benchmarking results at the bottom of this issue - notice the memory usage column on the far right)

Despite the new optimizations, the memory usage of your Bloom filters (esspecially the scalable version, which I really want to use) is very high compared to an alternative like a HashSet.

Is this just the nature of the implementation, or can it be improved?

(The table below is a part of the output of the benchmarking code. In the tabl, your original implementation of a scalable Bloom filter is called ScalableBloomFilter.)

BenchmarkDotNet=v0.11.5, OS=macOS Mojave 10.14.5 (18F203) [Darwin 18.6.0]
Intel Core i7-8850H CPU 2.60GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=2.2.300
  [Host] : .NET Core 2.2.5 (CoreCLR 4.6.27617.05, CoreFX 4.6.27618.01), 64bit RyuJIT
  Core   : .NET Core 2.2.5 (CoreCLR 4.6.27617.05, CoreFX 4.6.27618.01), 64bit RyuJIT

Job=Core  Runtime=Core  
Method ItemsToInsert MaximumErrorRate Mean Error StdDev Ratio RatioSD Rank Gen 0 Gen 1 Gen 2 Allocated
HashSet 1000 0.02 146.3 μs 0.1939 μs 0.1620 μs 1.00 0.00 1 15.1367 - - 70.31 KB
StringOptimizedScalableBloomFilter 1000 0.02 909.7 μs 5.6858 μs 5.0404 μs 6.21 0.03 2 234.3750 - - 1083.49 KB
GenericOptimizedScalableBloomFilter 1000 0.02 5,701.6 μs 42.8845 μs 40.1142 μs 39.00 0.27 3 2414.0625 - - 11144.79 KB
ScalableBloomFilter 1000 0.02 6,271.6 μs 56.0022 μs 52.3845 μs 42.89 0.41 4 2390.6250 - - 11029.23 KB
HashSet 10000 0.02 1,492.3 μs 3.4257 μs 3.0368 μs 1.00 0.00 1 152.3438 - - 703.13 KB
StringOptimizedScalableBloomFilter 10000 0.02 17,605.3 μs 45.3247 μs 37.8482 μs 11.80 0.04 2 4062.5000 - - 18814.31 KB
GenericOptimizedScalableBloomFilter 10000 0.02 106,426.8 μs 1,297.8431 μs 1,214.0032 μs 71.35 0.88 3 43600.0000 - - 201469.91 KB
ScalableBloomFilter 10000 0.02 114,675.1 μs 952.2085 μs 844.1081 μs 76.84 0.62 4 43200.0000 - - 199364.2 KB
HashSet 50000 0.02 8,018.3 μs 23.7326 μs 21.0383 μs 1.00 0.00 1 828.1250 - - 3828.13 KB
StringOptimizedScalableBloomFilter 50000 0.02 103,353.9 μs 1,100.2632 μs 975.3547 μs 12.89 0.11 2 26400.0000 - - 122239.81 KB
GenericOptimizedScalableBloomFilter 50000 0.02 693,657.8 μs 2,892.3871 μs 2,564.0259 μs 86.51 0.44 3 285000.0000 - - 1316991.46 KB
ScalableBloomFilter 50000 0.02 734,364.0 μs 9,817.7735 μs 9,183.5514 μs 91.65 1.16 4 282000.0000 - - 1303198.18 KB
@rmc00
Copy link
Owner

rmc00 commented Jun 11, 2019

Thanks for this input! I really appreciate the thorough data. I've been aware of this issue for a while now, but frankly it's been tough for me to decide what the right solution is exactly. The answer that I see requires some breaking changes in the API and more importantly also requires that consumers handle serialization prior to passing objects to any of the data structures that require hashing (almost all of them). I've got some code where I've tinkered with making the change as a new major version and adding some serialization helper methods as extension methods on the data structures. I think you've convinced me that now is the time to make that code production-ready and get it out for folks to use. I appreciate your feedback, and I'll keep this issue updated as I work through this update.

@rmc00 rmc00 self-assigned this Jun 11, 2019
@jnysteen
Copy link
Author

I'm happy that you found it useful and looking forward to seeing what you have in mind :)

@TrevorDArcyEvans
Copy link

Are there any updates to share on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants