A fast Bloom filter for Ruby. Written in C.
C Ruby
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
ext/bloom_filter
lib
spec
.gitignore
LICENSE
README.textile
Rakefile
VERSION.yml
bloom_filter.gemspec

README.textile

Bloom Filter

This is a Bloom filter for Ruby written in C.

What is a Bloom filter?

I suck at explaining things. Wikipedia doesn’t. http://wikipedia.org/wiki/Bloom_filter.

But in short a Bloom filter is a probabilistic data structure which is used to test for membership in a set. False positives can happen, but false negatives can not. The exact error rate of a Bloom filter can be tuned by changing the size of it.

A Bloom filter is really just a bitmap of some size. When you insert a given key into it, it hashes that key with some number of hash functions, and uses the result of those hash functions to set bits in the bitmap. Retrieval obviously works in the same way, run the N hash functions and see if the bits are set.

Why would I use a Bloom filter?

They’re extremely compact and quite fast. The size of a Bloom filter is dependent upon the number of keys you’ll be inserting into it and the acceptable error rate. The smaller the size of the bloom filter, the higher the error rate.

So, one idea is that you could use a Bloom filter as a pre-filter for database calls. Before requesting a particular id from the database, you could query the Bloom filter. If it returns true for the id, you can query the database for the data. If it returns false, you don’t. Since false negatives cannot happen, you’ll never miss querying the database when you should. However, you may (at a particular error rate) query the database when you don’t necessarily need to. But, as before, this can be tuned according to your needs.

How do I use a Bloom filter?

filter = BloomFilter.new( {size of bitmap}, {number of hash functions} )
filter = BloomFilter.new(1024, 3)

filter.set(123)
filter.set('abc')
filter.set(:foo)

filter.get(123)    #=> true
filter.get('abc')  #=> true
filter.get(:foo)   #=> true

filter.get('something else') #=> nil

Of course, expecting the user to come up with meaningful values for the bitmap size and the hash function count is a bit silly. So, you can also use the BloomFilter.for_error_rate method.

filter = BloomFilter.for_error_rate( {maximum error rate}, {estimate of how many keys will be added} )
filter = BloomFilter.for_error_rate(0.05, 1000)

filter.filter_size #=> 6236
filter.hash_count  #=> 4

And mind you, filter_size returns the size of the bitmap in bits, not bytes. So the above example would have an actual size of 780 bytes. Pretty compact indeed.

Todo

  • Add benchmarks
  • Add graphs of size vs. error rate