# Bloom Filter
[Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a probabilistic data structure based on hashing. It’s very similar to [hash table](https://en.wikipedia.org/wiki/Hash_table), but differs in several important aspects.

* only `add()` and `contains()` operations are supported (I’ll skip `union`)
* `contains()` may return false positives
* uses fixed memory size (can’t enlarge), but scales well for big data

Bloom filter is relatively simple. It is using fixed bit array that is zeroed at the beginning and a fixed collection of k hash functions.

`add(item)` sets all the `k `bits of array to `1`, `array[hash[i](item)] = 1`.

`contains(item)` conversely checks if all the `k` bits are set,
`all(array[hash[i](item)] == 1)`.

It is obvious that any item that has been added will be correctly reported as present. However, it may happen that items which have not been added will be reported as present, too. That’s a false positive.

I will skip the math of a chance to get false positive and try the bloom filter directly in code. Let’s use a textbook example. Users are coming to a website. Based on user IP address, find out if the user is returning.

There are two groups of about million users, A — returning users, and B — new users. Using a standard hash table, we would need about `6*10**6` bytes of memory.

Bloom filter with `10**6` bytes of memory and `3` hash functions has about 4% of false positive rate. Bloom filter with `4*10**6` bytes of memory and `6` hash functions is below 0.1%.

Check the run section at the end of article.

In [1]:
import numpy as np
from collections import deque
from bitarray import bitarray

## algorithm

In [2]:
def ihash(x):
    h = 86813
    while True:
        for i in x:
            h = ((h + i) * 127733) % (1 << 32)
        yield h

In [3]:
def bloom_filter(array_bytes, k):
    array = bitarray(array_bytes * 8)
    array.setall(0)

    def _hash(x):
        for _, h in zip(range(k), ihash(x)):
            yield h % len(array)
    
    def _add(x):
        for h in _hash(x):
            array[h] = 1

    def _contains(x):
        return all(array[h] for h in _hash(x))

    return _add, _contains

In [4]:
def measure_accuracy(A, B, array_bytes, k):
    add, contains = bloom_filter(array_bytes, k)
    
    # store A
    deque((add(x) for x in A), 0)

    # find false positives in B
    fp = sum(contains(x) for x in B)

    # result
    acc = 1 - fp / len(B)
    print('{} hashes, {} false positives, {:.4f} accuracy'.format(k, fp, acc))

## run

In [5]:
n = 10 ** 6
A = set(map(tuple, np.random.randint(0, 256, (n, 4))))
B = set(map(tuple, np.random.randint(0, 256, (n, 4)))) - A
len(A), len(B)

(999876, 999654)

In [6]:
for k in [1, 2, 3, 4]:
    measure_accuracy(A, B, n, k)

1 hashes, 117928 false positives, 0.8820 accuracy
2 hashes, 67614 false positives, 0.9324 accuracy
3 hashes, 40024 false positives, 0.9600 accuracy
4 hashes, 61675 false positives, 0.9383 accuracy


In [None]:
for k in [1, 2, 4, 6, 8]:
    measure_accuracy(A, B, n * 4, k)

1 hashes, 30717 false positives, 0.9693 accuracy
2 hashes, 5569 false positives, 0.9944 accuracy
4 hashes, 968 false positives, 0.9990 accuracy
