# Bloom Filters Implementation

Lifted extant from GeeksForGeeks [Bloom Filters -- Indoduction and Implementation](https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/)

Uses [mmh3](https://pypi.org/project/mmh3/2.0/) and [bitarray](https://pypi.org/project/bitarray/) for hashing

In [1]:
# Install mmh3 and bitarray 3rd party modules first:
%pip install mmh3 
%pip install bitarray

Collecting mmh3
  Downloading mmh3-4.0.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (13 kB)
Downloading mmh3-4.0.1-cp311-cp311-macosx_10_9_x86_64.whl (34 kB)
Installing collected packages: mmh3
Successfully installed mmh3-4.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting bitarray
  Downloading bitarray-2.9.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (34 kB)
Downloading bitarray-2.9.1-cp311-cp311-macosx_10_9_x86_64.whl (127 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: bitarray
Successfully installed bitarray-2.9.1

[1m[[0m[34;49mn

In [6]:
# Python implementation of Bloom Filters
import math
import mmh3
from bitarray import bitarray

class BloomFilter(object):
  '''
  Class for Bloom filter, using murmur3 hash function
  '''
  def __init__(self, items_count, fp_prob):
    # int items_count = number of items expected to be stored in filter
    # float fp_prob = false positive probability
    self.fp_prob = fp_prob
    
    # Size of bit array
    self.size = self.get_size(items_count, fp_prob) 
    
    # number of hash functions to use:
    self.hash_count = self.get_hash_count(self.size, items_count)

    # Bit array of given size
    self.bit_array = bitarray(self.size)
    
    # initialize all bits as 0
    self.bit_array.setall(0)
    
  def add(self, item):
    # add item to filter
    digests = []
    for i in range(self.hash_count):
      # create digest for given item
      # use i as seed to mmh3.hash()
      # differing seeds = differing digests
      digest = mmh3.hash(item, i) % self.size
      digests.append(digest)
      
      # set the bit to True in bitarray
      self.bit_array[digest] = True

  def check(self, item):
    # check of existence of item in filter
    for i in range(self.hash_count):
      digest = mmh3.hash(item, i) % self.size
      if self.bit_array[digest] == False:
        # if any of the bit is False the item is not present in filter
        # else there is a change that it exists
        return False
    return True

    
    
  @classmethod # decorator functions
  def get_size(self, n, p):
    ''' 
      Return the size of bit array(m) according to this formula:
      m = -(n * lg(p) / lg(2)^2)
      
      int n = number of items expected to be processed
      float p = false positive probability in decimal
    '''
    m = -(n * math.log(p)) / (math.log(2)**2)
    return int(m)

  @classmethod
  def get_hash_count(self, m, n):
    '''
    Return the size of bitarray(m) according to the following formula
    k = (m/n) * log(2) where
    
    int m = size of bit array
    int n = number of items expected to be stored in filter
    '''
    k = (m/n) * math.log(2)
    return int(k)


In [8]:

# import BloomFilter
from random import shuffle

n = 20 # number of items
p = 0.05 # false positive probability

bloomf = BloomFilter(n, p)
print("Size of bit array:{}".format(bloomf.size))
print("False positive probability:{}".format(bloomf.fp_prob))
print("Number of hash functions:{}".format(bloomf.hash_count))

# words to be added 
word_present = ['abound','abounds','abundance','abundant','accessible', 
                'bloom','blossom','bolster','bonny','bonus','bonuses', 
                'coherent','cohesive','colorful','comely','comfort', 
                'gems','generosity','generous','generously','genial'] 
  
# words not added 
word_absent = ['bluff','cheater','hate','war','humanity', 
               'racism','hurt','nuke','gloomy','facebook', 
               'geeksforgeeks','twitter'] 

for item in word_present:
  bloomf.add(item)
  
  shuffle(word_present)
  shuffle(word_absent)
  
  test_words = word_present[:10] + word_absent
  shuffle(test_words)
  for word in test_words:
    if bloomf.check(word):
      if word in word_absent:
        print("'{}' is a false positive!",format(word))
      else:
        print("'{}' is probably present!".format(word))
    else:
      print("'{}' is definitely not present!".format(word))
    



Size of bit array:124
False positive probability:0.05
Number of hash functions:4
'racism' is definitely not present!
'war' is definitely not present!
'bonus' is definitely not present!
'bonny' is definitely not present!
'gems' is definitely not present!
'cheater' is definitely not present!
'generosity' is definitely not present!
'abound' is probably present!
'geeksforgeeks' is definitely not present!
'bluff' is definitely not present!
'comfort' is definitely not present!
'humanity' is definitely not present!
'facebook' is definitely not present!
'genial' is definitely not present!
'bolster' is definitely not present!
'hurt' is definitely not present!
'nuke' is definitely not present!
'hate' is definitely not present!
'comely' is definitely not present!
'generous' is definitely not present!
'gloomy' is definitely not present!
'twitter' is definitely not present!
'cheater' is definitely not present!
'bluff' is definitely not present!
'gloomy' is definitely not present!
'hate' is definite