In [29]:
!pip install mmh3



You should consider upgrading via the 'c:\users\utkarsh priyadarshi\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.





# Bloom Filter
This is part of the series **Mastering Data Structures for Databases**
You can find the article at https://medium.com/@utkarshpriyadarshi5026/mastering-data-structures-for-databases-part-3-bloom-filters-f92f3bff7dcc 

## Hash functions used in the Bloom Filter

Let's create a hash function type that will be used in the Bloom Filter.

Our hash function will take two arguments:
- a string to be hashed
- an integer that will be used as a seed for the hash function

In [30]:
from typing import Callable

HashFunction = Callable[[str, int], int]


### CryptoGraphic Hash Functions 

Different types of hash functions can be used in the Bloom Filter.


In [31]:
import hashlib

def sha256_hash(item: str, seed: int = 0) -> int:
    hash_value = int(hashlib.sha256((item + str(seed)).encode()).hexdigest(), 16)
    return hash_value

def md5_hash(item: str, seed: int = 0) -> int:
    hash_value = int(hashlib.md5((item + str(seed)).encode()).hexdigest(), 16)
    return hash_value

def sha1_hash(item: str, seed: int = 0) -> int:
    return int(hashlib.sha1((item + str(seed)).encode()).hexdigest(), 16)

### Non-Cryptographic Hash Functions

#### Murmur Hash
MurmurHash processes the input data in blocks, mixing the bits in each block to produce a final hash value. It uses a combination of multiplication and bitwise operations to achieve a good distribution of hash values.
#### DJB2 Hash
DJB2 starts with an initial hash value (often 5381) and iterates over each character in the input string. For each character, it multiplies the current hash value by 33 and adds the ASCII value of the character. This process is repeated for all characters in the input string.
#### FNV-1a Hash
FNV-1a starts with an initial hash value (FNV offset basis) and iterates over each character in the input string. For each character, it XORs the current hash value with the ASCII value of the character and then multiplies the result by the FNV prime. This process is repeated for all characters in the input string.


In [32]:
import mmh3

def murmur_hash(item: str, seed: int = 0) -> int:
    hash_value = mmh3.hash(item, seed)
    return hash_value

def djb2_hash(item: str, seed: int = 0) -> int:
    hash_value = seed
    for char in item:
        hash_value = ((hash_value << 5) + hash_value) + ord(char)
    return hash_value & 0xFFFFFFFF

def fnv1a_hash(item: str, seed: int = 0) -> int:
    hash_value = 0x811c9dc5 + seed # FNV offset basis
    fnv_prime = 0x01000193 # 32 bit FNV prime
    for char in item:
        hash_value ^= ord(char) # XOR
        hash_value *= fnv_prime # Multiplication
    return hash_value & 0xFFFFFFFF 

## Structure of the Bloom Filter
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. The structure of a Bloom filter consists of the following components:

1. Bit Array:  
    - A fixed-size bit array (or bit vector) initialized to all zeros. The size of the bit array determines the accuracy and space efficiency of the Bloom filter.
2. Hash Functions:  
    - Multiple independent hash functions that map elements to positions in the bit array. Each hash function should uniformly distribute the input elements across the bit array.

## Operations on the Bloom Filter

1. Add:
    - To add an element to the Bloom filter, each hash function is applied to the element to get multiple hash values. The corresponding positions in the bit array are then set to 1
2. Check:
    - To check if an element is in the Bloom filter, each hash function is applied to the element to get multiple hash values. If all the corresponding positions in the bit array are set to 1, the element is likely in the set. If any position is 0, the element is definitely not in the set.

In [33]:
from typing import List

class BloomFilter:
    def __init__(self, size: int, hash_functions: List[HashFunction]) -> None:
        self.size = size
        self.hash_functions = hash_functions
        self.bit_array = [0] * size
        print(f"Initialized Bloom Filter with size {size} and {len(hash_functions)} hash functions.")

    def _hashes(self, item: str) -> List[int]:
        hashes = []
        for i, hash_func in enumerate(self.hash_functions):
            hash_value = hash_func(item, i) % self.size
            hashes.append(hash_value)
            print(f"Hash {i+1} {hash_func.__name__} for '{item}' is {hash_value}.")
        return hashes

    def add(self, item: str) -> None:
        print("== Add Operation ==")
        hashes = self._hashes(item)
        for i, hash_value in enumerate(hashes):
            self.bit_array[hash_value] = 1
            print(f"Set bit_array[{hash_value}] to 1.")
            
        print(f"'{item}' added to Bloom Filter.\n\n")

    def check(self, item: str) -> bool:
        hashes = self._hashes(item)
        result = all(self.bit_array[hash_value] == 1 for hash_value in hashes)
        print(f"Checking '{item}': {'Present' if result else 'Not Present'} in Bloom Filter.")
        return result
    
    def show(self):
        print(self.bit_array)

In [34]:
import random


# Create a Bloom Filter with random hash functions
def create_random_bloom_filter(size: int, num_functions: int) -> BloomFilter:
    available_hash_functions: List[HashFunction] = [
    sha256_hash,
    md5_hash,
    sha1_hash,
    murmur_hash,
    djb2_hash,
    fnv1a_hash
]
    if num_functions > len(available_hash_functions):
        raise ValueError("Number of hash functions exceeds the available functions.")
    
    selected_functions = random.sample(available_hash_functions, num_functions)
    
    for i, func in enumerate(selected_functions):
        print(f"Selected Hash Function {i+1}: {func.__name__}")
        
    return BloomFilter(size, selected_functions)

In [35]:
bloom_filter = create_random_bloom_filter(100, 3)

Selected Hash Function 1: sha1_hash
Selected Hash Function 2: murmur_hash
Selected Hash Function 3: fnv1a_hash
Initialized Bloom Filter with size 100 and 3 hash functions.


In [36]:
bloom_filter.add("apple")
bloom_filter.check("apple")

== Add Operation ==
Hash 1 sha1_hash for 'apple' is 75.
Hash 2 murmur_hash for 'apple' is 23.
Hash 3 fnv1a_hash for 'apple' is 85.
Set bit_array[75] to 1.
Set bit_array[23] to 1.
Set bit_array[85] to 1.
'apple' added to Bloom Filter.


Hash 1 sha1_hash for 'apple' is 75.
Hash 2 murmur_hash for 'apple' is 23.
Hash 3 fnv1a_hash for 'apple' is 85.
Checking 'apple': Present in Bloom Filter.


True