<a href="https://colab.research.google.com/github/walkerjian/DailyCode/blob/main/Code_Craft__BloomFilter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Problem:
Implement a data structure which carries out the following operations without resizing the underlying array:

add(value): Add a value to the set of values.
check(value): Check whether a value is in the set.
The check method may return occasional false positives (in other words, incorrectly identifying an element as part of the set), but should always correctly identify a true element.

##Solution:
The description you've provided points towards implementing a data structure that behaves similarly to a Bloom filter. A Bloom filter is a probabilistic data structure that offers space efficiency at the cost of occasional false positives on membership checks. However, it never returns false negatives, which means if it says an element is not in the set, it truly is not.

Bloom filters use multiple hash functions to map each added element to several positions in a bit array. Here's a simplified implementation in Python:

1. **Initialization**: Choose the size of the bit array (`n`) and the number of hash functions (`k`). The size of the underlying array doesn't change once the Bloom filter is created, adhering to your requirement.
2. **Add(value)**: Hash the value `k` times and for each hash, set the bit at the hash's index in the bit array to 1.
3. **Check(value)**: Hash the value `k` times and check the bits at the indices of these hashes in the bit array. If all bits are 1, the value may be in the set (with a possibility of false positive); if any bit is 0, the value is definitely not in the set.

The essence of the Bloom filter is deeply tied to the properties and utility of hashing. Hash functions serve multiple key roles in this context:

1. **Uniform Distribution**: Good hash functions distribute values uniformly across the hash space. This property is crucial for minimizing collisions in a Bloom filter, where the goal is to spread out the indicators (bits) of different elements as evenly as possible across the bit array.

2. **Determinism**: A hash function will always produce the same output for the same input. This determinism is essential for checking membership in a Bloom filter because it ensures that the same indices are accessed in the bit array for a given value during both the `add` and `check` operations.

3. **Efficiency**: Hash functions are generally computationally efficient, allowing for quick computation of hash values. This efficiency is important for Bloom filters, as every `add` and `check` operation requires computing multiple hashes.

4. **Pseudo-randomness**: While deterministic, good hash functions appear pseudo-random, meaning they make it hard to predict where in the bit array a particular value will be mapped. This helps in ensuring that the bits set by different elements are spread out, reducing the chances of false positives.

In the context of the question, leveraging hashing (particularly multiple hash functions) allows the Bloom filter to efficiently support `add` and `check` operations without needing to resize the underlying array. The use of a fixed-size bit array relies on the properties of hash functions to manage the trade-off between space efficiency and the accuracy of membership tests, accepting that false positives are a cost of these benefits.

This approach contrasts with direct addressing or using a dynamic data structure like a hash table, which might need to resize to accommodate more elements or reduce collisions. By accepting a controlled rate of false positives, a Bloom filter remains extremely space-efficient and maintains constant-time performance for `add` and `check` operations, making it a unique and powerful tool when these trade-offs are acceptable.

The properties of a Bloom filter, specifically its capacity to handle false positives while avoiding false negatives, are intrinsically linked to hashing. The effectiveness of a Bloom filter is a balance between the size of the bit array, the number of hash functions used, and the number of elements stored. These factors collectively determine the probability of false positives.

There are indeed ways to quantify these properties and calculate the expected false positive rate (FPR) of a Bloom filter. The false positive rate is influenced by:

- **n**: The number of elements added to the Bloom filter.
- **m**: The size of the bit array.
- **k**: The number of hash functions.

The false positive probability (FPP) can be approximated by the formula:

$ \text{FPP} \approx \left( 1 - \left[ 1 - \frac{1}{m} \right]^{kn} \right)^k $

It's worth noting that there's an optimal number of hash functions, $k$, that minimizes the false positive probability for a given Bloom filter size, $m$, and a number of elements, $n$. This optimal number can be calculated using the formula:

$ k = \frac{m}{n} \ln{2} $

This means that for each element added, $k$ bits are set in the bit array, where $k$ is chosen to minimize the false positive rate for the expected number of elements and the size of the bit array.


##Implementation:

Let's implement this in Python. We'll use a simple hash function for demonstration, but in a real application, you'd want to use more robust hash functions to minimize collisions and false positives.

```python
import hashlib

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [0] * size

    def _hash(self, value, seed):
        # A simple hash function using hashlib; different seeds simulate different hash functions
        hash_obj = hashlib.sha256()
        hash_obj.update(bytes(f"{value}{seed}", 'utf-8'))
        return int(hash_obj.hexdigest(), 16) % self.size

    def add(self, value):
        for seed in range(self.num_hashes):
            index = self._hash(value, seed)
            self.bit_array[index] = 1

    def check(self, value):
        for seed in range(self.num_hashes):
            index = self._hash(value, seed)
            if self.bit_array[index] == 0:
                return False  # Definitely not present
        return True  # Possibly present

# Example of using BloomFilter
bloom = BloomFilter(size=1000, num_hashes=3)
bloom.add("hello")
bloom.add("world")

print(bloom.check("hello"))  # True
print(bloom.check("world"))  # True
print(bloom.check("not_in_set"))  # False, possibly True if collision occurs
```

This code provides a basic Bloom filter implementation. Keep in mind, the effectiveness (in terms of minimizing false positives) of a Bloom filter depends on choosing optimal values for `size` and `num_hashes` based on the expected number of elements to be added. There are formulas based on probability theory to help guide these choices.

In [None]:
import hashlib

class BloomFilter:
    def __init__(self, size, num_hashes):
        self.size = size
        self.num_hashes = num_hashes
        self.bit_array = [0] * size

    def _hash(self, value, seed):
        # A simple hash function using hashlib; different seeds simulate different hash functions
        hash_obj = hashlib.sha256()
        hash_obj.update(bytes(f"{value}{seed}", 'utf-8'))
        return int(hash_obj.hexdigest(), 16) % self.size

    def add(self, value):
        for seed in range(self.num_hashes):
            index = self._hash(value, seed)
            self.bit_array[index] = 1

    def check(self, value):
        for seed in range(self.num_hashes):
            index = self._hash(value, seed)
            if self.bit_array[index] == 0:
                return False  # Definitely not present
        return True  # Possibly present

# Example of using BloomFilter
bloom = BloomFilter(size=1000, num_hashes=3)
bloom.add("hello")
bloom.add("world")

print(bloom.check("hello"))  # True
print(bloom.check("world"))  # True
print(bloom.check("not_in_set"))  # False, possibly True if collision occurs


True
True
False


##Testing:
Let's calculate the false positive probability for a given Bloom filter setup. Assuming we have the values for $m$, $n$, and $k$, we can implement a calculation in Python:

This formula and calculation give a quantitative measure of a Bloom filter's performance and help in designing a filter with an acceptable false positive rate for a specific application.

In [None]:
from math import exp, log

def calculate_fpp(m, n, k):
    return (1 - exp(-k * n / m)) ** k

# Example: a Bloom filter with a bit array of size 1000, 100 elements added, and using 3 hash functions
m = 1000
n = 100
k = 3

fpp = calculate_fpp(m, n, k)
print(f"False positive probability: {fpp:.4f}")


False positive probability: 0.0174
