# Set Membership

The cell below defines two **abstract classes**: the first represents a set and basic insert/search operations on it. You will need to impement this API four times, to implement (1) sequential search, (2) binary search tree, (3) balanced search tree, and (4) bloom filter. The second defines the synthetic data generator you will need to implement as part of your experimental framework. <br><br>**Do NOT modify the next cell** - use the dedicated cells further below for your implementation instead. <br>

In [28]:
# DO NOT MODIFY THIS CELL

from abc import ABC, abstractmethod   

# abstract class to represent a set and its insert/search operations
class AbstractSet(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # inserts "element" in the set
    # returns "True" after successful insertion, "False" if the element is already in the set
    # element : str
    # inserted : bool
    @abstractmethod
    def insertElement(self, element):     
        inserted = False
        return inserted   
    
    # checks whether "element" is in the set
    # returns "True" if it is, "False" otherwise
    # element : str
    # found : bool
    @abstractmethod
    def searchElement(self, element):
        found = False
        return found    
    
    
    
# abstract class to represent a synthetic data generator
class AbstractTestDataGenerator(ABC):
    
    # constructor
    @abstractmethod
    def __init__(self):
        pass           
        
    # creates and returns a list of length "size" of strings
    # size : int
    # data : list<str>
    @abstractmethod
    def generateData(self, size):     
        data = [""]*size
        return data   


Use the cell below to define any auxiliary data structure and python function you may need. Leave the implementation of the main API to the next code cells instead.

In [2]:
# ADD AUXILIARY DATA STRUCTURE DEFINITIONS AND HELPER CODE HERE



Use the cell below to implement the requested API by means of **sequential search**.

In [3]:
class SequentialSearchSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE
        self.val = []
     
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        if element not in self.val:
            self.val.append(element)
            inserted = True
        return inserted

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        if element in self.val:
            found = True
        return found    

In [5]:
# Delete this cell before submission.
def sequentialCorrectnessTest():
    test = SequentialSearchSet()
    assert test.insertElement("a") == True
    assert test.insertElement("b") == True
    assert test.insertElement("c") == True
    assert test.insertElement("a") == False
    assert test.insertElement("b") == False
    assert test.insertElement("c") == False
    assert test.searchElement("a") == True
    assert test.searchElement("b") == True
    assert test.searchElement("c") == True
    assert test.searchElement("d") == False
    assert test.searchElement("e") == False
    assert test.searchElement("f") == False
    print("✅ Sequential search correctness test passed")

sequentialCorrectnessTest()

✅ Sequential search correctness test passed


Use the cell below to implement the requested API by means of **binary search tree**.

In [7]:
class BinarySearchTreeSet(AbstractSet):
    
    def __init__(self, val, left=None, right=None):
        # ADD YOUR CODE HERE
        self.val = val
        self.left = left
        self.right = right
        pass           
     
    
        
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        if self.val == None: # if the tree is empty
            self.val = element
            inserted = True
            return inserted
        
        # traverse the tree
        thisNode = self
        while thisNode:
            if element < thisNode.val:
                if thisNode.left:
                    thisNode = thisNode.left
                else:
                    thisNode.left = BinarySearchTreeSet(element)
                    inserted = True
                    break
            elif element > thisNode.val:
                if thisNode.right:
                    thisNode = thisNode.right
                else:
                    thisNode.right = BinarySearchTreeSet(element)
                    inserted = True
                    break
            else:
                break
        return inserted
    
    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        thisNode = self
        while thisNode:
            if element < thisNode.val:
                thisNode = thisNode.left
            elif element > thisNode.val:
                thisNode = thisNode.right
            else:
                found = True
                break
        
        return found    

In [9]:
# Delete this cell before submission.
def binaryTreeCorrectnessTest():
    test = BinarySearchTreeSet(None)
    assert test.insertElement("a") == True
    assert test.insertElement("b") == True
    assert test.insertElement("c") == True
    assert test.insertElement("a") == False
    assert test.insertElement("b") == False
    assert test.insertElement("c") == False
    assert test.searchElement("a") == True
    assert test.searchElement("b") == True
    assert test.searchElement("c") == True
    assert test.searchElement("d") == False
    assert test.searchElement("e") == False
    assert test.searchElement("f") == False
    print("✅ Binary tree correctness test passed")

binaryTreeCorrectnessTest()

✅ Binary tree correctness test passed


Use the cell below to implement the requested API by means of **balanced search tree**.

In [3]:
class BalancedSearchTreeSet(AbstractSet):
    
    def __init__(self):
        # ADD YOUR CODE HERE

        
        pass   

    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE

        return inserted
        
    
    

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        
        return found    

In [4]:
def balancedTreeCorrectnessTest():
    test = BalancedSearchTreeSet()
    assert test.insertElement("a") == True
    assert test.insertElement("b") == True
    assert test.insertElement("c") == True
    assert test.insertElement("a") == False
    assert test.insertElement("b") == False
    assert test.insertElement("c") == False
    assert test.searchElement("a") == True
    assert test.searchElement("b") == True
    assert test.searchElement("c") == True
    assert test.searchElement("d") == False
    assert test.searchElement("e") == False
    assert test.searchElement("f") == False
    print("✅ Balanced tree correctness test passed")

balancedTreeCorrectnessTest()

AssertionError: 

Use the cell below to implement the requested API by means of **bloom filter**.

$$
\ln(n) = 2 \sum_{n=1}^{\infty} \frac{((x-1) /(x+1))^{(2 n-1)}}{(2 n-1)}
$$

In [19]:
# because cannot import math
def ln(x, n=100):
    """
    Calculate natural logarithm of x using n terms of the series.
    n is defaulted to 100 which is a good approximation.
    http://www.math.com/tables/expansion/log.htm
    """
    total = 0
    for i in range(1, n+1):
        total += (((x-1)/(x+1))**(2*i-1))/(2*i-1)
    return total*2

$$
    P(\text{false positive}) = (1 - [1 - \frac{1}{m}]^{kn})^k
$$

In [21]:
def p_false_positive(m: int, k: int, n: int) -> float:
    """
    Calculate the probability of false positive.
    where:
    m : int
        the size of the bit array
    k : int
        the number of hash functions
    n : int
        number of expected elements in the set
    ===
    source: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
    """
    return (1 - (1 - 1/m)**(k*n))**k

$$
    m = - \frac{n \ln(p)}{\ln(2)^2}
$$

In [22]:
def optimal_size_of_bit_array(n: int, p: float) -> int:
    """
    Calculate the optimal size of the bit array.
    where:
    n : int
        the number of expected elements in the set
    p : float
        the probability of false positive
    ===
    source: https://www.geeksforgeeks.org/bloom-filters-introduction-and-python-implementation/
    """
    m = -(n*ln(p))/(ln(2)**2)
    return max(1, int(m))

$$
    k = \frac{m}{n} \ln(2)
$$

In [23]:
def optimal_number_of_hash_functions(m: int, n: int) -> int:
    """
    Calculate the optimal number of hash functions.
    where:
    m : int
        is the size of the bit array
    n : int
        is the number of expected elements in the set
    ===
    source: https://en.wikipedia.org/wiki/Bloom_filter
    """
    k = (m/n)*ln(2)
    return max(1, int(k))

In [30]:
def hash_with_seed(x, seed=''):
    """
    Hash function with seed.
    Silly hash function that concatenates the string representation of x and seed
    and then hashes the result.
    """
    return hash(str(x)+str(seed))

In [31]:
from bitarray import bitarray 
# pip install bitarray
# allowed as in the assignment description

class BloomFilterSet(AbstractSet):
    
    def __init__(self, n: int, p: float):
        # ADD YOUR CODE HERE
        """
        n : int
            the number of expected elements in the set
        p : float
            the probability of false positive
        """
        self.arraySize = optimal_size_of_bit_array(n, p)
        # create bit array
        self.bitArray = bitarray(self.arraySize)
        # initialize bit array with all 0s
        self.bitArray.setall(0)

        self.numHashFunctions: int = optimal_number_of_hash_functions(self.arraySize, n)
        
     
    def insertElement(self, element):
        inserted = False
        # ADD YOUR CODE HERE
        for i in range(self.numHashFunctions):
            # use hash function with seed i, and mod by array size to prevent index out of range
            index = hash_with_seed(element, i) % self.arraySize
            self.bitArray[index] = 1
        inserted = True
        return inserted

    def searchElement(self, element):     
        found = False
        # ADD YOUR CODE HERE
        for i in range(self.numHashFunctions):
            index = hash_with_seed(element, i) % self.arraySize
            if self.bitArray[index] == 0:
                break
        else:
            # for else loop in python, if for loop is not broken, then execute else
            found = True
        
        return found    

In [50]:
def bloomFilterCorrectnessTest():
    # words to be added
    word_present = """
Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:
Sometime too hot the eye of heaven shines,
And often is his gold complexion dimm'd;
And every fair from fair sometime declines,
By chance, or nature's changing course untrimm'd;
But thy eternal summer shall not fade
Nor lose possession of that fair thou owest;
Nor shall Death brag thou wander'st in his shade,
When in eternal lines to time thou growest:
So long as men can breathe or eyes can see,
So long lives this, and this gives life to thee.
""".split()
    
    # word not added
    word_absent = """
While the tournament would have to take place over
a longer period, it is felt cutting preparation for
teams - from nearly three weeks before Russia 2018,
although not quite as drastically as the week they had
before Qatar 2022 - would mean players were not on duty
 for a greater length of time.""".split()
    word_absent = [word for word in word_absent if word not in word_present]
    
        
    n = len(word_present) #number of items to add
    p = 0.05 #false positive probability
    
    bloomf = BloomFilterSet(n,p)
    print("Size of bit array: {}".format(bloomf.arraySize))
    print("Number of hash functions: {}".format(bloomf.numHashFunctions))

    for item in word_present:
        assert bloomf.insertElement(item) == True

    false_positive_count = 0
    for word in word_absent+word_present:
        if bloomf.searchElement(word):
            if word in word_absent:
                false_positive_count += 1
        if not bloomf.searchElement(word):
            if word in word_present:
                raise AssertionError("Word not found in bloom filter, but should be present")


    print("✅ Bloom filter correctness test passed")
    print("    False positive count: {}".format(false_positive_count))
    print("    False positive probability: {}".format(false_positive_count/len(word_absent+word_present)))

bloomFilterCorrectnessTest()

Size of bit array: 710
Number of hash functions: 4
✅ Bloom filter correctness test passed
    False positive count: 3
    False positive probability: 0.019230769230769232


Use the cell below to implement the **synthetic data generator** as part of your experimental framework.

In [7]:
import string
import random

class TestDataGenerator(AbstractTestDataGenerator):
    
    def __init__(self):
        # ADD YOUR CODE HERE
        
        
        pass           
        
    def generateData(self, size):     
        # ADD YOUR CODE HERE
        data = [""]*size
        

        return data   



Use the cells below for the python code needed to **fully evaluate your implementations**, first on real data and subsequently on synthetic data (i.e., read data from test files / generate synthetic one, instantiate each of the 4 set implementations in turn, then thorouhgly experiment with insert/search operations and measure their performance).

In [8]:
import timeit

# ADD YOUR TEST CODE HERE TO WORK ON REAL DATA





In [9]:
import timeit

# ADD YOUR TEST CODE HERE TO WORK ON SYNTHETIC DATA



