### Name: Jaimon Thyparambil Thomas
### StudentID: 29566428
### Date: 20/10/2019

# Q4 String Register (25 marks)

We want to design a data structure which we will call a *register* and which has the following properties:
* It stores strings of characters belonging to an alphabet $A$. We denote $c$ the total number of characters in the alphabet $A$. $A$ can be set to be any set of *comparable* characters, e.g. $\{a, b, c\}$, or $\{0, \dots,9\}$, or $\{a, \dots, z, A, \dots, Z\}$.
* The operation to determine whether a string of size $k$ belongs to the register takes $O(k \log c)$ runtime in the worst case.
* Adding a string to the register takes $O(k \times c)$ runtime in the worst case.
* Removing a string from the register takes $O(k \log c)$ runtime in the worst case.

Note that the runtime complexities are independent of the number of elements stored in the register.
Finally, remark that a string and (some of) its substring can all belong to the register.

## Q4.1 (22 marks)
Without using other data structures than Python lists, describe a data structure which meets the requirements described above. We recommend using a tree.

If you cannot find a way to meet the runtime requirements, provide an algorithm and implementation anyway. You may get up to half of the marks for doing so.

Create a class similar to tree with a list of branches where each node of the branch represents another object of the same tree. Each of these nodes represent the respective character of the string currently we are considering. It can represent any character that the Alphabet class contains. For example
<br>..........C
<br>........./
<br>........A
<br>......./ \
<br>......R   T
<br>...........\
<br>............T
<br>.............\
<br>..............L
<br>...............\
<br>................E

Here at the the third level there are two branches T and R. Also note that the branches should always be sorted so that always when a character comes we can use binary search (complexity log(c)) to find which branch represents the respective character. So inorder to maintain the branches in a sorted manner we should note that while inserting we should always find the correct position were it should be inserted so that the branches list is sorted. Then we can insert in that position by shifiting the rest of the nodes. We should also need to use a counter variable to indicate wheather such a word exist or not for example if cattle is entered and when we search for cat we should not get a search result as true. So that counter variable will decide at each node whether such a string is present or not. If such a word is present then in the last letters respective node the counter value will be greater than 1. Its count also indicates how many such words are present so when we remove a word we can just decrement the counter. Here also note that before operations like insertion, removal and search in each branch for each character we will also have to check if the character is present in the Alphabet if it is present then only the operation should be carried out. This will take a time complexity of $O(log(c)$ as we can use binary search to check if the character is present in the alphabet. Since while creating the alphabet class we will make it sorted.
<br>
<br>Space complexity of this alogrithm should be approximately kc where k represent the length of largest string and c represents the total no of usable characters.
<br>So as per this algorithm Time complexities should be:
<br>For insertion: $O(kn)$ (inserting 1 charcter may take $O(n+log(c)) = O(n)$ time since we have to insert in a sorted manner so for k characters its $O(kn)$)
<br>For search: $(klogc)$ (in each branch to find which is the current branch we are looking for takes $O(2logc) = O(logc)$ time so for k characters it takes $O(klogc)$)
<br>For removal :$(klogc)$ (in each branch to find which is the current branch we are looking for takes $O(2logc) = O(logc)$ time so for k characters it takes $O(klogc)$)

Implement this algorithm:

In [1]:
#This function returns wheather the current index is the last index in the string or not
def isLastIndex(index, string):
    return index == len(string) - 1

class Alphabet:
    def __init__(self,alphabetRange):
        self.alphabet = ''.join(sorted(alphabetRange))#This takes a time complexity of O(clogc)
        return 
    
    #This process has a complexity of O(log(c)) as it uses binary search
    #where c is the total no of characters in the alphabet
    def isCharNotPresent(self,char):
        startIndex = 0
        endIndex = len(self.alphabet)
        while(startIndex < endIndex):
            mid = (startIndex + endIndex)//2
            if(self.alphabet[mid] == char):
                return False
            elif (char > self.alphabet[mid]):
                startIndex = mid + 1
            else:
                endIndex = mid
        return True
    
    def __len__(self):
        return len(self.alphabet)

class RegisterTree:
    def __init__(self,alpha=None,character = None,count=0):
        self.branches = [] #list of all the child branches this character node has
        #The value of this variable count indicates for how many strings are present ending at this node 
        self.count = count
        self.alphabet = alpha
        self.character = character #Indicates this node is for which character
        return

    def __gt__(self,char):
        return self.character > char

    def __ge__(self,char):
        return self.character >= char

    def __eq__(self,char):
        return self.character == char

    def __lt__(self,char):
        return self.character < char

    def __le__(self,char):
        return self.character == char

    #This Function returns False if string is not present and True if It is Present
    def isStringPresent(self,string,index = 0):
        if index < len(string):
            if(self.alphabet.isCharNotPresent(string[index])):
                #This process has a time complexity of O(log(c))
                #Checking if the current character is present in the Alphabet
                return False
            isLastElement = isLastIndex(index, string)
            pos = self.getBranchIndex(string[index])#has a time complexity of O(log(c))
            if(pos == None):
                return False
            if (isLastElement and self.branches[pos].count > 0):
                return True
            else:
                return self.branches[pos].isStringPresent(string, index + 1)
        return False

    #This function returns False if String is not present returns True if it is present
    def removeString(self,string,index = 0):
        if index < len(string):
            if(self.alphabet.isCharNotPresent(string[index])):
                #This process has a time complexity of O(log(c))
                #Checking if the current character is present in the Alphabet
                return False
            isLastElement = isLastIndex(index, string)
            pos = self.getBranchIndex(string[index])#has a time complexity of O(log(c))
            if(pos == None):
                return False
            if (isLastElement and self.branches[pos].count > 0):
                self.branches[pos].count -= 1
                return True
            else:
                return self.branches[pos].removeString(string, index + 1)
        return False

    #Here the index node respresents which character we are currently proccesing out of the given string
    def insertString(self,string,index=0):
        if index < len(string):
            if(self.alphabet.isCharNotPresent(string[index])):
                #This process has a time complexity of O(log(c))
                #Checking if the current character is present in the Alphabet
                return False
            isLastElement = isLastIndex(index, string)
            pos = self.getBranchIndex(string[index])#has a time complexity of O(log(c))
            if(pos == None):
                pos = self.insertBranch(string[index])#has a time complexity O(log(c) + c)
            if(isLastElement):
                self.branches[pos].count += 1
                return True
            else:
                return self.branches[pos].insertString(string,index+1)
        return True

    def insertBranch(self,char):
        node = RegisterTree(self.alphabet,char)
        pos = self.getBranchIndex(char,True)#has a time complexity of O(log(c))
        self.branches.insert(pos,node) #has a time complexity of O(c)
        return pos

    #This Function returns index of the character you are searching for.
    #If the character is not present and it is not insert case then it will return None
    #If it is the insert case then it will return the index where we have to insert the character
    #It uses binary search for finding index so its time complexity is O(log(c)) where c is the total no of branches
    def getBranchIndex(self,char,isForInsert = False):
        startIndex = 0
        endIndex = len(self.branches)
        while(startIndex < endIndex):
            mid = (startIndex + endIndex)//2
            if(self.branches[mid] == char):
                return mid
            elif (char > self.branches[mid]):
                startIndex = mid + 1
            else:
                endIndex = mid
        return startIndex if isForInsert else None
    
class Register:
    def __init__(self,alpha):
        self.register = RegisterTree(alpha)
        return
    def isStringPresent(self,string):
        return self.register.isStringPresent(string)
    
    def insertString(self,string):
        return self.register.insertString(string)
    
    def removeString(self,string):
        return self.register.removeString(string)
        


Using the module *unittest*, write unit tests for your class. To obtain full marks, you need to write unit tests which extensively cover all cases. We recommend using the module *random*. Note that this question will only be marked if you provide __both__ a functional program and unit tests. You will only receive marks for features which are implemented __and__ tested convincingly.

In [2]:
import unittest
import random
import math
random.seed(a=0)
#TODO implement unit tests.
class TestRegister(unittest.TestCase):
    def setUp(self):
        alphabetAll = ""
        smallAplha = ""
        for i in range(ord('0'),ord('9')+1):
            alphabetAll += chr(i)
        for i in range(ord('a'),ord('z')+1):
            smallAplha +=chr(i)
            alphabetAll += chr(i)
        for i in range(ord('A'),ord('Z')+1):
            alphabetAll += chr(i)
        self.alphabetAll = Alphabet(alphabetAll)
        self.smallAplha = Alphabet(smallAplha)
        self.wordsList = []
        for each in range(10000):
            numbers = [i for i in random.sample(range(len(self.alphabetAll)), 30)]
            tempString = ""
            for each in numbers:
                tempString+=alphabetAll[each]
            self.wordsList.append(tempString)

    def testEmptyCase(self):
        register = Register(self.alphabetAll)
        self.assertEqual(False, register.isStringPresent("cat"))
        self.assertEqual(False, register.removeString("cat"))
    
    def testEmptyAlphabet(self):
        register = Register(Alphabet(""))
        self.assertEqual(False, register.insertString("cat"))
        self.assertEqual(False, register.isStringPresent("cat"))
        self.assertEqual(False, register.removeString("cat"))
        
        
    def testInvalidEntryCheck(self):
        register = Register(self.alphabetAll)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(False, register.isStringPresent("cat"))
        self.assertEqual(False, register.removeString("cat"))
        
    def testValidEntryCheck(self):
        register = Register(self.alphabetAll)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(True, register.isStringPresent("cattle"))
        self.assertEqual(True, register.removeString("cattle"))
        self.assertEqual(False, register.isStringPresent("cattle"))
    
    def testDifferentStartingWordCase(self):
        register = Register(self.alphabetAll)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(True, register.insertString("test"))
        self.assertEqual(True, register.isStringPresent("cattle"))
        self.assertEqual(True, register.isStringPresent("test"))
        self.assertEqual(True, register.removeString("cattle"))
        self.assertEqual(True, register.isStringPresent("test"))
        self.assertEqual(False, register.isStringPresent("cattle"))
    
    def testDifferentInbetweenWordCase(self):
        register = Register(self.alphabetAll)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(True, register.insertString("carset"))
        self.assertEqual(True, register.isStringPresent("cattle"))
        self.assertEqual(True, register.isStringPresent("carset"))
        self.assertEqual(True, register.removeString("cattle"))
        self.assertEqual(True, register.isStringPresent("carset"))
        self.assertEqual(False, register.isStringPresent("cattle"))
        
    def testDuplicateEntryCheck(self):
        register = Register(self.alphabetAll)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(True, register.insertString("cat"))
        self.assertEqual(True, register.insertString("cat"))
        self.assertEqual(True, register.isStringPresent("cat"))
        self.assertEqual(True, register.removeString("cat"))
        self.assertEqual(True, register.isStringPresent("cat"))
        self.assertEqual(True, register.removeString("cat"))
        self.assertEqual(False, register.isStringPresent("cat"))
        self.assertEqual(False, register.removeString("cat"))
        
    def testOutOfAlphabetCheck(self):
        register = Register(self.smallAplha)
        self.assertEqual(True, register.insertString("cattle"))
        self.assertEqual(False,register.insertString("caT"))
        self.assertEqual(True,register.insertString("cat"))
        self.assertEqual(False, register.isStringPresent("caT"))
        self.assertEqual(False, register.removeString("caT"))
        self.assertEqual(True, register.isStringPresent("cat"))
        self.assertEqual(True, register.removeString("cat"))
        self.assertEqual(False, register.isStringPresent("cat"))
        
    
    def testHugeDataSetWithAllAlpha(self):
        register = Register(self.alphabetAll)
        for each in self.wordsList:
            register.insertString(each)
        for each in self.wordsList:
            self.assertEqual(True, register.isStringPresent(each))
        for each in self.wordsList:
            self.assertEqual(True, register.removeString(each))
        for each in self.wordsList:
            self.assertEqual(False, register.isStringPresent(each))
            self.assertEqual(False, register.removeString(each))


In [3]:
testRegister = TestRegister()
suite = unittest.TestLoader().loadTestsFromModule(testRegister)
unittest.TextTestRunner().run(suite)

.........
----------------------------------------------------------------------
Ran 9 tests in 7.514s

OK


<unittest.runner.TextTestResult run=9 errors=0 failures=0>

## Q4.2 (3 marks)

### For Alphabet Class
time comlexity for creating the class is $O(clog(c))$ as it uses pythons inbuild sort function to sort the alphabets
its function isCharPresent takes a time complexity of O(log(c)) as we are using binary search to check if a character is present or not

### For Register class 

Worst Case Space Complexity of this algorithm is approximately $kc$ (where k is the length of the longest word and c is the total no of characters)

### searching a String of Length K
In each level(branches) the maximum no of characters that can be within the branch in the worst case is c. Since we are using binary search to find which branch we have to process the time complexity to find branch for each character in the string that we are searching for is $log(c)$.Similarly for each character we are checking whether that character is in Alphabet which also takes $O(log(c))$. So total time complexity for one character is $O(log(c)+log(c)) = O(2log(c)) = O(log(c))$ as we can ignore constants. since there are k characters in the string the total search complexity is $O(klog(c))$

### Inserting a String of Length K
In each level(branches) the maximum no of characters that can be within the branch in the worst case is c. Since we are using python list insert function for inserting which has a worst case complexity $O(c)$ (since all the characters has to be shifted for inserting a character in worst case) Similarly inorder to find the index where to insert we use binary search which has a complexity of $O(log(c))$.Similarly for each character we are checking whether that character is in Alphabet which also takes $O(log(c))$. So total time complexity for one character is $O(c + log(c)+log(c)) = O(c +2log(c)) = O(c)$ as we can ignore constants and also since c is much greater than log(c) for larger value of c. Like this we have to do insert for k characters so the overall complexity is $O(kn)$

### Removing a String of Length K
In each level(branches) the maximum no of characters that can be within the branch in the worst case is c. Since we are using binary search to find which branch we have to process the time complexity to find the branch for each character in teh string that we are trying to remove is $O(log(c)$.Similarly for each character we are checking whether that character is in Alphabet which also takes $O(log(c))$. So total time complexity for one character is $O(log(c)+log(c)) = O(2log(c)) = O(log(c))$ as we can ignore constants Since there are k characters in the string the total search complexity is $O(klog(c))$