### 1. Phone Book

- Qn Summary:
    - The ask here is to create a function to be utilised by a phone book called `process_queries`
    - This phone book is expected to map numbers to names
    - 3 operations are requested
        - `add`: given a number and name, add a contact
        - `find`: given a number, return the name if it exists, else return not found
        - `del`: removes a number from the map

- Approach:
    - To solve this, simply ensure that the phone book implementation gives O(1) lookup (i.e. hashmap, implemented as `dict()` in Python)
    - Nothing else is needed

In [None]:
class Query:
    def __init__(self, query):
        self.type = query[0]
        self.number = int(query[1])
        if self.type == 'add':
            self.name = query[2]

def read_queries():
    n = int(input())
    return [Query(input().split()) for i in range(n)]

def write_responses(result):
    print('\n'.join(result))

def process_queries(queries):
    result = []
    # Keep list of all existing (i.e. not deleted yet) contacts.
    contacts = []
    for cur_query in queries:
        if cur_query.type == 'add':
            # if we already have contact with such number,
            # we should rewrite contact's name
            for contact in contacts:
                if contact.number == cur_query.number:
                    contact.name = cur_query.name
                    break
            else: # otherwise, just add it
                contacts.append(cur_query)
        elif cur_query.type == 'del':
            for j in range(len(contacts)):
                if contacts[j].number == cur_query.number:
                    contacts.pop(j)
                    break
        else:
            response = 'not found'
            for contact in contacts:
                if contact.number == cur_query.number:
                    response = contact.name
                    break
            result.append(response)
    return result

In [None]:
queries = [
    'add 911 police',
    'add 76213 Mom',
    'add 17239 Bob',
    'find 76213',
    'find 910',
    'find 911',
    'del 910',
    'del 911',
    'find 911',
    'find 76213',
    'add 76213 daddy',
    'find 76213',
]
queries = [Query(query_str.split()) for query_str in queries]

def process_queries(queries):
    results = []
    contacts = {}
    for query in queries:
        if query.type == 'add':
            contacts[query.number] = query.name
        elif query.type == 'find':
            if query.number in contacts:
                results.append(contacts.get(query.number))
            else:
                results.append('not found')
        elif query.type == 'del':
            if query.number in contacts:
                del contacts[query.number]
        else:
            raise ValueError('Operation must be `add`, `find`, or `del`')

    return results

process_queries(queries)

### 2. Hashing with chains

- Summary:
    - In most hashing schemes, there must be a way to deal with multiple objects with the same hash. That is, if string A and string B have the same hash, adding B should not overwrite A
        - If we go with the implementation in Q1, that will definitely happen
    - In this question, we do this by simply modifying the value stored in the map to an array (list) instead of a single value, which we call a chain. 
        - So in the event of collision, values are appended in the array instead of overwritten
        - When looking up a new value, we simply go to the appropriate chain and iterate through all values there
    - Task
        - Implement the following hashmap functions while preserving the O(1) look-up of a hashmap:
            - `add`: Given a string, hash it and add to the appropriate chain
            - `check`: Given an integer representing a chain ID, return all values in the chain as a string
            - `find`: Given a string, return 'yes' if value exists in the hashmap, else 'no'
            - `del`: Given a string, remove value from hashmap if it exists
    
- Approach:
    - To implement hashing with chain, we simply use a standard python dictionary, with a list as the array
        - This preserves the O(1) lookup speed of the hashmap, while handling hash collisions 

In [22]:
class Query:

    def __init__(self, query: list[str]) -> None:
        self.type: str= query[0]
        if self.type == 'check':
            self.ind = int(query[1])
        else:
            self.s: str = query[1]

class QueryProcessor:
    _multiplier = 263
    _prime = 1000000007

    def __init__(self, bucket_count):
        self.bucket_count = bucket_count
        # store all strings in one list
        self.elems = []

    def _hash_func(self, s):
        ans = 0
        for c in reversed(s):
            ans = (ans * self._multiplier + ord(c)) % self._prime
        return ans % self.bucket_count

    def write_search_result(self, was_found):
        print('yes' if was_found else 'no')

    def write_chain(self, chain):
        print(' '.join(chain))

    def read_query(self):
        return Query(input().split())

    def process_query(self, query):
        if query.type == "check":
            # use reverse order, because we append strings to the end
            self.write_chain(cur for cur in reversed(self.elems) if self._hash_func(cur) == query.ind)
        else:
            try:
                ind = self.elems.index(query.s)
            except ValueError:
                ind = -1
            if query.type == 'find':
                self.write_search_result(ind != -1)
            elif query.type == 'add':
                if ind == -1:
                    self.elems.append(query.s)
            else:
                if ind != -1:
                    self.elems.pop(ind)

    def process_queries(self):
        n = int(input())
        for i in range(n):
            self.process_query(self.read_query())

queries = [
    'add world',
    'add HellO',
    'check 4',
    'find World',
    'find world',
    'del world',
    'check 4',
    'del HellO',
    'add luck',
    'add GooD',
    'check 2',
    'del good',
]

bucket_count = 5
proc = QueryProcessor(bucket_count)
for query in queries:
    # print('='*50)
    # print(query)
    proc.process_query(Query(query.split()))

HellO world
no
yes
HellO
GooD luck


In [38]:
from collections import deque

class Query:
    def __init__(self, query):
        self.type = query[0]
        if self.type == 'check':
            self.chain_index = int(query[1])
        else:
            self.string = query[1]

class QueryProcessor:
    def __init__(self, m) -> None:
        self.bucket_counts = m
        self.prime = int(1e9 + 7)
        self.base = 263
        self.hash_table_with_chain = {k: deque() for k in range(m)}
        
        self.result = []
        
    def _write_search_result(self, was_found) -> None:
        print('yes' if was_found else 'no')

    def _write_chain(self, chain) -> None:
        print(' '.join(chain))

    def _read_query(self) -> Query:
        return Query(input().split())

    def polyhash(self, s) -> int:
            
        hashval = 0
        for char in s[::-1]:
            hashval = (
                ((hashval * self.base) % self.prime) + ord(char)
            ) % self.prime
        return hashval % self.bucket_counts
    
    def process_query(self, query) -> None:
        if query.type == "check":
            # use reverse order, because we append strings to the end
            self._write_chain(
                self.hash_table_with_chain.get(query.chain_index, deque())
            )
        else:
            index = self.polyhash(query.string)
            if query.type == 'add':
                if query.string not in self.hash_table_with_chain[index]:
                    self.hash_table_with_chain[index].appendleft(query.string) 
            if query.type == 'find':
                self._write_search_result(query.string in self.hash_table_with_chain[index])
            if query.type == 'del':
                if query.string in self.hash_table_with_chain[index]:
                    self.hash_table_with_chain[index].remove(query.string)

    def process_queries(self):
        n = int(input())
        for _ in range(n):
            self.process_query(self._read_query())

queries = [
    'add world',
    'add HellO',
    'check 4',
    'find World',
    'find world',
    'del world',
    'check 4',
    'del HellO',
    'add luck',
    'add GooD',
    'check 2',
    'del good',
]

# bucket_count = 5
# proc = QueryProcessor(bucket_count)
# for query in queries:
#     proc.process_query(Query(query.split()))

queries = [
    'add test',
    'add test',
    'find test',
    'del test',
    'find test',
    'find Test',
    'add Test',
    'find Test',
]
# bucket_count = 4
# proc = QueryProcessor(bucket_count)
# for query in queries:
#     proc.process_query(Query(query.split()))

queries = [
    'check 0',
    'find help',
    'add help',
    'add del',
    'add add',
    'find add',
    'find del',
    'del del',
    'find del',
    'check 0',
    'check 1',
    'check 2',
]
bucket_count = 3
proc = QueryProcessor(bucket_count)
for query in queries:
    proc.process_query(Query(query.split()))



no
yes
yes
no

add help



### 3. Find pattern in text

- Summary:
    - Given 2 strings `text` and `pattern`, find the start index of all occurrences of `pattern` in `text`
        - i.e. In text `abacaba`, starting positions of pattern `aba` are [0, 4]
    
- Approach:
    - Naive approach would be to iterate through the entirety of the text for all substrings of `len(pattern)`, then check if the substrings are equal. 
        - This approach gives `O(N*M)` complexity, which is horrible
    - To do this faster, we use the rolling hash (Rabin Karp) algorithm, and make use of the fact that we can hash consecutive substrings in `text` using a recurrent relation, rather than doing a character by character comparison (See notes in `3. Search Substring`)
        - This implementation uses the polyhash hash function, but any linear additive hashing function will allow you to derive a recurrence
    - Rabin Karp simplifies time complexity to `O(M+N)` on average, though worst case is still `O(M*N)` (if all positions have collisions)

In [1]:
# python3
def read_input():
    return (input().rstrip(), input().rstrip())

def print_occurrences(output):
    print(' '.join(map(str, output)))

def get_occurrences(pattern, text):
    return [
        i 
        for i in range(len(text) - len(pattern) + 1) 
        if text[i:i + len(pattern)] == pattern
    ]

In [2]:
def polyhash(text, polynomial = 10, prime = 1e9+7) -> float:
    '''
    Time complexity: O(N) where N is the length of the text
    '''

    hashval: float = 0
    for char in text[::-1]:
        hashval: float = (((hashval * polynomial) % prime) + ord(char)) % prime
    return hashval

def precompute_hash(text, pattern, polynomial = 10, prime = 1e9+7) -> list[float]:
    '''
    Time complexity: 
        - O(M) for computing the last item in the substring_hash_store
        - O(M) for computing x^p
        - O(N-M) for computing the rest of the hash store
        - Total: O(M+M+N-M) = O(M+N)
    '''
    ## Declare an array to hold the hash values of the substrings. 
    ## The length should be len(text) - len(pattern) + 1
    textlen: int = len(text)
    patternlen: int = len(pattern)
    
    count_valid_substrings: int = textlen - patternlen + 1
    
    substring_hash_store: list[float] = [0.] * count_valid_substrings
    
    ## Compute hash of last possible substring at the end of text
    substring_hash_store[-1] = polyhash(text[count_valid_substrings-1:])

    ## Compute x^|P|
    x_power_p: float = 1
    for _ in range(len(pattern)):
        x_power_p: float = (x_power_p * polynomial) % prime

    ## Use recursive relation in a loop, to find the value of each 
    ## substring hash: 
    ## H[i] = (x * H[i+1] + T[i] - T[i + |P|] * x^|P|) mod p
    for i in range(count_valid_substrings-2, -1, -1):
        substring_hash_store[i] = (
            ((polynomial * substring_hash_store[i+1]) % prime) + 
            (ord(text[i]) % prime) -
            ((ord(text[i + len(pattern)]) * x_power_p) % prime)
        ) % prime
    
    return substring_hash_store 

def rabin_karp(text, pattern) -> list[int]:
    '''
    Time complexity: 
        - O(M) for computing pattern hash
        - O(N+M) for precomputing hashes for text
        - The loop is slightly complex:
            - O((N-M+1)) for loop over all possible substrings
            - Assuming `q` hashes match and ignoring collisions, it is possible to incur q*M for each loop to check for string equality
            - Total: O(N-M+1 + qM) = O(N-M+1) assuming q is small
                - Worst case for q is N, so this can become O(N-M+1 + N*M) = O(N*M)
        - Total: O(M+N+M+N-M+1) = O(M+2N+1) = O(M+N)
    '''
    pattern_hash: float = polyhash(pattern)
    precomputed_hash: list[float] = precompute_hash(text, pattern)
    result: list[int] = []
    for i in range(len(text)-len(pattern)+1):
        if precomputed_hash[i] == pattern_hash:
            if text[i:(i+len(pattern))] == pattern:
                result.append(i)
    return result

rabin_karp('abacaba', 'aba')

[0, 4]

In [3]:
pattern = 'aba'
text = 'abacaba'

# polyhash(pattern)
# print(precompute_hash(pattern=pattern, text=text))
rabin_karp(text, pattern)

[0, 4]

### 4. Substring equality

- Summary:
    - Given a string `s`, check if the substrings at indices `a` and `b` of length `l` are the same
    
- Approach:
    - Naively, we can simply compare the strings at `s[a:a+l]` and `s[b:b+l]` directly for equality, which will give us time complexity of `O(l)`
        - But imagine if there are multiple queries on the same string `s`
        - This will quickly become `O(q * l)`
    - So in the event of multiple queries, there is a better way using hashing
    - First, let's discuss how to precompute the hash (Note: you can precompute hashes the same way we did it in question 3, but this discusses another approach)
        - Let's suppose we have a string `ABCDEFG`
        - The corresponding integer values of the string is `1234567` (ordinal value of the letters)
        - Let's suppose we want to find the hash value of a given substring `CDE`, which we call $H(CDE)$
        - Using polyhash, 
            $$\begin{aligned}
                H(CDE) &= 3x^2 + 4x + 5 \\
                H(ABCDE) &= x^4 + 2x^3 + 3x^2 + 4x + 5 \\
                H(AB) &= x + 2 \\ \\

                H(ABCDE) - x^3 * H(AB) &= x^4 + 2x^3 + 3x^2 + 4x + 5 - x^3[x+2] \\
                &= x^4 + 2x^3 + 3x^2 + 4x + 5 - x^4 - 2x^3 \\
                &= 3x^2 + 4x + 5 \\
                &= H(CDE)
            \end{aligned}$$
        - So any substring can be computed in linear time, simply by storing the relevant cumulative hash values! 
    - Once we have this recurrence, to check if the substrings match at positions $a$ and $b$, we simply check if their hashes computed above match up, which can be done in linear time 
    - To avoid hash collisions, you can either do a character by character comparison in the event of match, or simply compute multiple hashes to check that they all match

In [33]:
# random.choice(range(10))

In [42]:
%%prun

import sys
import random
import string

def make_random_string(strlen=5000):
    return ''.join([random.choice(string.ascii_lowercase) for _ in range(strlen)])

def get_random_query(strlen=5000):
    p1: int = random.choice(range(strlen-(strlen//10)))
    p2 = random.choice(range(p1, strlen))
    length = random.choice(range(strlen-p2-1))
    return p1, p2, length

class Solver:
    def __init__(self, s) -> None:
        self.s: str = s
        self.polynomial=int(10)
        self.prime1 = int(1e9+7)
        self.prime2 = int(1e9+9)
        self.precompute_hash_1 = self.compute_hash(
            string=s, polynomial=self.polynomial, prime=self.prime1
        )
        # self.precompute_hash_2 = self.compute_hash(
        #     string=s, polynomial=self.polynomial, prime=self.prime2
        # )

    def _ask(self, a, b, l):
        return self.s[a:(a+l)] == self.s[b:(b+l)]
        
    def compute_hash(self, string, polynomial, prime):
        precompute_substring_hash = [int(0)] * (len(string)+1)
        for i in range(len(string)):
            precompute_substring_hash[i+1] = (
                (precompute_substring_hash[i] * polynomial) % prime + ord(string[i])
            ) % prime
        # print(precompute_substring_hash)
        return precompute_substring_hash

    def ask(self, a, b, l):
        polynomial_multiple1 = pow(int(self.polynomial), int(l), int(self.prime1))
        # polynomial_multiple2 = pow(int(self.polynomial), int(l), int(self.prime2))
        
        hash1_a = (
            self.prime1 + 
            self.precompute_hash_1[a+l] - 
            ((polynomial_multiple1 * self.precompute_hash_1[a]) % self.prime1)
         ) % self.prime1
        
        # hash2_a = (
        #     self.prime2 + 
        #     self.precompute_hash_2[a+l] - 
        #     ((polynomial_multiple2 * self.precompute_hash_2[a]) % self.prime2)
        # ) % self.prime2

        hash1_b = (
            self.prime1 + 
            self.precompute_hash_1[b+l] - 
            ((polynomial_multiple1 * self.precompute_hash_1[b]) % self.prime1)
        ) % self.prime1  

        # hash2_b = (
        #     self.prime2 + 
        #     self.precompute_hash_2[b+l] - 
        #     ((polynomial_multiple2 * self.precompute_hash_2[b]) % self.prime2)
        # ) % self.prime2      
        
        return True if ((hash1_a == hash1_b) & (self.s[a:(a+l)] == self.s[b:(b+l)])) else False

# s = sys.stdin.readline()
# q = int(sys.stdin.readline())
strlen = 30000
s = make_random_string(strlen)
# s = 'trololo'

solver = Solver(s)
queries = [(0,0,7), (2,4,3), (3,5,1), (1,3,2)]
# queries = [get_random_query(strlen) for _  in range(100)]

for query in queries:
    print(query)
    print(solver.ask(query[0],query[1],query[2]))
# for i in range(q):
# 	a, b, l = map(int, sys.stdin.readline().split())
# 	print("Yes" if solver.ask(a, b, l) else "No")

(0, 0, 7)
True
(2, 4, 3)
False
(3, 5, 1)
False
(1, 3, 2)
False
 

         187241 function calls in 0.034 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    30000    0.009    0.000    0.012    0.000 random.py:239(_randbelow_with_getrandbits)
    30000    0.008    0.000    0.021    0.000 random.py:375(choice)
        1    0.006    0.006    0.007    0.007 <string>:30(compute_hash)
        1    0.006    0.006    0.026    0.026 <string>:6(<listcomp>)
    37028    0.002    0.000    0.002    0.000 {method 'getrandbits' of '_random.Random' objects}
    30018    0.001    0.000    0.001    0.000 {built-in method builtins.len}
    30000    0.001    0.000    0.001    0.000 {method 'bit_length' of 'int' objects}
    30000    0.001    0.000    0.001    0.000 {built-in method builtins.ord}
        1    0.001    0.001    0.034    0.034 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.034    0.034 <string>:1(<

### 5. Longest common substring

- Summary:
    - Given 2 strings `s` and `t`, find substring `w` that is common between `s` and `t` that is the longest among all common substrings
    
- Approach:
    - Naively, we can brute force the entire solution by: 
        - Let's loop every possible substring length $L$
            - Then, loop over all possible substrings in `s` with length $L$. This gives us $O(N - L + 1)$
                - Then, loop over all possible substrings in `t` with lenght $L$. This gives us $O(M - L + 1)$
                    - For each substring from `s` and `t`, do a character-wise comparison in $O(L)$
        - Overall, this gives a total of $O(L * (N-L+1) * (M-L+1) * L) \approx O(N*M*L^2)$
        - Horrible approach
    
    - There is a better approach!!
        - We don't know what the length $l$ of the longest substring may be. So we can use binary search to find it
            - i.e. We check for all substrings of length $l = \frac{\min(|s|, |t|)}{2}$. If it exists, then we check for values between $l = \frac{\min(|s|, |t|)}{2}$ and $l = \min(|s|,|t|)$
                - The idea is, if we don't find a common substring of length $x$, there cannot be a common substring of length larger than $x$
            - This gives us $O(\log(N))$
        - For every value of $l$ we want to test: 
            - Recursively compute the hash values for every substring of length $l$ in `s`
                - Store the hash values in a map; the key is the hash, and the value is the index where that substring starts
                - Store 2 maps, representing 2 hash functions with different primes, to avoid collision
                - For each hashmap, it is done is $O(N-L)$ (see notes on `precompute_hash`)
            - Once done for `s`, start computing hashes for substrings in `t`
                - For each substring, check if hash value exists exists in the earlier map
                - If it matches, store current index from `t` and get index from `s` by calling the hashmap.    
                    - The question only asks us to identify the longest common substring, not all instances of longest common substrings
                - Then, increment the value of $l$ or return
                - This is done in $O(M)$
            
        - So overall complexity is $O(\log(\min{N,M}) * (N-L+M-L))$ which is approximately linear time with a $\log()$ factor

- ERRORS TO NOTE
    - Make sure your `PRIME` and `POLYNOMIAL` constants are declared as `int`, and not the default float, or you'll get weird off-by-1 errors
    - When computing the hash values in the loop in `get_hashmap_for_substrings`, make sure you add `prime` to ensure positive hash values only

In [2]:
import sys
from collections import namedtuple

Answer = namedtuple('answer_type', ['i', 'j', 'len'])

def solve(s, t):
	ans = Answer(0, 0, 0)
	for i in range(len(s)):
		for j in range(len(t)):
			for l in range(min(len(s) - i, len(t) - j) + 1):
				if (l > ans.len) and (s[i:i+l] == t[j:j+l]):
					ans = Answer(i, j, l)
	return ans

s='cool'
t='toolbox'

s='aaa'
t='bb'

s='aabaa'
t='babbaab'

solve(s,t)

answer_type(i=0, j=4, len=3)

In [3]:
import random
import string
def get_2_random_strings(maxlen=200, make_common_substring=True):
    if make_common_substring:
        common_substring_len = random.choice(range(maxlen))
        common_substring = ''.join([random.choice(string.ascii_lowercase) for _ in range(common_substring_len)])
    else:
        common_substring_len = 0
        common_substring = ''
    
    first_substring_len = random.choice(range(maxlen - common_substring_len))
    second_substring_len = random.choice(range(maxlen - common_substring_len))
    a = ''.join([random.choice(string.ascii_lowercase) for _ in range(first_substring_len)]) + common_substring + ''.join([random.choice(string.ascii_lowercase) for _ in range(maxlen-first_substring_len)])
    b = ''.join([random.choice(string.ascii_lowercase) for _ in range(second_substring_len)]) + common_substring + ''.join([random.choice(string.ascii_lowercase) for _ in range(maxlen-second_substring_len)])
    return a, b

# s, t = get_2_random_strings(100, make_common_substring=False)

In [25]:
POLYNOMIAL=int(10)
PRIME1=int(1e9+7)
PRIME2=int(1e9+9)
s, t = get_2_random_strings(100, make_common_substring=False)
# s, t = 'abcdefg', '0129u49sjdfhabcdefg'
# print(s, t)
    
Answer = namedtuple('answer_type', ['i', 'j', 'len'])

def naive_solve(s,t):
	ans = Answer(0, 0, 0)
	for i in range(len(s)):
		for j in range(len(t)):
			for l in range(min(len(s) - i, len(t) - j) + 1):
				if (l > ans.len) and (s[i:i+l] == t[j:j+l]):
					ans = Answer(i, j, l)
	return ans

def precompute_cumulative_hash(string, prime):
    '''
    See Q4 for computation. Basically we use the relation that for string `s`, the hash `H()` of s[2:4] is simply H(s[0:4]) - H(s[0:2]) * x^2
    '''
    hashvals_arr = [int(0)] * (len(string)+1)
    for i in range(len(string)):
        hashvals_arr[i+1] = int(
            (((hashvals_arr[i] * POLYNOMIAL) % prime) + ord(string[i]))
            % prime
        )
    return hashvals_arr

def get_hashmap_for_substrings(string, string_cumulative_hash, substring_len, prime):
    '''
    Returns a map with the hashvalues as keys, and substring starting indices as values.

    Note that the formula provided in the course is not accurate. Because there is a subtraction term, it is possible that you can end up with a negative number. As such, always add `prime` to the final sum to ensure that you are dealing with positive values. Adding `prime` doesn't change the final modulo, because prime % prime = 0, and won't affect the final answer by modular arithmetic.
    '''
    # substr_hashes = [0.] * (len(string)-substring_len+1)
    substr_hash_map = {}
    polynomial_term = pow(int(POLYNOMIAL), int(substring_len), int(prime))

    for substring_start_index in range(len(string)-substring_len+1):
        substring_end_index = substring_start_index+substring_len
        hashval = int((
            ##add `prime` to avoid negative values!!
            prime +
            string_cumulative_hash[substring_end_index] - 
            ((string_cumulative_hash[substring_start_index] * polynomial_term) % prime)
        ) % prime)
        # print(f'{substring_start_index}, {substring_end_index}, {string[substring_start_index:substring_end_index]}, {hashval}')
        if hashval in substr_hash_map:
            substr_hash_map[hashval].append(substring_start_index) 
        else:
            substr_hash_map[hashval] = [substring_start_index]
    return substr_hash_map

def get_common_substring_index(string, string_cumulative_hash, string_to_compare, hash_to_compare, substring_len, prime):
    # substr_hashes = [0.] * (len(string)-substring_len+1)
    polynomial_term = pow(int(POLYNOMIAL), int(substring_len), int(prime))

    for substring_start_index in range(len(string)-substring_len+1):
        substring_end_index = substring_start_index+substring_len
        hashval = (
            string_cumulative_hash[substring_end_index] - 
            ((string_cumulative_hash[substring_start_index] * polynomial_term) % prime) + 
            prime
        ) % prime
        if hashval in hash_to_compare:
            for index in hash_to_compare.get(hashval):
                if string[substring_start_index:(substring_start_index+substring_len)] == string_to_compare[index:(index+substring_len)]:
                    return index, substring_start_index
    return -1, -1

def solve(s, t):
    
    max_common_substr_len = min(len(s), len(t))
    min_common_substr_len = 0
    max_matching_substring_len = 0
    string1_match_index, string2_match_index = -1, -1
    
    s_hashval_arr_p1 = precompute_cumulative_hash(s, PRIME1)
    t_hashval_arr_p1 = precompute_cumulative_hash(t, PRIME1)
    # s_hashval_arr_p2 = precompute_cumulative_hash(s, PRIME2)
    # t_hashval_arr_p2 = precompute_cumulative_hash(t, PRIME2)

    while max_common_substr_len >= min_common_substr_len:
        # print('='*50)
        check_length: int = (min_common_substr_len + max_common_substr_len)//2
        # print(f"{max_common_substr_len=}, {min_common_substr_len=}, {check_length=}")

        s_substring_hash_p1 = get_hashmap_for_substrings(s, s_hashval_arr_p1, check_length, PRIME1)
        
        ## For debugging
        t_substring_hash_p1 = get_hashmap_for_substrings(t, t_hashval_arr_p1, check_length, PRIME1)
        
        has_common_substring_p1 = False
        common_substring_string1_index_p1, common_substring_string2_index_p1  = get_common_substring_index(string=t, string_cumulative_hash=t_hashval_arr_p1, string_to_compare=s, hash_to_compare=s_substring_hash_p1, substring_len=check_length, prime=PRIME1)
        
        # print(common_substring_string1_index_p1, common_substring_string2_index_p1)
        common_hash_found = common_substring_string1_index_p1 != -1
        is_not_collision = s[common_substring_string1_index_p1:(common_substring_string1_index_p1+check_length)] == t[common_substring_string2_index_p1:(common_substring_string2_index_p1+check_length)]
        # print(f'{common_hash_found=}, {is_not_collision=}')
        if common_hash_found and is_not_collision:
            has_common_substring_p1 = True
        
        if has_common_substring_p1:
            # print(s[common_substring_string1_index_p1:(common_substring_string1_index_p1+check_length)], t[common_substring_string2_index_p1:(common_substring_string2_index_p1+check_length)])
            max_matching_substring_len = check_length
            min_common_substr_len = check_length+1
            string1_match_index = common_substring_string1_index_p1
            string2_match_index = common_substring_string2_index_p1
        else:
            max_common_substr_len = check_length-1
        # print(f"{max_common_substr_len=}, {min_common_substr_len=}")

    return Answer(string1_match_index, string2_match_index, max_matching_substring_len)

print(f'{s=},{t=}')

my_i, my_j, my_len = solve(s,t)
print(my_i, my_j, my_len)
print(s[my_i:(my_i+my_len)])
print(t[my_j:(my_j+my_len)])

print('='*100)

naive_i, naive_j, naive_len = naive_solve(s,t)
print(naive_i, naive_j, naive_len)
print(s[naive_i:(naive_i+naive_len)])
print(t[naive_j:(naive_j+naive_len)])


s='zddqrevcnposdljbsddgeifqhrongyhepfcokabxoxuqfzxnnvllsbqisaidziqzlkqzaxtsakbagqegazyspgqjwnxdnvhzpear',t='unhmkttkkmxsqsavoodoaokxrnwvbuvgfyognbedogznbzshquicdkowmkxqecfflfacydqafccdjprepjtnukgpbzqyniwhrcfe'
56 13 2
sa
sa
2 69 2
dq
dq


### [** INCOMPLETE] 6. Pattern matching with mismatches

- Summary
    - Given strings `t` and `p` and integer `k`
    - Assume $|t| = n$ and $|p| = m$ and $m < n$
    - Find the number of times that `p` occurs in `t`, with a tolerance for `k` mismatches
        - i.e. if `t = 'abcde'`, then `p = abf` occurs 1 time with 1 mismatch. So if `k>=1`, then it occurs 1 time. If `k=0`  (i.e. no mismatch tolerated), then it occurs 0 times 

- Approach: 
    - Hash $p$
        - This is done in $O(m)$
    - Hash $t$
        - This is done in $O(n)$
    - Looping over every valid substring in $t$ in $O(n)$
        - For each substring, run binary search in $O(\log(m))$
            - For each run of binary search iteration, we compare middle character, and binary search the left/right substrings

    - How to apply binary search here?
        - Given the precomputed hashes, hash of any substring of $t$ can be computed in $O(1)$ time
        - For a given substring of $t$, run binary search to count the number of mismatches
            - Check if mid-point characters are equal. If no, add 1 to mismatches
            - Check if the hash value at the of the LHS substring of $t$ matches the hash value on the LHS of $p$. If hash match, then don't have to binary search the left string any more
            - Do the same for the right
            - Break when the count of mismatches exceeds $k$    
        - Binary search incurs $O(\log m)$
    - We need to perform binary search $n$ times, across the length of $t$

- Time complexity is given as $O(nk \log n)$, but I'm not super sure if that's correct given the breakdown above.

In [28]:
POLYNOMIAL = int(10)
PRIME1 = int(1e9+7)
PRIME2 = int(1e9+9)

t = 'xabcabc'
p = 'ccc'
k = 1
count_substring_approx_match = 0
print(f'{t=}, {p=}')

def make_cumulative_hash(string, polynomial=POLYNOMIAL, prime=PRIME1):
    hashvals = [int(0)] * (len(string)+1)
    for i in range(len(string)):
        hashvals[i+1] = (
            ((hashvals[i] * polynomial) % prime) + ord(string[i])
        ) % prime
    return hashvals

cum_hash_t = make_cumulative_hash(t, POLYNOMIAL, PRIME1)
cum_hash_p = make_cumulative_hash(p, POLYNOMIAL, PRIME1)
# print(cum_hash_t)
# print(cum_hash_p)

def substring_hashes_are_equal(substring_start_index, pattern_len, cum_hash_t, cum_hash_p, polynomial=POLYNOMIAL, prime=PRIME1):
    print('*'*25)
    print(f'Calling substring_hashes_are_equal({substring_start_index=}, {pattern_len=})')
    substring_hash = (
        prime +
        cum_hash_t[(substring_start_index+pattern_len)] - 
        ((cum_hash_t[substring_start_index] * pow(polynomial, pattern_len, prime)) % prime)
    ) % prime

    if pattern_len == (len(cum_hash_p)-1):
        pattern_hash = cum_hash_p[-1]
    else:
        pattern_hash = (
            prime +
            cum_hash_p[(substring_start_index+pattern_len)] - 
            ((cum_hash_p[substring_start_index] * pow(polynomial, pattern_len, prime)) % prime)
        ) % prime

    return substring_hash==pattern_hash

def less_than_k_mismatch(substring_start_index, substring, pattern, cum_hash_t, cum_hash_p, counter, k):
    print('+'*25)
    print(f'Calling less_than_k_mismatch({substring_start_index=}, {substring=}, {pattern=}, {counter=}, {k=})')
    left=0
    right=len(pattern)
    mid = (left+right)//2
    
    if substring_hashes_are_equal(substring_start_index, len(pattern), cum_hash_t, cum_hash_p):
        if substring == pattern:
            return True

    if substring[mid] != pattern[mid]:
        counter += 1
    
    if counter > k:
        return False
    else:
        left_counter_lower_than_k = less_than_k_mismatch(
            substring_start_index, 
            substring[left:mid], 
            pattern[left:mid], 
            cum_hash_t, cum_hash_p,
            counter, k
        )
        print(f'{left_counter_lower_than_k=}')
        if not left_counter_lower_than_k:
            return False
        
        right_counter_lower_than_k = less_than_k_mismatch(
            substring_start_index, 
            substring[(mid+1):right], 
            pattern[(mid+1):right], 
            cum_hash_t, cum_hash_p,
            counter, k
        )
        print(f'{right_counter_lower_than_k=}')
        if not right_counter_lower_than_k:
            return False
    
    return True

for i in range(len(t)-len(p)+1):
    # i = 0
    print('='*50)
    curr_substring = t[i:(i+len(p))]
    print(f'index={i}')
    print(f'{curr_substring=}')
    print(f'{p=}')
    print(f'count matches = {count_substring_approx_match}')

    if substring_hashes_are_equal(i, len(p), cum_hash_t, cum_hash_p):
        print('found equal substring')
        if curr_substring == p:
            count_substring_approx_match += 1
            continue

    less_than_k_mismatch_found = less_than_k_mismatch(i, curr_substring, p, cum_hash_t, cum_hash_p, 0, k)
    if less_than_k_mismatch_found:
        count_substring_approx_match += 1

print(count_substring_approx_match)

t='xabcabc', p='ccc'
index=0
curr_substring='xab'
p='ccc'
count matches = 0
*************************
Calling substring_hashes_are_equal(substring_start_index=0, pattern_len=3)
+++++++++++++++++++++++++
Calling less_than_k_mismatch(substring_start_index=0, substring='xab', pattern='ccc', counter=0, k=1)
*************************
Calling substring_hashes_are_equal(substring_start_index=0, pattern_len=3)
+++++++++++++++++++++++++
Calling less_than_k_mismatch(substring_start_index=0, substring='x', pattern='c', counter=1, k=1)
*************************
Calling substring_hashes_are_equal(substring_start_index=0, pattern_len=1)
left_counter_lower_than_k=False
index=1
curr_substring='abc'
p='ccc'
count matches = 0
*************************
Calling substring_hashes_are_equal(substring_start_index=1, pattern_len=3)
+++++++++++++++++++++++++
Calling less_than_k_mismatch(substring_start_index=1, substring='abc', pattern='ccc', counter=0, k=1)
*************************
Calling substring_hashes_ar

IndexError: list index out of range

In [6]:
len(cum_hash_p)

5