# Rabin-Karp method

If you think hard enough, there's nothing that diferrentiates piecs of text from numbers. You can think of letters as digits and base of the numbers as sufficently big to accomodate for all the digits. For example take the following text

$$
babacb
$$

it can be though of as a number base 26 (for all the english letters):

$$
(1,0,1,0,2,1)_{26}
$$

We can transform this number to base 10 using the following equation.

$$
1*26^5 + 0 * 26^4 + 1*26^3 + 0*26^2 + 2*26^1 + 1*26^0 = 11899005
$$

From the formulation above the following property should be clear.

$$
abba = abb * 26 + b
$$

in general we can write $concat(word, letter) = base * word + letter$ (1).

There's also a small technicality. When we compare numbers then $0001$ and $001$ and $1$ are equivalent. This means that we cannot map any letter to 0 if we want to be able to successfuly compare the numbers. 

Observation (1) allows us to quickly compute hashes for all the prefixes of a given word. Just like in class we are going to use modular arithmetic for our computations.

In [8]:
# we need to map letters to numbers. Python function ord does the job
print(repr('A'), ord('A'))
print(repr('a'), ord('a'))
print(repr('b'), ord('b'))
print(repr('c'), ord('c'))
print(repr(' '), ord(' '))

print('%s - %s + 1 = %d' % (repr('c'), repr('a'), ord('c') - ord('a') + 1))


'A' 65
'a' 97
'b' 98
'c' 99
' ' 32
'c' - 'a' + 1 = 3


In [17]:
BIG_FAT_PRIME = 2**32 - 1
ENGLISH_BASE = 30 # in theory 27 is sufficient but better safe than sorry!

def compute_hashes(text, base=ENGLISH_BASE, modulo = BIG_FAT_PRIME):
    # 
    h = [None for _ in range(len(text) + 1)]
    h[0] = 0 # hash of empty word is 0
    for i in range(len(text)):
        # we only deal with english letters so we subtract 'a'
        # to normalize range. We add 0 to avoid creating zero digit.
        letter_as_number = (ord(text[i]) - ord('a') + 1)
        h[i + 1] = h[i] * base + letter_as_number
        h[i + 1] %= modulo
        # at the end of the iteration h[i+1] is the hash
        # of prefix of text of lenght (i+1) which in
        # Python is text[:(i+1)]
    return h

PROTIP: If you happen to ever implemented this is lower level programming language like C or C++, be ware of integer overflows.

In [18]:
compute_hashes("babacb")

[0, 2, 61, 1832, 54961, 1648833, 49464992]

In [19]:
compute_hashes("babddd")

[0, 2, 61, 1832, 54964, 1648924, 49467724]

In [20]:
# for longer strings modulo matters
compute_hashes("babddasdasdsad")

[0,
 2,
 61,
 1832,
 54964,
 1648924,
 49467721,
 1484031649,
 1571276524,
 4188622771,
 1104631594,
 3074176759,
 2030989594,
 800145691,
 2529534259]

## hashes of substrings

Now here's a crutial observation. Let's take polynomial representation of string  $babacb$ (where $X$ is the base)

$$
b*X^5 + a * X^4 + b*X^3 + a*X^2 + c*X^1 + b*X^0 
$$

Say we want to compute hash of $ac$ which is conveniently appears on 4-th index the string we originally hashed. Moreover we have hashes of all the prefixes - it seems like we are in good shape:

\begin{align}
\text{we have }\ \ \ & hash(babac) &=\ & b*X^4 + a * X^3 + b*X^2 &+& a*X^1 + c*X^0 \\
\text{we have }\ \ \  & hash(bab)   &=\ & b*X^2 + a * X^1 + b*X^0&&\\
\text{we WANT }\ \ \  & hash(ac)    &=\ &                         && a*X^1 + c*X^0\\
\end{align}


From above we can clearly see that:

$$
hash(ac) = hash(babac) - X^2 * hash(bab)
$$

We can generalize this to arbitrary substring of our hashed string. Assume we hashed string $s_0, s_1, ..., s_{n-1}$ such that $h_0 = hash(\emptyset)$, $h_1 = hash(s_0)$, $h_2 = hash(s_0, s_1)$ etc. 
Then we can compute $hash(s_i, ..., s_j)$ using the following formula:

$$
hash(s_i, ..., s_{j-1}) = h_j - h_i * X ^{j - i}
$$

This looks very close to $O(1)$ complexity hash computation if not for $X ^{j - i}$. But since there are at most $n$ different powers of $X$ that we are interested in, we can precompute them in $O(n)$ time.

In [37]:
def compute_powers(n, base=ENGLISH_BASE, modulo=BIG_FAT_PRIME):
    powers = [None for _ in range(n + 1)]
    powers[0] = 1
    for i in range(n):
        powers[i+1] = (powers[i] * base) % modulo
    return powers

In [38]:
compute_powers(10)

[1,
 30,
 900,
 27000,
 810000,
 24300000,
 729000000,
 395163525,
 3264971160,
 3459854310,
 716414220]

Now we can put all those observations together into efficient detastructure that allows us to compute hashes of substrings in $O(1)$

In [77]:
class Hasher(object):
    def __init__(self, word):
        self.h = compute_hashes(word)
        self.powers = compute_powers(len(word))
        
    def substring_hash(self, i, j):
        result = self.h[j] - self.h[i] * self.powers[j-i]
        return result % BIG_FAT_PRIME

In [78]:
TEXT = "abcxabcx"
h = Hasher(TEXT)

In [79]:
def highlight(word, i, j):
    return word[:i] + "[" + word[i:j] + "]" + word[j:]

print(highlight(TEXT, 0, 2), h.substring_hash(0, 2))
print(highlight(TEXT, 4, 6), h.substring_hash(4, 6))
print(highlight(TEXT, 3, 5), h.substring_hash(3, 5))

[ab]cxabcx 32
abcx[ab]cx 32
abc[xa]bcx 721


## Hasher complexity analysis.

Preprocessing (`__init__`):
- `compute_hashes` is $O(n)$
- 'compute_powers` is $O(n)$
Therefore precomputing is $O(n)$.

Queries (`substring_hash`) is of complexity $O(1)$ - it is just a simple formula.

Notice that this technique is very powerful. More powerful than we need for pattern matching. It should not be a surprise that we can easily use it to solve pattern matching

In [100]:
# hasher for text
text_h = Hasher("to be or not to be")

In [101]:
# hash of the pattern
compute_hashes("be")[-1]

65

In [102]:
# hash of the occurence of "be" in original text. 
text_h.substring_hash(3, 5)

65

In [105]:
def compute_matches(text, pattern):
    # hash of patter
    pattern_hash = compute_hashes(pattern)[-1]
    # hasher for text
    text_h = Hasher(text)
    res = []
    for i in range(len(text) - len(pattern) + 1):
        # i is potential match start index
        # compare hash in text with hash of pattern
        if text_h.substring_hash(i, i + len(pattern)) == pattern_hash:
            # if matching append to result list.
            res.append(i)
    return res

In [106]:
compute_matches("to be or not to be", "be")

[3, 16]