### Author: Jose Miguel Bautista
### Updated: 05/31/2024

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
import sys

# Primitive Structures

Remember that data is stored as binary somewhere in memory, which is physically a bunch of magnetic switches.  
Data structures are ways of organizing that data, and associated operations to manipulate or manage the data.  

Below we talk about some of the simplest forms of structure, arrays and linked lists, which are almost entirely made of [primitive data types](https://en.wikipedia.org/wiki/Primitive_data_type) (floats, bools, etc.).  
These are also examples of *linear* structures which organize data according to position (index) in the structure.  
From these, we can construct more sophisticated data structures (e.g. dictionaries and queues).  
Later in the notes we'll also go over some *nonlinear* structures, trees, where some level of hierarchy or extrinsic organization is encoded in the structure. 

## Abstract Data Types (ADT)

Formally, an ADT is 3 things combined:
1. A collection of data values
1. The relationships between the data
1. Any supported operations on the data

More simply, an ADT is a logical description of how to store and manage data.  
Data structures are the actual, concrete implementations in memory and code based on the ADT.  
When we talk about data structures, we're really talking about ADTs and usually how they're implemented as either in a computer. 

These implementations can be split into 2 large categories: contiguous or linked structures.  
1. **Contiguous** - e.g. arrays, matrices, heaps, hash tables
2. **Linked** - e.g. linked lists, binary search trees

Contiguous structures, as the name implies, are structures made of a single allocation of memory (a bunch of switches physically adjacent to each other).  

Linked structures on the other hand can occupy multiple disjoint blocks of memory.  
Such structures would have the blocks connected by *pointers* which tell you the location of the next item in memory (see appendix if unfamiliar).  

As an analogy, a contiguous structure is like a box of chocolates: all the items are in one spot, neatly arranged, and you can pick at your leisure.  
Meanwhile, a linked structure is more like a quest/scavenger hunt: you need to go through each portion in order to progress through it.  
There are tradeoffs between the two.  

Contiguous structures tend to have the following benefits:
- Fast random access 
- Good speed on sequential calls due to the proximity of memory bits
- Often space-efficient (no memory is needed to point to other portions of the structure). 

On the other hand, linked structures have the following benefits:
- No intrinsic size limits; if you need more space, just point to an empty memory block. 
Contiguous structures grow by remaking the entire structure. 
- Simpler insertion and deletion. Again, it is largely pointer manipulation versus remaking the structure. 
- Pointer rearrangement is faster than data rearrangement on large records

I will point out that the abstraction is quite important.  
This abstraction lets us design our algorithms independently without worrying about whether the structure affects correctness.  
Assuming the algorithm is correct, changing the implemented structures should only change the efficiency of the algorithm's execution.  
That said, it is clearly better to have a good structure in mind when designing the algorithm in the first place (to save effort refining the algorithm design). 

## Arrays

Arrays are the fundamental contiguously-allocated data structure, where the elements are organized by an index.  
You have probably experienced these as numpy arrays, but even Python `lists` are implemented as arrays (not linked lists, as the name might suggest).  
You may also think of them abstractly as vectors.  

They're great because the index maps directly to the memory block that holds the data.  
So each element can be read effectively immediately ($O(1)$) when called by index.  
Plus, unlike linked lists, arrays will only ever hold data so it is very space efficient.  
They are also not particularly complicated, they're just big blocks of memory that hold data, so they can store pretty much any supported data type and can exist on almost any hardware.  

Normally, arrays come in two flavors: static and dynamic.  

**Static** arrays are arrays that occupy a memory block of fixed size.  

This fixed size is both a blessing and a curse.  
On older systems or small embedded devices, where memory is quite limited, it is good to have a fixed size to budget out your memory more accurately.  
Even on large/modern systems, it may be good to have static arrays either because: 
1. You want to limit the amount of entries, or 
1. You don't want to spend time managing the array size.  

The main problem then is that you would need to have a good idea of how much memory you actually want to allocate to the array in the first place.  
Too small, and you can't store all the data you want (overflow).  
Too large, and the array has some empty memory slots which is space-inefficient.  

**Dynamic** arrays can change their size when they get full. 

This solves the problem of potential overflow, at the cost of requiring some time managing array size every now and then.  
In the book, they go through the basic steps of resizing, which I will translate below:
1. Start as a contiguous array of size $m$. 
1. Allocate a new contiguous array of size $2m$. 
1. Copy the old array contents to the lower half of the new space. 
1. Return the old array space to the allocator. 

where step 3 takes $O(m)$ time for any given resizing.  

To figure out the total amount of time managing an array of size $n$, we need to account for all the copyings across multiple resizings.  
Suppose we start with an array size of $1$ and go up to an array size of $n$ which (for simplicity) is exactly a power of $2$.  
We reach that array size within $\log_2(n)$ doublings.  
To resize *this* array, half of the elements will be copied for the first time, a quarter of them the second time, so on and so forth.  
**Sanity check:** The proportions [need to add up to 1](https://en.wikipedia.org/wiki/1/2_%2B_1/4_%2B_1/8_%2B_1/16_%2B_%E2%8B%AF).  
So the total number of copying operations $M$ is the sum
\begin{align}
M &= \sum_{i=1}^{\log_2(n)}i\left( \frac{n}{2^i}\right) \\
&\leq \sum_{i=1}^{\infty}i\left( \frac{n}{2^i}\right) \\
&= 2n
\end{align}
If you need help with evaluating the sum, see the appendix.  
The point is that the total effort managing(doubling) this array is $O(n)$.  
You'll notice that this is the same as just making a static array of the correct size.  
The difference is that the static allocation runs in $O(n)$ from the start.  
Dynamic allocation is fast ($O(1)$) for most queries and slows down greatly at the doublings s.t. *[as a whole](https://en.wikipedia.org/wiki/Amortized_analysis)* the total is still $O(n)$.  

Note that when managing `lists` in Python, the reallocation is only [proportional to the original size](https://stackoverflow.com/questions/75074524/why-does-python-list-resize-overallocation-differ-from-other-languages) rather than a full doubling.  
I have not found exactly why, but we can infer *a* reason from the source comments: for moderate to large arrays, doubling the memory is usually overkill.  
You have probably experienced something similar before when budgeting.  
You have an idea of how much you need, budget out the amount, and the actual amount needed is only slightly over/under.  
It seems Python is trying to capture this and reallocates proportionately less new space the larger the original array is.  
This is more space efficient, but potentially increases the number of times you would need to resize the array.  
Although apparently it is still $O(n)$ in the amortized analysis. 

## Linked Lists

All linked structures have the following properties:
- Has a pointer to the head of the structure in order to access it 
- Contains a set of nodes, each containing one or more data fields
- Each node contains a pointer to at least one other node (so it “wastes” some space).

Similar to arrays, **linked lists** are the simplest linked data structure, where the subsequent elements are connected by pointers rather than proximity/indices.  
Unlike arrays, which are fundamentally quite simple, there are differences between linked lists in terms of the connectedness of the elements.  

In **singly linked lists**, nodes contain the data and a pointer to at most 1 other *successor* node denoting the next element of the list.  
The ending node normally has a `NULL` pointer to terminate it, which points to an invalid address.  

**Doubly linked lists** are similar to singly linked lists, but nodes can have 2 pointers: one for the successor, and one for its predecessor.  
This can speed up some operations by allowing you to traverse the list more easily, at the cost of space for the extra pointer.  

**Circular linked lists** are variations where the last node points to the head node instead of `NULL`. 

Linked lists support 3 basic operations: searching, insertion, and deletion.  
Python does use linked lists internally for things like the `deque` objects, but (as far as I know) it is not intrinsically available due to the lack of explicit pointers.  
There is [`llist`](https://ajakubek.github.io/python-llist/index.html) which implements singly and doubly-linked lists, but installing the library may require you to install/update your C++.  
If you want to use it, **READ THE DOCUMENTATION FOR IT**, it has code examples.  

Below that, I have mocked up a singly linked list using classes with the basic operations in it, as well as an `append` operation.  
I have also left some extra operations unfilled as practice. 

In [10]:
# Singly Linked List class
class Node():
    
    def __init__(self, value):
        self.value = value
        self.successor = None
        self.predecessor = None # not used here, but needed for double-linking
        
class LinkedList(): 
    
    def __init__(self):
        self.head = None

    def __repr__(self):
        out = "["
        if self.head is None:
            pass
        else:
            current = self.head
            out += "{}".format(current.value)
            while current.successor is not None:
                current = current.successor
                out += ", {}".format(current.value)
        out += "]"
        return out
        
    def __len__(self): 
        current = self.head
        index = 0
        while current is not None:
            index += 1
            current = current.successor
        return index
    
    def insert(self, value): # Insert at the head of the list
        headNode = Node(value)
        headNode.successor = self.head
        self.head = headNode
    
    def delete(self, value): # Delete the first instance of a value in the list
        current = self.head
        if current is not None:
            if current.value == value: # value already in head
                self.head = current.successor
                print("found")
            else: 
                while current.successor is not None: 
                    if current.successor.value == value:
                        current.successor = current.successor.successor
                        break
                    current = current.successor
        # As far as I know I cannot free the memory of the deleted node in Python, so not perfect translation
        
    def search(self, value): # Find the index where a value occurs, raises error if not in list
        current = self.head
        index = 0
        if current is None:
            raise ValueError("List is Empty")  
        elif current.value == value: # head has it, no successors
            return index
        while current is not None:
            if current.value == value:
                return index
            index += 1
            current = current.successor
        raise ValueError("Value not found")
    
    def append(self, value): # Insertion at the end of the list
        if self.head is None:
            self.head = Node(value)
        else:
            current = self.head
            while current.successor is not None:
                current = current.successor
            current.successor = Node(value)  

    def pop(self): # Removes from the end of the list
        if self.head is None:
            pass
        elif self.current.successor is None:
            self.head = None
        else:
            current = self.head
            while current.successor.successor is not None:
                current = current.successor
            current.successor = None
    
    def popLeft(self): # Removes from the start of the list
        if self.head is None:
            pass
        else:
            current = self.head
            self.head = current.successor

    def get(self, index): # Return value at specified index
        if self.head is None:
            pass
        else:
            current = self.head
            while index > 0:
                current = current.successor
                index -= 1
            return current.value
    
    def inject(self, value, index): # Insert value at specified index
        if self.head is None:
            pass
        else:
            current = self.head
            while index > 1:
                current = current.successor
                index -= 1
            newNode = Node(value)
            newNode.successor = current.successor
            if index == 0:
                self.head = newNode
            else:
                current.successor = newNode
            pass
    
    def replace(self, value, index): # Replace value at specified index
        if self.head is None:
            pass
        else:
            current = self.head
            while index > 0:
                current = current.successor
                index -= 1
            current.value = value
            pass
    

In [11]:
# Test space for Linked Lists

linked = LinkedList()

for i in range(10):
    linked.append(i)

print(linked.inject(6,0))

print(linked)

None
[6, 1, 2, 3, 4, 5, 6, 7, 8, 9]


# Containers and Dictionaries

*Containers* are ADTs which hold and return data independent of content.  
These include structures like stacks and queues, where the data is stored and recieved in order. 

By contrast, *dictionaries* are ADTs which retrieve by content.  
More specifically, dictionaries store (key, value) pairs where keys are unique and used to access the corresponding value.   This is similar to indices in arrays, which is why dictionaries as an ADT are also called "associative arrays". 

## Stacks and Queues

Stacks and queues are both ADTs which store and retrieve data, with the difference being the order of retrieval. 

Stacks follow first-in-last-out (FILO) order, and the associated operations are called `PUSH` and `POP` which add and remove elements respectively. 
- PUSH(x, s) – Insert item **x** at the top of stack **s**.
- POP(s) – Return (and remove) the top item of stack **s**.

Some stack implementations also have a `PEEK` operation that can look at the top element without removing it. 

Queues follow first-in-first-out (FIFO) order, and the associated operations are called `ENQUEUE` and `DEQUEUE`:
- ENQUEUE(x, q) – Insert item **x** at the back of queue **q**.
- DEQUEUE(q) – Return (and remove) the front item from queue **q**.

As an analogy, the stack is like a literal stack of cafeteria trays while the queue is like the literal queue at the cafeteria serving line.  
For the former, the trays are stacked from the bottom up, but you get/return trays from the top out of convenience.  
For the latter, the first person served is the first person to queue-in out of fairness.  
Consider this when deciding which one to use.  

To implement these, one could use either an array or a linked list.  
Below I give a demo of stacks using Python `lists` (which are actually arrays) and the translated `PUSH` and `POP` operations.  
For queues, I used the [`deque`](https://docs.python.org/3/library/collections.html#collections.deque) object (internally a [doubly linked list](https://github.com/python/cpython/blob/main/Modules/_collectionsmodule.c)) for variety, but you could use the linked list class above to do it. 

Actually the `deque` object is a related ADT, the double-ended queue, which generalizes stacks and queues.  
This particular Python implementation uses a doubly linked list, but instead of storing one datum per node, it actually stores the data in linked blocks.  
This scheme reduces the relative space taken up by pointers.  
Much like with arrays, this also means most additions/removals to the list (on either end) take $O(1)$ time since the memory has already been allocated.  
The downside is that `deques` still take up more memory per element.  

In [4]:
# Stack demo
stack = [x for x in range(5)]
spacer = 12

print('STACK(s):'.ljust(spacer), stack)
print('SIZE(s) =', sys.getsizeof(stack), 'bytes\n')

stack.append(6)
print('PUSH(6, s):'.ljust(spacer), stack)
stack.pop()
print('POP(s):'.ljust(spacer), stack)
stack.append(7)
print('PUSH(7, s):'.ljust(spacer), stack)
stack.pop()
print('POP(s):'.ljust(spacer), stack)
stack.pop()
print('POP(s):'.ljust(spacer), stack)

STACK(s):    [0, 1, 2, 3, 4]
SIZE(s) = 120 bytes

PUSH(6, s):  [0, 1, 2, 3, 4, 6]
POP(s):      [0, 1, 2, 3, 4]
PUSH(7, s):  [0, 1, 2, 3, 4, 7]
POP(s):      [0, 1, 2, 3, 4]
POP(s):      [0, 1, 2, 3]


In [5]:
# Queue demo
queue = deque([x for x in range(5)])
spacer = 15

print('QUEUE(q):'.ljust(spacer), queue)
print('SIZE(q) =', sys.getsizeof(queue), 'bytes\n')

queue.append(6)
print('ENQUEUE(6, q): '.ljust(spacer), queue)
queue.popleft()
print('DEQUEUE(q): '.ljust(spacer), queue)
queue.popleft()
print('DEQUEUE(q): '.ljust(spacer), queue)
queue.append(0)
print('ENQUEUE(0, q): '.ljust(spacer), queue)
queue.popleft()
print('DEQUEUE(q): '.ljust(spacer), queue)


QUEUE(q):       deque([0, 1, 2, 3, 4])
SIZE(q) = 760 bytes

ENQUEUE(6, q):  deque([0, 1, 2, 3, 4, 6])
DEQUEUE(q):     deque([1, 2, 3, 4, 6])
DEQUEUE(q):     deque([2, 3, 4, 6])
ENQUEUE(0, q):  deque([2, 3, 4, 6, 0])
DEQUEUE(q):     deque([3, 4, 6, 0])


## Dictionaries

Dictionaries are ADTs that store (key, value) pairs, where the key is unique and takes the role of an index in retrieving the value (read: data entry).  
In a very real sense, dictionaries correspond to [maps or functions](https://en.wikipedia.org/wiki/Map_(mathematics)) in math, which associate elements in a domain (the keys) to the codomain (the values).  
So when you go through a dictionary, remember that **its fundamental elements are always pairs.**  

Dictionaries support the following operations:
- SEARCH(D, k) – Return a pointer to key-value pair **x** in dictionary **D** whose key is **k**, (nil if none exist).
- INSERT(D, x) – For pair **x**, add it to **D**.
- DELETE(D, x) – For pointer to **x** in the **D**, remove it from **D**.
- MIN(D), MAX(D) – Returns the item of **D** which has the smallest/largest key.
- PREDECESSOR(D,k), SUCCESSOR(D,k) – For key **k** in **D**, return the next smallest (largest) key. 

If this were C, you could implement dictionaries yourself with either arrays or linked lists, and the book goes through the performance in each case.  
In the wild, most implementations use either a *hash table* or a *search tree* structure to do it.  
We will go over the both concepts in the next sections. 

For now, I will point out that Python already has dictionaries implemented as the `dict` object - they went with the hash table approach.  
In short, a hash table turns keys into numbers (they "hash" the key) in order to index a table of values. 

This may explain a common source of frustration when handling a `dict`: the keys cannot be changed (they are immutable).  
`Dicts` hash the keys to produce the right index.  
When you try to mutate the key into something else, hashing would produce a different number and send it to the wrong part of the table.  
This is also why you can't use something like a `list` as a key: it can't be hashed at all. 

`Dicts` are also dynamic in the sense that the memory requirement for them grows as you INSERT pairs.  
The object will double in size once there are fewer than 1/3 of the slots for key-value pairs remaining.  
If you start to DELETE a lot of entries, one would expect that there should be an equivalent process to shrink the size.  
But according to [Tim Peters](https://mail.python.org/pipermail/python-list/2000-March/048085.html) the DELETE operation never actually triggers shrinkage (though he does gives tips on how to do it). 

Sidenote: `dict` objects are enclosed by curly braces `{}`, the same as `set` objects (unordered collections of items).  
This is for historical reasons: originally, the source code for `sets` were [pretty much copy-pasted](https://github.com/python/cpython/blob/main/Objects/setobject.c) from the `dict` objects source code, but with dummy entries for the stored values.  
Nowadays, the codes have diverged; `dicts` are even ordered now.

## Binary Search Trees (BST)

A *tree* is a linked structure, similar to the linked list, except now each node can have more than one successor, now called its "children".  
Each node still has only only one predecessor at most, called its "parent".   

A *rooted tree* is a tree which has at most a single "root" node that has no parents.  

A *binary tree* is a tree where each node contains at most two children, called the left child and right child.  

With all the background out of the way, a **binary search tree** is a rooted binary tree whose nodes are sorted by key.  
For a node with key `x`, its children form subtrees.  
All keys in the left subtree are smaller than `x`, while all keys in the right subtree are larger than `x`.  

Such a structure supports dictionary operations, so in the homework I will ask you to build one yourself.  
They are in fact one of the efficient ways to implement them because they are naturally well-ordered:
- Finding the minimum(maximum) is equivalent to finding the left(right)-most element of the tree. 
- The fact that they are *binary* trees also means we can use binary search on it (see: notebook 01-02).  

Hopefully this second fact makes it visually apparent why binary search was so effective in the guessing game.  

Note that subtrees of BSTs are themselves BSTs by construction, so BSTs are recursive structures.  
Among other things, this means BSTs lend themselves well to recursive algorithms.  
As an example, consider the following problem:  

    Given: n nodes, how many BSTs can be formed? 
The answer to this is surprisingly straightforward.  

**Solution:**  
Denote the number of BSTs for $n$ nodes as $C_n$.  
Pick the $i^{th}$ node to be the root node; then there will be $i-1$ nodes to the left and $n-i$ nodes to the right.  
These 2 sets of nodes also have to make BSTs independently, with $C_{i-1}$ possible in the left and $C_{n-i}$ in the right.  
So the total number of BSTs for the original root are their product, $C_{i-1}C_{n-i}$.  
The total for all $n$ nodes is then the sum over all possible root nodes: 
$$C_n = \sum_{i=1}^n C_{i-1}C_{n-i}$$
This is great, we now have $C_n$ in terms of all previous values, a *recursion relation*, so we just need one value to start us off.  
If $n=0$ there is only one possible tree (the empty one), so $C_0 = 1$.  
With this, we can bootstrap up to arbitrary $n$, either analytically or with a quick bit of code.  
Mercifully though, someone already did that and the answer is the [Catalan numbers](https://en.wikipedia.org/wiki/Catalan_number).



# Hashing 

Hashing and hash tables are common ways to store things.  
The key idea behind them is that arrays are very fast at random access ($O(1)$) provided you have an index.  
So if we can just store the values in a hash table and turn the keys into indices, we can speed up operations.  
This conversion of key to index is accomplished with a *hash function*.  

## Hash Functions 

A hash function is a mathematical function which maps keys to indices of the hash table.  

For now, I will just assume keys are strings as the most common scenario.  
To turn these into integers, we could do something very simple like assign integers in order, e.g. ($a\rightarrow0$, $b\rightarrow1$, $c\rightarrow2$, ...) and then add them together.  
This would be a paticularly poor hash because we want different keys to map to different indices, but this scheme sends "gods" and "dogs" to the same index.  
This behavior is called **collision** and is generally undesiable.  

Our naive hash failed because we didn't account for the order of items, and the order of characters in a string is crucial to its information.  
For that matter, the order of characters matters when writing down numbers: 4321 $\neq$ 1234.  
Mathematically, an $N$-digit number, $H$,  can be written in terms of its digits (in base-10) $d_i$ as 
$$ H = \sum_{i=0}^{N-1} 10^{N-(i+1)} \times d_i$$
For the two numbers above
\begin{align}
4321 &= \sum_{i=0}^{4-1} 10^{4-(i+1)} \times d_i; \quad d_i = [4, 3, 2, 1]\\
1234 &= \sum_{i=0}^{4-1} 10^{4-(i+1)} \times d_i; \quad d_i = [1, 2, 3, 4]
\end{align}

This can be generalized to strings in order to generate a more useful hash.  
If a key is made of characters $s_i$ from an alphabet of size $\alpha$, then an $N$ character string can be similarly hashed as 
$$ H = \sum_{i=0}^{N-1} \alpha^{N-(i+1)} \times \text{char}(s_i)$$
where $\text{char}(d_i)$ is a function to assign a number to a given character $s_i$.  
If we only used the latin alphabet to make our key, then $\alpha = 26$; in actual implementations it would be larger to at least include more unicode characters.  

The immediate problem of the scheme is that the hashes can get very large very quickly.  
Remember, they have to actually index a table with some finite size $m$, and these hashes will rapidly exceed sensible values for $m$.  
Still, we went through all that effort to make a bunch of hash values that almost certainly don't collide, and we definitely like that property.  

It turns out we can keep using that hash function, as long as we can re-map the large values down into indices $[0, m-1]$ uniformly.  
The easiest way to do this is to just take $H \mod(m)$, and for good choices of $\alpha$ and $m$, this makes a [pseudorandom generator](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) of indices.  
This pseudorandomness is exactly what we want (uniform distribution of indices), and as the book explains, it functions much like a roulette wheel.  
The ball of a roulette wheel may travel for a very long distance, but it ultimately ends up on at least one of the slots in a more or less random fashion.  


Note that for hashing in the Python `dicts`, [randomness is mostly ignored](https://github.com/python/cpython/blob/main/Objects/dictobject.c).  
As they explain in the source code comments (lines 292+), most of the important hashes they use are regularly spaced by construction.  
This is because for common uses of a `dict`, the keys will themselves be regularly spaced and this will already guarantee it is collision-free.  
Additionally, this hashing is simple so it tends to be quite fast.  

## Chaining

Inevitably at least *some* of the hashes will collide.  
But if we did our job right, those hashes should be very rare - rare enough that we can just track all the times it happens.  
**Chaining** represents the hash table as a set of linked lists, where the $i^{th}$ list corresponds to all items hashed to $i$.  
Then the operations on the dictionary reduce to operations on the correct list.  
This is one of the simpler schemes, but it does sacrifice memory.  

Note that Python's collision handling is in the family of [open addressing](https://en.wikipedia.org/wiki/Open_addressing), where collided hashes are instead redirected to the next open slot.  
The actual scheme is again in comments of the [source code](https://github.com/python/cpython/blob/main/Objects/dictobject.c), lines 320+.  
It gets somewhat messy so I will just summarize it as "you look around for the next available slot in a pseudorandom fashion that depends on the hash value."  

**Sidenote:** There was a recent advance in open-addressing (led by an undergraduate) on [how fast open-addressing](https://arxiv.org/pdf/2501.02305v1) can be.  
The paper author also has a very approachable [presentation](https://www.youtube.com/watch?v=ArQNyOU1hyE) about both open-addressing and the new method, if you don't want to read so much.  
The key to their technique is what they call *funnel hashing*.  
I will let them explain the full method, but make 2 comments.  
1. It should be intuitively clear why their method is faster - they're effectively making a binary tree (upside-down) for their probing scheme. 
1. Personally I would have called it "percolation" rather than funneling, because the probe only descends the layers when the top layers are "saturated." 

## Substring Matching: Rabin-Karp Algorithm

One of the most common problems when handling strings is *substring matching*.  
Given a text **T** of length $n$, and a pattern **P** of length $m$, does **T** contain **P** and if so, where?

    Ex: “man” is in “human” at indices 2-4

The brute force method to this would be: 
1. Divide **T** into all possible $m$-length substrings (window); this will make n-m+1 of them. 
1. Compare the leftmost characters of all windows and the pattern
    1. If they doesn't match, move to the next window. 
    1. If they match, check the second character, and repeat until you fail or reach the end

Since the pattern can be matched in multiple positions, and more than once, it really seems like we need to do an exhaustive search ($O(nm)$).  
One way to speed this up is with hashing and the **Rabin-Karp algorithm**. 

The basic Rabin-Karp algorithm is as follows
1. Divide **T** into all possible $m$-length substrings (window); this will make n-m+1 of them. 
1. Hash all the windows and **P**. 
1. Compare all window hashes to the hash of **P** 
    1. If hashes are not equal, the strings have to be different and you can move on. 
    1. If hashes are equal, the strings are almost certainly the same.  
    They could be collisions, so you would need to manually check every character ($O(m)$). 

Ideally, the hashes are random enough that the false positive rate of the last step is low.  
Normally the hashing process takes $O(m)$ time so it still look like the algorithm is $O(nm)$ in total. 

The trick, is you can call it that, is that we actually don't need to recalculate the full hash every time.  
This is because adjacent windows will share multiple $(m-1)$ characters.  
Explicitly, the hash $H$ for a substring $S$ of length $m$, starting position $j$, characters $s_{i+j}$ is 

$$H(S, j) = \sum_{i=0}^{m-1} \alpha^{m-(i+1)} \times \text{char}(s_{i+j})$$

So the next window, $j\rightarrow j+1$ has hash
\begin{align}
H(S, j+1) &= \sum_{i=0}^{m-1} \alpha^{m-(i+1)} \times \text{char}(s_{i+j+1})\\
&= \alpha\left[H(S, j) - \alpha^{m-1}\text{char}(s_{j})\right] + \text{char}(s_{j+m})
\end{align}

As an example, suppose the text is $S=$"$9876$", window length $m=3$, alphabet size $\alpha=10$.  
There are 2 windows: "$987$" and "$876$".  
The hashes are 
\begin{align}
H_1 &= 10^2\times9 + 10^1 ✕ (8) + 10^0 ✕ (7) = 987\\
H_2 &= 10^2\times8 + 10^1 ✕ (7) + 10^0 ✕ (6) = 876
\end{align}
i.e. the hashes are just the strings themselves.  
To turn $H_1$ into $H_2$, we need to remove "$9$" from the hundreds digit, shift "$87$" left, and put a "$6$" in the ones digit.   
Operationally these correspond to (in order) subtracting $900$, multiplying by $10$, and adding $6$ to $H_1$.  

$$H_2 = 10 (H_1 - (10^2 \times 9)) + (10^0 \times 6) $$

This expression of the next hash in terms of the current is called a *rolling hash* and it's the key to speeding up the the operations.  
Once we know the hash somewhere, we know the hash anywhere else; it only takes some combinations of arithmetic which is $O(1)$.  
So assuming no collisions, we would expect $O(n+m)$ performance on average - far faster than the $O(nm)$ without the hash relations.  
That said, the worst case scenario is still $O(nm)$ if we’re unlucky and get nothing but collisions.  
This is a common theme with (random) hashes: good average performance, bad worst case performance.  

## Substring-Matching: Boyer-Moore-Horspool Algorithm

For completeness, I will go through the underlying Python approach with arrays.  
Python can do substring matching alone with the `.find()` method, which is implemented internally with a variant of the **Boyer-Moore-Horspool algorithm**.  

Recall that in brute force matching, we do an exhaustive search.  
For all windows, check the left most character and keep checking the successor if matched, move on if any do not.  
Suppose we were trying to match "PAN" in "ANPANMAN".  
Visually, it would look something like this

\begin{matrix}
A & N & P & A & N & M & A & N\\
 P & A & N & & & & & \\
 & P & A & N & & & & \\
 & & P & A & N & & & \\
 & & & P & A & N & & \\
 & & & & P & A & N & \\
 & & & & & P & A & N \\
\end{matrix}

This is, again, inefficient because a lot of characters are shared between windows.  
The problem is that we are matching on the leftmost characters.  
If we match on the rightmost characters instead, we gain more information.  
The heart of Boyer-Moore-Horspool is that we can skip some windows when mismatched.  
Visually, it looks like this: 
\begin{matrix}
A & N & P & A & N & M & A & N\\
 P & A & N & & & & & \\
 & & P & A & N & & & \\
 & & & P & A & N & & \\
\end{matrix}

In detail:
1. We start on the leftmost window, "ANP" 
    1. We see "P" is the last character, so it is not a match.  
    But it could be a match 2 windows down, where "P" would be the starting character.  
    1. We also know the window directly adjacent can't be a match either because the letter before the end should be "A".  
1. We move up 2 window positions, "PAN"
    1. checking from right to left, we get a full match. 
1. We move up 1 window position, "ANM"
    1. We see "M" is the last character, so it actually can't be a match anywhere in the pattern, and we can safely skip the full length of the pattern. 
1. Using the skips, there is no more space to fit the pattern, so we are done. 

So the speedup of Boyer-Moore-Horspool is in figuring out how many windows can be skipped based only on the pattern characters (the *bad character rule*).  
This can be done purely by pre-processing the string and assigning the amount of safe-skips as a number in table.  
For "PAN", it would look something like this: 
| Ending Window Character | # of safe skips |
| --- | --- |
| N | $0$ | 
| A | $1$ |
| P | $2$ |
| (else) | $3$ |

where for longer patterns, the highest number of safe-skips is pattern length $m$.  
This pre-processing is always order $m$ so the bottleneck on the algorithm is still in the matching phase.  

In the worst case, the number of skips is almost always $0$ (think about how to make this for pattern "PAN"), and the performance is $O(nm)$.  
In the "average" case, considering random strings and arbitrary pattern, the number of skips is some constant set by the statistics and it is $O(n)$.  
But in *typical* use cases, which is what Python is aiming for, the number of skips is almost always maximal.  
This is because patterns to match are intrinsically rare in natural language, and the performance end up being $O(n/m)$ most of the time.  
If you were doing substring matching for other reasons (e.g. cryptography), you may be better off getting another method. 

# Ending Note

At this stage, I should point out that the Python wiki graciously lists the [time complexities](https://wiki.python.org/moin/TimeComplexity) of all the operations on previously discussed data structures. 

Please look at them and make sure you understand why (in principle) they work that way. 

# Appendix: Geometric Sum

We want to show that
$$\sum_{i=1}^{\infty}\frac{i}{2^i} = 2$$
From your calculus class, remember that a convergent geometric series with rate $r<1$ has partial sums $S_N$, and in the limit of $N\rightarrow \infty$  
$$S_\infty = \sum_{i=0}^{\infty} r^{i}=\frac{1}{1-r} $$
The sum we actually want to evaluate is nearly in that form, but there is an extra factor of $i$. 

To get around this, take the derivative of $S_\infty$ w.r.t. $r$, to pull out a factor of $i$

\begin{align}
\frac{d}{dr}\left(\sum_{i=0}^{\infty} r^{i}\right) &= \frac{d}{dr}\left(\frac{1}{1-r}\right)\\
\sum_{i=0}^{\infty} ir^{i-1}&=\frac{+1}{(1-r)^2}\\
\sum_{i=1}^{\infty} ir^{i}&=\frac{r}{(1-r)^2}\\
\end{align}

Then you just plug in $r=1/2$ 
$$\sum_{i=1}^{\infty}\frac{i}{2^i} = \frac{r}{(1-r)^2}\bigg\rvert_{r=1/2} = 2$$

# Appendix: Pointers
Data is normally stored as binary somewhere in memory, which is physically a bunch of magnetic switches.  
*Pointers* to those data types are their addresses in memory, not the data themselves. 

**In C:** if `a` is an integer (say  $5$), then that information is stored as binary at some memory block with an address that I can refer to by `&a`.  
Here, `&` is the reference operator that returns the pointer to a data type, typically as a hexadecimal.  
So printing `a` returns `5`, but printing `&a` returns something like `0x7ffe5367e044`.  
If you then go to memory block with address `0x7ffe5367e044` and read off the data inside of it, it would read as $5$ in binary.

**In Python:** Python does not use pointers explicitly, but it does use them internally (it is implemented in C after all).  
For example, an array is actually a list of pointers, and knowledge of pointers can explain a few oddities. 

Consider the code below:

In [6]:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = a

a.append(51)
b.append(909)
print(a)
print(b)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 51, 909]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 51, 909]


In the first line, I am using the assignment operator, `=`, to both allocate the memory for `list` `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]` and give it name `a`.  
More precisely, there is now the variable `a` in the *namespace* that points to the `list` object.  
To reiterate: **variables are pointers to objects in memory.**  

In the second line, `b = a`, I am generating a new name in the namespace and making it point at the same `list` as `a`.  
So any time I invoke either `a` or `b`, it refers to the same `list` object in memory, and any operations I do on them affect the same item.  
This is **pointer aliasing**.  
It is very much like saying a person has two names they go by (a person named "Jose Miguel" may be called either "Jose" or "Miguel") but they both refer to the same person.  

Compare this with the `.copy()` method below: 

In [7]:
a = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
b = a.copy()

a.append(51)
b.append(909)
print(a)
print(b)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 51]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 909]


In this case, we make a copy of the `list` object so the names no longer point to the same object.  
Now, operations on lists `a` and `b` do not affect each other. 

There is one complication with this, a `list` is itself and array of pointers.  
So if you try this with a nested `list`:

In [8]:
a = [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
b = a.copy()

a[0].append(51)
b[0].append(909)
print(a)
print(b)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 51, 909]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 51, 909]]


The aliasing happens again because the copy we did was only a *shallow copy*.  
A `list` is an array of pointers, so a nested `list` involves pointers to pointers, and the aliasing is now happening one layer deeper.  

To avoid this, we can use *deepcopy*.  
Deepcopy will copy all of the structure of an object down to the actual memory blocks, and make an otherwise perfect clone that is independent of the original. 

In [9]:
from copy import deepcopy
a = [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
b = deepcopy(a)

a[0].append(55)
b[0].append(909)
print(a)
print(b)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 55]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 909]]


A further note: recall that `NumPy` also has `arrays`.  
These are more or less true arrays (not arrays of pointers) and operate directly on contiguous blocks of memory.  
I say more or less, because there is one exception: the `dtype=object` case.  
This dtype allows you to have mixed data types in a `NumPy array`, but now it *is* an array of pointers again.  
Life is full of trade-offs.  