# String Similarity II
Jaccard similarity is a pretty naive metric--it does not consider the order of words and it does not do well with spelling errors. However, it admits an O(n) time approximation algorithm. In this lecture, we will consider the opposite extreme. A much more comprehesive string similarity metric but one that is much harder to compute. For the rest of this lecture, we will assume tokenization at the granularity of each individual character.

## Levenshtein (Edit) Distance
The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. Let see a few examples that have an edit distance of 1:

In [1]:
strs = ('abc','ab')
#the edit distance is 1 (delete c)

strs = ('abc','abd')
#the edit distance is 1 (substitue c with d)

strs = ('abc','abec')
#the edit distance is 1 (insert e after b)

We can of course consider more complicated comparisons:

In [2]:
strs = ('bac','abd')
#the edit distance is 3 (insert a, delete a, subtitute c)

The edit distance $d(s,t)$ satisfies all of the properties one would expect from a distance measure:
$$d(s,t) = 0 \implies s = t \text{ and } s = t \implies d(s,t) = 0$$
$$d(s,t) = d(t,s)$$
$$d(s,t) + d(t, u) \ge d(s,u)$$

## A Simple Recursive Algorithm
The Levenshtein distance is easy to compute in the cases above but what if you were given very long arbitrary strings? We need a computational method to calculate this distance. Here is a simple recursive algorithm to compute this distance.

In [3]:
def edit_distance(s,t):

    #base cases if either string is empty, or they match
    if s == "":
        return len(t)
    
    if t == "":
        return len(s)
    
    if s == t:
        return 0
    
    #recursive step!!
    return min([ 1 + edit_distance(s[1:], t) , #1 step insertion
                 1 + edit_distance(s, t[1:]) , #1 step deletion
                 (s[0] != t[0]) + edit_distance(s[1:], t[1:]) ]) #1 step substitution

In [4]:
edit_distance('kitten','sitting')

3

In [5]:
edit_distance('kittens on','sitting')

6

This code seems slow for even moderately sized strings. How can we make this faster?

## A Faster Algorithm 
Let's get some intuition on why this algorithm is slow. We can add a print statement to track what pairs of sub strings are compared by the function:

In [22]:
def edit_distance_print(s,t):
    
    print("Comparing", s, t)

    #base cases if either string is empty, or they match
    if s == "":
        return len(t)
    
    if t == "":
        return len(s)
    
    if s == t:
        return 0
    
    #recursive step!!
    return min([ 1 + edit_distance_print(s[1:], t) , #1 step insertion
                 1 + edit_distance_print(s, t[1:]) , #1 step deletion
                 (s[0] != t[0]) + edit_distance_print(s[1:], t[1:]) ]) #1 step substitution

In [23]:
edit_distance_print('abc','abd') #look at c bd

Comparing abc abd
Comparing bc abd
Comparing c abd
Comparing  abd
Comparing c bd
Comparing  bd
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing  d
Comparing  bd
Comparing bc bd
Comparing c bd
Comparing  bd
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing  d
Comparing bc d
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing bc 
Comparing c 
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing c bd
Comparing  bd
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing  d
Comparing abc bd
Comparing bc bd
Comparing c bd
Comparing  bd
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing  d
Comparing bc d
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing bc 
Comparing c 
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing abc d
Comparing bc d
Comparing c d
Comparing  d
Comparing c 
Comparing  
Comparing bc 
Comparing c 
Comparing abc 
Comparing bc 
Comparing bc d
Comparing c d
Comparing  d
Comparing c 
Comparing

1

Check out how many times the tuple "c", "bd" shows up. There is a bunch of duplicated work in this algorithm. The recursive function recalculates things that it has already done! 

In [26]:
result_cache = {}

def edit_distance_mem(s,t):
    
    if s == "":
        
        #cache the result
        result_cache[(s,t)] = len(t)
        
        return len(t)
    
    if t == "":
        
        #cache the result
        result_cache[(s,t)] = len(s)
        
        return len(s)
    
    if s == t:
        
        #cache the result
        result_cache[(s,t)] = 0
        
        return 0
    
    #recursive step!!
    rtn =  min([ 1 + edit_distance_mem(s[1:], t) , #1 step insertion
                 1 + edit_distance_mem(s, t[1:]) , #1 step deletion
                 (s[0] != t[0]) + edit_distance_mem(s[1:], t[1:]) ]) #1 step substitution
    
    result_cache[(s,t)] = rtn
    
    return rtn

In [27]:
edit_distance_mem('abc','abd') #look at c bd

abc abd
bc abd
c abd
 abd
c bd
 bd
c d
 d
c 
 
 d
 bd
bc bd
c bd
 bd
c d
 d
c 
 
 d
bc d
c d
 d
c 
 
bc 
c 
c d
 d
c 
 
c bd
 bd
c d
 d
c 
 
 d
abc bd
bc bd
c bd
 bd
c d
 d
c 
 
 d
bc d
c d
 d
c 
 
bc 
c 
c d
 d
c 
 
abc d
bc d
c d
 d
c 
 
bc 
c 
abc 
bc 
bc d
c d
 d
c 
 
bc 
c 
bc bd
c bd
 bd
c d
 d
c 
 
 d
bc d
c d
 d
c 
 
bc 
c 
c d
 d
c 
 


1