# Levenshtein distance as a recursive function

by Koenraad De Smedt at UiB

---
Read ➜ Jurafsky & Martin. *Speech and Language Processing*, 3rd ed. [Ch. 2: Regular Expressions, Text Normalization, Edit Distance](https://web.stanford.edu/~jurafsky/slp3/2.pdf)

For spelling correction and other applications, it may be useful to compute how similar or different two strings are. One string distance measure is *edit distance*, in particular *Levenshtein* distance. The idea is to compute the cost of editing one string into another by deleting (*d*), inserting (*i*) or substituting (*s*) letters. The minimum edit distance, i.e. the lowest cost of editing one string into the other, determines string similarity. Here is an example from the book.

>```
I N T E * N T I O N
| | | | | | | | | |
* E X E C U T I O N
d s s   i s
```

Many possible combinations of editing steps can be tried out, but the minimum editing distance for these strings turns out to be 5.

>```
I N T E N * T I O N
| | | | | | | | | |
E * X E C U T I O N
s d s   s i
```

The book describes a fairly complex algorithm for computing the Levenshtein distance by creating a matrix. The present notebook takes a simpler, more elegant approach. It shows how to:

1.  Write a recursive function for the Levenshtein distance, inspired by [an example on Rosettacode](http://rosettacode.org/wiki/Levenshtein_distance#Memoized_recursion).

2.  Optimize execution by remembering earlier partial results.

---

If one of the strings is empty, return the length of the other string. If the first character of each string is the same, compute the distance of the remainders of the strings. Otherwise, try recursion with insertion, substitution or deletion and return the minimal cost of those three with the `min` function.

In [None]:
def ld(s, t):
  if not s: return len(t)
  if not t: return len(s)
  if s[0] == t[0]: return ld(s[1:], t[1:])
  lins = ld(s, t[1:]) +1
  lsub = ld(s[1:], t[1:]) +1
  ldel = ld(s[1:], t) +1
  return min(lins, lsub, ldel)

Let's test.

In [None]:
ld("graf", "giraffe")

This program is inefficient because it repeats a lot of computations in search of the minimum. The excessive time and memory consumption becomes apparent when comparing large strings. In the worst case, you can get a stack overflow.

The use of `@lru_cache` from the `functools` module optimizes the execution by remembering earlier results (called “caching” or “memoizing”).

In [None]:
from functools import lru_cache

@lru_cache(maxsize=4095)
def ld(s, t):
  # check if s or t is empty
  if not s: return len(t)
  if not t: return len(s)
  # check if first characters are the same: recursion at no cost
  if s[0] == t[0]: return ld(s[1:], t[1:])
  # try recursion with insertion, substitution and deletion, adding a cost
  lins = ld(s, t[1:]) +1
  lsub = ld(s[1:], t[1:]) +1
  ldel = ld(s[1:], t) +1
  # return the recursive result with the lowest cost
  return min(lins, lsub, ldel)

More testing.

In [None]:
ld("intention", "execution")

### Exercises

- This program uses the same cost of 1 for all edits. One could argue that a substitution is a combination of a deletion and an insertion, so it should be more costly. Change the program so that a substition has a cost of 2 and test.

## Note

There are other approaches to spelling errors that take into account the context of letters, keyboard layout, sound similarity, or [different possible error sources](https://aclanthology.org/A88-1011/).