- Hamming distance
- Levenshtein distance (coming soon)
The Hamming distance between two strings of the same length is the minimum number of substitutions required to transform one into the other.
Let u and v be strings of length n, then the Hamming distance d(u,v) = card({ i | ui ≠ vi })
The Levenshtein distance between two strings (not necessarily the same length) is the minimum number of additions, substitutions and deletions required to transform one into another.
The most straightforward way to derive the formula for Levenshtein distance is using recursion.
Let u=u1u2...um, v=v1v2...vn, we wish to find the formula for the Levenshtein distance d(u,v)
- If min(m,n) = 0, then max(len(m), len(n)). (This is obvious, since if one string is empty, then the distance is just the number of additions to get from the empty string to the non-empty string)
- If the last two characters are the same, that is: um = vn, then d(u,v)=d(u1u2...um-1, v1v2...vn-1)
- Otherwise, there are three options left to check, compute them all and pick the minimum:
- Delete um and recurse
- Delete vn and recurse or
- Delete both and recurse
Why does this last case work?
By the last case, we know that both strings are non-empty, and that their last characters differ. This means that we can delete the last character of one or the other or both, counting the deletion as one, and recurse until either of the first two cases are met.
To see this in action, suppose we are testing "solve" and "resolve", we know that we can append two characters to the former to get the latter. Using our algorithm, we know that they are both non-empty, and we will hit case 2 exactly five times, at which point we test "" and "re", which case 1 returns a distance of 2.
From Ch3 (MMDS):
The edit distance d(x,y) can be calculated as the length of x plus the length of y minus twice the length of their LCS.