# String Distance Examples

In [4]:
cd ../../

/home/abelm


In [5]:
import textsim
from textsim import *

In [6]:
s1 = "PCCW's chief operating officer, Mike Butcher, and Alex Arena, the chief financial officer, will report directly to Mr So."
s2 = "Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So."

### **Listing all available distances!**

In [7]:
print('total',len(textsim.stringdists.__all__))
textsim.stringdists.__all__

total 29


['lcs_similarity',
 'levenshtein_similarity_pattern',
 'edit_similarity_nltk',
 'jaro_winkler_distance',
 'sorensen_distance_textsim',
 'hamming_distance',
 'levenshtein_distance',
 'edit_distance_nltk',
 'dice_coefficient',
 'match_rating_comparison',
 'containment_distance',
 'binary_distance',
 'edit_distance',
 'needleman_wunsch_similarity',
 'levenshtein_distance_pattern',
 'smith_waterman_distance',
 'lcs_distance',
 'damerau_levenshtein_distance_textsim',
 'damerau_levenshtein_distance_jellyfish',
 'levenshtein_distance_jellyfish',
 'dice_coefficient_pattern',
 'needleman_wunsch_distance',
 'levenshtein_similarity_jellyfish',
 'lcs',
 'edit_similarity',
 'damerau_levenshtein_distance',
 'needleman_wunsch_distance_pystring',
 'levenshtein_similarity',
 'jaro_distance']

In [21]:
#Totaly exclusive distances
print('total:',len(textsim.stringdists.__distances__))
for metric in sorted(textsim.stringdists.__distances__.keys()):
    print(metric)

total: 15
binary_distance
containment_distance
damerau_levenshtein_distance
dice_coefficient
edit_similarity
hamming_distance
jaro_distance
jaro_winkler_distance
lcs_distance
lcs_similarity
levenshtein_distance
match_rating_comparison
needleman_wunsch_distance
needleman_wunsch_similarity
smith_waterman_distance


### **Calling all distances in a flash!**

In [9]:
dictdist = {}
for metric in textsim.stringdists.PAIRED_DISTANCES:
    func = textsim.stringdists.PAIRED_DISTANCES[metric]
    try: #lcs returns a string
        dictdist[metric] = float(func(s1,s2))
    except:
        pass

dictcp = dictdist.copy()
for value in sorted(dictdist.values()):
    for word in dictdist.keys():
        if dictdist[word] == value and word in dictcp:
            print('%.3f: %s' % (value,word))
            dictcp.pop(word)

0.000: binary_distance
0.587: levenshtein_similarity_pattern
0.587: edit_similarity_nltk
0.587: levenshtein_similarity_jellyfish
0.587: edit_similarity
0.587: levenshtein_similarity
0.707: lcs_similarity
0.721: containment_distance
0.740: jaro_winkler_distance
0.740: jaro_distance
0.747: sorensen_distance_textsim
0.747: dice_coefficient
0.747: dice_coefficient_pattern
0.873: needleman_wunsch_similarity
1.000: match_rating_comparison
50.000: levenshtein_distance
50.000: edit_distance_nltk
50.000: levenshtein_distance_pattern
50.000: edit_distance
50.000: damerau_levenshtein_distance_textsim
50.000: levenshtein_distance_jellyfish
50.000: damerau_levenshtein_distance
50.000: damerau_levenshtein_distance_jellyfish
81.000: lcs_distance
100.000: needleman_wunsch_distance
100.000: needleman_wunsch_distance_pystring
102.000: hamming_distance
115.000: smith_waterman_distance


### **Calling a specific distance!**

In [10]:
levenshtein_distance_jellyfish(s1,s2)

50

## Performance between String Similar Distances

Some distances have the same value, but comes from different implementations. Which is better?
The names of distances inside textim have been changed after run this performance test the first time.

    E.g. levenshtein_distance = levenshtein_distance_jellyfish
         damerau_levenshtein_distance = levenshtein_distance_pattern
         
damerau_levenshtein_distance_textsim is a self implementation, inside textsim package, of damerau_levenshtein_distance. The code is contained for students to take it as an example of implementation of this kind of distance.

In [11]:
%timeit edit_similarity_nltk(s1,s2)
%timeit levenshtein_similarity_jellyfish(s1,s2)
%timeit levenshtein_similarity_pattern(s1,s2)
%timeit levenshtein_similarity(s1,s2)


100 loops, best of 3: 12.5 ms per loop
100 loops, best of 3: 7.01 ms per loop
100 loops, best of 3: 7.21 ms per loop
100 loops, best of 3: 7.06 ms per loop


In [12]:
%timeit damerau_levenshtein_distance_jellyfish(s1,s2)
%timeit damerau_levenshtein_distance_textsim(s1,s2)

100 loops, best of 3: 13.8 ms per loop
100 loops, best of 3: 15 ms per loop


In [13]:
%timeit edit_distance_nltk(s1,s2)
%timeit levenshtein_distance_pattern(s1,s2)
%timeit levenshtein_distance_jellyfish(s1,s2)
%timeit levenshtein_distance(s1,s2)

100 loops, best of 3: 12.6 ms per loop
100 loops, best of 3: 7.2 ms per loop
100 loops, best of 3: 7.08 ms per loop
100 loops, best of 3: 7.15 ms per loop


In [14]:
%timeit dice_coefficient_pattern(s1,s2)
%timeit sorensen_distance_textsim(s1,s2)

10000 loops, best of 3: 67.3 µs per loop
10000 loops, best of 3: 69.3 µs per loop


In [15]:
%timeit smith_waterman_distance(s1,s2)

10 loops, best of 3: 38.1 ms per loop


Old smith_waterman_distance from swalign package report 52.6 ms for the same test data.

In [16]:
from textsim.stringdists.distances import needleman_wunsch_distance_textsim
%timeit needleman_wunsch_distance(s1,s2)
%timeit needleman_wunsch_distance_textsim(s1,s2)

10 loops, best of 3: 29.7 ms per loop
100 loops, best of 3: 10.6 ms per loop


Although Needleman-Wunsch textsim implementation is faster than pystring, POO and flexibility of the pystring implementation to pass an arbitrary designed function (to value substitution/copy) as parameter makes it a goal for the next textsim package version.
Also there is an error on textsim implementation tested by hand.

In [17]:
%timeit lcs(s1,s2)

100 loops, best of 3: 6.35 ms per loop


In [18]:
textsim.stringdists.__not_implemented__

['Gotoh distance', 'Monge Elkan distance', 'N-grams Overlap']