# Jaccard Distance

Jaccard Distance is a measure of how dissimilar two sets are.  The lower the distance, the more similar the two strings.

Jaccard Distance depends on another concept called “Jaccard Similarity Index” which is (the number in both sets) / (the number in either set) * 100

\begin{equation}J(X,Y) = |X \cap Y| / |X \cup Y|\end{equation}

Then we can calculate the Jaccard Distance as follows:

\begin{equation}D(X,Y) = 1 - J(X,Y)\end{equation}

## Example #1: Jaccard Distance between two words

In [2]:
import nltk

w1 = set('mapping')
w2 = set('mappings')

nltk.jaccard_distance(w1, w2)

0.14285714285714285

## Example #2: Basic Spelling Checker

In [3]:
import nltk

mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:
    jd = nltk.jaccard_distance(set(mistake), set(word))
    print(word, jd)

apple 0.875
bag 0.8571428571428571
drawing 0.6666666666666666
listing 0.16666666666666666
linking 0.3333333333333333
living 0.3333333333333333
lighting 0.16666666666666666
orange 0.7777777777777778
walking 0.5
zoo 1.0


In [15]:
# The same example with NLTK "words"

import nltk
#nltk.download('words')

mistake = "ligting"

words = nltk.corpus.words.words()

jds = []

for word in words:
    jd = nltk.jaccard_distance(set(mistake), set(word))
    rounded_jd = round(jd, 2)
    jds.append((rounded_jd, word))

In [16]:
# Print the nearest 10 words

print(*sorted(jds)[0:10], sep="\n")

(0.0, 'glint')
(0.0, 'littling')
(0.0, 'tiling')
(0.0, 'tilting')
(0.0, 'tingling')
(0.0, 'titling')
(0.17, 'Tlingit')
(0.17, 'aglint')
(0.17, 'antling')
(0.17, 'clinting')


## Example #3: Sentence-level Jaccard Distance

In [12]:
import nltk

sent1 = set("It might help to re-install Python if possible.")
sent2 = set("It can help to install Python again if possible.")
sent3 = set("It can be so helpful to reinstall C++ if possible.")
sent4 = set("help It possible Python to re-install if might.") # The same words as sent1 with a different order.
sent5 = set("I love Python programming.")

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)
jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3)
jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4)
jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5)


print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')
print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3')
print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4')
print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5')

0.18181818181818182 Jaccard Distance between sent1 and sent2
0.36 Jaccard Distance between sent1 and sent3
0.0 Jaccard Distance between sent1 and sent4
0.22727272727272727 Jaccard Distance between sent1 and sent5


## Example #4: Character-level n-gram Jaccard Distance

In [17]:
import nltk

sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # The same words as sent1 with a different order.
sent5 = "I love Python programming."


ng1_chars = set(nltk.ngrams(sent1, n=3))
ng2_chars = set(nltk.ngrams(sent2, n=3))
ng3_chars = set(nltk.ngrams(sent3, n=3))
ng4_chars = set(nltk.ngrams(sent4, n=3))
ng5_chars = set(nltk.ngrams(sent5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_chars, ng2_chars)
jd_sent_1_3 = nltk.jaccard_distance(ng1_chars, ng3_chars)
jd_sent_1_4 = nltk.jaccard_distance(ng1_chars, ng4_chars)
jd_sent_1_5 = nltk.jaccard_distance(ng1_chars, ng5_chars)

print(jd_sent_1_2, "Jaccard Distance between sent1 and sent2 with ngram 3")
print(jd_sent_1_3, "Jaccard Distance between sent1 and sent3 with ngram 3")
print(jd_sent_1_4, "Jaccard Distance between sent1 and sent4 with ngram 3")
print(jd_sent_1_5, "Jaccard Distance between sent1 and sent5 with ngram 3")

0.43103448275862066 Jaccard Distance between sent1 and sent2 with ngram 3
0.6323529411764706 Jaccard Distance between sent1 and sent3 with ngram 3
0.3333333333333333 Jaccard Distance between sent1 and sent4 with ngram 3
0.9047619047619048 Jaccard Distance between sent1 and sent5 with ngram 3


## Example #5: Token-level n-gram Jaccard Distance

In [18]:
import nltk

sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # The same words as sent1 with a different order.
sent5 = "I love Python programming."

tokens1 = nltk.word_tokenize(sent1)
tokens2 = nltk.word_tokenize(sent2)
tokens3 = nltk.word_tokenize(sent3)
tokens4 = nltk.word_tokenize(sent4)
tokens5 = nltk.word_tokenize(sent5)

ng1_tokens = set(nltk.ngrams(tokens1, n=3))
ng2_tokens = set(nltk.ngrams(tokens2, n=3))
ng3_tokens = set(nltk.ngrams(tokens3, n=3))
ng4_tokens = set(nltk.ngrams(tokens4, n=3))
ng5_tokens = set(nltk.ngrams(tokens5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
jd_sent_1_3 = nltk.jaccard_distance(ng1_tokens, ng3_tokens)
jd_sent_1_4 = nltk.jaccard_distance(ng1_tokens, ng4_tokens)
jd_sent_1_5 = nltk.jaccard_distance(ng1_tokens, ng5_tokens)

print(jd_sent_1_2, "Jaccard Distance between tokens1 and tokens2 with ngram 3")
print(jd_sent_1_3, "Jaccard Distance between tokens1 and tokens3 with ngram 3")
print(jd_sent_1_4, "Jaccard Distance between tokens1 and tokens4 with ngram 3")
print(jd_sent_1_5, "Jaccard Distance between tokens1 and tokens5 with ngram 3")

0.9285714285714286 Jaccard Distance between tokens1 and tokens2 with ngram 3
0.9333333333333333 Jaccard Distance between tokens1 and tokens3 with ngram 3
1.0 Jaccard Distance between tokens1 and tokens4 with ngram 3
1.0 Jaccard Distance between tokens1 and tokens5 with ngram 3


Read the full tutorial of Jaccard Distance at: https://python.gotrained.com/nltk-edit-distance-jaccard-distance/