# Application of NLP: Spell Checker

Spell checkers can use `approximate string matching algorithms` such as `Levenshtein distance`, `jaccard distance` to find correct spellings of misspelled words

Understanding the Levenshtein Distance Equation for Beginners
https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0

In [16]:
import nltk
from nltk.corpus import words

# load the dictionary
nltk.download('words')
correct_words = words.words()
# len(correct_words) # 236736 words

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


## Using Jaccard distance Method

In [17]:
# importing jaccard distance and ngrams from nltk.util
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams

In [18]:
# list of incorrect spellings
# that need to be corrected
incorrect_words = ['happpy', 'azmaing', 'intelliengt']

# loop for finding correct spellings
# based on jaccard distance
# and printing the correct word
for word in incorrect_words:
    temp = [(jaccard_distance(set(ngrams(word, 2)), set(ngrams(w, 2))),w) for w in correct_words if w[0] == word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

happy
amazing
intelligent


In [19]:
set(ngrams(word, 2))

{('e', 'l'),
 ('e', 'n'),
 ('g', 't'),
 ('i', 'e'),
 ('i', 'n'),
 ('l', 'i'),
 ('l', 'l'),
 ('n', 'g'),
 ('n', 't'),
 ('t', 'e')}

## Using Edit distance Method
Edit Distance measures dissimilarity between two strings by finding the minimum number of operations needed to transform one string into the other.
The transformations that can be performed are:

- Inserting a new character: `bat -> bats (insertion of 's')`
- Deleting an existing character : `care -> car (deletion of 'e')`
- Substituting an existing character: `bin -> bit (substitution of n with t)`
- Transposition of two existing consecutive characters: `sing -> sign (transposition of ng to gn)`


In [20]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [21]:
from nltk.metrics.distance  import edit_distance
from nltk.corpus import words
correct_words = words.words()

incorrect_words=['happpy', 'azmaing', 'intelliengt']

# loop for finding correct spellings
# based on edit distance and
# printing the correct words
for word in incorrect_words:
    temp = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(temp))
    # print(sorted(temp, key = lambda val:val[0])[0][1])



[(1, 'happy'), (1, 'happy'), (2, 'haply'), (2, 'happen'), (2, 'happify'), (2, 'happily'), (2, 'hippy'), (2, 'hoppy'), (3, 'hackly'), (3, 'hacky'), (3, 'haggly'), (3, 'haggy'), (3, 'haily'), (3, 'hairy'), (3, 'hammy'), (3, 'hamper'), (3, 'handy'), (3, 'hangby'), (3, 'hanky'), (3, 'hap'), (3, 'happier'), (3, 'happing'), (3, 'hapten'), (3, 'haptic'), (3, 'hapu'), (3, 'hapuku'), (3, 'hardly'), (3, 'hardy'), (3, 'harp'), (3, 'harper'), (3, 'harry'), (3, 'hashy'), (3, 'hasky'), (3, 'hasp'), (3, 'hasty'), (3, 'hatpin'), (3, 'hatty'), (3, 'haulmy'), (3, 'haunty'), (3, 'hawky'), (3, 'hay'), (3, 'hayey'), (3, 'hazily'), (3, 'hazy'), (3, 'heapy'), (3, 'helply'), (3, 'hempy'), (3, 'heppen'), (3, 'hepper'), (3, 'hipped'), (3, 'hippen'), (3, 'hippic'), (3, 'hipple'), (3, 'hippo'), (3, 'hippus'), (3, 'hopped'), (3, 'hopper'), (3, 'hoppet'), (3, 'hoppity'), (3, 'hopple'), (3, 'humpty'), (3, 'humpy'), (4, 'ha'), (4, 'haab'), (4, 'haaf'), (4, 'habble'), (4, 'habeas'), (4, 'habena'), (4, 'habile'), (4, '

## working with own dictionary

In [22]:
mistake = "पात"
dictionary = ['पातलो', 'पात', 'पत्रु', 'पात्र']

for word in dictionary:
    ed = nltk.edit_distance(mistake, dictionary)
    print(word, ed)

पातलो 4
पात 4
पत्रु 4
पात्र 4


In [23]:
mistake = "पात"
words = ['पातलो', 'पात', 'पत्रु', 'पात्र']
for word in words:
    ed = nltk.edit_distance(mistake, word)
    print(word, ed)

पातलो 2
पात 0
पत्रु 4
पात्र 2
