# 🎯 Project Showcase: Autocorrect using textdistance library

Autocorrect is of course based on Natural Language Processing, and we use it to correct errors in grammar and spellings while we type.

There a number of cases that comes to mind like, what if the word deos not exist in the vocabulary, in that case the program will suggest words that are similar to it.

We will be using the textdistance library for this task.

`pip install textdistance`

In [17]:
!pip install textdistance



Text data used: ***The Project Gutenberg Ebook Moby Dick; or the Whale by Herman Melville***

## Step 1: Importing the libraries

In [18]:
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter

## Step 2: Let us make a list of words from our `book.txt` file

In [19]:
words = []
with open('book.txt',encoding="utf8") as f:
    file_name_data = f.read()
    file_name_data = file_name_data.lower()
    words = re.findall('\w+', file_name_data)

# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.


In [20]:
word_freq_dict = {}
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]


## Step 3: Relative Frequency of Words

Now we want to get the probability of occurence of each word, this equals the relative frequencies of the words: 

In [21]:
probs = {}     
Total = sum(word_freq_dict.values())    
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k]/Total

## Step 4: Finding Similar Words

In [24]:
def my_autocorrect(input_word):
    input_word = input_word.lower()
    if input_word in V:
        return('Your word seems to be correct')
    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

In [26]:
my_autocorrect('nevertheless')

'Your word seems to be correct'

In [27]:
my_autocorrect('neverteless')

Unnamed: 0,Word,Prob,Similarity
2571,nevertheless,0.000225,0.75
13657,boneless,1.3e-05,0.416667
12684,elevates,4e-06,0.416667
1105,never,0.000925,0.4
7136,level,0.000108,0.4


In [29]:
my_autocorrect('corect')

Unnamed: 0,Word,Prob,Similarity
8530,correct,2.7e-05,0.833333
4546,record,3.1e-05,0.666667
8447,correctly,1.8e-05,0.625
12272,incorrect,4e-06,0.625
16260,core,4e-06,0.6


# Conclusion: 
We have successfully made a simple model that suggests words on the basis of errors on it and returns correct if it is in fact correct. 