# Autocorrect

Have you ever thought about how the autocorrect features works in the keyboard of a smartphone? Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works.I will take you through how to build autocorrect with Python.With the context of machine learning, autocorrect is based on natural language processing. As the name suggests it is programmed to correct spellings and errors while typing. So how it works?

Before I get into the coding stuff let’s understand how autocorrect works. Let’s say you typed a word in your keyboard if the word will exist in the vocabulary of our smartphone then it will assume that you have written the right word. Now it doesn’t matter whether you write a name, a noun or any word on the planet.

If the word exists in the history of the smartphone, it will generalize the word as a correct word. What if the word doesn’t exist? If the word that you typed is a non-existing word in the history of our smartphone then the autocorrect is programmed to find the most similar words in the history of our smartphone.

I hope you now know what autocorrect is and how it works. Now let’s see how we can build an autocorrect feature with Python. Like our smartphone uses history to match the type words whether it’s correct or not. So here we also need to use some words to put the functionality in our autocorrect.

So I will use the text from a book which you can easily download from here. Now let’s get started with the task to build an autocorrect with Python.

For this task, we need some libraries. The libraries that I am going to use are very general as a machine learning practitioner. So you must be having all the libraries installed in your system already except one. You need to install a library known as textdistance, which can be easily installed by using the pip command; pip install textdistance.

Dataset :- https://github.com/rahulmuggalla/Auto_Correct/blob/main/book.txt

In [2]:
#importing libraries
import numpy as np #2 perform mathematical operations on arrays
import pandas as pd #for data analysis
#pip install textdistance
import textdistance #Compute distance between sequences.
import re #provides regular expression matching operations similar to those found in Perl

from collections import Counter #Dict subclass for counting hashable items. Sometimes called a bag or multiset.

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textdistance
  Downloading textdistance-4.3.0-py3-none-any.whl (29 kB)
Installing collected packages: textdistance
Successfully installed textdistance-4.3.0


In [4]:
words = []
with open('book.txt', 'r') as f:
    file_name_data = f.read()
    file_name_data = file_name_data.lower()
    words = re.findall('\w+',file_name_data)

In [5]:
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale']
There are 17647 unique words in the vocabulary.


In [6]:
#In the above code, we made a list of words, and now we need to build the frequency of those words, 
#which can be easily done by using the counter function in Python:
word_freq_dict = {}  
word_freq_dict = Counter(words)
print(word_freq_dict.most_common()[0:10])

[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]


## Relative Frequency of words

In [7]:
#Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:
probs = {}     
Total = sum(word_freq_dict.values())    
for k in word_freq_dict.keys():
    probs[k] = word_freq_dict[k] / Total

## Finding Similar Words

In [10]:
#Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. 
#Next, we will return the 5 most similar words ordered by similarity and probability:
def my_autocorrect(input_word):
    input_word = input_word.lower()

    if input_word in V:
        return('Your word seems to be correct')

    else:
        similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()]
        df = pd.DataFrame.from_dict(probs, orient='index').reset_index()
        df = df.rename(columns={'index':'Word', 0:'Prob'})
        df['Similarity'] = similarities
        
        output = df.sort_values(['Similarity', 'Prob'], ascending=False).head()
        return(output)

In [16]:
#Now, let’s find the similar words by using our autocorrect function:
word = input("Enter word : ")
my_autocorrect(word)

Enter word : learing


Unnamed: 0,Word,Prob,Similarity
333,clearing,2.7e-05,0.857143
12403,clearings,4e-06,0.75
5064,bearing,0.000112,0.714286
2498,hearing,5.8e-05,0.714286
4862,rearing,2.2e-05,0.714286


## Summary
As we took words from a book the same way their are some words already present in the vocabulary of the smartphone and some words it records while the user starts using the keyboard.