<a href="https://colab.research.google.com/github/ssatendra790/Auto-Correct-Spelling-Checker/blob/main/Spell_Checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**AUTO CORRECT SPELLING CHECKER**\
-\
-\
-

**Identify a mispelled word**\
A word is mispelled if it is not found in the vocabulary of the corpus of text the autocorrect system is working with\
-\
-\
-\
-

In [2]:
import re
import string
from collections import Counter
import numpy as np

In [3]:
def read_corpus(filename):
  with open(filename, "r") as file:
    lines = file.readlines()
    words = []

    for line in lines:
      words += re.findall(r'\w+',line.lower())

  return words

In [11]:
words = read_corpus("./bigf.txt")
print(f"There are {len(words)} total words in the corpus")

There are 4893393 total words in the corpus


In [12]:
vocabs = set(words)
print(f"There are {len(vocabs)} unique words in the vocabulary")

There are 186347 unique words in the vocabulary


In [13]:
word_counts = Counter(words)
print(word_counts["the"])

274722


In [14]:
total_word_count = float(sum(word_counts.values()))
word_probas = {word: word_counts[word] / total_word_count for word in word_counts.keys()}

In [15]:
print(word_probas["the"])

0.05614141353453524


In [17]:
def split(word):
  return [(word[:i], word[i:]) for i in range(len(word) + 1)]

In [19]:
print(split("satendra"))

[('', 'satendra'), ('s', 'atendra'), ('sa', 'tendra'), ('sat', 'endra'), ('sate', 'ndra'), ('saten', 'dra'), ('satend', 'ra'), ('satendr', 'a'), ('satendra', '')]


In [20]:
def delete(word):
  return [l + r[1:] for l,r in split(word) if r]

In [21]:
print(delete("satendra"))

['atendra', 'stendra', 'saendra', 'satndra', 'satedra', 'satenra', 'satenda', 'satendr']


In [22]:
def swap(word):
  return [l + r[1] + r[0] + r[2:] for l,r in split(word) if len(r)>1]

In [23]:
print(swap("satendra"))

['astendra', 'staendra', 'saetndra', 'satnedra', 'satednra', 'satenrda', 'satendar']


In [28]:
def replace(word):
  letters = string.ascii_lowercase
  return [l + c + r[1:] for l,r in split(word) if r for c in letters]

In [29]:
print(replace("satendra"))

['aatendra', 'batendra', 'catendra', 'datendra', 'eatendra', 'fatendra', 'gatendra', 'hatendra', 'iatendra', 'jatendra', 'katendra', 'latendra', 'matendra', 'natendra', 'oatendra', 'patendra', 'qatendra', 'ratendra', 'satendra', 'tatendra', 'uatendra', 'vatendra', 'watendra', 'xatendra', 'yatendra', 'zatendra', 'satendra', 'sbtendra', 'sctendra', 'sdtendra', 'setendra', 'sftendra', 'sgtendra', 'shtendra', 'sitendra', 'sjtendra', 'sktendra', 'sltendra', 'smtendra', 'sntendra', 'sotendra', 'sptendra', 'sqtendra', 'srtendra', 'sstendra', 'sttendra', 'sutendra', 'svtendra', 'swtendra', 'sxtendra', 'sytendra', 'sztendra', 'saaendra', 'sabendra', 'sacendra', 'sadendra', 'saeendra', 'safendra', 'sagendra', 'sahendra', 'saiendra', 'sajendra', 'sakendra', 'salendra', 'samendra', 'sanendra', 'saoendra', 'sapendra', 'saqendra', 'sarendra', 'sasendra', 'satendra', 'sauendra', 'savendra', 'sawendra', 'saxendra', 'sayendra', 'sazendra', 'satandra', 'satbndra', 'satcndra', 'satdndra', 'satendra', 'sa

In [30]:
def insert(word):
  letters = string.ascii_lowercase
  return [l+c+ r for l,r in split(word) for c in letters]

In [31]:
print(insert("satendra"))

['asatendra', 'bsatendra', 'csatendra', 'dsatendra', 'esatendra', 'fsatendra', 'gsatendra', 'hsatendra', 'isatendra', 'jsatendra', 'ksatendra', 'lsatendra', 'msatendra', 'nsatendra', 'osatendra', 'psatendra', 'qsatendra', 'rsatendra', 'ssatendra', 'tsatendra', 'usatendra', 'vsatendra', 'wsatendra', 'xsatendra', 'ysatendra', 'zsatendra', 'saatendra', 'sbatendra', 'scatendra', 'sdatendra', 'seatendra', 'sfatendra', 'sgatendra', 'shatendra', 'siatendra', 'sjatendra', 'skatendra', 'slatendra', 'smatendra', 'snatendra', 'soatendra', 'spatendra', 'sqatendra', 'sratendra', 'ssatendra', 'statendra', 'suatendra', 'svatendra', 'swatendra', 'sxatendra', 'syatendra', 'szatendra', 'saatendra', 'sabtendra', 'sactendra', 'sadtendra', 'saetendra', 'saftendra', 'sagtendra', 'sahtendra', 'saitendra', 'sajtendra', 'saktendra', 'saltendra', 'samtendra', 'santendra', 'saotendra', 'saptendra', 'saqtendra', 'sartendra', 'sastendra', 'sattendra', 'sautendra', 'savtendra', 'sawtendra', 'saxtendra', 'saytendra'

In [35]:
def level_one_edits(word):
  return set(delete(word) + swap(word) + replace(word) + insert(word))

**level_one_edits** checks for one error in one word.

In [39]:
def level_two_edits(word):
  return set(e2 for e1 in level_one_edits(word) for e2 in level_one_edits(e1))

**level_two_edits** checks for two errors in one word.

In [42]:
def correct_spelling(word, vocabulary, word_probabilities):
  if word in vocabulary:
    print(f"{word} is Correctly Spelled")
    return
  suggestion = level_one_edits(word) or level_two_edits(word) or [word]
  best_guesses = [w for w in suggestion if w in vocabulary]
  return [(w, word_probabilities[w]) for w in best_guesses]


-\
-\
-\
-\
-\
-




 Above function named **correct_spelling** works as following:


*   First checks if the word is already in our dataset that contains the correct spellings then it will return that the spelling is already spelled correctly.
*   **suggestion** stores *level_one_edit* and further edits
*   **best_guesses** stores the words after edits that are also present in our dataset.
*   lasty, we return words that we stored in *best_guesses* along with its probability.







-\
-\
-\
-\
-\
-

In [68]:
Input_Word = "mispelled"
guesses = correct_spelling(Input_Word, vocabs, word_probas)
print(guesses)

[('misspelled', 8.174287248132329e-07), ('dispelled', 8.174287248132329e-07)]
