<a href="https://colab.research.google.com/github/samyxandz/Ml-playgound/blob/main/AutoCorrect2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data pre processing step

In [None]:
import re
from collections import Counter
import numpy as np
import pandas as pd

In [None]:
def process_data(file_name):
    """
    Input:
        A file_name which is found in your current directory. You just have to read it in.
    Output:
        words: a list containing all the words in the corpus (text file you read) in lower case.
    """
    words = []
    file = open(file_name, 'r')
    text = file.read().lower()
    file.close()
    words = re.findall(r'\w+', text)

    return words

##### testing above function

In [None]:
word_l = process_data('shakespeare.txt')
vocab = set(word_l)  # this will be your new vocabulary
print(f"The first ten words in the text are: \n{word_l[0:10]}")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first ten words in the text are: 
['o', 'for', 'a', 'muse', 'of', 'fire', 'that', 'would', 'ascend', 'the']
There are 6116 unique words in the vocabulary.


#### Building the word frequency table

In [None]:
def get_count(words):
    '''
    Input:
        word_l: a set of words representing the corpus.
    Output:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    '''
    return dict(Counter(words))

testing the above function

In [None]:
word_count_dict = get_count(word_l)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'thee' is {word_count_dict.get('thee')}")

There are 6116 key values pairs
The count for the word 'thee' is 240


#### Model intuition

"Given the dictionary of word counts, compute the probability that each word will appear if randomly selected from the corpus of words.
    
 $$P(w_i) = \\frac{C(w_i)}{M} \$$

where  $C(w_i)$ is the total number of times $w_i$ appears in the corpus.
    
$M$ is the total number of words in the corpus.

For example, the probability of the word 'am' in the sentence **'I am happy because I am learning'** is:

 $$P(am) = \\frac{C(w_i)}{M} \$$
    





In [None]:
def get_probs(word_frequencies):
    '''
    Input:
        word_count_dict: The wordcount dictionary where key is the word and value is its frequency.
    Output:
        probs: A dictionary where keys are the words and the values are the probability that a word will occur.
    '''
    probabilities = {}
    total_words = len(word_l)

    for word, frequency in word_frequencies.items():
        probabilities[word] = frequency / total_words

    return probabilities

##### testing above function

In [None]:
probs = get_probs(word_count_dict)
print(f"Length of probs is {len(probs)}")
print(f"P('thee') is {probs['the']:.4f}")

Length of probs is 6116
P('thee') is 0.0284


# String Manipulations

we will be making functions for following part

- delete_letter: given a word, it returns all the possible strings that have one character removed.
- switch_letter: given a word, it returns all the possible strings that have two adjacent letters switched.
- replace_letter: given a word, it returns all the possible strings that have one character replaced by another different letter.
- insert_letter: given a word, it returns all the possible strings that have an additional character inserted.

#### delete_letter( )
 given a word, returns a list of strings with one character deleted.

For example, given the word nice, it would return the set:

{'ice', 'nce', 'nic', 'nie'}.

In [None]:
def delete_letter(word, verbose=False):
    '''
    Input:
        word: the string/word for which you will generate all possible words
                in the vocabulary which have 1 missing character
    Output:
        delete_l: a list of all possible strings obtained by deleting 1 character from word
    '''

    delete_l = [word[:i] + word[i + 1:] for i in range(len(word))]
    split_l = [(word[:i], word[i:]) for i in range(len(word))]

    if verbose: print(f"input word {word} \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return delete_l

In [None]:
delete_word_l = delete_letter(word="cans",verbose=True)

input word cans 
split_l = [('', 'cans'), ('c', 'ans'), ('ca', 'ns'), ('can', 's')], 
delete_l = ['ans', 'cns', 'cas', 'can']


#### switch_letter( )

  a function that switches two letters in a word. It takes in a word and returns a list of all the possible switches of two letters that are adjacent to each other.

For example, given the word 'eta', it returns
{'eat', 'tea'}, but does not return 'ate'.



In [None]:
def switch_letter(word, verbose=False):
    '''
    Input:
        word: input string
     Output:
        switches: a list of all possible strings with one adjacent charater switched
    '''

    switch_l = [word[:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(len(word) - 1)]
    split_l = [(word[:i], word[i:]) for i in range(len(word))]

    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}")

    return switch_l

##### testing above function

In [None]:
switch_word_l = switch_letter(word="eta", verbose=True)

Input word = eta 
split_l = [('', 'eta'), ('e', 'ta'), ('et', 'a')] 
switch_l = ['tea', 'eat']


#### replace_letter( )

  takes in a word and returns a list of strings with one replaced letter from the original word.


    Input:
        word: the input string/word
    Output:
        replaces: a list of all possible strings where we replaced one letter from the original word.



In [None]:
def replace_letter(word, verbose=False):


    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    for i in range(len(word)):
        for letter in letters:
            if letter != word[i]:
                replace_l.append(word[:i] + letter + word[i + 1:])
    replace_l.sort()

    split_l = [(word[:i], word[i:]) for i in range(len(word))]

    if verbose: print(f"Input word = {word} \n split_l = {split_l} \nreplace_l= \n {replace_l}")

    return replace_l

testing above function

In [None]:
replace_l = replace_letter(word='can', verbose=True)

Input word = can 
 split_l = [('', 'can'), ('c', 'an'), ('ca', 'n')] 
replace_l= 
 ['aan', 'ban', 'caa', 'cab', 'cac', 'cad', 'cae', 'caf', 'cag', 'cah', 'cai', 'caj', 'cak', 'cal', 'cam', 'cao', 'cap', 'caq', 'car', 'cas', 'cat', 'cau', 'cav', 'caw', 'cax', 'cay', 'caz', 'cbn', 'ccn', 'cdn', 'cen', 'cfn', 'cgn', 'chn', 'cin', 'cjn', 'ckn', 'cln', 'cmn', 'cnn', 'con', 'cpn', 'cqn', 'crn', 'csn', 'ctn', 'cun', 'cvn', 'cwn', 'cxn', 'cyn', 'czn', 'dan', 'ean', 'fan', 'gan', 'han', 'ian', 'jan', 'kan', 'lan', 'man', 'nan', 'oan', 'pan', 'qan', 'ran', 'san', 'tan', 'uan', 'van', 'wan', 'xan', 'yan', 'zan']


#### insert_letter( )

  a function that takes in a word and returns a list with a letter inserted at every offset.


    Input:
        word: the input string/word
    Output:
        inserts: a set of all possible strings with one new letter inserted at every offset
  


In [None]:
def insert_letter(word, verbose=False):

    letters = 'abcdefghijklmnopqrstuvwxyz'
    split_l = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    insert_l = [a + letter + b for a, b in split_l for letter in letters]

    if verbose: print(f"Input word {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")

    return insert_l

testing above function

In [None]:
insert_l = insert_letter('at', True)
print(f"Number of strings output by insert_letter('at') is {len(insert_l)}")

Input word at 
split_l = [('', 'at'), ('a', 't'), ('at', '')] 
insert_l = ['aat', 'bat', 'cat', 'dat', 'eat', 'fat', 'gat', 'hat', 'iat', 'jat', 'kat', 'lat', 'mat', 'nat', 'oat', 'pat', 'qat', 'rat', 'sat', 'tat', 'uat', 'vat', 'wat', 'xat', 'yat', 'zat', 'aat', 'abt', 'act', 'adt', 'aet', 'aft', 'agt', 'aht', 'ait', 'ajt', 'akt', 'alt', 'amt', 'ant', 'aot', 'apt', 'aqt', 'art', 'ast', 'att', 'aut', 'avt', 'awt', 'axt', 'ayt', 'azt', 'ata', 'atb', 'atc', 'atd', 'ate', 'atf', 'atg', 'ath', 'ati', 'atj', 'atk', 'atl', 'atm', 'atn', 'ato', 'atp', 'atq', 'atr', 'ats', 'att', 'atu', 'atv', 'atw', 'atx', 'aty', 'atz']
Number of strings output by insert_letter('at') is 78


# Combining the edits

creating  two functions that, given a string, will return all the possible single and double edits on that string.
These will be `edit_one_letter() ` and ` edit_two_letters() `.

## Edit at One letter

The function to get all the possible edits that are one edit away from a word. The edits consist of the replace, insert, delete, and optionally the switch operation. You should use the previous functions you have already implemented to complete this function. The 'switch' function is a less common edit function, so its use will be selected by an "allow_switches" input argument.


    Input:
        word: the string/word for which we will generate all possible wordsthat are one edit away.
    Output:
        edit_one_set: a set of words with one possible edit. Please return a set. and not a list.
   



In [None]:
def edit_one_letter(word, allow_switches = True):


    edit_one_set = set().union(replace_letter(word)).union(insert_letter(word)).union(delete_letter(word))

    if allow_switches:
        edit_one_set = edit_one_set.union(switch_letter(word))

    return edit_one_set

Testing the function


In [None]:
tmp_word = "at"
tmp_edit_one_set = edit_one_letter(tmp_word)
# turn this into a list to sort it, in order to view it
tmp_edit_one_l = sorted(list(tmp_edit_one_set))

print(f"input word {tmp_word} \nedit_one_l \n{tmp_edit_one_l}\n")
print(f"Number of outputs from edit_one_letter('at') is {len(edit_one_letter('at'))}")

input word at 
edit_one_l 
['a', 'aa', 'aat', 'ab', 'abt', 'ac', 'act', 'ad', 'adt', 'ae', 'aet', 'af', 'aft', 'ag', 'agt', 'ah', 'aht', 'ai', 'ait', 'aj', 'ajt', 'ak', 'akt', 'al', 'alt', 'am', 'amt', 'an', 'ant', 'ao', 'aot', 'ap', 'apt', 'aq', 'aqt', 'ar', 'art', 'as', 'ast', 'ata', 'atb', 'atc', 'atd', 'ate', 'atf', 'atg', 'ath', 'ati', 'atj', 'atk', 'atl', 'atm', 'atn', 'ato', 'atp', 'atq', 'atr', 'ats', 'att', 'atu', 'atv', 'atw', 'atx', 'aty', 'atz', 'au', 'aut', 'av', 'avt', 'aw', 'awt', 'ax', 'axt', 'ay', 'ayt', 'az', 'azt', 'bat', 'bt', 'cat', 'ct', 'dat', 'dt', 'eat', 'et', 'fat', 'ft', 'gat', 'gt', 'hat', 'ht', 'iat', 'it', 'jat', 'jt', 'kat', 'kt', 'lat', 'lt', 'mat', 'mt', 'nat', 'nt', 'oat', 'ot', 'pat', 'pt', 'qat', 'qt', 'rat', 'rt', 'sat', 'st', 't', 'ta', 'tat', 'tt', 'uat', 'ut', 'vat', 'vt', 'wat', 'wt', 'xat', 'xt', 'yat', 'yt', 'zat', 'zt']

Number of outputs from edit_one_letter('at') is 129


## Edit at Two letter

Generalize this to implement to get two edits on a word.

we have to get all the possible edits on a single word and then for each modified word, you would have to modify it again.

    Input:
        word: the input string/word
    Output:
        edit_two_set: a set of strings with all possible two edits

In [None]:
def edit_two_letters(word, allow_switches = True):

    edit_one_set = edit_one_letter(word, allow_switches)
    edit_two_set = set()

    for entry in edit_one_set:
        edit_two_set = edit_two_set.union(edit_one_letter(entry, allow_switches))

    return edit_two_set

Testing the function


In [None]:
tmp_edit_two_set = edit_two_letters("a")
tmp_edit_two_l = sorted(list(tmp_edit_two_set))
print(f"Number of strings with edit distance of two: {len(tmp_edit_two_l)}")
print(f"First 10 strings {tmp_edit_two_l[:10]}")
print(f"Last 10 strings {tmp_edit_two_l[-10:]}")

print(f"Number of strings that are 2 edit distances from 'a' is {len(edit_two_letters('a'))}")

Number of strings with edit distance of two: 2654
First 10 strings ['', 'a', 'aa', 'aaa', 'aab', 'aac', 'aad', 'aae', 'aaf', 'aag']
Last 10 strings ['zv', 'zva', 'zw', 'zwa', 'zx', 'zxa', 'zy', 'zya', 'zz', 'zza']
Number of strings that are 2 edit distances from 'a' is 2654


## WARNING

will show all the elements at edit distance 2


In [None]:
tmp_edit_two_set = edit_two_letters("a")
tmp_edit_two_l = sorted(list(tmp_edit_two_set))

print(f" The strings are :{tmp_edit_two_l}")

 The strings are :['', 'a', 'aa', 'aaa', 'aab', 'aac', 'aad', 'aae', 'aaf', 'aag', 'aah', 'aai', 'aaj', 'aak', 'aal', 'aam', 'aan', 'aao', 'aap', 'aaq', 'aar', 'aas', 'aat', 'aau', 'aav', 'aaw', 'aax', 'aay', 'aaz', 'ab', 'aba', 'abb', 'abc', 'abd', 'abe', 'abf', 'abg', 'abh', 'abi', 'abj', 'abk', 'abl', 'abm', 'abn', 'abo', 'abp', 'abq', 'abr', 'abs', 'abt', 'abu', 'abv', 'abw', 'abx', 'aby', 'abz', 'ac', 'aca', 'acb', 'acc', 'acd', 'ace', 'acf', 'acg', 'ach', 'aci', 'acj', 'ack', 'acl', 'acm', 'acn', 'aco', 'acp', 'acq', 'acr', 'acs', 'act', 'acu', 'acv', 'acw', 'acx', 'acy', 'acz', 'ad', 'ada', 'adb', 'adc', 'add', 'ade', 'adf', 'adg', 'adh', 'adi', 'adj', 'adk', 'adl', 'adm', 'adn', 'ado', 'adp', 'adq', 'adr', 'ads', 'adt', 'adu', 'adv', 'adw', 'adx', 'ady', 'adz', 'ae', 'aea', 'aeb', 'aec', 'aed', 'aee', 'aef', 'aeg', 'aeh', 'aei', 'aej', 'aek', 'ael', 'aem', 'aen', 'aeo', 'aep', 'aeq', 'aer', 'aes', 'aet', 'aeu', 'aev', 'aew', 'aex', 'aey', 'aez', 'af', 'afa', 'afb', 'afc', 'afd'

# Suggest Spelling Suggestions

The 'suggestion algorithm' follows this logic:

- If the word is in the vocabulary, suggest the word.
- Otherwise, if there are suggestions from edit_one_letter that are in the vocabulary, use those.
- Otherwise, if there are suggestions from edit_two_letters that are in the vocabulary, use those.
- Otherwise, suggest the input word.

The idea is that words generated from fewer edits are more likely than words with more edits.

> Then create a 'best_words' dictionary where the 'key' is a suggestion and the 'value' is the probability of that word in your vocabulary.

> Select the n best suggestions. There may be fewer than n


    Input:
        word: a user entered string to check for suggestions
        probs: a dictionary that maps each word to its probability in the corpus
        vocab: a set containing all the vocabulary
        n: number of possible word corrections you want returned in the dictionary
    Output:
        n_best: a list of tuples with the most probable n corrected words and their probabilities.


In [1]:
def get_corrections(word, probs, vocab, n=2, verbose = False):


    suggestions = list((word in vocab and word) or edit_one_letter(word).intersection(vocab) or edit_two_letters(word).intersection(vocab))
    n_best = [(s, probs[s]) for s in suggestions]

    if verbose: print("entered word = ", word, "\nsuggestions = ", suggestions)

    return n_best

Testing the function

In [None]:

my_word = 'dys'
tmp_corrections = get_corrections(my_word, probs, vocab, 2, verbose=True) # keep verbose=True
for i, word_prob in enumerate(tmp_corrections):
    print(f"word {i}: {word_prob[0]}, probability {word_prob[1]:.6f}")

