## Spell Check

Identify the list of candidate words using Trie structure for the misspelled word. Find the minimum edit distance and choose the correct word based on the context.

### Import necessary libraries and download the packages

In [1]:
import string
import nltk
from nltk.corpus import stopwords, brown, gutenberg
from nltk.util import ngrams
import itertools

In [2]:
# nltk.download('brown')
# nltk.download('gutenberg')

### Load the data

In [2]:
data = ["Mr Patrick is our new principle","The company excepted all the terms","Please don't keep your dog on the lose","The later is my best friend","I need some stationary products for my craftwork","The actor excepted the Oscar","I will call you later in the evening","Covid affects the lungs","The council of the ministers were sworn in yesterday","Robert too wants to accompany us to the park","Mia will council me about choosing fashion as my career", "The bear at the zoo was very playful","The sheep have a lot of fur that keeps them warm","The hot spring is at the furthest corner of the street","Can you advice me on how to study for exams","The team will loose the match if they don't play well","Can you go to the market for me","The teachers asked the students to keep quite", "The heap of garbage should be cleaned immediately","This is there house","Mr Patrick is our new principal","The company accepted all the terms","Please don't keep your dog on the loose","The latter is my best friend","I need some stationery products for my craftwork","The actor accepted the Oscar","I will call you later in the evening","Covid affects the lungs","The council of the ministers were sworn in yesterday","Robert too wants to accompany us to the park","Mia will counsel me about choosing fashion as my career", "The bear at the zoo was very playful","The sheep have a lot of fur that keeps them warm","The hot spring is at the farthest corner of the street","Can you advise me on how to study for exams","The team will lose the match if they don't play wel.","Can you go to the market for me","The teachers asked the students to keep quiet", "The heap of garbage should be cleaned immediately","This is their house","Mr Patrick is our new principal","The company accepted all the terms","Please don't keep your dog on the loose","The latter is my best friend","I need some stationery products for my craftwork","The actor accepted the Oscar","I will call you later in the evening","Covid affects the lungs","The council of the ministers were sworn in yesterday","Robert too wants to accompany us to the park","Mia will counsel me about choosing fashion as my career", "The bear at the zoo was very playful","The sheep have a lot of fur that keeps them warm","The hot spring is at the farthest corner of the street","Can you advise me on how to study for exams","The team will lose the match if they don't play well","Can you go to the market for me","The teachers asked the students to keep quiet", "The heap of garbage should be cleaned immediately","This is their house"]

### Understanding the data and Data Preprocessing

- Extending the original data
- Spliting the data into words
- Adding brown and gutenberg corpus to the data and creating the final corpus
- Preprocessing Steps - lowercase, punctuation and stopwords removal 

In [3]:
len(data)

60

In [4]:
data.extend(data) # double the data

In [5]:
len(data)

120

In [6]:
sentence = []
for i in data:
    sentence.append(i.split()) # split the data into seentences

In [7]:
len(sentence)

120

In [8]:
sentence[:5]

[['Mr', 'Patrick', 'is', 'our', 'new', 'principle'],
 ['The', 'company', 'excepted', 'all', 'the', 'terms'],
 ['Please', "don't", 'keep', 'your', 'dog', 'on', 'the', 'lose'],
 ['The', 'later', 'is', 'my', 'best', 'friend'],
 ['I', 'need', 'some', 'stationary', 'products', 'for', 'my', 'craftwork']]

In [9]:
words = []
for i in data:
    words.extend(i.split()) # split the data into words

In [10]:
words[:5]

['Mr', 'Patrick', 'is', 'our', 'new']

In [11]:
brown.words()[:10] # The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [12]:
len(brown.words())

1161192

In [13]:
gutenberg.words()[:10] # gutenberg corpus

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', 'VOLUME', 'I', 'CHAPTER']

In [14]:
len(gutenberg.words())

2621613

In [15]:
1161192+2621613+len(words)

3783753

In [16]:
corpus = brown.words() + gutenberg.words() + words # combine the brown and gutenberg corpus
len(corpus)

3783753

In [17]:
corpus[-10:] # last 10 words in corpus

['of',
 'garbage',
 'should',
 'be',
 'cleaned',
 'immediately',
 'This',
 'is',
 'their',
 'house']

In [18]:
len(set(corpus)) # unique words in corpus

84569

In [19]:
list(string.punctuation)[:10]

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']

In [20]:
len(list(string.punctuation))

32

In [21]:
stopwords.words('english')[:7]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours']

In [22]:
len(stopwords.words('english'))

179

In [23]:
unwanted_words = list(string.punctuation) + stopwords.words('english') 
len(unwanted_words)

211

In [24]:
# Lowercasing
lower = [x.lower() for x in corpus]
lower[:10]

['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of']


In [25]:
# Punctuation and stop words removal
filtered_words = [x for x in lower if x not in unwanted_words]
filtered_words[:10]

['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', "atlanta's", 'recent', 'primary']


In [26]:
len(filtered_words) 

1676346

In [27]:
filtered_words = list(filter(lambda x: x.isalpha() and len(x)>1, filtered_words))
len(filtered_words) # clean corpus

1534110

### Building Bi-Gram Model
- N-Gram model is useful in capturing word order related regulations
- uses N-1 words of prior context

In [28]:
bigrams = ngrams(filtered_words,2)

In [29]:
type(bigrams)

zip

In [30]:
bigrams = list(bigrams) 
print(type(bigrams), bigrams[:5])

<class 'list'> [('fulton', 'county'), ('county', 'grand'), ('grand', 'jury'), ('jury', 'said'), ('said', 'friday')]


In [31]:
model = nltk.ConditionalFreqDist(bigrams) # creates a frequency distribution for each condition (the first item in each pair) and count the number of occurrences of each event (the second item in each pair) given that condition i.e, it will give you the frequency of each word following a given word

In [32]:
type(model)

nltk.probability.ConditionalFreqDist

In [33]:
for condition in itertools.islice(model.conditions(), 5):  # Only print for the first 10 conditions
    print(condition, model[condition].most_common(5))

fulton [('county', 6), ('superior', 2), ('legislators', 2), ('taxpayers', 1), ('court', 1)]
county [('school', 5), ('jail', 4), ('unit', 3), ('new', 3), ('hospital', 3)]
grand [('jury', 11), ('old', 4), ('champion', 3), ('pianoforte', 3), ('foe', 3)]
jury [('said', 10), ('judge', 4), ('box', 4), ('recommended', 2), ('room', 2)]
said [('unto', 1698), ('lord', 146), ('mr', 132), ('alice', 123), ('turnbull', 121)]


### Predictions

In [34]:
questions = ["Mr Patrick is our new MASK.","The company MASK all the terms.","Please don't keep your dog on the MASK.","The MASK is my best friend.","I need some MASK products for my craftwork.","The actor MASK the Oscar.","I will call you MASK in the evening.","Covid MASK the lungs.","The MASK of the ministers were sworn in yesterday.","Robert MASK wants to accompany us to the park.","Mia will MASK me about choosing fashion as my career.", "The MASK at the zoo was very playful.","The sheep have a lot of MASK that keeps them warm.","The hot spring is at the MASK corner of the street.","Can you MASK me on how to study for exams.","The team will MASK the match if they don't play well.","Can you go MASK the market for me.","The teachers asked the students to keep MASK.", "The MASK of garbage should be cleaned immediately.","This is MASK house."]

In [42]:
choices = [['principle','principal'],['accepted','excepted'],['lose','loose'],['later','latter'],['stationary','stationery'],['accepted','excepted'],['later','latter'],['affects','effects'],['council','counsel'],['too','to'],['council','counsel'],['bear','bare'],['fur','fir'],['furthest','farthest'],['advice','advise'],['loose','lose'],['to','for'],['quiet','quite'],['heep','heap'],['there','their']]

In [46]:
def predict(question, model, choice1, choice2):
    processed = [i.lower() for i in question.split() if i not in unwanted_words]
    for i in processed:
        if i in ["mask", "mask."]:
            break
        prev = i
    prob1 = model[prev].freq(choice1)
    prob2 = model[prev].freq(choice2)
    print('--------------------------------------------------')
    print(f"{choice1} = {prob1:.4f}\n{choice2} = {prob2:.4f}")
    if prob1 > prob2:
        return question.replace("MASK",choice1.upper())
    return question.replace("MASK",choice2.upper())

In [47]:
for i in range(len(questions)):
    print(f"{i + 1}." , predict(questions[i], model, choices[i][0],choices[i][1]))

--------------------------------------------------
principle = 0.0012
principal = 0.0016
1. Mr Patrick is our new PRINCIPAL.
--------------------------------------------------
accepted = 0.0066
excepted = 0.0033
2. The company ACCEPTED all the terms.
--------------------------------------------------
lose = 0.0103
loose = 0.0205
3. Please don't keep your dog on the LOOSE.
--------------------------------------------------
later = 0.0000
latter = 0.0000
4. The LATTER is my best friend.
--------------------------------------------------
stationary = 0.0031
stationery = 0.0062
5. I need some STATIONERY products for my craftwork.
--------------------------------------------------
accepted = 0.0678
excepted = 0.0339
6. The actor ACCEPTED the Oscar.
--------------------------------------------------
later = 0.0071
latter = 0.0000
7. I will call you LATER in the evening.
--------------------------------------------------
affects = 1.0000
effects = 0.0000
8. Covid AFFECTS the lungs.
----------

### Inferences

`1. Lack of Context:` Bigrams only consider adjacent pairs of words, which may not capture the full context of a sentence or document. This limitation can lead to misinterpretation or loss of meaning, especially in complex language constructs.

`2. Sparse Data:` In datasets with limited occurrences of specific bigrams, statistical models based on bigrams may suffer from sparse data issues. This can result in unreliable estimates and reduced accuracy, particularly in language tasks with diverse vocabulary.

`3. Disregard for Word Order:` Bigrams ignore the ordering of words beyond adjacent pairs. Consequently, they cannot capture dependencies or patterns that span more than two consecutive words, limiting their ability to model complex linguistic structures accurately.