## NLP Class 3 Exercise 1:
- You are provided with a following sentence: *"I have a `fazt` car"*, where `fazt` is the misspelled word
- Create a bi-gram model, that looks at the prior word to misspelled word and provides suggested correction
- Create a tri-gram model, that looks at the prior word as well as following word to misspelled word and provides suggested correction
- Improve the performance of your spelling recommendations by only considering the tokens within edit distance of 1 for misspelled word  
**Suggestions:** 
- You can use either NLTL Brown Corpus or NLTK Reuters Corpus to build a model

`Brown Corpus`: a collection of texts from a wide range of sources, all written in 1961 created at Brown University.  It includes texts from 500 sources and covers ~1.1M words, where the sources have been categorized by genre, for example, news, editorial, adventure fiction, mystery fiction, romance, etc.  
`Reuters Corpus`: contains ~10K news documents totaling over 1.7 million words.  The documents in this corpus are from the Reuters newswire in the late 1980s. They have been classified into 90 topics, and thus, the corpus is often used for experiments in text categorization.

In [3]:
# %pip install nltk==3.8.1

import nltk
nltk.download('reuters', halt_on_error=False)
nltk.download('brown', halt_on_error=False)

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/ziyuanye/nltk_data...
[nltk_data] Downloading package brown to /Users/ziyuanye/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

## Spelling correction using Brown Corpus

In [2]:
import re
import nltk
from nltk import ngrams, bigrams, trigrams
from nltk.corpus import reuters, brown
from nltk.metrics.distance import edit_distance
from collections import Counter, defaultdict

import pandas as pd

# Download the necessary datasets
# nltk.download('punkt')
# nltk.download('brown')

### Create most likely corrections, purely based on ngram frequency

In [4]:
# Get tokens from NLTK Corpus
brown_words = brown.words() # for Brown corpus
reuters_words = reuters.words() # for Reuters corpus
print(f'Corpus size brown_words: {len(brown_words):,}')
print(f'Corpus size reuters_words: {len(reuters_words):,}')

# Choose your preferred corpus
words = [word.lower() for word in brown_words] 
# words = [word.lower() for word in reuters_words]

# Eliminate punctuation from corpus
filtered_words = [word for word in words if not re.fullmatch(r'[^\w\s]', word)]

# Create bigrams and trigrams from the corpus
corpus_bigrams = list(bigrams(filtered_words))
corpus_trigrams = list(trigrams(filtered_words))

# Get frequencies of bigrams and trigrams in the corpus
bigram_freq = Counter(corpus_bigrams)
trigram_freq = Counter(corpus_trigrams)

Corpus size brown_words: 1,161,192
Corpus size reuters_words: 1,720,901


In [5]:
# Create a DataFrame for the bigrams
df_bigram = pd.DataFrame(list(bigram_freq.items()), columns=['Bigram', 'Frequency'])
df_bigram.sort_values(by='Frequency', ascending=False, inplace=True)
display(df_bigram.head(5))

# Create a DataFrame for the trigrams
df_trigram = pd.DataFrame(list(trigram_freq.items()), columns=['Trigram', 'Frequency'])
df_trigram.sort_values(by='Frequency', ascending=False, inplace=True)
display(df_trigram.head(5))

Unnamed: 0,Bigram,Frequency
40,"(of, the)",9721
83,"(in, the)",6041
158,"(to, the)",3492
401,"(on, the)",2477
117,"(and, the)",2247


Unnamed: 0,Trigram,Frequency
334,"(one, of, the)",404
6294,"(the, united, states)",336
6657,"(as, well, as)",238
2263,"('', he, said)",222
1183,"(some, of, the)",179


In [5]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Mon, 26 June 2023 11:43:31'