# Introduction to NLTK

In part 1 nltk is used to explore the Herman Melville novel Moby Dick. Then in part 2 a spelling recommender function that uses nltk to find words similar to the misspelling was developed. 

## Part 1 - Analyzing Moby Dick

In [2]:
#import the required libraries
import nltk
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open(r'C:\Users\sharm.LAPTOP-118C54MT\OneDrive - York University\Coursera\Course_4\Assignment_2\moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)
moby_tokens[0:20]

In [4]:
# number of tokens (words and punctuation symbols) are in text1
len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

255018

In [5]:
#number of unique tokens (unique words and punctuation) in text1
    
len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

20754

In [6]:
from nltk.stem import WordNetLemmatizer #to lemmatizing the verbs

lemmatizer = WordNetLemmatizer() #the lemmatizer object 
lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]
len(set(lemmatized)) #unique tokens after lemmatizing 

16899

In [7]:
#lexical diversity of the text input
len(set(nltk.word_tokenize(moby_raw)))/len(nltk.word_tokenize(moby_raw))

0.08138249064771899

In [8]:
# percentage of tokens whale or Whale 
len([w for w in moby_tokens if w == 'whale'.lower()])/len(nltk.word_tokenize(moby_raw))

0.00306645021135763

In [9]:
#20 most common occurring tokens in the text and their frequncy
dist = nltk.FreqDist(moby_tokens)
dist.most_common(20)

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2111),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

In [10]:
# tokens with length greater than 5 and frequency of more than 150   
dist = nltk.FreqDist(moby_tokens)
vocab1 = dist.keys()
sorted([w for w in vocab1 if len(w) > 5 and dist[w] > 150])

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

In [11]:
#the longest word in text1 and its length 
max_len = max([len(w) for w in moby_tokens])
max_tup = [(w,len(w)) for w in moby_tokens if len(w) == max_len]
max_tup

[("twelve-o'clock-at-night", 23)]

In [13]:
#alternative way
import pandas
distu = nltk.FreqDist(moby_tokens)
moby_frame = pandas.DataFrame(distu.most_common(),
                                    columns=["token", "frequency"])
length = max(moby_frame.token.str.len())
longest = moby_frame.token.str.extractall("(?P<long>.{{{}}})".format(length))
print(longest.long.iloc[0], length)


twelve-o'clock-at-night 23


In [14]:
#unique words with frequency more than 2000 and their frequency. 
distu = nltk.FreqDist(moby_tokens)
vocab1 = distu.keys()
unq_word = [(distu[w],w) for w in vocab1 if w.isalpha() and distu[w]>2000]
sorted(unq_word, key=lambda tup : tup[0], reverse=True)

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2111, 'I')]

In [15]:
#average number of tokens per sentence
sent_tokens = nltk.sent_tokenize(moby_raw)
len(moby_tokens)/len(sent_tokens)

25.88489646772229

In [16]:
# 5 most freqent parts of sppech in the given text and their frequency
word_tagged = nltk.pos_tag(moby_tokens)
distw = nltk.FreqDist(tag for (word,tag) in word_tagged)
distw.most_common(5)

[('NN', 32730), ('IN', 28658), ('DT', 25870), (',', 19204), ('JJ', 17619)]

## Part 2 - Spelling Recommender

For this part, three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list are developed.

For every misspelled word, the recommender should find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

Each of the three different recommenders will use a different distance measure (outlined below).

In [17]:
#*Each of the three different recommenders will use a different distance measure (outlined below).
from nltk.corpus import words

correct_spellings = words.words()

In [18]:
#using jaccard distance on the trigrams of the two words for the 
#misspelled words ['cormulent', 'incendenece', 'validrate']
entries=['cormulent', 'incendenece', 'validrate']
temp_list = []
final_list = []
for mistake in entries:
    for word in correct_spellings:
        if word[0] == mistake[0]:
            jd = nltk.jaccard_distance(set(nltk.ngrams(mistake,3)),set(nltk.ngrams(word,3)))
            temp_list.append((word,jd))
    final_list.append((sorted(temp_list, key=lambda tup : tup[1]))[0])
    temp_list.clear()
final_list
     

[('corpulent', 0.6),
 ('indecence', 0.6666666666666666),
 ('validate', 0.5555555555555556)]

In [19]:
# Jaccard distnace on the 4-grams of the two words. 
temp_list = []
final_list = []
for mistake in entries:
    for word in correct_spellings:
        if word[0] == mistake[0]:
            jd = nltk.jaccard_distance(set(nltk.ngrams(mistake,4)),set(nltk.ngrams(word,4)))
            temp_list.append((word,jd))
    final_list.append((sorted(temp_list, key=lambda tup : tup[1]))[0])
    temp_list.clear()
[w[0] for w in final_list]


['cormus', 'incendiary', 'valid']

In [20]:
# Edit distance on the two words with transpositions    
temp_list = []
final_list = []
for mistake in entries:
    for word in correct_spellings:
        if word[0] == mistake[0]:
            jd = nltk.edit_distance(mistake,word, transpositions=True)
            temp_list.append((word,jd))
    final_list.append((sorted(temp_list, key=lambda tup : tup[1]))[0])
    temp_list.clear()
    
    
[w[0] for w in final_list] 


['corpulent', 'intendence', 'validate']