---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the Herman Melville novel Moby Dick. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Moby Dick

In [1]:
import nltk
import pandas as pd
import numpy as np

# If you would like to work with the raw text you can use 'moby_raw'
with open('moby.txt', 'r') as f:
    moby_raw = f.read()
    
# If you would like to work with the novel in nltk.Text format you can use 'text1'
moby_tokens = nltk.word_tokenize(moby_raw)
text1 = nltk.Text(moby_tokens)

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [2]:
def example_one():
    
    return len(nltk.word_tokenize(moby_raw)) # or alternatively len(text1)

example_one()

255038

In [3]:
len(moby_tokens)

255038

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [4]:
def example_two():
    
    return len(set(nltk.word_tokenize(moby_raw))) # or alternatively len(set(text1))

example_two()

20742

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [5]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

16887

In [6]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w,'v') for w in moby_tokens]


In [7]:
my_list = ['I', 'was', 'once', 'lost', 'but', 'now', 'have', 'been' ,'found', 'gardening', ';']
[lemmatizer.lemmatize(w,'v') for w in my_list]

['I', 'be', 'once', 'lose', 'but', 'now', 'have', 'be', 'find', 'garden', ';']

In [8]:
my_list = ['I', 'was', 'once', 'lost', 'but', 'now', 'have', 'been' ,'found', 'gardening', ';']
[lemmatizer.lemmatize(w,'s') for w in my_list]

['I',
 'was',
 'once',
 'lost',
 'but',
 'now',
 'have',
 'been',
 'found',
 'gardening',
 ';']

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [9]:
def answer_one():
    '''
    Returns: The ratio between unique tokens and all tokens
    '''
    
    total_tokens = len(moby_tokens)
    
    unique_tokens = len(set(moby_tokens))
    
    return unique_tokens/total_tokens

answer_one()

0.08132905684643073

In [10]:
print(f"Total tokens = {len(moby_tokens)} Length of unique tokens = {len(set(moby_tokens))}")
    


Total tokens = 255038 Length of unique tokens = 20742


### Question 2

What percentage of tokens is 'whale'or 'Whale'?

*This function should return a float.*

In [11]:
from nltk.probability import FreqDist


In [12]:
def answer_two():
    '''
        Alternative 1: Uses list comprehension to get count through exact match
    '''
    
    whale_tokens = [token for token in moby_tokens if  token ==   "Whale" or token == 'whale'] 
    
    return len(whale_tokens) / len (moby_tokens) * 100

answer_two()

0.41248755087477157

In [13]:
#Instead of working with lists, create a dataframe for all the tasks
def getWordFreqDF():
    '''
        This function sets up a dataframe that will be used by all functions below
        The FreqDist of the tokens dictionary is converted to a dataframe
        The length of the tokens and a flag to indicate if the word is alphabetic or punctuation is added
        The returned dataframe is sorted by frequency count
    '''
   
    from nltk.probability import FreqDist
    moby_freq_dist = FreqDist(moby_tokens)

    word_df = pd.DataFrame.from_dict( moby_freq_dist, orient='index' )
    word_df.reset_index(inplace=True)
    word_df.columns = ['word_token', 'word_freq']
    word_df['word_len'] = word_df['word_token'].apply(len)
    word_df['word_isalpha'] = word_df['word_token'].apply(lambda x: x.isalpha())
    word_df.sort_values(by=['word_freq'], ascending=False, inplace=True)
    return word_df
 

In [14]:
moby_text_df = getWordFreqDF()

In [15]:
moby_text_df

Unnamed: 0,word_token,word_freq,word_len,word_isalpha
26,",",19204,1,False
50,the,13715,3,True
9,.,7306,1,False
53,of,6513,2,True
29,and,6010,3,True
...,...,...,...,...
13310,passionateness,1,14,True
13311,ireful,1,6,True
13312,aggrieved,1,9,True
5155,bruised,1,7,True


In [16]:
def answer_two():
    '''
        Add the frequencies of the words whale and Whale
    '''
   
    moby_freq_dist = FreqDist(moby_tokens)
    whale_freq = moby_freq_dist['whale'] +  moby_freq_dist['Whale'] 
    return (whale_freq)/moby_text_df['word_freq'].sum() * 100
answer_two()

0.41248755087477157

In [17]:
#Using dataframe
def answer_two():
    return moby_text_df[moby_text_df['word_token'].isin(['Whale', 'whale'])]['word_freq'].sum()/ len(moby_tokens) 
answer_two()

0.004124875508747716

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [18]:
#Using dataframe
def answer_three():
    '''
       Since the dataframe is already sorted, return the top 20. 
       Note the usage of zip to combine columns of the dataframe into a list of tuples
    '''
   
    top_20 =  moby_text_df[:20][['word_token', 'word_freq']]
    top_20['token_freq_combine'] = list(zip(top_20['word_token'], top_20['word_freq'] ))
    return top_20['token_freq_combine']

answer_three()

26       (,, 19204)
50     (the, 13715)
9         (., 7306)
53       (of, 6513)
29      (and, 6010)
12        (a, 4545)
16       (to, 4515)
31        (;, 4173)
24       (in, 3908)
94     (that, 2978)
40      (his, 2459)
59       (it, 2196)
32        (I, 2113)
275       (!, 1767)
76       (is, 1722)
22       (--, 1713)
44     (with, 1659)
174      (he, 1658)
37      (was, 1639)
205      (as, 1620)
Name: token_freq_combine, dtype: object

In [19]:
def answer_three():
    moby_freq_dist = FreqDist(moby_tokens)
   
    # sort the dictionary
    moby_freq_sorted = {k: v for k, v in sorted(moby_freq_dist.items(), key=lambda item: item[1], reverse=True)}
   
    return [(k, v) for i, (k, v) in enumerate(moby_freq_sorted.items())][:20]


answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7306),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2113),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

In [20]:
freq_dist_tokens = FreqDist(moby_tokens)
freq_dist_tokens.most_common(20)

[(',', 19204),
 ('the', 13715),
 ('.', 7306),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2113),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 150?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [21]:
#Using dataframe
def answer_four():
    
    to_return_df = moby_text_df[(moby_text_df['word_freq'] > 150) & (moby_text_df['word_len'] > 5)]
    to_return_df = to_return_df.sort_values(by=['word_token'], ascending=True)
   
    return to_return_df['word_token']

answer_four()

1899     Captain
6220      Pequod
3625    Queequeg
7421    Starbuck
88        almost
301       before
1367     himself
872       little
886       seemed
731       should
1080      though
83       through
173       whales
639      without
Name: word_token, dtype: object

In [22]:
def answer_four():
    
    to_return = sorted(
                    [item[0] for item in freq_dist_tokens.items()
                     if item[1] > 150 and len(item[0]) > 5])
    
    return to_return

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

### Question 5

Find the longest word in text1 and that word's length.

*This function should return a tuple `(longest_word, length)`.*

In [23]:
def answer_five():
    '''
    Since the length of each token is a column in the dataframe, find the index of the row containing 
    the max length to access the word
    '''
   
    max_idx = moby_text_df['word_len'].idxmax()
    return (moby_text_df.loc[max_idx, 'word_token'], moby_text_df.loc[max_idx, 'word_len'])

answer_five()

("twelve-o'clock-at-night", 23)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [24]:
def answer_six():
    
    to_return_df = moby_text_df [moby_text_df['word_freq'] > 2000]
    
    to_return_df = to_return_df[to_return_df['word_token'].apply(lambda x: x.isalpha())]
    to_return_df.sort_values(by=['word_freq'], ascending = False, inplace=True)
    return list(zip(to_return_df['word_freq'],to_return_df['word_token']))
  
answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2113, 'I')]

### Question 7

What is the average number of tokens per sentence?

*This function should return a float.*

In [25]:
def answer_seven():
    
    from nltk import sent_tokenize
    from nltk import word_tokenize

    moby_sentences = sent_tokenize(moby_raw)
    moby_word_tokens = word_tokenize(moby_raw)

    
    return len(moby_word_tokens)/len(moby_sentences)

answer_seven()

25.886926512383273

### Question 8

What are the 5 most frequent parts of speech in this text? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [26]:
def answer_eight():
    
    moby_onlywords_df = moby_text_df[moby_text_df['word_isalpha']].copy()
    moby_onlywords_df['pos_tag'] = nltk.pos_tag(moby_onlywords_df['word_token'])
    moby_onlywords_df[['pos', 'tag']] = pd.DataFrame(moby_onlywords_df['pos_tag'].tolist(), index=moby_onlywords_df.index)   
    
    #Note you CANNOT count the rows for the freq you have to SUM the freq
    tag_df = moby_onlywords_df.groupby(['tag']).agg(tag_freq=('word_freq', 'sum'))
    tag_df = tag_df.reset_index()
    tag_df = tag_df.sort_values(by=['tag_freq'], ascending=False)
    return  list(zip(tag_df['tag'], tag_df['tag_freq']))[:5]
    
df = answer_eight()

In [27]:
df

[('DT', 28037), ('IN', 27240), ('NN', 26565), ('JJ', 20311), ('RB', 14315)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [28]:

from nltk.corpus import words
from nltk.metrics.distance import edit_distance, jaccard_distance
    
from nltk.util import ngrams, trigrams

correct_spellings = words.words()

In [29]:
#create a dataframe of all words
correct_spellings_df = pd.DataFrame(correct_spellings, columns=['correct_word'])

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [30]:
def jacDist(col_word, entry_word, gramno):
    '''
       Utility function to find Jaccard similarity between sets of ngrams
    '''
   
    return jaccard_distance(set(ngrams(col_word, gramno)), set(ngrams(entry_word, gramno)))

In [31]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    
    correct_word_list = []
    for entry in entries:
        first_char = entry[0]
        temp_df = correct_spellings_df[correct_spellings_df['correct_word'].str.startswith(first_char)].copy()
        temp_df['jac_dist'] = temp_df['correct_word'].apply(jacDist, args=(entry, 3))
                                    
        correct_word_list.append(temp_df.loc[temp_df['jac_dist'].idxmin()]['correct_word'])
    return correct_word_list
    
answer_nine()

['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [32]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    correct_word_list = []
    for entry in entries:
        first_char = entry[0]
        temp_df = correct_spellings_df[correct_spellings_df['correct_word'].str.startswith(first_char)].copy()
        temp_df['jac_dist'] = temp_df['correct_word'].apply(jacDist, args=(entry, 4))
                                    
        correct_word_list.append(temp_df.loc[temp_df['jac_dist'].idxmin()]['correct_word'])
    return correct_word_list
  
    
answer_ten()

['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [33]:
edit_distance('some', 'simi')

2

In [34]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):
    correct_word_list = []
    for entry in entries:
        first_char = entry[0]
        temp_df = correct_spellings_df[correct_spellings_df['correct_word'].str.startswith(first_char)].copy()
        temp_df['edit_dist'] = temp_df['correct_word'].apply(lambda x: edit_distance(x, entry))
                                    
        correct_word_list.append(temp_df.loc[temp_df['edit_dist'].idxmin()]['correct_word'])
    return correct_word_list
    
answer_eleven()

['corpulent', 'intendence', 'validate']