# Assignment 2 - Introduction to NLTK

In part 1 of this assignment you will use nltk to explore the <a href='http://www.cs.cmu.edu/~ark/personas/'>CMU Movie Summary Corpus</a>. All data is released under a <a href='https://creativecommons.org/licenses/by-sa/3.0/us/legalcode'>Creative Commons Attribution-ShareAlike License</a>. Then in part 2 you will create a spelling recommender function that uses nltk to find words similar to the misspelling. 

## Part 1 - Analyzing Plots Summary Text

In [6]:
import nltk
import pandas as pd
import numpy as np

nltk.data.path.append("assets/")

# If you would like to work with the raw text you can use 'plots_raw'
with open('assets/plots.txt', 'rt', encoding="utf8") as f:
    plots_raw = f.read()

# If you would like to work with the plot summaries in nltk.Text format you can use 'text1'.
plots_tokens = nltk.word_tokenize(plots_raw)
text1 = nltk.Text(plots_tokens)

### Example 1

How many tokens (words and punctuation symbols) are in text1?

*This function should return an integer.*

In [3]:
def example_one():
    
    return len(nltk.word_tokenize(plots_raw)) # or alternatively len(text1)

example_one()

374441

### Example 2

How many unique tokens (unique words and punctuation) does text1 have?

*This function should return an integer.*

In [4]:
def example_two():
    
    return len(set(nltk.word_tokenize(plots_raw))) # or alternatively len(set(text1))

example_two()

25933

### Example 3

After lemmatizing the verbs, how many unique tokens does text1 have?

*This function should return an integer.*

In [5]:
from nltk.stem import WordNetLemmatizer

def example_three():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

example_three()

21760

### Question 1

What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

*This function should return a float.*

In [6]:
def answer_one():
    
    aux = example_two()
    aux_two = example_one()
    
    return aux/aux_two

In [8]:
answer_one()

0.06925790712021386

### Question 2

What percentage of tokens is 'love'or 'Love'?

*This function should return a float.*

In [7]:
def answer_two():
    dist = nltk.FreqDist(text1)
    return (((dist['Love']+dist['love'])*100)/len(text1))

answer_two()

0.12391805384559917

### Question 3

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?

*This function should return a list of 20 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [8]:
def answer_three():
    return  nltk.FreqDist(text1).most_common(20)

answer_three()

[(',', 19420),
 ('the', 18698),
 ('.', 16624),
 ('to', 12149),
 ('and', 11400),
 ('a', 8979),
 ('of', 6510),
 ('is', 5699),
 ('in', 5109),
 ('his', 4693),
 ("'s", 3682),
 ('her', 3674),
 ('he', 3556),
 ('that', 3517),
 ('with', 3293),
 ('him', 2570),
 ('for', 2433),
 ('by', 2321),
 ('The', 2234),
 ('on', 1925)]

### Question 4

What tokens have a length of greater than 5 and frequency of more than 200?

*This function should return an alphabetically sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [43]:
def answer_four():
    
    dist = nltk.FreqDist(text1)
    list_aux = []
    vector_aux = dist.keys()

    for a in vector_aux:
        if dist[a] > 200 and len(a) > 5:
            list_aux.append(a)
    
    return sorted(list_aux)

answer_four()

['However',
 'Meanwhile',
 'another',
 'because',
 'becomes',
 'before',
 'begins',
 'daughter',
 'decides',
 'escape',
 'family',
 'father',
 'friend',
 'friends',
 'himself',
 'killed',
 'leaves',
 'mother',
 'people',
 'police',
 'returns',
 'school',
 'through']

['Afghanistan',
 'Afterwards',
 'Alexandra',
 'Alistair',
 'Although',
 'American',
 'Americans',
 'Anarkali',
 'Andromeda',
 'Armitage',
 'Australia',
 'Baltimore',
 'Bang-hee',
 'Batwoman',
 'Bhagwaan',
 'Bharathi',
 'Bioroids',
 'Blackwell',
 'Brewster',
 'Briareos',
 'California',
 'Caligula',
 'Callahan',
 'Catholic',
 'Chandran',
 'Charlotte',
 'Chitralekha',
 'Chiun-Hwa',
 'Chrissie',
 'Christian',
 'Christians',
 'Christmas',
 'Christopher',
 'Cinderella',
 'Commissioner',
 'Communist',
 'Congress',
 'Conqueror',
 'Constance',
 'Constanza',
 'Dai-Shocker',
 'Demetrio',
 'Demetrius',
 'Detective',
 'Devereaux',
 'Dietrich',
 'District',
 'Elizabeth',
 'Eventually',
 'Everything',
 'Farnsworth',
 'Fletcher',
 'Following',
 'Francine',
 'Francisco',
 'Galactica',
 'Garrotte',
 'Georgina',
 'Gopalakrishnan',
 'Governor',
 'Guinevere',
 'Hamilton',
 'Havisham',
 'Henriette',
 'Hollywood',
 'Induchoodan',
 'Initially',
 'Inspector',
 'Isabella',
 'Japanese',
 'Jean-Baptiste',
 'Jenni

### Question 5

Find the longest token in text1 and that token's length.

*This function should return a tuple `(longest_word, length)`.*

In [66]:
def answer_five():

    dist = nltk.FreqDist(text1)
    vector_aux = dist.keys()
    list_aux = sorted(vector_aux, key=len) 
    
    return (list_aux[-1], len(list_aux[-1]))

answer_five()

('live-for-today-for-tomorrow-we-die', 34)

### Question 6

What unique words have a frequency of more than 2000? What is their frequency?

"Hint:  you may want to use `isalpha()` to check if the token is a word and not punctuation."

*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*

In [78]:
def answer_six():
    
    dist = nltk.FreqDist(text1)
    list_aux = []
    vector_aux = dist.keys()

    for a in vector_aux:
        if dist[a] > 2000 and a.isalpha():
            list_aux.append((dist[a], a))

    return sorted(list_aux, reverse = True) # Your answer here

answer_six()

[(18698, 'the'),
 (12149, 'to'),
 (11400, 'and'),
 (8979, 'a'),
 (6510, 'of'),
 (5699, 'is'),
 (5109, 'in'),
 (4693, 'his'),
 (3674, 'her'),
 (3556, 'he'),
 (3517, 'that'),
 (3293, 'with'),
 (2570, 'him'),
 (2433, 'for'),
 (2321, 'by'),
 (2234, 'The')]

In [77]:
dist = nltk.FreqDist(text1)
list_aux = []
vector_aux = dist.keys()

for a in vector_aux:
    if dist[a] > 2000 and a.isalpha():
        list_aux.append((dist[a], a))

sorted(list_aux, reverse = True)

[(18698, 'the'),
 (12149, 'to'),
 (11400, 'and'),
 (8979, 'a'),
 (6510, 'of'),
 (5699, 'is'),
 (5109, 'in'),
 (4693, 'his'),
 (3674, 'her'),
 (3556, 'he'),
 (3517, 'that'),
 (3293, 'with'),
 (2570, 'him'),
 (2433, 'for'),
 (2321, 'by'),
 (2234, 'The')]

In [70]:
dist = nltk.FreqDist(text1)
dist.isalpha()

AttributeError: 'FreqDist' object has no attribute 'isalpha'

### Question 7

`text1` is in `nltk.Text` format that has been constructed using tokens output by `nltk.word_tokenize(plots_raw)`. 

Now, use `nltk.sent_tokenize` on the tokens in `text1` by joining them using whitespace to output a sentence-tokenized copy of `text1`. Report the average number of whitespace separated tokens per sentence in the sentence-tokenized copy of `text1`.

*This function should return a float.*

In [20]:
def answer_seven():

    #sentences = sent_tokenize(plots_raw)
    #return len(text1)/len(sentences)

answer_seven()

22.31737990225295

In [9]:
from nltk.tokenize import sent_tokenize
senteces = sent_tokenize(plots_raw)
counts = (len(nltk.word_tokenize(sentence)) for sentence in sentences)

senteces



["Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.",
 'The nation of Panem consists of a wealthy Capitol and twelve poorer districts.',
 'As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games.',
 'The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth.',
 'In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12.',
 'Her older sister Katniss volunteers to take her place.',
 "Peeta Mellark, a baker's son who once gave Katniss bread when she was starving, is the other District 12 tribute.",
 'Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy.',
 'He warns them about the "Career" tributes who train intensively at special a

In [17]:
counts = (len(nltk.word_tokenize(senteces)) for sentece in senteces)

sum(counts)

TypeError: expected string or bytes-like object

In [10]:
len(senteces)

16778

### Question 8

What are the 5 most frequent parts of speech in `text1`? What is their frequency?

*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [21]:
def answer_eight():
    import collections
    pos_token = nltk.pos_tag(text1)
    pos_counts = collections.Counter((subl[1] for subl in pos_token))
    return pos_counts.most_common(5)

answer_eight()

[('NN', 51452), ('IN', 39225), ('NNP', 38361), ('DT', 34471), ('VBZ', 23799)]

## Part 2 - Spelling Recommender

For this part of the assignment you will create three different spelling recommenders, that each take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest distance*, and starts with the same letter as the misspelled word, and return that word as a recommendation.

*Each of the three different recommenders will use a different distance measure (outlined below).

Each of the recommenders should provide recommendations for the three default words provided: `['cormulent', 'incendenece', 'validrate']`.

In [23]:
from nltk.corpus import words

correct_spellings = words.words()

### Question 9

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the trigrams of the two words.**

Refer to:
- [NLTK Jaccard distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=jaccard_distance#nltk.metrics.distance.jaccard_distance)
- [NLTK ngrams](https://www.nltk.org/api/nltk.util.html?highlight=ngrams#nltk.util.ngrams)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [24]:
def answer_nine(entries=['cormulent', 'incendenece', 'validrate']):
    from nltk.metrics.distance import (
    jaccard_distance,
    )
    from nltk.util import ngrams
    spellings_series = pd.Series(correct_spellings)
    correct = []
    for entry in entries :
        spellings = spellings_series[spellings_series.str.startswith(entry[0])]
        distances = ((jaccard_distance(set(ngrams(entry, 3)),set(ngrams(word, 3))), word) for word in spellings)
        closet = min(distances)
        correct.append(closet[1])
        
    return correct
    
answer_nine()

['corpulent', 'indecence', 'validate']

### Question 10

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Jaccard distance](https://en.wikipedia.org/wiki/Jaccard_index) on the 4-grams of the two words.**

Refer to:
- [NLTK Jaccard distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=jaccard_distance#nltk.metrics.distance.jaccard_distance)
- [NLTK ngrams](https://www.nltk.org/api/nltk.util.html?highlight=ngrams#nltk.util.ngrams)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [26]:
def answer_ten(entries=['cormulent', 'incendenece', 'validrate']):
    from nltk.metrics.distance import (
    jaccard_distance,
    )
    from nltk.util import ngrams
    spellings_series = pd.Series(correct_spellings)
    correct = []
    for entry in entries :
        spellings = spellings_series[spellings_series.str.startswith(entry[0])]
        distances = ((jaccard_distance(set(ngrams(entry, 4)),set(ngrams(word, 4))), word) for word in spellings)
        closet = min(distances)
        correct.append(closet[1])
        
    return correct
    
answer_ten()

['cormus', 'incendiary', 'valid']

### Question 11

For this recommender, your function should provide recommendations for the three default words provided above using the following distance metric:

**[Edit distance on the two words with transpositions.](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)**

Refer to:
- [NLTK edit distance](https://www.nltk.org/api/nltk.metrics.distance.html?highlight=edit_distance#nltk.metrics.distance.edit_distance)

*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [None]:
def answer_eleven(entries=['cormulent', 'incendenece', 'validrate']):

    # YOUR CODE HERE
    raise NotImplementedError()
    return # Your answer here 
    
answer_eleven()