# Week 10 Problem 4

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

7. **If your code does not pass the unit tests, it will not pass the autograder.**

## Author: John Nguyen
### Primary Reviewer: Kelechi Ikegwu

# Due Date: 6 PM, April 2, 2018

In [2]:
import numpy as np
import nltk
import string
from nltk.corpus import stopwords, inaugural
from nltk import pos_tag, FreqDist
from nltk import sent_tokenize, word_tokenize, WhitespaceTokenizer, WordPunctTokenizer
from gensim.models import Word2Vec

from nose.tools import (
    assert_equal,
    assert_is_instance,
    assert_almost_equal,
    assert_true
)
from numpy.testing import assert_array_equal

# Download the stopwords
nltk.download('stopwords');

# Download inaugural address
nltk.download('inaugural');

Using TensorFlow backend.


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/data_scientist/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package inaugural to
[nltk_data]     /home/data_scientist/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


In this assignment, we will incorporate all of the NLP techniques you've learned from this week's notebooks in analyzing the [Inaugural Address Corpus documentation](http://www.nltk.org/book/ch02.html#inaugural-corpus) which contains a collection of 55 texts, one for each presidential address starting from 1789. 

In [3]:
# View the number of files
print(len(inaugural.fileids()))

# View the list of files
print(inaugural.fileids())

56
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Re

In [4]:
# View Washington's second inauguration address
inaugural.raw('1793-Washington.txt')

'Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n \n'

## Question 1: Tokenizer

Create the wrapper function _tokenizer()_ that tokenize an inputted string depending on the specific <i>token_type</i>:

- "sentence": Tokenize by sentence.
- "word": Tokenize by word.
- "whitespace": Tokenize by whitespace.
- "wordpunctuation": Tokenize by word and punctuation.

Note, do not remove punctuation. Your function should output as a list of string. You may use any of the built-in functions in the [nltk.tokenize package](http://www.nltk.org/api/nltk.tokenize.html).

__Example:__

- _tokenizer("All your base belongs to us.", "sentence")_ should return ['All your base belongs to us.']
- _tokenizer("All your base belongs to us.", "word")_ should return ['All', 'your', 'base', 'belongs', 'to', 'us', '.']
- _tokenizer("All your base belongs to us.", "whitespace")_ should return ['All', 'your', 'base', 'belongs', 'to', 'us.']
- _tokenizer("All your base belongs to us.", "wordpunctuation")_ should return ['All', 'your', 'base', 'belongs', 'to', 'us', '.']

In [5]:
def tokenizer(text, token_type):
    '''
    Converts text into tokens.
    
    Parameters
    ----------
    text: a String.
    token_type: a String specifying either a sentence, word, whitespace or word punctuation tokenizer
    
    Returns
    -------
    tokens: a List
    '''

    # YOUR CODE HERE
    
    if token_type == 'sentence' :
        tokens = sent_tokenize(text)
    elif token_type == 'word':
        tokens = word_tokenize(text)
    elif token_type == 'whitespace':
        tokens = WhitespaceTokenizer().tokenize(text)
    else:
        tokens = WordPunctTokenizer().tokenize(text)
 
    return tokens

In [6]:
test1 = tokenizer(inaugural.raw('2001-Bush.txt'), "wordpunctuation")
assert_equal(type(test1), list)
assert_equal(len(test1), 1825)

test2 = tokenizer(inaugural.raw('2001-Bush.txt'), "sentence")
assert_equal(type(test2), list)
assert_equal(len(test2), 97)

test3 = tokenizer(inaugural.raw('1837-VanBuren.txt'), "word")
assert_equal(type(test3), list)
assert_equal(len(test3), 4160)

test4 = tokenizer(inaugural.raw('1941-Roosevelt.txt'), "whitespace")
assert_equal(type(test4), list)
assert_equal(len(test4), 1360)

## Question 2: Part-of-Speech Tagging

President Kennedy famously called Americans to take action and do more for their country in his inaugural address: _"And so, my fellow Americans: ask not what your country can do for you—ask what you can do for your country."_ Let us determine which president have the most proportions of action words in their address.

Create a function <i>proportion_action_words()</i> that takes tokens from a inaugural address and perform part-of-speech tagging. Of course, the length of addresses are all very different. As such, your function will output the total number of verbs over the number of tokens. Essentially, your function should do the following:

- Use the built-in function <i>pos_tag()</i> with _tagset='universal'_ to do POS tagging.
- <i>pos_tag()</i> will return a list of tuples. Iterate through the list and count the number of tags that is a "VERB".
- Your function should return the proportion $\frac{\text{# of VERB}}{len(tokens)}$

__Example:__ proportion_action_words(['ask', 'what', 'you', 'can', 'do', 'for', 'your', 'country', '.']) should return 0.3333.

In [7]:
def proportion_action_words(tokens):
    '''
    Compute proportion of verb.
    
    Parameters
    ----------
    tokens: a list of strings.
    
    Returns
    -------
    result: a Float.
    '''
    # YOUR CODE HERE
    tagged = pos_tag(tokens, tagset='universal')

    a = list()
    for item in tagged:
        if item[1] == 'VERB':
            a.append(item[0])
    
    result = len(a)/len(tagged)
    
    return result

In [8]:
test1 = tokenizer(inaugural.raw('1817-Monroe.txt'), "word")
test1_verb_prop = proportion_action_words(test1)
assert_equal(type(test1_verb_prop), float)
assert_almost_equal(test1_verb_prop, 0.1655, 3)

test2 = tokenizer(inaugural.raw('2009-Obama.txt'), "word")
test2_verb_prop = proportion_action_words(test2)
assert_equal(type(test2_verb_prop), float)
assert_almost_equal(test2_verb_prop, 0.1744, 3)

test3 = tokenizer(inaugural.raw('1789-Washington.txt'), "word")
test3_verb_prop = proportion_action_words(test3)
assert_equal(type(test3_verb_prop), float)
assert_almost_equal(test3_verb_prop, 0.1600, 3)

Lets see which address had the highest proportion of verb. Does the result surprise you?

In [9]:
# Determine which address have the highest proportion of verbs
from operator import itemgetter

call_to_action = []
for address in inaugural.fileids():
    address_tokens = tokenizer(inaugural.raw(address), "word")
    address_verb_prop = proportion_action_words(address_tokens)
    call_to_action.append((address, address_verb_prop))

sorted(call_to_action, key=itemgetter(1), reverse=True)

[('1865-Lincoln.txt', 0.19383825417201542),
 ('1793-Washington.txt', 0.1836734693877551),
 ('1989-Bush.txt', 0.1824600520252694),
 ('1869-Grant.txt', 0.18048780487804877),
 ('1969-Nixon.txt', 0.1762278167560875),
 ('1965-Johnson.txt', 0.17561260210035007),
 ('1913-Wilson.txt', 0.17541070482246954),
 ('1805-Jefferson.txt', 0.17506297229219145),
 ('2009-Obama.txt', 0.17444444444444446),
 ('1861-Lincoln.txt', 0.17254313578394598),
 ('1925-Coolidge.txt', 0.1725225225225225),
 ('1973-Nixon.txt', 0.17248255234297108),
 ('1981-Reagan.txt', 0.17175709665828243),
 ('1945-Roosevelt.txt', 0.17061611374407584),
 ('1977-Carter.txt', 0.17005813953488372),
 ('1821-Monroe.txt', 0.16908904810644831),
 ('1917-Wilson.txt', 0.16908212560386474),
 ('1937-Roosevelt.txt', 0.16725263686589653),
 ('1993-Clinton.txt', 0.1668472372697725),
 ('1921-Harding.txt', 0.16590726346823909),
 ('1873-Grant.txt', 0.16564833672776647),
 ('1817-Monroe.txt', 0.16557911908646003),
 ('1949-Truman.txt', 0.16526946107784432),
 ('

## Question 3: Word Frequency

The function <i>frequent_tokens()</i> will take a document of tokens, remove the punctuation tokens (e.g, ".", "?", etc.), compute the frequency distribution of each tokens and return the top n tokens. Your function must output a list of tuples.

__Hint:__

- You can use _string.punctuation_ which is a list punctuations. You can iterate and replace any tokens from the list that is a punctuation or use a one-line list comprehension. Refer to this week's notebook.
- Use the _nltk_ built-in function _FreqDist()_ to compute the frequency count and <i>most_common()</i> to obtain the most frequent tokens. Refer to the following [documentation](http://www.nltk.org/api/nltk.html?highlight=freqdist).

In [14]:
def frequent_tokens(tokens, n):
    '''
    Compute the token frequency distribution and return the top n token
    
    Parameters
    ----------
    tokens: a List of string.
    n: a int
    
    Returns
    -------
    result: a List of tuples
    '''
    # YOUR CODE HERE
    new_mvr = []
    new_mvr.extend([wtk for wtk in tokens if wtk not in string.punctuation])
    fdist = FreqDist(new_mvr)
    result = fdist.most_common(n)
    
    return result

In [15]:
test1 = tokenizer(inaugural.raw('1817-Monroe.txt'), "word")
test1_result = frequent_tokens(test1, 10)
assert_equal(type(test1_result), list)
assert_equal(type(test1_result[0]), tuple)
assert_equal(test1_result, [('the', 264), ('of', 162), ('to', 120), ('and', 120),
                                      ('in', 71), ('our', 60), ('a', 58), ('be', 50),
                                      ('it', 45), ('is', 41)])

test2 = tokenizer(inaugural.raw('1885-Cleveland.txt'), "word")
test2_result = frequent_tokens(test2, 5)
assert_equal(type(test2_result), list)
assert_equal(type(test2_result[0]), tuple)
assert_equal(test2_result, [('the', 167), ('of', 117), ('and', 102), ('to', 57), ('a', 29)])

test3 = tokenizer(inaugural.raw('1957-Eisenhower.txt'), "word")
test3_result = frequent_tokens(test3, 8)
assert_equal(type(test3_result), list)
assert_equal(type(test3_result[0]), tuple)
assert_equal(test3_result, [('the', 106), ('of', 96), ('and', 50), ('to', 41),
                            ('in', 39), ('we', 35), ('our', 35), ('all', 26)])

In [16]:
# Create the token documents without the punctuation
inaugural_docs = []

for address in inaugural.fileids():
    address_tokens = tokenizer(inaugural.raw(address), "wordpunctuation")
    inaugural_docs.append([token for token in address_tokens if token not in string.punctuation])

## Question 4: Word2Vec

The function <i>w2v_similarity()</i> will take a document of tokens, a string specifying a word and an integer. Your function should do the following:

- Create a Word2Vec model for the documents with _size=10_, _window=5_, <i>min_count=3</i>, _seed=10_, _workers=1_.
- Using the model, compute the Cosine similarity of the inputted word and output the top n similar words.

__Note__: By default, [word2vec](https://radimrehurek.com/gensim/models/word2vec.html) is multi-threaded so setting the seed alone does not guarantee consistent result. The documentation recommend setting the workers to 1 to ensure consistency but this does not work in our current environment. Each time a kernel is started, a different hash is generated for the Word2Vec model. To guarantee the same similarity score, you need to set this hash explicitly. To avoid the trouble, we will not check the similarity score but make sure the model parameters are as shown above.

In [11]:
def w2v_similarity(documents, word, n):
    '''
    Create a Word2Vec model and compute the top n similar words to the input
    
    Parameters
    ----------
    documents: a list of list of tokens.
    word: a String.
    n: a int
    
    Returns
    -------
    scores: a List of tuples
    '''
    # YOUR CODE HERE
    
    model = Word2Vec(documents, size=10, window=5, min_count=3, seed=10, workers=1)
    vals = model.most_similar(word, topn=n)
    
    return vals

In [14]:
test1 = w2v_similarity(inaugural_docs, "American", 10)
assert_equal(len(test1), 10)
assert_equal(type(test1), list)
assert_equal(type(test1[1]), tuple)
assert_equal(type(test1[1][0]), str)
assert_equal(type(test1[1][1]), float)
assert_equal(type(test1[1][1]), float)
assert_equal(test1[1][1] <= 1, True)
assert_equal(test1[1][1] >= -1, True)


test2 = w2v_similarity(inaugural_docs, "citizen", 200)
assert_equal(len(test2), 200)
assert_equal(type(test2), list)
assert_equal(type(test2[101][0]), str)
assert_equal(type(test2[101]), tuple)
assert_equal(type(test2[101][1]), float)
assert_equal(test2[101][1] <= 1, True)
assert_equal(test2[101][1] >= -1, True)