# Week 10 Problem 1

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

## Author: Radhir Kothuri
### Primary Reviewer: Apurv Garg

# Due Date: 6 PM, April 2, 2018

In [1]:
% matplotlib inline
import matplotlib as mpl
import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numpy.testing import assert_array_equal
from nose.tools import assert_equal, assert_true, assert_almost_equal, assert_is_instance, assert_is_not
import nltk, re
from nltk import pos_tag
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
from sklearn.datasets import load_files
from sklearn import metrics
from collections import Counter
# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Question 1

In this question, we will be exploring collocations using `Bigrams` and the `Pointwise Mutual Information` algorithm in order to return the top `k` bi-grams in the inputted text data.

- Finish the function `top_k` that takes in 2 parameters: `text_data`, a corpus of text data that is already tokenized, and `k`, an integer that represents the number of top k bigrams to return.
- Compute the `k` most popular bigrams and return as a list of `2-tuples` where each element in the `2-tuple` is a string.

In [2]:
def top_k(text_data, k):
    '''    
    Return the top k most popular bigrams
    
    Parameters
    ----------
    text_data: list of strings
    k: An int
    
    Returns
    -------
    A list of 2-tuples where each element in 2-tuple is a string
    '''
    # YOUR CODE HERE
    bigram_measures = BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(text_data)
    bgs = finder.nbest(bigram_measures.pmi, k)
    
    return bgs

In [3]:
mvr = nltk.corpus.movie_reviews
top_k_bigrams = top_k(list(mvr.words()[:1000]), 6)
assert_equal(len(top_k_bigrams), 6)
for bigram in top_k_bigrams:
    assert_true(type(bigram[0]) is str)
    assert_true(type(bigram[1]) is str)
assert_equal(top_k_bigrams[0], ('action', 'sequences'))
assert_equal(top_k_bigrams[-1], ('baldwin', 'brother'))
top_k_bigrams = top_k(list(mvr.words()[1000:2000]), 8)
assert_equal(len(top_k_bigrams), 8)
for bigram in top_k_bigrams:
    assert_true(type(bigram[0]) is str)
    assert_true(type(bigram[1]) is str)
assert_equal(top_k_bigrams[0], ('20th', 'century'))
assert_equal(top_k_bigrams[-1], ('big', 'pink'))

## Question 2

In this question, we will explore part of speech tagging. Specifically, we will construct a function given a parameter `text_data` that will return all the values tagged with the input variable `classifier`.

- Finish the function `get_part_of_speech` that takes in 2 variables `text_data`, a corpus of text data tokenized by white space, and `classifier`, a string that is one of the 
[12 `universal part of speech target` tags](http://www.nltk.org/book/ch05.html). See section 2.3.
- With respect to this question however, we will only be passing in 'NOUN', 'VERB', or 'NUM' as the `classifier`.
- Use Part of Speech Tagging in order to return a list of strings based on the `classifier` passed in. For example, after using part of speech tagging, if the classifier is `NOUN`, then iterate through all the tagged words and return a list of strings that are only `NOUN`s.

In [21]:
def get_part_of_speech(text_data, classifier):
    '''    
    Return a list of strings based on the value of `classifier`
    
    Parameters
    ----------
    text_data: list of strings
    classifier: A string
    
    Returns
    -------
    A list of strings
    '''
    # YOUR CODE HERE
    
    tagged = pos_tag(text_data, tagset='universal')

    a = list()
    for item in tagged:
        if item[1] == classifier:
            a.append(item[0])
    
    return a

In [22]:
mvr = nltk.corpus.movie_reviews
all_numbers = get_part_of_speech(list(mvr.words()[:1000]), 'NUM')
all_verbs = get_part_of_speech(list(mvr.words()[:1000]), 'VERB')
assert_true('10' in all_numbers)
for val in all_numbers:
    assert_true(type(val) is str)
    assert_true(val not in all_verbs)
all_nouns = get_part_of_speech(list(mvr.words()[1000:2000]), 'NOUN')
for val in all_nouns:
    assert_true(type(val) is str)

## Question 3

In this question, we will be exploring tagged text extraction. We will be using regular expressions instead of a particular classifier variable in order to retrieve all the part of speeches that we want along with the specific counts of each part of speech.

- Finish the function `get_tagged_text` that takes in the parameter `text_data`, a corpus of text data tokenized by white space. 
- The function will return 2 data structures:
    - A list of strings that matches the part of speech that is either a plural noun (NNS), a proper noun (NNP), a past tense verb (VBD), or a foreign word (FW)
    - A dictionary that maps each of the above part of speech tags to the respective counts that they appear in `text_data`. For example, if after the tagging stage there are 5 NNS, 4 NNP, and 3 VBD words. The dictionary should map `NNS`: 3, `NNP`: 4, and `VBD`: 3.
    - Return the 2 data structures as a 2-tuple with the list of strings as the first element and the dictionary as the second element

In [34]:
def get_tagged_text(text_data):
    '''    
    Use part of speech tagging with regular expressions in order 
    to return all words matching the particular regular expression as well as
    a dictionary mapping the part of speech to the number of appearances in 
    text_data
    
    Parameters
    ----------
    text_data: list of strings
    
    Returns
    -------
    A 2-tuple of a list of strings and a dictionary
    '''
    # YOUR CODE HERE
    
    ptgs = pos_tag(text_data)
    
    a = list()
    for item in ptgs:
         if item[1] in ['NNS', 'NNP', 'VBD', 'FW']:
            a.append(item[1])

    tc = Counter(a)
    
    return a, dict(tc)

In [35]:
mvr = nltk.corpus.movie_reviews
tagged_text, mapped_text = get_tagged_text(list(mvr.words()[:1000]))
assert_true(type(tagged_text) is list)
assert_true(type(mapped_text) is dict)
for text in tagged_text:
    assert_true(type(text) is str)
for key in mapped_text:
    assert_true(type(key) is str)
    assert_true(type(mapped_text[key]) is int)
for key in mapped_text.keys():
    assert_true(key in ['NNS', 'NNP', 'VBD', 'FW'])