# Week 9 Problem 4

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select *Kernel*, and restart the kernel and run all cells (*Restart & Run all*).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select *File* → *Save and CheckPoint*)

5. When you are ready to submit your assignment, go to *Dashboard* → *Assignments* and click the *Submit* button. Your work is not submitted until you click *Submit*.

6. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

## Author: Radhir Kothuri
### Primary Reviewer: Kelechi Ikegwu

# Due Date: 6 PM, March 26, 2018

In [1]:
% matplotlib inline
import matplotlib as mpl
import random
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from numpy.testing import assert_array_equal
from nose.tools import assert_equal, assert_true, assert_almost_equal, assert_is_instance, assert_is_not
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_files
from sklearn import metrics
# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")

## Question 1

In this question, we will be investigating simple text analysis. We will be implementing a simple search algorithm in this question.

- Finish the function `get_words` to return a list of strings that are either 5 or 6 characters in length.
- The function has one parameter: `text_data` that is a list of strings that will be used in order to find the subset of strings that have either 5 or 6 characters.
- For example, if `text_data` = [`hello`, `nice`, `abbrev`, `total`], then the function should return the list: [`hello`, `abbrev`, `total`] as these are the only strings that contain 5 or 6 characters in length.

In [2]:
def get_words(text_data):
    '''    
    Return all 5 or 6 letter words in a list of strings
    
    Parameters
    ----------
    text_data: list of strings
    
    Returns
    -------
    A list of strings
    '''
    # YOUR CODE HERE
    
    newwords = []
    for i in range(0, len(text_data)):
        if (len(text_data[i]) == 5 | len(text_data[i]) == 6):
            newwords.append(text_data[i])
                
    return newwords

In [3]:
mvr = nltk.corpus.movie_reviews
all_matches = get_words(list(mvr.words()[:1000]))
assert_true(type(all_matches) is list)
for match in all_matches:
    assert_true(len(match) == 5 or len(match) == 6)
for match in all_matches:
    assert_true(type(match) is str)

## Question 2

In this question, we will be implementing the term frequency algorithm. Use the input list `text_data` and return a dictionary mapping each word to the number of times that word appears in `text_data`.

- `text_data` is a list of words of the first 1000 words of the first movie review split by whitespace.
- Return a dictionary mapping strings to ints where the strings are the words in `text_data` and the ints are the number of occurences in `text_data`
- Also remove the following stop words from the `text_data` before processing the word_counts: [`we`, `for`, `to`, `a`, `(`, `was`, `why`, `ve`, `in`, `is`, `.`, `&`].
- Also for the dictionary mapping, please start counting from 1 and not 0 (i.e. on the first occurence of a word, the value for that word in the dictionary should be 1 and not 0).

In [38]:
def term_frequency(text_data, stop_words):
    '''    
    Return a dictionary mapping the words in text_data to the number of occurences in the text_data list
    
    Parameters
    ----------
    text_data: list of strings
    
    Returns
    -------
    A dictionary mapping strings to ints
    '''
    # YOUR CODE HERE
    
    for item in stop_words:
        while item in text_data: text_data.remove(item)    
    
    import collections as cl
    wc = cl.Counter(text_data)
    
    return dict(wc)

In [41]:
mvr = nltk.corpus.movie_reviews
stop_words = ['we', 'for', 'to', 'a', '(', 'was', 'why', 've', 'in', 'is', '.', '&']
word_counts = term_frequency(list(mvr.words()[:1000]), stop_words)
assert_true(type(word_counts) is dict)
for word in stop_words:
    assert_true(word not in word_counts.keys())
for word in word_counts:
    assert_true(type(word) is str)
    assert_true(type(word_counts[word]) is int)
    assert_true(word_counts[word] > 0)

## Question 3

In this question, we will be creating a pipeline that contains a `TFIDFVectorizer` and a `LinearSVC` model for document classification.

- Create a pipeline object with a `TFIDFVectorizer` object with the name `tfidf` followed by a `LinearSVC` model with the name `svc`.
- Add the value `english` for params `tfidf__stop_words` to the pipeline object.
- Fit the model to the `mvr_train` and `y_train` variables.
- Return a 2-tuple of the pipeline object followed by the result of the `predict()` function on the `mvr_test` variable.

- Order of Pipeline: 1. `TFIDFVectorizer` 2. `LinearSVC`. **Please use the names `tfidf` for the `TFIDFVectorizer` and `svc` for `LinearSVC` or you will NOT pass the tests.**

In [46]:
def tfidf_pipeline(mvr_train, y_train, mvr_test):
    '''    
    Use TFIDFVectorizer and LinearSVC for document classification
    using the Pipeline object
    
    Parameters
    ----------
    mvr_train: list (independent variable training data)
    y_train: np.ndarray (dependent variable training data)
    mvr_test: list (independent variable testing data)
    
    Returns
    -------
    A 2-tuple of Pipeline object followed by result of the predict() function on the data
    attribute of the testing_data
    Return type: (pipeline.Pipeline, np.ndarray)
    '''
    # YOUR CODE HERE
    from sklearn import svm
    
    tools = [('tfidf', TfidfVectorizer()), ('svc', svm.LinearSVC())]
    clf = Pipeline(tools)

    clf.set_params(tfidf__stop_words = 'english')

    clf = clf.fit(mvr_train, y_train)
    predicted = clf.predict(mvr_test)
    
    return clf, predicted

In [47]:
mvr = nltk.corpus.movie_reviews
data_dir = '/home/data_scientist/data/nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)
mvr_train, mvr_test, y_train, y_test = train_test_split(mvr.data, mvr.target, test_size=0.25, random_state=23)
pipeline, predicted = tfidf_pipeline(mvr_train, y_train, mvr_test)
assert_true(type(pipeline) is Pipeline)
assert_true(type(pipeline.get_params()['tfidf']) is TfidfVectorizer)
assert_true(type(pipeline.get_params()['svc']) is LinearSVC)
assert_true(pipeline.get_params()['tfidf'].get_params()['stop_words'] is 'english')
assert_true(type(predicted) is np.ndarray)

In [48]:
# Display the metrics of the document classification model from above
mvr = nltk.corpus.movie_reviews
data_dir = '/home/data_scientist/data/nltk_data/corpora/movie_reviews'
mvr = load_files(data_dir, shuffle = False)
mvr_train, mvr_test, y_train, y_test = train_test_split(mvr.data, mvr.target, test_size=0.25, random_state=23)
pipeline, predicted = tfidf_pipeline(mvr_train, y_train, mvr_test)
print(metrics.classification_report(y_test, predicted, target_names = mvr.target_names))

             precision    recall  f1-score   support

        neg       0.82      0.82      0.82       259
        pos       0.81      0.81      0.81       241

avg / total       0.81      0.81      0.81       500

