### NLP on IMDB Dataset

This notebook is to investigate the basics of natural language processing, specifically to the application of classifying sentiment. I used a large dataset of IMDB movie reviews, and tried to classify each review as positive or negative. My initial model used a vanilla bag of words method, treating each text as an unordered multiset of the words that it contains. I then attempted to use the word2vec neural network model in order to learn more sophisticated word representations, to gain better classification accuracy. 

### The Problem

We have several example movie reviews, that were written by people after watching a movie. We want to learn a function $f: X \rightarrow{} y$ that takes in these reviews and outputs a binary value indicating whether the review indicates a positive or negative sentiment about the movie. 

We assume that there is some (unknown) true data generating distribution $D$ that defines $p(x, y)$, or the probability of observing a certain sentiment given a particular review. This distribution defines such probabilities for all pairs $(x, y) \in X \times Y$, but we only have access to a very small subset of the data (ie, the training data). Our function $f$ must be able to generalize well, and we define this as having the minimum loss on unseen data:


First, we read in our labelled training data using `pandas`. The reviews will be held in the data structure train["review"]. Let's take a look at a single review. 

In [1]:
# read in data
import pandas as pd
train = pd.read_csv('data/labeledTrainData.tsv', header = 0, delimiter = '\t', quoting = 3)
print(train["review"][0])
print(train["sentiment"][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

Next, it is required to clean each review so that we eliminate punctuation and extract only the words. Moreover, since some words occur so many times in the English language (such as "the", "a", "it", etc), we don't want to consider those words in our model. This is because having them or not having them in our training data won't really make a difference - they are so common and don't indicate any sort of sentiment, so we would lose nothing by throwing them out. In fact, we'd make the gain of not having to represent these words in our model, which saves us a bit of computation time as well as space. 

The BeautifulSoup library was used to extract only the letters. Stanford's NLTK was used to find "stopwords", or words that occur very often in the English language. The following cell is heavily based off of this Kaggle tutorial. 

In [2]:
# cleaning data
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
import re
def clean(review):
    # remove html
    text = BeautifulSoup(review, "html5lib").get_text()
    # regexp matching to extract letters only
    letters_only = re.sub("[^a-zA-Z]", " ", text)
    words = letters_only.lower().split()
    stops = set(stopwords.words("english"))
    # remove common words
    meaningful_words = [w for w in words if not w in stops]
    return (" ".join(meaningful_words))

cleaned_reviews = [clean(train["review"][i]) for i in range(len(train["review"]))]

Finally, we can look at the words in a clean review. 

In [3]:
print(cleaned_reviews[1].split())

['classic', 'war', 'worlds', 'timothy', 'hines', 'entertaining', 'film', 'obviously', 'goes', 'great', 'effort', 'lengths', 'faithfully', 'recreate', 'h', 'g', 'wells', 'classic', 'book', 'mr', 'hines', 'succeeds', 'watched', 'film', 'appreciated', 'fact', 'standard', 'predictable', 'hollywood', 'fare', 'comes', 'every', 'year', 'e', 'g', 'spielberg', 'version', 'tom', 'cruise', 'slightest', 'resemblance', 'book', 'obviously', 'everyone', 'looks', 'different', 'things', 'movie', 'envision', 'amateur', 'critics', 'look', 'criticize', 'everything', 'others', 'rate', 'movie', 'important', 'bases', 'like', 'entertained', 'people', 'never', 'agree', 'critics', 'enjoyed', 'effort', 'mr', 'hines', 'put', 'faithful', 'h', 'g', 'wells', 'classic', 'novel', 'found', 'entertaining', 'made', 'easy', 'overlook', 'critics', 'perceive', 'shortcomings']


Next, we begin to define the core functions that allow us to take English words and represent them as numerical training data, which is what our machine learning algorithms actuall look at. This requires a few steps:

- Creating a vocabulary, which is just a set of all the words in all the reviews. 

- Obtaining, for each review, an occurence dictionary. This function will take a single review and return a dictionary that enumerates how often each word occurred. 

- Using the above two functions, we can define another function that actually creates the feature vectors for our bag of words model. If there are n words in the vocabulary, then for each review, the feature vector f corresponding to that review will have values f[i] that correspond to the number of times that particular word occurred in the review (and 0 if it was not present in the review). 

You may notice that this "bag of words" model already has a few weaknesses. Most significantly, it does not take into account the ordering of words or any sense of context. The English langauge is full of phrases and idioms that are composed of words that when put together, mean something entirely different than the two words separately. 

However, despite its disadvantages, the bag of words model has actually seen some significant success in practice, most notably for spam filtering. This makes sense - even if spam emails do have phrases where the words greatly depend on the context around them, we can probably get really good spam classification just by detecting the presence and relative frequency of certain words. 


In [4]:
from collections import defaultdict
import numpy as np
# creates a vocabulary - set of all words in all reviews
def create_vocab(cleaned_reviews):
    """
    Takes in a bunch of reviews and creates a vocabulary. 
    """
    li = []
    for review in cleaned_reviews:
        a = review.split()
        for item in a:
            li.append(item)
    return list(set(li))

def get_word_occ_dict(review):
    d = defaultdict(int)
    words = review.split()
    for w in words:
        d[w]+=1
    return d

# takes in a vocab and a review and returns a feature vector for the review
# the feature vector f has d dimensions where d = len (vocab)
# for the i in [1..d]th word, f[i] = n where n is the number of times the word occured in the review
# the feature vectors are sparse, since most words in the vocab may not occur in a specific review
def create_feature_vector(review, vocab):
    word_dict = get_word_occ_dict(review) 
    feature_vector = [word_dict[v] if v in word_dict else 0 for v in vocab]
    return np.array(feature_vector)

def create_feature_vectors(cleaned_reviews, vocab):
    feature_vectors = [create_feature_vector(review, vocab) for review in cleaned_reviews]
    return np.array(feature_vectors)

vocab = create_vocab(cleaned_reviews)
X = create_feature_vectors(cleaned_reviews, vocab)


Next, let's get our labels for our training data, and look at the shape of both our training and testing data. 

In [5]:
y = train['sentiment']
X.shape
y.shape


(25000,)

Now we get on to the actual learning portion of our investigation. To do this, I first imported a lot of functions I'll be using, mostly from the Sci-kit learn library. I also imported a personal machine learning utilities library that I wrote up to help me with cross-validation and hyperparameter tuning needs. 

In [6]:
from utils import *
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# separate data into training and testing
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.3)





First, I fit a single SVM to the data before considering hyperparameter settings and other complex models in order to see what kind of accuracy I'm getting. In order to save a lot of computation time, I decided not to engineer any additional features or do any sort of feature expansion. SVMs can be kernelized, which means that a kernel function can replace the feature transformations anyways, so we can use that if we want to consider higher-dimensional spaces.

In [None]:
# fit SVM to the data
clf = LinearSVC(verbose = 10)
clf.fit(X_train, y_train)
y_train_pred, y_test_pred = clf.predict(X_train), clf.predict(X_test)
from sklearn.metrics import accuracy_score
test_acc, train_acc = accuracy_score(y_train, y_train_pred), accuracy_score(y_test, y_test_pred)
print("test accuracy: {}".format(test_acc))
print("training accuracy: {}".format(train_acc))

[LibLinear]



test accuracy: 1.0
training accuracy: 0.8594666666666667


Next, I tried out several different linear SVMs, primarily by changing the hyperparameter `C`. To avoid rehashing the explanation for exactly what this hyperparameter does, refer to my Quora answer for an explanation. 

I also used a utility function that I wrote in my personal utils library called `get_best_hyperparams_cv`. 
This function takes in a bunch of classifiers with different hyperparameter settings, and returns the classifier and setting that performs the best (where best is defined by loweset cross-validation error). 

In [None]:
# try several different classifiers by changing the value for C, which indicates how much slack variables are penalized.
clfs_and_params = [(LinearSVC(C = c, verbose = 10), c) for c in [0.01, 0.1, 1.0, 5.0, 10, 100]]
clf, best_params, best_test_err, best_train_err = get_best_hyperparams_cv(X_train, y_train, k = 10, 
                                                                          classifiers = clfs_and_params, 
                                                                          verbose = True)

training with params: 0.01
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]found params with test error: 0.12194285714285713
training with params: 0.1
[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

In [None]:
test_data = pd.read_csv('data/unlabeledTrainData.tsv', header = 0, delimiter = '\t', quoting = 3)
cleaned_reviews = [clean(test_data["review"][i]) for i in range(len(test_data["review"]))]
X = create_feature_vectors(cleaned_reviews, vocab)
final_test_preds = clf.predict(X)

In [54]:
import tensorflow as tf

# cleaned reviews are a bunch of reviews where we will get our training examples from.
# Let's look at one cleaned review: 
print(cleaned_reviews[0])
window_size = 1
vocab = create_vocab(cleaned_reviews)

def word_one_hot(word, vocab):
    idx = vocab.index(word)
    if idx < 0:
        return -1
    vec = np.zeros((len(vocab)))
    vec[idx] = 1
    return vec

def create_vectorized_word_pairs(review, vocab, window_size):
    words = review.split()
    data = []
    for i in range(len(words)):
        left = [words[i-j] for j in range(1, window_size + 1) if i-j >= 0]
        right = [words[i+j] for j in range(1, window_size + 1) if i+j < len(words)]
        neighbors = left + right
        pairs = [(word_one_hot(words[i], vocab), word_one_hot(n,vocab)) for n in neighbors]
        data.append(pairs)
    
    return data

def create_word_pairs_all_reviews(cleaned_reviews, vocab, window_size):
    data = []
    for review in cleaned_reviews:
        li = create_vectorized_word_pairs(review, vocab, window_size)
        data = data + li
    return data


stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working

In [55]:
example_pairs = create_vectorized_word_pairs(cleaned_reviews[0], vocab, 1)
print(example_pairs[0])
print(cleaned_reviews[0])


[(array([ 0.,  0.,  0., ...,  0.,  0.,  0.]), array([ 0.,  0.,  0., ...,  0.,  0.,  0.]))]
stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also 

In [62]:
# example_pairs is a list of lists where each list contains 2 * window_size elements. 
# each element will be a pair of (example, label)
# map data into concrete X/Y input output lists

features, labels = [], []
#features = [elm[0] for elm in li for lin in example_pairs]
#labels = [elm[1] for elm in li for li in example_pairs]
for li in example_pairs:
    features = features + [elm[0] for elm in li]
    labels = labels + [elm[1] for elm in li]

# TODO - implement model - probably will be written up in a module and imported here. 


TypeError: unhashable type: 'numpy.ndarray'