# Machine Learning Lab 3

### Eric Johnson and Quincy Schurr

#### Overview

The dataset that we will be using for Lab 3 was obtained from a Kaggle competition called Bago of Words Meets Bag of Popcorn. The dataset can be found here https://www.kaggle.com/c/word2vec-nlp-tutorial/data. 

The dataset is comprised of 3 files, the test dataset, the unlabeled training dataset, and the labeled training dataset. For this lab we will be using the test and labeled training datasets. The test dataset has 25,000 records that contain an id and then a review. The labeled training dataset contains 25,000 records that include an id, a sentiment score, and a review. The sentiment score is a binary value where 1 is a positive review, given if the rating on IMDB was greater than or equal to 7, and 0 is a negative review, given if the rating on IMDB was less than 5.

The purpose of this dataset is understanding sentiment analysis through deep learning architectures. This means that the machine is learning how interpret the meaning behind an expression by learning how language is constructed and then trying to piece together patterns to determine whether an expression is positive or negative in meaning. The dataset was made public to users of Kaggle so that they could learn how to use deep learning architectures in order to classify or predict what people mean when they say certain things or predict what they could say next based on previous things they have said. 

For this lab we will be classifying the movie reviews as positive or negative based on the sentiment analysis. Since the dataset was already split into a test and training set, we do not have to preprocess the data in order to split it up. We will be choosing two deep learning architectures to run our data through and will be testing the accuracy of each of the architectures. We will also be tuning the hyperparameters to try to obtain the best accruacy for each architecture. 

##### Define all imports needed for project

In [5]:
import pandas as pd
import numpy as np
from sklearn import metrics
import tensorflow as tf
from tensorflow.contrib import learn

##### Import the data

In [12]:
test1_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/test1'
test2_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/test2'

test1 = pd.read_csv(test1_url, delimiter='\t')
test2 = pd.read_csv(test2_url, delimiter='\t')
test2.columns = ['id', 'review']
test_data = test1.append(test2, ignore_index=True)


train1_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train1'
train2_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train2'
train1 = pd.read_csv(train1_url, delimiter='\t')
train2 = pd.read_csv(train2_url, delimiter='\t')
train2.columns = ['id', 'sentiment', 'review']
train_data = train1.append(train2, ignore_index=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24998 entries, 0 to 24997
Data columns (total 3 columns):
id           24998 non-null object
sentiment    24998 non-null int64
review       24998 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB


##### Remove punctuation from the data

In [7]:
test_data['review'].replace(regex=True,inplace=True,to_replace=r'<br \/>',value=r'')
test_data['review'].replace(regex=True,inplace=True,to_replace=r'[.,\\/|#!$%\^&\*;:{}=\-_\'~()\?"]',value=r'')
#print(test_data.ix[9].review)

##### Create embedding of all the words in the datasets

In [15]:
# Try tweaking these numbers
MAX_DOCUMENT_LENGTH = 30
EMBEDDING_SIZE = 100

vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
x_train = train_data['review']
x_test = test_data['review']
x_train = np.array(list(vocab_processor.fit_transform(x_train)))
x_test = np.array(list(vocab_processor.transform(x_test)))
n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 127377


##### Create model functions
From in-class notebook 15

In [None]:
def lstm_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.LSTMCell(EMBEDDING_SIZE,state_is_tuple=False)
    #cell = tf.nn.rnn_cell.BasicRNNCell(HIDDEN_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 15, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.01)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

In [None]:
def gru_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.GRUCell(EMBEDDING_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 15, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.01)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op