# Machine Learning Lab 3

### Eric Johnson and Quincy Schurr

#### Overview

The dataset that we will be using for Lab 3 was obtained from a Kaggle competition called Bago of Words Meets Bag of Popcorn. The dataset can be found here https://www.kaggle.com/c/word2vec-nlp-tutorial/data. 

The dataset is comprised of 3 files, the test dataset, the unlabeled training dataset, and the labeled training dataset. For this lab we will be using the test and labeled training datasets. The test dataset has 25,000 records that contain an id and then a review. The labeled training dataset contains 25,000 records that include an id, a sentiment score, and a review. The sentiment score is a binary value where 1 is a positive review, given if the rating on IMDB was greater than or equal to 7, and 0 is a negative review, given if the rating on IMDB was less than 5.

The purpose of this dataset is understanding sentiment analysis through deep learning architectures. This means that the machine is learning how interpret the meaning behind an expression by learning how language is constructed and then trying to piece together patterns to determine whether an expression is positive or negative in meaning. The dataset was made public to users of Kaggle so that they could learn how to use deep learning architectures in order to classify or predict what people mean when they say certain things or predict what they could say next based on previous things they have said. 

For this lab we will be classifying the movie reviews as positive or negative based on the sentiment analysis. Since the dataset was already split into a test and training set, we do not have to preprocess the data in order to split it up. We will be choosing two deep learning architectures to run our data through and will be testing the accuracy of each of the architectures. We will also be tuning the hyperparameters to try to obtain the best accruacy for each architecture. 

##### Define all imports needed for project

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics, grid_search
import tensorflow as tf
from tensorflow.contrib import learn
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import ensemble



##### Import the data

When we first looked at the data, we brought in both the training and testing data and started using it for analysis. After a while we realized that we could not use an accuracy metric with the dataset we chose because the test data did not have a sentiment value included. Testing accuracy is the best measure with this dataset becuase we are classifying the movie reviews into two categories and we were either correct or incorrect. Using false positives or negatives would not give us a conclusion that would be helpful in determining if this architecture could be used to help with deep learning problems in the future.

For this reason, we decided to just use the training data for our lab.

In [2]:
train1_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train1'
train2_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train2'
train1 = pd.read_csv(train1_url, delimiter='\t')
train2 = pd.read_csv(train2_url, delimiter='\t')
train2.columns = ['id', 'sentiment', 'review']
X = train1.append(train2, ignore_index=True)
X.rename(columns={'id' : 'id_'}, inplace=True)

##### Remove punctuation from the data

We decided to remove all punctuation from the data set to just focus on the words. Some of the formatting of the data set also included some left over html so we removed that as well.

In [3]:
X['review'].replace(regex=True,inplace=True,to_replace=r'<br \/>',value=r'')
X['review'].replace(regex=True,inplace=True,to_replace=r'[.,\\/|#!$%\^&\*;:{}=\-_\'~()\?"]',value=r'')
y = X['sentiment']
#print(np.min(train_data.review.str.len()))
#print(np.min(test_data.review.str.len()))

#### Convolutional Neural Network

The first model we will run the dataset through is a convolutional model. We got the idea from a paper published by IBM that can be found here (http://www.aclweb.org/anthology/C14-1008). We then read a bit more on a blog that can be found here (http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/). This blog talked about implementing a CNN in tensorflow for sentiment analysis. We liked the idea and decided to do an implementation of it for our dataset. We could not figure out how to to the embedding using tensorflow, so we used sklearn for most of the following.

##### Create embedding of all the words in the datasets

In this section we create an embedding of all the words in the dataset to use in our convoltional model. The first step was to run the review attribute from the whole dataset through a vocab processor. This way we found out how many unique words there were in the dataset and also assigned a number to each one of those words.


In [None]:
# Try tweaking these numbers
MAX_DOCUMENT_LENGTH = 30
EMBEDDING_SIZE = 100

#get unique words out of X

#Step1 - Vocab Processor
X = X['review']

vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(X)

X= np.array(list(vocab_processor.transform(X))).astype('float32')
#X_train = np.array(list(vocab_processor.transform(X_train))).astype('float32')
#y_train = np.array(y_train)
#X_test = np.array(list(vocab_processor.transform(X_test))).astype('float32')
#y_test = np.array(y_test)
print('Total words: %d' % len(vocab_processor.vocabulary_))

##### One Hot Encoding

The next step is to take the number representation to a one hot encoding, creating a sparse matrix representation of all the numbered words. To do this we had to take the maximum number and use that as the basis for an array of the maximum document length to input as the n_values since the input needed to take an array of the ints.

In [None]:
#Step 2, one hot encoding
num = np.max(X)
num2= list()
for i in range(0,30):
    num2.append(num)

le = preprocessing.OneHotEncoder(n_values=num2, categorical_features='all', sparse=True, handle_unknown='ignore')
le.fit(X)
X = le.transform(X)

#reshape into something with only 30 attributes wide      
sparse_words = []
for i in range(MAX_DOCUMENT_LENGTH):
    sparse_words.append(X[:, num*i:(i+1)*num])

orig_len = X.shape[0] #need this for hstack later

print(X.shape)

In [None]:
from scipy.sparse import vstack, hstack

#create a vertical stacking of all these words so that we can input into embedding

a = vstack(sparse_words)
print(a.shape)

##### Embedding

The next step is to take the one hot encoding and transform it into an embedding so that it can be in the correct format to input into a convoltional model. The Random Tress Embedding transforms a sparse matrix into a multi-dimension sparse matrix so that it takes up less space. An option to make the matrix dense, would be to choose False for the sparse_output parameter.

In [None]:
#step 3 - try embedding
rf = ensemble.RandomTreesEmbedding(n_estimators=10, max_depth=5, min_samples_split=2, min_samples_leaf=1, 
                                   min_weight_fraction_leaf=0.0, max_leaf_nodes=None, 
                                   min_impurity_split=1e-07, sparse_output=True, n_jobs=1)

rf.fit(a)
X = rf.transform(a).astype('float32') 
#convert to type float32 because the convolutional network model is expecting a float32, not 64.
# X_test = rf.transform(X_test).astype('float32')

print(X.shape)
#embedding size is shape(1)

In [None]:
type(X)

In [None]:
from scipy.sparse import vstack, hstack

orig_len
tmp = []
for i in range(MAX_DOCUMENT_LENGTH):
    tmp.append(X[i*orig_len:(i+1)*orig_len,:])
    
b = hstack(tmp)
print(b.shape)

X = b.todense() #create dense matrix for convolution

##### Split data into testing and training sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

##### Convolutional Model

In [None]:
%%time
# Copy TensorFlow Architecture from 
#   Deep MNIST for experts
#   https://www.tensorflow.org/versions/r0.11/tutorials/mnist/pros/index.html
#which we then copied from Notebook 14 from in class

import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers

def conv_model(X, y):
    print('===============================')
    print(X)
    # get in format expected by conv2d
    # (batch_size,width, height,color_channels)
    # since our images are gray scale and 28x28 pixels
    #   we define the last three elements as 28x28x1
    # we don't know the batch size, so just let it 
    # figure that out from the input data (-1 designation)
    # height and width of images
    features = tf.reshape(X, [-1, MAX_DOCUMENT_LENGTH, 60, 1])
    print(features)
    
    # create 32 filters of 5x5 size 
    n_out = 32
    kernel = [5,5]
    features = layers.conv2d(inputs=features, 
                            num_outputs=n_out, 
                            kernel_size=kernel)
    print('Through conv')
    # add a bias and pass through relu (for concentrated gradients)
    features = tf.nn.relu(layers.bias_add(features))
    
    # 2x2 max pool to reduce image size to 14x14
    kernel = [1, 2, 2, 1]
    stride = [1, 2, 2, 1]
    features = tf.nn.max_pool(features, ksize=kernel,strides=stride, padding='SAME')
    
    # create 64 filters of 5x5 size
    n_out = 64
    kernel = [5,5]
    features = layers.conv2d(inputs=features, 
                            num_outputs=n_out, 
                            kernel_size=kernel)
    
    # add a bias and pass through relu (for concentrated gradients)
    features = tf.nn.relu(layers.bias_add(features))
    
    # 2x2 max pool to reduce image size to 7x7
    kernel = [1, 2, 2, 1]
    stride = [1, 2, 2, 1]
    features = tf.nn.max_pool(features, ksize=kernel,strides=stride, padding='SAME')
    

    # make the weights a column vector, 7x7x64 = 3136
    features = layers.flatten(features)
    print(features)
    
    # pass through fully connected layer with 1024 hidden neurons, W=3136x1024
    features = layers.stack(features, layers.fully_connected, [1024])
    
    # add bias and pass through relu
    features = tf.nn.relu(layers.bias_add(features))
    print(features)
    
    # then make a fully connected layer with bias and sigmoid nonlinearity 
    #  which... is... just logistic regression with one versus all
    pred, loss = learn.models.logistic_regression(features, y)
    print(pred)
    
    print('===============================')
    
    return pred, loss


# Create a classifier, train and predict.
classifier = learn.TensorFlowEstimator(model_fn=conv_model, 
                                       n_classes=2, steps=20000, 
                                       learning_rate=0.05, batch_size=30)

print ('classifier created')

# this operation can take a little while to complete
#   Google says it should take about 30 minutes, but 
#   my machine took a lot longer...
classifier.fit(X_train, y_train)
print ('fit')

# now predict the outcome
score = metrics.accuracy_score(y_test, classifier.predict(X_test))
print(score)

The run time for this model took a while, and so the output in included below. We ran the convolutional model three differnt times adjusting the second layer from a 5*5 to a 3*3 and changing the learning rate from 0.05 to 0.01, but each time we obtained about 50% accuracy. 50% accuracy does not mean much in terms of getting a valuable architecture since it is about the same as guessing. 

We thought that this model may be good to try based on the papers that we read but with the dataset we chose, it did not turn out to be the best model. The reason for this could be that the rating metric used in this dataset was a bit arbitrary, since the dataset chose to ignore the ratings that were given 5-7 stars and many words could be used in both positive and negative reviews. Below is the accuracy given for the convolutional model run with a learning rate of 0.05.


<img src = 'CNNOutput.png'>

### Testing RNNs 

#### Perform embedding and train-test split
Based on code from in-class notebook 15

In [4]:
# We redo the embedding to make sure all data is in the expected for and not tainted by the code above
MAX_DOCUMENT_LENGTH = 30
EMBEDDING_SIZE = 100

X = X['review']

vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(X)

X= np.array(list(vocab_processor.transform(X)))

n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Total words: 157132


Split into training and testing sets.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

#### Create RNN functions (LSTM and GRU cells are tested)
From in-class notebook 15

In [6]:
def lstm_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.LSTMCell(EMBEDDING_SIZE,state_is_tuple=False)
    #cell = tf.nn.rnn_cell.BasicRNNCell(HIDDEN_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 2, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.1)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

In [7]:
def lstm_model_2(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.LSTMCell(EMBEDDING_SIZE,state_is_tuple=False)
    #cell = tf.nn.rnn_cell.BasicRNNCell(HIDDEN_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 2, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.01)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

In [8]:
def lstm_model_3(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.LSTMCell(EMBEDDING_SIZE,state_is_tuple=False)
    #cell = tf.nn.rnn_cell.BasicRNNCell(HIDDEN_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 2, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.001)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

In [9]:
def gru_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.GRUCell(EMBEDDING_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 2, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.001)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

##### Hyper-tune parameters with grid search and K-Fold validation

In [None]:
%%time
# Couldn't get normal grid search to work with Tensorflow so this is the poorly programmed version
# Each model has a different learning rate

scores = []
sub_scores = []

for k1 in range(0,2):
    # Doing 2-fold validation on the parameters because the model takes 40 minutes to run each time
    X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X_train, y_train, test_size=0.30)
    classifier = learn.Estimator(model_fn=lstm_model)
    classifier.fit(X_train_sub, y_train_sub, steps=200)
    y_pred = [p['class'] for p in classifier.predict(X_test_sub, as_iterable=True)]
    sub_scores.append(metrics.accuracy_score(y_test_sub, y_pred))
    print('Fold done ', k1)

scores.append(np.mean(sub_scores))

for k1 in range(0,2):
    X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X_train, y_train, test_size=0.30)
    classifier = learn.Estimator(model_fn=lstm_model_2)
    classifier.fit(X_train_sub, y_train_sub, steps=200)
    y_pred = [p['class'] for p in classifier.predict(X_test_sub, as_iterable=True)]
    sub_scores.append(metrics.accuracy_score(y_test_sub, y_pred))
    print('Fold done ', k1)


scores.append(np.mean(sub_scores))
    
for k1 in range(0,2):
    X_train_sub, X_test_sub, y_train_sub, y_test_sub = train_test_split(X_train, y_train, test_size=0.30)
    classifier = learn.Estimator(model_fn=lstm_model_3)
    classifier.fit(X_train_sub, y_train_sub, steps=200)
    y_pred = [p['class'] for p in classifier.predict(X_test_sub, as_iterable=True)]
    sub_scores.append(metrics.accuracy_score(y_test_sub, y_pred))
    print('Fold done ', k1)

scores.append(np.mean(sub_scores))

print(scores)

Due to the time constraints of running the grid search, a screen shot of a previous run is shown below.

<img src="GridSearch.png">

This shows that the best of the three learning rates we looked at is 0.001. We use this in the LSTM model below. We do not have time to also perform this grid search on the GRU model so we will just use the same learning rate as the LSTM. Previous testing has led us to believe that the GRU model does not perform as well as the LSTM model.

In [10]:
%%time
# Build LSTM model
classifier = learn.Estimator(model_fn=lstm_model_3)

# Train and predict
classifier.fit(X_train, y_train, steps=200)
y_predicted = [p['class'] for p in classifier.predict(X_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted)
print('Accuracy: {0:f}'.format(score))
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)



Accuracy: 0.701067
[[2710 1044]
 [1198 2548]]
CPU times: user 1h 53min 46s, sys: 14min 50s, total: 2h 8min 36s
Wall time: 41min 24s


In [11]:
%%time
# Build GRU model
gru_classifier = learn.Estimator(model_fn=gru_model)

# Train and predict
gru_classifier.fit(X_train, y_train, steps=200)
y_predicted_gru = [p['class'] for p in gru_classifier.predict(X_test, as_iterable=True)]
score = metrics.accuracy_score(y_test, y_predicted_gru)
print('Accuracy: {0:f}'.format(score))
cm = metrics.confusion_matrix(y_test, y_predicted_gru)
print(cm)



Accuracy: 0.694400
[[2515 1239]
 [1053 2693]]
CPU times: user 1h 29min 28s, sys: 13min 1s, total: 1h 42min 30s
Wall time: 32min 41s


Both of our models have very similar accuracy results. These accuracies are close enough that we cannot say that one model out-performs the other. The GRU model trains in roughly 75% the time of the LSTM model so this would make it better for a production version of the system. 