# Machine Learning Lab 3

### Eric Johnson and Quincy Schurr

#### Overview

The dataset that we will be using for Lab 3 was obtained from a Kaggle competition called Bago of Words Meets Bag of Popcorn. The dataset can be found here https://www.kaggle.com/c/word2vec-nlp-tutorial/data. 

The dataset is comprised of 3 files, the test dataset, the unlabeled training dataset, and the labeled training dataset. For this lab we will be using the test and labeled training datasets. The test dataset has 25,000 records that contain an id and then a review. The labeled training dataset contains 25,000 records that include an id, a sentiment score, and a review. The sentiment score is a binary value where 1 is a positive review, given if the rating on IMDB was greater than or equal to 7, and 0 is a negative review, given if the rating on IMDB was less than 5.

The purpose of this dataset is understanding sentiment analysis through deep learning architectures. This means that the machine is learning how interpret the meaning behind an expression by learning how language is constructed and then trying to piece together patterns to determine whether an expression is positive or negative in meaning. The dataset was made public to users of Kaggle so that they could learn how to use deep learning architectures in order to classify or predict what people mean when they say certain things or predict what they could say next based on previous things they have said. 

For this lab we will be classifying the movie reviews as positive or negative based on the sentiment analysis. Since the dataset was already split into a test and training set, we do not have to preprocess the data in order to split it up. We will be choosing two deep learning architectures to run our data through and will be testing the accuracy of each of the architectures. We will also be tuning the hyperparameters to try to obtain the best accruacy for each architecture. 

##### Define all imports needed for project

In [1]:
import pandas as pd
import numpy as np
from sklearn import metrics, grid_search
import tensorflow as tf
from tensorflow.contrib import learn
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import ensemble



##### Import the data

When we first looked at the data, we brought in both the training and testing data and started using it for analysis. After a while we realized that we could not use an accuracy metric with the dataset we chose because the test data did not have a sentiment value included. Testing accuracy is the best measure with this dataset becuase we are classifying the movie reviews into two categories and we were either correct or incorrect. Using false positives or negatives would not give us a conclusion that would be helpful in determining if this architecture could be used to help with deep learning problems in the future.

For this reason, we decided to just use the training data and to split the data into 70% training and 30% testing data.

In [2]:
train1_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train1'
train2_url = 'https://raw.githubusercontent.com/quincyschurr/machineLearning/master/Lab3/train2'
train1 = pd.read_csv(train1_url, delimiter='\t')
train2 = pd.read_csv(train2_url, delimiter='\t')
train2.columns = ['id', 'sentiment', 'review']
X = train1.append(train2, ignore_index=True)
X.rename(columns={'id' : 'id_'}, inplace=True)

##### Remove punctuation from the data

In [3]:
X['review'].replace(regex=True,inplace=True,to_replace=r'<br \/>',value=r'')
X['review'].replace(regex=True,inplace=True,to_replace=r'[.,\\/|#!$%\^&\*;:{}=\-_\'~()\?"]',value=r'')
y = X['sentiment']
#print(np.min(train_data.review.str.len()))
#print(np.min(test_data.review.str.len()))

In [4]:
#try new type of embedding - found here 
#(https://www.tensorflow.org/versions/r0.10/tutorials/wide_and_deep/index.html#tensorflow-wide-deep-learning-tutorial)
#categorical base columns
'''id_ = tf.contrib.layers.sparse_column_with_hash_bucket("id", hash_bucket_size=25000)
review = tf.contrib.layers.sparse_column_with_hash_bucket("review", hash_bucket_size=250000)
sentiment = tf.contrib.layers.sparse_column_with_keys(column_name="sentiment", keys=[0, 1])
#review = tf.contrib.layers.real_valued_column("review")

deep_columns = [
  tf.contrib.layers.embedding_column(sentiment, dimension=8),
  id_, review]

wide_columns = [
  sentiment,
  tf.contrib.layers.crossed_column([id_, sentiment], hash_bucket_size=int(1e4)),
  tf.contrib.layers.crossed_column([review, sentiment], hash_bucket_size=int(1e4))]

import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=wide_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50])

m.fit(x=X_train, y=y_train, steps=2000)

# Evaluate accuracy.
accuracy_score = classifier.evaluate(x=X_test,
                                     y=y_test)["accuracy"]
print('Accuracy: {0:f}'.format(accuracy_score))'''

'id_ = tf.contrib.layers.sparse_column_with_hash_bucket("id", hash_bucket_size=25000)\nreview = tf.contrib.layers.sparse_column_with_hash_bucket("review", hash_bucket_size=250000)\nsentiment = tf.contrib.layers.sparse_column_with_keys(column_name="sentiment", keys=[0, 1])\n#review = tf.contrib.layers.real_valued_column("review")\n\ndeep_columns = [\n  tf.contrib.layers.embedding_column(sentiment, dimension=8),\n  id_, review]\n\nwide_columns = [\n  sentiment,\n  tf.contrib.layers.crossed_column([id_, sentiment], hash_bucket_size=int(1e4)),\n  tf.contrib.layers.crossed_column([review, sentiment], hash_bucket_size=int(1e4))]\n\nimport tempfile\nmodel_dir = tempfile.mkdtemp()\nm = tf.contrib.learn.DNNLinearCombinedClassifier(\n    model_dir=model_dir,\n    linear_feature_columns=wide_columns,\n    dnn_feature_columns=deep_columns,\n    dnn_hidden_units=[100, 50])\n\nm.fit(x=X_train, y=y_train, steps=2000)\n\n# Evaluate accuracy.\naccuracy_score = classifier.evaluate(x=X_test,\n           

##### Split data into testing and training sets

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

##### Create embedding of all the words in the datasets

In [22]:
# Try tweaking these numbers
MAX_DOCUMENT_LENGTH = 30
EMBEDDING_SIZE = 100

#get unique words out of X
#Step1 - Vocab Processor
X_train = X_train['review']
X_test = X_test['review']
vocab_processor = learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)
vocab_processor.fit(X_train.append(X_test))
X_train = np.array(list(vocab_processor.transform(X_train))).astype('float32')
y_train = np.array(y_train)
X_test = np.array(list(vocab_processor.transform(X_test)))
print('Total words: %d' % len(vocab_processor.vocabulary_))

#Step 2, one hot encoding
le = preprocessing.OneHotEncoder(n_values='auto', categorical_features='all', sparse=True)
X_train = le.fit_transform(X_train)
X_test = le.fit_transform(X_test)

print(X_train.shape)
print(X_test.shape)

Total words: 157132
(17498, 130458)
(7500, 73041)


In [23]:
#step 3 - try embedding
rf = ensemble.RandomTreesEmbedding(n_estimators=10, max_depth=5, min_samples_split=2, min_samples_leaf=1, 
                                   min_weight_fraction_leaf=0.0, max_leaf_nodes=None, 
                                   min_impurity_split=1e-07, sparse_output=True, n_jobs=1)
X_train = rf.fit_transform(X_train)
X_test = rf.fit_transform(X_test)

print(X_train.shape)
print(X_test.shape)

(17498, 82)
(7500, 67)


In [24]:
%%time
# Copy TensorFlow Architecture from 
#   Deep MNIST for experts
#   https://www.tensorflow.org/versions/r0.11/tutorials/mnist/pros/index.html
#which we then copied from Notebook 14 from in class

import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers

def conv_model(X, y):
    print('===============================')
    print(X)
    # get in format expected by conv2d
    # (batch_size,width, height,color_channels)
    # since our images are gray scale and 28x28 pixels
    #   we define the last three elements as 28x28x1
    # we don't know the batch size, so just let it 
    # figure that out from the input data (-1 designation)
    # height and width of images
    features = tf.reshape(X, [-1, MAX_DOCUMENT_LENGTH, EMBEDDING_SIZE, 1])
    print(features)
    
    # create 32 filters of 5x5 size 
    n_out = 32
    kernel = [5,5]
    features = layers.conv2d(inputs=features, 
                            num_outputs=n_out, 
                            kernel_size=kernel)
    print('Through conv')
    # add a bias and pass through relu (for concentrated gradients)
    features = tf.nn.relu(layers.bias_add(features))
    
    # 2x2 max pool to reduce image size to 14x14
    kernel = [1, 2, 2, 1]
    stride = [1, 2, 2, 1]
    features = tf.nn.max_pool(features, ksize=kernel,strides=stride, padding='SAME')
    
    # create 64 filters of 5x5 size
    n_out = 5
    kernel = [3,3]
    features = layers.conv2d(inputs=features, 
                            num_outputs=n_out, 
                            kernel_size=kernel)
    
    # add a bias and pass through relu (for concentrated gradients)
    features = tf.nn.relu(layers.bias_add(features))
    
    # 2x2 max pool to reduce image size to 7x7
    kernel = [1, 2, 2, 1]
    stride = [1, 2, 2, 1]
    features = tf.nn.max_pool(features, ksize=kernel,strides=stride, padding='SAME')
    

    # make the weights a column vector, 7x7x64 = 3136
    features = layers.flatten(features)
    print(features)
    
    # pass through fully connected layer with 1024 hidden neurons, W=3136x1024
    features = layers.stack(features, layers.fully_connected, [1024])
    
    # add bias and pass through relu
    features = tf.nn.relu(layers.bias_add(features))
    print(features)
    
    # then make a fully connected layer with bias and sigmoid nonlinearity 
    #  which... is... just logistic regression with one versus all
    pred, loss = learn.models.logistic_regression(features, y)
    print(pred)
    
    print('===============================')
    
    return pred, loss


# Create a classifier, train and predict.
classifier = learn.TensorFlowEstimator(model_fn=conv_model, 
                                       n_classes=2, steps=20000, 
                                       learning_rate=0.05, batch_size=30)

# this operation can take a little while to complete
#   Google says it should take about 30 minutes, but 
#   my machine took a lot longer...
classifier.fit(X_train, y_train)

# now predict the outcome
#score = accuracy_score(y_test, classifier.predict(X_test))
#print(score)



Tensor("input:0", shape=(?, 82), dtype=float64)
Tensor("Reshape:0", shape=(?, 30, 100, 1), dtype=float64)
Through conv


TypeError: DataType float64 for attr 'T' not in list of allowed values: float32, float16

##### Create model functions
From in-class notebook 15

In [None]:
def lstm_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.LSTMCell(EMBEDDING_SIZE,state_is_tuple=False)
    #cell = tf.nn.rnn_cell.BasicRNNCell(HIDDEN_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 15, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.01)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

In [None]:
def gru_model(x, y):
    # Convert indexes of words into embeddings.
    # This creates embeddings matrix of [n_words, EMBEDDING_SIZE] and then
    # maps word indexes of the sequence into [batch_size, sequence_length,
    # EMBEDDING_SIZE].
    word_vectors = learn.ops.categorical_variable(x, n_classes=n_words,
      embedding_size=EMBEDDING_SIZE, name='words')

    # Split into list of embedding per word, while removing doc length dim.
    # word_list results to be a list of tensors [batch_size, EMBEDDING_SIZE].
    word_list = tf.unpack(word_vectors, axis=1)

    # Create a Gated Recurrent Unit cell with hidden size of EMBEDDING_SIZE.
    cell = tf.nn.rnn_cell.GRUCell(EMBEDDING_SIZE)

    # Create an unrolled Recurrent Neural Networks to length of
    # MAX_DOCUMENT_LENGTH and passes word_list as inputs for each unit.
    per_rnn_output, final_encoding = tf.nn.rnn(cell, word_list, dtype=tf.float32)

    # Given encoding of RNN, take encoding of last step (e.g hidden size of the
    # neural network of last step) and pass it as features for logistic
    # regression over output classes.
    target = tf.one_hot(y, 15, 1, 0)
    prediction, loss = learn.models.logistic_regression(final_encoding, target)

    # Create a training op.
    train_op = tf.contrib.layers.optimize_loss(
      loss, tf.contrib.framework.get_global_step(),
      optimizer='Adam', learning_rate=0.01)

    return {'class': tf.argmax(prediction, 1), 'prob': prediction}, loss, train_op

##### Hyper-tune parameters with grid search

In [None]:
# Basic outline of gridsearch. Probably will need to tweak
param_grid = {"steps": [200,400],
               "learning_rate": [0.1,0.2]}

regressor = learn.TensorFlowDNNRegressor(hidden_units=[10, 10],
steps=5000, learning_rate=0.1, batch_size=10)

# run grid search
gs = grid_search.GridSearchCV(
         regressor, param_grid=param_grid, scoring = 'accuracy',
         verbose=10, n_jobs=-1,cv=2, fit_params={'steps': [200,400]})

# summarize the results of the grid search
print(gs.best_score_)
print(gs.best_params_)