# IMDB Reviews Sentiment Analysis

This notebooks covers a few different approaches to sentiment analysis and compares their respective results. We'll be taking a look at IMDB movie reviews and trying to correctly predict whether the reviews were positive or negative. The techniques investigated are:<br>
- LSA and Logistic Regression
- CNN
- LSTM


The data used can be found here: http://ai.stanford.edu/~amaas/data/sentiment/

## Imports and Load Data

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
import os
from glob import glob

Here we just define the locations of the training and testing data

In [2]:
HOME_DIR = os.environ['HOME']
DATA_DIR = HOME_DIR +'/git/imdb-sentiment-analysis/data/aclImdb/'
SAVE_MODEL_DIR = HOME_DIR +'/git/imdb-sentiment-analysis/saved_models/'
# train data directories
train_dir = DATA_DIR + 'train/'
train_pos_dir = train_dir + 'pos/' 
train_neg_dir = train_dir + 'neg/'
# test data directories
test_dir = DATA_DIR + 'test/'
test_pos_dir = test_dir + 'pos/'
test_neg_dir = test_dir + 'neg/'

I created a couple of functions for loading the data. The first, get_data, takes the filepath to a folder containing the individual text files which contain the reviews. The function returns a list of id numbers, ratings, text in the file, and the label (positive or negative).

In [3]:
def get_data(paths):
    """
    Return review data given file paths
    params:
    :paths: list of data file paths 
    
    returns: idx,ratings,texts,labels
    """
    idx = []
    ratings = []
    texts = []
    labels = []

    for path in paths:
        with open(path) as f:
            _,filename = os.path.split(path)
            
            if 'neg' in path: labels.append(0)
            if 'pos' in path: labels.append(1)
            
            idx.append(filename[0:filename.find('_')])
            ratings.append(filename[filename.find('_')+1])
            texts.append(f.read().lower())

    return idx,ratings,texts,labels

This function calls the previous function to load all the training and testing data. It also has the option of one-hot labeling which we'll use with one of the models later on.

In [4]:
from sklearn.preprocessing import OneHotEncoder

def load_imdb_data(one_hot_labels=True):
    """Load the imdb review data
    The data can be downloaded here: http://ai.stanford.edu/~amaas/data/sentiment/

    params:
    :one_hot_labels: if True, encode labels in one-hot format.

    returns: X_train, X_test, y_train, y_test
    """
    train_paths = glob(train_neg_dir+'*') + glob(train_pos_dir+'*')
    _,_,train_texts,train_labels = get_data(train_paths)

    test_paths = glob(test_neg_dir+'*') + glob(test_pos_dir+'*')
    _,_,test_texts,test_labels = get_data(train_paths)
    
    if one_hot_labels:
        enc = OneHotEncoder()
        train_label_array = np.array(train_labels).reshape((len(train_labels),1))
        test_label_array = np.array(test_labels).reshape((len(test_labels),1))
        enc.fit(train_label_array)
        train_labels = enc.transform(train_label_array).toarray()
        test_labels = enc.transform(test_label_array).toarray()
    
    return train_texts,train_labels,test_texts,test_labels

# LSA Text Classifier

The first technique we're going to try is LSA with a Logistic Regression classifier. LSA consists of converting each review into a term-frequency inverse-document-frequency (tfidf) vector and then applying SVD dimensionality reduction.

In this case, I'm going to convert the training data into tf-idf vectors first, and those will be the input to the model. This way we can create our model with the amount of dimensionality reduction as a parameter, and use grid search to determine how much dimenionality reduction optimizes accuracy.

### Convert the text string into a tf-idf vector

In [30]:
train_texts, train_labels, test_texts, test_labels = load_imdb_data(one_hot_labels=False)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
train_tfidf = tfidf_vectorizer.fit_transform(train_texts)
test_tfidf = tfidf_vectorizer.fit_transform(test_texts)

### Model which performs LSA and fits a Logistic Regression model

In [38]:
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted

class LSALogisticRegression(BaseEstimator):
    
    def __init__(self, n_components=2):
        # parameters
        self.n_components = n_components
        
    def fit(self, X, y):
        # model
        self.model = LogisticRegression()
        self.svd = TruncatedSVD(self.n_components)
        
        # dimensionality reduction
        X_svd = self.svd.fit_transform(X)
        self.model.fit(X_svd,y)
        
        self.X_ = X_svd
        self.y_ = y
        
        return self
        
    def predict(self, X):
        
        check_is_fitted(self, ['X_','y_'])
        X_svd = self.svd.transform(X)
        y_pred = self.model.predict(X_svd)
        
        return y_pred
    
    def get_params(self, deep=True):
        return {"n_components": self.n_components}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

In [39]:
from sklearn.model_selection import GridSearchCV
params = {'n_components':[100,500,1000,1500,2000], 'normalize':[True,False]}
lsa = LSALogisticRegression()
model = GridSearchCV(lsa, params, 'accuracy', cv=3)
model.fit(train_tfidf, train_labels)

GridSearchCV(cv=3, error_score='raise',
       estimator=LSALogisticRegression(n_components=2), fit_params={},
       iid=True, n_jobs=1,
       param_grid={'normalize': [True, False], 'n_components': [100, 500, 1000, 1500, 2000]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

In [40]:
test_pred = model.predict(test_tfidf)
accuracy = accuracy_score(test_labels,test_pred)
print "Best parameters: {}".format(model.best_params_)
print "Test accuracy: {}%".format(accuracy*100)

Best parameters: {'normalize': False, 'n_components': 2000}
Test accuracy: 90.556%


Here we get a final test accuracy of 90.5%. Pretty good considering it's pretty much just using a bag of word counts to make it's prediction.

Hmm... looks like the 2000 dimension version did the best. The less we reduced the dimensionality, the better. Maybe none is best? Let's try:

In [42]:
model = LogisticRegression()
model.fit(train_tfidf, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [44]:
test_pred = model.predict(test_tfidf)
accuracy = accuracy_score(test_labels,test_pred)
print "Test accuracy: {}%".format(accuracy*100)

Test accuracy: 93.328%


Linear Regression on tf-idf FTW!

# CNN Classifier

The next technique tried is a Convolutional Neural Network. As with the previous model, there are a few preprocessing steps we're going to have to perform before feeding the reviews into the CNN.

We need to map each word string to a word vector (aka. word embeddings). The word vector is the word's location in word-space. I won't go into detail here, but just imagine that word-space is this nice n-dimensional space where words that are similar are close to each other. A 2-dimensional word-space could be a table covered in cue-cards which each have a word on them. You can use your own understanding of words to move the words around the table so related words are close to each other (animals might be grouped, people, places, etc.). Now imagine you could do this in 3d. Or 4d! Or n-d! That might be hard for you and I, but computers and math don't mind.

I'm going to try three different ways of mapping our words into word-space:

1) **Let the neural network do it:** We can do this by putting a layer at the beginning of our network (called an embedding layer) which takes the words and maps them to 100d vectors. As the model trains, it will learn whatever word representation seems to be improving the accuracy the most (yay backprop!)<br>

2) **Let Gensim do it:** [Gensim](https://radimrehurek.com/gensim/models/word2vec.html) is a library with all sorts of fancy functionality for learning things from text. We'll be using Word2Vec, which produces word vectors with deep learning via word2vec’s “skip-gram and CBOW models”. It reads through your corpus of text, and learns a representation which is best able to predict words based on the words around it. This turns out to work pretty well.<br>

3) **Let GloVe do it:** [GloVe](https://github.com/stanfordnlp/GloVe) is another model which is able to learn word representations. And the creators already went through all the trouble of training it on the entirety of wikipedia! Given that amount of data, and more time than I'm willing to spend here, they probably learned a word representations that contains more useful information than anything we would get with little training on little data.<br>

In [5]:
# only create word vectors for the 
# 10,000 most import words
max_num_words = 10000 
# maximum review length of 1000 words
max_seq_length = 1000
# dimensions of word vector
word_vector_size = 100

### Tokenize texts

First we need to tokenize the text. In this case, it will include stripping out unneccessary characters, and spliting the review texts into lists of words.

In [6]:
train_texts, train_labels, test_texts, test_labels = load_imdb_data()

In [7]:
from bs4 import BeautifulSoup
from keras.preprocessing.text import Tokenizer

Using Theano backend.
Using cuDNN version 5103 on context None
Mapped name None to device cuda: Tesla K80 (0000:00:1E.0)


In [8]:
# Clean text
train_soup = [BeautifulSoup(text, 'lxml').get_text() for text in train_texts]

# Create corpus dictionary 
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_soup)
word_index = tokenizer.word_index
index_word = {index:word for word,index in word_index.items()}

In [9]:
from keras.preprocessing.sequence import pad_sequences

In [10]:
#text to sequence of indices from word_index dictionary
train_index_sequences = tokenizer.texts_to_sequences(train_texts)
test_index_sequences = tokenizer.texts_to_sequences(test_texts)

#pad training and testing sequences
train_data = pad_sequences(train_index_sequences, maxlen=max_seq_length)
test_data = pad_sequences(test_index_sequences, maxlen=max_seq_length)

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(train_data,train_labels,test_size=0.2)

### Gensim word embeddings

Here's where we create the gensim word embeddings. Like I mentioned earlier, Word2Vec trains on our training texts to try to learn the most useful word representations. We then take those word vectors and place them all in a matrix. This is just how the CNN likes it's word vectors stored.

In [None]:
#text to sequence of words
word_sequences = index_sequences[:]
for i,sequence in enumerate(word_sequences):
    for j,index in enumerate(sequence):
        word_sequences[i][j] = index_word[index]

In [None]:
from gensim.models import Word2Vec

In [297]:
# create word vectors
model_name = SAVE_MODEL_DIR + "mrsa_word2vec"
model = Word2Vec(word_sequences,size = word_vector_size)
model.save(model_name)

In [298]:
# create embedding matrix
gensim_embedding_matrix = np.zeros((len(word_index)+1,word_vector_size))
word_vec_vocab = model.wv.vocab.keys()
for word in word_vec_vocab:
    idx = tokenizer.word_index[word]
    gensim_embedding_matrix[idx] = model.wv[word]

# save embeddings
embed_mat_name = SAVE_MODEL_DIR + "mrsa_gensim_embeds.npy"
np.save(embed_mat_name,embedding_matrix)

In [251]:
# load embedding matrix
gensim_embedding_matrix = np.load(embed_mat_name)

### GloVe word embedding

Same thing as above, except there's no training required!

In [245]:
embeddings_index = {}
with open(os.path.join('glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [248]:
glove_embedding_matrix = np.zeros((len(word_index) + 1, word_vector_size))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        glove_embedding_matrix[i] = embedding_vector

Ok, now for the fun part! Time to have the word embeddings fight to the death! <br>
... well, more like compete at sentiment prediction, but you get what I mean

I'll use the same model architecture in each case, just changing the embedding layer weights at the top.

## CNN with trainable embeddings

This CNN starts with random word embeddings. Over the course of training, it'll learn more and more useful embeddings via backprop.

In [25]:
CNN1 = Sequential()
CNN1.add(Embedding(len(word_index)+1,
                  word_vector_size,
                  input_length=max_seq_length))
CNN1.add(Conv1D(128, 5, activation='relu'))
CNN1.add(MaxPooling1D(5))
CNN1.add(Conv1D(128, 5, activation='relu'))
CNN1.add(MaxPooling1D(5))
CNN1.add(Conv1D(128, 5, activation='relu'))
CNN1.add(MaxPooling1D(5))
CNN1.add(Flatten())
CNN1.add(Dense(128, activation='relu'))
CNN1.add(BatchNormalization())
CNN1.add(Dropout(0.5))
CNN1.add(Dense(2, activation='softmax'))
CNN1.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

In [287]:
# checkpointer for best weights
CNN_no_embeds_path = SAVE_MODEL_DIR + 'mrsa_CNN_no_embeds.h5'
checkpointer = ModelCheckpoint(CNN_no_embeds_path, save_best_only=True)

# Fit model     
CNN1.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=5,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f959cc77e10>

Looks like it overfit pretty quickly. We'll see later that this overfits more quickly than the other models, which makes sense. Since this model has the extra trainable parameters in the embedding layer, it can use those to remember specific training data information.

### Results

In [288]:
CNN1.load_weights(CNN_no_embeds_path)
score, acc = CNN1.evaluate(test_data,test_labels)
print('Test score: %.2f' % score)
print('Test accuracy: %.2f' % acc)

Test score: 0.25
Test accuracy: 0.96


## CNN with gensim embeddings

Next up is Gensim. Let's see how it fairs.

In [22]:
from keras.models import Sequential
from keras.layers import Conv1D, Dropout, MaxPooling1D, BatchNormalization, Dense, Flatten, Embedding
from keras.callbacks import ModelCheckpoint

In [299]:
CNN2 = Sequential()
CNN2.add(Embedding(len(word_index)+1,
                    word_vector_size,
                    weights = [imdb_embedding_matrix],
                    input_length=max_seq_length,
                    trainable=False))
CNN2.add(Conv1D(128, 5, activation='relu'))
CNN2.add(BatchNormalization())
CNN2.add(MaxPooling1D(5))
CNN2.add(Conv1D(128, 5, activation='relu'))
CNN2.add(BatchNormalization())
CNN2.add(MaxPooling1D(5))
CNN2.add(Conv1D(128, 5, activation='relu'))
CNN2.add(BatchNormalization())
CNN2.add(MaxPooling1D(5))
CNN2.add(Flatten())
CNN2.add(Dense(128, activation='relu'))
CNN2.add(BatchNormalization())
CNN2.add(Dropout(0.5))
CNN2.add(Dense(2, activation='softmax'))
CNN2.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

In [300]:
# checkpointer for best weights
CNN_weights_path = SAVE_MODEL_DIR + 'mrsa_CNN.h5'
checkpointer = ModelCheckpoint(CNN_weights_path, save_best_only=True)

# Fit model     
CNN2.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=5,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f95aeceffd0>

In [301]:
CNN2.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=2,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f95980fdb10>

It eventually starts to overfit, but not as quickly as the last model.

### Results

In [302]:
CNN2.load_weights(CNN_weights_path)
score, acc = CNN2.evaluate(test_data,test_labels)
print('Test score: %.2f' % score)
print('Test accuracy: %.2f' % acc)

Test score: 0.18
Test accuracy: 0.94


## CNN using GloVe embeddings

And finally, it's GloVe's turn.

In [305]:
CNN3 = Sequential()
CNN3.add(Embedding(len(word_index)+1,
                         word_vector_size,
                         weights = [pretrained_embedding_matrix],
                         input_length=max_seq_length,
                         trainable=False))
CNN3.add(Conv1D(128, 5, activation='relu'))
CNN3.add(BatchNormalization())
CNN3.add(MaxPooling1D(5))
CNN3.add(Conv1D(128, 5, activation='relu'))
CNN3.add(BatchNormalization())
CNN3.add(MaxPooling1D(5))
CNN3.add(Conv1D(128, 5, activation='relu'))
CNN3.add(BatchNormalization())
CNN3.add(MaxPooling1D(5))
CNN3.add(Flatten())
CNN3.add(Dense(128, activation='relu'))
CNN3.add(BatchNormalization())
CNN3.add(Dropout(0.5))
CNN3.add(Dense(2, activation='softmax'))
CNN3.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

In [306]:
# checkpointer for best weights
CNN_pretrain_path = SAVE_MODEL_DIR + 'mrsa_CNN_pretrain.h5'
checkpointer = ModelCheckpoint(CNN_pretrain_path, save_best_only=True)

# Fit model     
CNN3.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=5,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f958b818090>

In [307]:
CNN3.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=2,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f958a9a35d0>

### Results

In [308]:
CNN3.load_weights(CNN_pretrain_path)
score, acc = CNN3.evaluate(test_data,test_labels)
print('Test score: %.2f' % score)
print('Test accuracy: %.2f' % acc)

Test score: 0.25
Test accuracy: 0.91


## All embeddings results

So we ended up with testing accuracies of **96%** for the trainable embeds, **94%** for the gensim embeds, **91%** for the GloVe embeds. It's hard to tell from the outset whether pretrained embeds or embeds learned from your own corpus will be more useful. If the context in which words are used is drastically different in your corpus than in the corpus used for the pretrained embeds, it would make sense that learned embeds would fair better. In this case, it seems that learned embeds are better, and learned embeds via accuracy maximization and backprop are best.

# LSTM with trainable embeddings

LSTMs are especially good at learning useful representations from text. See Andrei Karpathy's great blog post for some cool examples and explanations on the topic: [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

Since the trainable embedding layer worked best for the CNN, that's what I'm going to use with the LSTM model. I kept the architecture simple, otherwise the training way took too long. I added the dropout layer when I found the model wasn't generalizing well to the validation set.

In [15]:
from keras.models import Sequential
from keras.callbacks import ModelCheckpoint
from keras.layers import LSTM, Embedding, Dropout, BatchNormalization, Dense

LSTM1 = Sequential()
LSTM1.add(Embedding(len(word_index)+1,
                   word_vector_size,
                   input_length=max_seq_length))
LSTM1.add(Dropout(0.25))
LSTM1.add(LSTM(64))
LSTM1.add(BatchNormalization())
LSTM1.add(Dense(2, activation='softmax'))

LSTM1.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

In [18]:
# checkpointer for best weights
lstm_weights_path = SAVE_MODEL_DIR + "mrsa_lstm_weights.h5"
checkpointer = ModelCheckpoint(lstm_weights_path, save_best_only=True)

LSTM1.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=2,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fb68d653ad0>

In [19]:
LSTM1.fit(X_train,y_train, 
         validation_data = (X_valid,y_valid), 
         batch_size=128, epochs=1,
         callbacks=[checkpointer])

Train on 20000 samples, validate on 5000 samples
Epoch 1/1


<keras.callbacks.History at 0x7fb68bfb77d0>

Looks like the model is overfitting at this point. Let's see how the best weights do on the testing set.

In [20]:
LSTM1.load_weights(lstm_weights_path)
score, acc = LSTM1.evaluate(test_data,test_labels)
print('Test score: %.2f' % score)
print('Test accuracy: %.2f' % acc)

Test score: 0.22
Test accuracy: 0.94


Tied for second! Let's try to get the best of both worlds next.

# CNN and LSTM

The above model took *forever* to train. To help speed things along, I added a 1D convolutional layer and a max pooling layer. This way the size of the input to the LSTM is much smaller. I'm not sure whether the addition of the convolution layer will reveal useful latent features, or throw away relevant information...<br> Time to find out!

In [23]:
model = Sequential()
model.add(Embedding(len(word_index)+1,
                   word_vector_size,
                   input_length=max_seq_length))
model.add(Dropout(0.25))
model.add(Conv1D(64,5,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(4))
model.add(LSTM(70))
model.add(Dense(2, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

The table below shows what I was saying about the dimensionality of the input to the LSTM layer. The output from the embedding layer is (1000,100) which was the input into the LSTM layer on the previous model. In this case, after the convolutional layer and max pooling, the input to the LSTM is (249,64)!

In [24]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 1000, 100)         8977400   
_________________________________________________________________
dropout_5 (Dropout)          (None, 1000, 100)         0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 996, 64)           32064     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 249, 64)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 70)                37800     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 142       
Total params: 9,047,406
Trainable params: 9,047,406
Non-trainable params: 0
_________________________________________________________________


In [329]:
model.fit(X_train, y_train,
          batch_size=128,
          epochs=1,
          validation_data=(X_valid, y_valid))

Train on 20000 samples, validate on 5000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f959ad24810>

In [330]:
model.fit(X_train, y_train,
          batch_size=128,
          epochs=3,
          validation_data=(X_valid, y_valid))

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f959ab8abd0>

In [332]:
#model.load_weights(lstm_weights_path)
score, acc = model.evaluate(test_data,test_labels)
print('Test score: %.2f' % score)
print('Test accuracy: %.2f' % acc)

Test score: 0.11
Test accuracy: 0.97


97% test accuracy! LSTM FTW!

Crazy that logistic regression on tf-idf vectors only does a few percent worse though. And in a *fraction* of the training time.