# Movie Review Sentiment Analysis

In this Kernel I will be taking on the Kaggle competition whereby the challenge is to classify the sentences from the famous Rotten Tomatoes Movie Review dataset, with the aim to predict each sentence's classification to 1 of 5 labels. 

This will be my first application of Deep Learning within a Kaggle competition, therefore I would greatly appreciate any feedback on the techniques employed during this section - as well as any other section for that matter! 

I hope you enjoy the read.

In [None]:
# First up, I'll import every library that will be used in this project is imported at the start.

# Data handling and processing
import pandas as pd
import numpy as np

# Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
from scipy import stats
import statsmodels.api as sm
from scipy.stats import randint as sp_randint
from time import time

# NLP
import nltk
nltk.download('wordnet')
import re
from textblob import TextBlob
from sklearn.feature_extraction import text
from nltk.tokenize import TweetTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Machine Learning
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

In [None]:
# Data downloaded from Kaggle as a .csv file and read into this notebook from my local directory
train = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/train.tsv', sep = '\t')
test = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/test.tsv', sep = '\t')
sub = pd.read_csv('../input/movie-review-sentiment-analysis-kernels-only/sampleSubmission.csv', sep=',')

## 1. Initial Inspection

Submissions are evaluated on classification accuracy (the percent of labels that are predicted correctly) for every parsed phrase. The sentiment labels are:

0 - negative<br>
1 - somewhat negative<br>
2 - neutral<br>
3 - somewhat positive<br>
4 - positive

Let's take an initial peak into what we're dealing with:

In [None]:
# General information about the Dataset
train.info()

In [None]:
# First 10 rows of the Dataset
train.head(10)

Looks like a simple Dataset to get to grips with (which is always welcome). We can see here that each sentence is broken into multiple phrases, each having their own sentiment classification. Each sentence is denoted by the 'SentenceId' column.

In [None]:
# Checking out the total number of unique sentences
train['SentenceId'].nunique()

I'll take a further dive into understanding more about the sentence structure within both the training & test Dataset:

In [None]:
# Returning average count of phrases per sentence, per Dataset
print('Average count of phrases per sentence in train is {0:.0f}.'.format(train.groupby('SentenceId')['Phrase'].count().mean()))
print('Average count of phrases per sentence in test is {0:.0f}.'.format(test.groupby('SentenceId')['Phrase'].count().mean()))

In [None]:
# Returning total phrase and sentence count, per Dataset
print('Number of phrases in train: {}; number of sentences in train: {}.'.format(train.shape[0], len(train.SentenceId.unique())))
print('Number of phrases in test: {}; number of sentences in test: {}.'.format(test.shape[0], len(test.SentenceId.unique())))

In [None]:
# Returning average word length of phrases, per Dataset
print('Average word length of phrases in train is {0:.0f}.'.format(np.mean(train['Phrase'].apply(lambda x: len(x.split())))))
print('Average word length of phrases in test is {0:.0f}.'.format(np.mean(test['Phrase'].apply(lambda x: len(x.split())))))

Some interesting headlines from the above - there are a much larger proportion of phrases compared to sentences, and the average word length of phrases at 7 is quite low. This will need to considered when cleaning the text in order to strike the right balance between making the data neater, and losing too much data that renders Machine Learning more difficult.

The last step in this initial exploration is to explore the target in a little more detail. A graph should help with that!

In [None]:
# Set up graph
fig, ax = plt.subplots(1, 1, dpi = 100, figsize = (10, 5))

# Get data
sentiment_labels = train['Sentiment'].value_counts().index
sentiment_count = train['Sentiment'].value_counts()

# Plot graph
sns.barplot(x = sentiment_labels, y = sentiment_count)

# Plot labels
ax.set_ylabel('Count')    
ax.set_xlabel('Sentiment Label')
ax.set_xticklabels(sentiment_labels , rotation=30)

A strong lean towards somewhat negative reviews are contained in this dataset. A strong class imbalance may prove a few issues with machine learning - so good to be aware of this now!

## 2. Text Preprocessing

To make this step easier I will combine both Datasets and clean as a whole. Marking the Sentiment column as -999 within the test set means there's zero chance of cross contamination.

In [None]:
# New column in the test set for concatenating
test['Sentiment']=-999
test.head()

In [None]:
# Concatenating Datasets before the cleaning can begin
data = pd.concat([train,test], ignore_index = True)
print(data.shape)
data.tail()

In [None]:
# Deleting previous Datasets from memory
del train,test

### Noise Removal

Any piece of text which is not relevant to the context of the data and the end-output can be specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc), URLs or links, punctuations and industry specific words. This step deals with removal of all types of noisy entities present in the text.

Typically I would define a function to clear all of this noise, to then yield better Machine Learning application. However this dataset contains a lot of single word entries. To remove numbers, punctuation, special characters and whole words would wipe out a large percentage of the available observations, potentially then proving more trouble than what it's worth when modelling. So, I am going to keep them in and proceed with a more 'light touch' noise removal function:

In [None]:
# Basic text cleaning function
def remove_noise(text):
    
    # Make lowercase
    text = text.apply(lambda x: " ".join(x.lower() for x in x.split()))
    
    # Remove whitespaces
    text = text.apply(lambda x: " ".join(x.strip() for x in x.split()))
    
    # Convert to string
    text = text.astype(str)
        
    return text

In [None]:
# Apply the function and create a new column for the cleaned text
data['Clean Review'] = remove_noise(data['Phrase'])
data.head()

With the text cleaned, the data can now be split back into training and test sets.

In [None]:
# Re-instating the training set
train = data[data['Sentiment'] != -999]
train.shape

In [None]:
# Re-instating the test set
test = data[data['Sentiment'] == -999]
test.drop('Sentiment', axis=1, inplace=True)
test.shape

## 3. Creating a text matrix

To analyse a preprocessed data, it needs to be converted into features. Depending upon the usage, text features can be constructed using a variety of techniques – in this kernel I will be converting the data into statistical features.

The specific model in question is known as <b>'Term Frequency – Inverse Document Frequency' (TF – IDF)</b>

TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the text documents into vector models on the basis of occurrence of words in the documents without taking considering the exact ordering. For Example – let say there is a dataset of N text documents, In any document “D”, TF and IDF will be defined as –

- <b>Term Frequency (TF)</b> – TF for a term “t” is defined as the count of a term “t” in a document “D”
- <b>Inverse Document Frequency (IDF)</b> – IDF for a term is defined as logarithm of ratio of total documents available in the corpus and number of documents containing the term T.
- <b>TF . IDF</b> – TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula below. Following is the code using python’s scikit learn package to convert a text into tf idf vectors:

In [None]:
# Getting a count of words from the documents
# Ngram_range is set to 1,2 - meaning either single or two word combination will be extracted
tokenizer = TweetTokenizer()

cvec = CountVectorizer(ngram_range=(1,2), tokenizer=tokenizer.tokenize)
full_text = list(train['Clean Review'].values) + list(test['Clean Review'].values)
cvec.fit(full_text)

# Getting the total n-gram count
len(cvec.vocabulary_)

I am happy with that number as a starting point, less than 1000 was my initial aim. If I wanted to be more or less restrictive on n-gram selection, I could adjust the 'min_df' and 'max_df' parameters within my CountVectorizer, which controls for the minimum and maximum amount of documents each word should feature in.

We can now tackle the next step, which is to turn this document into a <b>'bag of words' representation</b>. This is essentially just a separate column for each term containing the count within each document. After that, we’ll take a look at the <b>sparsity</b> of this representation which lets us know how many <b>nonzero values</b> there are in the dataset. The more sparse the data is the more challenging it will be to model, but that’s a discussion for another day:

In [None]:
# Creating the bag-of-words representation: training set
train_vectorized = cvec.transform(train['Clean Review'])

# Getting the matrix shape
print('sparse matrix shape:', train_vectorized.shape)

# Getting the nonzero count
print('nonzero count:', train_vectorized.nnz)

# Getting sparsity %
print('sparsity: %.2f%%' % (100.0 * train_vectorized.nnz / (train_vectorized.shape[0] * train_vectorized.shape[1])))

In [None]:
# Creating the bag-of-words representation: test set
test_vectorized = cvec.transform(test['Clean Review'])

# Getting the matrix shape
print('sparse matrix shape:', test_vectorized.shape)

# Getting the nonzero count
print('nonzero count:', test_vectorized.nnz)

# Getting sparsity %
print('sparsity: %.2f%%' % (100.0 * test_vectorized.nnz / (test_vectorized.shape[0] * test_vectorized.shape[1])))

Now that we have term counts for each document, the TfidfTransformer can be applied to calculate the weights for each term in each document:

In [None]:
# Instantiating the TfidfTransformer
transformer = TfidfTransformer()

# Fitting and transforming n-grams
train_tdidf = transformer.fit_transform(train_vectorized)
test_tdidf = transformer.fit_transform(test_vectorized)

With words weighted, the time has now come to make some predictions :).

## 4. Machine Learning

Kaggle already provides us with training and test datasets, so data preparation at this stage is as simple as computing X & y variables for the training set.

In [None]:
# Create X & y variables for Machine Learning
X_train = train_tdidf
y_train = train['Sentiment']

X_test = test_tdidf

To help with Machine Learning I will define a function that will return the most prized statistics in one go. After instantiating a model, this function will return an mean accuracy score following 5 folds of cross validation - this is to ensure that we are getting a smoothed out representation of both the training and test sets. Next this function will provide us with the Confusion Matrix; how many correct vs incorrect classifications have actually taken place within the given model? Last up, this function will churn out for us a Classification Report which details other important metrics such as Precision, Recall, the F1 score (which is just the harmonic mean of the former two), and support (which is the classification count). 

Combined, these metrics will provide rich insight into individual model performance and will guide better selection towards the best performing model, and how best to optimise it.

In [None]:
# Model Fit and Prediction
def model(mod, model_name, X_train, y_train):
    
    # Fitting model
    mod.fit(X_train, y_train)
    
    # Print model name
    print(model_name)
    
    # Compute 5-fold cross validation: Accuracy
    acc = cross_val_score(mod, X_train, y_train, scoring = "accuracy", cv = 5)

    # Compute 5-fold prediction on training set
    predictions = cross_val_predict(mod, X_train, y_train, cv = 5)

    # Return accuracy score to 3dp
    print("Accuracy:", round(acc.mean(), 3))
 
    # Compute confusion matrix
    cm = confusion_matrix(predictions, y_train)
    
    # Print confusion matrix
    print("Confusion Matrix:  \n", cm)

    # Print classification report
    print("Classification Report \n", classification_report(predictions, y_train))

I will apply a Logistic Regression initially, since it is equipped to handle multi-class classifications problems and over a relatively short amount of time.

In [None]:
# Logistic Regression
log = LogisticRegression(multi_class='ovr')
model(log, "Logistic Regression", X_train, y_train)

We're not getting great performance with Machine Learning, as initially expected. I sense this Dataset may better suit Deep Learning application, so that's where I'll head next.

## 5. Deep Learning

Machine learning provided limited success to solving this challenge, so over to deep learning we head. The reason deep learning typically outperforms a bag of word models is the ability to capture the sequencial dependency between words in a sentence. This has been possible thanks to the invention of special neural network architectures called Recurrent Neural Networks. 

Specifically in this kernel, I will be working with Long Short Term Memory networks or 'LSTM's' for short. These networks possess a memory that "remembers" previous data from the input and makes decisions based on that knowledge. These networks are more directly suited for written data inputs, since each word in a sentence has meaning based on the surrounding words (previous and upcoming words).

Before I can get my hands dirty with these networks, i'll need to import the required tools from the Keras package:

In [None]:
# Importing all required tools for deep learning from keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, Conv1D, GRU, CuDNNGRU, CuDNNLSTM, BatchNormalization
from keras.layers import Bidirectional, GlobalMaxPool1D, MaxPooling1D, Add, Flatten
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D
from keras.models import Model, load_model
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras import backend as K
from keras.engine import InputSpec, Layer
from keras.optimizers import Adam

from keras.callbacks import ModelCheckpoint, TensorBoard, Callback, EarlyStopping

Before any words can be fed into the neural network, a series of processing steps are first required:

1. Tokenization - We need to break down the sentence into unique words. For eg, "I love cats and love dogs" will become ["I","love","cats","and","dogs"]
2. Indexing - We put the words in a dictionary-like structure and give them an index each For eg, {1:"I",2:"love",3:"cats",4:"and",5:"dogs"}
3. Index Representation- We could represent the sequence of words in the comments in the form of index, and feed this chain of index into the network (e.g. [1,2,3,4,2,5]).

The below code will complete each of these steps in turn:

In [None]:
# 1. Tokenization
tokenizer = Tokenizer(lower=True, filters='')

tokenizer.fit_on_texts(full_text)

In [None]:
# 2. Indexing
train_sequences = tokenizer.texts_to_sequences(train['Clean Review'])
test_sequences = tokenizer.texts_to_sequences(test['Clean Review'])

In [None]:
# 3. Index Representation
MAX_LENGTH = 50

padded_train_sequences = pad_sequences(train_sequences, maxlen=MAX_LENGTH)
padded_test_sequences = pad_sequences(test_sequences, maxlen=MAX_LENGTH)
padded_train_sequences

Next up, I will 'pad' the features so that they are of a consistent length - another important processing step.

The below code will first of all reveal the appropriate max length to set based on the sentences. This 'max_len' will then be applied into the padding step that follows.

In [None]:
# Find and plot total word count per sentence
totalNumWords = [len(one_comment) for one_comment in train_sequences]

plt.hist(totalNumWords,bins = np.arange(0,20,1))
plt.show()

In [None]:
# Setting max_len to 20 and padding data
max_len = 20
X_train = pad_sequences(train_sequences, maxlen = max_len)
X_test = pad_sequences(test_sequences, maxlen = max_len)

## LSTM-CNN Model

Now onto the model! This CNN-LSTM model consists of an initial LSTM layer which will receive word embeddings for each token in the review as inputs. The intuition is that its output tokens will store information not only of the initial token, but also any previous tokens; In other words, the LSTM layer is generating a new encoding for the original input. The output of the LSTM layer is then fed into a convolution layer which we expect will extract local features. Finally the convolution layer’s output will be pooled to a smaller dimension and ultimately outputted as either a positive or negative label.

I must credit the following source which introduces the idea of LSTM-CNN Model for Sentiment Analysis. I encourage you to take a read! - http://konukoii.com/blog/2018/02/19/twitter-sentiment-analysis-using-combined-lstm-cnn-models/

In [None]:
# Link to 2 million word vectors trained on Common Crawl (600B tokens)
embedding_path = '../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec'

In [None]:
# Setting embedding size & max number of features
embed_size = 300
max_features = 30000

In [None]:
# Prepare the embedding matrix
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')

embedding_index = dict(get_coefs(*o.strip().split(" ")) for o in open(embedding_path))

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.zeros((nb_words + 1, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None: 
        
        # Words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [None]:
# One hot encoding the y variable ready for deep learning application
ohe = OneHotEncoder(sparse=False)
y_ohe = ohe.fit_transform(y_train.values.reshape(-1, 1))

Finally, we can now define the model architecture.

In [None]:
# Create check-point
file_path = "best_model.hdf5"
check_point = ModelCheckpoint(file_path, monitor = "val_loss", verbose = 1,
                              save_best_only = True, mode = "min")

# Define callbacks
early_stop = EarlyStopping(monitor = "val_loss", mode = "min", patience = 3)

# Build model
def build_model(lr = 0.0, lr_d = 0.0, units = 0, dr = 0.0):
    inp = Input(shape = (max_len,))
    x = Embedding(19479, embed_size, weights = [embedding_matrix], trainable = False)(inp)
    x1 = SpatialDropout1D(dr)(x)

    x_gru = Bidirectional(CuDNNGRU(units, return_sequences = True))(x1)
    x1 = Conv1D(32, kernel_size=3, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool1_gru = GlobalAveragePooling1D()(x1)
    max_pool1_gru = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(32, kernel_size=2, padding='valid', kernel_initializer='he_uniform')(x_gru)
    avg_pool3_gru = GlobalAveragePooling1D()(x3)
    max_pool3_gru = GlobalMaxPooling1D()(x3)
    
    x_lstm = Bidirectional(CuDNNLSTM(units, return_sequences = True))(x1)
    x1 = Conv1D(32, kernel_size=3, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool1_lstm = GlobalAveragePooling1D()(x1)
    max_pool1_lstm = GlobalMaxPooling1D()(x1)
    
    x3 = Conv1D(32, kernel_size=2, padding='valid', kernel_initializer='he_uniform')(x_lstm)
    avg_pool3_lstm = GlobalAveragePooling1D()(x3)
    max_pool3_lstm = GlobalMaxPooling1D()(x3)
    
    
    x = concatenate([avg_pool1_gru, max_pool1_gru, avg_pool3_gru, max_pool3_gru,
                    avg_pool1_lstm, max_pool1_lstm, avg_pool3_lstm, max_pool3_lstm])
    x = BatchNormalization()(x)
    x = Dropout(0.2)(Dense(128,activation='relu') (x))
    x = BatchNormalization()(x)
    x = Dropout(0.2)(Dense(100,activation='relu') (x))
    x = Dense(5, activation = "sigmoid")(x)
    
    model = Model(inputs = inp, outputs = x)
    model.compile(loss = "binary_crossentropy", optimizer = Adam(lr = lr, decay = lr_d), metrics = ["accuracy"])
    history = model.fit(X_train, y_ohe, batch_size = 128, epochs = 15, validation_split=0.1, 
                        verbose = 1, callbacks = [check_point, early_stop])
    model = load_model(file_path)
    
    return model

Because of the multi-label loss, we are using k-hot encoding of the output and sigmoid activations. As a result, the loss is binary cross-entropy. With the LSTM-CNN model defined, we can now fit, predict and then submit.

In [None]:
# Instantiating model
model = build_model(lr = 1e-4, lr_d = 0, units = 128, dr = 0.5)

In [None]:
# Making predictions with test data
pred = model.predict(X_test, batch_size = 1024)

In [None]:
# Preparing final submission file
predictions = np.round(np.argmax(pred, axis=1)).astype(int)

sub['Sentiment'] = predictions
sub.to_csv("submission.csv", index=False)

Thank you for reading this Kernel. I note possible areas for expansion that include ensembling multiple Deep Learning models, as well as blending the final results with those from the Logistic Regression, perhaps with the inclusion of further Machine Learning algorithms that were not introduced in this Kernel (Decision Trees, SVC). These are areas that I may return to in futurel; they may even be of interest to you, too.

Last up, please feel free to offer any feedback on the approach/code within this Kernel. I am always looking for new or better ways of doing things. Thank you again!