# Sentiment Analysis of Movie Reviews using BERT

This project aims to train a set of deep learning models on movie review data in order to predict overall sentiment - positive or negative - using BERT, a pretrained NLP model developed by Google.

## Overview

Initially, I loaded the imdb movie review data using pandas dataframe. Then, I tokenized the reviews and used distilBERT to model the reviews as vectors of length 768, representing the hidden units of the classification token. Finally, I trained the logistic regression, SVM, and Naive Bayes models on this vector data and reported the accuracies of each model.

## Installing/Importing Libraries
First, I imported all the neccessary modules.

In [1]:
import time
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import BayesianRidge
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

## Parameter Initialization
Then, I set a few parameters for training and test data size and for the classification models.

In [2]:
sample_size = 250
MAX_ITER = 10000

## Importing Dataset
Next, I read in the csv file using pandas and stored it into a dataframe object. The dataset should have no header (no column labels), since those can be inferred. I also sliced the dataset to 2*sample_size in order to work with a smaller dataset resulting in shorter program run times.

In [3]:
start_time = time.time()

df_large = pd.read_csv('imdb_original_cleaned.csv', delimiter=',', header=None, encoding='utf-8')
df_pos = df_large[df_large[1] == 1]
df_pos_small = df_pos[:sample_size]
df_neg = df_large[(df_large[1] == 0)]
df_neg_small = df_neg[:sample_size]
df = pd.concat([df_pos_small,df_neg_small])
print(df)

                                                     0  1
0    Match 1: Tag Team Table Match Bubba Ray and Sp...  1
1    There's a sign on The Lost Highway that says:<...  1
2    (Some spoilers included:)<br /><br />Although,...  1
3    Back in the mid/late 80s, an OAV anime by titl...  1
4    **Attention Spoilers**<br /><br />First of all...  1
..                                                 ... ..
564  Title: Zombie 3 (1988) <br /><br />Directors: ...  0
565  I have read several reviews that ask the quest...  0
566  While filming an 80's horror movie called 'Hot...  0
567  I am not surprised to find user comments for t...  0
568  The Howling II starts as it means to go on wit...  0

[500 rows x 2 columns]


## Preprocessing and Cleaning Dataset
Since BERT can only model up to 512 tokens, the reviews in the dataset needed to be shortened. So, the data was cleaned and processed to only include adjectives and adverbs instrumental in determining whether the review is positive or negative. 

To do this, I first created a function to read files containing lists of common adjectives and adverbs and storing them in a set. Then, I wrote a function to keep only those adjectives and adverbs from each review and applied the function to all the rows of the dataframe.

In [4]:
# Reads input from a file line by line and returns a set of all the information
def SetFromFile(filename):
    inp = open(filename, 'r')
    adj_set = set()
    line = inp.readline().strip('\n')
    while line:
        if line != "review":
            adj_set.add(line)
        line = inp.readline().strip('\n')
    return adj_set

# Creating the adjective and adverb sets
ADJ_SET = SetFromFile("large_adjectives.txt")
ADV_SET = SetFromFile("large_adverbs.txt")

# Keeps only the words in a given string that are present in the adjective or adverb sets
def keep_adj_and_adv(strg, adjSet, advSet):
    strg = strg.replace("[^a-zA-Z0-9]", " ")
    str_list = strg.split()
    str_set = set(str_list)
    setA = str_set.intersection(adjSet)
    setB = str_set.intersection(advSet)
    final_set = setA.union(setB)
    final_list = list(final_set)
    final_str = ",".join(final_list)
    return final_str

only_adjectives_df = df[0].apply(lambda x: keep_adj_and_adv(x, ADJ_SET, ADV_SET))

## Loading Pre-trained BERT Model
The next step is to load in the BERT model and tokenizer.

In [5]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')


# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

## Tokenization
Then, I tokenized the reviews which consists of breaking up the reviews (containing only adjectives and adverbs) into words and subwords and mapping them to their respective IDs, only keeping the first 512 tokens.

In [6]:
# tokenizes the data
tokenized = only_adjectives_df.apply(lambda x: tokenizer.encode(x, add_special_tokens=True)[:512])

Token indices sequence length is longer than the specified maximum sequence length for this model (526 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (645 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (560 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (583 > 512). Running this sequence through the model will result in indexing errors


## Padding
Right now, tokenized is a list of sentences where each sentence is a list of tokens - thus, a list of lists. However, in order for BERT to process all of the data at once, it needs to be uniform. This can be accomplished by padding the internal lists to the same size. Then, it can be represented as a 2D array instead of a list of lists.

In [7]:
# pads the vectors with 0's
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)
        
padded = np.array([j + [0]*(max_len-len(j)) for j in tokenized.values])

Can view the dimensions of the padded data

In [8]:
padded_shape = np.array(padded).shape
print(padded_shape)

(500, 512)


## Masking
The final step is to create a mask to ignore the padding put in earlier which will be used in the BERT model when processing the input.

In [9]:
attention_mask = np.where(padded != 0, 1, 0)

## BERT Processing
First, the padded and attention_mask arrays are turned into tensors and then are inputted into the model() function which creates sentence embeddings. This function will return a tuple where the first value is a 3D array consisting of all the hidden states (768) for each token in each sequence.

In [10]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Since sentiment analysis is a classification task, I sliced the output to contain only the information relevant to the first token (CLS special token) of each sentence. This token can be thought of as representing the entire sentence. The sliced output is a 2D matrix where each row is a feature vector consisting of the 768 hidden units of the CLS token for the corresponding sentence.

This matrix will be stored in a features variable and the positive or negative labels will be stored in the labels variable.

In [11]:
features = last_hidden_states[0][:,0,:].numpy()
labels = df[1]

## Train/Test Split
Next, I split the data into a training set and a test set.

In [12]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

## Creating Deep Learning Model

Next, I trained several models using the training data and evaluated the accuracy using the test dataset.

The models I trained are
- Logistic Regression
- SVM
- Naive Bayes

### Training
I trained the models from Scikit Learn using the default parameters, with the exception of max_iter.

In [13]:
# Logistic Regression
lr_clf = LogisticRegression(max_iter = MAX_ITER)
lr_clf.fit(train_features, train_labels)

# Support Vector Machine Classification
svm_clf = SVC(max_iter = MAX_ITER)
svm_clf.fit(train_features, train_labels)

# Naive Bayes
br_clf = BayesianRidge(n_iter = MAX_ITER)
br_clf.fit(train_features, train_labels)

BayesianRidge(n_iter=10000)

### Evaluation
Then, I checked the accuracy of the models against the test dataset.

In [14]:
# Logistic Regression
lr_test_score = lr_clf.score(test_features, test_labels)

# Support Vector Machine Classification
svm_test_score = svm_clf.score(test_features, test_labels)

# Naive Bayes
br_test_score = br_clf.score(test_features, test_labels)


## Dummy Classifier
In order to normalize the accuracy of each model, I used a dummy classifier provided by scikit learn.

In [15]:
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_labels)



## Accuracy of Models

In [16]:
print("Dataset Size: " + str(sample_size*2))
print("Logistic Regression Model Accuracy: %0.2f" % lr_test_score)
print("SVM Model Accuracy: %0.2f" % svm_test_score)
print("Bayesian Ridge Model Accuracy: %0.2f" % br_test_score)
print("Dummy classifier score: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dataset Size: 500
Logistic Regression Model Accuracy: 0.80
SVM Model Accuracy: 0.75
Bayesian Ridge Model Accuracy: 0.40
Dummy classifier score: 0.52 (+/- 0.08)


## Program Run Time

In [17]:
end_time = time.time()
elapsed_time = end_time - start_time
# print("Start Time: " + str(start_time))
# print("End Time: " + str(end_time))
print("Run Time (min): %0.2f" % (elapsed_time/60))

Run Time (min): 6.26


## Problems Encountered

Problem: pandas read_csv function was not complying with the file containing the dataset. It was not loading the data propoerly.

Solution: adding the field - encoding='utf-8' - to the read_csv function in order for pandas to load the data into a dataframe

Problem: BERT supports only up to 512 tokens, which was much smaller than some of the reviews in the dataset

Solution: I filtered out the nouns and verbs, choosing to keep only the adjectives and adverbs for sentiment classification.

Problem: The logisitic regression model used to train the feature extraction data was timing out.

Solution: After researching this issue, I decided to increase the max iterations field of the model.

## Conclusions
After getting the logistic regression model to work, I added the SVM and naive bayes models to determine which of these widely-used classification models performs the best on text data and text classification. I found that the logisitc regression model works the best, however, these model should still be tested on larger dataset sizes. Additionally, part of its accuracy can be accounted for due to random choice, such as by the dummy classifier. 