# Sentiment Analysis with IMDB Dataset of 50K Movie Reviews

[Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

## Overview: This project has three parts
- Part 1: Cleaning the data
- Part 2: Classifying sentiments using Bag of Words (BOW) and Term Frequencies - Inverse Document Frequencies (TFIDF) as text encoder and Naive Bayes and Support Vector Machine as models
- Part 3: Classifying sentiments using Deep learning (Traditional Neural Network and Bidirectional LSTM)

Basically, my idea is starting with simple models and increase the complexity to see if more complex models would perform better

---

First let's import the necessary libraries

In [1]:
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import re
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, Embedding, TextVectorization
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import pad_sequences

np.random.seed(69)
tf.random.set_seed(69)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


And read the dataset

In [2]:
df = pd.read_csv("IMDB Dataset.csv")
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


Looking at the description, we can see that this dataset has 50000 reviews, half of which is postive and the other is negative.

There are duplicates but I don't think the cause big issues in this case so I decided to not remove them.

In [3]:
df.isna().any()

review       False
sentiment    False
dtype: bool

Luckily, there is no missing data in the dataset


---



## Part 1: Clean the dataset

Let's take a peek at some of the reviews

In [4]:
for review in df['review'].iloc[:10]:
    print(review)

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

Looking at the texts, I come up with a few steps to prepare our data for training *(look at the ```clean_text``` function to see how this works)*:
- First, standardize by making everything into lowercase since it will result in a lower number of tokens
- Remove all the HTML tags (I skimmed through a big chunk of the dataset and this seems to be the only weird things)
- Tokenize the texts using NLTK tokenizer
- Remove all the punctuation and symbols (symbols can indicate sentiments but I believe the effect is negligible in this task)
- Let's also remove all words with just one character because usually they don't indicate sentiments (words like "I") and remove all the new word appearing from the last step ("e.g" -> "e g")
- Then, remove all the stop words because they don't indicate sentiments. Note that I used the stop words list from NLTK and there are some words in that list that I want to keep (words that can indicate sentiments and negation)
- Finally, lemmatize all the words to, again, standardize and reduce the number of tokens

In [5]:
# Stop words but not those that are important
def is_stop_word(text):
    important_stop_words = ["n't", "no", "nor", "not", "never", "against", 'above', 'below', 'up', 'down', 'out', 'on', 'off', 'over', 'under', 'again']
    return (text in stopwords.words("english")) and (text not in important_stop_words)

# A valid word is a word with length more than 1 or a number
def is_valid_word(text):
    return (text.isalpha() and len(text) > 1) or (text.isnumeric())

def clean_text(row):
    # Lowercase
    row = row.lower()

    # Get rid of the HTML tag
    row = BeautifulSoup(row, "html.parser").get_text()

    # First, tokenize the text
    tokens = word_tokenize(row)

    # Change all the "n't" into "not"
    tokens = ["not" if token == "n't" else token for token in tokens]

    #This is a packed line
        # Get rid of invalid words
            # All the symbols and punctuations (but keep "n't" since it's negation)
            # All words of length 1
            # All stop words but keep those that can express sentiments or negation (see list above)
        #Lemmatize each word
    row = " ".join([WordNetLemmatizer().lemmatize(token) for token in tokens if is_valid_word(token) and not is_stop_word(token)])

    return row


Process the data and save it to a file

In [6]:
# Change this to False to reprocess the data again
has_preprocessed = True

if has_preprocessed:
    df['review'] = pd.read_csv("preprocessed_data.csv").squeeze()
else:
    df['review'] = df['review'].apply(clean_text)
    df['review'].to_csv("preprocessed_data.csv", index=False)

# Encoded the labels with 0, 1
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

Split our data into a training dataset and a test dataset, and we are ready to train models

In [7]:
X = df['review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=69)



---



## Part 1: Using BOW and TFIDF as text encoder and Naive Bayes and SVM as models

### A. Text Encoding
We cannot use strings as the inputs to ML models so the first thing is encoding the texts. Let's use two classic ways to encode texts: Bag of Words (BOW) and Term Frequencies - Inverse Document Frequencies (TFIDF)
- BOW creates features from texts using word frequencies
- TFIDF also relies on word frequencies but adds a few more stuffs so it's more "normalized"

In [8]:
bow = CountVectorizer()
tfidf = TfidfVectorizer()

X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

### B. Models
There are two classification models that are known to work really well with text data: Naive Bayes and Support Vector Machine. I believe the reason is that they work really well with sparse data (BOW and TFIDF features are sparse matrices)

Also, let's train all four of the combinations since they are really quick to train

In [9]:
svc_bow = LinearSVC().fit(X_train_bow, y_train)
print("Linear Support Vector Machine with Bag of Words: \n", classification_report(y_test, svc_bow.predict(X_test_bow)))

svc_tfidf = LinearSVC().fit(X_train_tfidf, y_train)
print("Linear Support Vector Machine with TFIDF: \n", classification_report(y_test, svc_tfidf.predict(X_test_tfidf)))

nb_bow = MultinomialNB().fit(X_train_bow, y_train)
print("Naive Bayes with Bag of Words: \n", classification_report(y_test, nb_bow.predict(X_test_bow)))

nb_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)
print("Naive Bayes with TFIDF: \n", classification_report(y_test, nb_tfidf.predict(X_test_tfidf)))



Linear Support Vector Machine with Bag of Words: 
               precision    recall  f1-score   support

           0       0.87      0.87      0.87      4971
           1       0.87      0.87      0.87      5029

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000

Linear Support Vector Machine with TFIDF: 
               precision    recall  f1-score   support

           0       0.90      0.89      0.90      4971
           1       0.89      0.90      0.90      5029

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

Naive Bayes with Bag of Words: 
               precision    recall  f1-score   support

           0       0.84      0.88      0.86      4971
           1       0.87      0.83      0.85      5029

    accuracy                           0.85     10000


We can see that all four models yield pretty good results with Linear Support Vector Machine with TFIDF yielding the highest accuracy.

I also notice there aren't much differences in precision and recall between the two labels. The models do not bias toward one label.

--------------------------

## Part 2: Using Deep Learning

Let's try to train more complex Deep Learning models to see if they perform better than previous models

### A. Traditional Neural Network
First, let's try a traditional neural network:
- Neural networks don't work well with sparse data, so we cannot use BOW and TFIDF features from last steps. Let's use the TextVectorization layer from Tensorflow. It encodes each word with an integer
- An Embedding layer
- Two Dense layer with activation function 'relu' and dropout rate of 0.5 for regularization
- And since this is binary classification, we have to use 'sigmoid' function for the last layer and 'binary_crossentropy' for the loss function

(You're probably wondering why there is a GlobalAveragePooling1D layer. I tried to use the Flatten layer, and I didn't know why the loss did not improve. I probably missed something, so it didn't work. I did some research on StackOverflow and tried this GlobalAveragePooling1D layer. It actually worked really well)

In [None]:
load_model_nn = False

nn = tf.keras.Model()

if load_model_nn:
    nn = tf.keras.models.load_model('nn.keras')
else:
    # Vectorize the strings
    encoder1 = tf.keras.layers.TextVectorization()
    encoder1.adapt(X)

    nn = Sequential([
        tf.keras.Input(shape=(1,), dtype=tf.string),
        encoder1,
        Embedding(len(encoder1.get_vocabulary()), 64, mask_zero=True),
        GlobalAveragePooling1D(),
        Dense(32, activation='relu'),
        Dropout(0.5),
        Dense(16, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])

    nn.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(nn.summary())
    nn.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.1 , callbacks=[EarlyStopping(patience=2)])

    # nn.save('nn.keras')

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization (TextVe  (None, None)              0         
 ctorization)                                                    
                                                                 
 embedding (Embedding)       (None, None, 64)          5747648   
                                                                 
 global_average_pooling1d (  (None, 64)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dense (Dense)               (None, 32)                2080      
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 16)                5

We can see the model overfits really quick. Observing this, I did try to add regularization. It's best to train the model for just 1 or 2 epochs, but I still let the model train for 3 epochs to demonstrate this.

In [None]:
print("Accuracy score of Traditional Neural Network: \n", classification_report(y_test, tf.round(nn.predict(X_test), 0.5)))

Accuracy score of Traditional Neural Network: 
               precision    recall  f1-score   support

           0       0.90      0.88      0.89      4971
           1       0.88      0.91      0.89      5029

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



The accuracy of Neural Network (0.89) is about the same as that SVM with TFIDF (0.9)

### 2. Bi-LSTM
Recurrent Neural Networks have been known for performing well with sequential data, such as texts, so let's try to use LSTM for our model. I also wrap the LSTM layer inside Bidirectional() so we would also look at each text backwards to identify any additional patterns. I read on Tensorflow website that Bi-LSTM tends to work really well with texts.

The model consists of
- Text Vectorization layer
- A Bidirectional-LSTM layer
- A Dense layer with activation function "relu"
- Dropout for regularization
- Final output layer with "sigmoid"

In [None]:
load_model_bilstm = False

bilstm = tf.keras.Model()

if load_model_bilstm:
    bilstm = tf.keras.models.load_model('bilstm.keras')
else:
    # Vectorize the strings
    encoder2 = tf.keras.layers.TextVectorization()
    encoder2.adapt(X_train)

    bilstm = Sequential([
        tf.keras.Input(shape=(1,), dtype=tf.string),
        encoder2,
        Embedding(len(encoder2.get_vocabulary()), 64, mask_zero=True),
        Bidirectional(LSTM(32, dropout=0.5)),
        Dense(16, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])

    bilstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    print(bilstm.summary())
    bilstm.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.1 , callbacks=[EarlyStopping(patience=2)])

    # bilstm.save('bilstm.keras')

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_1 (Text  (None, None)              0         
 Vectorization)                                                  
                                                                 
 embedding_1 (Embedding)     (None, None, 64)          5238848   
                                                                 
 bidirectional (Bidirection  (None, 64)                24832     
 al)                                                             
                                                                 
 dense_3 (Dense)             (None, 16)                1040      
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_4 (Dense)             (None, 1)                

Again, we can see that the model starts to overfits after the first few epochs

In [None]:
print("Accuracy score of Bi-LSTM Neural Network: \n", classification_report(y_test, tf.round(bilstm.predict(X_test), 0.5)))

Accuracy score of Bi-LSTM Neural Network: 
               precision    recall  f1-score   support

           0       0.86      0.91      0.89      4971
           1       0.91      0.86      0.88      5029

    accuracy                           0.88     10000
   macro avg       0.89      0.88      0.88     10000
weighted avg       0.89      0.88      0.88     10000



Turns out the model that usually works very well with texts does not perform any better than previous models.

The accuracy of Bi-LSTM is 0.88, which is about the same as traditional NN (0.89) and SVM with TFIDF (0.9)


---



## Conclusion

- All three approaches (SVM with TFIDF, traditional NN, BI-LSTM) yield about the same accuracy ~0.89-0.9 (traditional NN and BI-LSTM perform a bit worse but I believe if I train them for fewer epochs, they would yield the same accuracy)

- I expected Bi-LSTM to perform much better than other two approaches but that was not the case since it was "the model" for text classification. Maybe Bi-LSTM would perform better with more complex datasets.

- I did a skim on Kaggle on how other practitioners did on this dataset and the accuracy is around 0.8 to 0.9, so our models perform pretty well compared to others. On Kaggle, deep Learning models are actually on the lower end and simpler models yield higher accuracy

## Future works
- During this project, I tried to train a Word Embedding model by Word2Vec and use it to extract features from texts, but the accuracy of the model is about 0.86 and the training time was hours so I decided to not include it in this notebook. In future, I may try to use word embeddings from pretrained models to see it performs better
- Adding Attention machanism to Bi-LSTM maybe another thing we can try. Some words definitely indicate a sentiment more than others
- Finally, we can try to add more rules-based approaches. Maybe put more weight on words that we know can indicate sentiments (like "love", "hate", "1 out of 10") and remove more words that aren't important (like "film", "movies")


