# Sentiment Analysis of IMBD Movie Reviews

This project explores basic classifiers used for sentiment analysis using a dataset of 50000 movie reviews from IMDB. As my first foray into Natural Language Processing (NLP) and text analysis, much of the project serves as an exploration of text preprocessing techniques. I hope to return to this project to make improvements as I learn more from outside projects and coursework. Let's get started!

## Preparation, Ingestion and Preprocessing
### Imports

We'll explore the data using pandas, and use a variety of other tools to preprocess. We'll mainly use scikit-learn's classifiers to train our model. We import all necessary libraries below.

In [1]:
import os, json, gzip, requests, re 
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import string
import nltk


### Downloads and Preprocessing Helpers

Next, we download necessary tools from the nltk library.  

1. We'll use wordnet to lemmatize each token in our text. Lemmatization removes unnecessary portions from a word so we're left with a base word that captures much of the meaning already. This could involve changing the part of speech or tense of a word, or even changing a word with a stronger connotation to a more general one. For example:

<div align="center">'caring' -> 'care'</div>
<div align="center">'best' -> 'good'</div>
<div align="center">'wrote' -> 'write'</div>

2. We'll use stopwords to remove words that add to the dimensionality of the data and add little information. These include words like 'a', 'the', 'and', 'is', 'what', and many more. The nltk library has a preset dictionary of stopwords - later, we'll use this dictionary filter out all stopwords from each review in sequence.


In [2]:
nltk.download([ "names",
                "stopwords",
                "averaged_perceptron_tagger",
                "vader_lexicon",
                "punkt",
                "wordnet"])


[nltk_data] Downloading package names to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/scandukuri/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# Read in IMDB data from local library
df = pd.read_csv('IMDB Dataset.csv')   

# Pull in nltk lemmatizer tool and stopword dictionary
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english')

# Process each token individually after a review is split into its tokens
def preprocess_token(token):
    token = token.lower()   # convert to lowercase
    if token in stopwords: 
        return ''           # leave stopwords as empty strings
    return wnl.lemmatize(token)     # return lemmatized version of token

# Process an entire movie review using helper function above
def preprocess_doc(doc):
    doc = re.compile(r'<.*?>').sub('', doc)     # use regex to remove html tags
    tokens = nltk.tokenize.word_tokenize(doc)   # create list of tokens
    tokens=[token.lower() for token in tokens if token.isalpha()]
    words = [preprocess_token(token) for token in tokens]
    return ' '.join((' '.join(words)).split())  # remove extra whitespace if necessary


### Preprocessing, Splitting and Vectorization

Now, we'll use our preprocessing functions to process our column of movie reviews, and save the cleaned text as a new column. We'll also convert our 'positive' and 'negative' labels to integers $1$ and $0$ respectively.

In [5]:
df['processed_review'] = df['review'].apply(lambda doc : preprocess_doc(doc))
df['sentiment_labels'] = pd.Series(df['sentiment'] == 'positive').astype(int)

Now, we'll split our data into training and test data.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df['processed_review'], df['sentiment_labels'], random_state=42)

Finally, we'll vectorize our reviews into word vectors so we can begin training our model. We have many options for how we want to vectorize our documents. For now, my project only explores linear classifiers, so I will ignore most options that are not applicable (i.e. word2vec).

1. The Bag of Words model simply finds the union of <i>all</i> words used in <i>all</i> reviews, and for each review generates a vector of the number of times each of those words is used. 
2. The TF-IDF (Term Frequency-Inverse Document Frequency) model instead finds a metric that calculates the 'importance' of a term. It does this by multiplying the frequency of the term by the inverse of the frequency of documents that term shows up in. This results in a metric that is highest when a term is more <b>"special"</b> for a particular document (shows up in very few documents, and shows up a lot in a given review).

We will use TF-IDF vectorization for our classifiers. The scikit-learn library has such a vectorizer that will turn our single column of text reviews into an entire table of vectors. We use this now on both our train and test data, so our testing process is easier after our model is trained later on.

In [7]:
vectorizer= TfidfVectorizer()
tf_x_train = vectorizer.fit_transform(X_train)
tf_x_test = vectorizer.transform(X_test)

### Multinomial Naive Bayes

First, we'll train a Multinomial Naive Bayes classifier on our data. This is a common baseline classifier used in text classification for its simplicity. It assumes that for any particular vector of token weights $\vec{x}$ where $\vec{x} = (x_1, x_2, \ldots, x_n)$, and the corresponding true sentiment label $y$, the $x_i$ are conditionally independent of each other. 

<b>In other words, knowing the probability of the weight of a token doesn't give us any more information about the probability of the weight of any other token. </b> 

If the counts of each word follow a multinomial distribution, we are able to choose either the word counts themselves OR the TF-IDF vectors that we created earlier. We use the TF-IDF vectors, as described above.

In [8]:
clfMNB = MultinomialNB()
clfMNB.fit(tf_x_train, y_train)
MNB_y_test_pred=clfMNB.predict(tf_x_test)
MNBReport=classification_report(y_test, MNB_y_test_pred, output_dict = True)

MNBReport

{'0': {'precision': 0.849718221665623,
  'recall': 0.8815981809322722,
  'f1-score': 0.8653646871263452,
  'support': 6157},
 '1': {'precision': 0.880726439790576,
  'recall': 0.848652057386095,
  'f1-score': 0.8643918105178643,
  'support': 6343},
 'accuracy': 0.86488,
 'macro avg': {'precision': 0.8652223307280995,
  'recall': 0.8651251191591836,
  'f1-score': 0.8648782488221047,
  'support': 12500},
 'weighted avg': {'precision': 0.8654530318709492,
  'recall': 0.86488,
  'f1-score': 0.8648710106201377,
  'support': 12500}}

We find that our Multinomial Naive Bayes classifier has an accuracy of $86.55\%$ on the test data.

### Linear SVC

Next, we'll train a Linear Support Vector Classifier on our data. This is a type of classifier that, given $n$-vectors from each document, finds a hyperplane in $n$-dimensional space that maximizes the closest possible distance between a sample of one class and the other.

In a sense, for us this means a hyperplane that gives us the maximum distance between the <b>most negative positive review</b> and the <b>most positive negative review</b>.


In [9]:
clfLinearSVC = LinearSVC(random_state=0)
clfLinearSVC.fit(tf_x_train,y_train)
LinearSVC_y_test_pred=clfLinearSVC.predict(tf_x_test)
LinearSVCReport=classification_report(y_test, LinearSVC_y_test_pred, output_dict=True)

LinearSVCReport

{'0': {'precision': 0.897029702970297,
  'recall': 0.8828975150235504,
  'f1-score': 0.8899075059343537,
  'support': 6157},
 '1': {'precision': 0.8880434782608696,
  'recall': 0.9016238373009617,
  'f1-score': 0.8947821325197528,
  'support': 6343},
 'accuracy': 0.8924,
 'macro avg': {'precision': 0.8925365906155833,
  'recall': 0.892260676162256,
  'f1-score': 0.8923448192270533,
  'support': 12500},
 'weighted avg': {'precision': 0.8924697331037452,
  'recall': 0.8924,
  'f1-score': 0.8923810864488486,
  'support': 12500}}

We find that our Linear SVC has an accuracy of $89.25\%$ on the test data.

## Working Conclusions

For now, we've found that using TF-IDF vectorization the Linear SVC classifier performed better for sentiment analysis on IMDB movie reviews than a Naive Bayes classifier. In my next update of this project, I plan to explore another linear classifier (logistic regression) and learn more about non-linear classification (k-nearest neighbors).

