# Sentiment analysis with IMDb dataset

We are going to perform sentiment analysis, i.e. classifying documents based on the sentiment of the writer. To do so, we will work with a dataset of $50000$ movie reviews from IMDb (the Internet Movie Database), inspired by the paper *Learning Word Vectors for Sentiment Analysis* by A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, Association for Computational Linguistics, June 2011. Every review in the dataset is labelled as either positive, if the movie was rated with more than six stars on IMDb, or negative, if the movie was rated with fewer than five stars.

We download a compressed archive of the movie review dataset from <https://ai.stanford.edu/~amaas/data/sentiment/> as a gzip-compressed tarball archive. We begin by unpacking the archive.

In [4]:
import tarfile

#extract documents from compressed archive
with tarfile.open('aclImdb_v1.tar', 'r') as tar:
    tar.extractall()

Now that we successfully extracted the documents from the zip file, we will read the reviews into a single pandas `DataFrame` object.

In [11]:
import pyprind
import pandas as pd
import os
import sys

#set basepath to the directory where the unzipped IMDb files are
basepath = 'aclImdb'

#encode labels
labels = {'pos':1, 'neg':0}

#set up progress bar
pbar = pyprind.ProgBar(50000, stream=sys.stdout)

#read reviews and sentiment into a DataFrame
df_list = []
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                txt = infile.read()
            df_temp = pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
            df_list.append(df_temp)
            pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:05


In [12]:
#concatenate reviews into a single dataframe
df = pd.concat(df_list, ignore_index=True)

#look at first few rows of dataframe
df.head(5)

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


Now, we will shuffle the dataset to ensure that the class labels are evenly dispersed throughout the dataset.

In [15]:
import numpy as np

#shuffle dataset
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))

We will use the bag-of-words model to represent text as numerical feature vectors. This works as follows:
 1. Create a vocabulary of unique tokens (i.e. words) from the entire set of documents.
 2. Construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Note that the vectors will be sparse because only a small subset of all the words in the bag-of-words vocabulary will occur in a given document. To implement this model, we can use the `TfidfVectorizer` function in the sklearn library. The `TfidfVectorizer` function implements the term frequency-inverse document frequency (tf-idf) algorithm, which takes an array of text data and constructs a vocabulary stored as a Python dictionary mapping unique words to integer indices. Then, any text data is transformed to vectors of word counts, which are also called the raw term frequencies. The order in which the term frequencies appear in the feature vector is derived from the vocabulary indices, which are usually assigned alphabetically.

Then, `TfidfVectorizer` also downweights frequently occurring words in the feature vectors by multiplying the term frequencies $\operatorname{tf}(t, d)$ in a given document $d$ by the inverse document frequency $\operatorname{idf}(t, d)$, which is defined by
$$
\operatorname{idf}(t, d) = \log\left(\frac{1+n_d}{1+\operatorname{df}(d, t)} \right),
$$
where $n_d$ denotes the total number of documents and $\operatorname{df}(d, t)$ represents the number of documents $d$ which contain the term $t$. Formally, we obtain the feature vectors
$$
\operatorname{tf}(t, d) \cdot (\operatorname{idf}(t, d)+1).
$$

Before we can apply the `TfidfVectorizer`, we need to clean the text in the dataset. We use Python's regex library to clean the text data by stripping it of unwanted characters.

In [17]:
import re

#define text cleaning function
def preprocessor(text):
    
    #remove all the HTML markup from the text
    text = re.sub('<[^>]*>', '', text)
    
    #find and store emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    
    #remove all non-word characters from the text and convert text into lowercase characters
    text = (re.sub('[\W]+', ' ', text.lower()) +' '.join(emoticons).replace('-', ''))
    
    return text

#apply preprocessing to the movie reviews
df['review'] = df['review'].apply(preprocessor)

Next, we implement two tokenizers. The first simply splits the input document into individual words. The second makes use of the Porter stemmer algorithm to tokenize the reviews, as described in *An algorithm for suffix stripping* by Martin F. Porter, Program: Electronic Library and Information Systems, 14(3): 130–137, 1980. This involves transforming a word into its root form, allowing us to map related words to the same stem. We can implement the Porter stemming algorithm using the Natural Language Toolkit (NLTK) library.

Stop words are words that are extremely common across many contexts and therefore probably contain very little useful information that can be used to distinguish between different classes of documents. Examples of stop words are "is", "and", "has", and "like." To remove stop words from the movie reviews, we will use the set of 127 English stop words that is available from the NLTK library.

In [21]:
import nltk

#obtain stop words
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/theoteske/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [22]:
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

#define stop words
stop = stopwords.words('english')

#define tokenizer that simply splits document into words
def tokenizer(text):
    return text.split()

#create instance of PorterStemmer class
porter = PorterStemmer()

#define tokenizer function using Porter stemmer
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Having created the possible tokenizers and obtained the stop words for English, we can now define a logistic regression classifier that uses the vectors generated by `TfidfVectorizer` to predict each review's sentiment and train this classifier on the training set.

In [23]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

#perform 50-50 train-test split
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

#create TfidfVectorizer object to vectorize text
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

#create parameter grid over which to optimize
param_grid = [
    {
        'vect__ngram_range': [(1, 1)], 
        'vect__stop_words': [stop, None],
        'vect__tokenizer': [tokenizer, tokenizer_porter], 
        'clf__penalty': ['l2'],
        'clf__C': [1.0, 10.0]
    },
    {
        'vect__ngram_range': [(1, 1)], 
        'vect__stop_words': [stop, None], 
        'vect__tokenizer': [tokenizer], 
        'vect__use_idf':[False], 
        'vect__norm':[None], 
        'clf__penalty': ['l2'], 
        'clf__C': [1.0, 10.0]
    }
]

#create pipeline using vectorizer and logistic regression
lr_tfidf = Pipeline([
    ('vect', tfidf),
    ('clf', LogisticRegression(solver='liblinear'))
])

#optimize hyperparameters using grid search cross-validation
gs = GridSearchCV(lr_tfidf, param_grid, 
                  scoring='accuracy', 
                  cv=5, 
                  verbose=1,
                  n_jobs=-1)

#fit the optimal model to the training set
gs.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits






In [24]:
#print best parameters
print(f'Best parameter set: {gs.best_params_}')

#print CV accuracy
print(f'CV Accuracy: {gs.best_score_:.3f}')

#evaluate the optimal model on the test set
clf = gs.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x3264f2f20>}
CV Accuracy: 0.891
Test Accuracy: 0.897


We observe that the best-performing model didn't make use of the stop words and used the vanilla tokenizer (splitting documents into words) rather than the tokenizer which employed Porter stemming algorithm. The logistic regression classifier achieved a cross-validation accuracy of $89.1\%$, and the optimal model found through grid search on the hyperparameter grid was able to achieve $89.7\%$ accuracy on the test set.