## IMDB Reviews Sentiment Analysis

The goal is to train a simple NLP model on IMDB Dataset of 50K Movie Reviews. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It consists of a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('./IMDB Dataset.csv', 
                 encoding='utf-8')

df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Since there HTML marks and other peculiarities in the text. Let's clean them using Beautiful Soup and RegEx.

In [3]:
from bs4 import BeautifulSoup
import re
import warnings
warnings.filterwarnings('ignore')

#Getting rid of  html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
df['review']=df['review'].apply(denoise_text)
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Let's tokenize experimenting stemming and no stemming for further classification tuning.

In [4]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

# No stemming
def tokenizer(text):
    return text.split()

# Porter stemming
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

Let's do 50-50 train-test split as in the dataset description .

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['review'].values, df['sentiment'].values, test_size=0.5)

Let's do a pipeline with TF IDF and Support Vector Machine classifier. In the cross-validation part, I tune the tokenizer, unigrams or unigrams and bigrams as well as SVM hyperparameters such as regularization term C and the kernel (linear or Radial Basis Function).

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svm_param= [
            {'clf__C': [0.1, 1],
             'clf__kernel': ['linear', 'rbf'],
             'vect__ngram_range': [(1, 1)],
             'vect__tokenizer': [tokenizer, tokenizer_porter]}
           ]

svm_tfidf = Pipeline([
    ('vect', TfidfVectorizer(norm='l2')),
    ('clf', SVC())
])

gs_svm_tfidf = GridSearchCV(svm_tfidf, svm_param,
                            scoring='accuracy',
                            cv=5,
                            verbose=3,
                            n_jobs=-1,
                            return_train_score=True)

gs_svm_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




[CV 5/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x113ba1000>;, score=(train=0.884, test=0.845) total time=14.6min




[CV 4/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x111495000>;, score=(train=0.883, test=0.847) total time=14.6min




[CV 2/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x11ab15000>;, score=(train=0.882, test=0.846) total time=14.7min




[CV 1/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10d8a5000>;, score=(train=0.881, test=0.856) total time=14.7min




[CV 3/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f219000>;, score=(train=0.881, test=0.856) total time=14.8min




[CV 2/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x118d15000>;, score=(train=0.883, test=0.848) total time=17.3min




[CV 1/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x118445000>;, score=(train=0.883, test=0.858) total time=17.4min
[CV 3/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10f169000>;, score=(train=0.882, test=0.859) total time=17.4min




[CV 5/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x111495000>;, score=(train=0.886, test=0.847) total time=17.2min




[CV 4/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x113ba1000>;, score=(train=0.883, test=0.845) total time=17.3min




[CV 1/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x11ab15000>;, score=(train=0.846, test=0.814) total time=17.3min




[CV 2/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10d8a5000>;, score=(train=0.852, test=0.795) total time=17.3min




[CV 3/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f219000>;, score=(train=0.848, test=0.816) total time=17.4min




[CV 5/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10da3db40>;, score=(train=0.849, test=0.800) total time=18.1min




[CV 4/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10e395b40>;, score=(train=0.849, test=0.799) total time=18.2min




[CV 1/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10479db40>;, score=(train=0.855, test=0.822) total time=21.4min




[CV 1/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f219000>;, score=(train=0.976, test=0.894) total time=25.3min




[CV 2/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x111495000>;, score=(train=0.859, test=0.804) total time=22.7min




[CV 2/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x118445000>;, score=(train=0.977, test=0.885) total time=25.0min




[CV 3/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x113ba1000>;, score=(train=0.857, test=0.823) total time=22.8min




[CV 3/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x118d15000>;, score=(train=0.978, test=0.888) total time=25.4min




[CV 5/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10d8a5000>;, score=(train=0.857, test=0.810) total time=22.6min




[CV 4/5] END clf__C=0.1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x11ab15000>;, score=(train=0.858, test=0.812) total time=22.7min




[CV 4/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f169000>;, score=(train=0.979, test=0.888) total time=24.7min




[CV 5/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f219000>;, score=(train=0.979, test=0.877) total time=25.4min




[CV 1/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x111495000>;, score=(train=0.973, test=0.892) total time=26.2min




[CV 2/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10da3db40>;, score=(train=0.972, test=0.881) total time=26.4min




[CV 3/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x113ba1000>;, score=(train=0.973, test=0.888) total time=26.5min




[CV 5/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10d8a5000>;, score=(train=0.974, test=0.875) total time=26.4min




[CV 4/5] END clf__C=1, clf__kernel=linear, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10e395b40>;, score=(train=0.974, test=0.882) total time=26.6min




[CV 1/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x11ab15000>;, score=(train=0.994, test=0.890) total time=33.8min




[CV 2/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10479db40>;, score=(train=0.994, test=0.883) total time=34.2min




[CV 3/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x10f219000>;, score=(train=0.994, test=0.888) total time=32.9min
[CV 4/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x111495000>;, score=(train=0.994, test=0.882) total time=33.4min
[CV 5/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer at 0x118445000>;, score=(train=0.995, test=0.872) total time=33.3min
[CV 1/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x113ba1000>;, score=(train=0.992, test=0.885) total time=35.6min
[CV 2/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x10d8a5000>;, score=(train=0.992, test=0.879) total time=35.4min
[CV 3/5] END clf__C=1, clf__kernel=rbf, vect__ngram_range=(1, 1), vect__tokenizer=<function tokenizer_porter at 0x118d15000>;, scor

Here are the best parameter set and its CV accuracy.

In [9]:
print(f'Best parameter set for SVM: {gs_svm_tfidf.best_params_}')
print(f'Cross-validation Accuracy of SVM: {gs_svm_tfidf.best_score_:.4f}')

Best parameter set for SVM: {'clf__C': 1, 'clf__kernel': 'linear', 'vect__ngram_range': (1, 1), 'vect__tokenizer': <function tokenizer at 0x13035de10>}
Cross-validation Accuracy of SVM: 0.8863


Surprisingly, the best model choses non-stem tokenizer. The linear kernel of SVM shows no much of nonlinearity in this classification problem. Let's show the best estimator's test accuracy.

In [10]:
clf_svm = gs_svm_tfidf.best_estimator_
print(f'Test Accuracy of tuned SVM: {clf_svm.score(X_test, y_test):.4f}')

Test Accuracy of tuned SVM: 0.8920


It's pretty good test accuracy for a simple SVM model on IMDB dataset.