## IMDB Reviews SpaCy Sentiment Analysis

The goal is to train a simple NLP model on IMDB Dataset of 50K Movie Reviews. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. It consists of a set of 25,000 highly polar movie reviews for training and 25,000 for testing.

In [28]:
import numpy as np
import pandas as pd

df = pd.read_csv('./IMDB Dataset.csv', 
                 encoding='utf-8')

df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Since there HTML marks and other peculiarities in the text. Let's remove HTML tags isung RegEx and lemmatize the text using SpaCy.

In [29]:
import spacy

nlp = spacy.load('en_core_web_sm')

df.dropna(inplace=True)

def preprocess_text(text):
    text = re.sub(r'<.*?>', ' ', text)
    doc = nlp(text)
    tokens = [token.lemma_.lower() for token in doc if not token.like_url and not token.is_stop and not token.is_punct and not token.is_space]
    return " ".join(tokens)

df['review_processed'] = df['review'].apply(preprocess_text)            #df['review'].head(10).apply(preprocess_text) 
df.head(10)

Unnamed: 0,review,sentiment,review_processed
0,One of the other reviewers has mentioned that ...,positive,reviewer mention watch 1 oz episode hook right...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production film technique una...
2,I thought this was a wonderful way to spend ti...,positive,think wonderful way spend time hot summer week...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...
5,"Probably my all-time favorite movie, a story o...",positive,probably time favorite movie story selflessnes...
6,I sure would like to see a resurrection of a u...,positive,sure like resurrection date seahunt series tec...
7,"This show was an amazing, fresh & innovative i...",negative,amazing fresh innovative idea 70 air 7 8 year ...
8,Encouraged by the positive comments about this...,negative,encourage positive comment film look forward w...
9,If you like original gut wrenching laughter yo...,positive,like original gut wrench laughter like movie y...


Let's do 50-50 train-test split as in the dataset description .

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['review_processed'].values, df['sentiment'].values, test_size=0.5)

Let's do a pipeline with TF IDF and Support Vector Machine classifier. In the cross-validation part, I tune the tokenizer, unigrams or unigrams and bigrams as well as SVM hyperparameters such as regularization term C and the kernel (linear or Radial Basis Function).

In [31]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svm_param= [
            {'clf__C': [0.1, 1, 10],
             'clf__kernel': ['linear', 'rbf'],
             'vect__ngram_range': [(1, 1), (1, 2)]}
           ]

svm_tfidf = Pipeline([
    ('vect', TfidfVectorizer(norm='l2')),
    ('clf', SVC())
])

gs_svm_tfidf = GridSearchCV(svm_tfidf, svm_param,
                            scoring='accuracy',
                            cv=5,
                            verbose=3,
                            n_jobs=-1,
                            return_train_score=True)

gs_svm_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV 5/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1);, score=(train=0.894, test=0.861) total time= 8.4min
[CV 4/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1);, score=(train=0.894, test=0.861) total time= 8.4min
[CV 3/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1);, score=(train=0.893, test=0.858) total time= 8.4min
[CV 2/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1);, score=(train=0.891, test=0.865) total time= 8.4min
[CV 1/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 1);, score=(train=0.889, test=0.882) total time= 8.5min
[CV 3/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 2);, score=(train=0.875, test=0.842) total time=16.4min
[CV 2/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 2);, score=(train=0.875, test=0.847) total time=16.4min
[CV 1/5] END clf__C=0.1, clf__kernel=linear, vect__ngram_range=(1, 2);, score=

Here are the best parameter set and its CV accuracy.

In [32]:
print(f'Best parameter set for SVM: {gs_svm_tfidf.best_params_}')
print(f'Cross-validation Accuracy of SVM: {gs_svm_tfidf.best_score_:.4f}')

Best parameter set for SVM: {'clf__C': 10, 'clf__kernel': 'linear', 'vect__ngram_range': (1, 2)}
Cross-validation Accuracy of SVM: 0.8955


So, the best model uses bigrams. The linear kernel of SVM shows no much of nonlinearity in this classification problem. Let's show the best estimator's test accuracy.

In [33]:
clf_svm = gs_svm_tfidf.best_estimator_
print(f'Test Accuracy of tuned SVM: {clf_svm.score(X_test, y_test):.4f}')

Test Accuracy of tuned SVM: 0.8937


It's pretty good test accuracy for a simple SVM model on IMDB dataset. The test accuracy for spaCy sentiment analysis data is almost the same as for the previous IMDB project with RegEx and NLTK manual pre-processing.