Testing the Naive-Bayes algorithm for sentiment analysis on a subset of the IMDB review data.

Start by loading the necessary libraries as well as the IMDB dataset.

In [2]:
# Import libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

from wordcloud import WordCloud

import re
import nltk
import string

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import ComplementNB, GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Download necessary NLTK resources

nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [6]:
# Download the dataset (IMDB Reviews)

df = pd.read_csv("IMDB-Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [17]:
# Check the size of the data:

df.shape

(50000, 2)

In [18]:
# Check the number of positive and negative reviews in the data:

df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


There is an equal number of positive and negative reviews, which means the dataset is balanced.
For testing purposes, we will use a subset of the data instead of the full volume. Our sample size will be 3000.

In [31]:
# Sample the data.
df = df.sample(3000)

# Reset the index.
df.reset_index(drop=True, inplace=True)

# Check the size of the sample dataset.
df.shape

# Check the value counts to ensure we have a relatively balanced sample.
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
0,1514
1,1486


We will update the qualitative sentiment values to quantitative binary values: 0 for negative and 1 for positive.

In [33]:
# Replace the sentiment values with binary values.
df['sentiment'] = df['sentiment'].replace({'positive':1, 'negative':0})

df

Unnamed: 0,review,sentiment
0,"Of course the plot, script, and, especially ca...",1
1,I went to see this movie mostly because it loo...,0
2,Nice combination of the giant monster and samu...,1
3,I feel like I've just watched a snuff film.......,0
4,These reviews that claim this movie is so bad ...,0
...,...,...
2995,I have not seen this movie in ages but figured...,0
2996,"""Sir"" has played Lear over 200 times,but tonig...",1
2997,"""Tragic Hero"" is a film that is most definitel...",0
2998,The previous reviewer has said it exactly. I s...,1


# Data Preprocessing

We will start cleaning the data to convert the text to lowercase and remove URL links, special characters, and punctuation. We will also expand contractions.

In [36]:
pip install contractions

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Downloading anyascii-0.3.3-py3-none-any.whl (345 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.1/345.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyahocorasick-2.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (113 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.9/113.9 kB[0m 

In [37]:
# Define a function to preprocess the data.

import contractions

def clean_up(text):
  # converting to lowercase, removing URL links, special characters, and punctuation marks
  text = text.lower() # convert to lowercase
  text = re.sub('https?://\S+|www\.\S+', '', text) # remove URL links
  text = re.sub(r"\b\d+\b", "", text) # remove numbers
  text = re.sub('<.*?>+', '', text) # remove special characters
  text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuations
  text = re.sub('\n', '', text)
  text = re.sub('[’“”…]', '', text)

  # removing contractions
  text = contractions.fix(text)

  return text

In [41]:
dt = df['review'].apply(clean_up)

For testing purposes, I am going to keep every version of the review as we continue to preprocess the data. I will convert the dataframe into a datatable.

In [43]:
dt = pd.DataFrame(dt)
dt['sentiment']=df['sentiment']

dt

Unnamed: 0,review,sentiment
0,of course the plot script and especially casti...,1
1,i went to see this movie mostly because it loo...,0
2,nice combination of the giant monster and samu...,1
3,i feel like i have just watched a snuff filma ...,0
4,these reviews that claim this movie is so bad ...,0
...,...,...
2995,i have not seen this movie in ages but figured...,0
2996,sir has played lear over timesbut tonight he ...,1
2997,tragic hero is a film that is most definitely ...,0
2998,the previous reviewer has said it exactly i sa...,1


This is the stage of preprocessing where we will remove the stopwords; however, removing some stopwords (i.e. but, however, etc.) may change the meaning of a review so I will be keeping a version of the reviews with all stopwords intact.

In [49]:
# Create stopwords

stop_words = set(stopwords.words('english'))
dt['with_sw'] = dt['review']
dt['no_sw'] = dt['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [50]:
dt

Unnamed: 0,review,sentiment,no_sw,with_sw
0,of course the plot script and especially casti...,1,course plot script especially casting strong f...,of course the plot script and especially casti...
1,i went to see this movie mostly because it loo...,0,went see movie mostly looked good trailers rob...,i went to see this movie mostly because it loo...
2,nice combination of the giant monster and samu...,1,nice combination giant monster samurai genres ...,nice combination of the giant monster and samu...
3,i feel like i have just watched a snuff filma ...,0,feel like watched snuff filma beautifully acte...,i feel like i have just watched a snuff filma ...
4,these reviews that claim this movie is so bad ...,0,reviews claim movie bad good going way overboa...,these reviews that claim this movie is so bad ...
...,...,...,...,...
2995,i have not seen this movie in ages but figured...,0,seen movie ages figured id comment anyway most...,i have not seen this movie in ages but figured...
2996,sir has played lear over timesbut tonight he ...,1,sir played lear timesbut tonight cannot rememb...,sir has played lear over timesbut tonight he ...
2997,tragic hero is a film that is most definitely ...,0,tragic hero film definitely trying emulate cla...,tragic hero is a film that is most definitely ...
2998,the previous reviewer has said it exactly i sa...,1,previous reviewer said exactly saw enchanted s...,the previous reviewer has said it exactly i sa...


Next, we will lemmatize both sets of review data (with and without stop words).

In [51]:
# Lemmatization:

lemmatizer = WordNetLemmatizer()

dt['no_sw_lem'] = dt['no_sw'].apply(lemmatizer.lemmatize)
dt['with_sw_lem'] = dt['with_sw'].apply(lemmatizer.lemmatize)

In [52]:
dt

Unnamed: 0,review,sentiment,no_sw,with_sw,no_sw_lem,with_sw_lem
0,of course the plot script and especially casti...,1,course plot script especially casting strong f...,of course the plot script and especially casti...,course plot script especially casting strong f...,of course the plot script and especially casti...
1,i went to see this movie mostly because it loo...,0,went see movie mostly looked good trailers rob...,i went to see this movie mostly because it loo...,went see movie mostly looked good trailers rob...,i went to see this movie mostly because it loo...
2,nice combination of the giant monster and samu...,1,nice combination giant monster samurai genres ...,nice combination of the giant monster and samu...,nice combination giant monster samurai genres ...,nice combination of the giant monster and samu...
3,i feel like i have just watched a snuff filma ...,0,feel like watched snuff filma beautifully acte...,i feel like i have just watched a snuff filma ...,feel like watched snuff filma beautifully acte...,i feel like i have just watched a snuff filma ...
4,these reviews that claim this movie is so bad ...,0,reviews claim movie bad good going way overboa...,these reviews that claim this movie is so bad ...,reviews claim movie bad good going way overboa...,these reviews that claim this movie is so bad ...
...,...,...,...,...,...,...
2995,i have not seen this movie in ages but figured...,0,seen movie ages figured id comment anyway most...,i have not seen this movie in ages but figured...,seen movie ages figured id comment anyway most...,i have not seen this movie in ages but figured...
2996,sir has played lear over timesbut tonight he ...,1,sir played lear timesbut tonight cannot rememb...,sir has played lear over timesbut tonight he ...,sir played lear timesbut tonight cannot rememb...,sir has played lear over timesbut tonight he ...
2997,tragic hero is a film that is most definitely ...,0,tragic hero film definitely trying emulate cla...,tragic hero is a film that is most definitely ...,tragic hero film definitely trying emulate cla...,tragic hero is a film that is most definitely ...
2998,the previous reviewer has said it exactly i sa...,1,previous reviewer said exactly saw enchanted s...,the previous reviewer has said it exactly i sa...,previous reviewer said exactly saw enchanted s...,the previous reviewer has said it exactly i sa...


With the lemmatized data, I will create two sets of data (one with stop words and one without). From here, we will tokenize both sets of reviews and create our train/test datasets. First, we will work with the lemmatized data WITHOUT stopwords.

In [53]:
nb = dt.drop(columns=['review', 'no_sw', 'with_sw', 'with_sw_lem'])
nb.columns=['sentiment', 'review']

nb

Unnamed: 0,sentiment,review
0,1,course plot script especially casting strong f...
1,0,went see movie mostly looked good trailers rob...
2,1,nice combination giant monster samurai genres ...
3,0,feel like watched snuff filma beautifully acte...
4,0,reviews claim movie bad good going way overboa...
...,...,...
2995,0,seen movie ages figured id comment anyway most...
2996,1,sir played lear timesbut tonight cannot rememb...
2997,0,tragic hero film definitely trying emulate cla...
2998,1,previous reviewer said exactly saw enchanted s...


Let's tokenize the data.

In [54]:
tok_reviews = nb['review'].apply(lambda x: x.split())
tok_reviews.head(5)

Unnamed: 0,review
0,"[course, plot, script, especially, casting, st..."
1,"[went, see, movie, mostly, looked, good, trail..."
2,"[nice, combination, giant, monster, samurai, g..."
3,"[feel, like, watched, snuff, filma, beautifull..."
4,"[reviews, claim, movie, bad, good, going, way,..."


# Feature extraction using Bag of Words Vectorization.

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
text_counts = cv.fit_transform(nb['review'])


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



In [56]:
text_counts

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 262444 stored elements and shape (3000, 40703)>

In [57]:
# Split data into train and test sets.
X = text_counts
y = nb['sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,random_state=30)

# Naive Bayes Modeling

We will test through 3 different Naive Bayes models: ComplementNB, MultinomialNB, and BernoulliNB.

1. Complement NB Model:

In [61]:
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, confusion_matrix
CNB = ComplementNB()
CNB.fit(X_train, y_train)

from sklearn import metrics
predicted = CNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)

print('Complement NB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

Complement NB model accuracy is 82.83%
------------------------------------------------
Confusion Matrix:
     0    1
0  269   46
1   57  228
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       315
           1       0.83      0.80      0.82       285

    accuracy                           0.83       600
   macro avg       0.83      0.83      0.83       600
weighted avg       0.83      0.83      0.83       600



2. Multinomial NB Model:

In [60]:
from sklearn.naive_bayes import MultinomialNB

MNB = MultinomialNB()
MNB.fit(X_train, y_train)

predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, y_test)

print('Multinomial NB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

Multinomial NB model accuracy is 82.83%
------------------------------------------------
Confusion Matrix:
     0    1
0  269   46
1   57  228
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       315
           1       0.83      0.80      0.82       285

    accuracy                           0.83       600
   macro avg       0.83      0.83      0.83       600
weighted avg       0.83      0.83      0.83       600



3. Bernoulli NB Model:

In [62]:
from sklearn.naive_bayes import BernoulliNB

BNB = BernoulliNB()
BNB.fit(X_train, y_train)

predicted = BNB.predict(X_test)
accuracy_score_bnb = metrics.accuracy_score(predicted,y_test)

print('Bernoulli NB model accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, predicted))

Bernoulli NB model accuracy = 80.00%
------------------------------------------------
Confusion Matrix:
     0    1
0  289   26
1   94  191
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.92      0.83       315
           1       0.88      0.67      0.76       285

    accuracy                           0.80       600
   macro avg       0.82      0.79      0.79       600
weighted avg       0.81      0.80      0.80       600



# Feature Extraction using TF-IDF

In [63]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
text_count_2 = tfidf.fit_transform(nb['review'])

In [64]:
x_train, x_test, y_train, y_test = train_test_split(text_count_2, nb['sentiment'],test_size=0.20,random_state=30)

In [66]:
# fitting the model with CNB
CNB.fit(x_train, y_train)
accuracy_score_cnb = metrics.accuracy_score(CNB.predict(x_test), y_test)
print('accuracy_score_cnb = '+str('{:4.2f}'.format(accuracy_score_cnb*100))+'%')

accuracy_score_cnb = 85.17%


In [65]:
#fitting the model with MNB
MNB.fit(x_train, y_train)
accuracy_score_mnb = metrics.accuracy_score(MNB.predict(x_test), y_test)

print('accuracy_score_mnb = '+str('{:4.2f}'.format(accuracy_score_mnb*100))+'%')

accuracy_score_mnb = 85.17%


In [67]:
#fitting the model with BNB
BNB.fit(x_train, y_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(x_test), y_test)
print('accuracy_score_bnb = '+str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

accuracy_score_bnb = 80.83%


# Re-running this analysis with the sample dataset that INCLUDES stopwords.

In [69]:
nb_SW = dt.drop(columns=['review', 'no_sw', 'with_sw', 'no_sw_lem'])
nb_SW.columns=['sentiment', 'review']

nb_SW

Unnamed: 0,sentiment,review
0,1,of course the plot script and especially casti...
1,0,i went to see this movie mostly because it loo...
2,1,nice combination of the giant monster and samu...
3,0,i feel like i have just watched a snuff filma ...
4,0,these reviews that claim this movie is so bad ...
...,...,...
2995,0,i have not seen this movie in ages but figured...
2996,1,sir has played lear over timesbut tonight he ...
2997,0,tragic hero is a film that is most definitely ...
2998,1,the previous reviewer has said it exactly i sa...


Tokenize the data (keeping stop words)

In [71]:
tok_review_SW = nb_SW['review'].apply(lambda x: x.split())
tok_review_SW.head(5)

Unnamed: 0,review
0,"[of, course, the, plot, script, and, especiall..."
1,"[i, went, to, see, this, movie, mostly, becaus..."
2,"[nice, combination, of, the, giant, monster, a..."
3,"[i, feel, like, i, have, just, watched, a, snu..."
4,"[these, reviews, that, claim, this, movie, is,..."


# Bag of Words (with stop words)

In [85]:
token_SW = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token_SW.tokenize)
text_counts_SW = cv.fit_transform(nb_SW['review'])


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'



In [86]:
text_counts_SW

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 266098 stored elements and shape (3000, 40711)>

In [87]:
# Split data into train and test sets.
X = text_counts_SW
y = nb_SW['sentiment']

X_SW_train, X_SW_test, y_SW_train, y_SW_test = train_test_split(X, y, test_size=0.20,random_state=30)

# Naive Bayes Modeling (keeping stopwords)

1. Complement NB:

In [88]:
CNB_SW = ComplementNB()
CNB_SW.fit(X_SW_train, y_SW_train)

from sklearn import metrics
predicted = CNB_SW.predict(X_SW_test)
accuracy_score = metrics.accuracy_score(predicted, y_SW_test)

print('Complement NB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_SW_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_SW_test, predicted))

Complement NB model accuracy is 83.17%
------------------------------------------------
Confusion Matrix:
     0    1
0  270   45
1   56  229
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       315
           1       0.84      0.80      0.82       285

    accuracy                           0.83       600
   macro avg       0.83      0.83      0.83       600
weighted avg       0.83      0.83      0.83       600



2. Multinomial NB:

In [89]:
MNB_SW = MultinomialNB()
MNB_SW.fit(X_SW_train, y_SW_train)

predicted = MNB_SW.predict(X_SW_test)
accuracy_score = metrics.accuracy_score(predicted, y_SW_test)

print('Multinomial NB model accuracy is',str('{:04.2f}'.format(accuracy_score*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_SW_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_SW_test, predicted))

Multinomial NB model accuracy is 83.17%
------------------------------------------------
Confusion Matrix:
     0    1
0  270   45
1   56  229
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       315
           1       0.84      0.80      0.82       285

    accuracy                           0.83       600
   macro avg       0.83      0.83      0.83       600
weighted avg       0.83      0.83      0.83       600



3. Bernoulli NB:

In [90]:
BNB_SW = BernoulliNB()
BNB_SW.fit(X_SW_train, y_SW_train)

predicted = BNB_SW.predict(X_SW_test)
accuracy_score_bnb = metrics.accuracy_score(predicted,y_SW_test)

print('Bernoulli NB model accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_SW_test, predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_SW_test, predicted))

Bernoulli NB model accuracy = 80.17%
------------------------------------------------
Confusion Matrix:
     0    1
0  290   25
1   94  191
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.92      0.83       315
           1       0.88      0.67      0.76       285

    accuracy                           0.80       600
   macro avg       0.82      0.80      0.80       600
weighted avg       0.82      0.80      0.80       600



# TF-IDF (keeping stopwords)

In [84]:
tfidf_SW = TfidfVectorizer()
text_count_2 = tfidf_SW.fit_transform(nb_SW['review'])

In [91]:
x_SW_train, x_SW_test, y_SW_train, y_SW_test = train_test_split(text_count_2, nb_SW['sentiment'],test_size=0.20,random_state=30)

In [92]:
# fitting the model with CNB
CNB.fit(x_SW_train, y_SW_train)
accuracy_score_cnb = metrics.accuracy_score(CNB.predict(x_SW_test), y_SW_test)
print('accuracy_score_cnb = '+str('{:4.2f}'.format(accuracy_score_cnb*100))+'%')

accuracy_score_cnb = 86.00%


In [93]:
#fitting the model with MNB
MNB.fit(x_SW_train, y_SW_train)
accuracy_score_mnb = metrics.accuracy_score(MNB.predict(x_SW_test), y_SW_test)

print('accuracy_score_mnb = '+str('{:4.2f}'.format(accuracy_score_mnb*100))+'%')

accuracy_score_mnb = 86.00%


In [94]:
#fitting the model with BNB
BNB.fit(x_SW_train, y_SW_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(x_SW_test), y_SW_test)
print('accuracy_score_bnb = '+str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

accuracy_score_bnb = 84.00%


# Conclusions after testing with datasets where stop words had been removed versus dataset that kept stop words

The accuracy and F1 scores for the dataset that retained stopwords were higher than the dataset where the stopwords had been removed. Though the percentage change was not extremely large, it does show that some stopwords were key in detecting the overall sentiment of a review.