Sentiment Analysis of IMDB Movie Reviews

Import the nltk and necessary libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
import re

import os
print(os.listdir("../input"))
import warnings
warnings.filterwarnings('ignore')


['IMDB Dataset.csv']


Import the training data

In [2]:
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
print(imdb_data.shape)
imdb_data.head(10)

(50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


Exploratery data analysis

In [3]:
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


Finding the number of counts of each sentiment reviews

In [4]:
imdb_data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

Import the reviews and sentiments from IMDB dataset and split the dataset

In [5]:
reviews=np.array(imdb_data['review'])
sentiments=np.array(imdb_data['sentiment'])
                    
train_reviews=reviews[:40000]
train_sentiments=sentiments[:40000]
test_reviews=reviews[40000:]
test_sentiments=sentiments[40000:]
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(40000,) (40000,)
(10000,) (10000,)


Text preprocessing or text normalization

In [6]:
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english')

Removing html strips characters

In [7]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
imdb_data['review']=imdb_data['review'].apply(denoise_text)


Remove accented characters

In [8]:
def remove_accented_chars(text):
    text=unicodedata.normalize('NFKD',text).encode('ascii','ignore').decode('utf-8','ignore')
    return text
imdb_data['review']=imdb_data['review'].apply(remove_accented_chars)

Expand contractions

In [9]:
CONTRACTION_MAP = {"ain't": "is not", "aren't": "are not","can't": "cannot", 
                   "can't've": "cannot have", "'cause": "because", "could've": "could have", 
                   "couldn't": "could not", "couldn't've": "could not have","didn't": "did not", 
                   "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                   "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", 
                   "he'd": "he would", "he'd've": "he would have", "he'll": "he will", 
                   "he'll've": "he he will have", "he's": "he is", "how'd": "how did", 
                   "how'd'y": "how do you", "how'll": "how will", "how's": "how is", 
                   "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                   "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                   "i'd": "i would", "i'd've": "i would have", "i'll": "i will", 
                   "i'll've": "i will have","i'm": "i am", "i've": "i have", 
                   "isn't": "is not", "it'd": "it would", "it'd've": "it would have", 
                   "it'll": "it will", "it'll've": "it will have","it's": "it is", 
                   "let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                   "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                   "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", 
                   "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                   "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                   "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", 
                   "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", 
                   "she's": "she is", "should've": "should have", "shouldn't": "should not", 
                   "shouldn't've": "should not have", "so've": "so have","so's": "so as", 
                   "this's": "this is",
                   "that'd": "that would", "that'd've": "that would have","that's": "that is", 
                   "there'd": "there would", "there'd've": "there would have","there's": "there is", 
                   "they'd": "they would", "they'd've": "they would have", "they'll": "they will", 
                   "they'll've": "they will have", "they're": "they are", "they've": "they have", 
                   "to've": "to have", "wasn't": "was not", "we'd": "we would", 
                   "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", 
                   "we're": "we are", "we've": "we have", "weren't": "were not", 
                   "what'll": "what will", "what'll've": "what will have", "what're": "what are", 
                   "what's": "what is", "what've": "what have", "when's": "when is", 
                   "when've": "when have", "where'd": "where did", "where's": "where is", 
                   "where've": "where have", "who'll": "who will", "who'll've": "who will have", 
                   "who's": "who is", "who've": "who have", "why's": "why is", 
                   "why've": "why have", "will've": "will have", "won't": "will not", 
                   "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                   "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                   "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                   "you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
                   "you'll've": "you will have", "you're": "you are", "you've": "you have" } 

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text
imdb_data['review']=imdb_data['review'].apply(expand_contractions)


Removing special characters

In [10]:
def remove_special_characters(text, remove_digits=False):
    pattern=r'[^a-zA-z0-9\s]' if not remove_digits else r'[a-zA-z\s]'
    text=re.sub(pattern,'',text)
    return text
imdb_data['review']=imdb_data['review'].apply(remove_special_characters)

Text stemming


In [11]:
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
imdb_data['review']=imdb_data['review'].apply(simple_stemmer)


Removing stopwords

In [12]:
stop=set(stopwords.words('english'))
print(stop)

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
imdb_data['review']=imdb_data['review'].apply(remove_stopwords)

{'were', 'on', 'have', 'yourself', "it's", 'where', 'just', 'be', 'should', 'can', 'ma', 'most', 'few', 'of', 'in', "mustn't", 'an', 'this', 'any', 's', 'aren', 'mightn', 'weren', "weren't", 'are', 'it', 'more', "didn't", 'some', 'having', "needn't", 'doing', 'through', 'its', 'her', 'only', 'yourselves', 'themselves', 'hers', 'll', 'what', 'needn', 'while', 'his', 'who', "doesn't", 'both', 'by', 't', "you've", 'is', 'why', 'now', 'isn', 'about', 're', "hasn't", 'same', 'when', 'wouldn', 'itself', 'will', 'didn', "hadn't", 'before', 'too', "you'll", 'a', 'because', "won't", 'you', "wouldn't", 'until', 'or', "she's", 'above', 'we', 'ourselves', 've', 'does', 'himself', 'here', 'couldn', 'so', 'between', 'very', 'for', 'ain', 'being', 'was', 'wasn', 'no', 'yours', "you're", "should've", 'own', 'do', 'off', "shan't", 'that', 'again', 'am', 'as', 'don', 'o', 'herself', "that'll", 'hasn', 'those', 'shouldn', 'under', 'haven', 'they', 'nor', 'm', 'won', 'once', 'not', 'theirs', 'their', 'd',

Normalize train reviews

In [13]:
norm_train_reviews=imdb_data.review[:40000]
norm_train_reviews[0]

'one review ha mention watch 1 Oz episod hook right thi exactli happen meth first thing struck Oz wa brutal unflinch scene violenc set right word GO trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call OZ nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda Em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show would dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doe mess around first episod ever saw struck nasti wa surreal could say wa readi watch develop tast Oz got accustom high level graphic violenc violenc injustic crook guard sold nickel inmat kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch Oz may becom co

Normalize test reviews

In [14]:
norm_test_reviews=imdb_data.review[40000:]
norm_test_reviews[40000]

'first want say lean liber polit scale found movi offens manag watch whole doggon disgrac film thi movi bring low origin idea ye wa origin thu 2 star instead 1 film writer uncr onli come thi act wa horribl charact unlik part lead ladi stori good qualiti made bf sort bad guy see mayb miss someth knowh wa earth relev charact movi shell ani money thi garbag almost wish peta would come rescu thi aw offens movi form protest disgust say anymor'

Bags of words model used to convert categorical data to numerical data

In [15]:
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,2))
cv_train_reviews=cv.fit_transform(norm_train_reviews)
cv_test_reviews=cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)
#vocab=cv.get_feature_names()-toget feature names

BOW_cv_train: (40000, 1927678)
BOW_cv_test: (10000, 1927678)


TFIDF model used to convert categorical data to  numerical data

In [16]:
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,2),sublinear_tf=True)
tv_train_reviews=tv.fit_transform(norm_train_reviews)
tv_test_reviews=tv.transform(norm_test_reviews)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (40000, 1927678)
Tfidf_test: (10000, 1927678)


LabelBinarizer used to convert catrgorical data to numerical data

In [17]:
from sklearn.preprocessing import LabelBinarizer
lb=LabelBinarizer()
sentiment_data=lb.fit_transform(imdb_data['sentiment'])
print(sentiment_data.shape)

(50000, 1)


Split the sentiment data

In [18]:
train_sentiments=sentiment_data[:40000]
test_sentiments=sentiment_data[40000:]
print(train_sentiments)
print(test_sentiments)

[[1]
 [1]
 [1]
 ...
 [1]
 [0]
 [0]]
[[0]
 [0]
 [0]
 ...
 [0]
 [0]
 [0]]


Supervised learning models

In [19]:
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report,confusion_matrix

Logistic regression and Stochastic Gradient descent Classifier

In [20]:
lr=LogisticRegression(penalty='l2',max_iter=100,C=1,random_state=32)
svm=SGDClassifier(loss='hinge',n_iter=100,random_state=32)

Logistic regression for Bag of words model

fit the model

In [21]:
lr_bow=lr.fit(cv_train_reviews,train_sentiments)
lr_bow

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=32, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

predict the model

In [22]:
lr_bow_predict=lr.predict(cv_test_reviews)
lr_bow_predict

array([0, 0, 0, ..., 0, 1, 0])

print the classification report

In [23]:
lr_bow_report=classification_report(test_sentiments,lr_bow_predict,target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.65      0.78      0.71      4993
    Negative       0.72      0.58      0.64      5007

   micro avg       0.68      0.68      0.68     10000
   macro avg       0.68      0.68      0.67     10000
weighted avg       0.68      0.68      0.67     10000



plot the confusion matrix

In [24]:
cm_bow=confusion_matrix(test_sentiments,lr_bow_predict,labels=[1,0])
print(cm_bow)

[[2898 2109]
 [1116 3877]]


From the confusion matrix, we can conclude that out of 5007 reviews, 2898 reviews are correctly predict as positive
and out of 4993 reviews,3877 reviews correctly predicted as negative.

Logistic Regression for TFIDF model

fit the model

In [25]:
lr_tfidf=lr.fit(tv_train_reviews,train_sentiments)
lr_tfidf

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=32, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

predict the model

In [26]:
lr_tfidf_predict=lr.predict(tv_test_reviews)
lr_tfidf

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=32, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

plot the classification report

In [27]:
lr_tfidf_report=classification_report(test_sentiments,lr_tfidf_predict,target_names=['positive','negative'])
print(lr_tfidf_report)

              precision    recall  f1-score   support

    positive       0.69      0.69      0.69      4993
    negative       0.69      0.68      0.69      5007

   micro avg       0.69      0.69      0.69     10000
   macro avg       0.69      0.69      0.69     10000
weighted avg       0.69      0.69      0.69     10000



Plot the confusion matrix

In [28]:
cm_tfidf=confusion_matrix(test_sentiments,lr_tfidf_predict)
print(cm_tfidf)

[[3459 1534]
 [1578 3429]]


From the confusion matrix, we can predict that out of 4993 reviews, 3459 reviews are correctly predicted as positive
and out of 5007 reviews, 3429 reviews are correctly predicted as negative.

SVM for Bag of words model

fit the model

In [29]:
svm_bow=svm.fit(cv_train_reviews,train_sentiments)
svm_bow

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=100, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=32, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

predict the model

In [30]:
svm_bow_pred=svm.predict(cv_test_reviews)
svm_bow_pred

array([0, 0, 0, ..., 0, 1, 0])

plot the classification report

In [31]:
svm_bow_report=classification_report(test_sentiments,svm_bow_pred,target_names=['positive','negative'])
print(svm_bow_report)

              precision    recall  f1-score   support

    positive       0.65      0.73      0.69      4993
    negative       0.70      0.62      0.65      5007

   micro avg       0.67      0.67      0.67     10000
   macro avg       0.68      0.67      0.67     10000
weighted avg       0.68      0.67      0.67     10000



plot the confusion matrix

In [32]:
cm_bow=confusion_matrix(test_sentiments,svm_bow_pred)
cm_bow

array([[3644, 1349],
       [1924, 3083]])

From the confusion matrix, we can predict that out of 4993 reviews, 3644 reviews are correctly predicted as positive
and out of 5007 reviews, 3083 reviews are correctly predicted as negative.

SVM for TFIDF model

fit the model

In [33]:
svm_tf=svm.fit(tv_train_reviews,train_sentiments)
svm_tf

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=100, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=32, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

predict the model

In [34]:
svm_tf_pred=svm.predict(tv_test_reviews)
svm_tf_pred

array([1, 1, 1, ..., 1, 1, 1])

plot the classification report

In [35]:
svm_tf_report=classification_report(test_sentiments,svm_tf_pred,target_names=['positive','negative'])
print(svm_tf_report)

              precision    recall  f1-score   support

    positive       1.00      0.02      0.04      4993
    negative       0.51      1.00      0.67      5007

   micro avg       0.51      0.51      0.51     10000
   macro avg       0.75      0.51      0.36     10000
weighted avg       0.75      0.51      0.36     10000



In [36]:
cm_tf_report=confusion_matrix(test_sentiments,svm_tf_pred)
cm_tf_report

array([[ 104, 4889],
       [   0, 5007]])

From the confusion matrix,as we can predict that out of 4993 reviews, 104 reviews are correctly predicted as positive
and out of 5007 reviews,5007 reviews are correctly predicted as negative.