In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
imdb = pd.read_csv('IMDB Dataset.csv')

In [3]:
imdb.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [5]:
imdb.columns

Index(['review', 'sentiment'], dtype='object')

In [6]:
print(imdb.iloc[0,0])

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fac

In [7]:
sentiment = pd.get_dummies(imdb['sentiment'],drop_first=True)

In [8]:
imdb = pd.concat([imdb,sentiment],axis=1)

In [52]:
imdb.head()

Unnamed: 0,review,sentiment,positive
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [54]:
imdb['review'].head(5).apply(text_process)

0    [one, review, mention, watch, 1, oz, episod, y...
1    [wonder, littl, product, film, techniqu, unass...
2    [thought, wonder, way, spend, time, hot, summe...
3    [basic, there, famili, littl, boy, jake, think...
4    [petter, mattei, love, time, money, visual, st...
Name: review, dtype: object

Data Preparation

In [11]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rishik\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
def text_process(review):
    #make everything lower case
    review = review.lower()
    #remove punctuation
    review = [char for char in review if char not in string.punctuation]
    review = ''.join(review)
    #remove br because it is used to show line break
    review = [word for word in review.split() if word != 'br']
    #remove stopwords
    review = [word for word in review if word not in stopwords.words('english')]
     #stemming
    stemming = PorterStemmer()
    review = [stemming.stem(word) for word in review]
    return review
   

In [49]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import sys

In [15]:
from sklearn.model_selection import train_test_split

Vectorising each review

In [17]:
cv = CountVectorizer(analyzer=text_process).fit(imdb['review'])

print(len(cv.vocabulary_))

143223


In [20]:
review1 = imdb['review'][4]
print(review1)

Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way, or another to the next person, but no one seems to know the previous point of contact. Stylishly, the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment, as one discerns is the case with m

In [22]:
bow4 = cv.transform([review1])
print(bow4)
print(bow4.shape)

  (0, 4448)	1
  (0, 4572)	1
  (0, 5192)	1
  (0, 6662)	1
  (0, 8570)	1
  (0, 9088)	1
  (0, 10286)	1
  (0, 11666)	1
  (0, 15071)	1
  (0, 15557)	1
  (0, 20334)	1
  (0, 22102)	1
  (0, 22340)	1
  (0, 22462)	1
  (0, 23640)	2
  (0, 25558)	1
  (0, 27173)	1
  (0, 28210)	2
  (0, 28453)	1
  (0, 32488)	1
  (0, 34972)	3
  (0, 35340)	1
  (0, 35401)	1
  (0, 35676)	1
  (0, 40717)	1
  :	:
  (0, 115018)	1
  (0, 117315)	1
  (0, 117485)	1
  (0, 118951)	1
  (0, 119809)	1
  (0, 121126)	1
  (0, 121274)	1
  (0, 121569)	1
  (0, 123656)	1
  (0, 123729)	1
  (0, 124813)	1
  (0, 125677)	1
  (0, 126122)	1
  (0, 127195)	2
  (0, 129059)	1
  (0, 133517)	2
  (0, 134147)	1
  (0, 135411)	1
  (0, 135484)	1
  (0, 136686)	1
  (0, 136932)	1
  (0, 139293)	1
  (0, 140027)	1
  (0, 140138)	1
  (0, 141710)	1
(1, 143223)


In [24]:
review_bow = cv.transform(imdb['review'])

Now normalising using a tfidf transformer

In [26]:
tfidf_transformer = TfidfTransformer().fit(review_bow)

In [27]:
review_tfidf = tfidf_transformer.transform(review_bow)

In [30]:
print(review_tfidf.shape)

(50000, 143223)


Splitting data into training and test sets

In [32]:
X = review_tfidf
y = imdb['positive']

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Try first with a Naive-Bayes model

In [None]:
NB_model = MultinomialNB().fit(X_train, y_train)

In [38]:
predictions_NB = NB_model.predict(X_test)

In [51]:
print(classification_report(predictions_NB, y_test))
print(confusion_matrix(predictions_NB,y_test))

              precision    recall  f1-score   support

           0       0.88      0.85      0.86      8487
           1       0.84      0.87      0.86      8013

    accuracy                           0.86     16500
   macro avg       0.86      0.86      0.86     16500
weighted avg       0.86      0.86      0.86     16500

[[7193 1294]
 [1015 6998]]


Now try with logistic regression model

In [42]:
logistic_model = LogisticRegression().fit(X_train, y_train)

In [43]:
predictions_log = logistic_model.predict(X_test)

In [50]:
print(classification_report(predictions_log, y_test))
print(confusion_matrix(predictions_log,y_test))

              precision    recall  f1-score   support

           0       0.88      0.90      0.89      7991
           1       0.91      0.88      0.90      8509

    accuracy                           0.89     16500
   macro avg       0.89      0.89      0.89     16500
weighted avg       0.89      0.89      0.89     16500

[[7218  773]
 [ 990 7519]]


Therefore logistic regression model is slightly better!