So in this Notebook We will Apply Machine Learning Models on the pre-processed data of Imdb Dataset, to predict sentiment of the new review.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#reading the processed dataset
df= pd.read_csv('pre_processed.csv')
df.drop('Unnamed: 0', axis=1, inplace= True)
df.head()

Unnamed: 0,review,sentiment,simle_words_tokens,nltk_word_tokenize,nltk_sentence_tokenize,porter_stemming,lemmitized
0,one reviewers mentioned watching 1 oz e...,positive,"['one', 'reviewers', 'mentioned', 'watching', ...","['one', 'reviewers', 'mentioned', 'watching', ...",['one reviewers mentioned watching 1 oz...,one review mention watch 1 oz episod youll hoo...,one reviewer mentioned watching 1 oz episode y...
1,wonderful little production filming techniqu...,positive,"['wonderful', 'little', 'production', 'filming...","['wonderful', 'little', 'production', 'filming...",[' wonderful little production filming techni...,wonder littl product film techniqu unassum old...,wonderful little production filming technique ...
2,thought wonderful way spend time hot s...,positive,"['thought', 'wonderful', 'way', 'spend', 'time...","['thought', 'wonderful', 'way', 'spend', 'time...",[' thought wonderful way spend time hot...,thought wonder way spend time hot summer weeke...,thought wonderful way spend time hot summer we...
3,basically theres family little boy jake thi...,negative,"['basically', 'theres', 'family', 'little', 'b...","['basically', 'theres', 'family', 'little', 'b...",['basically theres family little boy jake t...,basic there famili littl boy jake think there ...,basically there family little boy jake think t...
4,petter matteis love time money visually s...,positive,"['petter', 'matteis', 'love', 'time', 'money',...","['petter', 'matteis', 'love', 'time', 'money',...",['petter matteis love time money visually...,petter mattei love time money visual stun film...,petter matteis love time money visually stunni...


## DataFrame Pre Processing:
Text is preprocessed but we have to check for the Missing Values and Duplicate Values etc.

**Checking for the Duplicate Rows in Dataset and Dropping them:**

In [3]:
df.duplicated().sum() 

422

In [4]:
df.drop_duplicates(inplace= True)

**Checking for the Null Values:**

In [5]:
df.isnull().sum()

review                    0
sentiment                 0
simle_words_tokens        0
nltk_word_tokenize        0
nltk_sentence_tokenize    0
porter_stemming           0
lemmitized                0
dtype: int64

**Checking Class Balance:**

In [6]:
df['sentiment'].value_counts()

sentiment
positive    24883
negative    24695
Name: count, dtype: int64

Class balance looks pretty great :)

**We will use lemmitized processed column of the textual reviews:**

In [8]:
df_main= df[['lemmitized', 'sentiment']]

In [9]:
df_main.head()

Unnamed: 0,lemmitized,sentiment
0,one reviewer mentioned watching 1 oz episode y...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there family little boy jake think t...,negative
4,petter matteis love time money visually stunni...,positive


In [10]:
#covnerting to text and label form
x = df_main.iloc[:,0:1]
y = df['sentiment']



In [11]:
x.head()

Unnamed: 0,lemmitized
0,one reviewer mentioned watching 1 oz episode y...
1,wonderful little production filming technique ...
2,thought wonderful way spend time hot summer we...
3,basically there family little boy jake think t...
4,petter matteis love time money visually stunni...


In [12]:
y

0        positive
1        positive
2        positive
3        negative
4        positive
           ...   
49995    positive
49996    negative
49997    negative
49998    negative
49999    negative
Name: sentiment, Length: 49578, dtype: object

**Encoding the Labels of the Reviews:**

In [14]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)



In [15]:
y

array([1, 1, 1, ..., 0, 0, 0])

Representing Positive with 1 and Negative with 0 in above encoding.

**Splitting the Dataset into Train-Test:**

In [17]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

In [18]:
X_train.shape

(39662, 1)

## Feature Extraction:

We will use Bag of Words technique for the Feature Extraction:

In [21]:
from sklearn.feature_extraction.text import CountVectorizer

#for all the vocabulary it is not able to make the array due to memory allocation error so we have to use top 10000 vocabulary (unique words)
cv = CountVectorizer(max_features=10000)
X_train_bow = cv.fit_transform(X_train['lemmitized']).toarray()
X_test_bow = cv.transform(X_test['lemmitized']).toarray()

In [24]:
X_train_bow.shape

(39662, 10000)

## Machine Learning Modelling:

In [26]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,confusion_matrix
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8425776522791448

In [27]:
confusion_matrix(y_test,y_pred)

array([[4152,  763],
       [ 798, 4203]])

**Using ngram Techinuq for the Feature Extraction Part:**

In [29]:
cv = CountVectorizer(ngram_range=(1,3),max_features=10000)

X_train_bow = cv.fit_transform(X_train['lemmitized']).toarray()
X_test_bow = cv.transform(X_test['lemmitized']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)



0.8520572811617588