# IMDB dataset having 50K movie reviews - Sentiment Analysis

<b>Introduction

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. A set of 25,000 highly polar movie reviews for training and 25,000 for testing were provided.
For more dataset information, please go through the following link,http://ai.stanford.edu/~amaas/data/sentiment/

This notebook focus on items below:

- Clean and pre-process text data.

- Perform feature extraction with nltk

- Build and employ a logistic regression classifier using scikit-learn.

- Tune model hyperparameters and evaluate model accuracy

<b>Loading the dataset<b>

In [2]:
import pandas as pd 
##header_list = ['review','sentiment']
df = pd.read_csv('IMDB Dataset.csv')
df.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


In [3]:
df['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [4]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [5]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

<b>Data Preparation</b>

In [14]:
#transform categorical labels to numeric values
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1
5,"Probably my all-time favorite movie, a story o...",1
6,I sure would like to see a resurrection of a u...,1
7,"This show was an amazing, fresh & innovative i...",0
8,Encouraged by the positive comments about this...,0
9,If you like original gut wrenching laughter yo...,1


In [15]:
import re

def preprocessor(text):
    '''
    remove text in brackets, remove emoji, and make text lowercase
    '''
    text = re.sub('<[^>]*>','',text)
    emoji = re.findall('(?::|;|=)(?:-)?(?:\)|\(D|P)',text)
    text = re.sub('[\W]+',' ',text.lower()) +\
                  ' '.join(emoji).replace('-','')
    return text

In [16]:
df['review'] = df['review'].apply(preprocessor)

In [17]:
df['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there s a family where a little boy ...
4        petter mattei s love in the time of money is a...
5        probably my all time favorite movie a story of...
6        i sure would like to see a resurrection of a u...
7        this show was an amazing fresh innovative idea...
8        encouraged by the positive comments about this...
9        if you like original gut wrenching laughter yo...
10       phil the alien is one of those quirky films wh...
11       i saw this movie when i was about 12 when it c...
12       so im not a big fan of boll s work but then ag...
13       the cast played shakespeare shakespeare lost i...
14       this a fantastic movie of three prisoners who ...
15       kind of drawn in by the erotic scenes only to ...
16       some films just simply should not be remade th.

<b>Tokenization</b>

In [19]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]


<b> Transform Text Data into TF-IDF Vectors </b>

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(strip_accents = None,
                       lowercase = False,
                       preprocessor = None,
                       tokenizer = tokenizer_porter,
                       use_idf = True,
                       norm = 'l2',
                       smooth_idf = True)
y = df['sentiment'].values
x = tfidf.fit_transform(df.review)

 <b> Document Classification using Logistic Regression

In [21]:
# tranining and validation split

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 1, test_size = 0.5, shuffle = False)

In [22]:
import pickle
from sklearn.linear_model import LogisticRegressionCV

clf = LogisticRegressionCV(cv = 5,
                          scoring = 'accuracy',
                          random_state = 0,
                          n_jobs = -1,
                          verbose = 3,
                          max_iter = 300).fit(x_train,y_train)
saved_model = open('saved_model.sav','wb')
pickle.dump(clf,saved_model)
saved_model.close()

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:  3.1min remaining:  4.6min
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  4.0min finished


In [27]:
file_name = 'saved_model.sav'
saved_clf = pickle.load(open(file_name, 'rb'))
saved_clf.score(x_test,y_test)



0.8946