# IMBD Dataset of 50K Movie Reviews

## Contents:

1. Background
2. Import Data and data cleansing
3. Create SID Object using Sentiment Intensity Analyzer
4. Add scores and labels to the dataframes
5. Accuracy of the Model
6. Analysis of various reviews from other datasets and the Rotten Tomatoes

## 1. Background

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.<br>
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.<br>
For more dataset information, please go through the following link,<br>
http://ai.stanford.edu/~amaas/data/sentiment/

## 2. Import Data and Data Cleansing

In [1]:
import nltk
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [2]:
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Based on the dataset, we will remove the following:
- Remove empty strings
- Drop Blanks

In [3]:
# Remove NAN values
df.isnull().sum()
df.dropna(inplace = True)

In [4]:
# Remove empty strings
blanks = []  # start with an empty list

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
print(len(blanks), 'blanks: ', blanks)

df.drop(blanks, inplace = True)

0 blanks:  []


In [5]:
len(df['review'].value_counts())

49582

## 3. Create SID Object using Sentiment Intensity Analyzer

In [6]:
sid = SentimentIntensityAnalyzer()

## 4. Add scores and labels to the dataframe

In [7]:
# Add the score
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

In [8]:
# Add labels
df['positive'] = df['scores'].apply(lambda score_dict: score_dict['pos'])
df['negative'] = df['scores'].apply(lambda score_dict: score_dict['neg'])
df['neutral'] = df['scores'].apply(lambda score_dict: score_dict['neu'])
df['compound']  =df['scores'].apply(lambda score_dict: score_dict['compound'])

In [9]:
df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

## 5. Testing Sentiment Analyzer on other movie reviews

In [61]:
# Classification Function
def classifier_func(a):
    if a>=0:
        return 'positive'
    else:
        return 'negative'    

In [65]:
movie_review = input(r'Please enter your movie review: ')

Please enter your movie review: It is probably one of the worst movies i have seen in a long time


In [66]:
x = sid.polarity_scores(movie_review).get('compound')
print(classifier_func(x), '\ncompound score: {}'.format(x))

negative 
compound score: -0.6249
