# Problem statement description

*IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/*

## ***Steps*** [ Plan of attack ]

1. Text Preprocessing
    - Removal of HTML tags. ( Due to inefficient web scraping of the data ) 
    - Removal of Stopwords which have no contribution in the analysis ( a, and, the, or, many, which,...) 
    - Removing the Special characters. For our usecase (differentiating the samtiment between positive and negative), special characters aren't required
    
    
2. Vectorization of Data (Technique : Bag of Words)

3. Apply appropriate ML algorithm

4. Hyperparameter Tuning

5. Building deployment ready pipelines

# Importing required libraries

In [None]:
import numpy as np
import pandas as pd

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [None]:
df.head()

In [None]:
# One review
df['review'][0]

# Text Cleaning 

1. Remove HTML tags
2. Remove special characters
3. Converting everything to lowercase
4. Removing stopwords
5. Stemming ( playing, played, plays, player, players, playful ------> play )

In [None]:
df.shape

In [None]:
df.info()

No missing values in both the columns

## Character Encoding the sentiment column

In [None]:
df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)

In [None]:
df.head()

## Removing HTML tags using RegEx

In [None]:
# Testing the RegEx
clean = re.compile('<.*?>')
re.sub(clean,'',df.iloc[2].review)

In [None]:
# Function to clean HTML tags
def clean_html(text):
    clean = re.compile('<.*?>')
    return re.sub(clean,'',text)

In [None]:
# Removing HTML tags from reviews column
df['review'] = df['review'].apply(clean_html)
df.head()

## Converting all the reviews to lower case

In [None]:
def convert_lower(text):
    return text.lower()

In [None]:
df['review'] = df['review'].apply(convert_lower)
df.head()

## Removing Special characters

In [None]:
def remove_special(text):
    x=''
    for t in text:
        if t.isalnum():
            x=x+t
        else:
            x=x+' '
    
    return x

In [None]:
df['review'] = df['review'].apply(remove_special)
df.head()

## Removing the stopwords

In [None]:
def remove_stopwords(text):
    x = []
    for i in text.split():
        if i not in stopwords.words('english'):
            x.append(i)
    
    # Transporting all the contents of x to y
    y = x[:]
    x.clear()
    return y

In [None]:
df['review'] = df['review'].apply(remove_stopwords)
df.head()

## Stemming

In [None]:
ps = PorterStemmer()

In [None]:
y = []
def stem_words(text):
    for i in text:
        y.append(ps.stem(i))
    z = y[:]
    y.clear()
    return z

In [None]:
df['review'] = df['review'].apply(stem_words)

In [None]:
# Join back
def join_back(list_input):
    return " ".join(list_input)

In [None]:
df['review'] = df['review'].apply(join_back)
df.head()

# Vectorization [ Bag of Words ]

In [None]:
cv = CountVectorizer(max_features=5000)

In [None]:
X = cv.fit_transform(df['review']).toarray()

In [None]:
X.shape

In [None]:
y = df.iloc[:,-1].values
y.shape

# ML Algorithm

In [None]:
clf1 = GaussianNB()
clf2 = MultinomialNB()
clf3 = BernoulliNB()

In [None]:
print("GaussianNB accuracy : ", cross_val_score(clf1,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')
print("MultinomialNB accuracy : ", cross_val_score(clf2,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')
print("BernoulliNB accuracy : ", cross_val_score(clf3,X,y,cv=10,scoring='accuracy').mean()*100 , ' %')

***It is clearly visible from the cross validated results that Bernoulli Naive Bayes works better than any other algorithm with a mean accuracy of 84.73%***

In [None]:
clf3.fit(X,y)

In [None]:
trnf1 = FunctionTransformer(func = clean_html)
# Eg:
trnf1.transform("<p> Swades is an excellent, mindblowing movie played by Shah Rukh Khan </p>")

In [None]:
trnf2 = FunctionTransformer(func = convert_lower)
# Eg: 
trnf2.transform(' Swades is an excellent, mindblowing movie played by Shah Rukh Khan ')

In [None]:
trnf3 = FunctionTransformer(func = remove_special)
# Eg: 
trnf3.transform(' swades is an excellent, mindblowing movie played by shah rukh khan ')

In [None]:
trnf4 = FunctionTransformer(func = remove_stopwords)
# Eg: 
trnf4.transform(' swades is an excellent  mindblowing movie played by shah rukh khan ')

In [None]:
trnf5 = FunctionTransformer(func = stem_words)
# Eg: 
trnf5.transform(['swades',
 'excellent',
 'mindblowing',
 'movie',
 'played',
 'shah',
 'rukh',
 'khan'])

In [None]:
trnf6 = FunctionTransformer(func = join_back)
# Eg: 
trnf6.transform(['swade', 'excel', 'mindblow', 'movi', 'play', 'shah', 'rukh', 'khan'])

# ***Building readily deployable Pipeline***

In [None]:
pipe = Pipeline([
    ('trnf1',trnf1),
    ('trnf2',trnf2),
    ('trnf3',trnf3),
    ('trnf4',trnf4),
    ('trnf5',trnf5),
    ('trnf6',trnf6)
])

In [None]:
review1 = 'Bhool Bhoolaiya, Phir Hera pheri, De Dana Dan and Bhaagam Bhaag are a few of the good comedy movies played by Akshay Kumar as a lead actor'
review2 = 'Golmaal 2 is the worst movie in the entire frenchise'

In [None]:
def sentiment_analyzer(text):
    buffer = []
    buffer.append(pipe.transform(text))
    estimator = clf3.predict(cv.transform(buffer))[0]

    if estimator == 0:
        return 'the review is negative'
    else: 
        return 'the review is positive'

In [None]:
sentiment_analyzer(review1)

In [None]:
sentiment_analyzer(review2)

## ***We have now built a pipeline which can be readily deployed on a testing environment wherein we just need to input the reviews and then run it down the pipeline of function transformers which preprocess our data. The Bernoulli Naive Bayes ML model would then predict the proprocessed data resulting in either a positive or a negative review***