# IMDB MOVIE REVIEWS
#### dataset resources: https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
#### A dataset containing two columns: reviews and sentiments (positive and negative) having 50k entries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#creating a dataframe object from out train csv file
movies_df=pd.read_csv("../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
movies_df = movies_df.iloc[:5000,:]

In [None]:
movies_df.info()
#label=0 tells the review is positive and 1 tells that it was a negative review

In [None]:
movies_df.describe()

In [None]:
movies_df.head()

## Exploring our dataset

In [None]:
print(movies_df.isnull().sum())
#there are no null entries or NaN
#changing column names
movies_df.columns=['review','label']

In [None]:
sns.heatmap(movies_df.isnull(),yticklabels=False,cbar=False,cmap='Blues')
#Everything is blue , there are no null values

In [None]:
movies_df['label'].hist(figsize=(13,5),color='g')
#positive reviews are around 2500 and negative reviews are around 2500

In [None]:
sns.countplot(movies_df['label'])
#this gives a clearer and prioritized picture.

In [None]:
positive=movies_df[movies_df['label']=='positive']
negative=movies_df[movies_df['label']=='negative']
#gets the object for pos and neg reviews

## Plotting the word cloud

A word cloud contains collection of all possible words used in our dataset and represents them in pictorial form

In [None]:
#grab the review column and convert into one massive string
sentences=movies_df['review'].tolist()
sentences=''.join(sentences)

In [None]:
from wordcloud import WordCloud
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences))

In [None]:
#let's see the positive movie reviews' words
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(''.join(positive['review'].tolist())))

In [None]:
#similarly, for negative movie reviews
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(''.join(negative['review'].tolist())))

## DATA CLEANING

For efficient data analysis, we only need those words which add value to our predictions. Unneccesary punctuation marks and stop words are to be removed.
Some stopWords which are most commonly used are 'I','We','They','and' etc etc.
Moreover, lemmatize and stemming are used to reduce a word to its root and stems. This is useful for tokenization.

In [None]:
import string #for punctuation
import nltk #natural language tool kit
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

print(string.punctuation)
print('------------------------------------------------------------------------------------------------------------------------')
print(stopwords.words('english'))
redundant_words=['br']
def text_cleaning(sentence): #sentence is text(paragraph)
    sentence_punc_removed=[letter for letter in sentence if letter not in string.punctuation]
    sentence_punc_removed=''.join(sentence_punc_removed) # punctuation removed
    #stopwords are filtered out
    sentence_clean=[word for word in sentence_punc_removed.split() if word.lower() not in stopwords.words('english')]
    # redundant words are filtered out
    sentence_clean=[word for word in sentence_clean if word.lower() not in redundant_words ] ##array of words of that sentence
    # lemmatization is done here.
    final_sentence = [lemmatizer.lemmatize(word.lower()) for word in sentence_clean]
    return final_sentence # ARRAY OF WORDS

In [None]:
movies_df_clean=movies_df['review'].apply(text_cleaning) # ARRAY OF ARRAY OF WORDS
# print(len(movies_df_clean))
sentences=movies_df_clean
listOfSentences=list()
for sentence in sentences:
    listOfSentences.append(' '.join(sentence))
from wordcloud import WordCloud
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate('\n'.join(listOfSentences)))

## FEATURE EXTRACTION

### TOKENIZATION / COUNT VECTORIZER

##### tokenization is a beautiful concept that helps to convert our textual data into some vectorized numeric form
##### our count vectorizer is going to pick up unique words from our text and then find out the frequency of that particular word for each row and make a 2D vector accordingly.

In [None]:

from sklearn.feature_extraction.text import CountVectorizer

#here, we performed data cleaning and count vectorization sequentially altogether !

movies_vectorizer=CountVectorizer(analyzer=text_cleaning,dtype='uint8').fit_transform(movies_df['review']) #transforms text into numeric vectorized format

In [None]:
print(type(movies_vectorizer)) # type vector
X=movies_vectorizer.toarray() # type 2d matrix
print(X.shape)

In [None]:
y=movies_df['label']

## MAKING OUR ML MODEL Using NAIVE BAYES AND LINEAR SVM (SVC) CLASSIFIER

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2, random_state = 42) #setting up train and test datasets
X_test.shape

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

NB_classifier=MultinomialNB()
svc = LinearSVC()
NB_classifier.fit(X_train,y_train) #training our model using training dataset
svc.fit(X_train, y_train)

## Assessing Performance and making Report

We are going to use confusion matrix which is going to tell how **OFTEN** our predictions are right in terms true class

#### It lists both false positive and false negatives 

In [None]:
from sklearn.metrics import confusion_matrix , classification_report

y_test_predictions=NB_classifier.predict(X_test)
y_test_pred_svm = svc.predict(X_test)

In [None]:
cm=confusion_matrix(y_test,y_test_predictions)
cm2=confusion_matrix(y_test,y_test_pred_svm)

fig, axis = plt.subplots(1, 2, figsize=(15, 5))

sns.heatmap(ax = axis[0],data= cm,annot=True)
axis[0].set_title("With Naive Bayes")

sns.heatmap(ax = axis[1],data=cm2,annot=True)
axis[1].set_title("With Linear SVC")

In [None]:
print(classification_report(y_test,y_test_predictions))
print(classification_report(y_test,y_test_pred_svm))

# AND WE ARE DONE :)
