# Binary classification of movie reviews

The objective of this notebook is to identify positive movie reviews. The movie reviews were obtained [rottentomatoes.com](https://www.rottentomatoes.com/) and the code that was used to do it can be found in my [scrapers repository](https://github.com/varun-jois/scrapers). In this analysis, 60 movies were chosen from various time periods from the genres of Action, Horror, Romance and Sports; 15 movies from each genre. The list of movies can be found in the file "movies_list.xlsx". 

## Loading the reviews and building the input for the neural network 

The reviews have already been obtained and can be found in the pickle "movie_reviews.pickle". First we shall load the data. 

In [1]:
import pickle
from NeuralNet import NeuralNet
import numpy as np
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn import feature_extraction

with open('movie_reviews.pickle', 'rb') as f:
    movie_reviews = pickle.load(f)

The structure of movie_reviews is a list of dicts where every list item corresponds to a movie.  

#### Keywords that will constitute our input units

Now we shall obtain the important words from our training set which will be used as our input units for our network.

In [2]:
# Get all the tokens
all_tokens = []
for m in movie_reviews:
    for r in m['reviews']:
        all_tokens.extend(word_tokenize(r['review']))

In [3]:
# Converting all the tokens to lowercase and keeping only letters
all_tokens = [t.lower() for t in all_tokens if t.isalpha()]

In [4]:
# Obtaining the stopwords and removing it from all_tokens
stop = set(stopwords.words('english'))
all_tokens = [t for t in all_tokens if t not in stop]

In [5]:
# Creating our lemmatizer to keep only the word stems (only getting the stems of nouns)
lemmatizer = WordNetLemmatizer()
all_tokens = [lemmatizer.lemmatize(t) for t in all_tokens]

In [6]:
# Creating our bag-of-words
bag = Counter(all_tokens)
len(bag)

27891

In [7]:
# Let's see the 10 most common words
print(bag.most_common(10))

[('movie', 16756), ('film', 13559), ('one', 7440), ('great', 5418), ('good', 4875), ('time', 4687), ('story', 4399), ('like', 3944), ('best', 3855), ('horror', 3563)]


For our model, we shall take all the words that had at least 10 occurrences.

In [8]:
kw = [k for k, v in bag.items() if v >= 10]
len(kw)

5500

Since there are 5500 words that had at least 10 occurrences, our neural network will have 1820 input units

#### We convert our reviews to matrices 

On rottentomatoes, reviewers rate on a 5 point scale. We consider all reviews greater or equal to 4 as positive reviews.

In [11]:
reviews = [r['review'] for m in movie_reviews for r in m['reviews']]
ratings = [1 if r['rating'] >= 4 else 0 for m in movie_reviews for r in m['reviews']]

In [13]:
# Vectorizing the data
cv = feature_extraction.text.CountVectorizer(vocabulary=kw, binary=True)
reviews_array = cv.fit_transform(reviews).toarray().T
ratings_array = np.array(ratings).reshape(1, -1)

We shuffle the data to create training and test data

In [16]:
np.random.seed(0)

# Shuffling the input units
np.random.shuffle(reviews_array)

# Shuffling the reviews and creating the tain and test data
per = np.random.permutation(reviews_array.shape[1])
reviews_array, ratings_array = reviews_array[:, per], ratings_array[:, per]
train_index = round(reviews_array.shape[1] * 0.95)
X_train, X_test = reviews_array[:, :train_index], reviews_array[:, train_index:]
Y_train, Y_test = ratings_array[:, :train_index], ratings_array[:, train_index:]

We check to see if the number of rows is the number of keywords (kw) and the number of columns is equal to the number of reviews in the training dataset

## Building our Neural Network
We shall create our network using mini-batch gradient descent with the [ADAM](https://arxiv.org/pdf/1412.6980.pdf) optimizer. Our network will comprise of 2 hidden layers with 50 hidden units each. We will be using the tanh activation and hence we will be using the xavier intilization.

In [73]:
np.random.seed(0)
nn = NeuralNet()
nn.initializer_xavier(X_train.shape[0], [50, 50, 1])
nn.gd_mini_batch(X_train, Y_train, alpha=0.001, activation='tanh', 
                 mini_batch_size=256, epochs=100, optimizer='adam')

The average cost for the 20th epoch was 0.1078769182205211.
The average cost for the 40th epoch was 0.06163014755262198.
The average cost for the 60th epoch was 0.04156796180617574.
The average cost for the 80th epoch was 0.03156289006690934.
The average cost for the 100th epoch was 0.02441794968370799.


In [74]:
# Getting the train set error
predictions, error = nn.predict(X_train, Y_train)
error

0.0082142194171153093

In [75]:
# Getting the test set error
predictions, error = nn.predict(X_test, Y_test)
error

0.22839506172839508