# IMDB Movie Sentiment Analasis

In this notebook, we will use the IMDB movie reviews dataset, provided on Kaggle (https://www.kaggle.com/c/word2vec-nlp-tutorial/data), to build a sentiment analysis model using NLP. 

Firstly, let us establish a baseline accuracy for our model using the tutorial provided by Kaggle alongside the dataset, found at https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/description. 

After doing this, we can work on increasing the efficiency of our model if we so desire. 

## Importing the Data

In [1]:
import numpy as np
import pandas as pd # to read the csv datasets and import them into python

In [2]:
data_path = "data" # path to the data folder in your directory
train_data = pd.read_csv(data_path + "\labeledTrainData.tsv", header=0, delimiter="\t", quoting=3) 

In [3]:
train_data

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."
...,...,...,...
24995,"""3453_3""",0,"""It seems like more consideration has gone int..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the..."


## Preprocessing

In [4]:
from bs4 import BeautifulSoup # to get rid of the HTML tags in the reviews
import re # to remove punctuations and numericals from the review

import nltk # to remove the stop words in our reviews
nltk.download() 
from nltk.corpus import stopwords

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [5]:
def preprocess_review(unclean_review):
    """ 
    Function that takes a single unclean review from the original dataset
    and returns a cleaned and preprocessed version of it. 
    Input: string: an uncleaned review from the dataset
    Output: string: cleaned and preprocessed review
    """
    # removes the HTML tags in the review
    untagged_review = BeautifulSoup(unclean_review).get_text() 
    # removes everything not in A-Z or a-z and replaces it with a space
    letter_only_review = re.sub("[^a-zA-Z]", " ", untagged_review) 
    # converting everything to lowercase
    letter_only_review = letter_only_review.lower() 
    # converting everything to tokenized words
    words_review = letter_only_review.split() 
    # converting to set for faster access
    stop_words = set(stopwords.words("english")) 
    # removing all the stop words in the review
    words_review = [w for w in words_review if not w in stop_words] 
    words_review = " ".join(words_review) 
    return words_review

In [6]:
total_reviews = len(train_data["review"]) # total num of reviews in our training dataset
cleaned_reviews = []
for i in range(total_reviews):
    clean_review = preprocess_review(train_data["review"][i])
    cleaned_reviews.append(clean_review)
    if i % 5000 == 0:
        print("{} reviews done".format(i))


0 reviews done
5000 reviews done
10000 reviews done
15000 reviews done
20000 reviews done


## Building Feature Vectors

In this part, we will be using the Bag of Words model to create feature vectors for our reviews. Since the total vocabulary of our reviews is quite large, we will restrict ourselves to the 5000 most frequent words in our reviews.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer # we will be using skleanr to automate most of our process

In [8]:
# initialize the CountVectorizer object with a max freq of 5000 words
vectorizer = CountVectorizer(max_features = 5000)

# learn the vocabulary and transform our reviews to their feature vectors
train_features = vectorizer.fit_transform(cleaned_reviews)
train_features = train_features.toarray()

## Training the Model

In this part, we will be training our model using Random Forest classifier with a a default value of 100. After gauging the accuracy, we can further finetune our model's hyperparameters to obtain greater accuracy.  

In [9]:
from sklearn.ensemble import RandomForestClassifier 

# initializing our rf with a 400 trees
forest = RandomForestClassifier(n_estimators = 400)

# fit our forest to the training data
forest = forest.fit(train_features, train_data["sentiment"])

## Testing the Model

In [10]:
# import the test data to evaluate our model
test_data = pd.read_csv(data_path + "/testData.tsv", header=0, delimiter="\t", quoting=3)

# cleaning up the test data
clean_test_reviews = []
for i in range(len(test_data["review"])):
    clean_review = preprocess_review(test_data["review"][i])
    clean_test_reviews.append(clean_review)
    if i % 5000 == 0:
        print("{} reviews done".format(i))

test_features = vectorizer.transform(clean_test_reviews)
test_features = test_features.toarray()        


0 reviews done
5000 reviews done
10000 reviews done
15000 reviews done
20000 reviews done


In [11]:
# Use the rf to predict the results on the test data
results = forest.predict(test_features)

# create a Pandas DataFrame for our results
output = pd.DataFrame(data={"id":test_data["id"], "sentiment":results})

# output the results in the specified format
output.to_csv("BoW_model.csv", index=False, quoting=3)