# IMDB Movie Sentiment Analasis

In this notebook, we will use the IMDB movie reviews dataset, provided on Kaggle (https://www.kaggle.com/c/word2vec-nlp-tutorial/data), to build a sentiment analysis model using NLP. 

Having established a baseline score for our model using the BoW implementation, we will now imply transfer learning by using the Word2Vec algorithm proposed by Google to conduct sentiment analysis. In this approach, we will use the average feature vectors of each word to build our model. 

## Importing the Data

In [27]:
import numpy as np
import pandas as pd # to read the csv datasets and import them into python

In [28]:
data_path = "data" # path to the data folder in your directory
# import the labeled and unlabeled training data to train our model
label_train = pd.read_csv(data_path + "/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3) 
unlabel_train = pd.read_csv(data_path + "/unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3) 
# import the test data to evaluate our model
test_data = pd.read_csv(data_path + "/testData.tsv", header=0, delimiter="\t", quoting=3)

In [29]:
unlabel_train

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."
...,...,...
49995,"""18984_0""","""The original Man Eater by Joe D'Amato is some..."
49996,"""16433_0""","""When Home Box Office was in it's early days m..."
49997,"""16006_0""","""Griffin Dunne was born into a cultural family..."
49998,"""40155_0""","""Not a bad story, but the low budget rears its..."


## Preprocessing

In [35]:
from bs4 import BeautifulSoup # to get rid of the HTML tags in the reviews
import re # to remove punctuations and numericals from the review

from nltk.corpus import stopwords # to remove the stop words in our reviews and obtain our tokenizer
import nltk.data
stop_words = set(stopwords.words("english"))
nltk.download()
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') # to convert reviews into list of sentences

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [39]:
def preprocess_review(unclean_review, remove_stopwords=False):
    """ 
    Function that takes a single unclean review from the original dataset
    and returns a cleaned and preprocessed version of it. 
    Input: string: an uncleaned review from the dataset
    Output: string: cleaned and preprocessed review
    """
    # removes the HTML tags in the review
    untagged_review = BeautifulSoup(unclean_review).get_text() 
    # removes everything not in A-Z or a-z and replaces it with a space
    letter_only_review = re.sub("[^a-zA-Z]", " ", untagged_review) 
    # converting everything to lowercase
    letter_only_review = letter_only_review.lower() 
    # converting everything to tokenized words
    words_review = letter_only_review.split() 
    if remove_stopwords:
        # converting to set for faster access
        stop_words = set(stopwords.words("english")) 
        # removing all the stop words in the review
        words_review = [w for w in words_review if not w in stop_words] 
    return words_review


In [56]:
def review_to_sentences(review, tokenizer, remove_stopwords=False):
    """
    Function that takes in a review and returns it in the form of a list of its sentences.
    Input: string: a review from the dataset
    Output: list of list: list of sentences where each sentence list is a list of words
    """
    # splitting review into sentences
    sentences = tokenizer.tokenize(review.strip())
    list_sentences = []
    for sentence in sentences:
        if len(sentence) > 0:
            # clean up the sentence by preprocessing it
            list_sentences.append(preprocess_review(sentence, remove_stopwords))
    return list_sentences

In [57]:
sentences = []  # Initialize an empty list of sentences

for review in label_train["review"]: 
    sentences += review_to_sentences(review, tokenizer)

for review in unlabel_train["review"]:
    sentences += review_to_sentences(review, tokenizer)



## Training the Model

In this part, we will be training our model using the Word2Vec algorithm  

In [59]:
# setting the hyperparamaters as per the information provided by Google's doc at https://code.google.com/archive/p/word2vec/
num_features = 300    # Word vector dimensionality                      
min_word_count = 40   # Minimum word count                        
threads = 6       # Number of threads to run in parallel
context = 10          # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

# training the model
from gensim.models import word2vec
model = word2vec.Word2Vec(sentences, workers=threads, size=num_features, 
                          min_count=min_word_count, window=context, sample=downsampling)
model.init_sims(replace=True)

# saving the model
model.save("default_w2v")


## Loading the model

In [93]:
from gensim.models import Word2Vec, KeyedVectors

# model trained on the 75,000 reviews provided in the Kaggle Database
model = Word2Vec.load("models\default_w2v")

In [94]:
def averageVec(words, model, num_features):
    """
    The function will take all the word vectors in a review and then average them 
    """
    featureVec = np.zeros((num_features, ), dtype="float32")
    num_words = 0
    index_set = set(model.wv.index2word) # converted to set for faster access
    # if a word in the review is in the model's vocab then add its featureVec to the total
    for word in words:
        if word in index_set:
            num_words += 1
            featureVec = np.add(featureVec, model[word])
    # get the average vector
    featureVec = np.divide(featureVec, num_words)
    return featureVec

def batchAvgVecs(reviews, model, num_features):
    """ 
    Given a list of reviews, it calculates the average feature vector for review and returns
    a list of these feature vectors
    """
    i = 0
    result = np.zeros((len(reviews), num_features), dtype="float32")
    for review in reviews:
        result[i] = averageVec(review, model, num_features)
        if i % 5000 == 0:
            print("{} reviews processed".format(i))
        i += 1
    return result

        

## Using the Model

In [95]:
# cleaning up the datasets and calculate their avg feature vectors 

clean_train_data = []
for review in label_train["review"]:
    clean_train_data.append(preprocess_review(review, remove_stopwords=True))
trainAvgVecs = batchAvgVecs(clean_train_data, model, num_features)

clean_test_data = []
for review in test_data["review"]:
    clean_test_data.append(preprocess_review(review, remove_stopwords=True))
testAvgVecs = batchAvgVecs(clean_test_data, model, num_features)

  featureVec = np.add(featureVec, model[word])


0 reviews processed
5000 reviews processed
10000 reviews processed
15000 reviews processed
20000 reviews processed
0 reviews processed
5000 reviews processed
10000 reviews processed
15000 reviews processed
20000 reviews processed


In [96]:
# train a random forest on the average feature vectors
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(trainAvgVecs, label_train["sentiment"])

# test the model and output the results
result = forest.predict(testAvgVecs)
output = pd.DataFrame(data={"id":test_data["id"], "sentiment":result} )
output.to_csv( "W2V_AvgVec.csv", index=False, quoting=3 )