# IMDB Movie Sentiment Analasis

In this notebook, we will use the IMDB movie reviews dataset, provided on Kaggle (https://www.kaggle.com/c/word2vec-nlp-tutorial/data), to build a sentiment analysis model using NLP. 

Rather than using the Word2Vec model (which provided little to no gains in our score), this time we will be using the Doc2Vec model which will consider the order of words in our review as well, giving us not only each word's average feature vector, but also the whole review's average feature vector as well. Through this, we hope to improve our score. 

## Importing the Data

In [85]:
import numpy as np
import pandas as pd # to read the csv datasets and import them into python

In [86]:
data_path = "data" # path to the data folder in your directory
# import the labeled and unlabeled training data to train our model
label_train = pd.read_csv(data_path + "/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3) 
unlabel_train = pd.read_csv(data_path + "/unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3) 
# import the test data to evaluate our model
test_data = pd.read_csv(data_path + "/testData.tsv", header=0, delimiter="\t", quoting=3)

In [87]:
all_reviews = np.array([], dtype=str)
datasets = [label_train, unlabel_train, test_data]
for dataset in datasets:
    all_reviews = np.concatenate((all_reviews, dataset.review), axis=0)


## Preprocessing

In [88]:
from bs4 import BeautifulSoup # to get rid of the HTML tags in the reviews
import re # to remove punctuations and numericals from the review

from nltk.corpus import stopwords # to remove the stop words in our reviews and obtain our tokenizer
from nltk.stem import WordNetLemmatizer # to lemmatize our reviews
from nltk import word_tokenize
import nltk.data
nltk.download()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [89]:
def preprocess_review(unclean_review):
    """ 
    Function that takes a single unclean review from the original dataset
    and returns a cleaned and preprocessed version of it. 
    Input: string: an uncleaned review from the dataset
    Output: string: cleaned and preprocessed review
    """
    # removes the HTML tags in the review
    untagged_review = BeautifulSoup(unclean_review).get_text() 
    # removes everything not in A-Z or a-z and replaces it with a space
    letter_only_review = re.sub("[^a-zA-Z]", " ", untagged_review) 
    # converting everything to lowercase
    letter_only_review = letter_only_review.lower() 
    # converting everything to tokenized words
    tokenized_review = word_tokenize(letter_only_review)
    # taking only the lemmatized words
    tokens = list(map(lemmatizer.lemmatize, tokenized_review))
    # only accepting the verbs from those lemmatized words to normalize process
    lemmatized_tokens = list(map(lambda x: lemmatizer.lemmatize(x, "v"), tokens))
    # removing all the stop words in the review
    result = [t for t in lemmatized_tokens if not t in stop_words]
    return result


In [90]:
# preprocess all the reviews

for i in range(len(all_reviews)):
    all_reviews[i] = preprocess_review(all_reviews[i])


"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

## Training the Model

In this part, we will be training our model using the Doc2Vec algorithm  

In [93]:
x_train = all_reviews[:len(label_train)]
y_train = label_train.sentiment.values
x_test = all_reviews[75000:100000]

In [95]:
from gensim.models import Word2Vec, Phrases

# identifying multi-word phrases (such as solar_system) using bigrams and trigrams
bigrams = Phrases(sentences=all_reviews)
trigrams = Phrases(sentences=bigrams[all_reviews])

In [96]:
import multiprocessing
embedding_vec_len = 256
model = Word2Vec(sentences=trigrams[bigrams[all_reviews]], size=embedding_vec_len, min_count=3, 
                 window=10, workers=multiprocessing.cpu_count())

In [97]:
model.wv.most_similar('galaxy')

[('solar_system', 0.8394182920455933),
 ('wormhole', 0.8316003084182739),
 ('planet', 0.8314940333366394),
 ('space_station', 0.8305203914642334),
 ('asteroid', 0.8083747029304504),
 ('venus', 0.7937676906585693),
 ('pod', 0.7921104431152344),
 ('orbit', 0.7900744080543518),
 ('federation', 0.7885801196098328),
 ('colony', 0.7865943908691406)]

In [99]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers

max_features = 5000 # we will keep the 5000 most freq words for our model
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts()
train_tokens = tokenizer.texts_to_sequences(x_train)

max_length = 130
x_train = pad_sequences(train_tokens, maxlen=maxlen)

embedding_size = 128
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(32, return_sequences = True)))
model.add(GlobalMaxPool1D())
model.add(Dense(20, activation="relu"))
model.add(Dropout(0.05))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

batch_size = 100
epochs = 3
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.25)


Using TensorFlow backend.
  if training is 1 or training is True:
  elif training is 0 or training is False:
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "C:\Users\yasoo\Anaconda3\envs\walmart\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "C:\Users\yasoo\Anaconda3\envs\walmart\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "C:\Users\yasoo\Anaconda3\envs\walmart\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "C:\Users\yasoo\Anaconda3\envs\walmart\lib\imp.py", line 242, in load_module
    else:
  File "C:\Users\yasoo\Anaconda3\envs\walmart\lib\imp.py", line 342, in load_dynamic
    name=name, loader=loader, origin=path)
ImportError: Module use of python36.dll conflicts with this version of Python.

During handling of the above exception,

TypeError: can only concatenate str (not "list") to str

## Loading the model

## Using the Model

In [None]:
# cleaning up the datasets and calculate their avg feature vectors 

clean_train_data = []
for review in label_train["review"]:
    clean_train_data.append(preprocess_review(review, remove_stopwords=True))
trainAvgVecs = batchAvgVecs(clean_train_data, model, num_features)

clean_test_data = []
for review in test_data["review"]:
    clean_test_data.append(preprocess_review(review, remove_stopwords=True))
testAvgVecs = batchAvgVecs(clean_test_data, model, num_features)

  index_set = set(model.wv.index2word) # converted to set for faster access


0 reviews processed


In [None]:
# train a random forest on the average feature vectors
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest = forest.fit(trainAvgVecs, label_train["sentiment"])

# test the model and output the results
result = forest.predict(testAvgVecs)
output = pd.DataFrame(data={"id":test_data["id"], "sentiment":result} )
output.to_csv( "W2V_AvgVec.csv", index=False, quoting=3 )