# IMDB reviews's sentiment analysis: Word2vec

## Prerequisites

In [1]:
import time

import numpy as np
from gensim.models import Word2Vec
from sklearn.cluster import KMeans

import config
from parsers import sentence as sentence_parser
import utils
import classifiers as classifiers_sk

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sergiidenysiuk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Intro

Word2vec, published by Google in 2013, is a neural network implementation that learns distributed representations for words. The original code is in **C**, but it has since been ported to other languages, including **Python**.

Word2Vec does not need labels in order to create meaningful representations. This is useful, since most data in the real world is unlabeled. If the network is given enough training data (tens of billions of words), it produces word vectors with intriguing characteristics. Words with similar meanings appear in clusters, and clusters are spaced such that some word relationships, such as analogies, can be reproduced using vector math. The famous example is that, with highly trained word vectors, "king - man + woman = queen."

Although Word2Vec does not require graphics processing units (GPUs) like many deep learning algorithms, it is compute intensive. Both Google's version and the Python version rely on multi-threading (running multiple processes in parallel to save time). In order to train model in a reasonable amount of time, `cython` have to be used. Word2Vec will run without `cython` installed, but it will take days to run instead of minutes.

## Read data

In [None]:
# %load -s read_and_parse utils.py

In [None]:
# %load -s concat_sets utils.py

In [None]:
# %load -s SentencesParser parsers/sentence.py

In [2]:
print("Read, clean and parse train data...")
train_data = utils.concat_sets(
    utils.read_and_parse(config.DATA_TRAINING_POS_REVIEW, sentence_parser.SentencesParser),
    utils.read_and_parse(config.DATA_TRAINING_NEG_REVIEW, sentence_parser.SentencesParser),
    columns=["id", "text", "sentiment"],
    is_join=False, is_shuffle=False)
print("Done.")

print("Read, clean and parse test data...")
test_data = utils.concat_sets(
    utils.read_and_parse(config.DATA_TEST_POS_REVIEW, sentence_parser.SentencesParser),
    utils.read_and_parse(config.DATA_TEST_NEG_REVIEW, sentence_parser.SentencesParser),
    columns=["id", "text", "sentiment"],
    is_join=False, is_shuffle=False)
print("Done.")

Read, clean and parse train data...
Done.
Read, clean and parse test data...
Done.


In [12]:
train_data.head()

Unnamed: 0,id,text,sentiment
0,aclImdb/train/pos/4715_9.txt,"[[for, a, movie, that, gets, no, respect, ther...",True
1,aclImdb/train/pos/12390_8.txt,"[[bizarre, horror, movie, filled, with, famous...",True
2,aclImdb/train/pos/8329_7.txt,"[[a, solid, if, unremarkable, film], [matthau,...",True
3,aclImdb/train/pos/9063_8.txt,"[[it, s, a, strange, feeling, to, sit, alone, ...",True
4,aclImdb/train/pos/3092_10.txt,"[[you, probably, all, already, know, this, by,...",True


In [13]:
test_data.head()

Unnamed: 0,id,text,sentiment
0,aclImdb/test/pos/4715_9.txt,"[[based, on, an, actual, story, john, boorman,...",True
1,aclImdb/test/pos/1930_9.txt,"[[this, is, a, gem], [as, a, film, four, produ...",True
2,aclImdb/test/pos/3205_9.txt,"[[i, really, like, this, show], [it, has, dram...",True
3,aclImdb/test/pos/10186_10.txt,"[[this, is, the, best, d, experience, disney, ...",True
4,aclImdb/test/pos/147_10.txt,"[[of, the, korean, movies, i, ve, seen, only, ...",True


Currently loaded texts are lists of sentences which is list words (three-level nested list).

In [14]:
train_data["text"]

0        [[for, a, movie, that, gets, no, respect, ther...
1        [[bizarre, horror, movie, filled, with, famous...
2        [[a, solid, if, unremarkable, film], [matthau,...
3        [[it, s, a, strange, feeling, to, sit, alone, ...
4        [[you, probably, all, already, know, this, by,...
5        [[i, saw, the, movie, with, two, grown, childr...
6        [[you, re, using, the, imdb, you, ve, given, s...
7        [[this, was, a, good, film, with, a, powerful,...
8        [[made, after, quartet, was, trio, continued, ...
9        [[for, a, mature, man, to, admit, that, he, sh...
10       [[aileen, gonsalves, my, girlfriend, is, in, t...
11       [[jonathan, demme, s, directorial, debut, for,...
12       [[when, i, rented, this, movie, to, watch, it,...
13       [[it, s, hard, to, say, sometimes, why, exactl...
14       [[yes, this, gets, the, full, ten, stars], [it...
15       [[hello], [this, movie, is, well, okay], [just...
16       [[this, is, a, film, that, was, very, well, do.

Word2Vec expects single sentences, each one as a list of words. In other words, the input format is a list of lists, so convert list or reviews to list of sentences which list of words (two-level nested list) example result:

In [8]:
train_sentences = [y for x in train_data["text"] for y in x]
train_sentences

[['for',
  'a',
  'movie',
  'that',
  'gets',
  'no',
  'respect',
  'there',
  'sure',
  'are',
  'a',
  'lot',
  'of',
  'memorable',
  'quotes',
  'listed',
  'for',
  'this',
  'gem'],
 ['imagine',
  'a',
  'movie',
  'where',
  'joe',
  'piscopo',
  'is',
  'actually',
  'funny'],
 ['maureen', 'stapleton', 'is', 'a', 'scene', 'stealer'],
 ['the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream'],
 ['watch',
  'for',
  'alan',
  'the',
  'skipper',
  'hale',
  'jr',
  'as',
  'a',
  'police',
  'sgt'],
 ['bizarre',
  'horror',
  'movie',
  'filled',
  'with',
  'famous',
  'faces',
  'but',
  'stolen',
  'by',
  'cristina',
  'raines',
  'later',
  'of',
  'tv',
  's',
  'flamingo',
  'road',
  'as',
  'a',
  'pretty',
  'but',
  'somewhat',
  'unstable',
  'model',
  'with',
  'a',
  'gummy',
  'smile',
  'who',
  'is',
  'slated',
  'to',
  'pay',
  'for',
  'her',
  'attempted',
  'suicides',
  'by',
  'guarding',
  'the',
  'gateway',
  'to',
  'hell'],
 ['the',
  'scen

## Prepare text data (convert a collection of text documents to a matrix of token counts)

In the IMDB data, there is a very large number of reviews, which will give a large vocabulary. To limit the size of the feature vectors, choose some maximum vocabulary size. Below, used the $5000$ most frequent words (remember that stopwords have already been removed in previous step).

**Note.** `CountVectorizer` comes with its own options to automatically do preprocessing, tokenization, and stop word removal for each of these, instead of specifying `None`, it's possible to use a built-in method or custom function, however, in this example, for data cleaning, custom parser is used.

In [5]:
print("Creating the Bag Of Words...")
vectorizer = CountVectorizer(analyzer="word",
                             max_features=5000)
print("Done.")

Creating the Bag Of Words...
Done.


In [6]:
print("Learn a vocabulary from documents...")
vectorizer.fit(train_data["text"].tolist())
print("Done.")

Learn a vocabulary from documents...
Done.


In [7]:
vectorizer.get_feature_names()

['abandoned',
 'abc',
 'abilities',
 'ability',
 'able',
 'abraham',
 'absence',
 'absent',
 'absolute',
 'absolutely',
 'absurd',
 'abuse',
 'abusive',
 'abysmal',
 'academy',
 'accent',
 'accents',
 'accept',
 'acceptable',
 'accepted',
 'access',
 'accident',
 'accidentally',
 'accompanied',
 'accomplished',
 'according',
 'account',
 'accuracy',
 'accurate',
 'accused',
 'achieve',
 'achieved',
 'achievement',
 'acid',
 'across',
 'act',
 'acted',
 'acting',
 'action',
 'actions',
 'activities',
 'actor',
 'actors',
 'actress',
 'actresses',
 'acts',
 'actual',
 'actually',
 'ad',
 'adam',
 'adams',
 'adaptation',
 'adaptations',
 'adapted',
 'add',
 'added',
 'adding',
 'addition',
 'adds',
 'adequate',
 'admire',
 'admit',
 'admittedly',
 'adorable',
 'adult',
 'adults',
 'advance',
 'advanced',
 'advantage',
 'adventure',
 'adventures',
 'advertising',
 'advice',
 'advise',
 'affair',
 'affect',
 'affected',
 'afford',
 'aforementioned',
 'afraid',
 'africa',
 'african',
 'after

In [8]:
vectorizer.get_stop_words()

In [9]:
print("Encode each train movie review document to vector...")
train_vectors = vectorizer.transform(train_data["text"].tolist()).toarray()
print("Done.")

print("Encode each test movie review document to vector...")
test_vectors = vectorizer.transform(train_data["text"].tolist()).toarray()
print("Done.")

Encode each train movie review document to vector...
Done.
Encode each test movie review document to vector...
Done.


In [None]:
train_vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
test_vectors

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

## Train classifiers

In [None]:
# %load classifiers/sklearn.py

In [None]:
print("Training the Random Forest...")
test_sentiments_predicted_rf = sk_classifiers.random_forest(
    train_vectors, train_data["sentiment"], test_vectors, n_estimators=100)
print("Done.")

print("Training the Naive Bayes Gaussian...")
test_sentiments_predicted_nbg = sk_classifiers.naive_bayes_gaussian(
    train_vectors, train_data["sentiment"], test_vectors)
print("Done.")

print("Training the Naive Bayes Multinomial...")
test_sentiments_predicted_nbm = sk_classifiers.naive_bayes_multinomial(
    train_vectors, train_data["sentiment"], test_vectors)
print("Done.")

print("Training the Naive Bayes Bernoulli...")
test_sentiments_predicted_nbb = sk_classifiers.naive_bayes_bernoulli(
    train_vectors, train_data["sentiment"], test_vectors)
print("Done.")

print("Training the k-Nearest Neighbors...")
test_sentiments_predicted_knn = sk_classifiers.k_nearest_neighbors(
    train_vectors, train_data["sentiment"], test_vectors, n_neighbors=100)
print("Done.")

Training the Random Forest...


In [None]:
filename_sklearn_rf = 'bag-of-words-sklearn-rf-model.csv'
filename_sklearn_nbg = 'bag-of-words-sklearn-nbg-model.csv'
filename_sklearn_nbm = 'bag-of-words-sklearn-nbm-model.csv'
filename_sklearn_nbb = 'bag-of-words-sklearn-nbb-model.csv'
filename_sklearn_knn = 'bag-of-words-sklearn-knn-model.csv'
filename_summary = 'bag-of-words-summary.txt'

print('Wrote Random Forest results to {filename}'.format(
    filename=filename_sklearn_rf))
utils.write_results_to_csv(
    test_data["id"],
    test_data["sentiment"],
    test_sentiments_predicted_rf,
    filename_sklearn_rf)

# print('Wrote Naive Bayes Gaussian results to {filename}'.format(
#     filename=filename_sklearn_nbg))
# utils.write_results_to_csv(
#     test_ids,
#     test_sentiments,
#     test_sentiments_predicted_nbg,
#     filename_sklearn_nbg)

# print('Wrote Naive Bayes Multinomial results to {filename}'.format(
#     filename=filename_sklearn_nbm))
# utils.write_results_to_csv(
#     test_ids,
#     test_sentiments,
#     test_sentiments_predicted_nbm,
#     filename_sklearn_nbm)

# print('Wrote Naive Bayes Bernoulli results to {filename}'.format(
#     filename=filename_sklearn_nbb))
# utils.write_results_to_csv(
#     test_ids,
#     test_sentiments,
#     test_sentiments_predicted_nbb,
#     filename_sklearn_nbb)

# print('Wrote k-Nearest Neighbors results to {filename}'.format(
#     filename=filename_sklearn_knn))
# utils.write_results_to_csv(
#     test_ids,
#     test_sentiments,
#     test_sentiments_predicted_knn,
#     filename_sklearn_knn)

# print('Wrote summary results to {filename}'.format(
#     filename=filename_summary))

In [None]:
with open(filename_summary, "w") as file_summary:
    print('Size of train dataset: {size}'.format(
        size=len(train_ids)), file=file_summary)

    print('Size of test dataset: {size}'.format(
        size=len(test_ids)), file=file_summary)

    print('\n', file=file_summary)

    print('Number of trees in Random Forest: {trees}'.format(
        trees=n_estimators), file=file_summary)

    print('Number of neighbors in KNN: {neighbors}'.format(
        neighbors=n_neighbors), file=file_summary)

    print('\n', file=file_summary)

    print('Accuracy of the the Random Forest sklearn: {accuracy}'.format(
        accuracy=utils.calculate_accuracy(
            test_sentiments, test_sentiments_predicted_rf)), file=file_summary)

    print('Accuracy of the Naive Bayes Gaussian sklearn: {accuracy}'.format(
        accuracy=utils.calculate_accuracy(
            test_sentiments, test_sentiments_predicted_nbg)), file=file_summary)

    print('Accuracy of the Naive Bayes Multinomial sklearn: {accuracy}'.format(
        accuracy=utils.calculate_accuracy(
            test_sentiments, test_sentiments_predicted_nbm)), file=file_summary)

    print('Accuracy of the Naive Bayes Bernoulli sklearn: {accuracy}'.format(
        accuracy=utils.calculate_accuracy(
            test_sentiments, test_sentiments_predicted_nbb)), file=file_summary)

    print('Accuracy of the k-Nearest Neighbors sklearn: {accuracy}'.format(
        accuracy=utils.calculate_accuracy(
            test_sentiments, test_sentiments_predicted_knn)), file=file_summary)

    print('\n', file=file_summary)

    print('Count of each word in train dataset: {counts}'.format(
        counts=utils.count_words(vectorizer.get_feature_names(), train_texts)), file=file_summary)


In [None]:
print('Count of each word in train dataset: {counts}'.format(
      counts=utils.count_words(vectorizer.get_feature_names(), train_texts)))

# train_texts