In [30]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

`gensim` provides a well-documented implementation of `word2vec`:

In [31]:
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

## Training data
The __Amazon Instant Video__ [dataset](http://jmcauley.ucsd.edu/data/amazon/) consists of 37,126 reviews, including ratings and text content, in JSON format.

In [32]:
df = pd.read_json('data/amazon.json', lines=True)
df = df[['overall', 'reviewText']].sort_values('overall')

## Preprocessing
We first build a vocabulary table, which consists of

- filtering out unique and stopwords
- count-vectorizing

In [70]:
sentences = [TaggedDocument(words=utils.to_unicode(line[1]).split(), tags=[line[0]]) for line in df['reviewText'].iteritems()]#, tags=[line['overall']]) for i, line in df.iterrows()]

In [96]:
model = Doc2Vec(min_count=1, window=10, vector_size=300, sample=1e-4, negative=5, workers=7)
model.build_vocab(sentences)

## Training
We train our model on the __Amazon Instant Video__ reviews dataset. `doc2vec` is really just an  ensemble of (100-dimensional) `word2vec` vectors, which themselves are numeric representations of individual words.

In [72]:
import random

In [105]:
# A hack
class Sentence:
    def __init__(self, sentences):
        self.sentences = sentences
    
    def permute(self):
        random.shuffle(self.sentences)
        return self.sentences

In [106]:
s = Sentence(sentences)
model.train(s.permute(), total_examples=model.corpus_count, epochs=10)

## Classification

Having transformed reviews into vectors, we can now train a classifier to recognize positive and negative reviews. Reviews are quantified in a range of values `0...5` inclusive, with `0` as the worst and `5` as the best.

In [107]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [108]:
X = np.zeros((df.shape[0], 300))
for i in range(df.shape[0]):
    X[i] = model[i]
y = df['overall']

In [109]:
X_train, X_test, y_train, y_test = train_test_split(X, df['overall'])

In [110]:
clf = LogisticRegression(multi_class="ovr")
clf.fit(X_train, y_train);

In [111]:
clf.score(X_test, y_test)

0.5627020038784745