In [2]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

`gensim` provides a well-documented implementation of `word2vec`:

In [50]:
from gensim import utils
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec

## Training data
The __Amazon Instant Video__ [dataset](http://jmcauley.ucsd.edu/data/amazon/) consists of 37,126 reviews, including ratings and text content, in JSON format.

In [34]:
df = pd.read_json('data/amazon.json', lines=True)
df = df[['overall', 'reviewText']]

## Preprocessing
We first build a vocabulary table, which consists of

- filtering out unique and stopwords
- count-vectorizing

In [51]:
sentences = [TaggedDocument(utils.to_unicode(line['reviewText']).split(), [line['overall']]) for i, line in df.iterrows()]

In [52]:
model = Doc2Vec(min_count=1, window=10, vector_size=100, sample=1e-4, negative=5, workers=7)
model.build_vocab(sentences)

## Training
We train our model on the __Amazon Instant Video__ reviews dataset. `doc2vec` is really just an  ensemble of (100-dimensional) `word2vec` vectors, which themselves are numeric representations of individual words.

In [59]:
for epoch in range(5):
    model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

ValueError: You must specify an explict epochs count. The usual value is epochs=model.epochs.