# 3. Doc2Vec - Distributed Representations of Sentences and Documents

### References
- [Distributed Representations of Sentences and Documents, Le et al. 2014](https://arxiv.org/pdf/1405.4053.pdf)

In [1]:
from models import Doc2Vec
import nltk
import string
import itertools
import time
import os
import csv
import pickle
from scipy import spatial
%matplotlib inline

## Utility Methods
Define utility methods for preprocessing corpus.

In [2]:
def flatten(l):
    return list(itertools.chain.from_iterable(l))

In [3]:
def normalize_corpus(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    # remove empty strings
    texts = [x for x in texts if x != '']
    
    return(texts)

## Load and Preprocess Corpus
We will use [Amazon Food Review Corpus](https://www.kaggle.com/snap/amazon-fine-food-reviews). If we have already normalized and saved corpus with pickle, load the data. If not, load raw data and normalize the corpus and save it. This will takes some time since these codes are not optimized for preprocessing corpus.

In [4]:
DATA_DIR = os.path.join("data/amazon_food")

In [5]:
if os.path.exists(os.path.join(DATA_DIR,"amazon_corpus.pkl")):
    print("Found amazon_corpus.pkl. Loading Corpus..")
    with open(os.path.join(DATA_DIR,"amazon_corpus.pkl"), "rb") as f:
        corpus = pickle.load(f)
        
else:
    print("No amazon_corpus.pkl. Preprocessing Corpus..")
    with open(os.path.join(DATA_DIR, "Reviews.csv"), 'r') as f:
        reader = csv.reader(f)
        next(reader, None)
        corpus = [line[-1] for line in reader]

    stops = nltk.corpus.stopwords.words('english')
    corpus = normalize_corpus(corpus, stops)

    corpus = [line.split(' ') for line in corpus]
    
    with open(os.path.join(DATA_DIR, "amazon_corpus.pkl"), "wb") as f:
        pickle.dump(corpus, f)

print("DONE!")

Found amazon_corpus.pkl. Loading Corpus..
DONE!


In [6]:
len(flatten(corpus[:10000]))

387867

## Fit and Train GloVe

In [7]:
doc2vec = Doc2Vec.Doc2VecModel(100, 100, 5, max_vocab_size=100000, learning_rate=0.01)

In [8]:
doc2vec.fit_to_corpus(corpus[:10000])

Instructions for updating:
Use the retry module or similar alternatives.


In [9]:
doc2vec.train(20, log_dir="log/03_doc2vec", save_dir="save/03_doc2vec", print_every=1000)

Writing TensorBoard summaries to log/doc2vec
Saving TensorFlow models to save/doc2vec
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 8092463
Total number of batches: 758
--------------------------------------------------------------------------------
step: 1000, epoch:2, time/batch: 0.005252, avg_loss: 35.83
step: 2000, epoch:3, time/batch: 0.005116, avg_loss: 30.03
step: 3000, epoch:4, time/batch: 0.005247, avg_loss: 25.68
step: 4000, epoch:6, time/batch: 0.005416, avg_loss: 22.43
step: 5000, epoch:7, time/batch: 0.00544, avg_loss: 19.94
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:8, time/batch: 0.004783, avg_loss: 19.33
step: 7000, epoch:10, time/batch: 0.00462, avg_loss: 17.39
step: 8000, epoch:11, time/batch: 0.004677, avg_loss: 15.63
step: 9000, epoch:12, time/batch: 0.005184, avg_loss: 15.8
step: 10000, epoch:14, time/batch: 0.005299, avg_loss: 14.92
Saved summaries at step 

# Test

Load original corpus for test phase.

In [10]:
with open(os.path.join(DATA_DIR, "Reviews.csv"), 'r') as f:
    reader = csv.reader(f)
    next(reader, None)
    corpus = [line[-1] for line in reader]

I'm not sure if these results are acceptable. Maybe we need some hyperparameter tunings and more data. Since our main point is not in getting good result but in implementing models in a consistent way, I'll leave the process of getting good results to readers.

In [11]:
def doc_similarity(target, model, corpus):
    print("TARGET SENTENCE: {}".format(corpus[target]))
    target_V = model.embedding_for(target)
    
    similarities = []
    for doc in range(model.corpus_size):
        if doc == target: continue
        
        vector = model.embedding_for(doc)
        cosine_sim = 1 - spatial.distance.cosine(target_V, vector)
        similarities.append([corpus[doc].strip(), cosine_sim])
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:5]

In [12]:
doc_similarity(0, doc2vec, corpus)

TARGET SENTENCE: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


[['This is a very healthy dog food. Good for their digestion. Also good for small puppies. My dog eats her required amount at every feeding.',
  0.49768173694610596],
 ["My cats have been happily eating Felidae Platinum for more than two years. I just got a new bag and the shape of the food is different. They tried the new food when I first put it in their bowls and now the bowls sit full and the kitties will not touch the food. I've noticed similar reviews related to formula changes in the past. Unfortunately, I now need to find a new food that my cats will eat.",
  0.4471796452999115],
 ['I am very satisfied with my Twizzler purchase.  I shared these with others and we have all enjoyed them.  I will definitely be ordering more.',
  0.4460362493991852],
 ['This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!',
  0.4460066854953766],
 ["One of my boys needed to lose some weight and the other didn't.

In [13]:
doc2vec.generate_tsne('log/03_doc2vec/fig.png', size=(30,20), doc_count=50, corpus=corpus)