# 3. Doc2Vec - Distributed Representations of Sentences and Documents

### References
- [Distributed Representations of Sentences and Documents, Le et al. 2014](https://arxiv.org/pdf/1405.4053.pdf)

In [1]:
from models import Doc2Vec
import nltk
import string
import itertools
import time
import os
import csv
import pickle
from scipy import spatial
%matplotlib inline

## Utility Methods
Define utility methods for preprocessing corpus.

In [2]:
def flatten(l):
    return list(itertools.chain.from_iterable(l))

In [3]:
def normalize_corpus(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    # remove empty strings
    texts = [x for x in texts if x != '']
    
    return(texts)

## Load and Normalize Corpus
We will use [Amazon Food Review Corpus](https://www.kaggle.com/snap/amazon-fine-food-reviews). If we have already normalized and saved corpus with pickle, load the data. If not, load raw data and normalize the corpus and save it. This will takes some time since these codes are not optimized for preprocessing corpus.

In [4]:
if os.path.exists(os.path.join("data","amazon_corpus.pkl")):
    print("Found amazon_corpus.pkl. Loading Corpus..")
    with open(os.path.join("data","amazon_corpus.pkl"), "rb") as f:
        corpus = pickle.load(f)
        
else:
    print("No amazon_corpus.pkl. Preprocessing Corpus..")
    with open(os.path.join("data", "Reviews.csv"), 'r') as f:
        reader = csv.reader(f)
        next(reader, None)
        corpus = [line[-1] for line in reader]

    stops = nltk.corpus.stopwords.words('english')
    corpus = normalize_corpus(corpus, stops)

    corpus = [line.split(' ') for line in corpus]
    
    with open(os.path.join("data", "amazon_corpus.pkl"), "wb") as f:
        pickle.dump(corpus, f)

print("DONE!")

Found amazon_corpus.pkl. Loading Corpus..
DONE!


In [5]:
len(flatten(corpus[:10000]))

387867

## Fit and Train GloVe

In [6]:
doc2vec = Doc2Vec.Doc2VecModel(100, 100, 5, max_vocab_size=100000, learning_rate=0.01)

In [7]:
doc2vec.fit_to_corpus(corpus[:10000])

Instructions for updating:
Use the retry module or similar alternatives.


In [8]:
doc2vec.train(20, log_dir="log/doc2vec", save_dir="save/doc2vec", print_every=1000)

Writing TensorBoard summaries to log/doc2vec
Saving TensorFlow models to save/doc2vec
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 9448362
Total number of batches: 758
--------------------------------------------------------------------------------
step: 1000, epoch:2, time/batch: 0.005517543077468872, avg_loss: 35.93056869506836
step: 2000, epoch:3, time/batch: 0.005339736461639404, avg_loss: 29.108983993530273
step: 3000, epoch:4, time/batch: 0.005398691892623902, avg_loss: 25.362348556518555
step: 4000, epoch:6, time/batch: 0.005444626331329346, avg_loss: 20.517866134643555
step: 5000, epoch:7, time/batch: 0.0053118314743042, avg_loss: 19.599010467529297
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:8, time/batch: 0.0050750067234039305, avg_loss: 19.227130889892578
step: 7000, epoch:10, time/batch: 0.004860623836517334, avg_loss: 16.350727081298828
step: 8000, epoch:11, time/ba

# Test

Load original corpus for test phase.

In [9]:
with open(os.path.join("data", "Reviews.csv"), 'r') as f:
    reader = csv.reader(f)
    next(reader, None)
    corpus = [line[-1] for line in reader]

I'm not sure if these results are acceptable. Maybe we need some hyperparameter tunings and more data. Since our main point is not in getting good result but in implementing models in a consistent way, I'll leave the process of getting good results to readers.

In [10]:
def doc_similarity(target, model, corpus):
    print("TARGET SENTENCE: {}".format(corpus[target]))
    target_V = model.embedding_for(target)
    
    similarities = []
    for doc in range(model.corpus_length):
        if doc == target: continue
        
        vector = model.embedding_for(doc)
        cosine_sim = 1 - spatial.distance.cosine(target_V, vector)
        similarities.append([corpus[doc].strip(), cosine_sim])
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:5]

In [11]:
doc_similarity(0, doc2vec, corpus)

TARGET SENTENCE: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


[["I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service!",
  0.5692881345748901],
 ['This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!',
  0.5344139337539673],
 ['I am very satisfied 

In [12]:
doc2vec.generate_tsne('log/doc2vec/fig.png', size=(30,20), doc_count=50, corpus=corpus)