# 1. Word2Vec

### References
- [Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al. 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [TensorFlow Tutorial - Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)

In [1]:
from models import Word2Vec
import nltk
import string
import itertools
import time
import os
import csv
import pickle
from scipy import spatial
%matplotlib inline

## Utility Methods
Define utility methods for preprocessing corpus.

In [2]:
def flatten(l):
    return list(itertools.chain.from_iterable(l))

In [3]:
def normalize_corpus(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    # remove empty strings
    texts = [x for x in texts if x != '']
    
    return(texts)

## Load and Preprocess Corpus
We will use [Amazon Food Review Corpus](https://www.kaggle.com/snap/amazon-fine-food-reviews). If we have already normalized and saved corpus with pickle, load the data. If not, load raw data and normalize the corpus and save it. This will takes some time since these codes are not optimized for preprocessing corpus.

In [4]:
DATA_DIR = os.path.join("data/amazon_food")

In [5]:
if os.path.exists(os.path.join(DATA_DIR,"amazon_corpus.pkl")):
    print("Found amazon_corpus.pkl. Loading Corpus..")
    with open(os.path.join(DATA_DIR, "amazon_corpus.pkl"), "rb") as f:
        corpus = pickle.load(f)
        
else:
    print("No amazon_corpus.pkl. Preprocessing Corpus..")
    with open(os.path.join(DATA_DIR, "Reviews.csv"), 'r') as f:
        reader = csv.reader(f)
        next(reader, None)
        corpus = [line[-1] for line in reader]

    stops = nltk.corpus.stopwords.words('english')
    corpus = normalize_corpus(corpus, stops)

    corpus = [line.split(' ') for line in corpus]
    
    with open(os.path.join(DATA_DIR, "amazon_corpus.pkl"), "wb") as f:
        pickle.dump(corpus, f)

print("DONE!")

Found amazon_corpus.pkl. Loading Corpus..
DONE!


In [6]:
len(flatten(corpus[:10000]))

387867

## Fit and Train Word2Vec

In [7]:
word2vec = Word2Vec.Word2VecModel(150,5, subsampling_threshold=1e-6, max_vocab_size=100000, negative_sample_size=10,
                                  learning_rate=0.01)

In [8]:
word2vec.fit_to_corpus(corpus[:10000])

Instructions for updating:
Use the retry module or similar alternatives.


In [9]:
word2vec.train(10, log_dir="log/01_word2vec", save_dir="save/01_word2vec", print_every=1000)

Writing TensorBoard summaries to log/word2vec
Saving TensorFlow models to save/word2vec
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 7092162
Total number of batches: 6990
--------------------------------------------------------------------------------
step: 1000, epoch:1, time/batch: 0.005406327724456787, avg_loss: 55.443824768066406
step: 2000, epoch:1, time/batch: 0.004623481512069702, avg_loss: 39.67242431640625
step: 3000, epoch:1, time/batch: 0.004372473478317261, avg_loss: 33.468414306640625
step: 4000, epoch:1, time/batch: 0.004261737585067749, avg_loss: 28.45407485961914
step: 5000, epoch:1, time/batch: 0.004471955060958863, avg_loss: 26.627120971679688
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:1, time/batch: 0.00408284330368042, avg_loss: 24.711214065551758
step: 7000, epoch:2, time/batch: 0.003792802333831787, avg_loss: 21.486061096191406
step: 8000, epoch:2, time/ba

## Test

In [10]:
def word_similarity(target, model):
    target_V = model.embedding_for(target)
    
    similarities = []
    for word in model.words:
        if word == target: continue
        
        vector = model.embedding_for(word)
        cosine_sim = 1 - spatial.distance.cosine(target_V, vector)
        similarities.append([word, cosine_sim])
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:10]

In [11]:
word_similarity('great', word2vec)

[['good', 0.7600875496864319],
 ['really', 0.7369396686553955],
 ['taste', 0.7281869053840637],
 ['flavor', 0.7144145369529724],
 ['br', 0.6923125386238098],
 ['recommend', 0.6900876760482788],
 ['would', 0.6897999048233032],
 ['product', 0.6886175870895386],
 ['one', 0.6810516119003296],
 ['drink', 0.6701462268829346]]

In [12]:
word2vec.generate_tsne('log/01_word2vec/fig.png', size=(15,15), word_count=500)