# 1. Word2Vec

### References
- [Distributed Representations of Words and Phrases and their Compositionality - Mikolov et al. 2013](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
- [TensorFlow Tutorial - Vector Representations of Words](https://www.tensorflow.org/tutorials/word2vec)

In [1]:
import Word2Vec
import nltk
import string
import itertools
import time
import os
import csv
import pickle
from scipy import spatial
%matplotlib inline

## Utility Methods
Define utility methods for preprocessing corpus.

In [2]:
def flatten(l):
    return list(itertools.chain.from_iterable(l))

In [3]:
def normalize_corpus(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    # remove empty strings
    texts = [x for x in texts if x != '']
    
    return(texts)

## Load and Normalize Corpus
We will use [Amazon Food Review Corpus](https://www.kaggle.com/snap/amazon-fine-food-reviews). If we have already normalized and saved corpus with pickle, load the data. If not, load raw data and normalize the corpus and save it. This will takes some time since these codes are not optimized for preprocessing corpus.

In [4]:
if os.path.exists("amazon_corpus.pkl"):
    print("Found amazon_corpus.pkl. Loading Corpus..")
    with open("amazon_corpus.pkl", "rb") as f:
        corpus = pickle.load(f)
        
else:
    print("No amazon_corpus.pkl. Preprocessing Corpus..")
    with open(os.path.join("/home/young/Dropbox/Study/Project/data", "Reviews.csv"), 'r') as f:
        reader = csv.reader(f)
        next(reader, None)
        corpus = [line[-1] for line in reader]

    stops = nltk.corpus.stopwords.words('english')
    corpus = normalize_corpus(corpus, stops)

    corpus = [line.split(' ') for line in corpus]
    
    with open("amazon_corpus.pkl", "wb") as f:
        pickle.dump(corpus, f)

print("DONE!")

Found amazon_corpus.pkl. Loading Corpus..
DONE!


In [11]:
len(flatten(corpus[:10000]))

387867

## Fit and Train Word2Vec

In [6]:
word2vec = Word2Vec.Word2VecModel(150,5, subsampling_threshold=1e-6, max_vocab_size=100000, negative_sample_size=10,
                                  learning_rate=0.01)

In [7]:
word2vec.fit_to_corpus(corpus[:10000])

Instructions for updating:
Use the retry module or similar alternatives.


In [8]:
word2vec.train(10, log_dir="log/word2vec", save_dir="save/word2vec", print_every=1000)

Writing TensorBoard summaries to log
Saving TensorFlow models to save
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 7092162
Total number of batches: 6990
--------------------------------------------------------------------------------
step: 1000, epoch:1, time/batch: 0.004911087989807129, avg_loss: 42.68687438964844
step: 2000, epoch:1, time/batch: 0.004730724334716797, avg_loss: 34.19025802612305
step: 3000, epoch:1, time/batch: 0.004783396244049072, avg_loss: 29.222558975219727
step: 4000, epoch:1, time/batch: 0.0048897314071655274, avg_loss: 27.20993423461914
step: 5000, epoch:1, time/batch: 0.005037831783294678, avg_loss: 25.04834747314453
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:1, time/batch: 0.003989023685455322, avg_loss: 22.82687759399414
step: 7000, epoch:2, time/batch: 3.703451156616211e-05, avg_loss: 15.624125480651855
step: 8000, epoch:2, time/batch: 0.00447933125

## Test

In [9]:
def word_similarity(target, model):
    target_V = model.embedding_for(target)
    
    similarities = []
    for word in model.words:
        if word == target: continue
        
        vector = model.embedding_for(word)
        cosine_sim = 1 - spatial.distance.cosine(target_V, vector)
        similarities.append([word, cosine_sim])
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:10]

In [10]:
word_similarity('great', word2vec)

[['like', 0.7117627859115601],
 ['love', 0.7079678773880005],
 ['taste', 0.6982841491699219],
 ['br', 0.6952022314071655],
 ['make', 0.6903050541877747],
 ['good', 0.6882364153862],
 ['coffee', 0.682274341583252],
 ['would', 0.6748025417327881],
 ['well', 0.6683940887451172],
 ['product', 0.6657009124755859]]

In [13]:
word2vec.generate_tsne('log/word2vec/fig.png', size=(15,15), word_count=500)