# 2. GloVe - Global Vectors for Word Representation

### References
- [GloVe: Global Vectors for Word Representation, Pennington et al. 2014](https://www.aclweb.org/anthology/D14-1162)
- [GradySimon/tensorflow-glove](https://github.com/GradySimon/tensorflow-glove)

In [1]:
from models import GloVe
import nltk
import string
import itertools
import time
import os
import csv
import pickle
from scipy import spatial
%matplotlib inline

## Utility Methods
Define utility methods for preprocessing corpus.

In [2]:
def flatten(l):
    return list(itertools.chain.from_iterable(l))

In [3]:
def normalize_corpus(texts, stops):
    # Lower case
    texts = [x.lower() for x in texts]

    # Remove punctuation
    texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]

    # Remove numbers
    texts = [''.join(c for c in x if c not in '0123456789') for x in texts]

    # Remove stopwords
    texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]

    # Trim extra whitespace
    texts = [' '.join(x.split()) for x in texts]
    
    # remove empty strings
    texts = [x for x in texts if x != '']
    
    return(texts)

## Load and Normalize Corpus
We will use [Amazon Food Review Corpus](https://www.kaggle.com/snap/amazon-fine-food-reviews). If we have already normalized and saved corpus with pickle, load the data. If not, load raw data and normalize the corpus and save it. This will takes some time since these codes are not optimized for preprocessing corpus.

In [None]:
DATA_DIR = os.path.join("data/amazon_food")

In [4]:
if os.path.exists(os.path.join(DATA_DIR,"amazon_corpus.pkl")):
    print("Found amazon_corpus.pkl. Loading Corpus..")
    with open(os.path.join(DATA_DIR,"amazon_corpus.pkl"), "rb") as f:
        corpus = pickle.load(f)
        
else:
    print("No amazon_corpus.pkl. Preprocessing Corpus..")
    with open(os.path.join(DATA_DIR, "Reviews.csv"), 'r') as f:
        reader = csv.reader(f)
        next(reader, None)
        corpus = [line[-1] for line in reader]

    stops = nltk.corpus.stopwords.words('english')
    corpus = normalize_corpus(corpus, stops)

    corpus = [line.split(' ') for line in corpus]
    
    with open(os.path.join(DATA_DIR, "amazon_corpus.pkl"), "wb") as f:
        pickle.dump(corpus, f)

print("DONE!")

Found amazon_corpus.pkl. Loading Corpus..
DONE!


In [5]:
len(flatten(corpus[:10000]))

387867

## Fit and Train GloVe

In [6]:
glove = GloVe.GloVeModel(150, 5, max_vocab_size=100000, learning_rate=0.01)

In [7]:
glove.fit_to_corpus(corpus[:10000])

In [8]:
glove.train(10, log_dir="log/02_glove", save_dir="save/02_glove", print_every=1000)

Writing TensorBoard summaries to log/glove
Saving TensorFlow models to save/glove
--------------------------------------------------------------------------------
Created and Initialized fresh model. Size: 7115724
Total number of batches: 3495
--------------------------------------------------------------------------------
step: 1000, epoch:1, time/batch: 0.0035736639499664305, avg_loss: 164.47706604003906
step: 2000, epoch:1, time/batch: 0.0035365924835205078, avg_loss: 76.98371887207031
step: 3000, epoch:1, time/batch: 0.003534843683242798, avg_loss: 49.760887145996094
step: 4000, epoch:2, time/batch: 0.003539583683013916, avg_loss: 115.78126525878906
step: 5000, epoch:2, time/batch: 0.0035061142444610597, avg_loss: 100.78335571289062
Saved summaries at step 5000
Saved a model at step 5000
step: 6000, epoch:2, time/batch: 0.0034685146808624266, avg_loss: 84.24215698242188
step: 7000, epoch:3, time/batch: 0.0033266327381134032, avg_loss: 117.70085906982422
step: 8000, epoch:3, time/ba

# Test

In [9]:
def word_similarity(target, model):
    target_V = model.embedding_for(target)
    
    similarities = []
    for word in model.words:
        if word == target: continue
        
        vector = model.embedding_for(word)
        cosine_sim = 1 - spatial.distance.cosine(target_V, vector)
        similarities.append([word, cosine_sim])
    
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:10]

In [10]:
word_similarity('great', glove)

[['product', 0.6368147730827332],
 ['good', 0.5886852145195007],
 ['also', 0.5703610777854919],
 ['like', 0.5643731951713562],
 ['buy', 0.5581755042076111],
 ['really', 0.5555024147033691],
 ['much', 0.5272473692893982],
 ['well', 0.5267638564109802],
 ['love', 0.5197827219963074],
 ['best', 0.5129992365837097]]

In [11]:
glove.generate_tsne('log/02_glove/fig.png', size=(15,15), word_count=500)