# CBOW Implementation on TF
`w266 Final Project: Crosslingual Word Embeddings`   

The code in this notebook is based on the Skip-Gram model in the [TensorFlow tutorial code](https://github.com/tensorflow/tensorflow/blob/r1.2/tensorflow/examples/tutorials/word2vec/word2vec_basic.py). I will first attempt to the basic Word2Vec algorithm to a sample of our data (Wikipedia dumps in English). Then I'll examine different ways of visualizing the embeddings that result. Finally I will explore what it might look like to make [Duong et al's modifications](https://arxiv.org/pdf/1606.09403.pdf) to train crosslingual embeddings.

# Embeddings Overview 

__Basic Idea__: start with 1-hot vector, pass it through a linear activation layer then into a softmax and optimize for the probability of nearby words(Skipgram) or the centerword(CBOW). The 'embeddings' are the parameters of the linear activation (which transform the vector of size $|V|$ into an embedding of size $N$:
$$\text{Weight Matrix:}\qquad W \in \mathbb{R}^{|V|\times N}$$
$$\text{Bias (?):}\qquad b \in \mathbb{R}^{n}$$

__Key Modifications:__ 
* Duong et all use a CBOW style algorithm but substitute a word's translation at training time so that they learn embeddings for the target language word based on the source language context. (see section 4.1)
* As a result, instead of a single weight matrix, they use a concatenation of two (see section 4 intro):
$$\text{Context Matrix:}\qquad W \in \mathbb{R}^{|V|\times N}$$
$$\text{Embedding Matrix:}\qquad U \in \mathbb{R}^{|V|\times N}$$
* Since normalizing Softmax is costly, they instead optimize for a _log-pseudo likelihood_ by learning to differentiate data from negative examples selected from a noise distribution (following Mikolov 2013, see section 3) (Note that the TF tutorial models how to do this 'noise contrastive estimation').

# Notebook Setup

In [1]:
# general imports
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os
import sys  
import math
import random
import sklearn
import numpy as np
import collections
import datetime as dt
import matplotlib
import matplotlib.pyplot as plt
import tensorflow as tf

# tell matplotlib not to open a new window
%matplotlib inline

In [2]:
###  MI - Added for zip file used in example files

import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
from sklearn.manifold import TSNE

In [None]:
# filepaths
BASE = '/home/mmillervedam/Data'
FPATH_EN = BASE + '/test/wiki_en_10K.txt' # first 10000 lines from wiki dump
FPATH_ES = BASE + '/test/wiki_es_10K.txt' # first 10000 lines from wiki dump
FULL_EN = BASE + '/en/full.txt'
FULL_ES = BASE + '/es/full.txt'
DPATH = '/home/mmillervedam/ProjectRepo/XlingualEmb/data/dicts/en.es.panlex.all.processed'

In [3]:
# globals
# MI VOCAB_SIZE = 5000
VOCAB_SIZE = 10000

In [4]:
### MI - Added example file download for local testing -- we can delete later

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filename)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename

filename = maybe_download('text8.zip', 31344016)

Found and verified text8.zip


# Data Load & Tokenize

In [None]:
# Helper function
def read_data(filename):
    """
    Extract the file as a list of words.
    NOTE: this is modified from original function in TF  
    tutorialwhich expected a zipped input file.
    """
    with open(filename) as f:
        data = tf.compat.as_str(f.read()).split()
    return data

In [5]:
### MI Added for zip file to run locally - we can delete these later

def read_data1(filename):
    f = zipfile.ZipFile(filename)
    for name in f.namelist():
        return tf.compat.as_str(f.read(name)).split()
    f.close()
    
words = read_data1(filename)
print('Data size %d' % len(words))
print(words[:10])

Data size 17005207
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against']


In [None]:
# Tokenizer preserves order (see code in Appendix)
en_raw = read_data(FPATH_EN)
es_raw = read_data(FPATH_ES)

In [None]:
# take a look
print(en_raw[:10])
print(es_raw[:10])

__`NOTE!`__ We'll need to prepend 'en' and 'es' before training crosslingual versions.   
__`QUESTIONS:`__ Do we deal with special characters?, punctuation?

# Build Vocabulary

In [6]:
# Helper function
def build_dataset(words, n_words):
    """
    Process raw inputs into a dataset.
    Creates vocabulary from top n words indexed by rank.
    NOTE: this function is directly from TF tutorial
    """
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

In [7]:
### MI Added for test data

data, count, dictionary, reverse_dictionary = build_dataset(words, VOCAB_SIZE)
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10])
print('Length of Data',len(data))  #list
print(len(count)) #list
print(len(dictionary))#dict
print(len(reverse_dictionary))
print(count[:5])
print('----------------------')


Most common words (+UNK) [['UNK', 1737307], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]
Length of Data 17005207
10000
10000
10000
[['UNK', 1737307], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764)]
----------------------


In [8]:
print(reverse_dictionary[5239])
print(count[5239])

anarchism
('anarchism', 303)


In [None]:
# Dataset Builder indexes by count (see code in Appendix)
en_data, en_counts, en_dict, en_index = build_dataset(en_raw, VOCAB_SIZE)
es_data, es_counts, es_dict, es_index = build_dataset(es_raw, VOCAB_SIZE)

In [None]:
#del en_raw  # Uncomment to reduce memory.
print("ENGLISH:")
print('Most common words (+UNK):\n', en_counts[:5])
print('Sample data:\n',' '.join(['%s(%s)'%(en_index[i],i) for i in en_data[:10]]))

In [None]:
# del es_raw  # Uncomment to reduce memory.
print("SPANISH:")
print('Most common words (+UNK)\n', es_counts[:5])
print('Sample data\n:',' '.join(['%s(%s)'%(es_index[i],i) for i in es_data[:10]]))

# Generate Batched Data

## Skip-Gram and CBOW##
First we will implement a skip gram model:

In [9]:
#################### PARAMETERS ####################
batch_size = 8 # Number of inputs to process at once.
num_skips = 2 # How many times to reuse an input to generate a context.
skip_window = 2 # How many words to consider left and right.
data_index = 0  # -see note below-

In [10]:
#################### PARAMETERS MI ####################
batch_size = 128 # Number of inputs to process at once.
num_skips = 2 # How many times to reuse an input to generate a context.
skip_window = 2 # How many words to consider left and right.
data_index = 0  # -see note below-

In [11]:
# Helper Function
def generate_batch(data, batch_size, num_skips, skip_window):
    """
    Function to generate a training batch for the skip-gram model.
    NOTE: this wass modified from original function in TF  
    tutorial by adventuresinML tutorial - mostly just renamed.
    """
    
    global data_index
    
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    context = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    span = 2 * skip_window + 1  # [ skip_window input_word skip_window ]
    buffer = collections.deque(maxlen=span)
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    for i in range(batch_size // num_skips):
        target = skip_window  # input word at the center of the buffer
        targets_to_avoid = [skip_window]
        for j in range(num_skips):
            while target in targets_to_avoid:
                target = random.randint(0, span - 1)
            targets_to_avoid.append(target)
            batch[i * num_skips + j] = buffer[skip_window]  # this is the input word
            context[i * num_skips + j, 0] = buffer[target]  # these are the context words
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
    # Backtrack a little bit to avoid skipping words in the end of a batch
    data_index = (data_index + len(data) - span) % len(data)
    return batch, context

## CBOW ##

In [12]:
def generate_batch_cbow(data, batch_size, bag_window):
    
    """
    Function to generate a training batch for the CBOW model.
    Modified generate_batch() to produce context batches.
    
    """    

    global data_index
    span = 2 * bag_window + 1 
    batch = np.ndarray(shape=(batch_size, span - 1), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  
    buffer = collections.deque(maxlen=span)
    
    for _ in range(span):
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
        
    for i in range(batch_size):

        buffer_list = list(buffer)
        labels[i, 0] = buffer_list.pop(bag_window)
        batch[i] = buffer_list

        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
        
    return batch, labels

__`NOTE:`__ The TF tutorial sets data_index as global inside the generate_batch function. Double check you're getting the expected behavior below b/c we're doubling up on languages. 
> `UPDATE`: OK - it looks like this is because the 'generate batch' function is used dynamically to window over the data. I'll figure out how to handle the global indexer when I get to the tensorflow portion of the code.

In [13]:
### MI Testing generate_batch and generate_batch_cbow

print('data:', [reverse_dictionary[di] for di in data[:8]])

# Skip-gram
print('============= Skip-gram ===============')
num_skips = 2
skip_window = 1
batch, labels = generate_batch(data, batch_size = 8, num_skips = num_skips, skip_window = skip_window)
print('with num_skips = %d and skip_window = %d:' % (num_skips, skip_window))
print('    batch:', [reverse_dictionary[bi] for bi in batch])
print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])

# CBOW
data_index = 0
print()
print('================ CBOW =================')

bag_window = 1
batch1, labels1 = generate_batch_cbow(data, batch_size = 8, bag_window = bag_window)
print('with bag_window = %d:' % (bag_window))
print('    context:', [(reverse_dictionary[bix], reverse_dictionary[bix2])  for (bix, bix2) in batch1])
print('    labels:', [reverse_dictionary[lix] for lix in labels1.reshape(8)])


data: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first']
with num_skips = 2 and skip_window = 1:
    batch: ['originated', 'originated', 'as', 'as', 'a', 'a', 'term', 'term']
    labels: ['anarchism', 'as', 'originated', 'a', 'term', 'as', 'of', 'a']

with bag_window = 1:
    context: [('anarchism', 'as'), ('originated', 'a'), ('as', 'term'), ('a', 'of'), ('term', 'abuse'), ('of', 'first'), ('abuse', 'used'), ('first', 'against')]
    labels: ['originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used']


In [None]:
############## ENGLISH BATCHES & CONTEXT #################
# batch = list of text segmetns represented by their indices
# contexts = corresponding skip_gram context set indices
en_batch, en_context = generate_batch(en_data, batch_size, 
                                      num_skips, skip_window)

In [None]:
# take a look
print('RAW BATCH:', en_batch)
print('RAW CONTEXT:', en_context.squeeze())
print("Decoded:")
for i in range(8):
    print("   ", en_batch[i], en_index[en_batch[i]],
        '->', en_context[i, 0], en_index[en_context[i, 0]])

In [None]:
############## SPANISH BATCHES & CONTEXT #################
# batch = list of text segmetns represented by their indices
# contexts = corresponding skip_gram context set indices
es_batch, es_context = generate_batch(es_data, batch_size, 
                                      num_skips, skip_window)

In [None]:
# take a look
print('RAW BATCH:', es_batch)
print('RAW CONTEXT:', es_context.squeeze())
print("Decoded:")
for i in range(8):
    print("   ", es_batch[i], es_index[en_batch[i]],
        '->', es_context[i, 0], es_index[es_context[i, 0]])

__`NOTE:`__ To implment Duong et Al's work we'd perform the word substitution at this stage, replacing the words in the batch with the index of their translation... In fact we'd probably do so using a dictionary of indices for the vocab. 

# TensorFlow Model - Using CBOW (testing 3 models)

__Step 1:__ Set up the model graph.

In [14]:
# recall that we set the vocabulary size at the top of the NB
print(VOCAB_SIZE)

10000


In [15]:
#################### PARAMETERS ####################
batch_size = 32 # Number of inputs to process at once.
embedding_size = 300 # Hidden layer representation size
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a context.

# Validation variables
valid_size = 8     # Random set of words to evaluate similarity on.
valid_window = 50  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)
num_sampled = 64    # Number of negative examples to sample.

In [17]:
# initialize the TF graph
graph = tf.Graph()

In [18]:
##################### DATA PLACEHOLDERS ####################
with graph.as_default():
    
    # Shape of the place holders have been modified for CBOW implementation
    train_inputs = tf.placeholder(tf.int32, shape=[batch_size, bag_window * 2])
    train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
    
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
    train_one_hot = tf.one_hot(train_labels, VOCAB_SIZE)

In [19]:
#################### INPUT(EMBEDDING)LAYER #################
with graph.as_default():
    embeddings = tf.Variable(tf.random_uniform([VOCAB_SIZE, 
                                                embedding_size],
                                               -1.0, 1.0))
    embed = tf.nn.embedding_lookup(embeddings, train_inputs)
    # this var is used for CBOW only
    reduced_embed = tf.div(tf.reduce_sum(embed, 1), skip_window*2)

In [20]:
######################## HIDDEN LAYER ######################
with graph.as_default():

    weights = tf.Variable(tf.truncated_normal([VOCAB_SIZE, embedding_size],
                              stddev=1.0 / math.sqrt(embedding_size)))
    biases = tf.Variable(tf.zeros([VOCAB_SIZE]))
    
    ## Here reduced_embed is used instead of embed for CBOW
    hidden_out = tf.matmul(reduced_embed, tf.transpose(weights)) + biases


We will first implement CBOW with the model used in the TF tutorial - Softmax Cross Entropy

In [22]:
#################### Softmax Cross Entropy ##################
with graph.as_default():
    
    train_one_hot = tf.one_hot(train_labels, VOCAB_SIZE)
    cross_entropy = tf.reduce_mean(
        tf.nn.softmax_cross_entropy_with_logits(logits=hidden_out, 
                                                labels=train_one_hot))

    # Construct the SGD optimizer using a learning rate of 1.0.
    optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(cross_entropy)

Because the full softmax model will be slow, we will test the two sampled loss functions provided in TensorFlow.  The first will be NCE Loss and the second will be Sampled Softmax.

In [23]:
########################  NCE Loss  ########################
with graph.as_default():
    
    nce_loss = tf.reduce_mean(tf.nn.nce_loss(weights = weights, 
                                         biases = biases, 
                                         inputs = tf.reduce_sum(embed, 1), 
                                         labels = train_labels, 
                                         num_sampled= num_sampled, 
                                         num_classes= VOCAB_SIZE,
                                         partition_strategy="div"))

    nce_optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(nce_loss)

__`NOTE:`__ If we're going to set up experiments/comparisons between different embedding training methods (eg. Duongs word2vec modification vs the post training aligned word vectors referenced in the Babylon Repo)... we'll want to fix the embedding size across the multiple models. Maybe even fix the initialization for the weights?-- no in this case the weights are irrelevant across models b/c they'll be optimizing different things. Presumably part of what we're interested in is comparisons of speed to train in concert w/ efficacy on the translation task and random initialization always begs the question of 'did we just get lucky'.

In [24]:
####################### Sampled Softmax ######################
with graph.as_default():
    sampled_loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(weights = weights,
                                                     biases = biases, 
                                                     inputs = tf.reduce_sum(embed, 1), 
                                                     labels = train_labels, 
                                                     num_sampled= num_sampled, 
                                                     num_classes= VOCAB_SIZE))
    sampled_optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(sampled_loss)

__Step 2:__ Set up validation set - a randomly chosen set of words to use to track our progress as we train. By construction we'll pick words from the 100 most frequent in the vocabulary then use cosine similarity to find the nearest neighbors in the embedding matrix.

In [25]:
###################### VALIDATION EXAMPLES #################
valid_size = 8    # Random set of words to evaluate similarity on.
valid_window = 50  # Only pick dev samples in the head of the distribution.
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

with graph.as_default():
    valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

In [26]:
##################### SIMILARITY CALCULATION ################
with graph.as_default():
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
    normalized_embeddings = embeddings / norm
    valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
    similarity = tf.matmul(valid_embeddings, tf.transpose(normalized_embeddings))
    

In [27]:
# Variable initializer
with graph.as_default():
    init = tf.global_variables_initializer()

__Step 3:__ Run the model & track progress by examining the matches for words in our validation set.

In [28]:

data_index = 0 # used to track batches

# For testing various models with CBOW

NCE = 0
SMAX_X_ENTROPY = 1
SAMPLED_SMAX = 2


def run_cbow(graph, num_steps, model = SAMPLED_SMAX ):
    """Runner code for word2vec TF model w/ full softmax"""
    with tf.Session(graph=graph) as session:
      # We must initialize all variables before we use them.
        init.run()
        print('Initialized')

        average_loss = 0
        for step in range(num_steps):
            #def generate_batch_cbow(data, batch_size, bag_window):
            batch_inputs, batch_labels = generate_batch_cbow(data,
                                                         batch_size, 
                                                         bag_window)
            feed_dict = {train_inputs: batch_inputs, 
                         train_labels: batch_labels}

            # We perform one update step by evaluating the optimizer op 
            
            # We tested various models
            # 1) NCE optimizer
            # 2) Softmax Cross Entropy 
            # 3) Sampled Softmax
                       
            if model == NCE:
                _, loss_val = session.run([nce_optimizer, nce_loss], 
                                          feed_dict=feed_dict)
            elif model == SMAX_X_ENTROPY:
                _, loss_val = session.run([optimizer, cross_entropy], 
                                          feed_dict=feed_dict)
            else:
                _, loss_val = session.run([sampled_optimizer, sampled_loss], 
                                          feed_dict=feed_dict)
                
            average_loss += loss_val

            if step % 2000 == 0:
                if step > 0:
                    average_loss /= 2000
                # The average loss is an estimate of the loss over the last 2000 batches.
                print('Average loss at step ', step, ': ', average_loss)
                average_loss = 0


            # Note that this is expensive (~20% slowdown if computed every 500 steps)
            if step % 10000 == 0:
                sim = similarity.eval()
                for i in range(valid_size):
                    valid_word = reverse_dictionary[valid_examples[i]]
                    top_k = 8  # number of nearest neighbors
                    nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                    log_str = 'Nearest to %s:' % valid_word
                    for k in range(top_k):
                        close_word = reverse_dictionary[nearest[k]]
                        log_str = '%s %s,' % (log_str, close_word)
                    print(log_str)

Runner Call

### Model Testing
First, the Softmax Cross Entropy Optimizer is tested.  This model is the baseline from the TF tutorial.

In [29]:
num_steps = 10001
start_time = dt.datetime.now()
run_cbow(graph, num_steps=num_steps, model = SMAX_X_ENTROPY)
end_time = dt.datetime.now()
print("Softmax Cross Entropy method took {} seconds to run 10000 iterations".format((end_time - start_time).total_seconds()))

Initialized
Average loss at step  0 :  9.27666282654
Nearest to zero: chemistry, theft, linguistics, missiles, abstraction, millions, kinds, share,
Nearest to is: maltese, wasn, garbage, ferdinand, ukraine, marks, reconstruction, blues,
Nearest to seven: vol, capacitor, textiles, sources, antony, corps, era, cp,
Nearest to nine: epic, fellow, importantly, snyder, arizona, special, spending, falling,
Nearest to not: civic, beverages, teacher, mitchell, increase, featured, roma, wwii,
Nearest to it: clown, playoffs, proceedings, generalized, contested, extreme, fl, anton,
Nearest to have: itself, simulation, enemy, label, catholicism, arabia, devoted, framework,
Nearest to for: escaped, player, bone, wax, reduces, european, advantages, going,
Average loss at step  2000 :  6.17026983762
Average loss at step  4000 :  5.80681890154
Average loss at step  6000 :  5.63998748465
Average loss at step  8000 :  5.53292568243
Average loss at step  10000 :  5.52537331802
Nearest to zero: nine, five,

In [30]:
num_steps = 10001
start_time = dt.datetime.now()
run_cbow(graph, num_steps=num_steps, model = NCE)
end_time = dt.datetime.now()
print("NCE method took {} seconds to run 10000 iterations".format((end_time - start_time).total_seconds()))

Initialized
Average loss at step  0 :  234.274353027
Nearest to zero: silver, poetry, stationary, pierce, regulation, insulin, ak, besides,
Nearest to is: hung, contexts, flows, astronomical, acquire, signing, krak, manufacture,
Nearest to seven: friendship, sit, neutron, jew, contributors, costume, billion, bits,
Nearest to nine: attained, bet, applied, southeast, saul, serbia, elite, morning,
Nearest to not: read, casting, faculty, relational, muhammad, shut, dean, exposition,
Nearest to it: lanka, anne, arrest, impressed, weber, scots, tunnel, receivers,
Nearest to have: kissinger, scores, yet, send, performers, precision, texts, six,
Nearest to for: catalan, klan, existed, underworld, myth, arc, arrangements, potatoes,
Average loss at step  2000 :  43.2367338331
Average loss at step  4000 :  16.4807954127
Average loss at step  6000 :  869893967876.0
Average loss at step  8000 :  6.3441603282e+17
Average loss at step  10000 :  1.00599554835e+19
Nearest to zero: three, hermes, him, e

__Looks like something is wrong here:__
Occassionally, the average loss goes up!  Need to investigate. 

In [31]:
num_steps = 10001
start_time = dt.datetime.now()
run_cbow(graph, num_steps=num_steps, model = SAMPLED_SMAX)
end_time = dt.datetime.now()
print("Sampled Softmax method took {} seconds to run 10000 iterations".format((end_time - start_time).total_seconds()))

Initialized
Average loss at step  0 :  6.75405597687
Nearest to zero: maximum, theatre, woods, curious, favoured, bosnia, restriction, crushed,
Nearest to is: producer, processor, precursor, productivity, disc, asian, hull, na,
Nearest to seven: dublin, marvin, ken, disputes, perfect, office, comparing, helsinki,
Nearest to nine: up, monument, cyberpunk, challenges, trivial, count, acquisition, required,
Nearest to not: summer, or, algorithm, orthography, acupuncture, wonder, standard, testimony,
Nearest to it: bruce, mars, parsons, contribution, assume, regarding, areas, unstable,
Nearest to have: dense, arrested, autism, duck, singers, exposition, horace, athletic,
Nearest to for: joining, dominion, regime, malay, farms, hurt, monster, given,
Average loss at step  2000 :  4.40612527955
Average loss at step  4000 :  3.97505038351
Average loss at step  6000 :  3.854579023
Average loss at step  8000 :  3.50023281932
Average loss at step  10000 :  3.49135156456
Nearest to zero: four, sev

__Results Discussion__:
The sampled softmax TF loss function produces the best results with the high performance sampling functions.  We will proceed with the implementation with this optimization.

In [None]:
num_steps = 10001
nce_start_time = dt.datetime.now()
run_cbow(graph, num_steps=num_steps)
nce_end_time = dt.datetime.now()
print("NCE method took {} seconds to run 10000 iterations".format((nce_end_time-nce_start_time).total_seconds()))

In [None]:
# NOTE: output from ^^ saved to:
path = 'wtv_output/en_smalldata_10Kiter_fullsfmx.txt'
!tail -n 1 {path}

In [None]:
# take a look at loss
!grep 'Average' {path} | tail

In [None]:
# take a look at NN for 'the'
!grep 'Nearest to them:' {path} |tail

__`NOTE:`__ The data ^^ are undoubtedly too small... 'Alabama' shouldn't appear in the top 100 words. However I'll wait to look at larger data with the sampling method which is much more efficient.

# TensorFlow Model using CBOW w/ Sampled Softmax (faster) #

We'll write it as a class this time for ease of calling later.  TODO:

In [None]:
# Helper function
def with_self_graph(function):
    """Decorator-foo borrowed from w266 a4."""
    def wrapper(self, *args, **kwargs):
        with self.graph.as_default():
            return function(self, *args, **kwargs)
    return wrapper

In [None]:
from sklearn.manifold import TSNE
#from __future__ import unicode_literals

class Word2Vec(object):
    """Single Layer Neural Net to Learn Word Embeddings."""
    # This code was adapted from:
    # SOURCE: https://github.com/tensorflow/tensorflow
    #         /blob/r1.2/tensorflow/examples/tutorials
    #         /word2vec/word2vec_basic.py
    
    def __init__(self, graph=None, *args, **kwargs):
        """
        Initialize TensorFlow Neural Net Model.
        Args:
          V: vocabulary size
          H: embedding size
          
        Kwargs:
          softmax_ns = 64  (number of negative samples)
          alpha = 1.0  (learning rate)
          examples = np.array of 5 top 100 words for validation
        """
        # Set TensorFlow graph. All TF code will work on this graph.
        self.graph = graph or tf.Graph()
        self.SetParams(*args, **kwargs)
        
    @with_self_graph # TODO : remove this unless we plan to init as tf.const
    def SetParams(self, V, H, softmax_ns=64, learning_rate=1.0):
        # Model structure.
        self.V = V
        self.H = H
        
        # Training hyperparameters
        self.softmax_ns = softmax_ns
        self.alpha = 1.0
        
        # Words for validation
        self.examples = np.random.choice(100, 10, replace=False)
        
        # Results
        self.epochs_trained = 0
        self.final_embeddings = None
            
    @with_self_graph
    def BuildCoreGraph(self):
        
        batch_size = 128 # TODO : I've hard coded this for now b/c I want to get
                         # the rest of the code running, but eventually this should
                         # be inferred dynamically from the input shape as in a4.
        
        # Data Placeholders
        self.inputs_ = tf.placeholder(tf.int32, shape=[batch_size])
        self.context_ = tf.placeholder(tf.int32, shape=[batch_size, 1])
        
        # Embedding Layer
        with tf.variable_scope("Embedding_Layer"):
            self.embeddings_ = tf.Variable(tf.random_uniform([self.V, self.H], 
                                                             -1.0, 1.0), name = 'Embeddings')
            self.embed_ = tf.nn.embedding_lookup(self.embeddings_, self.inputs_)
            # Normalized Embeddings facillitate cosine similarity calculation
            # .... but don't train on these! they're just for evaluation!
            self.norm_ = tf.sqrt(tf.reduce_sum(tf.square(self.embeddings_), 1, keep_dims=True))
            self.normalized_embeddings_ = self.embeddings_ / self.norm_
            
        # Hidden Layer
        with tf.variable_scope("Hidden_Layer"):
            self.W_ = tf.Variable(tf.truncated_normal([self.V, self.H],
                                  stddev=1.0 / math.sqrt(self.H)), name = 'W')
            self.b_ = tf.Variable(tf.zeros([self.V,], dtype=tf.float32), name = 'b')
            self.logits_ = tf.matmul(self.embed_, tf.transpose(self.W_)) + self.b_
           
    @with_self_graph
    def BuildTrainingGraph(self):
        with tf.variable_scope("Training"):
            nce_args = dict(weights=self.W_, 
                            biases=self.b_, 
                            labels=self.context_, 
                            inputs=self.embed_, 
                            num_sampled=self.softmax_ns, 
                            num_classes=self.V)
            self.nce_loss_ = tf.reduce_mean(tf.nn.nce_loss(**nce_args))
            self.optimizer_ = tf.train.GradientDescentOptimizer(self.alpha)
            self.train_step_ = self.optimizer_.minimize(self.nce_loss_)
        
    @with_self_graph
    def BuildValidationGraph(self):
        self.test_ = tf.constant(self.examples, dtype=tf.int32)
        self.test_embed_ = tf.nn.embedding_lookup(self.normalized_embeddings_, 
                                                  self.test_)
        self.similarity = tf.matmul(self.test_embed_, 
                                    self.normalized_embeddings_, 
                                    transpose_b=True)
        
    def learn_embeddings(self, num_steps, batch_fxn, data, index, verbose = True):
        """
        Runs a specified number of training steps.
        NOTE: right now the batch fxn is hard coded with inputs: 
                  (data,batch_size=128,num_skips=2,skip_window=2)
              It should output two arrays representing the input & 
              context indices for a single batch. 
              TODO: replace this with something less clunky!
        """
        
        with tf.Session(graph=self.graph) as session:
            
            # initialize all variables
            init = tf.global_variables_initializer()
            init.run()
            print('... Model Initialized')
            if verbose:
                for var in tf.trainable_variables():
                    print("\t", var)
        
            # iterate through specificied number of training steps
            average_loss = 0
            for step in range(num_steps):
                # Get the next batch of inputs & their skipgram context
                batch_inputs, batch_context = batch_fxn(data, 128, 2, 2)

                # Run the train op
                feed_dict = {self.inputs_: batch_inputs, self.context_: batch_context}
                _, loss_val = session.run([self.train_step_, self.nce_loss_], 
                                          feed_dict=feed_dict)
                
                # Logging Progress
                average_loss += loss_val
                loss_logging_interval = num_steps // 10
                sim_logging_interval = num_steps // 5
                if not verbose:
                    continue
                if step % loss_logging_interval == 0:
                    if step > 0:
                        average_loss /= loss_logging_interval
                    # The average loss is an estimate of the loss over the last 1000 batches.
                    print('Average loss at step ', step, ': ', average_loss)
                    average_loss = 0  
                if step % sim_logging_interval == 0:
                    sim = self.similarity.eval()
                    for i in xrange(len(self.examples)):
                        word = index[self.examples[i]]
                        top_k = 8  # number of nearest neighbors
                        nearest = (-sim[i, :]).argsort()[1:top_k + 1]
                        log_str = '   Nearest to %s:' % word
                        for k in xrange(top_k):
                            nbr = index[nearest[k]]
                            log_str = '%s %s,' % (log_str, nbr)
                        print(log_str)
            # results
            self.epochs_trained = num_steps
            self.final_embeddings = self.normalized_embeddings_.eval()
        return self.final_embeddings
    
    def plot_embeddings_in_2D(self, num, index):
        """ 
        Plot 2D representation of embeddings.
        Args: 
            num = int (number of examples to plot)
            index = reverse dictionary of word indices
            filename = path to save plot
        """
        if self.final_embeddings is None:
            print("You must train the embeddings before plotting.")
        else:
            tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
            low_dim_embs = tsne.fit_transform(self.final_embeddings[:num, :])
            labels = [index[i] for i in xrange(num)]
            plt.figure(figsize=(18, 18))  # in inches
            for i, label in enumerate(labels):
                x, y = low_dim_embs[i, :]
                plt.scatter(x, y)
                plt.annotate(str(label), xy=(x, y), xytext=(5, 2), 
                             textcoords='offset points', ha='right', va='bottom')
            plt.show()

__Step 0:__ Data prep (_we did this above, just running a few checks here_)

In [None]:
# We'll be using the shortened version of the English Wikipedia File
print('Corpus Size: %s Words' % (len(en_data)))
print('Vocabulary Size: %s Words' % (len(en_counts)))

In [None]:
# Additional Parameters for batch function
BATCH_SIZE = 128 # Number of inputs to process at once.
EMBEDDING_SIZE = 128 # Hidden layer representation size

In [None]:
# NOTE: the following are hard coded into the class above, just here for reference 
# TODO make a better batch iterator & a data handler so we don't have to do this!
skip_window = 2 # How many words to consider left and right.
num_skips = 2 # How many times to reuse an input to generate a context.
data_index = 0 # Used to track batches for now, TODO fix batch iterator!

__Step 1:__ Create Model & Initialize TF Graph

In [None]:
model = Word2Vec(V=VOCAB_SIZE, H=EMBEDDING_SIZE)
model.BuildCoreGraph()
model.BuildTrainingGraph()
model.BuildValidationGraph()

> __`Question for Mona & Roseanna:`__ When does it makes sense to keep these graph building methods separate and when should they all be part of the same class method? In this case we're never going to run 'inference' except in the context of the test/validation exercise... because we don't really care about the ultimate prediction of context words we really care about the embeddings. I've left these as 3 methods for now (following the lead of A4) but I wonder if we could combine two or all of them when we create our Xlingual version of this class.

__Step 2:__ Train the model.   
__`Note:`__ The training function is logging the sampled softmax loss... (NCE)... not sure if thats terribly instructive.

In [None]:
NSTEPS = 50000
start_time = dt.datetime.now()
embeddings = model.learn_embeddings(NSTEPS, generate_batch, en_data, en_index)
end_time = dt.datetime.now()
total = (end_time - start_time).total_seconds()
print("NCE method took {} seconds to run {} iterations".format(total, NSTEPS))

__Step 3:__ Plot the resulting embeddings.  

In [None]:
model.plot_embeddings_in_2D(300, en_index)

__`NOTE:`__ This plotting code clearly needs some work... but cool idea. I think it makes sense to rewrite the method so that it accepts a specific set of input words or indices not just a number of top words to plot. 

__`Also NOTE:`__ matplotlib is going to throw a fit when it encounters non ascii characters. [This SO post](https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte) explains that this is a Python 2 problem and suggests `from __future__ import unicode_literals` might help... but it doesn't seem to. Another SO post suggested that `sys.setdefaultencoding('utf-8')` should fix it, but that causes [this print problem](https://stackoverflow.com/questions/25494182/print-not-showing-in-ipython-notebook-python) (print output of Jupyter cells is getting redirected to the terminal). The suggested solution (below) seems to work. I may switch to using plotly in which case no matter. Otherwise we should dig into this a bit more.

In [None]:
# set decoding for matplotlib to handle accents
stdout = sys.stdout
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout = stdout

In [None]:
print('please work')

# Word2Vec on Full Spanish Data Set

__Step 0:__ Data Prep

In [None]:
VOCAB_SIZE = 10000
BATCH_SIZE = 128 # Number of inputs to process at once.
EMBEDDING_SIZE = 128 # Hidden layer representation size
data_index = 0 # Used to track batches for now, TODO fix batch iterator!

In [None]:
# read in raw file
es_raw = read_data(FULL_ES)

In [None]:
# parse data into dictionary
es_data, es_counts, es_es_dict, es_index = build_dataset(es_raw, VOCAB_SIZE)

In [None]:
# take a look
del es_raw  # reduce memory.
print('Corpus Size: %s Words' % (len(es_data)))
print('Vocabulary Size: %s Words' % (len(es_counts)))
print('Most common words (+UNK)\n', es_counts[:5])
print('Sample data\n:',' '.join(['%s(%s)'%(es_index[i],i) for i in es_data[:10]]))

__Step 1:__ Initialize Model

In [None]:
model = Word2Vec(V=VOCAB_SIZE, H=EMBEDDING_SIZE)
model.BuildCoreGraph()
model.BuildTrainingGraph()
model.BuildValidationGraph()

__Step 2:__ Train Model.

In [None]:
NSTEPS = 100001
start_time = dt.datetime.now()
embeddings = model.learn_embeddings(NSTEPS, generate_batch, es_data, es_index)
end_time = dt.datetime.now()
total = (end_time - start_time).total_seconds()
print("NCE method took {} seconds to run {} iterations".format(total, NSTEPS))

__Step 3:__ Plot a few of the embeddings.

In [None]:
model.plot_embeddings_in_2D(500, es_index)

__Step 4:__ Save embeddings & index

In [None]:
# confirm dim
model.final_embeddings.shape

In [None]:
# dictionary
len(es_index)

In [None]:
# make them a tuple to pickle
embeddings_tuple = (es_index, model.final_embeddings)

In [None]:
# save to file
import pickle
filename = './wtv_output/es_w2v_100K_embed.pkl'
with open(filename, 'wb') as f:
    pickle.dump(embeddings_tuple, f)

In [None]:
# confirm reload
filename = './wtv_output/es_w2v_100K_embed.pkl'
with open(filename, 'rb') as f:
    test_tuple = pickle.load(f)

In [None]:
test_tuple[1][:10]