# 15688 Practical Data Science: Student Tutorial_Assignment Checkout

## Word2vec in TensorFlow

Carnegie Mellon University

Gilbert Gao

*bog@andrew.cmu.edu*

## Conception

#### What is Word Embedding ?

>Word Embedding or Distributed representation, a Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other'). 

[source](https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html)

In Natural Language Processing, the traditional and classical method is to represent a word to a discrete signal, for example, [cat] -> [id537], and [dog] -> [id142]. The most shortage of the method is that it is lacking of relationship of these two words (they are both animals). And this One-hot Representation also make the vectors of words too sparse, so we need much training to get a satisficed model. Word Embedding is developed to solve such problem. 

Word embedding(Vector Representations of Words) is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension.[source](https://en.wikipedia.org/wiki/Word_embedding)


A word embedding $W:words→ℝ_n$ is a paramaterized function mapping words in some language to high-dimensional vectors (perhaps 200 to 500 dimensions). For example, we might find:

$$W("cat")=(0.2, -0.4, 0.7, ...)$$
$$W("cap")=(0.1, -0.4, 0.7, ...)$$

$$W("mat")=(0.0, 0.6, -0.1, ...)$$
$$W("map")=(0.0, 0.6, -0.1, ...)$$

Typically, the function is a lookup table, parameterized by a matrix, with a row for each word: 
$$W_θ(w_n)=θ_n$$

VSMs have a long, rich history in NLP, but all methods depend in some way or another on the Distributional Hypothesis, which states that words that appear in the same contexts share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (e.g. Latent Semantic Analysis), and predictive methods (e.g. neural probabilistic language models).

source: [Distributed Representations of Words and Phrases and Their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)


#### What is TensorFlow ?

TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

[source](https://www.tensorflow.org/)

#### What is word2vec ?

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. word2vec is an efficient implementation of the continuous bag-of-words(CBOW) or skip-gram architectures(Word Embedding) for computing vector representations of words(These two architectures are similar in algorithm.), literary translating words (strings) to vectors (lists of floats). These representations can be subsequently used in many natural language processing applications and for further research.

word2vec organizes word by semantic meaning, and turns text into a numerical form that Deep Learning Nets and machine learning algorithms can in-turn use.


#### word2vec is a simple neural networks

Word2vec is a two-layer neural network that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus.

![Figure. word2vec nerual network ](http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png)


#### Usage of Workflow 

The word2vec model takes a text corpus as input and produces the word vectors as output. At first it constructs a vocabulary from the training text data corpus and then learns vector representation of words.

Reference
https://code.google.com/archive/p/word2vec/


#### Two architectures of word2vec

1.Continuous bag of words(CBOW): Predict a missing word in a sentence based on the surrounding context
2.Skip-gram: Each current word as an input to a log-linear classifier to predict words within a certain range before and after that current word
![Figure. CBOW vs.Skip-gram](https://silvrback.s3.amazonaws.com/uploads/60a81cd5-5189-4550-9709-523b3feef3d1/sentiment_01_large.png)

Skip-gram architecture could be viewed as the inverse of Continuous bag of words architecture. Given the context (surronding words) to CBOW, predict the current word.  Given the current word to Skip-gram architecture , predict the context (surrounding words).

Compare these two architectures, CBOW is several times faster to train than the skip-gram and has slightly better accuracy for frequent words. Skip-gram works well with a small amount of the training data and well represents rare words. And Skip-gram is the most common architecture.

Referred from a Google implementation of word2vec,
Performance:
The training speed can be significantly improved by using parallel training on multiple-CPU machine (use the switch '-threads N'). The hyper-parameter choice is crucial for performance (both speed and accuracy), however varies for different applications. The main choices to make are:
- architecture: skip-gram (slower, better for infrequent words) vs CBOW (fast)
- the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
- sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
- dimensionality of the word vectors: usually more is better, but not always
- context (window) size: for skip-gram usually around 10, for CBOW around 5



#### Training

Neural probabilistic language models are traditionally trained using the maximum likelihood (ML) principle to maximize the probability of the next word $w_t$ (for "target") given the previous words $h$ (for "history") in terms of a softmax function,

However, feature learning in word2vec we do not need a full probabilistic model, The CBOW and skip-gram models are instead trained using a binary classification objective (logistic regression) to discriminate the real target words $w_t$ from $k$ imaginary (noise) words, in the same context.

For detail of softmax function, there is a [paper](http://www-personal.umich.edu/~ronxin/pdf/w2vexp.pdf) as reference. 


#### Evaluating Embeddings: Analogical Reasoning

Embeddings are useful for a wide variety of prediction tasks in NLP. Short of training a full-blown part-of-speech model or named-entity model, one simple way to evaluate embeddings is to directly use them to predict syntactic and semantic relationships like king is to queen as father is to ?. This is called analogical reasoning and the task was introduced by Mikolov and colleagues. TensorFlow has build_eval_graph() and eval() functions in tensorflow/models/embedding/word2vec.py.



## word2vec Workout

The goal of this part is to implement simple word2vec using TensorFlow and to train a Word2Vec skip-gram model on given data

Referrence: TensorFlow Official word2vec implementation, following the steps.
[tensorflow/models/embedding/word2vec.py](https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/embedding/word2vec.py)

In [3]:
import os
from six.moves import urllib
import zipfile
import collections
import math
import random
import numpy as np
import tensorflow as tf

### Download data

[source](http://mattmahoney.net/dc/)

In [4]:
URL = 'http://mattmahoney.net/dc/'
EXPECTED_BYTES = 31344016
FILENAME = 'text8.zip'


def maybe_download(filename, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(URL + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception('Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename
    

# Read the data into a string.
def read_data(filename):
    f = zipfile.ZipFile(filename)
    
    for name in f.namelist():        
        return f.read(name).split()
    f.close()    

In [5]:
try:    
    filename = maybe_download(FILENAME, EXPECTED_BYTES)  
    words = read_data(FILENAME)
    print("type(words): ", type(words))
    print("len(words): ", len(words)) 
    
except Exception as e:
    print("ERROR: ", e.message)

('Found and verified', 'text8.zip')
('type(words): ', <type 'list'>)
('len(words): ', 17005207)


### Build the dictionary and replace rare words with UNK token.

In [7]:
VOCABULARY_SIZE = 50000

# words : type is of list, total size = 17005207
def build_dataset(words):
    # word histogram
    count = [['UNK', -1]]   

    # container without overlapping of words 
    # (word, counts)
    count.extend(collections.Counter(words).most_common(VOCABULARY_SIZE - 1))

    # word -> id
    dictionary = dict()
    
    for word, _ in count:
        dictionary[word] = len(dictionary) 

    data = list()
    unk_count = 0
    
    # convert a sequence of words into a sequence of  id numbers.
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count = unk_count + 1
#             count[0][1] = unk_count
        data.append(index)
    count[0][1] = unk_count
    
    # id -> word
    reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reverse_dictionary

In [8]:
data, count, dictionary, reverse_dictionary = build_dataset(words)
build_dataset(words)
del words  # to reduce memory.

In [36]:
print 'Most common words and UNK', count[:10]
print 'Sample data', data[:15]
print 'len(data)', len(data)

Most common words and UNK [['UNK', 418391], ('the', 1061396), ('of', 593677), ('and', 416629), ('one', 411764), ('in', 372201), ('a', 325873), ('to', 316376), ('zero', 264975), ('nine', 250430)]
Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156, 128, 742, 477, 10572, 134]
len(data) 17005207


### Function to generate a training batch for the skip-gram model.

In [15]:
data_index = 0

def generate_batch(batch_size, num_skips, skip_window):
    global data_index
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window
 
    batch = np.ndarray(shape=(batch_size), dtype=np.int32) # batch (8,)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) # labels (8,1)
    span = 2 * skip_window + 1 # [ skip_window target skip_window ] # 3
    buffer = collections.deque(maxlen=span)
    
    for _ in range(span): # 0,1,2
        buffer.append(data[data_index]) # 0 -> 1 -> 2
        data_index = (data_index + 1) % len(data) # 1 -> 2 -> 3
        
    for i in range(batch_size // num_skips): # 0,1,2,3
        target = skip_window  # target label at the center of the buffer, 1 
        targets_to_avoid = [ skip_window ] 
        
        # now, target is 1
        for j in range(num_skips): # 0, 1
            while target in targets_to_avoid:
                target = random.randint(0, span - 1) # either of 0,1, or 2
            targets_to_avoid.append(target) 
            batch[i * num_skips + j] = buffer[skip_window]
            labels[i * num_skips + j, 0] = buffer[target]                       
        buffer.append(data[data_index])        
        data_index = (data_index + 1) % len(data)        
    return batch, labels

In [16]:
BATCH_SIZE = 8
NUM_SKIPS = 2 # 4 How many times to reuse an input to generate a label
SKIP_WINDOW = 1 # 2 How many words to consider left and right.

batch, labels = generate_batch(batch_size=BATCH_SIZE, num_skips=NUM_SKIPS, skip_window=SKIP_WINDOW)
for i in range(8):
    print(batch[i], '->', labels[i, 0])
for i in data[:10]:
    print(i)

(3084, '->', 12)
(3084, '->', 5239)
(12, '->', 6)
(12, '->', 3084)
(6, '->', 12)
(6, '->', 195)
(195, '->', 2)
(195, '->', 6)
5239
3084
12
6
195
2
3137
46
59
156


### Build and train a skip-gram model

In [37]:
EMBEDDING_SIZE = 128  # Dimension of the embedding vector.
BATCH_SIZE = 128
# pick a random validation set to sample nearest neighbors, but limit the
# validation samples to the words that have a low numeric ID, which by
# construction are also the most frequent.
VALID_SIZE = 16     # Random set of words to evaluate similarity on.
VALID_WINDOW = 100  # Only pick dev samples in the head of the distribution.
NUM_SAMPLED = 64    # Number of negative examples to sample.

valid_examples = np.array(random.sample(np.arange(VALID_WINDOW), VALID_SIZE))
print("valid_examples", valid_examples)

graph = tf.Graph()
with graph.as_default():
    with graph.device("/cpu:0"):
        # Input data.
        train_inputs = tf.placeholder(tf.int32, shape=[BATCH_SIZE])
        train_labels = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1])
        valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
        
        # Construct the variables.
        
        # input embeddings: W_I
        # define Embedding Matrix and random initialized
        embeddings = tf.Variable(
            tf.random_uniform( [VOCABULARY_SIZE, EMBEDDING_SIZE], -1.0, 1.0)
        )
        
        # output weights: W_O
        # Noise-Contrastive softmax function, need to set weight for every word
        nce_weights = tf.Variable(
            tf.truncated_normal(
                [VOCABULARY_SIZE, EMBEDDING_SIZE],
                stddev=1.0 / math.sqrt(EMBEDDING_SIZE)
            )
        )
        
        nce_biases = tf.Variable(tf.zeros([VOCABULARY_SIZE]))
            
        # Look up embeddings(vector) for inputs in batch. v_t = W_I x_t
        embed = tf.nn.embedding_lookup(embeddings, train_inputs)
        
        # Compute the average NCE loss for the batch.
        # tf.nce_loss automatically draws a new sample of the negative labels each
        # time we evaluate the loss.
        loss = tf.reduce_mean(
            tf.nn.nce_loss(
                nce_weights, # W_O
                nce_biases,  # b_O
                embed,  # v_t
                train_labels,
                NUM_SAMPLED, # the number of classes to randomly sample per batch: Negative sampling 
                VOCABULARY_SIZE, # the number of possible classes.
                num_true=1 # the number of target classes per training example
            )
        )
        
        # Now we have the loss node, but we still
        # but we still need to construct the Stochastic gradient descent optimizer 
        # using a learning rate of 1.0.
        optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss)
        
        # Compute the cosine similarity between minibatch examples and all embeddings.
        norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
        normalized_embeddings = embeddings / norm # make v_t unit vector
        
        # Calculate unit vector v_t = W_I x_t
        valid_embeddings = tf.nn.embedding_lookup(
            normalized_embeddings, 
            valid_dataset
        )
        
        # |v_t|^2
        similarity = tf.matmul(
            valid_embeddings, 
            normalized_embeddings, 
            transpose_b = True
        )

('valid_examples', array([ 3, 99, 11, 73, 84, 30, 38, 63, 52, 96, 67, 98, 28, 50, 61, 89]))


In [18]:
print("valid_embeddings._shape: ", valid_embeddings._shape)
print("normalized_embeddings._shape: ", normalized_embeddings._shape)
print("similarity._shape: ", similarity._shape)

('valid_embeddings._shape: ', TensorShape([Dimension(16), Dimension(128)]))
('normalized_embeddings._shape: ', TensorShape([Dimension(50000), Dimension(128)]))
('similarity._shape: ', TensorShape([Dimension(16), Dimension(50000)]))


### Begin training

In [38]:
NUM_STEPS = 100 #001
DISPLAY_STEPS = 2000
SIMILARITY_STEPS = 10000
MODEL_PATH = "./model.ckpt"

with tf.Session(graph=graph) as session:
    # must initialize all variables before we use them.
    tf.initialize_all_variables().run()
    print("All Variables Initialized")
    
    average_loss = 0
    
    # The training of model is simple
    # Just start session, using feed_dict, putting into training data
    for step in xrange(NUM_STEPS):
        batch_inputs, batch_labels = generate_batch(batch_size=BATCH_SIZE, num_skips=NUM_SKIPS, skip_window=SKIP_WINDOW)
        feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
        
        # One update step is evaluating the optimizer op (including it
        # in the list of returned values for session.run()
        _, loss_val = session.run([optimizer, loss], feed_dict=feed_dict)
        average_loss += loss_val
    
        if step % DISPLAY_STEPS == 0:
            if step > 0:
                average_loss = average_loss / DISPLAY_STEPS
    
            # The average loss is an estimate of the loss over the last 2000 batches.
            print("Average loss at step ", step, ": ", average_loss)
            average_loss = 0
    
        # note that this is expensive (~20% slowdown if computed every 500 steps)
        if step % SIMILARITY_STEPS == 0:
            sim = similarity.eval()
            for i in xrange(VALID_SIZE):
                valid_word = reverse_dictionary[valid_examples[i]]
                top_k = 8 # number of nearest neighbors
                nearest = (-sim[i, :]).argsort()[1:top_k+1]
                log_str = "Nearest to %s:" % valid_word
            
                for k in xrange(top_k):
                    close_word = reverse_dictionary[nearest[k]]
                    log_str = "%s %s," % (log_str, close_word)
                print(log_str)
                
    # equivalent to session.run(normalized_embeddings)             
    final_embeddings = normalized_embeddings.eval()     
    
    # save a model
    saver = tf.train.Saver()
    saver.save(session, MODEL_PATH)

All Variables Initialized
('Average loss at step ', 0, ': ', 278.0208740234375)
Nearest to and: allusions, buzzard, temptation, commodities, ordaining, migrate, recourse, obsessed,
Nearest to while: jingles, hellenism, enclosure, troy, sociale, essayists, watches, fished,
Nearest to is: duff, confound, outskirts, subtleties, frege, fantasia, deprivation, effectiveness,
Nearest to b: beers, counterfactual, darling, crystal, bodhisattvas, adp, medicine, appendicitis,
Nearest to war: bleed, tls, pakistani, chiropractors, cultivars, jure, minya, phonemic,
Nearest to his: audition, jammed, hulls, excitation, brahma, oppressive, insurrections, modernist,
Nearest to not: pullback, apsu, nubia, lets, ostensible, bastard, mucous, bryozoans,
Nearest to into: caldera, durst, kellermann, granddaughter, collided, rodgers, mixing, prayed,
Nearest to most: firemen, vigilante, patel, pony, martini, muppet, polg, inhospitable,
Nearest to history: rome, accelerating, holistic, slid, mohawk, balkan, tibe

In [40]:
# restore the model
with tf.Session(graph=graph) as sess:
    saver.restore(sess, MODEL_PATH)