# Lecture 4: Structure your model
## 0. Overview
- Overall structure of a model in TensorFlow
- word2vec
- Name scope
- Embedding visualization

## 1. Overall structure of a model in TensorFlow
### Phase 1: Assemble graph
1. Define placeholders for input and output
2. Define the weights
3. Define the inference model
4. Define loss function
5. Define optimizer

### Phase 2: Compute
![training loop](figures/04_01.png)

## 2. word2vec
### 2.1 about word2vec
[ Simple Word Vector Representations](http://web.stanford.edu/class/cs224n/lectures/cs224n-2017-lecture2.pdf)
- main idea: Predict between every word and its context words
- algorithms
    1. Continuous Bag of Words (CBOW): predict target word from bag-of-words context
        - e.g. sentence "The brown fox jumps.": predict "brown" from "the", "fox", "jumps"
        - treats entire context as 1 observation, therefore smoothes over a lot of the distributional info
        - useful for smaller datasets
    2. Skip-grams (SG): predict context words given context (position independent)
        - e.g. sentence "The brown fox jumps.": predict "the", "fox", "jumps" from "brown"
        - treats each context-target pair as a new observation
        - does better for larger datasets
- training methods (moderatiely efficient)  
    0. training is aimed to: 
        - minimize the cross-entropy loss of our model for every word $w$ in the training set (info theory)
        - i.e. minimizing the negative log likelihood of the correct class (probablistic interpretation): MLE or MAP (if use the regularization term $R(W)$ in the full loss function)
        - ([explanation](http://cs231n.github.io/linear-classify/#softmax-classifier))
    
    1. hierarchical softmax   
        - softmax-based approach
        - structure the softmax as a binary tree
        - **softmax vs SVM**  
    ![softmaxvssvm](figures/04_02.png)
    **naiive softmax**: normalization factor is too computationally expensive  
    2. negative sampling  
        -  sampling-based approach (others: importance sampling, target sampling)
        -  a simplified model of  **Noise Contrastive Estimation (NCE)**: makes certain assumption about the number of noise samples to generate (k) and the distribution of noise samples (Q) (negative sampling assumes that kQ(w) = 1)
        - useful for the learning word embeddings, but doesn’t have the theoretical guarantee that its derivative tends towards the gradient of the softmax function, which makes it not so useful for language modelling
        - [Sebastian Rudder’s “On word embeddings - Part 2: Approximating the Softmax” ](http://sebastianruder.com/word-embeddings-softmax/index.html)  
        - [Chris Dyer’s “Notes on Noise Contrastive Estimation and Negative Sampling”)](http://demo.clab.cs.cmu.edu/cdyer/nce_notes.pdf)
    3. NCE (used in the example)
    
- [word2vec simple tutorial: skip-gram model](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)
- [google code](https://code.google.com/archive/p/word2vec/)

### 2.2 about the dataset
text8: the first 100 MB of cleaned text of the English Wikipedia dump on Mar. 3, 2006 (whose link is no longer available)



## 3. create word2vec, skip-gram model
### Phase 1: Assemble the graph
#### 1. Define placeholders for input and output
- input: center word
- output: target (context) word
-  Instead of using one-hot vectors, use the **index** of those words directly (scalar placeholder with shape
`[BATCH_SIZE]` )


    center_words = tf.placeholder(tf.int32, shape = [BATCH_SIZE])
    target_words = tf.placeholder(tf.int32, shape = [BATCH_SIZE])

#### 2. Define the weight (in this case, embedding matrix)
- each row corresponds to the representation vector of one word (with size `EMBED_SIZE`)
- shape of embedding matrix: `[VOCAB_SIZE, EMBED_SIZE]`
- initialize the embedding matrix to value from a random distribution (e.g. uniform distribution)

    embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0))

#### 3. Inference (compute the forward path of the graph)
- goal: to get the vector representations of words in our dictionary
- to get the representation of all the center words in the batch, we get the slice of all corresponding rows in the embedding matrix ( `tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None,
validate_indices=True, max_norm=None)` )
- `tf.nn.embedding_lookup()` is useful when it comes to matrix multiplication with one-hot vectors because it saves us from doing a bunch of unnecessary computation that will return 0 anyway
![matrix multiplication of one-hot](figures/04_03.png)

    embed = tf.nn.embedding_lookup(embed_matrix, center_words)

#### 4. Define the loss function (NCE)
- [nce_loss source code](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_impl.py)  
- **note: the third argument is actually inputs, and the fourth is labels**  
- need weights and biases for the hidden layer to calculate NCE loss, then define loss

        tf.nn.nce_loss(weights, biases, labels, inputs, num_sampled, num_classes, num_true=1, sampled_values=None, remove_accidental_hits=False, partition_strategy='mod',name='nce_loss')
    


    nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE], stddev=1.0 / EMBED_SIZE ** 0.5))
    nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]))
    loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
                                         bias=nce_bias,
                                         labels=target_words,
                                         inputs=embed,
                                         num_sampled=NUM_SAMPLED,
                                         num_classes=VOCAB_SIZE))

#### 5. Define optimizer (gradient descent)

    optimiser = tf.train.GradientDescentOptimiser(LEARNING_RATE).minimize(loss)

### Phase 2: Execute the computation
1. create a session
2. feed inputs and outputs into the placeholders using `feed_dict`
3. run the optimizer to minimize the loss
4. fetch the loss value to report back

    with tf.Session() as sess:
         sess.run(tf.global_variables_initializer())

         average_loss = 0.0
         for index in xrange(NUM_TRAIN_STEPS):  
         # range() returns list object, xrange() returns xrange object
         # latter is better for generating large number of indeces
             batch = batch_gen.next()
             loss_batch, _ = sess.run([loss, optimizer],
                                     feed_dict={center_words: batch[0], target_words: batch[1]})
             average_loss += loss_batch
             if (index + 1) % 2000 == 0:
                 print('Average loss at step {}: {:5.1f}').format(index + 1, average_loss / (index + 1))

                 # w.pf -> w=max width of the entire number, p=precision

## [code for full implementation](https://github.com/chiphuyen/tf-stanford-tutorials/blob/master/examples/04_word2vec_no_frills.py)

## 4. Name Scope
### 4.1 a look at TensorBoard
![tensorboad_original](figures/04_04.png)

- not very readable: TensorBoard does not know which nodes are similar
- need to group related nodes (using name scope)

    with tf.name_scope(name_of_that_scope):
        # declare op_1
        # declare op_2
        # ...

#### 4.2 use name scope to build grouped blocks
the graph can have 3 op blocks:
- "data"
- "embed": has two nodes
    - one for tf.Variable
    - one for if.random_uniform
- "NCE_LOSS"

    with tf.name_scope('data'):
        center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
        target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')
    with tf.name_scope('embed'):
        embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), name='embed_matrix')
    with tf.name_scope('loss'):
        embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')
        nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
                                                    stddev=1.0 / math.sprt(EMBED_SIZE)),
                                name='nce_weight')
        nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')
        loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight,
                                            biases=nce_bias,
                                            labels=target_words,
                                            inputs=embed,
                                            num_sampled=NUM_SAMPLED,
                                            num_classes=VOCAB_SIZE),
                              name='loss')

#### 4.4 view grouped TensorBoard
![tensorboad_grouped](figures/04_05.png)

## 5. build model as a class
### 5.1 why build class?
- can't dump everything into a giant function
- make model most easy to use

#### 5.2 getting started

    class SkipGramModel:
        """ Build the graph for word2vec model"""
        def __init__(self, params): # constructor function
            “”“ Step 1: define the placeholders for input and output """
            pass

        def _create_placeholders(self):
            “”“ Step 2: define weights. In word2vec, it's actually the weights that we care
    about """
            pass

        def _create_loss(self):
             """ Step 3 + 4: define the inference + the loss function """
             pass

         def _create_optimizer(self):
             """ Step 5: define optimizer """
             pass

### 6. Results
#### 6.1 visualized embedding (using t-SNE)
![tSNE](figures/04_06.png)

patterns found in 3D:
- all the number (one, two, …, zero) are grouped in a line on the bottom right
- All the months are grouped together
- “Do”, “does”, “did” are grouped together
- ...

#### 6.2 about t-SNE
- [from-sne-to-tsne-to-largevis](http://bindog.github.io/blog/2016/06/04/from-sne-to-tsne-to-largevis/)
- 2 main stages
    - construct a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked, whilst dissimilar points have an extremely small probability of being picked.
    - defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence between the two distributions with respect to the locations of the points in the map
    - note: whilst the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this should be changed as appropriate
- example: [visualizing MNIST](http://colah.github.io/posts/2014-10-Visualizing-MNIST/)

#### 6.3 visualized embedding (using PCA)

    from tensorflow.contrib.tensorboard.plugins import projector
    
    # obtain the embedding_matrix after you’ve trained it
    final_embed_matrix = sess.run(model.embed_matrix)
    
    # create a variable to hold your embeddings. It has to be a variable. Constants
    # don’t work. You also can’t just use the embed_matrix we defined earlier for our model. Why
    # is that so? I don’t know. I get the 500 most popular words.
    embedding_var = tf.Variable(final_embed_matrix[:500], name='embedding')
    sess.run(embedding_var.initializer)
    config = projector.ProjectorConfig()
    summary_writer = tf.summary.FileWriter(LOGDIR)
    
    # add embeddings to config
    embedding = config.embeddings.add()
    embedding.tensor_name = embedding_var.name
    
    # link the embeddings to their metadata file. In this case, the file that contains
    # the 500 most popular words in our vocabulary
    embedding.metadata_path = LOGDIR + '/vocab_500.tsv'
    
    # save a configuration file that TensorBoard will read during startup
    projector.visualize_embeddings(summary_writer, config)
    
    # save our embedding
    saver_embed = tf.train.Saver([embedding_var])
    saver_embed.save(sess, LOGDIR + '/skip-gram.ckpt', 1)

Now we run our model again, then again run tensorboard. If you go to http://localhost:6006, click
on the Embeddings tab, you’ll see all the visualization.

[more resource](https://www.tensorflow.org/get_started/embedding_viz)