# PART 2 TensorFlow
# 6. Workshop 4-1 :  自然語言處理 (NLP) - LSTM Algorithm and its Application
2019/08/30

> [ Reference ] :
1. Tom Hope, Yehezkel S. Resheff, and Itay Lieder, "**`Learning TensorFlow : A Guide to Building Deep Learning Systems`**", Chapter 5 & 6, O'Reilly, 2017.
      [ Code ] : https://github.com/giser-yugang/Learning_TensorFlow
2. Victor Zhou, "**`An Introduction to Recurrent Neural Networks for Beginners`**" Towards Data Science, 2019/07/25. https://towardsdatascience.com/an-introduction-to-recurrent-neural-networks-for-beginners-664d717adbd
3. Andrej Karpathy, "**`The Unreasonable Effectiveness of Recurrent Neural Networks`**" Andrej Karpathy blog, 2015/05/21. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
4. Wikipedia, "**`Long short-term memory`**", 2019. https://en.wikipedia.org/wiki/Long_short-term_memory
5. 陳誠, "**人人都能看懂的LSTM**", https://zhuanlan.zhihu.com/p/32085405
6. Wikipedia, "**`Gated recurrent unit`**" https://en.wikipedia.org/wiki/Gated_recurrent_unit

- [Intro to LSTM Model](#Intro)
- [1 One-Hot for Text Sequences](#OneHot)
    - [1.1 Supervized Word Embeddings](#WordEmbeddings)
    - [1.2 LSTM with Sequence Length](#LSTM)
    - [1.3 Training Embeddings for the LSTM Classifier](#Training)
    - [EXERCISE 1 : Stacking Multiple LSTMs (Solution)](#Ex1)
- [2 Word2Vec for Text Sequences](#Word2Vec)
    - [2.1 Skip-Grams](#SkipGram)
    - [2.2 Building the Computation Graph](#BuildGraph)
        - [Embeddings in TensorFlow](#Embeddings)
        - [The Noise-Contrastive Estimation (NCE) Loss Function](#NCE)
        - [Learning Rate Decay](#LRDecay)
    - [2.3 Launching the Computation Graph](#LaunchGraph)
        - [Checking Out the Embeddings](#CheckOut)
    - [2.4 Visualization with TensorBoard](#TensorBoard)

<a id='Intro'></a>
# Intro to LSTM Model

-----------------------
![title](./Fig_1a_RNN_operation.png)

**Figure 1.a RNN model.** (from Ref. 4)

------------------------

+ **Long short-term memory (LSTM)** is *an artificial recurrent neural network (RNN) architecture* used in the field of deep learning. 
+ A common LSTM unit is composed of a `cell`, an `input gate`, an `output gate` and a `forget gate`. 
+ The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

-----------------------
![title](./Fig_1b_LSTM_model.png)

**Figure 1.b The LSTM model.** (from Ref. 5)

------------------------

+ The basic ideas behind **LSTM models** :
    + **Markov Chain Model**
    + **Hidden Markov Model (HMM)**

----------------------
![title](./Fig_2a_LSTM_with_a_forget_gate.png)
![title](./Fig_2b_LSTM_Variables_Activation_Funcs.png)

----------------------
**Figure 2 LSTM with a forget gate.** (from Ref. 4)

![title](./Fig_3_Intro_to_RNNs.png)


**Figure 3. Different Recurrent Neural Networks.** (from Ref. 3)

<a id='OneHot'></a>
## 1.  One-Hot for Text Sequences

In [20]:
import numpy as np
import tensorflow as tf

# for the old-version usage of TensorFlow, such as tensorflow.examples.tutorials.mnist
old_v = tf.logging.get_verbosity()          
tf.logging.set_verbosity(tf.logging.ERROR)

batch_size = 128
embedding_dimension = 64
num_classes = 2
hidden_layer_size = 32
times_steps = 6 #1個timestep為一個字的sample
element_size = 1

+ Next, we create sentences. We sample random digits and map them to the corresponding
“words” (e.g., 1 is mapped to “One,” 7 to “Seven,” etc.).

In [21]:
digit_to_word_map = {1:"One",2:"Two", 3:"Three", 4:"Four", 5:"Five",
                     6:"Six",7:"Seven",8:"Eight",9:"Nine"}

+ Text sequences typically have variable lengths, which is of course the case for all real natural language data (such as in the sentences appearing on this page).

> + To make our simulated sentences have different lengths, we sample for each sentence a random length between 3 and 6 with **`np.random.choice(range(3, 7))`**—the lower bound is inclusive, and the upper bound is exclusive.


> + Now, to put all our input sentences in one tensor (per batch of data instances), we need them to somehow be of the same size—so we pad sentences with a length shorter than 6 with zeros (or PAD symbols) to make all sentences equally sized (artificially). This pre-processing step is known as **zero-padding**. 
    
The following code accomplishes all of this:

In [22]:
digit_to_word_map[0]="PAD" #dictionary加一個 0:PAD
even_sentences = []
odd_sentences = []
seqlens = []

for i in range(10000):
    rand_seq_len = np.random.choice(range(3,7)) #一句三到六個字
    seqlens.append(rand_seq_len)
    rand_odd_ints = np.random.choice(range(1,10,2),rand_seq_len)  
    rand_even_ints = np.random.choice(range(2,10,2),rand_seq_len)
    
    # Padding
    if rand_seq_len<6:
        rand_odd_ints = np.append(rand_odd_ints,
                                  [0]*(6-rand_seq_len))
        rand_even_ints = np.append(rand_even_ints,
                                  [0]*(6-rand_seq_len))
    even_sentences.append(" ".join([digit_to_word_map[r] 
                                    for r in rand_even_ints])) #空一格串起
    odd_sentences.append(" ".join([digit_to_word_map[r] 
                                   for r in rand_odd_ints]))
    
data = even_sentences+odd_sentences
# Same seq lengths for even, odd sentences
seqlens*=2

Let’s take a look at our sentences, each padded to length 6:

In [23]:
even_sentences[0:6]   

['Six Six Two Eight Two Six',
 'Six Six Eight Six Eight PAD',
 'Eight Eight Six PAD PAD PAD',
 'Six Six Eight Six Six Eight',
 'Eight Six Two Four PAD PAD',
 'Two Eight Two Eight PAD PAD']

In [24]:
odd_sentences[0:6]   

['One Seven Five One Nine One',
 'Nine One Five Nine Nine PAD',
 'Three Nine Seven PAD PAD PAD',
 'One Three Five Three Seven Five',
 'Three One Nine Three PAD PAD',
 'Nine Nine Three Five PAD PAD']

> + Notice that we add the **PAD** word (token) to our data and `digit_to_word_map` dictionary, and separately store even and odd sentences and their original lengths (before padding).

Let’s take a look at the original sequence lengths for the sentences we printed:

In [25]:
seqlens[0:6]  # Same seq lengths for even, odd sentences

[6, 5, 3, 6, 4, 4]

### Q : Why keep the original sentence lengths? 
> + By zero-padding, we solved one technical
problem but created another: if we naively pass these padded sentences through our
RNN model as they are, it will process useless **PAD** symbols. 


> + This would both harm model correctness by processing “*noise*” and increase computation time. We resolve this issue by first storing the original lengths in the seqlens array and then telling TensorFlow’s **`tf.nn.dynamic_rnn()`** where each sentence ends.

+ Our data is simulated—generated by us. In real applications, we would  start off by getting a collection of documents (e.g., one-sentence tweets) and then mapping each word to an integer ID.


+ So, we now **map words to indices**—word `identifiers`—by simply creating a dictionary with words as keys and indices as values. 
+ We also create the **inverse map**.

In [26]:
# Map from words to indices
word2index_map ={}
index=0
for sent in data:
    for word in sent.lower().split():  ##變小寫 分離（自動空格隔開）
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
            
# Inverse map
index2word_map = {index: word for word, index in word2index_map.items()}
vocabulary_size = len(index2word_map)

### This is a supervised classification task—we need an array of labels in the `one-hot` format, train and test sets, a function to generate batches of instances, and placeholders, as usual.

+ First, we create the labels and split the data into train and test sets:

In [27]:
labels = [1]*10000 + [0]*10000
for i in range(len(labels)):
    label = labels[i]
    one_hot_encoding = [0]*2
    one_hot_encoding[label] = 1
    labels[i] = one_hot_encoding
    
data_indices = list(range(len(data)))
np.random.shuffle(data_indices)
data = np.array(data)[data_indices]

labels = np.array(labels)[data_indices]
seqlens = np.array(seqlens)[data_indices]
train_x = data[:10000]
train_y = labels[:10000]
train_seqlens = seqlens[:10000]

test_x = data[10000:]
test_y = labels[10000:]
test_seqlens = seqlens[10000:]

+ Next, we create a function that generates batches of sentences. Each sentence in a
batch is simply a list of integer IDs corresponding to words:

In [28]:
def get_sentence_batch(batch_size,data_x, data_y, data_seqlens):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [[word2index_map[word] for word in data_x[i].lower().split()]
    for i in batch]
    y = [data_y[i] for i in batch]
    seqlens = [data_seqlens[i] for i in batch]
    return x,y,seqlens

+ Finally, we create placeholders for data:

In [29]:
_inputs = tf.placeholder(tf.int32, shape=[batch_size,times_steps])
_labels = tf.placeholder(tf.float32, shape=[batch_size, num_classes])

# seqlens for dynamic calculation
_seqlens = tf.placeholder(tf.int32, shape=[batch_size])

<a id='WordEmbeddings'></a>   
## 1.1 Supervised Word Embeddings

+ **Word IDs encoded in `one-hot` (binary) categorical form**

### `tf.nn.embedding_lookup() function`
+ Word embeddings as *basic hash tables* or *lookup tables*, mapping words to their dense vector values. These vectors are optimized as part of the training process.
+ **Using the built-in `tf.nn.embedding_lookup()` function**:

In [30]:
with tf.name_scope("embeddings"):
    embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size,
                               embedding_dimension],
                               -1.0, 1.0),name='embedding')
    embed = tf.nn.embedding_lookup(embeddings, _inputs)

<a id='LSTM'></a>  
## 1.2 LSTM with Sequence Length

+ A very popular recurrent network is the **`long short-term memory (LSTM) network`**. 
>    + It has some special *memory mechanisms* that enable the recurrent cells to better store information for long periods of time, thus allowing them to capture long-term dependencies better than plain RNN.
    1. **These memory mechanisms simply consist of some more parameters added to each recurrent cell, enabling the RNN to overcome optimization issues and propagate information.** 
    2. **These trainable parameters act as filters that select what information is worth “*remembering*” and passing on, and what is worth “*forgetting*.”**
+ **They are trained in exactly the same way as any other parameter in a network, with gradient-descent algorithms and backpropagation.**

**Creating an LSTM cell with `tf.contrib.rnn.BasicLSTMCell()` and feed it to `tf.nn.dynamic_rnn()`:** 

In [31]:
with tf.variable_scope("lstm"):
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size,
                                              forget_bias=1.0)
    outputs, states = tf.nn.dynamic_rnn(lstm_cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)
weights = {
'linear_layer': tf.Variable(tf.truncated_normal([hidden_layer_size,
                            num_classes], mean=0,stddev=.01))
}

biases = {
'linear_layer':tf.Variable(tf.truncated_normal([num_classes],
                           mean=0,stddev=.01))
}
##最後output的weight和bias

# Extract the last relevant output and use in a linear layer
final_output = tf.matmul(states[1],
                         weights["linear_layer"]) + biases["linear_layer"]
softmax = tf.nn.softmax_cross_entropy_with_logits(logits = final_output,
                                                  labels = _labels)
cross_entropy = tf.reduce_mean(softmax)

> **[ NOTE ] :  We take the last valid output vector — in this case conveniently available for us in the `states` tensor returned by `dynamic_rnn()` — and pass it through a linear layer (and the softmax function), using it as our final prediction.**

<a id='Training'></a>  
## 1.3 Training Embeddings for the LSTM Classifier

In [32]:
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(_labels,1), tf.argmax(final_output,1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32)))*100

In [33]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for step in range(1000):
        x_batch, y_batch,seqlen_batch = get_sentence_batch(batch_size,
                                         train_x,train_y, train_seqlens)
        sess.run(train_step,feed_dict={_inputs:x_batch, _labels:y_batch,
                                       _seqlens:seqlen_batch})
        
        if step % 100 == 0:
            acc = sess.run(accuracy,feed_dict={_inputs:x_batch,
                                               _labels:y_batch,
                                               _seqlens:seqlen_batch})
            print("Accuracy at %d: %.5f" % (step, acc))
        
    for test_batch in range(5):
        x_test, y_test,seqlen_test = get_sentence_batch(batch_size,
                                                        test_x,test_y,
                                                        test_seqlens)
        batch_pred,batch_acc = sess.run([tf.argmax(final_output,1), accuracy],
                                         feed_dict={_inputs:x_test,
                                                    _labels:y_test,
                                                    _seqlens:seqlen_test})
        print("Test batch accuracy %d: %.5f" % (test_batch, batch_acc))
    
    output_example = sess.run([outputs],feed_dict={_inputs:x_test,
                                                   _labels:y_test,
                                                   _seqlens:seqlen_test})
    states_example = sess.run([states[1]],feed_dict={_inputs:x_test,
                                                     _labels:y_test,
                                                     _seqlens:seqlen_test})

Accuracy at 0: 44.53125
Accuracy at 100: 100.00000
Accuracy at 200: 100.00000
Accuracy at 300: 100.00000
Accuracy at 400: 100.00000
Accuracy at 500: 100.00000
Accuracy at 600: 100.00000
Accuracy at 700: 100.00000
Accuracy at 800: 100.00000
Accuracy at 900: 100.00000
Test batch accuracy 0: 100.00000
Test batch accuracy 1: 100.00000
Test batch accuracy 2: 100.00000
Test batch accuracy 3: 100.00000
Test batch accuracy 4: 100.00000


+ Let’s take a look at one example of these outputs, for a sentence that was zero-padded (in your random batch of data you may see different output, of course—look for a sentence whose seqlen was lower than the maximal 6):

In [34]:
seqlen_test[1]

5

In [35]:
output_example[0][1].shape

(6, 32)

+ This output has, as expected, six time steps, each a vector of size 32. 

Let’s take a glimpse at its values (printing only the first few dimensions to avoid clutter):

In [36]:
output_example[0][1][:6,0:3]

array([[ 0.37689742, -0.3723543 ,  0.25342464],
       [ 0.61963177, -0.66885376,  0.52438927],
       [ 0.7476994 , -0.8419353 ,  0.6733225 ],
       [ 0.7789648 , -0.8884601 ,  0.68215954],
       [ 0.7584404 , -0.872957  ,  0.72053736],
       [ 0.        ,  0.        ,  0.        ]], dtype=float32)

+ If a sentence has zero vectors in the last few time steps, it is because of zero-padding.

Finally, we look at the states vector returned by `dynamic_rnn()`:

In [37]:
states_example[0][1][0:3]

array([ 0.7584404 , -0.872957  ,  0.72053736], dtype=float32)

+ **We can see that it conveniently stores for us the last relevant output vector — its values match the last relevant output vector before zero-padding.**

--------------------------------
<a id='Ex1'></a>
## [ EXERCISE 1 ] : Stacking multiple LSTMs

+ [ HINT ]: 
> 1. Using the **`MultiRNNCell()`** that combines multiple RNN cells into one multilayer cell.
> 2. **The code segment in the Ref. 1 (Chapter 5) is incorrect. It should be modified as following:**

In [38]:
# Building 2 LSTM layers
num_LSTM_layers = 2

with tf.variable_scope("lstm"):
    lstm_cell = [tf.contrib.rnn.BasicLSTMCell(hidden_layer_size,forget_bias= 1.0) 
                 for _ in range(num_LSTM_layers)]
    cell = tf.contrib.rnn.MultiRNNCell(cells=lstm_cell,
                                       state_is_tuple=True)
    outputs, states = tf.nn.dynamic_rnn(cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)

+ First, define an LSTM cell as before, and then feed it into the `tf.contrib.rnn.MultiRNNCell()` wrapper.

+ Then, there are two layers of LSTM. 
+ To get the final state of the second layer, we simply adapt our indexing a bit:

In [39]:
# Extract the final state and use in a linear layer
final_output = tf.matmul(states[num_LSTM_layers-1][1],
                         weights["linear_layer"]) + biases["linear_layer"]

### < Solution >

In [40]:
import numpy as np
import tensorflow as tf

# for the old-version usage of TensorFlow, such as tensorflow.examples.tutorials.mnist
old_v = tf.logging.get_verbosity()          
tf.logging.set_verbosity(tf.logging.ERROR)

batch_size = 128
embedding_dimension = 64
num_classes = 2
hidden_layer_size = 32
times_steps = 6
element_size = 1

# Creating index-to-word mapping
digit_to_word_map = {1:"One",2:"Two", 3:"Three", 4:"Four", 5:"Five",
                     6:"Six",7:"Seven",8:"Eight",9:"Nine"}

digit_to_word_map[0]="PAD"

# Creating sentences for datasets
even_sentences = []
odd_sentences = []
seqlens = []

for i in range(10000):
    rand_seq_len = np.random.choice(range(3,7))
    seqlens.append(rand_seq_len)
    rand_odd_ints = np.random.choice(range(1,10,2),rand_seq_len)
    rand_even_ints = np.random.choice(range(2,10,2),rand_seq_len)
    
    # Padding
    if rand_seq_len<6:
        rand_odd_ints = np.append(rand_odd_ints,
                                  [0]*(6-rand_seq_len))
        rand_even_ints = np.append(rand_even_ints,
                                  [0]*(6-rand_seq_len))
    even_sentences.append(" ".join([digit_to_word_map[r] 
                                    for r in rand_even_ints]))
    odd_sentences.append(" ".join([digit_to_word_map[r] 
                                   for r in rand_odd_ints]))
    
data = even_sentences+odd_sentences

# Same seq lengths for even, odd sentences
seqlens*=2

# Map from words to indices
word2index_map ={}
index=0
for sent in data:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
            
# Inverse map
index2word_map = {index: word for word, index in word2index_map.items()}
vocabulary_size = len(index2word_map)

# Arranging the train and test datasets
labels = [1]*10000 + [0]*10000
for i in range(len(labels)):
    label = labels[i]
    one_hot_encoding = [0]*2
    one_hot_encoding[label] = 1
    labels[i] = one_hot_encoding
    
data_indices = list(range(len(data)))
np.random.shuffle(data_indices)
data = np.array(data)[data_indices]

labels = np.array(labels)[data_indices]
seqlens = np.array(seqlens)[data_indices]
train_x = data[:10000]
train_y = labels[:10000]
train_seqlens = seqlens[:10000]

test_x = data[10000:]
test_y = labels[10000:]
test_seqlens = seqlens[10000:]

def get_sentence_batch(batch_size,data_x, data_y, data_seqlens):
    instance_indices = list(range(len(data_x)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [[word2index_map[word] for word in data_x[i].lower().split()]
    for i in batch]
    y = [data_y[i] for i in batch]
    seqlens = [data_seqlens[i] for i in batch]
    return x,y,seqlens

In [41]:
## --------------------------------------
## Building up the computation graph...
## --------------------------------------
# Need to clear the computational graph for creating 2 stacked LSTM layers
tf.reset_default_graph()

_inputs = tf.placeholder(tf.int32, shape=[batch_size,times_steps])
_labels = tf.placeholder(tf.float32, shape=[batch_size, num_classes])

# seqlens for dynamic calculation
_seqlens = tf.placeholder(tf.int32, shape=[batch_size])

with tf.name_scope("embeddings"):
    embeddings = tf.Variable(
            tf.random_uniform([vocabulary_size,
                               embedding_dimension],
                               -1.0, 1.0),name='embedding')
    embed = tf.nn.embedding_lookup(embeddings, _inputs)

**Creating one LSTM cell with `tf.contrib.rnn.BasicLSTMCell()`, feed it into the `tf.contrib.rnn.MultiRNNCell()` and run it with `tf.nn.dynamic_rnn()`:** 

In [42]:
# Creating 2 stacked LSTM layers
num_LSTM_layers = 2

with tf.variable_scope("lstm"):
    lstm_cell = [tf.contrib.rnn.BasicLSTMCell(hidden_layer_size,forget_bias= 1.0) 
                 for _ in range(num_LSTM_layers)]
    cell = tf.contrib.rnn.MultiRNNCell(cells=lstm_cell,
                                       state_is_tuple=True)
    outputs, states = tf.nn.dynamic_rnn(cell, embed,
                                        sequence_length = _seqlens,
                                        dtype=tf.float32)  ##兩層lstm
    
weights = {
        'linear_layer':tf.Variable(tf.truncated_normal([hidden_layer_size,num_classes],
                                                       mean=0,stddev=.01))
}
biases = {
        'linear_layer':tf.Variable(tf.truncated_normal([num_classes],mean=0,stddev=.01))
}

## final_output = tf.matmul(states[1], weights['linear_layer']) + biases['linear_layer']
# Extract the final state and use in a linear layer
final_output = tf.matmul(states[num_LSTM_layers-1][1], weights["linear_layer"]) + biases["linear_layer"]

softmax = tf.nn.softmax_cross_entropy_with_logits(logits=final_output,labels=_labels)
cross_entropy = tf.reduce_mean(softmax)

+ #### Training Embeddings and the Stacked LSTM Classifier

In [43]:
train_step = tf.train.RMSPropOptimizer(0.001, 0.9).minimize(cross_entropy)

correct_prediction = tf.equal(tf.argmax(_labels,1), tf.argmax(final_output,1))
accuracy = (tf.reduce_mean(tf.cast(correct_prediction, tf.float32)))*100

In [44]:
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    
    for step in range(1000):
        x_batch, y_batch,seqlen_batch = get_sentence_batch(batch_size,
                                         train_x,train_y, train_seqlens)
        sess.run(train_step,feed_dict={_inputs:x_batch, _labels:y_batch,
                                       _seqlens:seqlen_batch})
        
        if step % 100 == 0:
            acc = sess.run(accuracy,feed_dict={_inputs:x_batch,
                                               _labels:y_batch,
                                               _seqlens:seqlen_batch})
            print("Accuracy at %d: %.5f" % (step, acc))
        
    for test_batch in range(5):
        x_test, y_test,seqlen_test = get_sentence_batch(batch_size,
                                                        test_x,test_y,
                                                        test_seqlens)
        batch_pred,batch_acc = sess.run([tf.argmax(final_output,1), accuracy],
                                         feed_dict={_inputs:x_test,
                                                    _labels:y_test,
                                                    _seqlens:seqlen_test})
        print("Test batch accuracy %d: %.5f" % (test_batch, batch_acc))
    
    output_example = sess.run([outputs],feed_dict={_inputs:x_test,
                                                   _labels:y_test,
                                                   _seqlens:seqlen_test})
    states_example = sess.run([states[1]],feed_dict={_inputs:x_test,
                                                     _labels:y_test,
                                                     _seqlens:seqlen_test})

Accuracy at 0: 60.93750
Accuracy at 100: 100.00000
Accuracy at 200: 100.00000
Accuracy at 300: 100.00000
Accuracy at 400: 100.00000
Accuracy at 500: 100.00000
Accuracy at 600: 100.00000
Accuracy at 700: 100.00000
Accuracy at 800: 100.00000
Accuracy at 900: 100.00000
Test batch accuracy 0: 100.00000
Test batch accuracy 1: 100.00000
Test batch accuracy 2: 100.00000
Test batch accuracy 3: 100.00000
Test batch accuracy 4: 100.00000


In [45]:
print(seqlen_test[1])
print('This output has 6 time steps, each with vector size 32 :', output_example[0][1].shape)
print('\nPrinting only the first 3 dimensions :\n', output_example[0][1][:6,:3])
print('\n')

print(len(states_example[0]))
print('The states vector returned by dynamic_rnn() : \n', states_example[0][1][1,:])

4
This output has 6 time steps, each with vector size 32 : (6, 32)

Printing only the first 3 dimensions :
 [[-0.25823805  0.29683825  0.18983856]
 [-0.64487714  0.73287576  0.5071672 ]
 [-0.8181103   0.90600467  0.70263547]
 [-0.86785537  0.94548136  0.7704953 ]
 [ 0.          0.          0.        ]
 [ 0.          0.          0.        ]]


2
The states vector returned by dynamic_rnn() : 
 [-0.86785537  0.94548136  0.7704953  -0.08993607  0.8896145   0.89738095
 -0.90628624 -0.8824731  -0.8189859  -0.5911037   0.8861079  -0.9172841
 -0.75333756 -0.9453625   0.9268746  -0.9108106   0.8645589   0.93709713
  0.92868733 -0.8984062  -0.86088115  0.89710945 -0.93727446 -0.9186008
 -0.925568    0.9135656  -0.7785736   0.895155   -0.9355229  -0.9055909
  0.51267844  0.9289578 ]


<a id='Word2Vec'></a>
# 2. Word2Vec for Text Sequences
> [ Reference ] :
1. Tom Hope, Yehezkel S. Resheff, and Itay Lieder, "**`Learning TensorFlow : A Guide to Building Deep Learning Systems`**", Chapter 6, O'Reilly, 2017.

+ skip-grams 
+ negative sampling 
+ **word embeddings** 

In [80]:
import os
import math
import numpy as np
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

# Define the hyperparameters
batch_size = 64
embedding_dimension=5
negative_samples = 8
LOG_DIR = 'word2vec'

digit_to_word_map = {1: 'One', 2: 'Two', 3: 'Three', 
                     4: 'Four', 5: 'Five', 6: 'Six', 
                     7: 'Seven', 8: 'Eight', 9: 'Nine'}
sentences = []

# Create two kinds of sentences - sequences of odd and even digits
for i in range(10000):
    rand_odd_ints =np.random.choice(range(1,10,2),3)
    sentences.append(' '.join([digit_to_word_map[r] for r in rand_odd_ints]))
    rand_even_ints = np.random.choice(range(2,10,2),3)
    sentences.append(' ' .join([digit_to_word_map[r] for r in rand_even_ints]))

In [81]:
sentences[0:10]

['One Five One',
 'Four Eight Eight',
 'Five Nine One',
 'Four Six Two',
 'Seven Seven One',
 'Six Two Six',
 'One One Nine',
 'Eight Two Four',
 'Seven One Seven',
 'Two Four Six']

In [82]:
# Map words to indices
word2index_map = {}
index=0
for sent in sentences:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
index2word_map = {index: word for word, index in word2index_map.items()}
vocabulary_size = len(index2word_map)

<a id='SkipGram'></a>
### 2.1 Skip-Grams 
+ (Ref. 1, Chapter 6)

In [83]:
def get_skipgram_batch(batch_size):
    instance_indices = list(range(len(skip_gram_pairs)))
    np.random.shuffle(instance_indices)
    batch = instance_indices[:batch_size]
    x = [skip_gram_pairs[i][0] for i in batch]
    y = [[skip_gram_pairs[i][1]] for i in batch]
    return x,y

In [84]:
# map words to indices
word2index_map = {}
index = 0
for sent in sentences:
    for word in sent.lower().split():
        if word not in word2index_map:
            word2index_map[word] = index
            index+=1
index2word_map = {index:word for word,index in word2index_map.items()}
vocabulary_size = len(index2word_map)

#create skip-gram pairs
skip_gram_pairs = []
for sent in sentences:
    tokenized_sent = sent.lower().split()
    for i in range(1,len(tokenized_sent)-1):
        word_context_pair = [[word2index_map[tokenized_sent[i-1]],
                             word2index_map[tokenized_sent[i+1]]],
                             word2index_map[tokenized_sent[i]]]
        skip_gram_pairs.append([word_context_pair[1],
                                word_context_pair[0][0]])
        skip_gram_pairs.append([word_context_pair[1],
                                word_context_pair[0][1]])

In [85]:
skip_gram_pairs[0:10]

[[1, 0],
 [1, 0],
 [3, 2],
 [3, 3],
 [4, 1],
 [4, 0],
 [5, 2],
 [5, 6],
 [7, 7],
 [7, 0]]

In [86]:
# Batch example
x_batch,y_batch = get_skipgram_batch(8)
x_batch
y_batch
[index2word_map[word] for word in x_batch]
[index2word_map[word[0]] for word in y_batch]

['nine', 'five', 'seven', 'four', 'six', 'four', 'nine', 'eight']

In [87]:
x_batch

[7, 1, 4, 3, 2, 5, 7, 2]

In [88]:
y_batch

[[4], [1], [7], [2], [5], [2], [4], [3]]

In [89]:
[index2word_map[word] for word in x_batch]

['seven', 'five', 'nine', 'eight', 'four', 'six', 'seven', 'four']

In [90]:
[index2word_map[word[0]] for word in y_batch]

['nine', 'five', 'seven', 'four', 'six', 'four', 'nine', 'eight']

<a id='BuildGraph'></a>
### 2.2 Building the Computation Graph

In [91]:
## --------------------------------------
## Building up the computation graph...
## --------------------------------------
# Need to reset the computational graph 
tf.reset_default_graph()

#Input data,labels
train_inputs = tf.placeholder(tf.int32,shape=[batch_size])
train_labels = tf.placeholder(tf.int32,shape=[batch_size,1])

<a id='Embeddings'></a>
> ### Embeddings in TensorFlow

+ Here we use a loss function accounting for the *`unsupervised`* nature of the word-embedding task. 
+ Using the embedding lookup (the built-in **`tf.nn.embedding_lookup() function`**), which efficiently retrieves the vectors for each word in a given sequence of word indices, remains the same:

In [92]:
# Building the lookup table
with tf.name_scope('embeddings'):
    embeddings = tf.Variable(
        tf.random_uniform([vocabulary_size,embedding_dimension],-1.0,1.0),name='embedding')
    # This is essentially a lookup table
    embed = tf.nn.embedding_lookup(embeddings,train_inputs)

<a id='NCE'></a>
> ### The Noise-Contrastive Estimation (NCE) Loss Function
>    + `tf.nn.nce_loss()` automatically draws negative (“noise”) samples when we evaluate the loss (run it in a session):

In [93]:
# Create variables for the NCE loss
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,embedding_dimension],
                                              stddev=1.0/math.sqrt(embedding_dimension)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

loss = tf.reduce_mean(tf.nn.nce_loss(weights = nce_weights,
                                     biases = nce_biases,
                                     inputs = embed,
                                     labels = train_labels,
                                     num_sampled = negative_samples,
                                     num_classes = vocabulary_size))

tf.summary.scalar("NCE_loss", loss)

<tf.Tensor 'NCE_loss_1:0' shape=() dtype=string>

<a id='LRDecay'></a>
> ### Learning Rate Decay
>    + **`tf.train.exponential_decay()`**

In [94]:
# Learning rate decay
global_step = tf.Variable(0,trainable=False)
learningRate = tf.train.exponential_decay(learning_rate=0.1,
                                          global_step = global_step,
                                          decay_steps=1000,
                                          decay_rate=0.95,
                                          staircase=True)
train_step = tf.train.GradientDescentOptimizer(learningRate).minimize(loss)

<a id='LaunchGraph'></a>
### 2.3 Launching the computation graph

In [95]:
# Merge all summary ops
merged = tf.summary.merge_all()

with tf.Session() as sess:
    train_writer = tf.summary.FileWriter(LOG_DIR,graph=tf.get_default_graph())
    saver = tf.train.Saver()
    
    with open(os.path.join(LOG_DIR,'metadata.tsv'),'w')as metadata:
        metadata.write('Name\tClass\t\n')
        for k,v in index2word_map.items():
            metadata.write('%s\t%d\t\n' %(v,k))
    config = projector.ProjectorConfig()
    embedding = config.embeddings.add()
    embedding.tensor_name = embeddings.name

    # Link embedding to its metadata file
    embedding.metadata_path = os.path.join(LOG_DIR,'metadata.tsv')
    projector.visualize_embeddings(train_writer,config)

    tf.global_variables_initializer().run()
    for step in range(1000):
        x_batch,y_bacth = get_skipgram_batch(batch_size)
        summary,_ = sess.run([merged,train_step],feed_dict={train_inputs:x_batch,
                                                            train_labels:y_bacth})
        train_writer.add_summary(summary,step)
        if step%100 ==0:
            saver.save(sess,os.path.join(LOG_DIR,'w2c_model.skpt'),step)
            loss_value = sess.run(loss,feed_dict={train_inputs:x_batch,
                                                            train_labels:y_bacth})
            print('loss as %d: %.5d'%(step,loss_value))

    # Normalize embeddings before using
    norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings),1,keep_dims=True))
    normalized_embeddings = embeddings / norm
    normalized_embeddings_matrix = sess.run(normalized_embeddings)

loss as 0: 00006
loss as 100: 00003
loss as 200: 00002
loss as 300: 00002
loss as 400: 00002
loss as 500: 00002
loss as 600: 00002
loss as 700: 00002
loss as 800: 00002
loss as 900: 00002


<a id='CheckOut'></a>
> ### Checking Out the Embeddings

In [96]:
ref_word = normalized_embeddings_matrix[word2index_map['one']]
cosine_dists = np.dot(normalized_embeddings_matrix,ref_word)
ff = np.argsort(cosine_dists)[::-1][1:10]
for f in ff:
    print(index2word_map[f])
    print(cosine_dists[f])

three
0.99160933
five
0.9637778
seven
0.9437641
nine
0.8734147
four
0.071836025
two
-0.03422053
eight
-0.036034882
six
-0.13121545



### [ Note ] :
>+ The word vectors representing `odd numbers` are similar (in terms of the `dot product`) to `one`.
+ Those representing `even numbers` are not similar to it (and have a `negative dot product` with the `one` vector).

-----------------

<a id='TensorBoard'></a>
### 2.4 Visualization with TensorBoard

> ####  To run TensorBoard, run the following command on `Anaconda Prompt` :
`tensorboard --logdir=`_path/to/log-directory_

> + For instance, **`tensorboard --logdir=C:\DL\logs\word2vec`**

> Connecting to **`http://localhost:6006`**


+ In TensorBoard, go to the `Projector` tab. This is a three-dimensional interactive visualization panel, where we can move around the space of our embedded vectors and explore different “angles,” zoom in, and more.


+ **An `Embedding Projector` TensorFlow demo** : http://projector.tensorflow.org/
-----------------