This Notebook is essentially a modification of:
1. [Recurrent Neural Networks in Tensorflow 2](https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html)
2. [Recurrent Neural Networks in Tensorflow 3](https://r2rt.com/recurrent-neural-networks-in-tensorflow-iii-variable-length-sequences.html)

[Theories behind RNN & LSTM](https://r2rt.com/written-memories-understanding-deriving-and-extending-the-lstm.html)

# Contents
1. Loading data
2. Using the `BasicLSTMCell`
3. Stacking the LSTM cells
4. Adding Dropout
5. Using `GRUCell`
6. Generating words
7. Layer Normalization(skipped)

# 1. Loading data

Load the data and preprocess
* `wiki.dl.txt` represents the content of [the wikipedia article for Deep Learning](https://en.wikipedia.org/wiki/Deep_learning)

In [1]:
# Open and preprocess
import re
with open('wiki.dl.txt') as f:
    # remove special characters
    corpus = re.sub('\[[0-9]*\]','',f.read()) 
    # decapitalize
    corpus = corpus.lower()
    print(corpus[:1000])

# tokenize until word units
from nltk import sent_tokenize, word_tokenize
sentences = sent_tokenize(corpus)
sents_tokenized = [word_tokenize(sent) for sent in sentences]

# Build a vocabulary
from collections import Counter
from itertools import chain
count = Counter(chain(*sents_tokenized))
vocab_size = 5000
count = count.most_common(vocab_size - 3)
# WARNING: indices start from 1(for padding)
vocab = {tup[0]:i + 1 for i,tup in enumerate(count)}
START = '<<'
END = '>>'
UNKNOWN = '_UNK_'
vocab[START] = len(vocab) + 1
vocab[END] = len(vocab) + 1
vocab[UNKNOWN] = len(vocab) + 1
vocab_size = min(vocab_size, len(vocab))

rev_vocab = dict([(v,k) for k,v in vocab.items()])

# Filter words in the training sentences
for sent in sents_tokenized:
    sent.insert(0, START)
    sent.append(END)
    for i in range(len(sent)):
        word = sent[i]
        if word not in vocab:
            sent[i] = UNKNOWN

# Convert the words into indices            
sents2id = []
for sent in sents_tokenized:
    sents2id.append([])
    for word in sent:
        sents2id[-1].append(vocab[word])

print(sents_tokenized[:3])
print(sents2id[:3])

deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. learning can be supervised, partially supervised or unsupervised.

some representations are loosely based on interpretation of information processing and communication patterns in a biological nervous system, such as neural coding that attempts to define a relationship between various stimuli and associated neuronal responses in the brain. research attempts to create efficient systems to learn these representations from large-scale, unlabeled data sets.

deep learning architectures such as deep neural networks, deep belief networks and recurrent neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation and bioinformatics where they produced results 

Training epochs generator

In [2]:
import tensorflow as tf
print('tensorflow version:%s' % tf.__version__)
import numpy as np

# Training batch generator
def gen_samples(num_steps,batch_size):
    '''
    Generate a training example(a sentence) over a single epoch,
    using the sentences data
    
    Args:
        num_steps: the size of the words-window to consider
        batch_size: size of each batches
        
    Yields:
        batch: list of words. The number of examples in the batch is batch_size,
        and the length of each examples is seq_length.
    '''
    x_batch = []
    y_batch = []
    for sent in sents2id:
        sent_length = len(sent)
        # Too short sentences will be ignored
        if sent_length < num_steps + 1:
            break
        for i in range(sent_length-num_steps-1):
            x_batch.append(sent[i:i+num_steps])
            y_batch.append(sent[i+1:i+num_steps+1])
            if len(x_batch) == batch_size:
                yield x_batch,y_batch
                x_batch = []
                y_batch = []
                
def gen_epochs(n,num_steps,batch_size):
    '''
    Generate a whole training epoch
    
    Args:
        n: the number of training epochs
        num_steps: the size of the words-window to consider
        batch_size: size of each batches
    Yields:
        a training set generated with num_steps, batch_size over 
        the whole training corpus
    '''
    for n in range(n):
        yield gen_samples(num_steps,batch_size)
        
def train_network(graph, n_epochs, batch_size, num_steps,
                  verbose=True, save=False):
    tf.set_random_seed(123)
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        # list that stores losses for each training epochs
        training_losses = []
        for i_epoch,epoch in enumerate(
            gen_epochs(n_epochs, num_steps, batch_size)):
            total_loss = 0
            n_batches = 0
            state = None
            for X, Y in epoch:
                n_batches += 1
                feed_dict = {graph['x']:X, graph['y']:Y}
                batch_loss, batch_state, _ =\
                sess.run([graph['loss'],
                          graph['final_state'],
                          graph['optimizer']],
                         feed_dict)
                total_loss += batch_loss

            avg_epoch_loss = total_loss/n_batches
            training_losses.append(avg_epoch_loss)

            if verbose:
                print('Average loss in Epoch %i:%.4f' %
                      (i_epoch,avg_epoch_loss))

        # Store the training result
        if isinstance(save, str):
            graph['saver'].save(sess, save)

    return training_losses

tensorflow version:1.3.0


# 2. Basic LSTM

Here, we are using distributed representation for each words(i.e. word2vec). To understand what `tf.nn.embedding_lookup` is doing, refer to:
1. See the 'Inputs' section of [TF documentation for RNN](https://www.tensorflow.org/tutorials/recurrent) to understand what the function does
2. And look [here](http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/) for the rationale. Take a look at the section **Adding an embedding layer**
3. If you don't want to use word2vec, [this](https://stackoverflow.com/questions/35056909/input-to-lstm-network-tensorflow) might be helpful.

original source: https://r2rt.com/recurrent-neural-networks-in-tensorflow-ii.html

In [3]:
from tensorflow.contrib import rnn

def build_graph(batch_size, num_steps, state_size,
                num_classes = vocab_size, learning_rate=1e-4):
    
    # wipe out all previously built graphs
    tf.reset_default_graph()
    
    # word ids, without one-hot encoding
    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='x')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='y')

    # vector representation for each words(word2vec)
    word_embeddings = tf.get_variable('embedding_matrix',
                                 [num_classes, state_size])

    # rnn_inputs is a tensor of dim [batch_size,num_steps,state_size]
    rnn_inputs = tf.nn.embedding_lookup(word_embeddings, x)

    cell = rnn.BasicLSTMCell(state_size, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state =\
    tf.nn.dynamic_rnn(cell,rnn_inputs,initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes],
                            initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b

    # Make predictions
    predictions = tf.nn.softmax(logits)

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y_reshaped))
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
    return dict(
        x=x,
        y=y,
        init_state=init_state,
        final_state=final_state,
        loss=loss,
        optimizer=optimizer,
        preds=predictions,
        saver=tf.train.Saver()
    )

batch_size = 32
num_steps = 10
state_size = 500
num_epochs = 100

graph = build_graph(batch_size, num_steps, state_size)
train_network(graph, num_epochs, batch_size, num_steps)

Average loss in Epoch 0:7.4219
Average loss in Epoch 1:7.4095
Average loss in Epoch 2:7.3648
Average loss in Epoch 3:6.8196
Average loss in Epoch 4:6.1925
Average loss in Epoch 5:5.9669
Average loss in Epoch 6:5.8511
Average loss in Epoch 7:5.7761
Average loss in Epoch 8:5.7209
Average loss in Epoch 9:5.6772
Average loss in Epoch 10:5.6410
Average loss in Epoch 11:5.6102
Average loss in Epoch 12:5.5832
Average loss in Epoch 13:5.5592
Average loss in Epoch 14:5.5375
Average loss in Epoch 15:5.5176
Average loss in Epoch 16:5.4992
Average loss in Epoch 17:5.4818
Average loss in Epoch 18:5.4654
Average loss in Epoch 19:5.4497
Average loss in Epoch 20:5.4344
Average loss in Epoch 21:5.4197
Average loss in Epoch 22:5.4051
Average loss in Epoch 23:5.3907
Average loss in Epoch 24:5.3766
Average loss in Epoch 25:5.3626
Average loss in Epoch 26:5.3487
Average loss in Epoch 27:5.3349
Average loss in Epoch 28:5.3213
Average loss in Epoch 29:5.3079
Average loss in Epoch 30:5.2947
Average loss in Ep

[7.4219453811645506,
 7.4095009613037108,
 7.3647918891906734,
 6.8195610618591305,
 6.1925234031677245,
 5.9669471549987794,
 5.8511255264282225,
 5.7760960006713864,
 5.7208665847778324,
 5.6771656990051271,
 5.6410123252868649,
 5.6101623725891114,
 5.5832146263122562,
 5.5592173767089843,
 5.5375051689147945,
 5.5176248741149898,
 5.4991607093811039,
 5.481809883117676,
 5.4654388236999516,
 5.4497043609619142,
 5.4344337272644045,
 5.4196513366699222,
 5.4050824928283694,
 5.3907295799255373,
 5.3765571784973147,
 5.3625544929504398,
 5.3486833381652836,
 5.3349454689025881,
 5.3213220024108887,
 5.3078727912902828,
 5.2946855926513674,
 5.281846790313721,
 5.2690480613708495,
 5.2556065177917484,
 5.2418182945251468,
 5.2277877426147459,
 5.2135727500915525,
 5.1990779113769534,
 5.1841147422790526,
 5.1684820365905759,
 5.1521724700927738,
 5.1354609298706055,
 5.1184075355529783,
 5.1009625816345219,
 5.0831543350219723,
 5.0650280380249022,
 5.046668281555176,
 5.0280907058715

# 3. Stacking the RNN cell

In [4]:
def build_graph(batch_size, num_steps, state_size, num_layers = 5,
                num_classes = vocab_size, learning_rate=1e-4):
    
    # wipe out all previously built graphs
    tf.reset_default_graph()
    
    # word ids, without one-hot encoding
    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='x')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='y')

    # vector representation for each words(word2vec)
    word_embeddings = tf.get_variable('embedding_matrix',
                                 [num_classes, state_size])

    # rnn_inputs is a tensor of dim [batch_size,num_steps,state_size]
    rnn_inputs = tf.nn.embedding_lookup(word_embeddings, x)

    cell = rnn.BasicLSTMCell(state_size, state_is_tuple=True)
    cell = rnn.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state =\
    tf.nn.dynamic_rnn(cell,rnn_inputs,initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes],
                            initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b

    # Make predictions
    predictions = tf.nn.softmax(logits)

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y_reshaped))
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
    return dict(
        x=x,
        y=y,
        init_state=init_state,
        final_state=final_state,
        loss=loss,
        optimizer=optimizer,
        preds=predictions,
        saver=tf.train.Saver()
    )

batch_size = 32
num_steps = 10
state_size = 500
num_epochs = 100

graph = build_graph(batch_size, num_steps, state_size)
train_network(graph, num_epochs, batch_size, num_steps)

Average loss in Epoch 0:7.4218
Average loss in Epoch 1:7.2267
Average loss in Epoch 2:6.2358
Average loss in Epoch 3:5.8123
Average loss in Epoch 4:5.6463
Average loss in Epoch 5:5.5572
Average loss in Epoch 6:5.5023
Average loss in Epoch 7:5.4551
Average loss in Epoch 8:5.4100
Average loss in Epoch 9:5.3771
Average loss in Epoch 10:5.3582
Average loss in Epoch 11:5.3469
Average loss in Epoch 12:5.3400
Average loss in Epoch 13:5.3352
Average loss in Epoch 14:5.3317
Average loss in Epoch 15:5.3289
Average loss in Epoch 16:5.3265
Average loss in Epoch 17:5.3245
Average loss in Epoch 18:5.3228
Average loss in Epoch 19:5.3212
Average loss in Epoch 20:5.3198
Average loss in Epoch 21:5.3185
Average loss in Epoch 22:5.3172
Average loss in Epoch 23:5.3160
Average loss in Epoch 24:5.3148
Average loss in Epoch 25:5.3137
Average loss in Epoch 26:5.3125
Average loss in Epoch 27:5.3113
Average loss in Epoch 28:5.3101
Average loss in Epoch 29:5.3086
Average loss in Epoch 30:5.3084
Average loss in Ep

[7.4218243980407719,
 7.2266994094848629,
 6.2358335304260253,
 5.8122639846801754,
 5.6463211822509765,
 5.5571738243103024,
 5.5023233795166018,
 5.4550895881652828,
 5.4099793624877925,
 5.3771359443664553,
 5.3582220268249507,
 5.3469111251831052,
 5.3399650001525876,
 5.3352075195312496,
 5.3317015647888182,
 5.3288762092590334,
 5.3265036964416508,
 5.3244793891906737,
 5.322770862579346,
 5.3211570167541504,
 5.3197667503356936,
 5.3184642791748047,
 5.3171938705444335,
 5.3159810829162595,
 5.3148066711425779,
 5.3136506080627441,
 5.312477645874023,
 5.3112623023986814,
 5.3100501632690431,
 5.3086326408386233,
 5.3084370803833005,
 5.3059710693359374,
 5.3047337341308598,
 5.3033027648925781,
 5.301822872161865,
 5.3020363235473633,
 5.3008287620544436,
 5.2983656501770016,
 5.2973518753051758,
 5.2961262893676757,
 5.2947085952758792,
 5.2979016304016113,
 5.3007911109924315,
 5.2987985801696773,
 5.29440860748291,
 5.2933845329284672,
 5.2924264335632323,
 5.290903129577636

# 4. Adding Dropout

LSTM with a single layer

In [5]:
def build_graph(batch_size, num_steps, state_size, keep_prob=0.7,
                num_classes = vocab_size, learning_rate=1e-4):
    
    # wipe out all previously built graphs
    tf.reset_default_graph()
    
    # word ids, without one-hot encoding
    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='x')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='y')

    # vector representation for each words(word2vec)
    word_embeddings = tf.get_variable('embedding_matrix',
                                 [num_classes, state_size])

    # rnn_inputs is a tensor of dim [batch_size,num_steps,state_size]
    rnn_inputs = tf.nn.embedding_lookup(word_embeddings, x)
    
    # Add a dropout to the input layer
    rnn_inputs = tf.nn.dropout(rnn_inputs, keep_prob)

    cell = rnn.BasicLSTMCell(state_size, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state =\
    tf.nn.dynamic_rnn(cell,rnn_inputs,initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes],
                            initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    
    # Dropout layer
    rnn_inputs = tf.nn.dropout(rnn_outputs, keep_prob)
    
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b

    # Make predictions
    predictions = tf.nn.softmax(logits)

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y_reshaped))
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
    return dict(
        x=x,
        y=y,
        init_state=init_state,
        final_state=final_state,
        loss=loss,
        optimizer=optimizer,
        preds=predictions,
        saver=tf.train.Saver()
    )

batch_size = 32
num_steps = 10
state_size = 500
num_epochs = 100

graph = build_graph(batch_size, num_steps, state_size)
train_network(graph, num_epochs, batch_size, num_steps)

Average loss in Epoch 0:7.4217
Average loss in Epoch 1:7.4082
Average loss in Epoch 2:7.3474
Average loss in Epoch 3:6.7228
Average loss in Epoch 4:6.1681
Average loss in Epoch 5:5.9628
Average loss in Epoch 6:5.8527
Average loss in Epoch 7:5.7802
Average loss in Epoch 8:5.7259
Average loss in Epoch 9:5.6832
Average loss in Epoch 10:5.6483
Average loss in Epoch 11:5.6173
Average loss in Epoch 12:5.5909
Average loss in Epoch 13:5.5665
Average loss in Epoch 14:5.5451
Average loss in Epoch 15:5.5248
Average loss in Epoch 16:5.5070
Average loss in Epoch 17:5.4892
Average loss in Epoch 18:5.4730
Average loss in Epoch 19:5.4577
Average loss in Epoch 20:5.4428
Average loss in Epoch 21:5.4290
Average loss in Epoch 22:5.4151
Average loss in Epoch 23:5.4019
Average loss in Epoch 24:5.3889
Average loss in Epoch 25:5.3752
Average loss in Epoch 26:5.3618
Average loss in Epoch 27:5.3490
Average loss in Epoch 28:5.3364
Average loss in Epoch 29:5.3238
Average loss in Epoch 30:5.3104
Average loss in Ep

[7.4217072296142579,
 7.4082433509826657,
 7.3474486923217777,
 6.7227872276306151,
 6.1680613708496095,
 5.9627669906616214,
 5.8527391052246092,
 5.7802228355407719,
 5.7259400939941409,
 5.6832261085510254,
 5.6482846450805662,
 5.6172613906860356,
 5.5909370422363285,
 5.5665307807922364,
 5.545142784118652,
 5.5247776794433596,
 5.5070427513122562,
 5.4891741943359378,
 5.4730441474914553,
 5.4576987838745117,
 5.4427736091613772,
 5.4289922714233398,
 5.4151290130615237,
 5.4018707466125484,
 5.3888525199890136,
 5.3751937675476071,
 5.3618081855773925,
 5.3490494155883788,
 5.3364063644409176,
 5.3237851524353026,
 5.3103896903991696,
 5.2969864273071288,
 5.2836200523376462,
 5.2713753128051755,
 5.2582091522216796,
 5.2447212409973147,
 5.2311424827575683,
 5.2174740219116211,
 5.2033799743652347,
 5.1888844871520998,
 5.1746632003784176,
 5.1599514579772947,
 5.1437390327453612,
 5.1292967224121098,
 5.1133275222778316,
 5.0964964103698733,
 5.0801027297973631,
 5.06341068267

Adding dropout for the layers in Stacked LSTM model

In [6]:
def build_graph(batch_size, num_steps, state_size,
                num_classes = vocab_size, keep_prob = 0.7,
                learning_rate=1e-4):
    
    # wipe out all previously built graphs
    tf.reset_default_graph()
    
    # word ids, without one-hot encoding
    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='x')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='y')

    # vector representation for each words(word2vec)
    word_embeddings = tf.get_variable('embedding_matrix',
                                 [num_classes, state_size])

    # rnn_inputs is a tensor of dim [batch_size,num_steps,state_size]
    rnn_inputs = tf.nn.embedding_lookup(word_embeddings, x)

    cell = rnn.BasicLSTMCell(state_size, state_is_tuple=True)
    # Adding Dropout along to each stacked LSTM layers
    cell = rnn.DropoutWrapper(cell,
                              input_keep_prob = keep_prob,
                              output_keep_prob = keep_prob)
    cell = rnn.MultiRNNCell([cell] * 5, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state =\
    tf.nn.dynamic_rnn(cell,rnn_inputs,initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes],
                            initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b

    # Make predictions
    predictions = tf.nn.softmax(logits)

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y_reshaped))
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
    return dict(
        x=x,
        y=y,
        init_state=init_state,
        final_state=final_state,
        loss=loss,
        optimizer=optimizer,
        preds=predictions,
        saver=tf.train.Saver()
    )

batch_size = 32
num_steps = 10
state_size = 500
num_epochs = 1000

graph = build_graph(batch_size,num_steps,state_size)
train_network(graph, num_epochs, batch_size, num_steps,
              save='saves/lstm_lm_1000epochs')

Average loss in Epoch 0:7.4220
Average loss in Epoch 1:7.2523
Average loss in Epoch 2:6.3041
Average loss in Epoch 3:5.8651
Average loss in Epoch 4:5.7038
Average loss in Epoch 5:5.6177
Average loss in Epoch 6:5.5626
Average loss in Epoch 7:5.5043
Average loss in Epoch 8:5.4660
Average loss in Epoch 9:5.4351
Average loss in Epoch 10:5.4032
Average loss in Epoch 11:5.3942
Average loss in Epoch 12:5.3863
Average loss in Epoch 13:5.3781
Average loss in Epoch 14:5.3707
Average loss in Epoch 15:5.3738
Average loss in Epoch 16:5.3687
Average loss in Epoch 17:5.3637
Average loss in Epoch 18:5.3497
Average loss in Epoch 19:5.3517
Average loss in Epoch 20:5.3559
Average loss in Epoch 21:5.3521
Average loss in Epoch 22:5.3474
Average loss in Epoch 23:5.3442
Average loss in Epoch 24:5.3490
Average loss in Epoch 25:5.3452
Average loss in Epoch 26:5.3418
Average loss in Epoch 27:5.3397
Average loss in Epoch 28:5.3367
Average loss in Epoch 29:5.3357
Average loss in Epoch 30:5.3378
Average loss in Ep

Average loss in Epoch 252:2.4861
Average loss in Epoch 253:2.4791
Average loss in Epoch 254:2.4659
Average loss in Epoch 255:2.4520
Average loss in Epoch 256:2.4376
Average loss in Epoch 257:2.4305
Average loss in Epoch 258:2.4268
Average loss in Epoch 259:2.4110
Average loss in Epoch 260:2.4051
Average loss in Epoch 261:2.3940
Average loss in Epoch 262:2.3973
Average loss in Epoch 263:2.3836
Average loss in Epoch 264:2.3646
Average loss in Epoch 265:2.3608
Average loss in Epoch 266:2.3538
Average loss in Epoch 267:2.3327
Average loss in Epoch 268:2.3319
Average loss in Epoch 269:2.3138
Average loss in Epoch 270:2.2977
Average loss in Epoch 271:2.2952
Average loss in Epoch 272:2.2954
Average loss in Epoch 273:2.2808
Average loss in Epoch 274:2.2674
Average loss in Epoch 275:2.2675
Average loss in Epoch 276:2.2629
Average loss in Epoch 277:2.2761
Average loss in Epoch 278:2.2385
Average loss in Epoch 279:2.2579
Average loss in Epoch 280:2.2561
Average loss in Epoch 281:2.2691
Average lo

Average loss in Epoch 501:0.8928
Average loss in Epoch 502:0.8933
Average loss in Epoch 503:0.8867
Average loss in Epoch 504:0.8764
Average loss in Epoch 505:0.8812
Average loss in Epoch 506:0.8733
Average loss in Epoch 507:0.8707
Average loss in Epoch 508:0.8709
Average loss in Epoch 509:0.8584
Average loss in Epoch 510:0.8606
Average loss in Epoch 511:0.8558
Average loss in Epoch 512:0.8575
Average loss in Epoch 513:0.8443
Average loss in Epoch 514:0.8420
Average loss in Epoch 515:0.8494
Average loss in Epoch 516:0.8363
Average loss in Epoch 517:0.8462
Average loss in Epoch 518:0.8428
Average loss in Epoch 519:0.8234
Average loss in Epoch 520:0.8356
Average loss in Epoch 521:0.8243
Average loss in Epoch 522:0.8238
Average loss in Epoch 523:0.8254
Average loss in Epoch 524:0.8194
Average loss in Epoch 525:0.8121
Average loss in Epoch 526:0.8130
Average loss in Epoch 527:0.8083
Average loss in Epoch 528:0.8023
Average loss in Epoch 529:0.8021
Average loss in Epoch 530:0.8081
Average lo

Average loss in Epoch 750:0.4288
Average loss in Epoch 751:0.4188
Average loss in Epoch 752:0.4200
Average loss in Epoch 753:0.4231
Average loss in Epoch 754:0.4162
Average loss in Epoch 755:0.4240
Average loss in Epoch 756:0.4142
Average loss in Epoch 757:0.4252
Average loss in Epoch 758:0.4138
Average loss in Epoch 759:0.4199
Average loss in Epoch 760:0.4182
Average loss in Epoch 761:0.4178
Average loss in Epoch 762:0.4156
Average loss in Epoch 763:0.4046
Average loss in Epoch 764:0.4104
Average loss in Epoch 765:0.4056
Average loss in Epoch 766:0.4162
Average loss in Epoch 767:0.4123
Average loss in Epoch 768:0.4125
Average loss in Epoch 769:0.4045
Average loss in Epoch 770:0.3963
Average loss in Epoch 771:0.4116
Average loss in Epoch 772:0.3996
Average loss in Epoch 773:0.4109
Average loss in Epoch 774:0.4093
Average loss in Epoch 775:0.4103
Average loss in Epoch 776:0.3935
Average loss in Epoch 777:0.4050
Average loss in Epoch 778:0.3953
Average loss in Epoch 779:0.4056
Average lo

Average loss in Epoch 999:0.2878


[7.4219515228271487,
 7.2522882080078128,
 6.3040896224975587,
 5.8651046371459961,
 5.7037962150573733,
 5.6177187156677242,
 5.5626495552062991,
 5.5042684936523436,
 5.4660033988952641,
 5.4350634002685547,
 5.4031640243530275,
 5.3941654968261723,
 5.3863291168212895,
 5.3781476402282715,
 5.3706502914428711,
 5.3738175201416016,
 5.3687067985534664,
 5.363705978393555,
 5.3496961021423344,
 5.3517421722412113,
 5.3559330368041991,
 5.3520964050292967,
 5.3474118423461912,
 5.3442337226867673,
 5.348981075286865,
 5.3451544761657717,
 5.3417949867248531,
 5.3397066879272463,
 5.3367014312744141,
 5.3357400131225585,
 5.3378387641906739,
 5.3361295890808105,
 5.3369355201721191,
 5.3381499671936039,
 5.3326733398437502,
 5.3328661346435551,
 5.3283170890808105,
 5.3248631286621091,
 5.3219636726379393,
 5.3212178421020511,
 5.3222626304626468,
 5.3177423667907711,
 5.3164740753173829,
 5.3191093254089354,
 5.313106460571289,
 5.3230368423461911,
 5.321018161773682,
 5.31815172195434

# 5. Using `GRUCell`

In [7]:
def build_graph(batch_size, num_steps, state_size,
                num_classes = vocab_size, keep_prob = 0.7,
                learning_rate=1e-4):
    
    # wipe out all previously built graphs
    tf.reset_default_graph()
    
    # word ids, without one-hot encoding
    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='x')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='y')

    # vector representation for each words(word2vec)
    word_embeddings = tf.get_variable('embedding_matrix',
                                 [num_classes, state_size])

    # rnn_inputs is a tensor of dim [batch_size,num_steps,state_size]
    rnn_inputs = tf.nn.embedding_lookup(word_embeddings, x)

    cell = rnn.GRUCell(state_size)
    # Adding Dropout along to each stacked GRU layers
    cell = rnn.DropoutWrapper(cell,
                              input_keep_prob = keep_prob,
                              output_keep_prob = keep_prob)
    cell = rnn.MultiRNNCell([cell] * 5, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state =\
    tf.nn.dynamic_rnn(cell,rnn_inputs,initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes],
                            initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b

    # Make predictions
    predictions = tf.nn.softmax(logits)

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits=logits, labels=y_reshaped))
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    
    return dict(
        x=x,
        y=y,
        init_state=init_state,
        final_state=final_state,
        loss=loss,
        optimizer=optimizer,
        preds=predictions,
        saver=tf.train.Saver()
    )

batch_size = 32
num_steps = 10
state_size = 500
num_epochs = 1000

graph = build_graph(batch_size,num_steps,state_size)
train_network(graph, num_epochs, batch_size, num_steps,
              save='saves/gru_lm_1000epochs')

Average loss in Epoch 0:7.4222
Average loss in Epoch 1:7.3584
Average loss in Epoch 2:6.5257
Average loss in Epoch 3:5.9450
Average loss in Epoch 4:5.7495
Average loss in Epoch 5:5.6482
Average loss in Epoch 6:5.5778
Average loss in Epoch 7:5.5332
Average loss in Epoch 8:5.5026
Average loss in Epoch 9:5.4657
Average loss in Epoch 10:5.4389
Average loss in Epoch 11:5.4214
Average loss in Epoch 12:5.4063
Average loss in Epoch 13:5.3938
Average loss in Epoch 14:5.3959
Average loss in Epoch 15:5.3843
Average loss in Epoch 16:5.3792
Average loss in Epoch 17:5.3875
Average loss in Epoch 18:5.3808
Average loss in Epoch 19:5.3778
Average loss in Epoch 20:5.3757
Average loss in Epoch 21:5.3726
Average loss in Epoch 22:5.3691
Average loss in Epoch 23:5.3694
Average loss in Epoch 24:5.3714
Average loss in Epoch 25:5.3639
Average loss in Epoch 26:5.3600
Average loss in Epoch 27:5.3609
Average loss in Epoch 28:5.3581
Average loss in Epoch 29:5.3542
Average loss in Epoch 30:5.3578
Average loss in Ep

Average loss in Epoch 252:2.7055
Average loss in Epoch 253:2.6915
Average loss in Epoch 254:2.6676
Average loss in Epoch 255:2.6699
Average loss in Epoch 256:2.6517
Average loss in Epoch 257:2.6301
Average loss in Epoch 258:2.6171
Average loss in Epoch 259:2.6259
Average loss in Epoch 260:2.6084
Average loss in Epoch 261:2.5913
Average loss in Epoch 262:2.5907
Average loss in Epoch 263:2.5806
Average loss in Epoch 264:2.5720
Average loss in Epoch 265:2.5594
Average loss in Epoch 266:2.5328
Average loss in Epoch 267:2.5255
Average loss in Epoch 268:2.5099
Average loss in Epoch 269:2.5010
Average loss in Epoch 270:2.5050
Average loss in Epoch 271:2.4778
Average loss in Epoch 272:2.4631
Average loss in Epoch 273:2.4520
Average loss in Epoch 274:2.4428
Average loss in Epoch 275:2.4298
Average loss in Epoch 276:2.4275
Average loss in Epoch 277:2.4222
Average loss in Epoch 278:2.3966
Average loss in Epoch 279:2.4005
Average loss in Epoch 280:2.3861
Average loss in Epoch 281:2.3685
Average lo

Average loss in Epoch 501:0.9884
Average loss in Epoch 502:0.9963
Average loss in Epoch 503:0.9879
Average loss in Epoch 504:0.9861
Average loss in Epoch 505:0.9824
Average loss in Epoch 506:0.9811
Average loss in Epoch 507:0.9816
Average loss in Epoch 508:0.9830
Average loss in Epoch 509:0.9668
Average loss in Epoch 510:0.9684
Average loss in Epoch 511:0.9658
Average loss in Epoch 512:0.9630
Average loss in Epoch 513:0.9555
Average loss in Epoch 514:0.9584
Average loss in Epoch 515:0.9410
Average loss in Epoch 516:0.9437
Average loss in Epoch 517:0.9436
Average loss in Epoch 518:0.9517
Average loss in Epoch 519:0.9469
Average loss in Epoch 520:0.9323
Average loss in Epoch 521:0.9354
Average loss in Epoch 522:0.9282
Average loss in Epoch 523:0.9356
Average loss in Epoch 524:0.9240
Average loss in Epoch 525:0.9203
Average loss in Epoch 526:0.9200
Average loss in Epoch 527:0.9253
Average loss in Epoch 528:0.9155
Average loss in Epoch 529:0.9259
Average loss in Epoch 530:0.9042
Average lo

Average loss in Epoch 750:0.5135
Average loss in Epoch 751:0.5155
Average loss in Epoch 752:0.5102
Average loss in Epoch 753:0.5111
Average loss in Epoch 754:0.5090
Average loss in Epoch 755:0.5063
Average loss in Epoch 756:0.5114
Average loss in Epoch 757:0.5108
Average loss in Epoch 758:0.5009
Average loss in Epoch 759:0.5182
Average loss in Epoch 760:0.5107
Average loss in Epoch 761:0.5022
Average loss in Epoch 762:0.5092
Average loss in Epoch 763:0.5035
Average loss in Epoch 764:0.5039
Average loss in Epoch 765:0.4927
Average loss in Epoch 766:0.4982
Average loss in Epoch 767:0.4929
Average loss in Epoch 768:0.4926
Average loss in Epoch 769:0.5032
Average loss in Epoch 770:0.4929
Average loss in Epoch 771:0.4967
Average loss in Epoch 772:0.4943
Average loss in Epoch 773:0.4880
Average loss in Epoch 774:0.4905
Average loss in Epoch 775:0.4933
Average loss in Epoch 776:0.4972
Average loss in Epoch 777:0.4823
Average loss in Epoch 778:0.4854
Average loss in Epoch 779:0.4899
Average lo

Average loss in Epoch 999:0.3411


[7.4221527862548831,
 7.3584401702880857,
 6.5256946182250974,
 5.94500732421875,
 5.7494622611999509,
 5.6481802558898924,
 5.577843055725098,
 5.5332091140747073,
 5.5026279830932614,
 5.4657371330261233,
 5.4388532066345219,
 5.4214054489135739,
 5.4063381958007817,
 5.3938037872314455,
 5.395850791931152,
 5.3842817878723146,
 5.3792078781127932,
 5.3875311279296874,
 5.3808303070068355,
 5.3778426170349123,
 5.3757103919982914,
 5.3725952911376957,
 5.369075698852539,
 5.3693513107299804,
 5.3713754653930668,
 5.3639495468139646,
 5.35998836517334,
 5.3608768272399905,
 5.3581439208984376,
 5.3542014694213869,
 5.3577730751037596,
 5.3498951339721676,
 5.3571703529357908,
 5.3597930145263675,
 5.3548560714721676,
 5.3567337989807129,
 5.3582329177856449,
 5.3560940361022951,
 5.3449330902099605,
 5.3538310623168943,
 5.3507927131652835,
 5.3434312820434569,
 5.352578620910645,
 5.3464652633666994,
 5.3442092323303223,
 5.3483645439147951,
 5.3487653350830078,
 5.3453985023498536,


# 5. Generating Words

In [9]:
def generate_words(graph, seed_word, n_words, trained_vars):
    '''
    Args:
        seed_word: the seed word to initiate the generation
        n_words: the number of words to generate
        trained_vars: the path for the saved model(Variables), stored by tf.train.Saver()
    '''
    
    with tf.Session() as sess:
        # restore the variables from the previous training
        graph['saver'].restore(sess, trained_vars)
        
        # beginning of the sentence
        state = None
        current_word = vocab[seed_word]
        i_words = [current_word]

        for i in range(n_words):  
            if state is not None:
                feed_dict = {graph['x']:[[current_word]],
                             graph['init_state']:state}
            else:
                feed_dict = {graph['x']:[[current_word]]}

            preds, state = sess.run([graph['preds'],
                                     graph['final_state']],
                                    feed_dict)

            current_word = np.random.choice(
                a=vocab_size, size=1, p=np.squeeze(preds))[0]
            i_words.append(current_word)

    return ' '.join([rev_vocab[i] for i in i_words])

graph_test = build_graph(1,1,state_size)
generate_words(graph_test, START, 1000, 'saves/gru_lm_1000epochs')

INFO:tensorflow:Restoring parameters from saves/gru_lm_1000epochs


"<< the universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions and . . . . a world may propagate through a layer more than once , the cap depth is potentially unlimited . . . then time more showing hopfield , which demonstrated layer learning with the layers from nonlinear processing units used in used ) and neocognitron introduced by fukushima in in . . a cumulative concepts . . . . . . . each layer in turn as an unsupervised restricted boltzmann in data hand-written algorithm , more abstract days of the nodes by recurrent neural networks . in which a signal may propagate through a layer more than once , the cap depth is potentially unlimited . . time . . a time , pick out which features . useful . . . . . . . representation . . . . hornik . . respectively . . . . . . representation . . . . . . . . . . . . . . . . representation . . . . . . . . . . . a signal . in colleagues in 

# 6. Layer Normalization(skipped)
[Layer Normalization](https://arxiv.org/abs/1607.06450) is the RNN equivalent of Batch Normalization.