For deep learning, this tutorial will walk you through building handwritten digits classifiers using the MNIST dataset, arguably the "Hello World" of neural networks. For reinforcement learning, we will let computer learns to play Pong game from the original screen inputs. For nature language processing, we start from word embedding, and then describe language modeling and machine translation.

This tutorial includes all modularized implementation of Google TensorFlow Deep Learning tutorial, so you could read TensorFlow Deep Learning tutorial as the same time [en] [cn] .


For experts: Read the source code of InputLayer and DenseLayer, you will understand how TensorLayer work. After that, we recommend you to read the codes on Github directly.

Before we start

The tutorial assumes that you are somewhat familiar with neural networks and TensorFlow (the library which TensorLayer is built on top of). You can try to learn the basic of neural network from the Deeplearning Tutorial.

For a more slow-paced introduction to artificial neural networks, we recommend Convolutional Neural Networks for Visual Recognition by Andrej Karpathy et al., Neural Networks and Deep Learning by Michael Nielsen.

To learn more about TensorFlow, have a look at the TensorFlow tutorial. You will not need all of it, but a basic understanding of how TensorFlow works is required to be able to use TensorLayer. If you're new to TensorFlow, going through that tutorial.

TensorLayer is simple

The following code shows a simple example of TensorLayer, see . We provide a lot of simple functions (like fit() , test() ), however, if you want to understand the details and be a machine learning expert, we suggest you to train the network by using TensorFlow's methods like, see for more details.

import tensorflow as tf
import tensorlayer as tl

sess = tf.InteractiveSession()

# prepare data
X_train, y_train, X_val, y_val, X_test, y_test = \

# define placeholder
x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.int64, shape=[None, ], name='y_')

# define the network
network = tl.layers.InputLayer(x, name='input_layer')
network = tl.layers.DropoutLayer(network, keep=0.8, name='drop1')
network = tl.layers.DenseLayer(network, n_units=800,
                                act = tf.nn.relu, name='relu1')
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop2')
network = tl.layers.DenseLayer(network, n_units=800,
                                act = tf.nn.relu, name='relu2')
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop3')
# the softmax is implemented internally in tl.cost.cross_entropy(y, y_, 'cost') to
# speed up computation, so we use identity here.
# see tf.nn.sparse_softmax_cross_entropy_with_logits()
network = tl.layers.DenseLayer(network, n_units=10,
                                act = tf.identity,
# define cost function and metric.
y = network.outputs
cost = tl.cost.cross_entropy(y, y_, 'cost')
correct_prediction = tf.equal(tf.argmax(y, 1), y_)
acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
y_op = tf.argmax(tf.nn.softmax(y), 1)

# define the optimizer
train_params = network.all_params
train_op = tf.train.AdamOptimizer(learning_rate=0.0001, beta1=0.9, beta2=0.999,
                            epsilon=1e-08, use_locking=False).minimize(cost, var_list=train_params)

# initialize all variables in the session

# print network information

# train the network, network, train_op, cost, X_train, y_train, x, y_,
            acc=acc, batch_size=500, n_epoch=500, print_freq=5,
            X_val=X_val, y_val=y_val, eval_train=False)

# evaluation
tl.utils.test(sess, network, acc, X_test, y_test, x, y_, batch_size=None, cost=cost)

# save the network to .npz file
tl.files.save_npz(network.all_params , name='model.npz')

Run the MNIST example


In the first part of the tutorial, we will just run the MNIST example that's included in the source distribution of TensorLayer. MNIST dataset contains 60000 handwritten digits that is commonly used for training various image processing systems, each of digit has 28x28 pixels.

We assume that you have already run through the installation. If you haven't done so already, get a copy of the source tree of TensorLayer, and navigate to the folder in a terminal window. Enter the folder and run the example script:


If everything is set up correctly, you will get an output like the following:

The example script allows you to try different models, including Multi-Layer Perceptron, Dropout, Dropconnect, Stacked Denoising Autoencoder and Convolutional Neural Network. Select different models from if __name__ == '__main__':.


Understand the MNIST example

Let's now investigate what's needed to make that happen! To follow along, open up the source code.


The first thing you might notice is that besides TensorLayer, we also import numpy and tensorflow:

import tensorflow as tf
import tensorlayer as tl
from tensorlayer.layers import set_keep
import numpy as np
import time

As we know, TensorLayer is built on top of TensorFlow, it is meant as a supplement helping with some tasks, not as a replacement. You will always mix TensorLayer with some vanilla TensorFlow code. The set_keep is used to access the placeholder of keeping probabilities when using Denoising Autoencoder.

Loading data

The first piece of code defines a function load_mnist_dataset(). Its purpose is to download the MNIST dataset (if it hasn't been downloaded yet) and return it in the form of regular numpy arrays. There is no TensorLayer involved at all, so for the purpose of this tutorial, we can regard it as:

X_train, y_train, X_val, y_val, X_test, y_test = \

X_train.shape is (50000, 784), to be interpreted as: 50,000 images and each image has 784 pixels. y_train.shape is simply (50000,), which is a vector the same length of X_train giving an integer class label for each image -- namely, the digit between 0 and 9 depicted in the image (according to the human annotator who drew that digit).

For Convolutional Neural Network example, the MNIST can be load as 4D version as follow:

X_train, y_train, X_val, y_val, X_test, y_test = \
            tl.files.load_mnist_dataset(shape=(-1, 28, 28, 1))

X_train.shape is (50000, 28, 28, 1) which represents 50,000 images with 1 channel, 28 rows and 28 columns each. Channel one is because it is a grey scale image, every pixel have only one value.

Building the model

This is where TensorLayer steps in. It allows you to define an arbitrarily structured neural network by creating and stacking or merging layers. Since every layer knows its immediate incoming layers, the output layer (or output layers) of a network double as a handle to the network as a whole, so usually this is the only thing we will pass on to the rest of the code.

As mentioned above, supports four types of models, and we implement that via easily exchangeable functions of the same interface. First, we'll define a function that creates a Multi-Layer Perceptron (MLP) of a fixed architecture, explaining all the steps in detail. We'll then implement a Denosing Autoencoder (DAE), after that we will then stack all Denoising Autoencoder and supervised fine-tune them. Finally, we'll show how to create a Convolutional Neural Network (CNN). In addition, a simple example for MNIST dataset in, a CNN example for CIFAR-10 dataset in

Multi-Layer Perceptron (MLP)

The first script, main_test_layers(), creates an MLP of two hidden layers of 800 units each, followed by a softmax output layer of 10 units. It applies 20% dropout to the input data and 50% dropout to the hidden layers.

To feed data into the network, TensofFlow placeholders need to be defined as follow. The None here means the network will accept input data of arbitrary batchsize after compilation. The x is used to hold the X_train data and y_ is used to hold the y_train data. If you know the batchsize beforehand and do not need this flexibility, you should give the batchsize here -- especially for convolutional layers, this can allow TensorFlow to apply some optimizations.

x = tf.placeholder(tf.float32, shape=[None, 784], name='x')
y_ = tf.placeholder(tf.int64, shape=[None, ], name='y_')

The foundation of each neural network in TensorLayer is an InputLayer <tensorlayer.layers.InputLayer> instance representing the input data that will subsequently be fed to the network. Note that the InputLayer is not tied to any specific data yet.

network = tl.layers.InputLayer(x, name='input_layer')

Before adding the first hidden layer, we'll apply 20% dropout to the input data. This is realized via a DropoutLayer <tensorlayer.layers.DropoutLayer> instance:

network = tl.layers.DropoutLayer(network, keep=0.8, name='drop1')

Note that the first constructor argument is the incoming layer, the second argument is the keeping probability for the activation value. Now we'll proceed with the first fully-connected hidden layer of 800 units. Note that when stacking a DenseLayer <tensorlayer.layers.DenseLayer>.

network = tl.layers.DenseLayer(network, n_units=800, act = tf.nn.relu, name='relu1')

Again, the first constructor argument means that we're stacking network on top of network. n_units simply gives the number of units for this fully-connected layer. act takes an activation function, several of which are defined in tensorflow.nn and tensorlayer.activation. Here we've chosen the rectifier, so we'll obtain ReLUs. We'll now add dropout of 50%, another 800-unit dense layer and 50% dropout again:

network = tl.layers.DropoutLayer(network, keep=0.5, name='drop2')
network = tl.layers.DenseLayer(network, n_units=800, act = tf.nn.relu, name='relu2')
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop3')

Finally, we'll add the fully-connected output layer which the n_units equals to the number of classes. Note that, the softmax is implemented internally in tf.nn.sparse_softmax_cross_entropy_with_logits() to speed up computation, so we used identity in the last layer, more details in tl.cost.cross_entropy().

network = tl.layers.DenseLayer(network,
                              act = tf.identity,

As mentioned above, each layer is linked to its incoming layer(s), so we only need the output layer(s) to access a network in TensorLayer:

y = network.outputs
y_op = tf.argmax(tf.nn.softmax(y), 1)
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(y, y_))

Here, network.outputs is the 10 identity outputs from the network (in one hot format), y_op is the integer output represents the class index. While cost is the cross-entropy between target and predicted labels.

Denoising Autoencoder (DAE)

Autoencoder is a unsupervised learning models which able to extract representative features, it has become more widely used for learning generative models of data and Greedy layer-wise pre-train. For vanilla Autoencoder see Deeplearning Tutorial.

The script main_test_denoise_AE() implements a Denoising Autoencoder with corrosion rate of 50%. The Autoencoder can be defined as follow, where an Autoencoder is represented by a DenseLayer:

network = tl.layers.InputLayer(x, name='input_layer')
network = tl.layers.DropoutLayer(network, keep=0.5, name='denoising1')
network = tl.layers.DenseLayer(network, n_units=200, act=tf.nn.sigmoid, name='sigmoid1')
recon_layer1 = tl.layers.ReconLayer(network,

To train the DenseLayer, simply run ReconLayer.pretrain(), if using denoising Autoencoder, the name of corrosion layer (a DropoutLayer) need to be specified as follow. To save the feature images, set save to True. There are many kinds of pre-train metrices according to different architectures and applications. For sigmoid activation, the Autoencoder can be implemented by using KL divergence, while for rectifer, L1 regularization of activation outputs can make the output to be sparse. So the default behaviour of ReconLayer only provide KLD and cross-entropy for sigmoid activation function and L1 of activation outputs and mean-squared-error for rectifing activation function. We recommend you to modify ReconLayer to achieve your own pre-train metrice.


In addition, the script main_test_stacked_denoise_AE() shows how to stacked multiple Autoencoder to one network and then fine-tune.

Convolutional Neural Network (CNN)

Finally, the main_test_cnn_layer() script creates two CNN layers and max pooling stages, a fully-connected hidden layer and a fully-connected output layer. More CNN examples can be found in the tutorial scripts, like

At the begin, we add a Conv2dLayer <tensorlayer.layers.Conv2dLayer> with 32 filters of size 5x5 on top, follow by max-pooling of factor 2 in both dimensions. And then apply a Conv2dLayer with 64 filters of size 5x5 again and follow by a max_pool again. After that, flatten the 4D output to 1D vector by using FlattenLayer, and apply a dropout with 50% to last hidden layer. The ? represents arbitrary batch_size.

Note, introduces the simplified CNN API for beginner.

network = tl.layers.InputLayer(x, name='input_layer')
network = tl.layers.Conv2dLayer(network,
                        act = tf.nn.relu,
                        shape = [5, 5, 1, 32],  # 32 features for each 5x5 patch
                        strides=[1, 1, 1, 1],
                        name ='cnn_layer1')     # output: (?, 28, 28, 32)
network = tl.layers.PoolLayer(network,
                        ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1],
                        pool = tf.nn.max_pool,
                        name ='pool_layer1',)   # output: (?, 14, 14, 32)
network = tl.layers.Conv2dLayer(network,
                        act = tf.nn.relu,
                        shape = [5, 5, 32, 64], # 64 features for each 5x5 patch
                        strides=[1, 1, 1, 1],
                        name ='cnn_layer2')     # output: (?, 14, 14, 64)
network = tl.layers.PoolLayer(network,
                        ksize=[1, 2, 2, 1],
                        strides=[1, 2, 2, 1],
                        pool = tf.nn.max_pool,
                        name ='pool_layer2',)   # output: (?, 7, 7, 64)
network = tl.layers.FlattenLayer(network, name='flatten_layer')
                                                # output: (?, 3136)
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop1')
                                                # output: (?, 3136)
network = tl.layers.DenseLayer(network, n_units=256, act = tf.nn.relu, name='relu1')
                                                # output: (?, 256)
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop2')
                                                # output: (?, 256)
network = tl.layers.DenseLayer(network, n_units=10,
                act = tf.identity, name='output_layer')
                                                # output: (?, 10)


For experts: Conv2dLayer will create a convolutional layer using tensorflow.nn.conv2d, TensorFlow's default convolution.

Training the model

The remaining part of the script copes with setting up and running a training loop over the MNIST dataset by using cross-entropy only.

Dataset iteration

An iteration function for synchronously iterating over two numpy arrays of input data and targets, respectively, in mini-batches of a given number of items. More iteration function can be found in tensorlayer.iterate

tl.iterate.minibatches(inputs, targets, batchsize, shuffle=False)

Loss and update expressions

Continuing, we create a loss expression to be minimized in training:

y = network.outputs
y_op = tf.argmax(tf.nn.softmax(y), 1)
cost = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(y, y_))

More cost or regularization can be applied here, take main_test_layers() for example, to apply max-norm on the weight matrices, we can add the following line:

cost = cost + tl.cost.maxnorm_regularizer(1.0)(network.all_params[0]) +

Depending on the problem you are solving, you will need different loss functions, see tensorlayer.cost for more.

Having the model and the loss function defined, we create update expressions for training the network. TensorLayer do not provide many optimizer, we used TensorFlow's optimizer instead:

train_params = network.all_params
train_op = tf.train.AdamOptimizer(learning_rate, beta1=0.9, beta2=0.999,
    epsilon=1e-08, use_locking=False).minimize(cost, var_list=train_params)

For training the network, we fed data and the keeping probabilities to the feed_dict.

feed_dict = {x: X_train_a, y_: y_train_a}
feed_dict.update( network.all_drop ), feed_dict=feed_dict)

While, for validation and testing, we use slightly different way. All dropout, dropconnect, corrosion layers need to be disable. tl.utils.dict_to_one set all network.all_drop to 1.

dp_dict = tl.utils.dict_to_one( network.all_drop )
feed_dict = {x: X_test_a, y_: y_test_a}
err, ac =[cost, acc], feed_dict=feed_dict)

As an additional monitoring quantity, we create an expression for the classification accuracy:

correct_prediction = tf.equal(tf.argmax(y, 1), y_)
acc = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

What Next?

We also have a more advanced image classification example in Please read the code and notes, figure out how to generate more training data and what is local response normalization. After that, try to implement Residual Network (Hint: you may want to use the Layer.outputs).

Run the Pong Game example

In the second part of the tutorial, we will run the Deep Reinforcement Learning example that is introduced by Karpathy in Deep Reinforcement Learning: Pong from Pixels.


Before running the tutorial code, you need to install OpenAI gym environment which is a benchmark for Reinforcement Learning. If everything is set up correctly, you will get an output like the following:

This example allow computer to learn how to play Pong game from the screen inputs, just like human behavior. After training for 15,000 episodes, the computer can win 20% of the games. The computer win 35% of the games at 20,000 episode, we can seen the computer learn faster and faster as it has more winning data to train. If you run it for 30,000 episode, it start to win.

render = False
resume = False

Setting render to True, if you want to display the game environment. When you run the code again, you can set resume to True, the code will load the existing model and train the model basic on it.


Understand Reinforcement learning

Pong Game

To understand Reinforcement Learning, we let computer to learn how to play Pong game from the original screen inputs. Before we start, we highly recommend you to go through a famous blog called Deep Reinforcement Learning: Pong from Pixels which is a minimalistic implementation of Deep Reinforcement Learning by using python-numpy and OpenAI gym environment.


Policy Network

In Deep Reinforcement Learning, the Policy Network is the same with Deep Neural Network, it is our player (or “agent”) who output actions to tell what we should do (move UP or DOWN); in Karpathy's code, he only defined 2 actions, UP and DOWN and using a single simgoid output; In order to make our tutorial more generic, we defined 3 actions which are UP, DOWN and STOP (do nothing) by using 3 softmax outputs.

# observation for training
states_batch_pl = tf.placeholder(tf.float32, shape=[None, D])

network = tl.layers.InputLayer(states_batch_pl, name='input_layer')
network = tl.layers.DenseLayer(network, n_units=H,
                                act = tf.nn.relu, name='relu1')
network = tl.layers.DenseLayer(network, n_units=3,
                        act = tf.identity, name='output_layer')
probs = network.outputs
sampling_prob = tf.nn.softmax(probs)

Then when our agent is playing Pong, it calculates the probabilities of different actions, and then draw sample (action) from this uniform distribution. As the actions are represented by 1, 2 and 3, but the softmax outputs should be start from 0, we calculate the label value by minus 1.

prob =
    feed_dict={states_batch_pl: x}
# action. 1: STOP  2: UP  3: DOWN
action = np.random.choice([1,2,3], p=prob.flatten())
ys.append(action - 1)

Policy Gradient

Policy gradient methods are end-to-end algorithms that directly learn policy functions mapping states to actions. An approximate policy could be learned directly by maximizing the expected rewards. The parameters of a policy function (e.g. the parameters of a policy network used in the pong example) could be trained and learned under the guidance of the gradient of expected rewards. In other words, we can gradually tune the policy function via updating its parameters, such that it will generate actions from given states towards higher rewards.

An alternative method to policy gradient is Deep Q-Learning (DQN). It is based on Q-Learning that tries to learn a value function (called Q function) mapping states and actions to some value. DQN employs a deep neural network to represent the Q function as a function approximator. The training is done by minimizing temporal-difference errors. A neurobiologically inspired mechanism called “experience replay” is typically used along with DQN to help improve its stability caused by the use of non-linear function approximator.

You can check the following papers to gain better understandings about Reinforcement Learning.

The most successful applications of Deep Reinforcement Learning in recent years include DQN with experience replay to play Atari games and AlphaGO that for the first time beats world-class professional GO players. AlphaGO used the policy gradient method to train its policy network that is similar to the example of Pong game.

Dataset iteration

In Reinforcement Learning, we consider a final decision as an episode. In Pong game, a episode is a few dozen games, because the games go up to score of 21 for either player. Then the batch size is how many episode we consider to update the model. In the tutorial, we train a 2-layer policy network with 200 hidden layer units using RMSProp on batches of 10 episodes.

Loss and update expressions

Continuing, we create a loss expression to be minimized in training:

actions_batch_pl = tf.placeholder(tf.int32, shape=[None])
discount_rewards_batch_pl = tf.placeholder(tf.float32, shape=[None])
loss = tl.rein.cross_entropy_reward_loss(probs, actions_batch_pl,
        states_batch_pl: epx,
        actions_batch_pl: epy,
        discount_rewards_batch_pl: disR

The loss in a batch is relate to all outputs of Policy Network, all actions we made and the corresponding discounted rewards in a batch. We first compute the loss of each action by multiplying the discounted reward and the cross-entropy between its output and its true action. The final loss in a batch is the sum of all loss of the actions.

What Next?

The tutorial above shows how you can build your own agent, end-to-end. While it has reasonable quality, the default parameters will not give you the best agent model. Here are a few things you can improve.

First of all, instead of conventional MLP model, we can use CNNs to capture the screen information better as Playing Atari with Deep Reinforcement Learning describe.

Also, the default parameters of the model are not tuned. You can try changing the learning rate, decay, or initializing the weights of your model in a different way.

Finally, you can try the model on different tasks (games).

Run the Word2Vec example

In this part of the tutorial, we train a matrix for words, where each word can be represented by a unique row vector in the matrix. In the end, similar words will have similar vectors. Then as we plot out the words into a two-dimensional plane, words that are similar end up clustering nearby each other.


If everything is set up correctly, you will get an output in the end.


Understand Word Embedding

Word Embedding

We highly recommend you to read Colah's blog Word Representations to understand why we want to use a vector representation, and how to compute the vectors. (For chinese reader please click. More details about word2vec can be found in Word2vec Parameter Learning Explained.

Bascially, training an embedding matrix is an unsupervised learning. As every word is refected by an unique ID, which is the row index of the embedding matrix, a word can be converted into a vector, it can better represent the meaning. For example, there seems to be a constant male-female difference vector: woman − man = queen - king, this means one dimension in the vector represents gender.

The model can be created as follow.

# train_inputs is a row vector, a input is an integer id of single word.
# train_labels is a column vector, a label is an integer id of single word.
# valid_dataset is a column vector, a valid set is an integer id of single word.
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

# Look up embeddings for inputs.
emb_net = tl.layers.Word2vecEmbeddingInputlayer(
        inputs = train_inputs,
        train_labels = train_labels,
        vocabulary_size = vocabulary_size,
        embedding_size = embedding_size,
        num_sampled = num_sampled,
        nce_loss_args = {},
        E_init = tf.random_uniform_initializer(minval=-1.0, maxval=1.0),
        E_init_args = {},
        nce_W_init = tf.truncated_normal_initializer(
        nce_W_init_args = {},
        nce_b_init = tf.constant_initializer(value=0.0),
        nce_b_init_args = {},
        name ='word2vec_layer',

Dataset iteration and loss

Word2vec uses Negative Sampling and Skip-Gram model for training. Noise-Contrastive Estimation Loss (NCE) can help to reduce the computation of loss. Skip-Gram inverts context and targets, tries to predict each context word from its target word. We use tl.nlp.generate_skip_gram_batch to generate training data as follow, see .

# NCE cost expression is provided by Word2vecEmbeddingInputlayer
cost = emb_net.nce_cost
train_params = emb_net.all_params

train_op = tf.train.AdagradOptimizer(learning_rate, initial_accumulator_value=0.1,
          use_locking=False).minimize(cost, var_list=train_params)

data_index = 0
while (step < num_steps):
  batch_inputs, batch_labels, data_index = tl.nlp.generate_skip_gram_batch(
                data=data, batch_size=batch_size, num_skips=num_skips,
                skip_window=skip_window, data_index=data_index)
  feed_dict = {train_inputs : batch_inputs, train_labels : batch_labels}
  _, loss_val =[train_op, cost], feed_dict=feed_dict)

Restore existing Embedding matrix

In the end of training the embedding matrix, we save the matrix and corresponding dictionaries. Then next time, we can restore the matrix and directories as follow. (see main_restore_embedding_layer() in

vocabulary_size = 50000
embedding_size = 128
model_file_name = "model_word2vec_50k_128"
batch_size = None

print("Load existing embedding matrix and dictionaries")
all_var = tl.files.load_npy_to_any(name=model_file_name+'.npy')
data = all_var['data']; count = all_var['count']
dictionary = all_var['dictionary']
reverse_dictionary = all_var['reverse_dictionary']

tl.nlp.save_vocab(count, name='vocab_'+model_file_name+'.txt')

del all_var, data, count

load_params = tl.files.load_npz(name=model_file_name+'.npz')

x = tf.placeholder(tf.int32, shape=[batch_size])
y_ = tf.placeholder(tf.int32, shape=[batch_size, 1])

emb_net = tl.layers.EmbeddingInputlayer(
                inputs = x,
                vocabulary_size = vocabulary_size,
                embedding_size = embedding_size,
                name ='embedding_layer')


tl.files.assign_params(sess, [load_params[0]], emb_net)

Run the PTB example

Penn TreeBank (PTB) dataset is used in many LANGUAGE MODELING papers, including "Empirical Evaluation and Combination of Advanced Language Modeling Techniques", "Recurrent Neural Network Regularization". It consists of 929k training words, 73k validation words, and 82k test words. It has 10k words in its vocabulary.

The PTB example is trying to show how to train a recurrent neural network on a challenging task of language modeling.

Given a sentence "I am from Imperial College London", the model can learn to predict "Imperial College London" from "from Imperial College". In other word, it predict next words in a text given a history of previous words. In previous example , num_steps (sequence length) is 3.


The script provides three settings (small, medium, large), larger model has better performance, you can choice different setting in:

    "model", "small",
    "A type of model. Possible options are: small, medium, large.")

If you choice small setting, you can see:

The PTB example proves RNN is able to modeling language, but this example did not do something practical. However, you should read through this example and “Understand LSTM” in order to understand the basic of RNN. After that, you learn how to generate text, how to achieve language translation and how to build a questions answering system by using RNN.

Understand LSTM

Recurrent Neural Network

We personally think Andrey Karpathy's blog is the best material to Understand Recurrent Neural Network , after reading that, Colah's blog can help you to Understand LSTM Network [chinese] which can solve The Problem of Long-Term Dependencies. We do not describe more about RNN, please read through these blogs before you go on.


Image by Andrey Karpathy

Synced sequence input and output

The model in PTB example is a typically type of synced sequence input and output, which was described by Karpathy as "(5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like."

The model is built as follow. Firstly, transfer the words into word vectors by looking up an embedding matrix, in this tutorial, no pre-training on embedding matrix. Secondly, we stacked two LSTMs together use dropout among the embedding layer, LSTM layers and output layer for regularization. In the last layer, the model provides a sequence of softmax outputs.

The first LSTM layer outputs [batch_size, num_steps, hidden_size] for stacking another LSTM after it. The second LSTM layer outputs [batch_size*num_steps, hidden_size] for stacking DenseLayer after it, then compute the softmax outputs of each example (n_examples = batch_size*num_steps).

To understand the PTB tutorial, you can also read TensorFlow PTB tutorial.

(Note that, TensorLayer supports DynamicRNNLayer after v1.1, so you can set the input/output dropouts, number of RNN layer in one single layer)

network = tl.layers.EmbeddingInputlayer(
            inputs = x,
            vocabulary_size = vocab_size,
            embedding_size = hidden_size,
            E_init = tf.random_uniform_initializer(-init_scale, init_scale),
            name ='embedding_layer')
if is_training:
    network = tl.layers.DropoutLayer(network, keep=keep_prob, name='drop1')
network = tl.layers.RNNLayer(network,
            cell_init_args={'forget_bias': 0.0},
            initializer=tf.random_uniform_initializer(-init_scale, init_scale),
lstm1 = network
if is_training:
    network = tl.layers.DropoutLayer(network, keep=keep_prob, name='drop2')
network = tl.layers.RNNLayer(network,
            cell_init_args={'forget_bias': 0.0},
            initializer=tf.random_uniform_initializer(-init_scale, init_scale),
lstm2 = network
if is_training:
    network = tl.layers.DropoutLayer(network, keep=keep_prob, name='drop3')
network = tl.layers.DenseLayer(network,
            W_init=tf.random_uniform_initializer(-init_scale, init_scale),
            b_init=tf.random_uniform_initializer(-init_scale, init_scale),
            act = tf.identity, name='output_layer')

Dataset iteration

The batch_size can be seem as how many concurrent computations. As the following example shows, the first batch learn the sequence information by using 0 to 9. The second batch learn the sequence information by using 10 to 19. So it ignores the information from 9 to 10 !n If only if we set the batch_size = 1, it will consider all information from 0 to 20.

The meaning of batch_size here is not the same with the batch_size in MNIST example. In MNIST example, batch_size reflects how many examples we consider in each iteration, while in PTB example, batch_size is how many concurrent processes (segments) for speed up computation.

Some Information will be ignored if batch_size > 1, however, if your dataset is "long" enough (a text corpus usually has billions words), the ignored information would not effect the final result.

In PTB tutorial, we set batch_size = 20, so we cut the dataset into 20 segments. At the beginning of each epoch, we initialize (reset) the 20 RNN states for 20 segments, then go through 20 segments separately.

A example of generating training data as follow:

train_data = [i for i in range(20)]
for batch in tl.iterate.ptb_iterator(train_data, batch_size=2, num_steps=3):
    x, y = batch
    print(x, '\n',y)
... [[ 0  1  2] <---x                       1st subset/ iteration
...  [10 11 12]]
... [[ 1  2  3] <---y
...  [11 12 13]]
... [[ 3  4  5]  <--- 1st batch input       2nd subset/ iteration
...  [13 14 15]] <--- 2nd batch input
... [[ 4  5  6]  <--- 1st batch target
...  [14 15 16]] <--- 2nd batch target
... [[ 6  7  8]                             3rd subset/ iteration
...  [16 17 18]]
... [[ 7  8  9]
...  [17 18 19]]


This example can also be considered as pre-training of the word embedding matrix.

Loss and update expressions

The cost function is the averaged cost of each mini-batch:

# See tensorlayer.cost.cross_entropy_seq() for more details
def loss_fn(outputs, targets, batch_size, num_steps):
    # Returns the cost function of Cross-entropy of two sequences, implement
    # softmax internally.
    # outputs : 2D tensor [batch_size*num_steps, n_units of output layer]
    # targets : 2D tensor [batch_size, num_steps], need to be reshaped.
    # n_examples = batch_size * num_steps
    # so
    # cost is the averaged cost of each mini-batch (concurrent process).
    loss = tf.nn.seq2seq.sequence_loss_by_example(
        [tf.reshape(targets, [-1])],
        [tf.ones([batch_size * num_steps])])
    cost = tf.reduce_sum(loss) / batch_size
    return cost

# Cost for Training
cost = loss_fn(network.outputs, targets, batch_size, num_steps)

For updating, this example decreases the initial learning rate after several epochs (defined by max_epoch), by multiplying a lr_decay. In addition, truncated backpropagation clips values of gradients by the ratio of the sum of their norms, so as to make the learning process tractable.

# Truncated Backpropagation for training
with tf.variable_scope('learning_rate'):
    lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
optimizer = tf.train.GradientDescentOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))

If the epoch index greater than max_epoch, decrease the learning rate by multipling lr_decay.

new_lr_decay = lr_decay ** max(i - max_epoch, 0.0), learning_rate * new_lr_decay))

At the beginning of each epoch, all states of LSTMs need to be reseted (initialized) to zero states, then after each iteration, the LSTMs' states is updated, so the new LSTM states (final states) need to be assigned as the initial states of next iteration:

# set all states to zero states at the beginning of each epoch
state1 = tl.layers.initialize_rnn_state(lstm1.initial_state)
state2 = tl.layers.initialize_rnn_state(lstm2.initial_state)
for step, (x, y) in enumerate(tl.iterate.ptb_iterator(train_data,
                                            batch_size, num_steps)):
    feed_dict = {input_data: x, targets: y,
                lstm1.initial_state: state1,
                lstm2.initial_state: state2,
    # For training, enable dropout
    feed_dict.update( network.all_drop )
    # use the new states as the initial state of next iteration
    _cost, state1, state2, _ =[cost,
    costs += _cost; iters += num_steps


After training the model, when we predict the next output, we no long consider the number of steps (sequence length), i.e. batch_size, num_steps are 1. Then we can output the next word step by step, instead of predict a sequence of words from a sequence of words.

input_data_test = tf.placeholder(tf.int32, [1, 1])
targets_test = tf.placeholder(tf.int32, [1, 1])
network_test, lstm1_test, lstm2_test = inference(input_data_test,
                      is_training=False, num_steps=1, reuse=True)
cost_test = loss_fn(network_test.outputs, targets_test, 1, 1)
# Testing
# go through the test set step by step, it will take a while.
start_time = time.time()
costs = 0.0; iters = 0
# reset all states at the beginning
state1 = tl.layers.initialize_rnn_state(lstm1_test.initial_state)
state2 = tl.layers.initialize_rnn_state(lstm2_test.initial_state)
for step, (x, y) in enumerate(tl.iterate.ptb_iterator(test_data,
                                        batch_size=1, num_steps=1)):
    feed_dict = {input_data_test: x, targets_test: y,
                lstm1_test.initial_state: state1,
                lstm2_test.initial_state: state2,
    _cost, state1, state2 =[cost_test,
    costs += _cost; iters += 1
test_perplexity = np.exp(costs / iters)
print("Test Perplexity: %.3f took %.2fs" % (test_perplexity, time.time() - start_time))

What Next?

Now, you understand Synced sequence input and output. Let think about Many to one (Sequence input and one output), LSTM is able to predict the next word "English" from "I am from London, I speak ..".

Please read and understand the code of, it show you how to restore a pre-trained Embedding matrix and how to learn text generation from a given context.

Karpathy's blog : "(3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). "

Run the Translation example


This script is going to training a neural network to translate English to French. If everything is correct, you will see.

  • Download WMT English-to-French translation data, includes training and testing data.
  • Create vocabulary files for English and French from training data.
  • Create the tokenized training and testing data from original training and testing data.
Start training by using the tokenized bucket data, the training process can only be terminated by stop the program. When steps_per_checkpoint = 10 you will see.

Create Embedding Attention Seq2seq Model

After training the model for 350000 steps, you can play with the translation by switch main_train() to main_decode(). You type in a English sentence, the program will outputs a French sentence.

Reading model parameters from wmt/translate.ckpt-350000
>  Who is the president of the United States?
Qui est le président des États-Unis ?

Understand Translation


Sequence to sequence model is commonly be used to translate a language to another. Actually it can do many thing you can't imagine, we can translate a long sentence into short and simple sentence, for example, translation going from Shakespeare to modern English. With CNN, we can also translate a video into a sentence, i.e. video captioning.

If you just want to use Seq2seq but not going to design a new algorithm, the only think you need to consider is the data format including how to split the words, how to tokenize the words etc. In this tutorial, we described a lot about data formating.


Sequence to sequence model is a type of "Many to many" but different with Synced sequence input and output in PTB tutorial. Seq2seq generates sequence output after feeding all sequence inputs. The following two methods can improve the accuracy: - Reversing the inputs - Attention mechanism

To speed up the computation, we used:

  • Sampled softmax

Karpathy's blog described Seq2seq as: "(4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French)."


As the above figure shows, the encoder inputs, decoder inputs and targets are:

encoder_input =  A    B    C
decoder_input =  <go> W    X    Y    Z
targets       =  W    X    Y    Z    <eos>

Note: in the code, the size of targets is one smaller than the size
of decoder_input, not like this figure. More details will be show later.


The English-to-French example implements a multi-layer recurrent neural network as encoder, and an Attention-based decoder. It is the same as the model described in this paper: - Grammar as a Foreign Language

The example uses sampled softmax to handle large output vocabulary size. In this example, as target_vocab_size=4000, for vocabularies smaller than 512, it might be a better idea to just use a standard softmax loss. Sampled softmax is described in Section 3 of the this paper: - On Using Very Large Target Vocabulary for Neural Machine Translation

Reversing the inputs and Multi-layer cells have been successfully used in sequence-to-sequence models for translation has beed described in this paper: - Sequence to Sequence Learning with Neural Networks

Attention mechanism allows the decoder more direct access to the input, it was described in this paper: - Neural Machine Translation by Jointly Learning to Align and Translate

Alternatively, the model can also be implemented by a single-layer version, but with Bi-directional encoder, was presented in this paper: - Neural Machine Translation by Jointly Learning to Align and Translate


Bucketing and Padding

(Note that, TensorLayer supports Dynamic RNN layer after v1.2, so bucketing is not longer necessary in many cases)

Bucketing is a method to efficiently handle sentences of different length. When translating English to French, we will have English sentences of different lengths L1 on input, and French sentences of different lengths L2 on output. We should in principle create a seq2seq model for every pair (L1, L2+1) (prefixed by a GO symbol) of lengths of an English and French sentence.

To minimize the number of buckets and find the closest bucket for each pair, then we could just pad every sentence with a special PAD symbol in the end if the bucket is bigger than the sentence

We use a number of buckets and pad to the closest one for efficiency. In this example, we used 4 buckets as follow.

buckets = [(5, 10), (10, 15), (20, 25), (40, 50)]

If the input is an English sentence with 3 tokens, and the corresponding output is a French sentence with 6 tokens, then they will be put in the first bucket and padded to length 5 for encoder inputs (English sentence), and length 10 for decoder inputs. If we have an English sentence with 8 tokens and the corresponding French sentence has 18 tokens, then they will be fit into (20, 25) bucket.

In other word, bucket (I, O) is (encoder_input_size, decoder_inputs_size).

Given a pair of [["I", "go", "."], ["Je", "vais", "."]] in tokenized format, we fit it into bucket (5, 10). The training data of encoder inputs representing [PAD PAD "." "go" "I"] and decoder inputs [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD]. The targets are decoder inputs shifted by one. The target_weights is the mask of targets.

bucket = (I, O) = (5, 10)
encoder_inputs = [PAD PAD "." "go" "I"]                       <-- 5  x batch_size
decoder_inputs = [GO "Je" "vais" "." EOS PAD PAD PAD PAD PAD] <-- 10 x batch_size
target_weights = [1   1     1     1   0 0 0 0 0 0 0]          <-- 10 x batch_size
targets        = ["Je" "vais" "." EOS PAD PAD PAD PAD PAD]    <-- 9  x batch_size

In this example, one sentence is represented by one column, so assume batch_size = 3, bucket = (5, 10) the training data will look like:

encoder_inputs    decoder_inputs    target_weights    targets
0    0    0       1    1    1       1    1    1       87   71   16748
0    0    0       87   71   16748   1    1    1       2    3    14195
0    0    0       2    3    14195   0    1    1       0    2    2
0    0    3233    0    2    2       0    0    0       0    0    0
3    698  4061    0    0    0       0    0    0       0    0    0
                  0    0    0       0    0    0       0    0    0
                  0    0    0       0    0    0       0    0    0
                  0    0    0       0    0    0       0    0    0
                  0    0    0       0    0    0       0    0    0
                  0    0    0       0    0    0

where 0 : _PAD    1 : _GO     2 : _EOS      3 : _UNK

During training, the decoder inputs are the targets, while during prediction, the next decoder input is the last decoder output.

Special vocabulary symbols, punctuations and digits

The special vocabulary symbols in this example are:

_PAD = b"_PAD"
_GO = b"_GO"
_EOS = b"_EOS"
_UNK = b"_UNK"
PAD_ID = 0      <-- index (row number) in vocabulary
GO_ID = 1
EOS_ID = 2
UNK_ID = 3

_PAD 0 Padding, empty word _GO 1 1st element of decoder_inputs _EOS 2 End of Sentence of targets _UNK 3 Unknown word, words do not exist in vocabulary will be marked as 3

For digits, the normalize_digits of creating vocabularies and tokenized dataset must be consistent, if normalize_digits=True all digits will be replaced by 0. Like 123 to 000, `9 to 0 and 1990-05 to 0000-00, then 000, 0 and 0000-00 etc will be the words in the vocabulary (see vocab40000.en).

Otherwise, if normalize_digits=False, different digits will be seem in the vocabulary, then the vocabulary size will be very big. The regular expression to find digits is _DIGIT_RE = re.compile(br"\d"). (see tl.nlp.create_vocabulary() and tl.nlp.data_to_token_ids())

For word split, the regular expression is _WORD_SPLIT = re.compile(b"([.,!?\"':;)(])"), this means use the symbols like [ . , ! ? " ' : ; ) ( ] and space to split the sentence, see tl.nlp.basic_tokenizer() which is the default tokenizer of tl.nlp.create_vocabulary() and tl.nlp.data_to_token_ids().

All punctuation marks, such as . , ) ( are all reserved in the vocabularies of both English and French.

Sampled softmax

Sampled softmax is a method to reduce the computation of cost so as to handle large output vocabulary. Instead of compute the cross-entropy of large output, we compute the loss from samples of num_samples.

Dataset iteration

The iteration is done by EmbeddingAttentionSeq2seqWrapper.get_batch, which get a random batch of data from the specified bucket, prepare for step. The data

Loss and update expressions

The EmbeddingAttentionSeq2seqWrapper has built in SGD optimizer.

What Next?

Try other applications.

More info

For more information on what you can do with TensorLayer, just continue reading through readthedocs. Finally, the reference lists and explains as follow.

layers (tensorlayer.layers),

activation (tensorlayer.activation),

natural language processing (tensorlayer.nlp),

reinforcement learning (tensorlayer.rein),

cost expressions and regularizers (tensorlayer.cost),

load and save files (tensorlayer.files),

operating system (tensorlayer.ops),

helper functions (tensorlayer.utils),

visualization (tensorlayer.visualize),

iteration functions (tensorlayer.iterate),

preprocessing functions (tensorlayer.prepro),