Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 53 additions & 55 deletions docs/user/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -971,24 +971,24 @@ challenging task of language modeling.

Given a sentence "I am from Imperial College London", the model can learn to
predict "Imperial College London" from "from Imperial College". In other
word, it predict next words in a text given a history of previous words.
In previous example , ``num_steps (sequence length)`` is 3.
word, it predict the next word in a text given a history of previous words.
In the previous example , ``num_steps`` (sequence length) is 3.

.. code-block:: bash

python tutorial_ptb_lstm.py


The script provides three settings (small, medium, large), larger model has
better performance, you can choice different setting in:
The script provides three settings (small, medium, large), where a larger model has
better performance. You can choose different settings in:

.. code-block:: python

flags.DEFINE_string(
"model", "small",
"A type of model. Possible options are: small, medium, large.")

If you choice small setting, you can see:
If you choose the small setting, you can see:

.. code-block:: text

Expand Down Expand Up @@ -1021,11 +1021,11 @@ If you choice small setting, you can see:
Epoch: 13 Valid Perplexity: 121.475
Test Perplexity: 116.716

The PTB example proves RNN is able to modeling language, but this example
did not do something practical. However, you should read through this example
and “Understand LSTM” in order to understand the basic of RNN.
After that, you learn how to generate text, how to achieve language translation
and how to build a questions answering system by using RNN.
The PTB example shows that RNN is able to model language, but this example
did not do something practically interesting. However, you should read through this example
and “Understand LSTM” in order to understand the basics of RNN.
After that, you will learn how to generate text, how to achieve language translation,
and how to build a question answering system by using RNN.


Understand LSTM
Expand All @@ -1038,7 +1038,7 @@ We personally think Andrey Karpathy's blog is the best material to
`Understand Recurrent Neural Network`_ , after reading that, Colah's blog can
help you to `Understand LSTM Network`_ `[chinese] <http://dataunion.org/9331.html>`_
which can solve The Problem of Long-Term
Dependencies. We do not describe more about RNN, please read through these blogs
Dependencies. We will not describe more about the theory of RNN, so please read through these blogs
before you go on.

.. _fig_0601:
Expand All @@ -1051,28 +1051,28 @@ Image by Andrey Karpathy
Synced sequence input and output
---------------------------------

The model in PTB example is a typically type of synced sequence input and output,
The model in PTB example is a typical type of synced sequence input and output,
which was described by Karpathy as
"(5) Synced sequence input and output (e.g. video classification where we wish
to label each frame of the video). Notice that in every case are no pre-specified
constraints on the lengths sequences because the recurrent transformation (green)
is fixed and can be applied as many times as we like."

The model is built as follow. Firstly, transfer the words into word vectors by
looking up an embedding matrix, in this tutorial, no pre-training on embedding
matrix. Secondly, we stacked two LSTMs together use dropout among the embedding
layer, LSTM layers and output layer for regularization. In the last layer,
to label each frame of the video). Notice that in every case there are no pre-specified
constraints on the lengths of sequences because the recurrent transformation (green)
can be applied as many times as we like."

The model is built as follows. Firstly, we transfer the words into word vectors by
looking up an embedding matrix. In this tutorial, there is no pre-training on the embedding
matrix. Secondly, we stack two LSTMs together using dropout between the embedding
layer, LSTM layers, and the output layer for regularization. In the final layer,
the model provides a sequence of softmax outputs.

The first LSTM layer outputs [batch_size, num_steps, hidden_size] for stacking
another LSTM after it. The second LSTM layer outputs [batch_size*num_steps, hidden_size]
for stacking DenseLayer after it, then compute the softmax outputs of each example
(n_examples = batch_size*num_steps).
The first LSTM layer outputs ``[batch_size, num_steps, hidden_size]`` for stacking
another LSTM after it. The second LSTM layer outputs ``[batch_size*num_steps, hidden_size]``
for stacking a DenseLayer after it. Then the DenseLayer computes the softmax outputs of each example
``n_examples = batch_size*num_steps``).

To understand the PTB tutorial, you can also read `TensorFlow PTB tutorial
<https://www.tensorflow.org/versions/r0.9/tutorials/recurrent/index.html#recurrent-neural-networks>`_.

(Note that, TensorLayer supports DynamicRNNLayer after v1.1, so you can set the input/output dropouts, number of RNN layer in one single layer)
(Note that, TensorLayer supports DynamicRNNLayer after v1.1, so you can set the input/output dropouts, number of RNN layers in one single layer)


.. code-block:: python
Expand Down Expand Up @@ -1118,26 +1118,26 @@ To understand the PTB tutorial, you can also read `TensorFlow PTB tutorial
Dataset iteration
^^^^^^^^^^^^^^^^^

The batch_size can be seem as how many concurrent computations.
As the following example shows, the first batch learn the sequence information by using 0 to 9.
The second batch learn the sequence information by using 10 to 19.
So it ignores the information from 9 to 10 !\n
If only if we set the batch_size = 1, it will consider all information from 0 to 20.
The ``batch_size`` can be seen as the number of concurrent computations we are running.
As the following example shows, the first batch learns the sequence information by using items 0 to 9.
The second batch learn the sequence information by using items 10 to 19.
So it ignores the information from items 9 to 10 !\n
If only if we set ``batch_size = 1```, it will consider all the information from items 0 to 20.

The meaning of batch_size here is not the same with the batch_size in MNIST example. In MNIST example,
batch_size reflects how many examples we consider in each iteration, while in
PTB example, batch_size is how many concurrent processes (segments)
for speed up computation.
The meaning of ``batch_size`` here is not the same as the ``batch_size`` in the MNIST example. In the MNIST example,
``batch_size`` reflects how many examples we consider in each iteration, while in the
PTB example, ``batch_size`` is the number of concurrent processes (segments)
for accelerating the computation.

Some Information will be ignored if batch_size > 1, however, if your dataset
is "long" enough (a text corpus usually has billions words), the ignored
information would not effect the final result.
Some information will be ignored if ``batch_size`` > 1, however, if your dataset
is "long" enough (a text corpus usually has billions of words), the ignored
information would not affect the final result.

In PTB tutorial, we set batch_size = 20, so we cut the dataset into 20 segments.
At the beginning of each epoch, we initialize (reset) the 20 RNN states for 20
segments, then go through 20 segments separately.
In the PTB tutorial, we set ``batch_size = 20``, so we divide the dataset into 20 segments.
At the beginning of each epoch, we initialize (reset) the 20 RNN states for the 20
segments to zero, then go through the 20 segments separately.

A example of generating training data as follow:
A example of generating training data is as follows:

.. code-block:: python

Expand Down Expand Up @@ -1169,7 +1169,7 @@ A example of generating training data as follow:
Loss and update expressions
^^^^^^^^^^^^^^^^^^^^^^^^^^^

The cost function is the averaged cost of each mini-batch:
The cost function is the average cost of each mini-batch:

.. code-block:: python

Expand All @@ -1181,7 +1181,7 @@ The cost function is the averaged cost of each mini-batch:
# targets : 2D tensor [batch_size, num_steps], need to be reshaped.
# n_examples = batch_size * num_steps
# so
# cost is the averaged cost of each mini-batch (concurrent process).
# cost is the average cost of each mini-batch (concurrent process).
loss = tf.nn.seq2seq.sequence_loss_by_example(
[outputs],
[tf.reshape(targets, [-1])],
Expand All @@ -1193,9 +1193,7 @@ The cost function is the averaged cost of each mini-batch:
cost = loss_fn(network.outputs, targets, batch_size, num_steps)


For updating, this example decreases the initial learning rate after several
epochs (defined by ``max_epoch``), by multiplying a ``lr_decay``. In addition,
truncated backpropagation clips values of gradients by the ratio of the sum of
For updating, truncated backpropagation clips values of gradients by the ratio of the sum of
their norms, so as to make the learning process tractable.

.. code-block:: python
Expand All @@ -1210,7 +1208,7 @@ their norms, so as to make the learning process tractable.
train_op = optimizer.apply_gradients(zip(grads, tvars))


If the epoch index greater than ``max_epoch``, decrease the learning rate
In addition, if the epoch index is greater than ``max_epoch``, we decrease the learning rate
by multipling ``lr_decay``.

.. code-block:: python
Expand All @@ -1220,8 +1218,8 @@ by multipling ``lr_decay``.


At the beginning of each epoch, all states of LSTMs need to be reseted
(initialized) to zero states, then after each iteration, the LSTMs' states
is updated, so the new LSTM states (final states) need to be assigned as the initial states of next iteration:
(initialized) to zero states. Then after each iteration, the LSTMs' states
is updated, so the new LSTM states (final states) need to be assigned as the initial states of the next iteration:

.. code-block:: python

Expand Down Expand Up @@ -1249,8 +1247,8 @@ Predicting
^^^^^^^^^^^^^

After training the model, when we predict the next output, we no long consider
the number of steps (sequence length), i.e. ``batch_size, num_steps`` are ``1``.
Then we can output the next word step by step, instead of predict a sequence
the number of steps (sequence length), i.e. ``batch_size, num_steps`` are set to ``1``.
Then we can output the next word one by one, instead of predicting a sequence
of words from a sequence of words.

.. code-block:: python
Expand Down Expand Up @@ -1291,12 +1289,12 @@ of words from a sequence of words.
What Next?
-----------

Now, you understand Synced sequence input and output. Let think about
Many to one (Sequence input and one output), LSTM is able to predict
Now, you have understood Synced sequence input and output. Let's think about
Many to one (Sequence input and one output), so that LSTM is able to predict
the next word "English" from "I am from London, I speak ..".

Please read and understand the code of ``tutorial_generate_text.py``,
it show you how to restore a pre-trained Embedding matrix and how to learn text
Please read and understand the code of ``tutorial_generate_text.py``.
It shows you how to restore a pre-trained Embedding matrix and how to learn text
generation from a given context.

Karpathy's blog :
Expand Down