-
Notifications
You must be signed in to change notification settings - Fork 130
/
recurrency.rst
157 lines (114 loc) · 6.01 KB
/
recurrency.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
.. _recurrency:
==========
Recurrency
==========
Also see the slides of `our Interspeech 2020 tutorial about machine learning frameworks including RETURNN <https://www-i6.informatik.rwth-aachen.de/publications/download/1154/Zeyer--2020.pdf>`__
which explains the **recurrency** concept as well.
**Recurrency** :=
Anything which is defined by step-by-step execution,
where current step depends on previous step, such as RNN, beam search, etc.
This is all covered by :class:`returnn.tf.layers.rec.RecLayer`,
which is a generic wrapper around ``tf.while_loop``.
It covers:
* Definition of stochastic variables (the output classes itself but also latent variables)
for either beam search or training (e.g. using ground truth values)
* Automatic optimizations
The recurrent formula is defined in a way as it would be used for recognition.
I.e. specifically you would define your output labels as stochastic variables,
and their probability distribution.
The automatic optimization will make this efficient for the case of training.
Also see :ref:`recurrent_subnet`
for more about the usage of the :class:`returnn.tf.layers.rec.RecLayer`.
.. _recurrency_stochastic_vars:
Stochastic variables
--------------------
The layer to define them is :class:`returnn.tf.layers.rec.ChoiceLayer`.
The default behavior is:
* In training, it will just return the ground truth values.
* With search enabled (in recognition), it will do beam search.
Note that there can be multiple stochastic variables.
Usually the output classes are one stochastic variable.
But there can be additional stochastic variables,
e.g. latent variables,
e.g. for the segment boundaries or time position in a hard attention model.
For latent variables, you might want to perform search,
while keeping the output labels fixed to the ground truth ("forced alignment").
You might also want to perform search over the output labels in training,
see :ref:`min_exp_risk_training`.
For details on how beam search is implemented,
see :ref:`beam_search`.
For details about how to use it for recognition or generation,
see :ref:`generation_search`.
.. _recurrency_automatic_optimization:
Automatic optimization
----------------------
The definition of the recurrent formula can have parts
which are actually independent from the loop
-- maybe depending on the mode, e.g. in training.
**Automatic optimization** will find parts of the formula (i.e. sub layers)
which can be calculated independently from the loop,
i.e. outside of the loop.
All layers are implemented in a way that they perform the same mathematical calculation
whether they are inside the loop or outside.
Example:
.. code-block:: python
network = {
"input": {"class": "rec", "unit": "nativelstm2", "n_out": 20}, # encoder
"input_last": {"class": "get_last_hidden_state", "from": "input", "n_out":40},
"output": {"class": "rec", "from": [], "target": "classes", "unit": { # decoder
"embed": {"class":"linear", "activation":None, "from":"output", "n_out":10},
"s": {"class": "rec", "unit": "nativelstm2", "n_out": 20, "from": "prev:embed", "initial_state": "base:input_last"},
"p": {"class":"softmax", "from":"s", "target": "classes", "loss": "ce"},
"output": {"class":"choice", "from":"p", "target":"classes", "beam_size":8}
}}
}
In this example, in training:
- ``output`` is using the ground truth values, i.e. independent of anything in the loop, and can be moved out.
- ``embed`` depends on ``output``, which is moved out, so it can also be calculated outside the loop.
- ``s`` depends on ``embed``, which is moved out, so it can also be calculated outside the loop.
Note that ``s`` has some internal state, and in fact needs to be calculated recurrently.
But because it can be calculated independently from the loop, it can make use of **very efficient** kernels:
In this case, it uses our ``NativeLstm2`` implementation.
- ``p`` depends on ``s``, and its loss calculation depends on the ground truth values,
so it also can be calculated outside.
This will result in a **very efficient** and parallel ``tf.matmul``.
With search enabled, in recognition:
``output`` depends on the probability distribution ``p``.
Effectively nothing can be moved out, because everything depends on each other.
This is still **as efficient as it possible can be**.
The ``output`` :class:`returnn.tf.layers.rec.ChoiceLayer` will use ``tf.nn.top_k`` internally.
This example also shows how one single definition of the network
can be used for both training and recognition,
and in a **very efficient** way.
Consider the `Transformer <https://arxiv.org/abs/1706.03762>`__ as another example.
The Transformer can be defined in a similar straight-forward way,
using ``output`` for the output labels with :class:`returnn.tf.layers.rec.ChoiceLayer`.
In training, it will result naturally in the standard fully parallel training.
In decoding, it is also as efficient as it possible can be.
.. _min_exp_risk_training:
Min expected risk training
--------------------------
Also:
* Min expected WER training
* Max expected BLEU training
* Reinforcement learning
By default,
:class:`returnn.tf.layers.rec.ChoiceLayer`
would return the ground truth in training.
However, this is flexible.
In *minimum expected risk training*,
you want to perform search also in training.
Example for min expected WER training:
.. code-block:: python
"encoder": ...,
"output": {"class": "rec", "unit": { ...
"output_prob": {"class": "softmax", "from": "readout", "target": "classes"},
"output": {"class": "choice", "target": "classes", "beam_size": 4, "from": "output_prob", "initial_output": 0},
} ...}, # [T|’time:var:extern_data:classes’,B], int32, dim 1030, beam ’output’, beam size 4
"min_wer": {
"class": "copy",
"from": "extra.search:output", # currently the syntax to enable search
"loss": "expected_loss", # expect beam search results with beam scores
"target": "classes",
"loss_opts": {"loss": {"class": "edit_distance"}, "loss_kind": "error"}
}