# Chapter 10
. Sequence Modeling: Recurrentand Recursive Nets

* 손고리즘ML : 파트 4 - DML [1]
* 김무성

* 10.1 Unfolding Computational Graphs
* 10.2 Recurrent Neural Networks
    - 10.2.1 Computing the Gradient in a Recurrent Neural Network
    - 10.2.2 Recurrent Networks as Directed Graphical Models
    - 10.2.3 Modeling Sequences Conditioned on Context with RNNs
* 10.3 Bidirectional RNNs
* 10.4 Encoder-Decoder Sequence-to-Sequence Architectures
* 10.5 Deep Recurrent Networks
* 10.6 Recursive Neural Networks
* 10.7 The Challenge of Long-Term Dependencies
* 10.8 Echo State Networks
* 10.9 Skip Connections through Time
* 10.10 Leaky Units and a Spectrum of Diﬀerent Time Scales
* 10.11 The Long Short-Term Memory and Other Gated RNNs
    - 10.11.1 LSTM
    - 10.11.2 Other Gated RNNs
* 10.12 Optimization for Long-Term Dependencies
    - 10.12.1 Clipping Gradients
* 10.13 Regularizing to Encourage Information Flow
* 10.14 Organizing the State at Multiple Time Scales
* 10.15 Explicit Memory

#### Recurrent neural networks

Recurrent neural networks or RNNs (Rumelhart et al., 1986a) are a family of neural networks for handling sequential data. Sequential data is data where each example consists of a sequence, with each example able to have a different sequence length.

#### computational graph

This chapter extends the idea of a computational graph, introduced in Sec. 6.5.1, to include cycles. These cycles represent the influence of the present value of a variable on its own value at a future time step. Such computational graphs allow us to define recurrent neural networks. We then describe many different ways to construct, train, and use recurrent neural networks.

# 10.1 Unfolding Computational Graphs

* In this section we explain the idea of unfolding a recursive or recurrent computation into a computational graph that has a repetitive structure, typically corresponding to a chain of events
* Unfolding this graph results in the sharing of parameters across a deep network structure.

#### classical form of dynamical system

For example, consider the classical form of a dynamical system:

<img src="figures/cap10.1.png" width=600 />

where s(t) is called the state of the system.

#### unfolding

For a finite number of time steps τ, the graph can be unfolded by applying the
definition τ − 1 times. For example, if we unfold Eq. 10.1 for τ = 3 time steps, we obtain

<img src="figures/cap10.2.png" width=600 />

The unfolded computational graph of Eq. 10.1 and Eq. 10.2 is illustrated in Fig. 10.1.

<img src="figures/cap10.3.png" width=600 />

#### external signal

As another example, let us consider a dynamical system driven by an external
signal x(t),

<img src="figures/cap10.4.png" width=600 />

where we see that the state now contains information about the whole past sequence.

#### hidden state

To indicate that the state is the hidden units of
the network, we now rewrite Eq. 10.3 using the variable h to represent the state:

<img src="figures/cap10.5.png" width=600 />

illustrated in Fig. 10.2, Typical RNNs will add extra architectural features such as
output layers that read information out of the state h to make predictions.

#### circuit diagram (recurrent graph) or unfolded computational graph (unrolled graph)

Eq. 10.4 can be drawn in two different ways. 
* One way to draw the RNN is with a diagram containing one node for every component that might exist in a physical implementation of the model, such as a biological neural network.
* The other way to draw the RNN is as an unfolded computational graph, in which each component is represented by many different variables, with one variable per time step, representing the state of the component at that point in time.

<img src="figures/cap10.6.png" width=600 />

The unfolding process offers two major advantages over simply constructing a
function of the full sequence x(t) for t = 1,...,τ.

We can represent the unfolded recurrence after t steps with a function g(t):

<img src="figures/cap10.7.png" width=600 />

The function g(t) takes the whole past sequence (x(t), x(t−1) , x(t−2), . . . , x(2), x(1) ) as input and produces the current state.

Another advantage of the unfolding process is that the same function f with <font color="red">the same parameters θ</font> is used at each time step.

#### Both the recurrent graph and the unrolled graph have their uses. 
* The recurrent graph is succint. 
* The unfolded graph provides an explicit description of which computations to perform. 
    - The unfolded graph also helps to illustrate the idea of information flow forward in time (computing outputs and losses) and backward in time (computing gradients) by explicitly showing the path along which this information flows.

# 10.2 Recurrent Neural Networks
   * 10.2.1 Computing the Gradient in a Recurrent Neural Network
   * 10.2.2 Recurrent Networks as Directed Graphical Models
   * 10.2.3 Modeling Sequences Conditioned on Context with RNNs

Some examples of important design patterns for recurrent neural networks include the following:
* Recurrent networks that <font color="red">produce an output at each time step</font> and have <font color="blue">recurrent connections between hidden units</font>, illustrated in Fig. 10.3.

<img src="figures/cap10.8.png" width=600 />

* Recurrent networks that produce an output at each time step and have <font color="red">recurrent connections only from the output at one time step to the hidden units at the next time step</font>, illustrated in Fig. 10.4

<img src="figures/cap10.10.png" width=600 />

* Recurrent networks with recurrent connections between hidden units, that <font color="red">read an entire sequence and then produce a single output</font>, illustrated in Fig. 10.5.

<img src="figures/cap10.12.png" width=600 />

### forward propagation equations

* We now develop the forward propagation equations for the RNN depicted in Fig. 10.3.
* Here we assume the hyperbolic tangent activation function.
* Here we assume that the output is discrete, as if the RNN is used to predict words or characters.
* A natural way to represent discrete variables is to regard the output o as giving the unnormalized log probabilities of each possible value of the discrete variable.
* We can then apply the softmax operation as a post-processing step to obtain a vector ˆy of normalized probabilities over the output.
* Forward propagation begins with a specification of the initial state h(0).

<img src="figures/cap10.8.png" width=600 />

Then, for each time step from t = 1 to t = τ, we apply the following update equations:

<img src="figures/cap10.9.png" width=600 />

where the parameters are the bias vectors b and c along with the weight matrices U, V and W, respectively for input-to-hidden, hidden-to-output and hidden-to- hidden connections.

#### total loss

* This is an example of a recurrent network that maps an input sequence to an output sequence of the same length. 
* The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps.

For example, if L(t) is the negative log-likelihood of y(t) given x(1),...,x(t),then

<img src="figures/cap10.11.png" width=600 />

<img src="figures/cap10.13.png" width=600 />

<img src="figures/cap10.14.png" width=600 />

## 10.2.1 Computing the Gradient in a Recurrent Neural Network

<img src="figures/cap10.15.png" width=600 />

<img src="figures/cap10.16.png" width=600 />

<img src="figures/cap10.17.png" width=600 />

<img src="figures/cap10.18.png" width=600 />

<img src="figures/cap10.19.png" width=600 />

## 10.2.2 Recurrent Networks as Directed Graphical Models

<img src="figures/cap10.20.png" width=600 />

<img src="figures/cap10.21.png" width=600 />

<img src="figures/cap10.22.png" width=600 />

<img src="figures/cap10.23.png" width=600 />

<img src="figures/cap10.24.png" width=600 />

<img src="figures/cap10.25.png" width=600 />

<img src="figures/cap10.26.png" width=600 />

## 10.2.3 Modeling Sequences Conditioned on Context with RNNs

<img src="figures/cap10.27.png" width=600 />

# 10.3 Bidirectional RNNs

<img src="figures/cap10.28.png" width=600 />

<img src="figures/cap10.29.png" width=600 />

<img src="figures/cap10.30.png" width=600 />

# 10.4 Encoder-Decoder Sequence-to-Sequence Architectures

<img src="figures/cap10.31.png" width=600 />

# 10.5 Deep Recurrent Networks

<img src="figures/cap10.32.png" width=600 />

# 10.6 Recursive Neural Networks

<img src="figures/cap10.33.png" width=600 />

# 10.7 The Challenge of Long-Term Dependencies

<img src="figures/cap10.34.png" width=600 />

<img src="figures/cap10.35.png" width=600 />

# 10.8 Echo State Networks

# 10.9 Skip Connections through Time

# 10.10 Leaky Units and a Spectrum of Diﬀerent Time Scales

# 10.11 The Long Short-Term Memory and Other Gated RNNs
   * 10.11.1 LSTM
   * 10.11.2 Other Gated RNNs

## 10.11.1 LSTM

<img src="figures/cap10.36.png" width=600 />

<img src="figures/cap10.37.png" width=600 />

<img src="figures/cap10.38.png" width=600 />

## 10.11.2 Other Gated RNNs

<img src="figures/cap10.40.png" width=600 />

<img src="figures/cap10.41.png" width=600 />

# 10.12 Optimization for Long-Term Dependencies
   * 10.12.1 Clipping Gradients

<img src="figures/cap10.42.png" width=600 />

## 10.12.1 Clipping Gradients

<img src="figures/cap10.43.png" width=600 />

<img src="figures/cap10.44.png" width=600 />

# 10.13 Regularizing to Encourage Information Flow

<img src="figures/cap10.45.png" width=600 />
<img src="figures/cap10.46.png" width=600 />

# 10.14 Organizing the State at Multiple Time Scales

# 10.15 Explicit Memory

# 참고자료
* [1] Bengio's deep learning book / Chapter 10. Sequence Modeling: Recurrentand Recursive Nets - http://www.deeplearningbook.org/contents/rnn.html
* [2] CS231n: Convolutional Neural Networks for Visual Recognition
 : Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM) - http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf
* [3] Probabilistic Graphical Models : Template Models - http://spark-university.s3.amazonaws.com/stanford-pgm/slides/Section-2-Representation-Template-Models.pdf