In [1]:
%matplotlib inline

from matplotlib import pyplot as plt
import numpy as np
import imp
from IPython.display import YouTubeVideo
from IPython.display import HTML

In [2]:
from PIL import Image, ImageChops

def trim(im, percent=36):
    bg = Image.new(im.mode, im.size, im.getpixel((0,0)))
    diff = ImageChops.difference(im, bg)
    diff = ImageChops.add(diff, diff, 2.0, -100)
    bbox = diff.getbbox()
    if bbox:
        x = im.crop(bbox)
        return x.resize(((x.size[0]*percent)/100, (x.size[1]*percent)/100), Image.ANTIALIAS)

def resize(filename, percent=36):
    trim(Image.open(filename + ".png"), percent).save(filename + "_r" + str(percent) + ".png")

# EECS 545:  Machine Learning
## Lecture 20:  Neural Networks: Part 2
* Instructor:  **Junhyuk Oh**
* Date:  April 11, 2016
- Many Slides/Figures/Examples/Ideas from:
  - Nando De Freitas (University of Oxford)
  - Richard Socher (MetaMind)

### Outline

- Motivation
- Basics of Neural Networks
  - Forward Propagation
  - Backward Propagation
- Deep Neural Networks
  - Convolutional Neural Networks
  - **Recurrent Neural Networks** 
- Applications
  - Computer Vision
  - Natural Language Processing
  - Reinforcement Learning

### Outline: Recurrent Neural Networks
- Introduction
- Simple RNN
  - Forward Propagation
  - Backward Propagation
- Long Short-Term Memory (LSTM)

### Recurrent Neural Networks (RNN)
- A special kind of neural network designed for modeling sequential data
  - Can take arbitrary number of inputs through time
  - Can produce arbitrary number of outputs through time
- Examples of sequential problems
  - Next word prediction
  - Machine translation
  - Speech recognition
  - Image caption generaion

### Outline: Recurrent Neural Networks
- Introduction
- **Simple RNN**
  - **Forward Propagation**
  - Backward Propagation
- Long Short-Term Memory (LSTM)

### Simple RNN
$$ \textbf{h}_t = f(\textbf{W}\textbf{x}_t + \textbf{U}\textbf{h}_{t-1}+\textbf{b}) $$
$$ \hat{\textbf{y}}_t = \textbf{V}\textbf{h}_t $$
- $\textbf{W}$: input weight, $\textbf{U}$: recurrent weight, $\textbf{V}$: output weight, $\textbf{b}$: bias
- $f$ is a non-linear activation function (e.g., ReLU)
- Weights are shared across time: the number of parameters does not depend on the length of input/output sequence
![](images/simple_rnn.png)

### Forward Propagation
![](images/simple_rnn_fprop.png)

### Outline: Recurrent Neural Networks
- Introduction
- Simple RNN
  - Forward Propagation
  - **Backward Propagation**
- Long Short-Term Memory (LSTM)

In [3]:
resize("images/simple_rnn", 70)

### Backpropagation Through Time (BPTT)
- Assume that a loss function defined as:
$$ \mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t \left( \textbf{y}_t, \hat{\textbf{y}}_t \right) $$
- Gradient w.r.t. hidden units (given $\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t+1}}$)
$$ \frac{\partial\mathcal{L}}{\partial \textbf{h}_t} = \sum_{\tau=t}^{T}\frac{\partial\mathcal{L}_{\tau}}{\partial \textbf{h}_t} \mbox { } (\because \frac{\partial\mathcal{L}_k}{\partial \textbf{h}_t}=0 \mbox { if } k < t)$$ 
$$ \frac{\partial\mathcal{L}}{\partial \textbf{h}_t} = \frac{\partial \mathcal{L}_t}{\partial \textbf{h}_t} + \frac{\partial \textbf{h}_{t+1}}{\partial \textbf{h}_{t}} \frac{\partial \sum_{\tau=t+1}^{T}\mathcal{L}_{\tau}}{\partial \textbf{h}_{t+1}}  \\
= \underbrace{\frac{\partial \mathcal{L}_t}{\partial \hat{\textbf{y}}_t}}_{\mbox{easy}}\underbrace{\frac{{\partial \hat{\textbf{y}}_t}}{\partial \textbf{h}_t}}_{\mbox{easy}} + \underbrace{\frac{\partial \textbf{h}_{t+1}}{\partial \textbf{h}_{t}}}_{\mbox{easy}} \underbrace{\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t+1}}}_{\mbox{given}} $$
![](images/simple_rnn_r70.png)

### Backpropagation Through Time (BPTT)
- Assume that a loss function defined as:
$$ \mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t \left( \textbf{y}_t, \hat{\textbf{y}}_t \right) $$
- Gradient w.r.t. input units (given $\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t}}$)
$$ \frac{\partial\mathcal{L}}{\partial \textbf{x}_t} = \frac{\partial \mathcal{L}}{\partial \textbf{h}_t}\frac{\partial \textbf{h}_t}{\partial \textbf{x}_t} $$
![](images/simple_rnn.png)

### Backward Propagation
![](images/simple_rnn_back2.png)

### Backward Propagation
![](images/simple_rnn_back3.png)

### Backward Propagation
![](images/simple_rnn_back4.png)

### Backward Propagation
![](images/simple_rnn_back5.png)

### Backward Propagation
![](images/simple_rnn_back6.png)

### Backward Propagation
![](images/simple_rnn_back7.png)

In [5]:
resize("images/simple_rnn_back_w", 60)

### Backpropagation Through Time (BPTT)
- Gradient w.r.t. weights
  - Note: the weights are shared through time
  - Recall: we should accumulate gradients through time
<font color='red'>$$ \frac{\partial \mathcal{L}}{\partial \textbf{V}} = \sum_{t=1}^{T}\frac{\partial \mathcal{L}}{\partial \hat{\textbf{y}}_t}\frac{\partial \hat{\textbf{y}}_t}{\partial \textbf{V}} $$ </font>
<font color='blue'>$$ \frac{\partial \mathcal{L}}{\partial \textbf{W}} = \sum_{t=1}^{T}\frac{\partial \mathcal{L}}{\partial \hat{\textbf{h}}_t}\frac{\partial \hat{\textbf{h}}_t}{\partial \textbf{W}} $$ </font>
<font color='green'>$$ \frac{\partial \mathcal{L}}{\partial \textbf{U}} = \sum_{t=1}^{T}\frac{\partial \mathcal{L}}{\partial \hat{\textbf{h}}_t}\frac{\partial \hat{\textbf{h}}_t}{\partial \textbf{U}} $$ </font>
![](images/simple_rnn_back_w_r60.png)

### Backpropagation Through Time (BPTT)
- BPTT is actually not different from backpropagation.
- RNN is actually not much different from a standard (feedforward) neural network except that:
  - Input/output are given through time.
  - Weights are extensively shared.

### Outline: Recurrent Neural Networks
- Introduction
- Simple RNN
  - Forward Propagation
  - Backward Propagation
- **Long Short-Term Memory (LSTM)**

### Vanshing Gradient Problem
- RNN can model arbitrary sequences if properly trained.
- In practice, it is difficult to train it for long-term dependencies because of vanishing gradient.
- Intuitively, a hidden unit does not affect other units in the long-term future due to new inputs.
  - Gradients are diffused through time
![](images/vanish_rnn.png)

### Long Shor-Term Memory (LSTM)
- A special type of RNN for handling **vanishing gradient** problem.
- $c_t$ is a **memory cell** preserving information about history of inputs.
- $h_t$ is the hidden activation which is given to the output layer.
- $i_t,o_t,f_t$ are **input gate**, **output gate**, and **forget gate** respectively.
![](images/lstm.png)

### Long Shor-Term Memory (LSTM)
- Gating mechanism control the following:
  - whether to ignore a new input or not
  - whether to produce an output or not (while preserving the memory cell)
  - whether to erase the memory cell or not
- Gating is controlled by LSTM's weights that are also learned from data.
![](images/vanish_lstm.png)

### Long Shor-Term Memory (LSTM)
- LSTM has been successfully applied to many sequence modeling problems.