In [2]:
%matplotlib inline

from matplotlib import pyplot as plt
import numpy as np
import imp
from IPython.display import YouTubeVideo
from IPython.display import HTML

### Outline

- Motivation
- Basics of Neural Networks
  - Forward Propagation
  - Backward Propagation
- Deep Neural Networks
  - Convolutional Neural Networks
  - **Recurrent Neural Networks** 
- Applications
  - Computer Vision
  - Natural Language Processing
  - Reinforcement Learning

### Outline: Recurrent Neural Networks
- Introduction
- Standard RNN
  - Forward Propagation
  - Backward Propagation
- Long Short-Term Memory (LSTM)
- Example: Character-level language modeling

### Recurrent Neural Networks (RNN)
- A special kind of neural network designed for modeling sequential data
  - Can take arbitrary number of inputs
  - Can produce arbitrary number of outputs
- Examples of sequential problems
  - Machine translation
  - Speech recognition
  - Image caption generation

### Outline: Recurrent Neural Networks
- Introduction
- **Standard RNN**
  - **Forward Propagation**
  - Backward Propagation
- Long Short-Term Memory (LSTM)
- Example: Character-level language modeling

### Forward Propagation
<img src="images/simple_rnn_fprop.png" width=700px />

### Forward Propagation
$$ \textbf{h}_t = f(\textbf{W}\textbf{x}_t + \textbf{U}\textbf{h}_{t-1}+\textbf{b}) $$
$$ \hat{\textbf{y}}_t = \textbf{V}\textbf{h}_t + \textbf{b}' $$
$$ \mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t \left( \textbf{y}_t, \hat{\textbf{y}}_t \right) $$
- $\textbf{W}$: input weight, $\textbf{U}$: recurrent weight, $\textbf{V}$: output weight, $\textbf{b},\textbf{b}'$: bias, $f$: non-linear activation (e.g., ReLU)
- Weights are shared across time: the number of parameters does not depend on the length of input/output sequence
<img src="images/simple_rnn.png" width=500px />

### Outline: Recurrent Neural Networks
- Introduction
- Standard RNN
  - Forward Propagation
  - **Backward Propagation**
- Long Short-Term Memory (LSTM)
- Example: Character-level language modeling

### Backpropagation Through Time (BPTT)
- Gradient w.r.t. hidden units (assuming that $\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t+1}}$ is given)
$$ \frac{\partial\mathcal{L}}{\partial \textbf{h}_t} = \sum_{\tau=t}^{T}\frac{\partial\mathcal{L}_{\tau}}{\partial \textbf{h}_t} \mbox { } (\because \frac{\partial\mathcal{L}_k}{\partial \textbf{h}_t}=0 \mbox { if } k < t)$$ 
$$ \frac{\partial\mathcal{L}}{\partial \textbf{h}_t} = \frac{\partial \mathcal{L}_t}{\partial \textbf{h}_t} + \frac{\partial \textbf{h}_{t+1}}{\partial \textbf{h}_{t}} \frac{\partial \sum_{\tau=t+1}^{T}\mathcal{L}_{\tau}}{\partial \textbf{h}_{t+1}}  \\
= \underbrace{\frac{\partial \mathcal{L}_t}{\partial \hat{\textbf{y}}_t}}_{\mbox{easy}}\underbrace{\frac{{\partial \hat{\textbf{y}}_t}}{\partial \textbf{h}_t}}_{\mbox{easy}} + \underbrace{\frac{\partial \textbf{h}_{t+1}}{\partial \textbf{h}_{t}}}_{\mbox{easy}} \underbrace{\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t+1}}}_{\mbox{given}} $$
<img src="images/simple_rnn_fprop.png" width=500px />

### Backpropagation Through Time (BPTT)
- Gradient w.r.t. input units (given $\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t}}$)
$$ \frac{\partial\mathcal{L}}{\partial \textbf{x}_t} = \frac{\partial \mathcal{L}}{\partial \textbf{h}_t}\frac{\partial \textbf{h}_t}{\partial \textbf{x}_t} $$
<img src="images/simple_rnn_fprop.png" width=500px />

### Backward Propagation
<img src="images/simple_rnn_back2.png" width=700px align="middle" />

### Backward Propagation
<img src="images/simple_rnn_back3.png" width=700px />

### Backward Propagation
<img src="images/simple_rnn_back4.png" width=700px />

### Backward Propagation
<img src="images/simple_rnn_back5.png" width=700px />

### Backward Propagation
<img src="images/simple_rnn_back6.png" width=700px />

### Backward Propagation
<img src="images/simple_rnn_back7.png" width=700px />

### Backpropagation Through Time (BPTT)
- Gradient w.r.t. weights
  - Recall: The weights are shared through time. Gradients of shared weights should be accumulated!
  
<font color='red'>$ \frac{\partial \mathcal{L}}{\partial \textbf{V}} = \sum_{t=1}^{T}\frac{\partial \mathcal{L}}{\partial \hat{\textbf{y}}_t}\frac{\partial \hat{\textbf{y}}_t}{\partial \textbf{V}} $ </font> <font color='blue'>$ \frac{\partial \mathcal{L}}{\partial \textbf{W}} = \sum_{t=1}^{T}\frac{\partial \mathcal{L}}{\partial \textbf{h}_t}\frac{\partial \textbf{h}_t}{\partial \textbf{W}} $ </font> <font color='green'>$ \frac{\partial \mathcal{L}}{\partial \textbf{U}} = \sum_{t=1}^{T-1}\frac{\partial \mathcal{L}}{\partial \textbf{h}_{t+1}}\frac{\partial \textbf{h}_{t+1}}{\partial \textbf{U}} $ </font>
<img src="images/simple_rnn_back_w.png" width=600px/>

### Summary of Standard Recurrent Neural Network
- RNN is actually not much different from a standard (feedforward) neural network except that:
  - Input/output are given through time.
  - Weights are extensively shared.
- RNN can be viewed as a very deep feedforward neural network with shared weights.

### Outline: Recurrent Neural Networks
- Introduction
- Standard RNN
  - Forward Propagation
  - Backward Propagation
- **Long Short-Term Memory (LSTM)**
- Example: Character-level language modeling

### Vanshing Gradient Problem
- RNN can model arbitrary sequences if properly trained.
- In practice, it is difficult to train an RNN to learn long-term dependencies because of vanishing gradient.
- Intuition of vanishing gradient
  - A hidden unit activation is not well-preserved to the long-term future (forward propagation view)
  - Gradients are diffused through time (backward propagation view)
![](images/vanish_rnn.png)
<span style="color:gray; font-size:10px; float:right">(Figure from Alex Graves)</span>

### Long Short-Term Memory (LSTM)
- A special type of RNN that can handle vanishing gradient better.
- $i_t,o_t,f_t$: **input gate**, **output gate**, and **forget gate**
- $c_t$: **memory cell** containing information about history of inputs
- $h_t$: output activation
<img src=images/lstm.png width=700px />
<span style="color:gray; font-size:10px; float:right">(Figure from Alex Graves)</span>

### Long Short-Term Memory (LSTM)
- Gating mechanism
  - Input gate: whether to ignore a new input or not
  - Output gate: whether to produce an output or not (while preserving the memory cell)
  - Forget gate: whether to erase the memory cell or not
- Gating is controlled by LSTM's weights that are also learned from data.
![](images/vanish_lstm.png)
<span style="color:gray; font-size:10px; float:right">(Figure from Alex Graves)</span>

### Outline: Recurrent Neural Networks
- Introduction
- Simple RNN
  - Forward Propagation
  - Backward Propagation
- Long Short-Term Memory (LSTM)
- **Example: Character-level language modeling**

### Character-level language modeling
- Goal: build a character-level generative model
- A character ($x_t$) is represented as a one-of-k vector (k: #characters).
$$ P(x_1,x_2,...,x_T) = \prod_{t=1}^{T}P(x_{t}|x_{t-1},...,x_1) \approx \prod_{t=1}^{T}P(x_{t}|x_{t-1},...,x_{t-n}) $$

### Character-level language modeling
- RNN is trained to predict next character given previous characters
- Maximizing the likelihood can be formulated as minimizing sum of cross entropy losses.
$$ \mathcal{L}(\textbf{x}) = -\sum_{t=1}^{T}\log P(x_{t}|x_{t-1},...,x_{t-n}; \theta) $$

### Character-level language modeling
- After training, the network can "sample" characters from the multinomial distribution of characters at the output layer (softmax).
<img src="images/char_rnn.png" width=800px />

### Shakespeare
<img src="images/char_rnn2.png" width=700px />
<span style="color:gray; font-size:10px; float:right">(Figure from Richard Socher)</span>

### Wikipedia
<img src="images/char_rnn3.png" width=900px />
<span style="color:gray; font-size:10px; float:right">(Figure from Richard Socher)</span>

### Latex
<img src="images/char_rnn4.png" width=900px />
<span style="color:gray; font-size:10px; float:right">(Figure from Richard Socher)</span>

### C++ Code
<img src="images/char_rnn5.png" width=900px />
<span style="color:gray; font-size:10px; float:right">(Figure from Richard Socher)</span>

### Outline

- Motivation
- Basics of Neural Networks
  - Forward Propagation
  - Backward Propagation
- Deep Neural Networks
  - Convolutional Neural Networks
  - Recurrent Neural Networks
- Applications
  - **Computer Vision** 
  - Natural Language Processing
  - Reinforcement Learning

### Object Detection
- Goal: find bounding boxes that contain objects in an image and predict their classes.
- CNN approaches have recently achieved state-of-the-art results on object detection task.
- Example: Regions with CNN
  - Use a low-level region proposal methods to generate many candidate bounding boxes
  - Use a pre-trained CNN to classify each region
<img src="images/object_detection.png" width=1000px />
<span style="color:gray; font-size:10px; float:right">(Girshick et al, "Reigion-based Convolutional Networks for Accurate Object Detection and Semantic Segmentation", PAMI, 2015.)</span>

### Object Detection
<img src="images/object_detection2.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Figur from Ross Girshick)</span>

### Object Segmentation
- Goal: Segment object regions and predict class labels for each region
- Can be formulated as pixel-wise classification 
- CNN is pre-trained on a large-scale classification dataset (ImageNet)
<img src="images/object_segmentation.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Long et al, "Fully Convolutional Networks for Semantic Segmentation", CVPR, 2015.)</span>

### Object Segmentation
<img src="images/object_segmentation3.png" width=650px />
<span style="color:gray; font-size:10px; float:right">(Noh et al, "Learning Deconvolution Network for Semantic Segmentation", ICCV, 2015.)</span>

### Neural Network as Generative Model
- Goal
$$\max_{\theta} \mathbb{E}_{\textbf{x}} \left[\log P\left(\textbf{x} ; \theta \right)\right]$$
- Outcome
$$\textbf{x} \sim P\left(\textbf{x} ; \theta \right)$$

<span style="width:1000px"> </span>

 ### Image Generation: DRAW Network (Gregor et al.)
- Introduce a differentiable visual attention mechanism
- The network is trained to reconstruct a given image through multiple time steps. 
- At each time-step, the network is forced to read/write only a part of the image (using attention windows).
- An intermediate hidden layer ($z_t$) is encouraged to follow a Gaussian distribution.
- After training, we sample $z_t$ from a Gaussian distribution to generate an image.
<img src="images/draw.png" width=600px />
<span style="color:gray; font-size:10px; float:right">(Gregor et al, "DRAW: A Recurrent Neural Network For Image Generation", ICML, 2015.)</span>

### Image Generation: Generative Adversarial Network (Goodfellow et al.)
- Model : Generator (G), Discriminator (D)
- Generator (G) : Learns to generate realistic images
- Discriminator (D) : Learns to classify whether a given image is real (from data) or not (from model)
- Objective : Fool each other!
<img src="images/gan.png" width=1000px />

### Image Generation: Generative Adversarial Network (Radford et al.)
<img src="images/gan2.png" width=1000px />
<span style="color:gray; font-size:10px; float:right">(Radford et al, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", ICLR, 2016.)</span>

### Image Generation: Generative Adversarial Network (Radford et al.)
<img src="images/gan3.png" width=1000px />
<span style="color:gray; font-size:10px; float:right">(Radford et al, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks", ICLR, 2016.)</span>

### Video Prediction (Oh et al.)
- Goal: Predict future frames given previous frames and actions in Atari games.
<img src="images/video_prediction.png" />
<span style="color:gray; font-size:10px; float:right">(Oh et al, "Action-Conditional Video Prediction using Deep Networks in Atari Games", NIPS, 2015.)</span>

### Outline

- Motivation
- Basics of Neural Networks
  - Forward Propagation
  - Backward Propagation
- Deep Neural Networks
  - Convolutional Neural Networks
  - Recurrent Neural Networks
- Applications
  - Computer Vision
  - **Natural Language Processing**
  - Reinforcement Learning

### Sequence-to-Sequence Learning Framework (Sutskever et al.)
- A general RNN framework for sequence-to-sequence prediction
- ex) $x$: English sentence, $y$: French sentence in machine translation
$$ P \left(y_1, y_2, ... , y_{T'} \vert x_1, x_2, ..., x_{T} \right)$$
<img src="images/seq_to_seq.png" />
<span style="color:gray; font-size:10px; float:right">(Sutskever et al, "Sequence to Sequence Learning with Neural Networks", NIPS, 2014.)</span>

### Seq2Seq: Application to Machine Translation (Sutskever et al.)
- Achieves the state-of-the-art results on English-to-French dataset.
<img src="images/seq_to_seq2.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Sutskever et al, "Sequence to Sequence Learning with Neural Networks", NIPS, 2014.)</span>

### Seq2Seq: Application to Grammar Parsing (Vinyals et al.)
- A parsing tree can be represented as a sequence.
- Seq2Seq framework can be applied (sentence $\rightarrow$ parse tree)
<img src="images/grammar.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Vinyals et al, "Grammar as a Foreign Language", NIPS, 2015.)</span>

### Seq2Seq: Application to Grammar Parsing (Vinyals et al.)

<img src="images/grammar2.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Vinyals et al, "Grammar as a Foreign Language", NIPS, 2015.)</span>

### Seq2Seq: Application to Program Execution (Zaremba et al.)
- Input: source code (characters)
- Output: execution result (characters)
<img src="images/python.png" />
<span style="color:gray; font-size:10px; float:right">(Zaremba et al., "Learning to Execute", ICLR, 2015.)</span>

### Seq2Seq: Application to Program Execution (Zaremba et al.)
- The network learns to execute simple codes without using any compiler or interpreter.
<img src="images/python2.png" />
<span style="color:gray; font-size:10px; float:right">(Zaremba et al, "Learning to Execute", ICLR, 2015.)</span>

### Image Caption Generation (Vinyals et al.)
- Goal: Generate a text that describes a given image
- Idea proposed by Vinyals et al.
  - Use a pre-trained CNN to extract image features
  - RNN part is a typical language model, but it is conditioned on the image feature.
<img src="images/image_caption.png" />
<span style="color:gray; font-size:10px; float:right">(Vinyals et al., "Show and Tell: A Neural Image Caption Generator", CVPR, 2015.)</span>

### Image Caption Generation (Vinyals et al.)
<img src="images/image_caption2.png" width=1000px />
<span style="color:gray; font-size:10px; float:right">(Vinyals et al, "Show and Tell: A Neural Image Caption Generator", CVPR, 2015.)</span>

### Image Caption Generation (Xu et al.)
- Another idea proposed by Xu et al. 
  - The model is forced to pay attention to only a part of the image when generating a word at a time.
<img src="images/image_caption3.png" width=700px />
<img src="images/image_caption4.png" width=700px />
<span style="color:gray; font-size:10px; float:right">(Xu et al, "Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention", ICML, 2015.)</span>

### Image Caption Generation (Xu et al.)

<img src="images/image_caption5.png" width=900px />
<span style="color:gray; font-size:10px; float:right">(Xu et al, "Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention", ICML, 2015.)</span>

### Image Caption Generation (Kiros et al.)
- Some examples from Kiros et al.
<img src="images/image_caption6.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Figures from Ruslan Salakhutdinov)</span>

### Outline

- Motivation
- Basics of Neural Networks
  - Forward Propagation
  - Backward Propagation
- Deep Neural Networks
  - Convolutional Neural Networks
  - Recurrent Neural Networks
- Applications
  - Computer Vision
  - Natural Language Processing
  - **Reinforcement Learning**

### Reinforcement Learning
- An agent observes a state $s_t$, chooses an action $a_t$, receives a reward $r_t$, and goes to the next state $s_{t+1}$.
- The goal is to learn an optimal policy that maximizes the total reward until the episode terminates (episodic task).
<img src="images/rl.png" width=600px />
<span style="color:gray; font-size:10px; float:right">(Figure from "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto)</span>

### Deep Q-Network (Minh et al.)
- A recent breakthrough from Google DeepMind
- Combines Q-Learning with deep neural networks
<img src="images/dqn.png" width=300px />
<span style="color:gray; font-size:10px; float:right">(Minh et al. "Human-level control through deep reinforcement learning", Nature, 2015.)</span>

### Brief Summary of Q-Learning
- $Q(s,a)$: expected future reward when choosing action $a$ at state $s$.
- The agent learns to estimate $Q(s,a)$ based on Bellman equation.
- During evaluation, it chooses $\mbox{argmax}_a Q(s,a)$ (greedy policy).
- Problem: need to approximate $Q(s,a)$ when the state space is very large.
<img src="images/rl.png" width=600px />
<span style="color:gray; font-size:10px; float:right">(Figure from "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto)</span>

### Deep Q-Network (Minh et al.)
- Key idea: Use a CNN to approximate Q-values
- Result
    - Outperforms all existing (model-free) controllers
    - Achieves human-level performances on many Atari 2600 games
<img src="images/dqn2.png" width=700px />
<span style="color:gray; font-size:10px; float:right">(Minh et al. "Human-level control through deep reinforcement learning", Nature, 2015.)</span>

### AlphaGo (Silver et al.)
- Another breakthrough from Google DeepMind
- Combines Monte-Carlo Tree Search (MTCS) with deep neural networks
<img src="images/alphago.jpg" width=300px />
<span style="color:gray; font-size:10px; float:right">(Silver et al., "Mastering the game of Go with deep neural networks and tree search", Nature, 2016.)</span>

### Brief Summary of Monte-Carlo Tree Search (MTCS)
- Idea: Simulate many possible futures and choose the best action
- Problem: search space is too large!
  - Should search over only a reasonable state space (tree policy should be reasonable).
  - Should search up to a certain depth and use a default policy (usually random) to get the outcome.
<img src="images/alphago2.png" width=700px />
<span style="color:gray; font-size:10px; float:right">(Figure from Browne et al.)</span>

### AlphaGo (Silver et al.)
0. Supervised Learning: Train a **policy network** to predict human experts' moves.
0. Reinforcement Learning: Improve the policy network through self-play.
0. Reinforcement Learning: Train a **value network** to predict whether the agent wins at the end or not.
0. Use the learned networks to do MCTS more efficiently!
<img src="images/alphago3.png" width=800px />
<span style="color:gray; font-size:10px; float:right">(Silver et al., "Mastering the game of Go with deep neural networks and tree search", Nature, 2016.)</span>

### AlphaGo (Silver et al.)
- Use the **policy network** as a prior distribution over actions.
  - Allows searching over only reasonable state spaces.
- Use the **value network** to directly predict the outcome at the leaf node.
- Use a shallow but fast policy network as a default policy. (better than random)
<img src="images/alphago2.png" width=700px />
<span style="color:gray; font-size:10px; float:right">(Figure from Browne et al.)</span>

### AlphaGo (Silver et al.)
- Result: AlphaGo beat Lee Sedol (the world’s top Go player).
<img src="images/alphago5.jpg", width=400px />

### Summary 
- Deep Learning: machine learning algorithms based on learning multiple levels of representation.
- Neural networks can implement the idea of deep learning in a very flexible way.
- Deep neural networks (e.g., CNN, RNN) have made remarkable advances in computer vision, natural language processing, and reinforcement learning area.