$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 5: Recurrent Neural Networks

## Introduction

In this tutorial, we will cover:

TODO

In [2]:
# Setup
%matplotlib inline
import os
import sys
import torch
import torchvision
import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')

## Theory Reminders

Thus far, our models have been composed of fully connected (linear) layers or convolutional layers.

- Fully connected layers
    - Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates,
        $$
        \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~
        \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l}.
        $$
    - FC's have completely pre-fixed input and output dimensions.
    
    <img src="img/mlp.png" />

- Convolutional layers
    - Each layer operates on an input tensor $\vec{x}$ containing $M$ feature maps. The $k$-th feature map of the output tensor $\vec{y}$ is:
        $$
        \vec{y}^k = \sum_{m=1}^{M} \vec{w}^{km}\ast\vec{x}^m+b^k,\ k\in[1,K]
        $$
      Where $\ast$ denotes convolution, and $K$ is the number of output feature maps.
      
      <img src="img/cnn_filters.png" width="500"/>
    - This time the weight dimensions are not dependent on the input dimensions.
    - Weights are shared across the spatial dimensions of the input.
    - Output dimension changes based on input dimension.


However,
- Models based on these types of layers lack **persistent state**. 
- The current output is not affected by **previous inputs** (or outputs).

How can we model a dynamical system?
E.g., a linear system such as
$$\vec{y}_t = a_0 + a_1 \vec{y}_{t-1}+\dots+a_P \vec{y}_{t-P} + b_0 \vec{x}_t+\dots+b_{t-Q}\vec{x}_{t-Q}$$

Many use cases and examples: text comprehension and translation, scene analysis in a video, etc.

## Recurrent layers

An RNN layer is similar to a regular FC layer, but it has two inputs:
- Current sample, $\vec{x}_t \in\set{R}^{d_{i}}$.
- Previous **state**, $\vec{h}_{t-1}\in\set{r}^{d_{h}}$.

and it produces two outputs which depend on both:
- Current layer output, $\vec{y}_t\in\set{R}^{d_o}$.
- Current **state**, $\vec{h}_{t}\in\set{r}^{d_{h}}$.

<img src="img/rnn_cell.png" width="300"/>

Crucially,
- The function $\varphi(\cdot)$ itself is not time-dependent (but is parametrized).
- The same layer (function) is applied at successive time steps, propagating the hidden state.

A basic RNN can be defined as follows.

$$
\begin{align}
\forall t \geq 0:\\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} \vec{x}_t + \vec{b}_h\right) \\
\vec{y}_t &= \varphi_y\left(\mat{W}_{hy}\vec{h}_t + \vec{b}_y \right)
\end{align}
$$

where,
- $\vec{x}_t \in\set{R}^{d_{i}}$ is the input at time $t$.
- $\vec{h}_{t-1}\in\set{R}^{d_{h}}$ is the **hidden state** of a fixed dimension.
- $\vec{y}_t\in\set{R}^{d_o}$ is the output at time $t$.
- $\mat{W}_{hh}\in\set{R}^{d_h\times d_h}$, $\mat{W}_{xh}\in\set{R}^{d_h\times d_i}$, $\mat{W}_{hy}\in\set{R}^{d_o\times d_h}$, $\vec{b}_h\in\set{R}^{d_h}$ 
- $\varphi_h$ and $\varphi_y$ are some non-linear functions. In many cases $\varphi_y$ is not used.

and $\vec{b}_y\in\set{R}^{d_o}$ are the model weights and biases.

### Modeling time-dependence

If we imagine **unrolling** a single RNN layer through time,
<img src="img/rnn_unrolled.png" width="800" />

We can see how late outputs can now be influenced by early inputs, through the hidden state.

How would **backpropagation** work, though?

### Layered RNN

RNNs layers can be stacked to build a deep RNN model.

<img src="img/rnn_layered.png" width="800"/>

- As with MLPs, adding depth allows us to model intricate hierarchical features.
- However, now we also have a time dimension which makes the representation time-dependent.

**Image credits**

Images in this tutorial were taken and/or adapted from:

- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Christopher Olah, https://colah.github.io/