# Presentation For Neural Networks Seminar (Recurrent Neural Networks)

## S.Alireza Mousavizade

## The topics covered in this presentation:

1. Vanishing Gradients & Solutions
   - Activation Functions (Sigmoid, Tanh, ReLu, Leaky ReLu)
   - Weight Initialization (Xavier Initialization)
   - Residual Block Connections (ResNet Architecture)
   - Gated RNNs (GRU, LSTM, ...)
   - Batch Normalization

2. Exploding Gradients & Solutions
   - Gradient Clipping
   - Network Re-designing
   - LSTM Networks
   - Weight Regularization

3. Recurrent Neural Networks
   - Motivation
   - Architecture
   - Advantages & Drawbacks
   - Applications
   - Text Processing & Word Embedding + Code
   - Gated RNNs
     - Motivation
     - GRU & LSTM Architecture
     - Compare SimpleRNN, LSTM & GRU + Code
   - Variants of RNNs
     - Bidirectional RNNs
     - Deep RNNs
   - CNN & RNN Combination
     - 1D Convolutional Layer
   - C-LSTM
   - Some RNNs Applications Architecture

***

# Vanishing and Exploding Gradient

***

# Introduction

Artificial Neural Networks as we know were invented in 1943 to mimic our biological Nervous system to help machines learn as humans do.
But it was not until 1975 that we were able to actually make machines learn and recognize patterns in a data, with the famous Back-Propagation Algorithm came a new hope of training of multi-layered networks.
It allowed researchers to train supervised deep artificial neural networks from scratch, although with a little success. The Problem for this low accuracy of training the ANN using Back Propagation was later identified by Sepp Hochreiter’s in 1991.

# The Problem
1. The Vanishing Gradients
2. The Exploding Gradients

## The Vanishing Gradients

In Deep Neural Networks **adding more and more hidden layers** makes our network to learn **more Complex arbitrary functions** and features and therefore have higher accuracy while predicting the outcomes or identifying a pattern/feature in a complex data such as Image and Speech.

But, **adding a layer comes at a cost** which we refer as the **Vanishing Gradient**.
The Error that is back propagated using the Back Propagation Algorithm might become so small by the time it reaches the input of the model that it may have very little effect. This phenomena is called the Vanishing Gradient Problem
This make it difficult to know which direction the parameters/weights should move to improve the cost function therefore causes premature conversion to a poor solution.

below is an example of how mathematically back propagation works for a 4 hidden layer network.

<center>
<img src="vae_gradients_images/vanishing_and_exploding_gradients.jpg" alt="Drawing" style="width: 85%;"/>
</center>

Some time it might so happen that ∂J/∂b1 becomes equal to zero, and hence may not contribute towards updation of weights, thus causing a premature end to the learning of the model.

### The Exploding Gradients

Let’s now talk about another scenario that is very common with Deep Neural Nets that leads to failure of model training.

Sometimes it might so happen that while updating the weights error gradients can accumulate and result in Large gradients, this is in turn result in large update of weights and therefore make a network unstable, worst case scenario being that the value of weights become NaN.

# Solutions to the Vanishing Gradient Problem

## 1. Activation Function

The Simplest solution is to use activation functions like relu (leaky relu instead of sigmoid, tanh.
[The Vanishing Gradient Problem Of Sigmoid](https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484)

### Sigmoid:

<center>
<img src="vae_gradients_images/sigmoid_vae_gradients.png" alt="Drawing" style="width: 85%;"/>
</center>


### Tanh:

<center>
<img src="vae_gradients_images/tanh_vae_gradients.png" alt="Drawing" style="width: 85%;"/>
</center>



## 2. Weight Initialization

- How to choose the	starting point for the iterative process of optimization?

- The aim of weight initialization is prevent layer activation outputs	from exploding or vanishing	during the course of a forward pass.

<center>
<img src="vae_gradients_images/weight_initialization.png" alt="Drawing" style="width: 85%;"/>
</center>

## Ideas:

### Small Random Numbers


<center>
<img src="vae_gradients_images/wi1.png" alt="Drawing" style="width: 85%;"/>
</center>

### Xavier Initialization

#### Steps to Derieve:

#### Step 1:

<center>
<img src="vae_gradients_images/wi2.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 2:

<center>
<img src="vae_gradients_images/wi3.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 3:

<center>
<img src="vae_gradients_images/wi4.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 4:

<center>
<img src="vae_gradients_images/wi5.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 5:

<center>
<img src="vae_gradients_images/wi6.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 6:

<center>
<img src="vae_gradients_images/wi7.png" alt="Drawing" style="width: 85%;"/>
</center>

***

#### Step 7:

<center>
<img src="vae_gradients_images/wi8.png" alt="Drawing" style="width: 85%;"/>
</center>



## 3. Residual Connections

the residual connection directly adds the value present at the beginning of the block, x, to the end of the block (F(x)+x) thus residual connection doesn’t have to go through the  activation functions that “squashes” the derivatives, resulting in a higher overall derivative of the block.

### ResNet Architecture

<center>
<img src="vae_gradients_images/resnet_arch.jpeg" alt="Drawing" style="width: 95%;"/>
</center>

> **Bottleneck Design:** The use of a bottleneck reduces the number of parameters and matrix multiplications. The idea is to make residual blocks as thin as possible to increase depth and have less parameters. They were introduced as part of the ResNet architecture, and are used as part of deeper ResNets such as ResNet-50 and ResNet-101.

| Residual Block Connection | Bottleneck Design |
| -- | -- |
| ![](vae_gradients_images/residual_block_connection.png) | ![Residual Network](vae_gradients_images/resnet_bottleneck.png) |

### ResNet Performance

Winner results of the ImageNet large scale visual recognition challenge (LSVRC) of the past years on the top-5 classiﬁcation task.

<center>
<img src="vae_gradients_images/image_net_results.jpg" alt="Drawing" style="width: 75%;"/>
</center>

### [Current State-of-the-art Models](https://paperswithcode.com/sota/image-classification-on-imagenet)


## 4. GRU and LSTM For RNN Cases

These two topics will be explained later.

## 5. Batch Normalization Layers

Batch Normalization (BN) does not prevent the vanishing or exploding gradient problem in a sense that these are impossible. Rather it reduces the probability for these to occur. Accordingly, the original paper states: ( It was proposed by Sergey Ioffe and Christian Szegedy in 2015.)

> In traditional deep networks, too-high learning rate may result in the gradients that explode or vanish, as well as getting stuck in poor local minima. Batch Normalization helps address these issues. By normalizing activations throughout the network, it prevents small changes to the parameters from amplifying into larger and suboptimal changes in activations in gradients; for instance, it prevents the training from getting stuck in the saturated regimes of nonlinearities.

***

# Solution to the Vanishing Gradient Problem

1. **Gradient Clipping**: When gradients explode, the gradients could become NaN because of the numerical overflow or we might see irregular oscillations in training cost when we plot the learning curve. A solution to fix this is to apply Gradient Clipping; which places a predefined threshold on the gradients to prevent it from getting too large, and by doing this it doesn’t change the direction of the gradients it only change its length.

<center>
<img src="vae_gradients_images/gradient-clipping.png" alt="Drawing" style="width: 50%;"/>
</center>


2. **Network Re-designing**: Using a smaller batch size and fewer layers while training might show some improvement in tackling the Exploding Gradients.

3. **LSTM Networks**: Using LSTMs and perhaps related gated-type neuron structures are the new best practices to avoid exploding gradients in networks.

4. **Weight Regularization**: if exploding gradients are still occurring, is to check the size of network weights and apply a penalty to the networks loss function for large weight values.

***
***

# Recurrent Neural Networks

## S.Alireza Mousavizade

***

# Motivation

- Not all problems can be converted into one with fixed-lenght inputs and outputs (such as: texts, signal, time series and etc.)
- Problems such as speech recognition or time-series prediction require a system to store and usethe context information
  - Simple case: Output YES if the number of 1s id odd, else NO. (There is no constraint on the length of the input sequence.)

 - In sequential data, the order of the data must also be considered. (For example: in a time series the price of a stock)

***

# Architecture

Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They are typically as follows:

Input: $x^{<1>}, x^{<2>}, ..., x^{<t>}$ where $x^{<i>}$ can be a number, a vector, a matrix, or even a tensor.


<center>
<img src="rnn_images/rnn_archit.png" alt="Drawing" style="width: 75%;"/>
</center>

- An RNN shares the same weights and bias parameters across several time steps




***

# The pros and cons

The pros and cons of a typical RNN architecture are summed up in the table below:

| Advantages | Drawbacks |
| -- | -- |
| Possibility of processing input of any length | Computation being slow |
| Model size not increasing with size of input | Difficulty of accessing information from a long time ago |
| Computation takes into account historical information | Cannot consider any future input for the current state
| Weights are shared across time

***

# Applications of RNNs

RNN models are mostly used in the fields of natural language processing and speech recognition. The different applications are summed up in the table below:

<center>
<img src="rnn_images/rnn_applications.png" alt="Drawing" style="width: 75%;"/>
</center>

## One To Many

One-to-many sequence problems are sequence problems where the input data has one time-step, and the output contains a vector of multiple values or multiple time-steps. Thus, we have a single input and a sequence of outputs.

### Music Generation

**OpenAI** created [MuseNet](https://openai.com/blog/musenet/), a deep neural network that can generate 4-minute musical compositions with 10 different instruments, and can combine styles from country to Mozart to the Beatles. MuseNet was not explicitly programmed with our understanding of music, but instead discovered patterns of harmony, rhythm, and style by learning to predict the next token in hundreds of thousands of MIDI files. MuseNet uses the same general-purpose unsupervised technology as GPT-2, a large-scale transformer model trained to predict the next token in a sequence, whether audio or text.

### Image Captioning

Image Captioning is the task of describing the content of an image in words Check out this amazing ["Generate Meaningful Captions for Images with Attention Models"](https://wandb.ai/authors/image-captioning/reports/Generate-Meaningful-Captions-for-Images-with-Attention-Models--VmlldzoxNzg0ODA) report by Rajesh Shreedhar Bhat and Souradip Chakraborty to learn more.


<center>
<img src="rnn_images/image_captioning.png" alt="Drawing" style="width: 75%;"/>
</center>

## Many To One

In many-to-one sequence problems, we have a sequence of data as input, and we have to predict a single output. Sentiment analysis or text classification is one such use case.

### Sentiment Analysis

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs. **Code**

<center>
<img src="rnn_images/sentiment_analysis.jpeg" alt="Drawing" style="width: 75%;"/>
</center>

### Time-Series Forcasting

Time series forecasting occurs when you make scientific predictions based on historical time stamped data. It involves building models through historical analysis and using them to make observations and drive future strategic decision-making.

[Timeseries forecasting for weather prediction On Climate data time-series](https://keras.io/examples/timeseries/timeseries_weather_forecasting/)

## Many To Many

Many-to-Many sequence learning can be used for machine translation where the input sequence is in some language, and the output sequence is in some other language. It can be used for Video Classification as well, where the input sequence is the feature representation of each frame of the video at different time steps.

Encoder-Decoder network is commonly used for many-to-many sequence tasks. Here encoder-decoder is just a fancy name for a neural architecture with two LSTM layers.

### Tx $=$ Ty:

#### Named Entity Recognition

A named entity is a “real-world object” that’s assigned a name – for example, a person, a country, a product or a book title. spaCy can recognize various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn’t always work perfectly and might need some tuning later, depending on your use case.
[spaCy library code](https://spacy.io/usage/linguistic-features/)


<center>
<img src="rnn_images/name_entity.png" alt="Drawing" style="width: 75%;"/>
</center>

### Tx $\neq$ Ty:

- Many to One + One to Many
  - Many to One: Encode input sequence in a single vector.
  - One to Many: Decode output sequence from single input vector.

#### Machine Translation

Machine translation, sometimes referred to by the abbreviation MR (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

- Why reverse input sequence?

<center>
<img src="rnn_images/sequence_to_sequence.png" alt="Drawing" style="width: 75%;"/>
</center>

#### Video Object Detection [Link](https://kth.diva-portal.org/smash/get/diva2:1156631/FULLTEXT01.pdf)

<center>
<img src="rnn_images/video_object_detection.png" alt="Drawing" style="width: 60%;"/>
</center>


***

# Sequential Data: Text

- Text can be understood as either a sequence of characters or a sequence of words.
- Deep Learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs. in much the same way that deep learning for computer vision is pattern recognition applied to pixels.

- Applications include document classification, sentiment analysis, author identification, and even question-answering (QA).


## Text Processing

Like all other neural networks, deep learning models dont take as input raw text: the only work with numeric tensors.
Vectoring text:
  - Segment text into words, and transform each word a vector.
  - Segment text into characters, and transform each character into a vector.
  - Extract n-grams of words or characters and transform each n-gram into a vector.

## Tokenization

The different units into which you can break down text(words, characters, or n-grams) are called tokens, and breaking text into such tokens is called tokenization.

- All text-vectorization processes consist of applying some tokenization scheme and then associating numeric vectors with the generated tokens.

## Word Embeddings

A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems

<center>
<img src="rnn_images/word_embedding.png" alt="Drawing" style="width: 75%;"/>
</center>


- **Word Embeddings** are pack more information into far fewer dimensions rather than **one-hot encoding**.

- They can be pre-trained on large amounts of text training data.

There are two ways to obtain word embeddings:

- Learn word embeddings jointly with the main mask
  - Start with random word vectors.
- Load word embeddings that were pre-trained using different machine learning task. [Distributed Representations of Words and Phrases and their Compositionality](https://arxiv.org/abs/1310.4546)
  - Called pre-trained word embeddings

## Learning Word Embeddings

- Associate a random vector to each word
- The problem with this approach is that the resulting embedding space has no structure
   - For instance, the words accurate and exact may end up with completely different embeddings even though they are interchangeable in most sentences.
 - The geometric relationships between word vectors should reflect the semantic relationships between these words.
 - Word embeddings are meant to map human language into a geometric space.
 - In a reasonable embedding space, you would expect synonyms to be embedded into similar word vectors.
- We expect the geometric distance between any ttwo word vectors to relate to he semantic distance between the associated words.
- We may want specific directions in the embedding space to be meaningful.

- A good word embedding space depends heaviy on your task.
- Reasonable to learn a new embedding space with every new task.

## Embedding Layer (Keras)

Embedding layer is a dictionary that maps integer indices (which stand for specific words) to dence vectors. Ductionary like 2D array of shape (total_number_of_words, embedding_dimensionality)

- Takes as input a 2D tesnor of integers of shape (n_samples, seqeunce_length) where n_sample is batch size.
- (32, 10): batch of 32 sequences of length 10. (**length is fixed.**)
- Returns a 3D floating-point tensor of shape (n_samples, seqeunce_length, embedding_dimensionality)
  -  ```
     from keras.layers import Embedding
     embedding = Embedding(input_dim=1000, output_dim=64)
     ```

- When we instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just as with any other layer.
- During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit.

## IMDB

In [1]:
# Disable warnings
import tensorflow as tf

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

from tensorflow.keras.datasets import imdb
from tensorflow.keras import preprocessing


max_features = 10000
maxlen = 40
embedding_dimension = 8

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)

x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

2022-04-10 10:55:20.353757: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-10 10:55:20.353773: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

model = Sequential()

model.add(Embedding(input_dim=max_features, output_dim=embedding_dimension,  input_length=maxlen))

# Flatten the 3d tensor of embedding into a 2d tensor of shape (n_samples, maxlen * embedding_dimension)
model.add(Flatten())

# Class = "+" and "-" comment
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer="rmsprop", loss='binary_crossentropy', metrics=['acc'])

model.summary()

model.fit(x_train, y_train,
          epochs=10,
          batch_size=32,
          validation_split=0.2)

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 40, 8)             80000     
                                                                 
 flatten (Flatten)           (None, 320)               0         
                                                                 
 dense (Dense)               (None, 1)                 321       
                                                                 
Total params: 80,321
Trainable params: 80,321
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10


2022-04-10 10:55:26.085612: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-04-10 10:55:26.085819: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-04-10 10:55:26.085865: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-04-10 10:55:26.085907: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-04-10 10:55:26.085948: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff51dec7d60>

***

# Gated RNN

## Motivation

Recurrent Neural Networks (RNN) are good at processing sequence data for predictions. Therefore, they are extremely useful for deep learning applications like speech recognition, speech synthesis, natural language understanding, etc.

- Recurrent neural networks suffer from **short-term memory**
    - RNN's may leave out important information from the beginning.

- Example:

<center>
<img src="rnn_images/vae_gradients_rnn.png" alt="Drawing" style="width: 75%;"/>
</center>


## Solution: Gated RNNs

Three are three main types of RNNs: SimpleRNN, Long-Short Term Memories ([LSTM](https://www.researchgate.net/publication/13853244_Long_Short-term_Memory)), and Gated Recurrent Units ([GRU](https://arxiv.org/abs/1412.3555)). SimpleRNNs are good for processing sequence data for predictions but suffers from **short-term memory** (**Vanishing Gradient/ Exploding Gradient**). LSTM’s and GRU’s were created as a method to mitigate short-term memory using mechanisms called gates.





### GRU

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but **has fewer parameters** than LSTM, as it lacks an output gate. **GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.**

#### Simplified GRU:

#### Rules

$$

\begin{aligned}
    \mathbf{\tilde{c}_t}& = \mathbf{tanh(W_{ac} \, a_{t-1} + W_{xc} \, x_{t} + b_c)}\\
    \hline
    \mathbf{c_t}& = \mathbf{\Gamma_u \, \tilde{c}_t + (1 - \Gamma_u) \, c_{t-1}}\\
    \mathbf{\Gamma_u}& = \mathbf{\sigma(W_{cu} \, c_{t-1} + W_{xu} \, x_t + b_u)}\\
    \hline
    \mathbf{a_t}& = \mathbf{c_t}
\end{aligned}

$$

#### GRU: (+ Relevance Gate)

#### Rules

$$

\begin{aligned}
    \mathbf{\tilde{c}_t}& = \mathbf{tanh(W_{cc} \, (\Gamma_r \, . \, c_{t-1}) + W_{xc} \, x_{t} + b_c)}\\
    \hline
    \mathbf{c_t}& = \mathbf{\Gamma_u \, \tilde{c}_t + (1 - \Gamma_t) \, c_{t-1}}\\
    \mathbf{\Gamma_u}& = \mathbf{\sigma(W_{cu} \, c_{t-1} + W_{xu} \, x_t + b_u)}\\
    \mathbf{\Gamma_r}& = \mathbf{\sigma(W_{cr} \, c_{t-1} + W_{xr} \, x_t + b_r)}\\
    \hline
    \mathbf{a_t}& = \mathbf{c_t}
\end{aligned}

$$

#### Architecture

<center>
<img src="rnn_images/GRU.png" alt="Drawing" style="width: 50%;"/>
</center>


### LSTM

Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the field of deep learning (DL). Unlike standard feedforward neural networks, LSTM has feedback connections.

#### Rules

$$

\begin{aligned}
    \mathbf{\tilde{c}_t}& = \mathbf{tanh(W_{ac} \, a_{t-1} + W_{xc} \, x_{t} + b_c)}\\
    \hline
    \mathbf{c_t}& = \mathbf{\Gamma_u \, \tilde{c}_t + \Gamma_f \, c_{t-1}}\\
    \mathbf{\Gamma_u}& = \mathbf{\sigma(W_{cu} \, c_{t-1} + W_{xu} \, x_t + b_u)}\\
    \mathbf{\Gamma_f}& = \mathbf{\sigma(W_{cf} \, c_{t-1} + W_{xf} \, x_t + b_f)}\\
    \mathbf{\Gamma_o}& = \mathbf{\sigma(W_{co} \, c_{t-1} + W_{xo} \, x_t + b_o)}\\
    \hline
    \mathbf{a_t}& = \Gamma_o \, . \, \mathbf{c_t} \\
    \mathbf{\Gamma_o}& = \mathbf{\sigma(W_{co} \, c_{t-1} + W_{xo} \, x_t + b_o)}\\
\end{aligned}

$$

[More ...](https://d2l.ai/chapter_recurrent-neural-networks/rnn.html#:~:text=A%20neural%20network%20that%20uses,number%20of%20time%20steps%20increases.)

#### Architecture

<center>
<img src="rnn_images/LSTM.png" alt="Drawing" style="width: 50%;"/>
</center>

### Used Gates

<center>
<img src="rnn_images/gates.png" alt="Drawing" style="width: 75%;"/>
</center>



### Weights and Inputs Shape (Batch Size Consideration)

$$
\begin{aligned}
    \textrm{Input} & : \mathbf{X_t} \in \mathbb{R}^{n \times d}\\[10pt]
    \textrm{Weights} & : \mathbf{W_{x\{*\}}} \in \mathbb{R}^{d \times h}\\
                         & : \mathbf{W_{a\{*\}}} \in \mathbb{R}^{d \times h}\\[5pt]

                         & : \mathbf{W_{c\{*\}}} \in \mathbb{R}^{h \times h}\\[5pt]

                         & : \mathbf{b_{\{*\}}} \in \mathbb{R}^{1 \times h}\\

                         & : \mathbf{\Gamma_u \in \mathbb{R}^{n \times h}}\\
                         & : \mathbf{\Gamma_r \in \mathbb{R}^{n \times h}}\\
                         & : \mathbf{\Gamma_f \in \mathbb{R}^{n \times h}}\\
                         & : \mathbf{\Gamma_o \in \mathbb{R}^{n \times h}}\\

    \textrm{where:}\\

    n & : \textrm{batch-size}\\
    d & : \textrm{input dimension (in text: word embedding dimension)}\\
    h & : \textrm{number of hidden units}

\end{aligned}
$$

## [Sentiment Analysis On Twitter Tweets](https://colab.research.google.com/drive/1V6-kZg3PetL7IJRjy0mLAYdEUzE8DVnM#scrollTo=W65oBdwmkclD) using SimpleRNN, LSTM and GRU

***

# Variants of RNNs

## Bidirectional RNN

### Motivation

- All of the RNNs we have considered up to now have a “causal” structure.
  - The state at time 𝑡 only captures information from the past, $x_1 , … , x_{t-1}$ , and the present input $x_t$
- In some applications, we want to output a prediction of $y_t$ which may depend on the whole input sequence.


Example:

<center>
<img src="rnn_images/brnn_ex.png" alt="Drawing" style="width: 45%;"/>
</center>

### Architecture

Bidirectional RNNs combine an RNN that moves
forward through time beginning from the start of the
sequence with another RNN that moves backward
through time beginning from the end of the sequence.

<center>
    <img src="rnn_images/brnn_arch.png" alt="Drawing" style="width: 45%;"/>
</center>

## Deep Recurrent Networks

For learning very complex functions sometimes is useful to stack multiple
layers of RNNs together to build even deeper versions of these models.

### Architecture

<center>
    <img src="rnn_images/drnn_arch.png" alt="Drawing" style="width: 45%;"/>
</center>

- Unlike D-CNNs for D-RNNs, having three layers is already quite a lot.
- D-RNNs have better memory than Single-Layer RNNs because each $x_t$ affects hidden states values in several ways.
- The blocks can be SimpleRNN, GRU, or LSTM

### [IMDB](https://colab.research.google.com/drive/1Ir3_GSETKbj3-O9T64I9H3ebo1s4i6CO?usp=sharing#scrollTo=1iahs0-or5uh)

***

# CNN + RNN

Such 1D convnets can be competitive with RNNs on certain sequence-
processing problems, usually at a considerably cheaper computational cost.

- Time can be treated as a spatial dimension, like the height or width of a 2D
image

## 1D Convolution Layer

1D convolution layers can recognize local
patterns in a sequence.

A pattern learned at a certain position in a
sentence can later be recognized at a
different position, making 1D convnets
translation invariant.

<center>
    <img src="rnn_images/1d_convnet.png" alt="Drawing" style="width: 50%;"/>
</center>

- We may use stride and 1D pooling.
- Because 1D convnets process input patches independently, they aren’t sensitive to the order of the timesteps (beyond a local scale, the size of the convolution windows), unlike RNNs.

## Combining RNN with CNN

One strategy to combine **the speed and lightness of convolutional layers** with the **order-sensitivity of recurrent layers** is to use a 1D convnet as a preprocessing step before an RNN.

<center>
    <img src="rnn_images/cnn+rnn.png" alt="Drawing" style="width: 50%;"/>
</center>

- Especially beneficial when **you’re dealing with sequences that are so long they can’t realistically be processed with RNNs**, such as sequences with thousands of steps.

- 1D convnets offer a faster alternative to RNNs on some problems, in particular natural language processing tasks.

- Because **RNNs are extremely expensive for processing very long sequences, but 1D convnets are cheap**, it can be a good idea to use a 1D convnet as a preprocessing step before an RNN, **shortening the sequence and extracting useful representations for the RNN to process.**

## C-LSTM

<center>
    <img src="rnn_images/clstm.png" alt="Drawing" style="width: 75%;"/>
</center>

> Stollenga, Marijn Frederik. Advances in humanoid control and perception. Diss. Università della Svizzera italiana, 2016.

> F. Xiong, X. Shi, and D. Yeung. "Spatiotemporal modeling for crowd counting in videos." IEEE International Conference on Computer Vision. 2017.


# Applications

## Image Captioning

<center>
    <img src="rnn_images/image_cap_arch.png" alt="Drawing" style="width: 75%;"/>
</center>

> A. Tripathi, S. Srivastava, and R. Kothari. "Deep Neural Network Based Image Captioning." International Conference on Big Data Analytics. Springer, 2018.

## Video Analysis (Lip Reading)

<center>
    <img src="rnn_images/lib_read_arch.png" alt="Drawing" style="width: 75%;"/>
</center>

> Fernandez-Lopez, Adriana, and Federico M. Sukno. "Survey on automatic lip-reading in the era of deep learning." Image and Vision Computing 78 (2018): 53-72.


***

In [4]:
# !pip3 install nbconvert
!jupyter nbconvert --to html aggregate.ipynb

Collecting nbconvert
  Using cached nbconvert-6.4.5-py3-none-any.whl (561 kB)
Collecting mistune<2,>=0.8.1
  Using cached mistune-0.8.4-py2.py3-none-any.whl (16 kB)
Collecting nbformat>=4.4
  Using cached nbformat-5.3.0-py3-none-any.whl (73 kB)
Collecting testpath
  Using cached testpath-0.6.0-py3-none-any.whl (83 kB)
Collecting bleach
  Using cached bleach-5.0.0-py3-none-any.whl (160 kB)
Collecting defusedxml
  Using cached defusedxml-0.7.1-py2.py3-none-any.whl (25 kB)
Collecting pandocfilters>=1.4.1
  Using cached pandocfilters-1.5.0-py2.py3-none-any.whl (8.7 kB)
Collecting jupyterlab-pygments
  Using cached jupyterlab_pygments-0.2.0-py2.py3-none-any.whl (22 kB)
Collecting nbclient<0.6.0,>=0.5.0
  Using cached nbclient-0.5.13-py3-none-any.whl (70 kB)
Collecting jsonschema>=2.6
  Using cached jsonschema-4.4.0-py3-none-any.whl (72 kB)
Collecting fastjsonschema
  Using cached fastjsonschema-2.15.3-py3-none-any.whl (22 kB)
Collecting webencodings
  Using cached webencodings-0.5.1-py2.py3