<a href="https://colab.research.google.com/github/sagar9926/NLP_Specialisation/blob/main/SequenceModelling/Vanilla_RNNs%2C_GRUs_and_the_scan_function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vanilla RNNs, GRUs and the `scan` function

In this notebook, you will learn how to define the forward method for vanilla RNNs and GRUs. Additionally, you will see how to define and use the function `scan` to compute forward propagation for RNNs.

By completing this notebook, you will:

- Be able to define the forward method for vanilla RNNs and GRUs
- Be able to define the `scan` function to perform forward propagation for RNNs
- Understand how forward propagation is implemented for RNNs.

In [1]:
import numpy as np
from numpy import random
from time import perf_counter

In [2]:
def sigmoid(x): # Sigmoid function
    return 1.0 / (1.0 + np.exp(-x))

# Part 1: Forward method for vanilla RNNs and GRUs

In this part of the notebook, you'll see the implementation of the forward method for a vanilla RNN and you'll implement that same method for a GRU. For this excersice you'll use a set of random weights and variables with the following dimensions:

- Embedding size (`emb`) : 128
- Hidden state size (`h_dim`) : (16,1)

The weights `w_` and biases `b_` are initialized with dimensions (`h_dim`, `emb + h_dim`) and (`h_dim`, 1). We expect the hidden state `h_t` to be a column vector with size (`h_dim`,1) and the initial hidden state `h_0` is a vector of zeros.

In [3]:
random.seed(10)                 # Random seed, so your results match ours
emb = 128                       # Embedding size
T = 256                         # Number of variables in the sequences
h_dim = 16                      # Hidden state dimension
h_0 = np.zeros((h_dim, 1))      # Initial hidden state
# Random initialization of weights and biases
w1 = random.standard_normal((h_dim, emb+h_dim))
w2 = random.standard_normal((h_dim, emb+h_dim))
w3 = random.standard_normal((h_dim, emb+h_dim))
b1 = random.standard_normal((h_dim, 1))
b2 = random.standard_normal((h_dim, 1))
b3 = random.standard_normal((h_dim, 1))
X = random.standard_normal((T, emb, 1))
weights = [w1, w2, w3, b1, b2, b3]

The vanilla RNN cell is quite straight forward. Its most general structure is presented in the next figure: 

<img src="https://github.com/amanjeetsahu/Natural-Language-Processing-Specialization/raw/d562105e68a0b85012ad3ebbb29b2af6344ad4e5/Natural%20Language%20Processing%20with%20Sequence%20Models/Week%202/RNN.PNG" width="400"/>

As you saw in the lecture videos, the computations made in a vanilla RNN cell are equivalent to the following equations:


$h^{<t>} = g(W_{h}[h^{<t-1>},x^{<t>}] + b_h) $

${y^h}^{<t>}=g(W_{yh}h^{<t>} + b_y)$


where $[h^{<t-1>},x^{<t>}]$ means that $h^{<t-1>}$ and $x^{<t>}$ are concatenated together. In the next cell we provide the implementation of the forward method for a vanilla RNN. 

In [5]:
def forward_V_RNN(inputs, weights): # Forward propagation for a a single vanilla RNN cell
    x, h_t = inputs

    # weights.
    wh, _, _, bh, _, _ = weights

    # new hidden state
    h_t = np.dot(wh, np.concatenate([h_t, x])) + bh
    h_t = sigmoid(h_t)

    return h_t, h_t

As you can see, we omitted the computation of $\hat{y}^{<t>}$. This was done for the sake of simplicity, so you can focus on the way that hidden states are updated here and in the GRU cell.

## 1.2 Forward method for GRUs

A GRU cell have more computations than the ones that vanilla RNNs have. You can see this visually in the following diagram:

<img src="https://github.com/amanjeetsahu/Natural-Language-Processing-Specialization/raw/d562105e68a0b85012ad3ebbb29b2af6344ad4e5/Natural%20Language%20Processing%20with%20Sequence%20Models/Week%202/GRU.PNG" width="400"/>

As you saw in the lecture videos, GRUs have relevance $\Gamma_r$ and update $\Gamma_u$ gates that control how the hidden state $h^{<t>}$ is updated on every time step. With these gates, GRUs are capable of keeping relevant information in the hidden state even for long sequences. The equations needed for the forward method in GRUs are provided below: 


$\Gamma_r=\sigma{(W_r[h^{<t-1>}, x^{<t>}]+b_r)}$


$\Gamma_u=\sigma{(W_u[h^{<t-1>}, x^{<t>}]+b_u)}$

$c^{<t>}=\tanh{(W_h[\Gamma_r*h^{<t-1>},x^{<t>}]+b_h)}$

$h^{<t>}=\Gamma_u*c^{<t>}+(1-\Gamma_u)*h^{<t-1>}$

In the next cell, please implement the forward method for a GRU cell by computing the update `u` and relevance `r` gates, and the candidate hidden state `c`. 

In [26]:
def forward_GRU(inputs, weights): # Forward propagation for a single GRU cell
    x, h_t = inputs

    # weights.
    wu, wr, wc, bu, br, bc = weights

    # Update gate
    ### START CODE HERE (1-2 lINES) ###
    u = np.dot(wu, np.concatenate([h_t, x])) + bu
    u = sigmoid(u)
    ### END CODE HERE ###
    
    # Relevance gate
    ### START CODE HERE (1-2 lINES) ###
    r = np.dot(wr, np.concatenate([h_t, x])) + br
    r = sigmoid(r)
    ### END CODE HERE ###
    
    # Candidate hidden state 
    ### START CODE HERE (1-2 lINES) ###
    c = np.dot(wc, np.concatenate([r * h_t, x])) + bc
    c = np.tanh(c)
    ### END CODE HERE ###
    
    # New Hidden state h_t
    h_t = u* c + (1 - u)* h_t
    return h_t, h_t

In [28]:
forward_GRU([X[0],h_0], weights)[0]

array([[-1.51274490e-01],
       [-9.99997455e-01],
       [-1.24643746e-06],
       [-9.99970502e-01],
       [-9.35853020e-05],
       [ 9.99746405e-01],
       [ 4.18433337e-03],
       [-9.99999345e-01],
       [ 9.60126258e-04],
       [ 6.49678310e-01],
       [-9.56896518e-01],
       [-3.21332511e-01],
       [ 2.38414515e-02],
       [-9.79230498e-01],
       [ 6.74189998e-02],
       [-1.74317466e-03]])

# Part 2: Implementation of the `scan` function

In the lectures you saw how the `scan` function is used for forward propagation in RNNs. It takes as inputs:

- `fn` : the function to be called recurrently (i.e. `forward_GRU`)
- `elems` : the list of inputs for each time step (`X`)
- `weights` : the parameters needed to compute `fn`
- `h_0` : the initial hidden state

`scan` goes through all the elements `x` in `elems`, calls the function `fn` with arguments ([`x`, `h_t`],`weights`), stores the computed hidden state `h_t` and appends the result to a list `ys`. Complete the following cell by calling `fn` with arguments ([`x`, `h_t`],`weights`).

In [32]:
def scan(fn, elems, weights, h_0=None): # Forward propagation for RNNs
    h_t = h_0
    ys = []
    for x in elems:
        ### START CODE HERE (1 lINE) ###
        y, h_t = fn([x, h_t], weights)
        ### END CODE HERE ###
        ys.append(y)
    return ys, h_t

# Part 3: Comparison between vanilla RNNs and GRUs

You have already seen how forward propagation is computed for vanilla RNNs and GRUs. As a quick recap, you need to have a forward method for the recurrent cell and a function like `scan` to go through all the elements from a sequence using a forward method. You saw that GRUs performed more computations than vanilla RNNs, and you can check that they have 3 times more parameters. In the next two cells, we compute forward propagation for a sequence with 256 time steps (`T`) for an RNN and a GRU with the same hidden state `h_t` size (`h_dim`=16).  

In [33]:
# vanilla RNNs
tic = perf_counter()
ys, h_T = scan(forward_V_RNN, X, weights, h_0)
toc = perf_counter()
RNN_time=(toc-tic)*1000
print (f"It took {RNN_time:.2f}ms to run the forward method for the vanilla RNN.")

It took 5.16ms to run the forward method for the vanilla RNN.


In [34]:
# GRUs
tic = perf_counter()
ys, h_T = scan(forward_GRU, X, weights, h_0)
toc = perf_counter()
GRU_time=(toc-tic)*1000
print (f"It took {GRU_time:.2f}ms to run the forward method for the GRU.")

It took 16.53ms to run the forward method for the GRU.


As you were told in the lectures, GRUs take more time to compute (However, sometimes, although a rare occurrence, Vanilla RNNs take more time. Can you figure out what might cause this ?). This means that training and prediction would take more time for a GRU than for a vanilla RNN. However, GRUs allow you to propagate relevant information even for long sequences, so when selecting an architecture for NLP you should assess the tradeoff between computational time and performance. 

# Creating a GRU model using Trax: Ungraded Lecture Notebook

In [36]:
!pip install -q -U trax
import trax
from trax import layers as tl

[K     |████████████████████████████████| 634kB 4.1MB/s 
[K     |████████████████████████████████| 153kB 6.1MB/s 
[K     |████████████████████████████████| 4.3MB 8.0MB/s 
[K     |████████████████████████████████| 1.2MB 29.7MB/s 
[K     |████████████████████████████████| 256kB 40.6MB/s 
[K     |████████████████████████████████| 368kB 28.4MB/s 
[K     |████████████████████████████████| 61kB 7.2MB/s 
[K     |████████████████████████████████| 3.9MB 47.3MB/s 
[K     |████████████████████████████████| 2.5MB 31.7MB/s 
[K     |████████████████████████████████| 901kB 36.5MB/s 
[K     |████████████████████████████████| 3.3MB 50.2MB/s 
[?25h

Trax allows to define neural network architectures by stacking layers (similarly to other libraries such as Keras). For this the `Serial()` is often used as it is a combinator that allows to stack layers serially using function composition.

Next you can see a simple vanilla NN architecture containing 1 hidden(dense) layer with 128 cells and output (dense) layer with 10 cells on which we apply the final layer of logsoftmax.

In [37]:
mlp = tl.Serial(
  tl.Dense(128),
  tl.Relu(),
  tl.Dense(10),
  tl.LogSoftmax()
)

Each of the layers within the `Serial` combinator layer is considered a sublayer. Notice that unlike similar libraries, **in Trax the activation functions are considered layers.** To know more about the `Serial` layer check the docs [here](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.combinators.Serial).

You can try printing this object:

In [38]:
print(mlp)

Serial[
  Dense_128
  Serial[
    Relu
  ]
  Dense_10
  LogSoftmax
]


Printing the model gives you the exact same information as the model's definition itself.

By just looking at the definition you can clearly see what is going on inside the neural network. Trax is very straightforward in the way a network is defined, that is one of the things that makes it awesome! 

## GRU MODEL

To create a `GRU` model you will need to be familiar with the following layers (Documentation link attached with each layer name):
   - [`ShiftRight`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.attention.ShiftRight) Shifts the tensor to the right by padding on axis 1. The `mode` should be specified and it refers to the context in which the model is being used. Possible values are: 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to "train".
   
   - [`Embedding`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Embedding) Maps discrete tokens to vectors. It will have shape `(vocabulary length X dimension of output vectors)`. The dimension of output vectors (also called `d_feature`) is the number of elements in the word embedding.
   - [`GRU`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.GRU) The GRU layer. It leverages another Trax layer called [`GRUCell`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.rnn.GRUCell). The number of GRU units should be specified and should match the number of elements in the word embedding. If you want to stack two consecutive GRU layers, it can be done by using python's list comprehension.
   - [`Dense`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.Dense) Vanilla Dense layer.
   - [`LogSoftMax`](https://trax-ml.readthedocs.io/en/latest/trax.layers.html#trax.layers.core.LogSoftmax) Log Softmax function.

Putting everything together the GRU model will look like this:

In [39]:
mode = 'train'
vocab_size = 256
model_dimension = 512
n_layers = 2

GRU = tl.Serial(
      tl.ShiftRight(mode=mode), # Do remember to pass the mode parameter if you are using it for interence/test as default is train 
      tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
      [tl.GRU(n_units=model_dimension) for _ in range(n_layers)], # You can play around n_layers if you want to stack more GRU layers together
      tl.Dense(n_units=vocab_size),
      tl.LogSoftmax()
    )

In [58]:
def show_layers(model, layer_prefix="Serial.sublayers"):
    print(f"Total layers: {len(model.sublayers)}\n")
    for i in range(len(model.sublayers)):
        print('========')
        print(f'{layer_prefix}_{i}: {model.sublayers[i]}\n')
        
show_layers(GRU)

Total layers: 6

Serial.sublayers_0: Serial[
  ShiftRight(1)
]

Serial.sublayers_1: Embedding_256_512

Serial.sublayers_2: GRU_512

Serial.sublayers_3: GRU_512

Serial.sublayers_4: Dense_256

Serial.sublayers_5: LogSoftmax



In [56]:
# example

vocab_size = 5
word_ids = np.array([1, 1, 3, 4], dtype=np.int32)  # word_ids < vocab_size
embedding_layer = tl.Embedding(vocab_size, 10)
embedding_layer.init(trax.shapes.signature(word_ids))
embedded = embedding_layer(word_ids)  # embedded.shape = (4, 32)

In [57]:
embedded

DeviceArray([[-0.51289576, -0.47462976,  0.4763662 ,  0.5279791 ,
               0.2914062 , -0.08452899, -0.11147123, -0.5470346 ,
               0.02848748,  0.38207462],
             [-0.51289576, -0.47462976,  0.4763662 ,  0.5279791 ,
               0.2914062 , -0.08452899, -0.11147123, -0.5470346 ,
               0.02848748,  0.38207462],
             [-0.22234032,  0.5045553 ,  0.23639515,  0.32686418,
               0.40890115, -0.3673048 , -0.42598405,  0.2964807 ,
               0.14386757, -0.320109  ],
             [ 0.01465019,  0.02285525,  0.10677988, -0.20892927,
              -0.10496668, -0.3220369 , -0.442741  , -0.32696432,
               0.28951478, -0.42385626]], dtype=float32)