# Recurrent Neural Networks
- Weights shared for each "word cell" (recurring)
![image.png](img/rnn.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/TyJuk/recurrent-neural-networks)
<br>
<br>

# Typical RNN Tasks
- 1:n - e.g. give a word and get a sentence
- n:1 - e.g. give a sentence and figure out if it's offensive or not (binary classification)
- n:n - e.g. translate a sentence into another language
<br>
<br>

# Why?
![image.png](img/rnn_q_matrix.png)
<br>
<br>

# Simple RNN
![image.png](img/simple_rnn.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/eaLt6/math-in-simple-rnns)
<br>
<br>

# Formulas
![image-2.png](img/rnn_math.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/eaLt6/math-in-simple-rnns)

<br>
<br>
<img src="img/rnn_element_wise.png" align="left"/>

This funny circle is standing for "*a binary operation that takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands, where each element i, j is the product of elements i, j of the original two matrices. It is to be distinguished from the more common matrix product.*" [(Hadamard product)](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) Also see: [How to do in Python](https://stackoverflow.com/questions/40034993/how-to-get-element-wise-matrix-multiplication-hadamard-product-in-numpy)

<br>
<br>

# What are we training?
![image.png](img/rnn_what_to_train.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/eaLt6/math-in-simple-rnns)

<br>
<br>

# Calculate Hidden State Activation `h` in Python
*From h_t_prev to h_t*

In [1]:
import numpy as np

In [2]:
w_hh = np.random.standard_normal((3,2))
w_hx = np.random.standard_normal((3,3))
h_t_prev = np.random.standard_normal((2,1))
x_t = np.random.standard_normal((3,1))

In [3]:
print("w_hh",w_hh,"w_hx",w_hx,"h_t_prev",h_t_prev,"x_t",x_t, sep="\n\n")

w_hh

[[-1.48910364  0.24731636]
 [ 2.00520266 -1.6848382 ]
 [-0.31401995  0.55496468]]

w_hx

[[-0.52032108  0.30678248 -0.37567689]
 [ 0.19514791  0.62584317 -1.36211149]
 [-0.91972935  0.7476726   0.66287266]]

h_t_prev

[[-1.80277885]
 [ 0.52402386]]

x_t

[[ 1.62051612]
 [-0.99302758]
 [-0.54891919]]


In [4]:
def sigmoid(x):
     return 1 / (1 + np.exp(-x))
    
bias = np.random.standard_normal((x_t.shape[0],1))
bias

array([[ 0.20711845],
       [-0.79272032],
       [-0.76548912]])

In [5]:
h_t = sigmoid(np.matmul(w_hh, h_t_prev) + np.matmul(w_hx, x_t) + bias)
print(h_t)

A = h_t

[[0.88890718]
 [0.00778221]
 [0.07548571]]


In [6]:
# Another way
h_t = sigmoid(np.matmul(np.hstack((w_hh, w_hx)), np.vstack((h_t_prev, x_t))) + bias)
print(h_t)

B = h_t

[[0.88890718]
 [0.00778221]
 [0.07548571]]


In [7]:
np.allclose(A,B)

True

In [8]:
# Another way, too
sigmoid(np.hstack((w_hh, w_hx)) @ np.vstack((h_t_prev, x_t)) + bias)

array([[0.88890718],
       [0.00778221],
       [0.07548571]])

In [9]:
# Another way, too (this one is sexy)
sigmoid(w_hh @ h_t_prev + w_hx @ x_t + bias)

array([[0.88890718],
       [0.00778221],
       [0.07548571]])

In [10]:
#np.concatenate([h_t_prev, x_t])
#w_hh.shape
#sigmoid(np.dot(w_hh, np.concatenate([h_t_prev, x_t])) + bias)

In [11]:
%timeit sigmoid(np.matmul(w_hh, h_t_prev) + np.matmul(w_hx, x_t) + bias)

5.65 µs ± 302 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [12]:
%timeit sigmoid(np.matmul(np.hstack((w_hh, w_hx)), np.vstack((h_t_prev, x_t))) + bias)

14.8 µs ± 710 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [13]:
%timeit sigmoid(np.hstack((w_hh, w_hx)) @ np.vstack((h_t_prev, x_t)) + bias)

17.1 µs ± 1.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [14]:
%timeit sigmoid(w_hh @ h_t_prev + w_hx @ x_t + bias)

5.59 µs ± 69.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Lets use `@` for element-wise operations. It's easier to remember, shorter and faster.

# Costs
For one example costs can be calculated like this.

<img src="img/rnn_costs_single.png" align="left"/>

<br>





If several time steps *T* are involved, we're building average costs.

<img src="img/rnn_costs_all.png" align="left"/>

[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/KBmVE/cost-function-for-rnns)

# Scan functions
- Abstract RNNs for fast computation
- Needed for GPU usage & parrallel computing

![image.png](img/rnn_scan_functions.png)

[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/rhso8/implementation-note)

An evaluation metric used in this course is perplexity, which seems not to fit with [speech recognition tasks](https://www.researchgate.net/post/What-are-the-performance-measures-in-Speech-recognition) (?).

Some more intuitive explanations towards perplexity:
- https://towardsdatascience.com/evaluation-of-language-models-through-perplexity-and-shannon-visualization-method-9148fbe10bd0
- https://www.cs.cmu.edu/~roni/papers/eval-metrics-bntuw-9802.pdf
- https://www.quora.com/How-does-perplexity-function-in-natural-language-processing?share=1

# Gated Recurrent Units (GRU)

- In simple RNNs hidden state *h* gets updated from unit to unit
- Problematic for long sequences (information loss each step -> vanishing gradients)
- GRUs have additional formulas (gates) to compute to keep relevant information available through all states

## Vanilla & GRU
![simple_rnn_vs_gru.png](img/simple_rnn_vs_gru.png)
[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/t5L3H/gated-recurrent-units)

The symbol $\Gamma$ looking like a 'T' without its left 'arm' is called *gamma* (Greek). It is used for the equations of 'gates' taking care of updates ($\Gamma_u$) & relevance ($\Gamma_r$).

# Testing simple RNNs & GRUs & Scan functions

Following [this](https://www.coursera.org/learn/sequence-models-in-nlp/ungradedLab/jJn3o/vanilla-rnns-grus-and-the-scan-function) lab... Although, I'm wondering why we're calculating hidden states now in a different way than we did it before.

In [15]:
from numpy import random

random.seed(10)                 

emb = 128                       # Embedding size
T = 256                         # Number of variables in the sequences
h_dim = 16                      # Hidden state dimension
h_0 = np.zeros((h_dim, 1))      # Initial hidden state

w1 = random.standard_normal((h_dim, emb+h_dim))
w2 = random.standard_normal((h_dim, emb+h_dim))
w3 = random.standard_normal((h_dim, emb+h_dim))

b1 = random.standard_normal((h_dim, 1))
b2 = random.standard_normal((h_dim, 1))
b3 = random.standard_normal((h_dim, 1))

X = random.standard_normal((T, emb, 1))

weights = [w1, w2, w3, b1, b2, b3]

In [16]:
def forward_V_RNN(inputs, weights):
    x, h_t = inputs
    wh, _, _, bh, _, _ = weights

    # Returning new hidden state only
    h_t = sigmoid(np.dot(wh, np.concatenate([h_t, x])) + bh)
    
    return h_t, h_t

In [17]:
def forward_GRU(inputs, weights): # Forward propagation for a single GRU cell
    x, h_t = inputs
    wu, wr, wc, bu, br, bc = weights

    # Update gate
    u = sigmoid(np.dot(wu, np.concatenate([h_t, x])) + bu)
    
    # Relevance gate
    r = sigmoid(np.dot(wr, np.concatenate([h_t, x])) + br)

    
    # Candidate hidden state 
    c = np.dot(wc, np.concatenate([r * h_t, x])) + bc
    c = np.tanh(c)
    
    # Returning new hidden state only
    h_t = u * c + (1 - u) * h_t
    
    return h_t, h_t

In [18]:
def scan(fn, elems, weights, h_0=None):
    h_t = h_0
    ys = []
    
    for x in elems:
        y, h_t = fn([x, h_t], weights)
        ys.append(y)
    
    return ys, h_t

In [19]:
%timeit scan(forward_V_RNN, X, weights, h_0)

2.05 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [20]:
%timeit scan(forward_GRU, X, weights, h_0)

5.81 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


# Bidirectional RNNs

I loved my `____`. Always when I came back from school, he was jumping around and very happy to see me.

- forwarding information from both sides of the rnn
- fill gaps in previous sentence with context from following sentence 



![rnn_bi_0.png](img/rnn_bi_0.png)
![rnn_bi_1.png](img/rnn_bi_1.png)

[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/TBXN7/deep-and-bi-directional-rnns)

In [21]:
import trax
from trax import layers as tl

In [22]:
%%time

mlp = tl.Serial(
  tl.Dense(128),
  tl.Relu(),
  tl.Dense(10),
  tl.LogSoftmax()
)

CPU times: user 773 µs, sys: 2 µs, total: 775 µs
Wall time: 781 µs


In [23]:
print(mlp)

Serial[
  Dense_128
  Serial[
    Relu
  ]
  Dense_10
  LogSoftmax
]


In [24]:
%%time

mode = 'train'
vocab_size = 256
model_dimension = 512
n_layers = 2

GRU = tl.Serial(
      tl.ShiftRight(mode=mode), # Do remember to pass the mode parameter if you are using it for interence/test as default is train 
      tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
      #[tl.GRU(n_units=model_dimension) for _ in range(n_layers)], # You can play around n_layers if you want to stack more GRU layers together
      tl.GRU(n_units=model_dimension),
      tl.Dense(n_units=vocab_size),
      tl.LogSoftmax()
    )

CPU times: user 708 µs, sys: 0 ns, total: 708 µs
Wall time: 714 µs


In [25]:
def show_layers(model, layer_prefix="Serial.sublayers"):
    print(f"Total layers: {len(model.sublayers)}\n")
    for i in range(len(model.sublayers)):
        print('========')
        print(f'{layer_prefix}_{i}: {model.sublayers[i]}\n')
        
show_layers(GRU)

Total layers: 5

Serial.sublayers_0: Serial[
  ShiftRight(1)
]

Serial.sublayers_1: Embedding_256_512

Serial.sublayers_2: GRU_512

Serial.sublayers_3: Dense_256

Serial.sublayers_4: LogSoftmax



In [26]:
vocab_size = 5
word_ids = np.array([1, 2, 3, 4], dtype=np.int32)  # word_ids < vocab_size
embedding_layer = tl.Embedding(vocab_size, 32)
embedding_layer.init(trax.shapes.signature(word_ids))
embedded = embedding_layer(word_ids)  # embedded.shape = (4, 32)



In [27]:
embedded

DeviceArray([[-0.24339633, -0.26646966, -0.06507818,  0.13353701,
               0.19828677,  0.2684803 ,  0.04015421,  0.1252992 ,
               0.05146184, -0.28740436, -0.2005695 ,  0.05363747,
               0.09167175, -0.24854366,  0.08769388,  0.281615  ,
              -0.17424588,  0.14047112, -0.02152186,  0.02121285,
              -0.20540592,  0.1334213 , -0.27354348, -0.05790756,
              -0.14452775, -0.11836766,  0.17844093, -0.17286661,
              -0.30193475,  0.21396698,  0.22410244,  0.23549293],
             [-0.16122551,  0.10811774,  0.16578965, -0.20675819,
               0.13521245, -0.13177456, -0.10787377, -0.06189003,
               0.19928242,  0.10119225, -0.26748452,  0.12204586,
               0.2817053 ,  0.21753137, -0.14413384, -0.27695698,
               0.16268931,  0.28909573, -0.00046428,  0.26734638,
              -0.28450522, -0.25055635,  0.06145451,  0.2477021 ,
              -0.16908409, -0.18996347,  0.2621673 , -0.11953808,
         

# Vanishing Gradients in RNNs
- we're using tanh & sigmoid -> gradients between -1 & 1 and 0 & 1
- through longer sequences (or more layers), in particular through multiplication and forward/backward propagation, gradients may end up at 0 (vanish)

A view strategies to deal with them
![rnn_vanish_grad_strategies.png](img/rnn_vanish_grad_strategies.png)

[Source](https://www.coursera.org/learn/sequence-models-in-nlp/supplement/OIXEN/rnns-and-vanishing-gradients)

More about optimization issues: https://blog.paperspace.com/intro-to-optimization-in-deep-learning-gradient-descent/