# RNN (Recurrent Neural Network)

RNN is a type of network that is great for processing sequences of data. It memorizes the past inputs, which means it shares the past weight to process the current input. The backpropagation method is done through time; a.k.a BPTT (Back Propagation Through Time).

It is used in natural language processing (NLP), time series prediction, and speech/audio processing.

### Each Element Formulation

<div style="text-align: center;">
    <img src="./files/RNN/RNN_example.png" width="400" height="250">
</div>

<div style="text-align: center;">
    <img src="./files/RNN/RNN_memo.png" width="600" height="250">
</div>

We got a **Input Layer**, **Hidden State**, and **Output Layer**. 

Hidden state is calculated as:
$$
z_{th} = W_{xh}x_t + W_{hh}h_{t-1} + b_h \\
h_t = tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h) = tanh(z_{th})
$$
Where $h_t$ is hidden state in time t, $W_{xh}$ is input => current hidden state weight, and $W_{hh}$ is past hidden state => current hidden state weight.

Output Layer is calculated as:
$$
z_{ty} = W_{hy}h_t + b_y \\ 
y_t = softmax(W_{hy}h_t + b_y) = \sigma (W_{hy}h_t + b_y) = \sigma (z_{ty})\\

$$
Where $W_{hy}$ is current hidden state => logit weight, and $b_y$ is its bias.

As you can see, the input->hidden state weight/bias and hidden state->output weight/bias are not depending on the time and it is shared.

### Back Propagation 1

Using the input sequence, get the output predictions. And by the error calculation, we can optimize the weight using back propagation.

First, we are going to get the gradients of $h_t, b_y, W_{hy}$.

Let Loss $L = \sum(y_t - y_t^{true})^2$, where $y_t$ is a output value after the softmax, and $y_t^{true}$ is one-hot encoded true value.
$$
L = \sum(y_t - y_t^{true})^2 \\
\frac{\partial L}{\partial y_t} = 2(y_{t} - y_{t}^{true})

Let
$$ 
y_t = \sigma (W_{hy}h_t + b_y) \\
z_{ty} = W_{hy}h_t + b_y \\ 
\frac{\partial y_t}{\partial z_{ty}} = \sigma '(z_{ty})
$$
If so,
$$
\begin{align*}
\frac{\partial L}{\partial h_t} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \\ 

\frac{\partial L}{\partial b_y} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial b_y}\\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot 1 \\ 

\frac{\partial L}{\partial W_{hy}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial W_{hy}}\\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot h_t \\ 
\end{align*}
$$

### Back Propagation 2

Now, we are going to calculate $W_{xh}, W_{hh}, b_h$.

$$ \begin{align*}
\frac{\partial L}{\partial W_{xh}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial W_{xh}} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot x_t \\

\frac{\partial L}{\partial W_{hh}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial W_{hh}} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot h_{t-1} \\

\frac{\partial L}{\partial b_h} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial b_h} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot 1

\end{align*} $$
We could also use this for compact computation:
$$ \begin{align*}
\frac{\partial L}{\partial h_t} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \\
&=2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy}
\end{align*} $$

### Generalization
It has already been generalized. Just to write down all the gradients together:
$$
\begin {align*}

\nabla b_y &= \frac{\partial L}{\partial b_y} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot 1 \\ 

\nabla W_{hy} &= \frac{\partial L}{\partial W_{hy}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot h_t \\ 

\nabla W_{xh} &= \frac{\partial L}{\partial W_{xh}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot x_t \\

\nabla W_{hh} &= \frac{\partial L}{\partial W_{hh}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot h_{t-1} \\

\nabla b_h &= \frac{\partial L}{\partial b_h} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot 1 \\
\end {align*}

However, if we don't use softmax function, the term changes:
$$
\frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \\
=> \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial h_t}
$$,

So $\sigma '(z_{ty})$ is removed.

By using the sequential $h_t$, we can train the RNN model.

### Implementation

We don't use the batch training, for the samples are not independent, but has a sequential relationship.

In [None]:
import numpy as np
from tqdm import tqdm

class RNN:
    def __init__(self, input_size, hidden_size, output_size, learning_rate = 0.01):
        '''
        Initiate the terms in the RNN according to the parameters.
        
        Parameters:
        input_size (int): size of the input
        hidden_size (size): neurons in the hidden layer
        output_size (int): size of the output
        learning_rate (int): learning rate
        
        Returns:
        self.input_size (int): input size
        self.hidden_size (int): hidden layer sizes list
        self.output_size (int): output size
        self.learning_rate (int): learning rate
        self.Wx (np.array): [#hidden size x #input size] x_t to h_t (z_th) weights
        self.Wh (np.array): [#hidden size x #hidden size] h_t-1 to h_t weights
        self.Wy (np.array): [#output size x #hidden size] h_t to y_t (z_ty) weights
        
        self.by (np.array): [#hidden size x 1] h_t to y_t (z_ty) biases
        self.bh (np.array): [#output size x 1] x_t to h_t (z_th) biases
        '''
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.learning_rate = learning_rate
        #Initialize all the weights and biases.
        self.Wx = np.random.randn(hidden_size, input_size) * 0.01
        self.Wh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.Wy = np.random.randn(output_size, hidden_size) * 0.01
        
        self.bh = np.zeros((hidden_size, 1))
        self.by = np.zeros((output_size, 1))
            
    def forward_propagation(self, X):
        '''
        Forward propagation, X is given as a sequential data.
        
        parameters:
        X (list): t-th sequential data; [input size x 1]
        '''
        h = np.zeros((self.hidden_size, 1))
        z = np.dot(self.Wx, X.reshape(-1, 1)) + np.dot(self.Wh, h) + self.bh
        h = np.tanh(z)
        output = np.dot(self.Wy, h) + self.by.reshape(-1, 1)
        
        return output, h
    
    def backward_propagation(self, X, y):
        '''
        Backward propagation to get gradients for every weights, and biases.
        '''
        dWx, dWh, dWy = np.zeros_like(self.Wxh), np.zeros_like(self.Whh), np.zeros_like(self.Why)
        dbh, dby = np.zeros_like(self.bh), np.zeros_like(self.by)
        
        seq_len, _ = X.shape
        h = np.zeros((seq_len, self.hidden_size, 1)) # save all time step's hidden states
        
        # Forward Pass
        y_pred = [] # save all time step's prediction
        for t in range(seq_len):
            y_t, y[t] = self.forward_propagation(X[t])
            y_pred.append(y_t)
        
        # Backward Pass (BPTT)
        dh_next = np.zeros_like(h[0]) #initiate the hidden state gradient for next time step
        for t in reversed(range(seq_len)):
            dy = self.loss_derivative(y_pred[t], y[t])
            
            dWy += np.dot(dy, h[t].T)
            dby += dy
            
            dh = np.dot(self.Wy.T, dy) + dh_next
            dz = dh * (1 - h[t] ** 2) # tanh derivative
            
            dWx += np.dot(dz, X[t].shape(-1, 1))
            dWh += np.dot(dz, h[t - 1].T if t > 0 else np.zeros_like(h[t]))
            dbh += dz
            
            dh_next = np.dot(self.Wh.T, dz)
            
        return dWx, dWh, dWy, dbh, dby

    def compute_loss(self, y_pred, y_true):
        # Mean Squared Error loss function
        return np.mean((y_true - y_pred) ** 2)
    
    def loss_derivative(self, y_pred, y_true):
        # Mean Squared Error loss' derivative function
        return 2 * (y_pred - y_true) # divided by batch_size; in this case, it is 1.
        
    def optimizer(self, dWxh, dWhh, dWhy, dbh, dby):
        # Update gradients and biases using the gradient (gradient descent)
        self.Wxh -= self.learning_rate * dWxh
        self.Whh -= self.learning_rate * dWhh
        self.Why -= self.learning_rate * dWhy
        self.bh -= self.learning_rate * dbh
        self.by -= self.learning_rate * dby
                
    def train_mini_batch(self, X_train, y_train, batch_size=32, epoches=1000, print_rate = 10):
        m = X.shape[0]
        for epoch in tqdm(range(epoches)):
            indices = np.random.permutation(m)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            for i in range(0, m, batch_size):  # Repetition in batch_size measure
                # Extraction
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size]
                
                dWxh, dWhh, dWhy, dbh, dby = self.backward_propagation(X_batch, y_batch)
                self.optimizer(dWxh, dWhh, dWhy, dbh, dby)
                
            if epoch % 10 == 0:
                y_pred = self.forward_propagation(X_train)[0]
                # training loss calculation
                loss = self.compute_loss(y_pred, y_train)
                print(f'{epoch}-th epoch MSE loss: {loss}')

TypeError: 'list' object cannot be interpreted as an integer