# RNN (Recurrent Neural Network)

RNN is a type of network that is great for processing sequences of data. It memorizes the past inputs, which means it shares the past weight to process the current input. The backpropagation method is done through time; a.k.a BPTT (Back Propagation Through Time).

It is used in natural language processing (NLP), time series prediction, and speech/audio processing.

### Each Element Formulation

<div style="text-align: center;">
    <img src="./files/RNN/RNN_example.png" width="400" height="250">
</div>

<div style="text-align: center;">
    <img src="./files/RNN/RNN_memo.png" width="600" height="250">
</div>

We got a **Input Layer**, **Hidden State**, and **Output Layer**. 

Hidden state is calculated as:
$$
z_{th} = W_{xh}x_t + W_{hh}h_{t-1} + b_h \\
h_t = tanh(W_{xh}x_t + W_{hh}h_{t-1} + b_h) = tanh(z_{th})
$$
Where $h_t$ is hidden state in time t, $W_{xh}$ is input => current hidden state weight, and $W_{hh}$ is past hidden state => current hidden state weight.

Output Layer is calculated as:
$$
z_{ty} = W_{hy}h_t + b_y \\ 
y_t = softmax(W_{hy}h_t + b_y) = \sigma (W_{hy}h_t + b_y) = \sigma (z_{ty})\\

$$
Where $W_{hy}$ is current hidden state => logit weight, and $b_y$ is its bias.

As you can see, the input->hidden state weight/bias and hidden state->output weight/bias are not depending on the time and it is shared.

### Back Propagation 1

Using the input sequence, get the output predictions. And by the error calculation, we can optimize the weight using back propagation.

First, we are going to get the gradients of $h_t, b_y, W_{hy}$.

Let Loss $L = \sum(y_t - y_t^{true})^2$, where $y_t$ is a output value after the softmax, and $y_t^{true}$ is one-hot encoded true value.
$$
L = \sum(y_t - y_t^{true})^2 \\
\frac{\partial L}{\partial y_t} = 2(y_{t} - y_{t}^{true})

Let
$$ 
y_t = \sigma (W_{hy}h_t + b_y) \\
z_{ty} = W_{hy}h_t + b_y \\ 
\frac{\partial y_t}{\partial z_{ty}} = \sigma '(z_{ty})
$$
If so,
$$
\begin{align*}
\frac{\partial L}{\partial h_t} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \\ 

\frac{\partial L}{\partial b_y} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial b_y}\\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot 1 \\ 

\frac{\partial L}{\partial W_{hy}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial W_{hy}}\\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot h_t \\ 
\end{align*}
$$

### Back Propagation 2

Now, we are going to calculate $W_{xh}, W_{hh}, b_h$.

$$ \begin{align*}
\frac{\partial L}{\partial W_{xh}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial W_{xh}} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot x_t \\

\frac{\partial L}{\partial W_{hh}} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial W_{hh}} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot h_{t-1} \\

\frac{\partial L}{\partial b_h} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \cdot \frac{\partial h_t}{\partial z_{th}} \cdot \frac{\partial z_{th}}{\partial b_h} \\
&= 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot 1

\end{align*} $$
We could also use this for compact computation:
$$ \begin{align*}
\frac{\partial L}{\partial h_t} &= \frac{\partial L}{\partial y_t} \cdot \frac{\partial y_t}{\partial z_{ty}} \cdot \frac{\partial z_{ty}}{\partial h_t} \\
&=2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy}
\end{align*} $$

### Generalization
It has already been generalized. Just to write down all the gradients together:
$$
\begin {align*}

\nabla b_y &= \frac{\partial L}{\partial b_y} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot 1 \\ 

\nabla W_{hy} &= \frac{\partial L}{\partial W_{hy}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot h_t \\ 

\nabla W_{xh} &= \frac{\partial L}{\partial W_{xh}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot x_t \\

\nabla W_{hh} &= \frac{\partial L}{\partial W_{hh}} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot h_{t-1} \\

\nabla b_h &= \frac{\partial L}{\partial b_h} = 2(y_{t} - y_{t}^{true}) \cdot \sigma '(z_{ty}) \cdot W_{hy} \cdot h_t \cdot 1 \\
\end {align*}

By using the sequential $h_t$, we can train the RNN model.

### Implementation

We don't use the batch training, for the samples are not independent, but has a sequential relationship.

In [None]:
import numpy as np
from tqdm import tqdm

class DNN:
    def __init__(self, input_size, hidden_size, output_size, learning_rate = 0.01):
        '''
        Initiate the terms in the RNN according to the parameters.
        
        Parameters:
        input_size (int): size of the input
        hidden_size (size): neurons in the hidden layer
        output_size (int): size of the output
        learning_rate (int): learning rate
        
        Returns:
        self.input_size (int): input size ([1 x input_size])
        self.hidden_size (int): hidden layer sizes list
        self.output_size (int): output size
        self.learning_rate (int): learning rate
        self.Wxh (np.array): [#hidden size x #input size] x_t to h_t (z_th) weights
        self.Whh (np.array): [#hidden size x #hidden size] h_t-1 to h_t weights
        self.Why (np.array): [#output size x #hidden size] h_t to y_t (z_ty) weights
        
        self.by (np.array): [#hidden size x 1] h_t to y_t (z_ty) biases
        self.bh (np.array): [#output size x 1] x_t to h_t (z_th) biases
        '''
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.learning_rate = learning_rate
        #Initialize all the weights and biases.
        self.Wxh = np.random.randn(hidden_size, input_size) * 0.01
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.Why = np.random.randn(output_size, hidden_size) * 0.01
        
        self.by = np.zeros((hidden_size, 1))
        self.bh = np.zeros((output_size, 1))
            
    def forward_propagation(self, X):
        '''
        Forward propagation to get initial guesses
        
        Parameters:
        X (list): input list
        
        Returns:
        self.a (list): X + every neuron's value after the activation function.
        self.z (list): X + every neuron's value before the activation function.
        '''
        self.a = [X]
        self.z = []
        for i in range(len(self.weights)):
            z = np.dot(self.a[i], self.weights[i]) + self.biases[i]
            self.z.append(z)
            a = self.sigmoid(z)
            self.a.append(a)
        return self.a[-1] # Return output layer activations
    
    def backward_propagation(self, X, y):
        '''
        Backward propagation to get gradients for every weights, and biases.
        
        Parameters:
        X (list): batched list (batch_size x input_size).
        Example of the X
        X = np.array([[0.5, 0.1, -0.2],  # First sample, 3 features
              [0.2, 0.4, 0.1]])  # Second sample, 3 features
        Total of 2 batch size
        where sample: one lump of data, features: ex) a person's height, or weight, or etc.
        
        y (list): true Y for each sample (batch_size y x output_size).
        
        Returns: 
        np.array(dW[::-1]) (list): gradient of W 
        np.array(db[::-1]) (list): gradient of b
        '''    
        # Calculate output Layer error
        self.batch_size = X.shape[0] # number of batches
        y_pred = self.forward_propagation(X)
        # Loss function derivative
        loss_derivative = self.loss_derivative(y_pred, y) # = d L / d a_N
        
        # Backpropagation through layers
        dA = loss_derivative * self.sigmoid_derivative(self.a[-1]) # Output Layer; * d a_N / d z_N
        dZ = dA # For output layer, dZ = dA because sigmoid derivative is applied
        dW = []
        db = []
        
        # Update weights and biases for the output layer
        dW.append(np.dot(self.a[-2].T, dZ)) # * d z_N / d w_N
        db.append(np.sum(dZ, axis=0, keepdims=True)) # dZ itself
        
        # Propagate the gradient back through the hidden layers
        dA = np.dot(dZ, self.weights[-1].T)  # Error propagated back to previous layer
        for i in range(len(self.weights) - 2, -1, -1):  # Loop through hidden layers
            dZ = dA * self.sigmoid_derivative(self.a[i + 1])
            dW.append(np.dot(self.a[i].T, dZ)) # * d z_N / d w_N
            db.append(np.sum(dZ, axis=0, keepdims=True)) # dZ itself
            dA = np.dot(dZ, self.weights[i].T)
        return dW[::-1], db[::-1]
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(self, x):
        return x * (1 - x)

    def compute_loss(self, y_pred, y_true):
        # Mean Squared Error loss function
        return np.mean((y_true - y_pred) ** 2)
    
    def loss_derivative(self, y_pred, y_true):
        # Mean Squared Error loss' derivative function
        return 2 * (y_pred - y_true) / self.batch_size
        
    def apply_gradient(self, dW, db):
        # Update gradients and biases using the gradient (gradient descent)
        for i in range(len(self.weights)):
            self.weights[i] -= self.learning_rate * dW[i]
            self.biases[i] -= self.learning_rate * db[i]
    
    def train(self, X, y, epochs=1000, print_rate = 10):
        for epoch in tqdm(range(epochs)):
            dW, db = self.backward_propagation(X, y)
            if (epoch+1) % print_rate == 0:
                y_pred = self.forward_propagation(X)
                loss = self.compute_loss(y_pred, y)
                print(f"Epoch {epoch+1}, Loss: {loss}")
            self.apply_gradient(dW, db)
                
    def train_mini_batch(self, X, y, batch_size=32, epoches=1000, print_rate = 10):
        m = X.shape[0]
        for epoch in tqdm(range(epoches)):
            indices = np.random.permutation(m)
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            for i in range(0, m, batch_size):  # Repetition in batch_size measure
                # Extraction
                X_batch = X_shuffled[i:i+batch_size]
                y_batch = y_shuffled[i:i+batch_size]
                
                dW, db = self.backward_propagation(X_batch, y_batch)
                
                if (epoch+1) % print_rate == 0:
                    y_pred = self.forward_propagation(X)
                    loss = self.compute_loss(y_pred, y)
                    print(f"Epoch {epoch+1}, Loss: {loss}")
                    
                self.apply_gradient(dW, db)

# Example usage:
num_samples = 1000
input_size = 3  # Number of input features
output_size = 2  # Output layer with 2 neurons

hidden_layers = [4, 5]  # Two hidden layers with 4 and 5 neurons
learning_rate = 0.01

# Instantiate the model
model = DNN(input_size, hidden_layers, output_size, learning_rate)

# Example data generator
def generate_data(num_samples, input_size, output_size):
    # generate X (from std norm dist)
    X = np.random.randn(num_samples, input_size)

    # Generate labeled y (one-hot encoded)
    y = np.zeros((num_samples, output_size))
    
    for i in range(num_samples):
        # Randomly assigning 0 or 1 in binary classification
        label = np.random.randint(0, output_size)
        y[i, label] = 1
    
    return X, y

# Example input data (X) and labels (y)
X, y = generate_data(num_samples, input_size, output_size)

# Train the model
model.train(X, y, epochs=50)