# Recurrent Neural Network

Definetion: A recurrent neural network (RNN) is a class of artificial neural network where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

## RNN

process :

- take input $x_t$ at time step $t$, hidden state $h_{t-1}$ of the previous layer at time step $t-1$ , and for the first time step, the previous hidden-state $h_{t-1}$ is usually initialized to all zeros.
- output $h_t$ at time step $t$
- use $h_t$ as input at the next time step $t+1$
- repeat this process for all inputs


Equation of RNN:

it is a simple neural network with a hidden layer. The input $x_t$ at time step $t$ is multiplied by a weight matrix $W_{xh}$ and added to the hidden state $h_{t-1}$ of the previous time step multiplied by the hidden-to-hidden weight matrix $W_{hh}$.giving an intermediate state $h_t$. The hidden state $h_t$ is then mapped to an output $y_t$ by multiplying it with the hidden-to-output weight matrix $W_{hy}$.

$$ h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t) $$
$$ y_t = W_{hy}h_t $$

where 
- $h_t$ is the hidden state at time step $t$
- $x_t$ is the input at time step $t$
- $y_t$ is the output at time step $t$
- $W_{hh}$ is the hidden-to-hidden weight matrix
- $W_{xh}$ is the input-to-hidden weight matrix
- $W_{hy}$ is the hidden-to-output weight matrix

## LSTM

LSTM is a special kind of RNN, capable of learning long-term dependencies. It was introduced by Hochreiter & Schmidhuber (1997), and was refined and popularized by many people in following work. It works tremendously well on a large variety of problems and is now widely used. LSTMs are explicitly designed to avoid the long-term dependency problem. Remembering information for long periods of time is practically their default behavior, not something they struggle to learn! All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer. 

You can observe how the tanh function transforms the candidate cell state to be centered and scaled between -1 and 1, which is essential for maintaining the balance of information within the LSTM cell. 
scaled_candidate_cell_state: We apply the tanh function to the candidate cell state, resulting in a scaled and centered version of the information. The tanh function squashes the values to be between -1 and 1, centering them around 0. This ensures that the information in the cell state is symmetric and bounded.

tanh is to scale the values between -1 and 1, which is essential for maintaining the balance of information within the LSTM cell.
sigmoid is to decide which information to throw away and which information to keep.

gates 
forget gate - 
input gate -
output gate -

$H_t$ is the hidden state is short-term memory and $C_t$ is the long-term memory. 

</div>
<div align="center">
  <img src="images/lstm_architecture.png" alt="Alt text" width="400" height="200" />
</div>

$$ f_t = \sigma(W_f[h_{t-1},x_t] + b_f) $$
$$ i_t = \sigma(W_i[h_{t-1},x_t] + b_i) $$
$$ o_t = \sigma(W_o[h_{t-1},x_t] + b_o) $$
$$ g_t = \tanh(W_g[h_{t-1},x_t] + b_g) $$
$$ c_t = f_t * c_{t-1} + i_t * g_t $$
$$ h_t = o_t * \tanh(c_t) $$

where
- $f_t$ is forget gate
- $i_t$ is input gate
- $o_t$ is output gate
- $g_t$ is gate gate
- $c_t$ is cell state
- $h_t$ is hidden state
- $W_f$ is weight matrix of forget gate
- $W_i$ is weight matrix of input gate
- $W_o$ is weight matrix of output gate
- $W_g$ is weight matrix of gate gate
- $b_f$ is bias of forget gate
- $b_i$ is bias of input gate
- $b_o$ is bias of output gate
- $b_g$ is bias of gate gate



## GRU

GRU is a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. GRU's performance on certain tasks of polyphonic music modeling and speech signal modeling was found to be similar to that of LSTM. GRU's have been shown to exhibit better performance on certain smaller datasets. 

</div>
<div align="center">
  <img src="images/gru_architecture.png" alt="Alt text" width="500" height="400" />
</div>

$$ z_t = \sigma(W_z[h_{t-1},x_t]) $$
$$ r_t = \sigma(W_r[h_{t-1},x_t]) $$
$$ \tilde{h_t} = \tanh(W[h_{t-1},r_t * x_t]) $$
$$ h_t = (1-z_t) * h_{t-1} + z_t * \tilde{h_t} $$

where
- $z_t$ is update gate
- $r_t$ is reset gate
- $\tilde{h_t}$ is new gate
- $h_t$ is hidden state
- $W_z$ is weight matrix of update gate
- $W_r$ is weight matrix of reset gate
- $W$ is weight matrix of new gate

## Implementation

### RNN

```python
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)
```

### LSTM

```python
class LSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTM, self).__init__()

        self.hidden_size = hidden_size

        self.i2f = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2i = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2g = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2y = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden, cell):
        combined = torch.cat((input, hidden), 1)
        forget = torch.sigmoid(self.i2f(combined))
        input = torch.sigmoid(self.i2i(combined))
        output = torch.sigmoid(self.i2o(combined))
        gate = torch.tanh(self.i2g(combined))
        cell = forget * cell + input * gate
        hidden = output * torch.tanh(cell)
        output = self.i2y(torch.cat((input, hidden), 1))
        output = self.softmax(output)
        return output, hidden, cell

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

    def initCell(self):
        return torch.zeros(1, self.hidden_size)
```

### GRU

```python
class GRU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(GRU, self).__init__()

        self.hidden_size = hidden_size

        self.i2z = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2r = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2y = nn.Linear(input_size + hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        update = torch.sigmoid(self.i2z(combined))
        reset = torch.sigmoid(self.i2r(combined))
        new = torch.tanh(self.i2h(torch.cat((input, reset * hidden), 1)))
        hidden = (1 - update) * hidden + update * new
        output = self.i2y(torch.cat((input, hidden), 1))
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)
```


## References

- https://colah.github.io/posts/2015-08-Understanding-LSTMs/ 
- https://www.youtube.com/watch?v=YCzL96nL7j0 

In [5]:
import numpy as np

# Sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Simulate forget gate, input gate, and output gate values
forget_gate_value = sigmoid(0.3)  # Example value between 0 and 1
input_gate_value = sigmoid(0.7)  # Example value between 0 and 1
output_gate_value = sigmoid(0.5)  # Example value between 0 and 1
print("Forget Gate Value:", forget_gate_value, "Input Gate Value:", input_gate_value, "Output Gate Value:", output_gate_value)
# Simulate previous cell state and candidate cell state
prev_cell_state = np.array([0.2, 0.1, 0.4])  # Example previous cell state
candidate_cell_state = np.array([0.6, 0.3, -0.2])  # Example candidate cell state
# Apply the tanh function to the candidate cell state
scaled_candidate_cell_state = np.tanh(candidate_cell_state)

print("Candidate Cell State (Before tanh):", candidate_cell_state)
print("Scaled Candidate Cell State (After tanh):", scaled_candidate_cell_state)

# Update the cell state using the gates
updated_cell_state = forget_gate_value * prev_cell_state + input_gate_value * candidate_cell_state
# Simulate how information is exposed in the hidden state
hidden_state = output_gate_value * np.tanh(updated_cell_state)

print("Updated Cell State:", updated_cell_state)
print("Hidden State:", hidden_state)


Forget Gate Value: 0.574442516811659 Input Gate Value: 0.6681877721681662 Output Gate Value: 0.6224593312018546
Candidate Cell State (Before tanh): [ 0.6  0.3 -0.2]
Scaled Candidate Cell State (After tanh): [ 0.53704957  0.29131261 -0.19737532]
Updated Cell State: [0.51580117 0.25790058 0.09613945]
Hidden State: [0.2953276  0.15706568 0.05965921]


In [None]:
# simple 10 line rnn text = "hello world" give this text to the neural network and it will predict the next character

import numpy as np

# data I/O
data = open('input.txt', 'r').read() # should be simple plain text file
chars = list(set(data)) # set() returns unique characters
data_size, vocab_size = len(data), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size)) # data has 1115394 characters, 65 unique.

# create two dictionaries to map characters to integers and integers to characters
char_to_ix = { ch:i for i,ch in enumerate(chars) } # character to index
ix_to_char = { i:ch for i,ch in enumerate(chars) } # index to character

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 25 # number of steps to unroll the RNN for
learning_rate = 1e-1 # learning rate for gradient descent

# model parameters
# weights
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
# biases
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

# loss function
def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  # store our inputs, hidden states, outputs, and probability values
  xs, hs, ys, ps = {}, {}, {}, {}
  # each of these are going to be SEQ_LENGTH(Here 25) long dicts i.e. 25 arrays of size 100 (size of hidden layer)
  # xs will store 25 inputs (each of size 65)
  # hs will store 25 hidden states (each of size 100)
  # ys will store 25 outputs (each of size 65)
  # ps will store 25 probability values (each of size 65)
  # We could have used lists BUT we need an entry with -1 to calc the 0th hidden layer
  # so it's better to use dicts in this case
  # init with previous hidden state
  hs[-1] = np.copy(hprev)
  # print("hs[-1] = ", hs[-1].shape)
  # print("hprev = ", hprev.shape)





# init loss as 0
loss = 0
# forward pass
for t in range(len(inputs)):
    # print("t = ", t)
    # print("inputs[t] = ", inputs[t])
    # print("Wxh = ", Wxh.shape)
    # print("xs[t] = ", xs[t].shape)
    # print("Whh = ", Whh.shape)
    # print("hs[t-1] = ", hs[t-1].shape)
    # print("bh = ", bh.shape)
    # print("hs[t] = ", hs[t].shape)
    # print("Why = ", Why.shape)
    # print("ys[t] = ", ys[t].shape)
    # print("by = ", by.shape)
    # print("ps[t] = ", ps[t].shape)
    # print("targets[t] = ", targets[t])
    # print("ys[t][targets[t]] = ", ys[t][targets[t]])
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    # print("np.log(ps[t][targets[t]]) = ", np.log(ps[t][targets[t]]))
    # print("loss = ", loss)
    xs[t] = np.zeros((vocab_size, 1)) # encode in 1-of-k representation








In [None]:

# simple rnn model

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
%matplotlib inline

# import mnist data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)

# define hyperparameters
learning_rate = 0.001
training_iters = 100000
batch_size = 128
display_step = 10

# network parameters
n_input = 28 # MNIST data input (img shape: 28*28)
n_steps = 28 # timesteps
n_hidden = 128 # hidden layer num of features
n_classes = 10 # MNIST total classes (0-9 digits)

# tf graph input
x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_classes])

# define weights
weights = {
    'out': tf.Variable(tf.random_normal([n_hidden, n_classes]))
}

biases = {
    'out': tf.Variable(tf.random_normal([n_classes]))
}

# define rnn model
def RNN(x, weights, biases):
    # unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
    x = tf.unstack(x, n_steps, 1)
    
    # define lstm cell
    lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden, forget_bias=1.0)
    
    # get lstm cell output
    outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
    
    # linear activation, using rnn inner loop last output
    return tf.matmul(outputs[-1], weights['out']) + biases['out']

# construct model
pred = RNN(x, weights, biases)

# define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)

# evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))

# calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# initialize the variables
init = tf.global_variables_initializer()

# launch the graph
with tf.Session() as sess:
    sess.run(init)
    step = 1
    
    # keep training until reach max iterations
    while step * batch_size < training_iters:
        batch_x, batch_y = mnist.train.next_batch(batch_size)
        
        # reshape data to get 28 seq of 28 elements
        batch_x = batch_x.reshape((batch_size, n_steps, n_input))
        
        # run optimization op (backprop)
        sess.run(optimizer, feed_dict={x: batch_x, y: batch_y})
        
        if step % display_step == 0:
            # calculate batch accuracy
            acc = sess.run(accuracy, feed_dict={x: batch_x, y: batch_y})
            
            # calculate batch loss
            loss = sess.run(cost, feed_dict={x: batch_x, y: batch_y})
            
            print("Iter " + str(step*batch_size) + ", Minibatch Loss= " + \
                 "{:.6f}".format(loss) + ", Training Accuracy= " + \
                 "{:.5f}".format(acc))
        step += 1
    print("Optimization Finished!")
    
    # calculate accuracy for 128 mnist test images
    test_len = 128
    test_data = mnist.test.images[:test_len].reshape((-1, n_steps, n_input))
    test_label = mnist.test.labels[:test_len]
    print("Testing Accuracy: ", \
         sess.run(accuracy, feed_dict={x: test_data, y: test_label}))

# visualize the result
    test_pred = sess.run(pred, feed_dict={x: test_data})
    test_pred = np.argmax(test_pred, axis=1)
    test_label = np.argmax(test_label, axis=1)
    
    plt.figure(figsize=(12, 8))
    for i in range(15):
        plt.subplot(3, 5, i+1)
        plt.imshow(test_data[i].reshape((28, 28)), cmap='gray')
        plt.title('label={}, pred={}'.format(test_label[i], test_pred[i]))
        plt.axis('off')
    plt.show()

# close the session
sess.close()




simple rnn explane with formula




In [None]:
# rnn with numpy implementation

import numpy as np

# define sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# define derivative of sigmoid function
def sigmoid_derivative(x):
    return x * (1 - x)

# define hyperparameters
learning_rate = 0.1
training_iters = 100000
display_step = 10000

# network parameters
n_input = 2
n_hidden = 2
n_output = 1

# initialize weights
W_hidden = np.random.uniform(size=(n_input, n_hidden))
b_hidden = np.random.uniform(size=(1, n_hidden))
W_output = np.random.uniform(size=(n_hidden, n_output))
b_output = np.random.uniform(size=(1, n_output))

# training dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# training
for i in range(training_iters):
    # forward propagation
    hidden_layer_input = np.dot(X, W_hidden) + b_hidden
    hidden_layer_output = sigmoid(hidden_layer_input)
    output_layer_input = np.dot(hidden_layer_output, W_output) + b_output
    output_layer_output = sigmoid(output_layer_input)
    
    # backpropagation
    error = y - output_layer_output
    d_output = error * sigmoid_derivative(output_layer_output)
    error_hidden_layer = d_output.dot(W_output.T)
    d_hidden_layer = error_hidden_layer * sigmoid_derivative(hidden_layer_output)
    
    # update weights
    W_output += hidden_layer_output.T.dot(d_output) * learning_rate
    b_output += np.sum(d_output, axis=0, keepdims=True) * learning_rate
    W_hidden += X.T.dot(d_hidden_layer) * learning_rate
    b_hidden += np.sum(d_hidden_layer, axis=0, keepdims=True) * learning_rate
    
    # print loss
    if i % display_step == 0:
        print('Iter {}, Loss: {:.6f}'.format(i, np.mean(np.abs(error))))

# testing
hidden_layer_input = np.dot(X, W_hidden) + b_hidden
hidden_layer_output = sigmoid(hidden_layer_input)
output_layer_input = np.dot(hidden_layer_output, W_output) + b_output
output_layer_output = sigmoid(output_layer_input)
print('Output: ', output_layer_output)

