<p style="text-align: center; font-size:50px;">About RNN</p>

<p style="text-align: center; font-size:30px;">Theory</p>

#### RNN is a popular form of neural network, mainly used for language tasks, sequential problems such as time series. 
#### It stands for Recurent Neural Network.
#### Unlike a conventional Deep Neural Network which showcases layers of nodes that are connected by weights which will take a certain set of input and output certain output(s), RNN specializes in sequential data. 
#### It utilizes hidden states, calculated after every inputs to 'remember' or 'store' information along the way. 
#### In theory, this is important for sequential task such as language modeling as human conversation takes into account past words with varying degree of importance. 
#### Of course RNN is the simplest form and architecture there is to this field of networks and there are more advanced ones such as LSTUs and GRUs but I will only talk about RNN in this notebook.
#### The essential formula for a RNN is as follows:

$$\LARGE h_{t+1} = f(w_x \cdot x_t + w_h \cdot h_t + b_h) $$
$$\LARGE y_t = f(w_y \cdot h_t + b_y) $$

#### Basically, $h$ represents the hidden state within the RNN. 
#### In RNNs, hidden states are the outputs and the inputs.
#### So in one cycle, the previous hidden state ($h_t$) and input at that time ($x_t$) will be put in. 
#### It would output a hidden state ($h_{t+1}$) which would be fed into the next cycle and also could be used to calculate the output in a conventional forward neural network. 
#### One thing to keep in mind is that the weights and biases used is the **same throughout** all of these cycles. 
#### The training and backpropagation is responsible for finding the optimal weights and biases numerical figures for the inputs and outputs. 

<p style="text-align: center; font-size:30px;">Code Implementation</p>

In [13]:
import torch 
import torch.nn as nn 
import numpy as np 
import torchinfo
from torchinfo import summary

#### For simplicity, we will come up with simple data we make ourselves.

In [3]:
text = ['hey how are you', 'good i am fine', 'have a nice day']

chars = set(''.join(text)) # will create a set of unique characters in the text array above

int2char = dict(enumerate(chars))

char2int = {char: ind for ind, char in int2char.items()}

#### int2char method will make a dictionary with the index as key and characters as value 
#### char2int method will create a dictionary that has characters as key and the respective indices in int2char as values. 

In [5]:
print(f"This is what int2char looks like {int2char}\n") 
print(f"This is what char2int looks like {char2int}") 

This is what int2char looks like {0: 'u', 1: ' ', 2: 'r', 3: 'd', 4: 'y', 5: 'g', 6: 'v', 7: 'c', 8: 'o', 9: 'f', 10: 'e', 11: 'n', 12: 'h', 13: 'i', 14: 'm', 15: 'w', 16: 'a'}

This is what char2int looks like {'u': 0, ' ': 1, 'r': 2, 'd': 3, 'y': 4, 'g': 5, 'v': 6, 'c': 7, 'o': 8, 'f': 9, 'e': 10, 'n': 11, 'h': 12, 'i': 13, 'm': 14, 'w': 15, 'a': 16}


#### We will also apply **padding** to our data. 
#### While RNNs are typically able to take in variably sized inputs, we usually want to feed the data in batches and thus need to ensure they are the same size. 
#### For sentences too short, we fill them up with 0 values and trim those that are too long. 
#### For our case, we will use the length of the longest sentence as the standard and pad the other sentences with 0.

In [6]:
maxlen = len(max(text, key=len))

for i in range(len(text)):
    while len(text[i]) < maxlen: 
        text[i] += ' '

#### Now since this is going to be a sequential predictive task, we would need to engineer the raw text data we have. 
#### We will have to divide the input data such that the last input character will be excluded and the target truth label be taken note as the 'correct answer' for the model. 

In [7]:
# Creating lists that will hold our input and target sequences
input_seq = []
target_seq = []

for i in range(len(text)):
    # Remove last character for input sequence
    input_seq.append(text[i][:-1])
      
      # Remove first character for target sequence
    target_seq.append(text[i][1:])
    print("Input Sequence: {}\nTarget Sequence: {}".format(input_seq[i], target_seq[i]))

Input Sequence: hey how are yo
Target Sequence: ey how are you
Input Sequence: good i am fine
Target Sequence: ood i am fine 
Input Sequence: have a nice da
Target Sequence: ave a nice day


#### The target sequence will always be one-time step ahead of the input sequence.
#### Now let us convert the sequence of characters to sequence of integers.

In [8]:
for i in range(len(text)):
    input_seq[i] = [char2int[character] for character in input_seq[i]]
    target_seq[i] = [char2int[character] for character in target_seq[i]]

In [9]:
dict_size = len(char2int) # This will dictate the size of our one-hot vector
seq_len = maxlen - 1
batch_size = len(text)

def one_hot_encode(sequence, dict_size, seq_len, batch_size):
    # Creating a multi-dimensional array of zeros with the desired output shape
    features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for i in range(batch_size):
        for u in range(seq_len):
            features[i, u, sequence[i][u]] = 1
    return features

In [10]:
input_seq = one_hot_encode(input_seq, dict_size, seq_len, batch_size)

# Making them tensors
input_seq = torch.from_numpy(input_seq)
target_seq = torch.Tensor(target_seq)

#### Now that our data are in good shape, we will implement the model.
#### For this model, we'll be using 1 layer of RNN followed by a fully connected layer for the outputs. 

In [40]:
device = torch.device('mps') 

In [41]:
class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        self.hidden_dim = hidden_dim
        self.n_layers = n_layers 

        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True) 
        self.fc = nn.Linear(hidden_dim, output_size) 

    def forward(self, x):

        batch_size = x.size(0) 

        hidden = self.init__hidden(batch_size) 

        out, hidden = self.rnn(x,hidden) 

        out = out.contiguous().view(-1, self.hidden_dim) 
        out = self.fc(out) 

        return out, hidden 

    def init__hidden(self, batch_size):
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim, device = device) 
        return hidden 

#### Now let's define our epochs and learning rate, and also decide on which device to run the code.

In [42]:
model = Model(dict_size, dict_size, 12, 1).to(device)

n_epochs = 100 
lr = 0.01 

criterion = nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(model.parameters(), lr = lr) 

#### Let's run our training loop.

In [46]:
# Training Run
for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad() # Clears existing gradients from previous epoch
    output, hidden = model(input_seq.to(device))
    loss = criterion(output, target_seq.to(device).view(-1).long())
    loss.backward() # Does backpropagation and calculates gradients
    optimizer.step() # Updates the weights accordingly
    
    if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

Epoch: 10/100............. Loss: 2.4471
Epoch: 20/100............. Loss: 2.2288
Epoch: 30/100............. Loss: 1.9065
Epoch: 40/100............. Loss: 1.4787
Epoch: 50/100............. Loss: 1.0803
Epoch: 60/100............. Loss: 0.7544
Epoch: 70/100............. Loss: 0.5252
Epoch: 80/100............. Loss: 0.3736
Epoch: 90/100............. Loss: 0.2710
Epoch: 100/100............. Loss: 0.2033


<p style="text-align: center; font-size:30px;">Evaluation</p>

#### Let's see what kind of output we will get with our trained model.

In [50]:
# This function takes in the model and character as arguments and returns the next character prediction and hidden state
def predict(model, character):
    # One-hot encoding our input to fit into the model
    character = np.array([[char2int[c] for c in character]])
    character = one_hot_encode(character, dict_size, character.shape[1], 1)
    character = torch.from_numpy(character)
    character = character.to(device)
    
    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1], dim=0).data
    # Taking the class with the highest probability score from the output
    char_ind = torch.max(prob, dim=0)[1].item()

    return int2char[char_ind], hidden

In [51]:
# This function takes the desired output length and input characters as arguments, returning the produced sentence
def sample(model, out_len, start='hey'):
    model.eval() # eval mode
    start = start.lower()
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = out_len - len(chars)
    # Now pass in the previous characters and get a new one
    for ii in range(size):
        char, h = predict(model, chars)
        chars.append(char)

    return ''.join(chars)

In [52]:
sample(model, 15, 'good') 

'good i am fine '

#### Guess we're getting somewhere :) 
#### Of course the model has seen good before and it is obviously only accurate enough when we feed in words that it saw. 
#### If we feed in words that it never saw, it would probably throw back very inaccurate and awkward words to finish the sentence as the data we fed in was extremely limited in magnitude and scope.

In [57]:
sample(model, 15, 'breakfast')

KeyError: 'b'

#### As we can see, it won't even be able process the word as it does not have the letter b in its dictionary. 
#### This is where we can see the limitation of this current implementation and let me talk more about the limitations of RNNs.

<p style="text-align: center; font-size:30px;">Model Limitations</p>

## Overfitting

#### With this data, it is limited in scope and magnitude which means it would only be respond similar to what it was fed in.
#### If we started to ask about politics and not greetings, it would have no idea how to respond and would throw gibberish back at us.

## Handling of unseen characters

#### Additionally, as we saw, the model is unable to handle unseen characters.
#### It needs to be able to one hot encode characters as vectors and not having seen an character, it will be unable to do so. 

## Representation of Textual Data 

#### In this notebook, we used one-hot encoding representation of textual data. 
#### However, this method is highly inefficient as it would mainly generate an extremely sparse matrix which would only contribute to inefficient space complexity. 
#### Additionally, it cannot contribute any contextual or semantic information with this representation. 
#### Modern NLP solutions rely on different word embeddings such as word2vec.
#### These methods allow the model to learn meanings of word based on the context. 