# Recurrent Neural Network

### References
- https://github.com/fastai/fastbook/blob/master/12_nlp_dive.ipynb
- https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/  [Mainly LSTM]
- https://deeplearning.cs.cmu.edu/S20/document/recitation/recitation-7.pdf
- http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

## Resources
#### Video
- https://www.youtube.com/watch?v=6niqTuYFZLQ

#### Article
- http://ethen8181.github.io/machine-learning/deep_learning/rnn/1_pytorch_rnn.html#Recurrent-Neural-Network-(RNN) [Main reference]
- https://medium.com/ecovisioneth/building-deep-multi-layer-recurrent-neural-networks-with-star-cell-2f01acdb73a7 [Multi Layer]
- https://towardsdatascience.com/pytorch-basics-how-to-train-your-neural-net-intro-to-rnn-cb6ebc594677
- https://www.jeremyjordan.me/introduction-to-recurrent-neural-networks/
- https://towardsdatascience.com/pytorch-basics-how-to-train-your-neural-net-intro-to-rnn-cb6ebc594677


![](./rnn.png)

- hidden state: The activations that are updated at each step of a recurrent neural network

![](./rnn2.png)

![Internal](./rnn_internal.png)

### My Simplified version of this internal

![](./rnn_my_internal.png)

## Multilayer RNN

When stacking multiple RNN cells on top of each other, the hidden state of the lower-level is passed on as input to the next-higher level $l$

![](./mRNN.png)

In [1]:
import torch 
from torch import nn, autograd
import torch.nn.functional as F
torch.__version__

'1.7.1'

In [2]:
a_device = "cuda" if torch.cuda.is_available else "cpu"
print(a_device)
device = torch.device(a_device)
device

cuda


device(type='cuda')

## Pytorch Docs for RNN class : 
- https://pytorch.org/docs/stable/generated/torch.nn.RNN.html#torch.nn.RNN

#### init args
- input_size – The number of expected features in the input x

- hidden_size – The number of features in the hidden state h

- num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two RNNs together to form a stacked RNN, with the second RNN taking in outputs of the first RNN and computing the final results. Default: 1
- batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False

#### Inputs: input, h_0
`input` of shape (seq_len, batch, input_size): 
- tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.

`h_0` of shape (num_layers * num_directions, batch, hidden_size): 
- tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. If the RNN is bidirectional, num_directions should be 2, else it should be 1.


#### Outputs: output, h_n
`output` of shape (seq_len, batch, num_directions * hidden_size): 
- tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. 

- For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case.

`h_n` of shape (num_layers * num_directions, batch, hidden_size): 
- tensor containing the hidden state for t = seq_len. Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size).


### Shapes


`Input1`:
- $(L, N, H_{in})$ tensor containing input features where $H_{in}=\text{input_size}$ and L represents a sequence length.
    N represents batch size
- if batch_first is True, then the input and output tensors are provided as (batch, seq, feature).

`Input2`: 
- $(S, N, H_{out})$ tensor containing the initial hidden state for each element in the batch. $H_{out}=\text{hidden_size}$ Defaults to zero if not provided. where $S=\text{num_layers} * \text{num_directions}$ If the RNN is bidirectional, num_directions should be 2, else it should be 1.

`Output1` --> `output` : 
- $(L, N, H_{all})$ where $H_{all}=\text{num_directions} * \text{hidden_size}$
- if batch_first is True, then the input and output tensors are provided as (batch, seq, feature).

`Output2` --> `hn`: 
- $(S, N, H_{out})$ tensor containing the next hidden state for each element in the batch



In [3]:
rnn = nn.RNN(input_size=5, hidden_size=3, num_layers=1, batch_first=True)
inputs_ = torch.rand(1, 6, 5) # (batch, sequence, input_size)

# if batch_first false
#     (seq_len, batch, input_size)

hidden = torch.rand(1, 1, 3)  #  (num_layers * num_directions, batch, hidden_size)
out, h = rnn(inputs_, hidden)

In [4]:
out.shape  #  (seq_len, batch, num_directions * hidden_size): 
# if batch first == True :
#  (batch , seq_len, num_directions * hidden_size):

torch.Size([1, 6, 3])

In [5]:
out

tensor([[[ 0.1536, -0.6420, -0.9776],
         [ 0.4521, -0.3503, -0.6143],
         [ 0.4006, -0.6025, -0.8698],
         [ 0.4774, -0.5537, -0.7637],
         [ 0.3686, -0.3401, -0.5982],
         [ 0.4397, -0.2146, -0.6809]]], grad_fn=<TransposeBackward1>)

In [38]:
out.view(out.shape[0], -1)

tensor([[ 0.1536, -0.6420, -0.9776,  0.4521, -0.3503, -0.6143,  0.4006, -0.6025,
         -0.8698,  0.4774, -0.5537, -0.7637,  0.3686, -0.3401, -0.5982,  0.4397,
         -0.2146, -0.6809]], grad_fn=<ViewBackward>)

In [35]:
out.reshape(out.shape[0], -1).shape

torch.Size([1, 18])

In [13]:
flatout = out.data.view(-1).div(0.85).exp()
flatout.shape

torch.Size([18])

In [30]:
flatout

tensor([1.1980, 0.4699, 0.3166, 1.7021, 0.6623, 0.4854, 1.6022, 0.4922, 0.3594,
        1.7536, 0.5213, 0.4072, 1.5428, 0.6702, 0.4947, 1.6775, 0.7769, 0.4489])

In [32]:
torch.multinomial(flatout, 1)[0]

tensor(8)

In [7]:
h.shape # (num_layers * num_directions, batch, hidden_size)

torch.Size([1, 1, 3])

#### Lets try on char
## Teach RNN `hihell` to `ihello`

In [46]:
idx2char = ["h", "i", "e", "l", "o"]

x_data = [0, 1, 0, 2, 3, 3] # hihell

one_hot_lookup = [
        [1, 0, 0, 0, 0], # h 0
        [0, 1, 0, 0, 0], # i 1
        [0, 0, 1, 0, 0], # e 2
        [0, 0, 0, 1, 0], # l 3
        [0, 0, 0, 0, 1], # o 4
]

y_data = [1, 0, 2, 3, 3, 4]  #ihello

In [47]:
hihell = torch.tensor([one_hot_lookup[x] for x in x_data])
print(hihell.shape)
hihell

torch.Size([6, 5])


tensor([[1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 0, 1, 0]])

In [48]:
input_ = hihell.float().unsqueeze(0)
# input_.requires_grad = True
input_.shape  # (batch, sequence, input_size)

torch.Size([1, 6, 5])

In [49]:
labels = torch.tensor(y_data)  # ihello
print(labels.shape)
labels

torch.Size([6])


tensor([1, 0, 2, 3, 3, 4])

In [382]:
# ihello = torch.tensor([one_hot_lookup[y] for y in labels])
# print(ihello.shape)
# ihello

#### Trying the simple rnn using `input_` no.

In [12]:
# initialize the hidden state
# (num_layers * num_directions, batch, hidden_size)
py_hidden_ = torch.zeros(1, 1, 5, requires_grad=True)
py_hidden_.shape

torch.Size([1, 1, 5])

In [14]:
py_cell = nn.RNN(input_size=5, hidden_size=5, batch_first=True) # batch first has some affects
py_out, py_hidden = py_cell(input_, py_hidden_)
py_out = py_out.view(-1, 5) #.shape
py_out

tensor([[ 0.5184, -0.2448, -0.4246, -0.4477,  0.2838],
        [ 0.5648, -0.4026,  0.0068, -0.4953,  0.4788],
        [ 0.5070, -0.5535, -0.0761, -0.2318,  0.3634],
        [ 0.2870, -0.3875, -0.2583, -0.6370,  0.7112],
        [ 0.3077, -0.4010, -0.4126, -0.5527,  0.7536],
        [ 0.3060, -0.4228, -0.4507, -0.5139,  0.7937]], grad_fn=<ViewBackward>)

In [15]:
py_out.shape 
# In the out, we get values from all 1 batches where number of time-steps (seq_len) is 6 and the number of predictions are 3. 
# For each batch, we're predicting 3 outputs.
# if batch first == True :
#  (batch , seq_len, num_directions * hidden_size):

torch.Size([6, 5])

In [16]:
py_max, idx = py_out.max(1)
idx.squeeze()

tensor([0, 0, 0, 4, 4, 4])

In [17]:
py_max

tensor([0.5184, 0.5648, 0.5070, 0.7112, 0.7536, 0.7937],
       grad_fn=<MaxBackward0>)

In [18]:
out_w = [idx2char[y.item()] for y in idx.squeeze()]
out_w

['h', 'h', 'h', 'o', 'o', 'o']

In [19]:
idx2char

['h', 'i', 'e', 'l', 'o']

In [20]:
py_hidden.shape

torch.Size([1, 1, 5])

In [21]:
py_out.shape

torch.Size([6, 5])

### Parameters

In [55]:
class Model(nn.Module):
    
    def __init__(self, sequence_length, input_size, hidden_size, num_layers, num_classes, batch_size):
        super(Model, self).__init__()
        
        self.sequence_length = sequence_length
        self.num_layers = num_layers
        self.hidden_size = hidden_size
        self.input_size = input_size
        self.num_classes = num_classes
        self.batch_size = batch_size
        self.h = self.__init_hidden__()
    
        self.rnn = nn.RNN(input_size=self.input_size, 
                          hidden_size=self.hidden_size,
                          num_layers=self.num_layers,
                          batch_first=True)
        """
            this rnn will be followed by a linear layer so,
            that the output from rnn matches our final required num of classes.
        """ 
        self.fc1 = nn.Linear(self.hidden_size, self.num_classes)
        
    def forward(self, x):
        x = x.view(self.batch_size, self.sequence_length, self.input_size)
        
#         (batch, sequence, input_size)
#         torch.Size([1, 6, 5])
#         py_cell = nn.RNN(input_size=5, hidden_size=5, batch_first=True) # batch first has some affects
#         py_out, py_hidden = py_cell(input_, hidden_)

        out, hidden = self.rnn(x, self.h)
        out = self.fc1(out.view(-1,self.hidden_size)) # making the out to have x*hidden size for the linear layer
        self.h = hidden.detach()
        return out
    
    def __init_hidden__(self):
        return torch.zeros(self.num_layers, self.batch_size, self.hidden_size) #, requires_grad=True)

## Creating the model

In [56]:
num_classes = 5
input_size = 5  # one-hot size
hidden_size = 3  # output from LSTM. 5 to directly predict one hot
batch_size = 1  # one sentence
sequence_length = 6  # let's do one by one
num_layers = 1  # one layer rnn

In [57]:
model = Model(sequence_length=sequence_length,
              input_size=input_size,
              hidden_size=hidden_size,
              num_layers=num_layers,
              num_classes=num_classes,
              batch_size=batch_size
             )

## Training

In [53]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
loss = 0

In [58]:
for epoch in range(2):
#     model.train()
    optimizer.zero_grad()
    
#     outputs, hidden = model(input_)
#     hidden = hidden.detach()

    outputs = model(input_)
#     hidden = hidden.detach()   

#     print(outputs.shape)
#     print(labels.shape)

    loss = criterion(outputs, labels)
    
    val, idx = outputs.max(1)
    result_str = [idx2char[c] for c in idx.detach().squeeze()]
    print(''.join(result_str))

    print(f"epochs {epoch+1}, loss {loss.item()}")
    
    loss.backward()
    
    optimizer.step()

eoeoee
epochs 1, loss 1.7449554204940796
eeeeee
epochs 2, loss 1.7873259782791138


In [90]:
# hidden = model.init_hidden()
# outputs, hidden = model(input_, hidden)
outputs = model(input_)

In [91]:
outputs.shape

torch.Size([6, 5])

In [92]:
y_max, idx = outputs.max(1)
idx.squeeze()

tensor([1, 0, 1, 3, 3, 0])

In [93]:
out_w = [idx2char[y.item()] for y in idx.squeeze()]
out_w

['i', 'h', 'i', 'l', 'l', 'h']

## LSTM

### `torch.nn.LSTM`
#### Parameters
- input_size – The number of expected features in the input x

- hidden_size – The number of features in the hidden state h

- num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1

#### `Inputs: input, (h_0, c_0)`
- `input and h_0` is same 
- `c_0` has same shape as of h_0, this is tensor containing the initial cell state for each element is the batch
    - (num_layers * num_directions, batch, hidden_size)
    
#### `Outputs: output, (h_n, c_n)`
- `output` 
    -  (seq_len, batch, num_directions * hidden_size)
    
- same as ths shapes of input and the states

In [6]:
lstm_false = nn.LSTM(input_size=5, hidden_size=3, num_layers=1, batch_first=False)
in_false = torch.rand(6,1,5)
h0_false = torch.rand(1, 1, 3)
c0_false = torch.rand(1, 1, 3)

output_f, (hn_f, cn_f) = lstm_false(in_false, (h0_false, c0_false))
output_f.shape

torch.Size([6, 1, 3])

# More complex text dataset

## Embedding Layer
- https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12

In [65]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]

idx_to_tag = ['DET', 'NN', "V"]
tag_to_idx = {"DET":0, "NN":1, "V":2}

In [66]:
word_to_idx = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
word_to_idx

{'The': 0,
 'dog': 1,
 'ate': 2,
 'the': 3,
 'apple': 4,
 'Everybody': 5,
 'read': 6,
 'that': 7,
 'book': 8}

In [69]:
def prepare_sequence(seq, to_idx):
    idxs = [to_idx[s] for s in seq]
#     print(idxs)
    return torch.tensor(idxs)

seq = training_data[0][0]
inputs = prepare_sequence(seq, word_to_idx)
inputs

tensor([0, 1, 2, 3, 4])

In [86]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_size, vocab_size, hidden_size, batch_size, num_layers, num_tags):
        super(LSTMTagger, self).__init__()
        self.embedding_size = embedding_size
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        
        self.batch_size = batch_size
        self.num_layers = num_layers
        self.num_tags = num_tags
        
        self.h, self.c = self.__init_h0_c0__()
        
        self.embedding_layer = nn.Embedding(vocab_size, self.embedding_size)
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, batch_first=True)
        # out to tag
        self.linear = nn.Linear(self.hidden_size, self.num_tags)
        
        
    def forward(self,x):
        emb_x = self.embedding_layer(x)   
#         print(emb_x.shape)
        out, (hn, cn) = self.lstm(emb_x.view(self.batch_size, len(x), -1),
                                  (self.h, self.c))
        self.h = hn.detach()
        self.c = cn.detach()
        
        out = self.linear(out.view(-1, self.hidden_size))
        
        return out
#         pass
        
    
    def __init_h0_c0__(self):
        return (
            torch.zeros(self.num_layers, self.batch_size, self.hidden_size),
            torch.zeros(self.num_layers, self.batch_size, self.hidden_size)
        )


### Initializing the lstm model

In [87]:
vocab_size = len(word_to_idx)
num_layers = 1
num_tags = len(tag_to_idx)
batch_size = 1
EMBEDDING_SIZE = 6
HIDDEN_SIZE = 3

lstm_net = LSTMTagger(EMBEDDING_SIZE, 
                      vocab_size, 
                      HIDDEN_SIZE, 
                      batch_size, 
                      num_layers, 
                      num_tags
)

In [88]:
### Optimizer and loss function

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(lstm_net.parameters(), lr=0.1)
loss = 0

### Training the tag generator

In [89]:
for epoch in range(2):
    for sequence, tags in training_data:
        optimizer.zero_grad()
        inputs_ = prepare_sequence(sequence, word_to_idx)
        labels_ = prepare_sequence(tags, tag_to_idx)
        
        out = lstm_net(inputs_)
        loss = criterion(out, labels_)
        loss.backward()
        optimizer.step()
        
    print(f"epochs {epoch+1}, loss {loss.item()}")

#         break

epochs 1, loss 1.083335518836975
epochs 2, loss 1.0105509757995605


In [90]:
out.shape

torch.Size([4, 3])

In [72]:
val, idxs = out.max(1)
idxs

tensor([1, 2, 0, 1])

In [73]:
' '.join(idx_to_tag[idx] for idx in idxs)

'NN V DET NN'

# On image  dataset
### Loading MNIST Dataset

In [16]:
from torchvision import datasets
from torchvision import transforms
from torch.utils.data import DataLoader

In [51]:
train_set = datasets.MNIST(root="pytorch_dataset/", 
                             train=True,
                             download=True,
                             transform=transforms.ToTensor()
                            )
test_set = datasets.MNIST(root="pytorch_dataset/", 
                            train=False, 
                            download=True,
                            transform=transforms.ToTensor()
                           )


In [52]:
train_loader = DataLoader(dataset=train_set,
                          batch_size=batch_size,
                         shuffle=True)
test_loader = DataLoader(dataset=test_set,
                          batch_size=batch_size,
                         shuffle=True)

### Initializing the model

In [59]:
rnn_model = RNN(input_size, hidden_size, num_layers, num_classes).to(device)

### Loss and Optimizer

In [60]:
# from torch.functional import
from torch import optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=lr)


## Traning the model

In [61]:
for epoch in range(num_epochs) :
    for x_train, y_train in train_loader:
        x_train = x_train.to(device)
        y_train = y_train.to(device)
        
#         print(x_train.shape)
        x_train = x_train.squeeze(1)
#         print(x_train.shape)
        
        optimizer.zero_grad()
 
        out = rnn_model(x_train)
        
        loss = criterion(out, y_train)
        
        loss.backward()
        
        optimizer.step()
        
    print(f" loss at the end of {epoch} epoch is {loss}")

print("Done Training")

 loss at the end of 0 epoch is 0.1431623101234436
 loss at the end of 1 epoch is 0.3201900124549866
Done Training


In [63]:
def accuracy(test_loader, model,cuda=True):
    n_correct = 0
    n_total = 0
    with torch.no_grad():
        for x_test, y_test in test_loader:
            if cuda:
                x_test = x_test.to(device)
                y_test = y_test.to(device=device)
            x_test = x_test.squeeze(1)
            preds = model(x_test)
            _, preds_idx = preds.max(1)
            n_correct += (preds_idx==y_test).sum().item()
            n_total += x_test.size(0)
    print(f"Out of {n_total} images {n_correct} where correctly classified")
    acc = (n_correct/n_total) * 100
    print(f"Accuracy of the model is {acc:.2f}")            

In [64]:
accuracy(test_loader, rnn_model,cuda=True)

Out of 10000 images 9525 where correctly classified
Accuracy of the model is 95.25
