In [0]:
import torch
import torchtext
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence, PackedSequence
from torch import nn

from torch import Tensor, dot, matmul

import torch.nn.functional as F

## **Basic Example**

In [0]:
seq = torch.FloatTensor([[3, 4, 5]])  

In [0]:
# Defining a basic RNN layer
rnn= nn.RNN(input_size=1, hidden_size=1, num_layers = 1, bias = False, batch_first=True)

RNN expects input sequences to be in a particular format. By setting batch_first = True, we set the input data format to be 'batch size, sequence length, # input features'

In [55]:
seq = seq.unsqueeze(2)
print(seq.shape)

print(seq)

torch.Size([1, 3, 1])
tensor([[[3.],
         [4.],
         [5.]]])


With the correct input format, we can now pass the input to the RNN layer. The RNN layer provides 2 outputs


1.   All hidden states associated with a sequence, for all sequences in the batch
2.   Just the very last hidden state for a sequence, for all sequences in the batch



In [0]:
out_all,out_last = rnn(seq)

In [57]:
print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")


Out all shape : torch.Size([1, 3, 1])
Out last shape : torch.Size([1, 1, 1])


There are 2 ways that we can acess the weights of the RNN layer.

1.   Accessing individual parameters using their names `weight_hh_10`, `weight_1h_10` and so on.
2.   Using the `state_dict()` parameter to access all weights





In [60]:
rnn.weight_hh_l0

Parameter containing:
tensor([[0.9963]], requires_grad=True)

In [26]:
rnn.weight_hh_l0

Parameter containing:
tensor([[0.7366]], requires_grad=True)

In [61]:
rnn.state_dict()

OrderedDict([('weight_ih_l0', tensor([[0.3445]])),
             ('weight_hh_l0', tensor([[0.9963]]))])

### **Computing the output**

RNN layers essentially take in a sequence and compute outputs for each time point in the input sequence. The weights that are used for computation remain the same for all time points.

The basic equation governing the computation is given by :
$h_t = \text{tanh}(W_{ih} x_t + b_{ih} + W_{hh} h_{(t-1)} + b_{hh})$

where 
$h_{t}$ represents the hidden state at time $t$








In [65]:
# Output states computed by the RNN layer
out_all

tensor([[[0.7753],
         [0.9733],
         [0.9909]]], grad_fn=<TransposeBackward1>)

#### Hidden State 1

Note. Since this is the very first state (time = 1) and we dont have a hidden state preceding it, we assumne it be zero. Therefore, $h_{0}$ is taken to be 0.

In [73]:
wih = rnn.weight_ih_l0
whh = rnn.weight_hh_l0

x = seq[0][0] # The first input feature of the first sequence

# Computing thw hidden state for time = 1
h1 = torch.tanh(Tensor(x*wih + whh*0))  
h1

tensor([[0.7753]], grad_fn=<TanhBackward>)

#### Hidden State 2

In [75]:
x = seq[0][1] # The second input feature of the first sequence

h2 = torch.tanh(Tensor(x*wih + whh*h1))  
h2

tensor([[0.9733]], grad_fn=<TanhBackward>)

#### Hidden State 3

In [76]:
x = seq[0][2] # The third and last input feature of the first sequence

h3 = torch.tanh(Tensor(x*wih + whh*h2))  
h3

tensor([[0.9909]], grad_fn=<TanhBackward>)

We can observe that :


1.   RNN does a very basic computation repeatedly on all features of the given sequence
2.   The output at a particular time stamp depends on the outputs at a previous time stamp



## **Adding more features**

We increase the complexity of the RNN computation by increasing the number of features at each sequence time stamp. Previously, each time stamp was represented by a single value. Now, we expand that to be represented by a feature vector

In [116]:
seq = Tensor([[1,1,1],[1,2,1],[2,3,1], [1,3,1]])

seq = seq.unsqueeze(0)

seq.shape

torch.Size([1, 4, 3])

The `seq` variable represents a sequence of length 4, where each element (time-stamp) is represented by a feature vector of length 3.

We next define a RNN layer where we set `input_size` to be 3. This time, we also set `bias` to be True, so that we include a bias term in our calculations

In [0]:
# Defining a basic RNN layer
rnn= nn.RNN(input_size=3, hidden_size=1, num_layers = 1, bias = True, batch_first=True)

In [118]:
out_all, out_last = rnn(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")

Out all shape : torch.Size([1, 4, 1])
Out last shape : torch.Size([1, 1, 1])


### **Computing outputs**

In [119]:
out_all

tensor([[[0.2149],
         [0.7819],
         [0.9865],
         [0.9831]]], grad_fn=<TransposeBackward1>)

#### Hidden State 1

A minor modification compared to the previous code is that we will be using dot multiplication to multiply $x$ with $W_{ih}$ and $h_{t-1}$ with $W_{hh}$.

In [120]:
wih = rnn.weight_ih_l0.squeeze(0)
whh = rnn.weight_hh_l0.squeeze(0)

bih = rnn.bias_ih_l0
bhh = rnn.bias_hh_l0

x = seq[0][0] # The first input feature of the first sequence

# Computing thw hidden state for time = 1
h1 = torch.tanh(Tensor(dot(x,wih) + bih  + dot(whh,Tensor([0.0])) + bhh))  
h1

tensor([0.2149], grad_fn=<TanhBackward>)

#### Hidden State 2

In [121]:
x = seq[0][1] # The first input feature of the first sequence

# Computing thw hidden state for time = 1
h2 = torch.tanh(Tensor(dot(x,wih) + bih  + dot(h1,whh) + bhh))  
h2

tensor([0.7819], grad_fn=<TanhBackward>)

#### Computing all states

We automate the manual computation of hidden states to verify our computation matches with the RNN layer output

In [0]:
output = []

h_previous = Tensor([0.0])

for i in range(seq.shape[1]):

  if i == 0:
    x = seq[0][i]
    h_current = torch.tanh(Tensor(dot(x,wih) + bih  + dot(h_previous,whh) + bhh))
    h_previous = h_current
    output.append(h_current.detach().numpy())

  else:
    x = seq[0][i]
    h_current = torch.tanh(Tensor(dot(x,wih) + bih  + dot(h_previous,whh) + bhh))
    h_previous = h_current
    output.append(h_current.detach().numpy())



In [133]:
output

[array([0.21494365], dtype=float32),
 array([0.7818633], dtype=float32),
 array([0.98649365], dtype=float32),
 array([0.98313564], dtype=float32)]

## **Increasing Hidden Size**

Till now, we had `hidden_size` parameter fixed at 1. We increase this value and see how it affects the RNN computation

In [0]:
# Defining the RNN layer
rnn= nn.RNN(input_size=3, hidden_size=2, num_layers = 1, bias = True, batch_first=True)

In [140]:
out_all, out_last = rnn(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")

Out all shape : torch.Size([1, 4, 2])
Out last shape : torch.Size([1, 1, 2])


We can see from the output shape that the size of the hidden states has increased to 2, corresponding to the increase in the `hidden_size` parameter to 2

In [141]:
rnn.state_dict()

OrderedDict([('weight_ih_l0', tensor([[ 0.2020, -0.3842, -0.4155],
                      [ 0.0146, -0.5546,  0.2592]])),
             ('weight_hh_l0', tensor([[ 0.4147, -0.3529],
                      [-0.1547,  0.1795]])),
             ('bias_ih_l0', tensor([ 0.6311, -0.2717])),
             ('bias_hh_l0', tensor([-0.4247, -0.2052]))])

Similarly, the RNN layer weight shapes have also changed in response to the new `hidden_size` parameter value

### **Computing outputs**

On increasing the `hidden_size` parameter to 2, we are essentially increase the size of the hidden states computed for each time-stamp. This essentially allows the hidden states to be more expressive and store more information.

In [142]:
out_all

tensor([[[-0.3725, -0.6397],
         [-0.6070, -0.8786],
         [-0.7160, -0.9576],
         [-0.8071, -0.9586]]], grad_fn=<TransposeBackward1>)

#### Hidden State 1

In [159]:
wih = rnn.weight_ih_l0
whh = rnn.weight_hh_l0

bih = rnn.bias_ih_l0
bhh = rnn.bias_hh_l0

x = seq[0][0] # The first input feature of the first sequence

# Computing thw hidden state for time = 1
h1 = torch.tanh(Tensor(matmul(x,wih.T) + bih  + matmul( torch.zeros([1,2]) , whh.T ) + bhh))  
h1

tensor([[-0.3725, -0.6397]], grad_fn=<TanhBackward>)

#### Computing for all states

In [0]:
output = []

h_previous = torch.zeros([1,2])  # Since the hidden_size parameter is 2, all hidden states will have a shape of [1,2]

for i in range(seq.shape[1]):

  x = seq[0][i]
  h_current = torch.tanh(Tensor(matmul(x,wih.T) + bih  + matmul(h_previous,whh.T) + bhh))
  h_previous = h_current
  output.append(h_current)




In [164]:
output

[tensor([[-0.3725, -0.6397]], grad_fn=<TanhBackward>),
 tensor([[-0.6070, -0.8786]], grad_fn=<TanhBackward>),
 tensor([[-0.7160, -0.9576]], grad_fn=<TanhBackward>),
 tensor([[-0.8071, -0.9586]], grad_fn=<TanhBackward>)]

## **Building a Bi-Directional RNN**

In [0]:
# Defining the RNN layer
rnn= nn.RNN(input_size=3, hidden_size=2, num_layers = 1, bias = True, batch_first=True, bidirectional=True)

In [167]:
out_all, out_last = rnn(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")

Out all shape : torch.Size([1, 4, 4])
Out last shape : torch.Size([2, 1, 2])


In [168]:
out_all

tensor([[[ 0.8117, -0.4865,  0.4549,  0.4985],
         [-0.0333, -0.6678,  0.3855,  0.6282],
         [ 0.4834, -0.8639,  0.1292,  0.3069],
         [-0.3773, -0.8821,  0.2727,  0.7177]]], grad_fn=<TransposeBackward1>)

In [169]:
out_last

tensor([[[-0.3773, -0.8821]],

        [[ 0.4549,  0.4985]]], grad_fn=<StackBackward>)

In [170]:
rnn.state_dict()

OrderedDict([('weight_ih_l0', tensor([[ 0.5186, -0.4766,  0.1410],
                      [ 0.2750, -0.6602, -0.6266]])),
             ('weight_hh_l0', tensor([[-0.6757,  0.2885],
                      [ 0.2265, -0.4132]])),
             ('bias_ih_l0', tensor([0.6045, 0.0802])),
             ('bias_hh_l0', tensor([0.3446, 0.4002])),
             ('weight_ih_l0_reverse', tensor([[-0.1216, -0.1432,  0.3163],
                      [-0.6950,  0.2082,  0.1613]])),
             ('weight_hh_l0_reverse', tensor([[-0.3441,  0.0915],
                      [-0.2372,  0.2422]])),
             ('bias_ih_l0_reverse', tensor([0.3067, 0.6804])),
             ('bias_hh_l0_reverse', tensor([0.2079, 0.1318]))])

### **Computing outputs - Forward Direction** 

For a bidirectional RNN layer with a hidden layer size of 2 and an input sequence of length 4, we get an output of size 4x4.

In the output, each row essentially captures the hidden state corresponding to a given time-stamp. In the previous example, each time stamp was represented by a vector of length 2 (because `hidden_size` = 2). Now, since its bidirectional, each hidden state is represented by a vector of length 4 ( 2 + 2)


For each timestamp, the first 2 values correspond to the forward run of the RNN and the last 2 values correspond to the backward run of the RNN.

In [193]:
out_all

tensor([[[ 0.8117, -0.4865,  0.4549,  0.4985],
         [-0.0333, -0.6678,  0.3855,  0.6282],
         [ 0.4834, -0.8639,  0.1292,  0.3069],
         [-0.3773, -0.8821,  0.2727,  0.7177]]], grad_fn=<TransposeBackward1>)

#### Hidden State 1 - Forward Direction

In [194]:
wih = rnn.weight_ih_l0
whh = rnn.weight_hh_l0

bih = rnn.bias_ih_l0
bhh = rnn.bias_hh_l0

# We represent all reverse weights using a '_' suffix
wih_ = rnn.weight_ih_l0_reverse
whh_ = rnn.weight_hh_l0_reverse

bih_ = rnn.bias_ih_l0_reverse
bhh_ = rnn.bias_hh_l0_reverse

x = seq[0][0] # The first input feature of the first sequence

# Computing thw hidden state for time = 1
h1 = torch.tanh(Tensor(matmul(x,wih.T) + bih  + matmul( torch.zeros([1,2]) , whh.T ) + bhh))  
h1


tensor([[ 0.8117, -0.4865]], grad_fn=<TanhBackward>)

#### Computing all states - Forward Direction

In [174]:
output = []

h_previous = torch.zeros([1,2])  # Since the hidden_size parameter is 2, all hidden states will have a shape of [1,2]

for i in range(seq.shape[1]):

  x = seq[0][i]
  h_current = torch.tanh(Tensor(matmul(x,wih.T) + bih  + matmul(h_previous,whh.T) + bhh))
  h_previous = h_current
  output.append(h_current)


output

[tensor([[ 0.8117, -0.4865]], grad_fn=<TanhBackward>),
 tensor([[-0.0333, -0.6678]], grad_fn=<TanhBackward>),
 tensor([[ 0.4834, -0.8639]], grad_fn=<TanhBackward>),
 tensor([[-0.3773, -0.8821]], grad_fn=<TanhBackward>)]

At this stage, we can compare the computed hidden states with the RNN layer output `out_all`. We can observe that computed states match to the first 2 elements of all the RNN layer outputs

In [177]:
out_all[:,:,:2]

tensor([[[ 0.8117, -0.4865],
         [-0.0333, -0.6678],
         [ 0.4834, -0.8639],
         [-0.3773, -0.8821]]], grad_fn=<SliceBackward>)

### **Computing Outputs - Backward Direction**

#### Hidden State 1 - Backward direction

In [190]:
x = seq[0][-1] # The very last element of the sequence is now treated as the first element in the backward run

# Computing thw hidden state for time = 4
h4_ = torch.tanh(Tensor(matmul(x,wih_.T) + bih_  + matmul( torch.zeros([1,2]) , whh_.T ) + bhh_))  
h4_


tensor([[0.2727, 0.7177]], grad_fn=<TanhBackward>)

#### Hidden State 2 - Backward direction

In [195]:
x = seq[0][-2] 

# Computing thw hidden state for time = 3
h3_ = torch.tanh(Tensor(matmul(x,wih_.T) + bih_  + matmul( h4_ , whh_.T ) + bhh_))  
h3_


tensor([[0.1292, 0.3069]], grad_fn=<TanhBackward>)

#### Hidden State 3 - Backward direction

In [196]:
x = seq[0][-3] 

# Computing thw hidden state for time = 3
h2_ = torch.tanh(Tensor(matmul(x,wih_.T) + bih_  + matmul( h3_ , whh_.T ) + bhh_))  
h2_


tensor([[0.3855, 0.6282]], grad_fn=<TanhBackward>)

#### Hidden State 4 - Backward direction

In [197]:
x = seq[0][-4] 

# Computing thw hidden state for time = 3
h1_ = torch.tanh(Tensor(matmul(x,wih_.T) + bih_  + matmul( h2_ , whh_.T ) + bhh_))  
h1_


tensor([[0.4549, 0.4985]], grad_fn=<TanhBackward>)

In [199]:
output_ = [h1_,h2_,h3_,h4_]
output_

[tensor([[0.4549, 0.4985]], grad_fn=<TanhBackward>),
 tensor([[0.3855, 0.6282]], grad_fn=<TanhBackward>),
 tensor([[0.1292, 0.3069]], grad_fn=<TanhBackward>),
 tensor([[0.2727, 0.7177]], grad_fn=<TanhBackward>)]

In [209]:
out_all[:,:,2:]   #Checking only the 2nd half of the RNN layer output

tensor([[[0.4549, 0.4985],
         [0.3855, 0.6282],
         [0.1292, 0.3069],
         [0.2727, 0.7177]]], grad_fn=<SliceBackward>)

The final RNN layer output is the concatentation of hidden states from both the forward and backward runs. On doing so, we can compare our manually computed results with the RNN layer output

In [0]:
fullOutput = [ torch.cat( (output[i], output_[i]),1)  for i in range(4) ]

In [207]:
fullOutput

[tensor([[ 0.8117, -0.4865,  0.4549,  0.4985]], grad_fn=<CatBackward>),
 tensor([[-0.0333, -0.6678,  0.3855,  0.6282]], grad_fn=<CatBackward>),
 tensor([[ 0.4834, -0.8639,  0.1292,  0.3069]], grad_fn=<CatBackward>),
 tensor([[-0.3773, -0.8821,  0.2727,  0.7177]], grad_fn=<CatBackward>)]

In [208]:
out_all

tensor([[[ 0.8117, -0.4865,  0.4549,  0.4985],
         [-0.0333, -0.6678,  0.3855,  0.6282],
         [ 0.4834, -0.8639,  0.1292,  0.3069],
         [-0.3773, -0.8821,  0.2727,  0.7177]]], grad_fn=<TransposeBackward1>)

## **Stacked RNNs**

In [0]:
# Defining the RNN layer
rnn= nn.RNN(input_size=3, hidden_size=3, num_layers = 2, bias = True, batch_first=True, bidirectional=False)

In [230]:
out_all, out_last = rnn(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")

Out all shape : torch.Size([1, 4, 3])
Out last shape : torch.Size([2, 1, 3])


In [231]:
out_all

tensor([[[ 0.0348, -0.2825,  0.5068],
         [-0.0356, -0.2683,  0.6636],
         [ 0.0149, -0.2029,  0.6910],
         [ 0.0190, -0.2103,  0.6927]]], grad_fn=<TransposeBackward1>)

In [232]:
out_last

tensor([[[ 0.7414,  0.9353,  0.8858]],

        [[ 0.0190, -0.2103,  0.6927]]], grad_fn=<StackBackward>)

In [233]:
rnn.state_dict()

OrderedDict([('weight_ih_l0', tensor([[ 0.0130,  0.2709, -0.3058],
                      [-0.4475,  0.3193,  0.4943],
                      [ 0.2080,  0.0944, -0.2989]])),
             ('weight_hh_l0', tensor([[ 0.4855,  0.0833, -0.1622],
                      [-0.5323, -0.1938,  0.3005],
                      [ 0.1978,  0.5410,  0.3437]])),
             ('bias_ih_l0', tensor([-0.1454,  0.4145, -0.2442])),
             ('bias_hh_l0', tensor([0.3113, 0.5509, 0.5318])),
             ('weight_ih_l1', tensor([[ 0.0108,  0.3439,  0.4292],
                      [-0.2595, -0.0296,  0.3443],
                      [ 0.4159,  0.0034,  0.0543]])),
             ('weight_hh_l1', tensor([[-0.5193,  0.3936, -0.3125],
                      [-0.5299,  0.5054,  0.1984],
                      [ 0.4348, -0.4416, -0.1205]])),
             ('bias_ih_l1', tensor([-0.1728, -0.0305,  0.5360])),
             ('bias_hh_l1', tensor([-0.2146, -0.2945, -0.0553]))])

### **Computing Outputs - Layer 1**

In [245]:
# Extracting the weights for RNN Layer 1
wih_10 = rnn.weight_ih_l0
whh_10 = rnn.weight_hh_l0

bih_10 = rnn.bias_ih_l0
bhh_10 = rnn.bias_hh_l0

output_1 = []

h_previous = torch.zeros([1,3])  # Since the hidden_size parameter is 3, all hidden states will have a shape of [1,3]

for i in range(seq.shape[1]):

  x = seq[0][i]
  h_current = torch.tanh(Tensor(matmul(x,wih_10.T) + bih_10  + matmul(h_previous,whh_10.T) + bhh_10))
  h_previous = h_current
  output_1.append(h_current)

output_1

[tensor([[0.1430, 0.8696, 0.2832]], grad_fn=<TanhBackward>),
 tensor([[0.4706, 0.9036, 0.7538]], grad_fn=<TanhBackward>),
 tensor([[0.7066, 0.8677, 0.9102]], grad_fn=<TanhBackward>),
 tensor([[0.7414, 0.9353, 0.8858]], grad_fn=<TanhBackward>)]

### **Computing Outputs - Layer 2**

In [252]:
# Extracting the weights for RNN Layer 1
wih_11 = rnn.weight_ih_l1
whh_11 = rnn.weight_hh_l1

bih_11 = rnn.bias_ih_l1
bhh_11 = rnn.bias_hh_l1

output_2 = []

h_previous = torch.zeros([1,3]) # Since the hidden_size parameter is 2, all hidden states will have a shape of [1,2]

for i in range(seq.shape[1]):
  
  x = seq[0][i]
  h_current = torch.tanh(Tensor(matmul(output_1[i],wih_11.T) + bih_11  + matmul(h_previous,whh_11.T) + bhh_11))
  h_previous = h_current
  output_2.append(h_current)

output_2

[tensor([[ 0.0348, -0.2825,  0.5068]], grad_fn=<TanhBackward>),
 tensor([[-0.0356, -0.2683,  0.6636]], grad_fn=<TanhBackward>),
 tensor([[ 0.0149, -0.2029,  0.6910]], grad_fn=<TanhBackward>),
 tensor([[ 0.0190, -0.2103,  0.6927]], grad_fn=<TanhBackward>)]

In [238]:
out_all

tensor([[[ 0.0348, -0.2825,  0.5068],
         [-0.0356, -0.2683,  0.6636],
         [ 0.0149, -0.2029,  0.6910],
         [ 0.0190, -0.2103,  0.6927]]], grad_fn=<TransposeBackward1>)

In [239]:
out_last

tensor([[[ 0.7414,  0.9353,  0.8858]],

        [[ 0.0190, -0.2103,  0.6927]]], grad_fn=<StackBackward>)

## **Expanding to GRUs**

In [0]:
# Defining a basic GRU layer
gru = nn.GRU(input_size=3, hidden_size=1, num_layers = 1, bias = True, batch_first=True)

\begin{array}{ll}
 |          r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
 |          z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
 |          n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
 |          h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
 |      \end{array}

In [328]:
gru.state_dict()

OrderedDict([('weight_ih_l0', tensor([[-0.0673, -0.1624,  0.4516],
                      [ 0.2531,  0.1567,  0.2796],
                      [-0.4242, -0.5414, -0.8276]])),
             ('weight_hh_l0', tensor([[ 0.6088],
                      [-0.9962],
                      [-0.7625]])),
             ('bias_ih_l0', tensor([0.8264, 0.8598, 0.6835])),
             ('bias_hh_l0', tensor([-0.4312, -0.8736, -0.8906]))])

In [329]:
out_all, out_last = gru(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last.shape}")

Out all shape : torch.Size([1, 4, 1])
Out last shape : torch.Size([1, 1, 1])


### Extracting layer weights

In [0]:
wir_10 = gru.weight_ih_l0[0,:].squeeze(0)
wiz_10 = gru.weight_ih_l0[1,:].squeeze(0)
win_10 = gru.weight_ih_l0[2,:].squeeze(0)

whr_10 = gru.weight_hh_l0[0]
whz_10 = gru.weight_hh_l0[1]
whn_10 = gru.weight_hh_l0[2]

bir_10 = gru.bias_ih_l0[0]
biz_10 = gru.bias_ih_l0[1]
bin_10 = gru.bias_ih_l0[2]

bhr_10 = gru.bias_hh_l0[0]
bhz_10 = gru.bias_hh_l0[1]
bhn_10 = gru.bias_hh_l0[2]

### **Computing outputs**

#### Hidden State 1

In [337]:
x = seq[0][0]

h_previous = torch.Tensor([0.0])

r = torch.sigmoid(dot(x,wir_10) + bir_10  + dot(h_previous,whr_10) + bhr_10  )
z = torch.sigmoid(dot(x,wiz_10) + biz_10  + dot(h_previous,whz_10) + bhz_10  )
n = torch.tanh(   dot(x,win_10) + bin_10  + r*( dot(h_previous,whn_10) + bhn_10) )
h1 = (1-z)*n + z*h_previous

h1

tensor([-0.3149], grad_fn=<AddBackward0>)

#### Hidden State 2

In [339]:
x = seq[0][1]

h_previous = h1

r = torch.sigmoid(dot(x,wir_10) + bir_10  + dot(h_previous,whr_10) + bhr_10  )
z = torch.sigmoid(dot(x,wiz_10) + biz_10  + dot(h_previous,whz_10) + bhz_10  )
n = torch.tanh(   dot(x,win_10) + bin_10  + r*( dot(h_previous,whn_10) + bhn_10) )
h2 = (1-z)*n + z*h_previous

h2

tensor([-0.4718], grad_fn=<AddBackward0>)

#### Hidden State 3

In [340]:
x = seq[0][2]

h_previous = h2

r = torch.sigmoid(dot(x,wir_10) + bir_10  + dot(h_previous,whr_10) + bhr_10  )
z = torch.sigmoid(dot(x,wiz_10) + biz_10  + dot(h_previous,whz_10) + bhz_10  )
n = torch.tanh(   dot(x,win_10) + bin_10  + r*( dot(h_previous,whn_10) + bhn_10) )
h3 = (1-z)*n + z*h_previous

h3

tensor([-0.5516], grad_fn=<AddBackward0>)

In [333]:
out_all

tensor([[[-0.3149],
         [-0.4718],
         [-0.5516],
         [-0.6281]]], grad_fn=<TransposeBackward1>)

## **Extending to LSTMs**

In [0]:
# Defining a basic LSTM layer
lstm = nn.LSTM(input_size=3, hidden_size=1, num_layers = 1, bias = True, batch_first=True)

\begin{array}{ll} \\
 |          i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
 |          f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
 |          g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
 |          o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
 |          c_t = f_t * c_{(t-1)} + i_t * g_t \\
 |          h_t = o_t * \tanh(c_t) \\
 |      \end{array}

In [343]:
lstm.state_dict()

OrderedDict([('weight_ih_l0', tensor([[-0.6096,  0.3653,  0.9641],
                      [-0.2380, -0.3644,  0.6403],
                      [ 0.3301, -0.4984,  0.1940],
                      [-0.8881,  0.1192, -0.6559]])),
             ('weight_hh_l0', tensor([[-0.2598],
                      [-0.9220],
                      [-0.6546],
                      [-0.6251]])),
             ('bias_ih_l0', tensor([ 0.9738, -0.7896,  0.2528, -0.3227])),
             ('bias_hh_l0', tensor([ 0.5644, -0.0880,  0.0569,  0.5657]))])

In [347]:
out_all, out_last = lstm(seq)

print(f"Out all shape : {out_all.shape}")

print(f"Out last shape : {out_last[0].shape}")

Out all shape : torch.Size([1, 4, 1])
Out last shape : torch.Size([1, 1, 1])


In [345]:
out_last

(tensor([[[-0.1506]]], grad_fn=<StackBackward>),
 tensor([[[-0.5870]]], grad_fn=<StackBackward>))

In [346]:
out_all

tensor([[[ 0.0668],
         [-0.0310],
         [-0.0401],
         [-0.1506]]], grad_fn=<TransposeBackward0>)

### Extracting weights

In [0]:
wii_10 = lstm.weight_ih_l0[0,:].squeeze(0)
wif_10 = lstm.weight_ih_l0[1,:].squeeze(0)
wig_10 = lstm.weight_ih_l0[2,:].squeeze(0)
wio_10 = lstm.weight_ih_l0[3,:].squeeze(0)

whi_10 = lstm.weight_hh_l0[0]
whf_10 = lstm.weight_hh_l0[1]
whg_10 = lstm.weight_hh_l0[2]
who_10 = lstm.weight_hh_l0[3]

bii_10 = lstm.bias_ih_l0[0]
bif_10 = lstm.bias_ih_l0[1]
big_10 = lstm.bias_ih_l0[2]
bio_10 = lstm.bias_ih_l0[3]

bhi_10 = lstm.bias_hh_l0[0]
bhf_10 = lstm.bias_hh_l0[1]
bhg_10 = lstm.bias_hh_l0[2]
bho_10 = lstm.bias_hh_l0[3]

## **Computing output**

\begin{array}{ll} \\
 |          i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
 |          f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
 |          g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
 |          o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
 |          c_t = f_t * c_{(t-1)} + i_t * g_t \\
 |          h_t = o_t * \tanh(c_t) \\
 |      \end{array}

#### Hidden State 1

In [350]:
x = seq[0][0]

h_previous = torch.Tensor([0.0])
c_previous = torch.Tensor([0.0])

i = torch.sigmoid(dot(x,wii_10) + bii_10  + dot(h_previous,whi_10) + bhi_10  )
f = torch.sigmoid(dot(x,wif_10) + bif_10  + dot(h_previous,whf_10) + bhf_10  )
g = torch.tanh(   dot(x,wig_10) + big_10  + dot(h_previous,whg_10) + bhg_10  )
o = torch.sigmoid(dot(x,wio_10) + bio_10  + dot(h_previous,who_10) + bho_10  )
c1 = f* c_previous + i*g
h1 = o* torch.tanh(c1)

h1

tensor([0.0668], grad_fn=<MulBackward0>)

#### Hidden State 2

In [352]:
x = seq[0][1]

h_previous = h1
c_previous = c1

i = torch.sigmoid(dot(x,wii_10) + bii_10  + dot(h_previous,whi_10) + bhi_10  )
f = torch.sigmoid(dot(x,wif_10) + bif_10  + dot(h_previous,whf_10) + bhf_10  )
g = torch.tanh(   dot(x,wig_10) + big_10  + dot(h_previous,whg_10) + bhg_10  )
o = torch.sigmoid(dot(x,wio_10) + bio_10  + dot(h_previous,who_10) + bho_10  )
c2 = f* c_previous + i*g
h2 = o* torch.tanh(c2)

h2

tensor([-0.0310], grad_fn=<MulBackward0>)

In [351]:
out_all

tensor([[[ 0.0668],
         [-0.0310],
         [-0.0401],
         [-0.1506]]], grad_fn=<TransposeBackward0>)

In [342]:
help(nn.LSTM)

Help on class LSTM in module torch.nn.modules.rnn:

class LSTM(RNNBase)
 |  Applies a multi-layer long short-term memory (LSTM) RNN to an input
 |  sequence.
 |  
 |  
 |  For each element in the input sequence, each layer computes the following
 |  function:
 |  
 |  .. math::
 |      \begin{array}{ll} \\
 |          i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
 |          f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
 |          g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
 |          o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
 |          c_t = f_t * c_{(t-1)} + i_t * g_t \\
 |          h_t = o_t * \tanh(c_t) \\
 |      \end{array}
 |  
 |  where :math:`h_t` is the hidden state at time `t`, :math:`c_t` is the cell
 |  state at time `t`, :math:`x_t` is the input at time `t`, :math:`h_{(t-1)}`
 |  is the hidden state of the layer at time `t-1` or the initial hidden
 |  state at time `0`, and :math:`i_t`, :m

In [0]:
# wir_10 = gru.weight_ih_l0[0,:].squeeze(0)
# wiz_10 = gru.weight_ih_l0[1,:].squeeze(0)
# win_10 = gru.weight_ih_l0[2,:].squeeze(0)

# whr_10 = gru.weight_hh_l0[0]
# whz_10 = gru.weight_hh_l0[1]
# whn_10 = gru.weight_hh_l0[2]

# x = seq[0][0]

# h_previous = torch.Tensor([0.0])

# # h_previous = torch.Tensor([-0.2382])

# r = torch.sigmoid(dot(x,wir_10)   + dot(h_previous,whr_10)   )
# z = torch.sigmoid(dot(x,wiz_10)   + dot(h_previous,whz_10)   )
# n = torch.tanh(   dot(x,win_10)   + r*( dot(h_previous,whn_10) ) )
# h = (1-z)*n + z*h_previous

In [325]:
h

tensor(-0.2382, grad_fn=<AddBackward0>)

In [331]:
out_all

tensor([[[-0.3149],
         [-0.4718],
         [-0.5516],
         [-0.6281]]], grad_fn=<TransposeBackward1>)

In [299]:
out_last

tensor([[[0.5352]]], grad_fn=<StackBackward>)

In [310]:
z

tensor(0.7888, grad_fn=<SigmoidBackward>)

In [277]:
h_previous.shape

torch.Size([1])

In [266]:
help(gru)

Help on GRU in module torch.nn.modules.rnn object:

class GRU(RNNBase)
 |  Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.
 |  
 |  
 |  For each element in the input sequence, each layer computes the following
 |  function:
 |  
 |  .. math::
 |      \begin{array}{ll}
 |          r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
 |          z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
 |          n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
 |          h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
 |      \end{array}
 |  
 |  where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
 |  at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
 |  at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
 |  :math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
 |  :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.
 