In [0]:
import torch
import torch.nn as nn

In PyTorch, The LSTM layers take input as (input_dim, output_dim). Now output_dim in nothing but the number of activation nodes in the layer, also know as hidden_size/hidden_units.
The input dim, is the size of input vector, i.e size of a single timestep of a single example.

In PyTorch LSTM except their input to be 3 dimensinoal, i.e
**(seq_len/number of timesteps, batch_size, input)**.

In [25]:
lstm = nn.LSTM(5, 3)
inputs = [torch.randn(1,5) for _ in range(6)]
inputs

[tensor([[-1.1272,  0.4237, -0.1290,  0.9384,  0.1659]]),
 tensor([[-1.9339,  1.2873,  0.6601, -2.3737, -0.6493]]),
 tensor([[-0.9119, -0.2122, -0.3635, -0.0915,  0.0601]]),
 tensor([[-0.6782,  1.1880, -0.8136, -0.1848,  0.1561]]),
 tensor([[-1.1382,  0.8900,  1.5628, -1.0369, -0.1432]]),
 tensor([[-0.9382,  1.1817,  1.5930,  1.0027,  0.9704]])]

Hence, the above will be input to the network, which is a list of 4 vector of shape (1,5).

In [0]:
# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))

for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    _, hidden = lstm(i.view(1, 1, -1), hidden)

In [16]:
hidden

(tensor([[[ 0.0992, -0.1988,  0.4391]]], grad_fn=<StackBackward>),
 tensor([[[ 0.2067, -0.4157,  0.8444]]], grad_fn=<StackBackward>))

So over here, we are feeding the input and the hidden state from previous time step to the current time-step cell. The final hidden contains output value from the last time step.  Note: they are output value and not activation value.. remember the basics.

### Weights of LSTM

$$it​=σ(Wiixt​+bii​+Whi​h(t−1)​+bhi​)$$
$$ft​=σ(Wif​xt​+bif​+Whf​h(t−1)​+bhf​)$$
$$gt​=tanh(Wig​xt​+big​+Whg​h(t−1)​+bhg​)$$
$$ot​=σ(Wio​xt​+bio​+Who​h(t−1)​+bho​)$$
$$ct​=ft​∗c(t−1)​+it​∗gt$$
$$​ht​=ot​∗tanh(ct​)$$

Now, I see that there are 8 weights, in the above equations, but if shape == shape, then only 4 remain. Hence the state_dict of LSTM should be (output_shape * 4, input_shape), which in this case, shoule be (12, 5).

In [17]:
lstm.state_dict()

OrderedDict([('weight_ih_l0',
              tensor([[ 0.0509, -0.5072,  0.2130,  0.0599,  0.4325],
                      [ 0.0322, -0.0359, -0.5053, -0.1728, -0.2117],
                      [ 0.5532, -0.3448,  0.4821,  0.5737, -0.1600],
                      [ 0.4477,  0.4059,  0.1165,  0.0745,  0.0109],
                      [ 0.4278, -0.2320,  0.0084, -0.4489, -0.1401],
                      [ 0.4040,  0.3669, -0.4151, -0.4016,  0.1966],
                      [ 0.0455, -0.5709,  0.3468, -0.0290,  0.1532],
                      [ 0.2783, -0.1945,  0.0117, -0.2531,  0.4458],
                      [ 0.3351,  0.2846,  0.4378,  0.3740, -0.5650],
                      [ 0.2323, -0.0755,  0.0764, -0.0998,  0.0061],
                      [ 0.2153, -0.0778,  0.0090, -0.0450, -0.0455],
                      [-0.4459, -0.2734,  0.5608,  0.0192, -0.5529]])),
             ('weight_hh_l0', tensor([[-0.4535,  0.1470, -0.4320],
                      [-0.0315,  0.1942, -0.1442],
                     

### Output of LSTM

In [26]:
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  

print('Input shape:', inputs.shape)
out, hidden = lstm(inputs, hidden)
print('out shape',out.shape)
print('hidden shape',hidden[0].shape, hidden[1].shape)

Input shape: torch.Size([6, 1, 5])
out shape torch.Size([6, 1, 3])
hidden shape torch.Size([1, 1, 3]) torch.Size([1, 1, 3])


In [27]:
out

tensor([[[-0.2465, -0.1238, -0.0830]],

        [[-0.0680, -0.1076, -0.1673]],

        [[-0.1711, -0.0456, -0.1984]],

        [[-0.2222, -0.2052, -0.0098]],

        [[-0.1096, -0.1382, -0.1539]],

        [[-0.1038, -0.1225, -0.1069]]], grad_fn=<StackBackward>)

In [28]:
hidden

(tensor([[[-0.1038, -0.1225, -0.1069]]], grad_fn=<StackBackward>),
 tensor([[[-0.1823, -0.1720, -0.2417]]], grad_fn=<StackBackward>))

Hence the out variable stores the values of each of the hidden states from each time step. These hiddens states will be useful when we are working with many to many sequence data.

### Time distributed dense layer on top of LSTM

For seq2seq models, we often need to feed the output of LSTM layers to Linear layers. Now, the input to the linear layer is 2D, while the output from LSTM layer is 3D. It has an extra time axis. So in order to tackle this, we can use [timedistributed Dense layer in Keras](https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/). but ig, Pytorch doesn't has that layer so instead we will use a for loop and feed output from each time step one by one.
