In [1]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

Sequence Models and Long Short-Term Memory Networks
===================================================

At this point, we have seen various feed-forward networks. That is,
there is no state maintained by the network at all. This might not be
the behavior we want. Sequence models are central to NLP: they are
models where there is some sort of dependence through time between your
inputs. The classical example of a sequence model is the Hidden Markov
Model for part-of-speech tagging. Another example is the conditional
random field.

A recurrent neural network is a network that maintains some kind of
state. For example, its output could be used as part of the next input,
so that information can propagate along as the network passes over the
sequence. In the case of an LSTM, for each element in the sequence,
there is a corresponding *hidden state* $h_t$, which in principle can
contain information from arbitrary points earlier in the sequence. We
can use the hidden state to predict words in a language model,
part-of-speech tags, and a myriad of other things.

LSTMs in Pytorch
----------------

Before getting to the example, note a few things. Pytorch\'s LSTM
expects all of its inputs to be 3D tensors. The semantics of the axes of
these tensors is important. The first axis is the sequence itself, the
second indexes instances in the mini-batch, and the third indexes
elements of the input. We haven\'t discussed mini-batching, so let\'s
just ignore that and assume we will always have just 1 dimension on the
second axis. If we want to run the sequence model over the sentence
\"The cow jumped\", our input should look like

$$\begin{aligned}
\begin{bmatrix}
\overbrace{q_\text{The}}^\text{row vector} \\
q_\text{cow} \\
q_\text{jumped}
\end{bmatrix}
\end{aligned}$$

Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which
case the 1st axis will have size 1 also.

Let\'s see a quick example.


In [2]:
# Author: Robert Guthrie

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7e9820e0e470>

In [3]:
lstm = nn.LSTM(3, 3)  # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)]  # make a sequence of length 5

# initialize the hidden state.
hidden = (torch.randn(1, 1, 3),
          torch.randn(1, 1, 3))
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden)

# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time
# Add the extra 2nd dimension
inputs = torch.cat(inputs).view(len(inputs), 1, -1)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print(out)
print("----")
print(hidden)

tensor([[[-0.0187,  0.1713, -0.2944]],

        [[-0.3521,  0.1026, -0.2971]],

        [[-0.3191,  0.0781, -0.1957]],

        [[-0.1634,  0.0941, -0.1637]],

        [[-0.3368,  0.0959, -0.0538]]], grad_fn=<MkldnnRnnLayerBackward0>)
----
(tensor([[[-0.3368,  0.0959, -0.0538]]], grad_fn=<StackBackward0>), tensor([[[-0.9825,  0.4715, -0.0633]]], grad_fn=<StackBackward0>))


Example: An LSTM for Part-of-Speech Tagging
===========================================

In this section, we will use an LSTM to get part of speech tags. We will
not use Viterbi or Forward-Backward or anything like that, but as a
(challenging) exercise to the reader, think about how Viterbi could be
used after you have seen what is going on. In this example, we also
refer to embeddings. If you are unfamiliar with embeddings, you can read
up about them
[here](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html).

The model is as follows: let our input sentence be $w_1, \dots, w_M$,
where $w_i \in V$, our vocab. Also, let $T$ be our tag set, and $y_i$
the tag of word $w_i$. Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a unique index
(like how we had word\_to\_ix in the word embeddings section). Then our
prediction rule for $\hat{y}_i$ is

$$\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j$$

That is, take the log softmax of the affine map of the hidden state, and
the predicted tag is the tag that has the maximum value in this vector.
Note this implies immediately that the dimensionality of the target
space of $A$ is $|T|$.

Prepare data:


In [4]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long)


training_data = [
    # Tags are: DET - determiner; NN - noun; V - verb
    # For example, the word "The" is a determiner
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
# For each words-list (sentence) and tags-list in each tuple of training_data
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:  # word has not been assigned an index yet
            word_to_ix[word] = len(word_to_ix)  # Assign each word with a unique index
print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}  # Assign each tag with a unique index
ix_to_tag = ["DET", "NN", "V"]

# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


Create the model

In [5]:
class LSTMTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim

        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

Train the model:


In [6]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)
    print(tag_scores)

for epoch in range(300):  # again, normally you would NOT do 300 epochs, it is toy data
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)

        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)

        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = loss_function(tag_scores, targets)
        loss.backward()
        optimizer.step()

# See what the scores are after training
with torch.no_grad():
    inputs = prepare_sequence(training_data[0][0], word_to_ix)
    tag_scores = model(inputs)

    # The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
    # for word i. The predicted tag is the maximum scoring tag.
    # Here, we can see the predicted sequence below is 0 1 2 0 1
    # since 0 is index of the maximum value of row 1,
    # 1 is the index of maximum value of row 2, etc.
    # Which is DET NOUN VERB DET NOUN, the correct sequence!
    print(tag_scores.shape)
    scores = tag_scores.argmax(axis = 1)

    predicted_sequence = [ix_to_tag[score] for score in scores]
    print(predicted_sequence)

tensor([[-1.1389, -1.2024, -0.9693],
        [-1.1065, -1.2200, -0.9834],
        [-1.1286, -1.2093, -0.9726],
        [-1.1190, -1.1960, -0.9916],
        [-1.0137, -1.2642, -1.0366]])
torch.Size([5, 3])
['DET', 'NN', 'V', 'DET', 'NN']


In [7]:
import torch
import torch.nn as nn

# Define the LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Example input
batch_size = 1
sequence_length = 1
input_size = 10
input_tensor = torch.randn(batch_size, sequence_length, input_size)

# Forward pass
output, (hidden_state, cell_state) = lstm(input_tensor)

In [8]:
input_tensor.shape

torch.Size([1, 1, 10])

In [9]:
hidden_state.shape

torch.Size([1, 1, 20])

In [10]:
cell_state.shape

torch.Size([1, 1, 20])

In [11]:
for weights in lstm.all_weights:
  for w in weights:
    print(w.shape)
  print("----")

torch.Size([80, 10])
torch.Size([80, 20])
torch.Size([80])
torch.Size([80])
----


In [12]:
lstm.weight_hh_l0.shape

torch.Size([80, 20])

In [13]:
lstm.weight_ih_l0.shape

torch.Size([80, 10])

In [14]:
lstm.bias_hh_l0.shape

torch.Size([80])

In [15]:
lstm.bias_ih_l0.shape

torch.Size([80])

Vanishing gradient problem in RNNs

In [16]:
import torch
import torch.nn as nn

class RNNWithReLU(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNNWithReLU, self).__init__()
        self.hidden_size = hidden_size
        self.rnn_cell = nn.RNNCell(input_size, hidden_size)
        self.output_layer = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()

    def forward(self, x, seq_length):
        h = torch.zeros(x.size(0), self.hidden_size)
        for t in range(seq_length):
            h = self.relu(self.rnn_cell(x[:, t, :], h))
        output = self.output_layer(h)
        return output

# Create a long sequence
seq_length = 1000
input_size = 10
hidden_size = 20
output_size = 1
batch_size = 32

# Initialize the model
model = RNNWithReLU(input_size, hidden_size, output_size)

# Create input and target
x = torch.randn(batch_size, seq_length, input_size)
y = torch.randn(batch_size, output_size)

# Train the model
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    optimizer.zero_grad()
    output = model(x, seq_length)
    loss = criterion(output, y)
    loss.backward()

    # Check gradients
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
        print(f"Gradient norm: {model.rnn_cell.weight_hh.grad.norm().item():.4f}")

    optimizer.step()


Epoch 0, Loss: 1.3160
Gradient norm: 0.1498
Epoch 10, Loss: 1.2415
Gradient norm: 0.1417
Epoch 20, Loss: 1.1728
Gradient norm: 0.1423
Epoch 30, Loss: 1.1015
Gradient norm: 0.1494
Epoch 40, Loss: 1.0224
Gradient norm: 0.1766
Epoch 50, Loss: 0.9377
Gradient norm: 0.1758
Epoch 60, Loss: 0.8472
Gradient norm: 0.1998
Epoch 70, Loss: 0.7543
Gradient norm: 0.1913
Epoch 80, Loss: 0.6599
Gradient norm: 0.2087
Epoch 90, Loss: 0.5642
Gradient norm: 0.2052


In [17]:
import torch
import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(LSTMModel, self).__init__()
        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        _, (h_n, _) = self.lstm(x)
        out = self.fc(h_n[-1])
        return out

# Create a long sequence
seq_length = 1000
input_size = 10
hidden_size = 20
output_size = 1
batch_size = 32

# Initialize the model
model = LSTMModel(input_size, hidden_size, output_size)

# Create input and target
x = torch.randn(batch_size, seq_length, input_size)
y = torch.randn(batch_size, output_size)

# Train the model
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

for epoch in range(100):
    optimizer.zero_grad()
    output = model(x)
    loss = criterion(output, y)
    loss.backward()

    # Check gradients
    if epoch % 10 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.4f}")
        print(f"Gradient norm: {model.lstm.weight_hh_l0.grad.norm().item():.4f}")

    optimizer.step()


Epoch 0, Loss: 0.6817
Gradient norm: 0.0325
Epoch 10, Loss: 0.6407
Gradient norm: 0.0315
Epoch 20, Loss: 0.6006
Gradient norm: 0.0341
Epoch 30, Loss: 0.5582
Gradient norm: 0.0399
Epoch 40, Loss: 0.5105
Gradient norm: 0.0474
Epoch 50, Loss: 0.4558
Gradient norm: 0.0524
Epoch 60, Loss: 0.3960
Gradient norm: 0.0511
Epoch 70, Loss: 0.3339
Gradient norm: 0.0531
Epoch 80, Loss: 0.2665
Gradient norm: 0.0609
Epoch 90, Loss: 0.1937
Gradient norm: 0.0686


Doing the math manually for the LSTM to understand it better

In [18]:
import torch
import torch.nn as nn

# Define the LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Example input
batch_size = 1
sequence_length = 1
input_size = 10
input_tensor = torch.randn(batch_size, sequence_length, input_size)

# Forward pass
output, (hidden_state, cell_state) = lstm(input_tensor)

In [19]:
input_tensor

tensor([[[-1.7225, -2.0093, -0.1262,  1.9545, -0.1713,  0.1520, -1.9830,
           2.2576, -1.2769,  0.7515]]])

In [20]:
x_t = input_tensor.squeeze()

In [21]:
W_ii = lstm.weight_ih_l0[:hidden_size]
W_if = lstm.weight_ih_l0[hidden_size:2*hidden_size]
W_ig = lstm.weight_ih_l0[2*hidden_size:3*hidden_size]
W_io = lstm.weight_ih_l0[3*hidden_size:]

assert(W_ii.shape == (hidden_size, input_size))
assert(W_if.shape == (hidden_size, input_size))
assert(W_ig.shape == (hidden_size, input_size))
assert(W_io.shape == (hidden_size, input_size))

W_hi = lstm.weight_hh_l0[:hidden_size]
W_hf = lstm.weight_hh_l0[hidden_size:2*hidden_size]
W_hg = lstm.weight_hh_l0[2*hidden_size:3*hidden_size]
W_ho = lstm.weight_hh_l0[3*hidden_size:]

assert(W_hi.shape == (hidden_size, hidden_size))
assert(W_hf.shape == (hidden_size, hidden_size))
assert(W_hg.shape == (hidden_size, hidden_size))
assert(W_ho.shape == (hidden_size, hidden_size))

b_ii = lstm.bias_ih_l0[:hidden_size]
b_if = lstm.bias_ih_l0[hidden_size:2*hidden_size]
b_ig = lstm.bias_ih_l0[2*hidden_size:3*hidden_size]
b_io = lstm.bias_ih_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))
assert(b_io.shape == (hidden_size, ))

b_hi = lstm.bias_hh_l0[:hidden_size]
b_hf = lstm.bias_hh_l0[hidden_size:2*hidden_size]
b_hg = lstm.bias_hh_l0[2*hidden_size:3*hidden_size]
b_ho = lstm.bias_hh_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))

In [22]:
h_t_1 = torch.zeros((hidden_size,))
c_t_1 = torch.zeros((hidden_size,))

In [23]:
i_t = torch.sigmoid(W_ii @ x_t + b_ii + W_hi @ h_t_1 + b_hi)
f_t = torch.sigmoid(W_if @ x_t + b_if + W_hf @ h_t_1 + b_hf)
g_t = torch.tanh(W_ig @ x_t + b_ig + W_hg @ h_t_1 + b_hg)
o_t = torch.sigmoid(W_io @ x_t + b_io + W_ho @ h_t_1 + b_ho)
c_t = f_t * c_t_1 + i_t * g_t
h_t = o_t * torch.tanh(c_t)

In [24]:
assert(torch.isclose(c_t, cell_state).all().item())
assert(torch.isclose(h_t, hidden_state).all().item())
assert(torch.isclose(output, hidden_state).all().item())

For multiple sequences

In [25]:
import torch
import torch.nn as nn

# Define the LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Example input
batch_size = 1
sequence_length = 4
input_size = 10
input_tensor = torch.randn(batch_size, sequence_length, input_size)

# Forward pass
output, (hidden_state, cell_state) = lstm(input_tensor)

In [26]:
input_tensor.shape

torch.Size([1, 4, 10])

In [35]:
i_t_s = input_tensor.squeeze()

In [34]:
W_ii = lstm.weight_ih_l0[:hidden_size]
W_if = lstm.weight_ih_l0[hidden_size:2*hidden_size]
W_ig = lstm.weight_ih_l0[2*hidden_size:3*hidden_size]
W_io = lstm.weight_ih_l0[3*hidden_size:]

assert(W_ii.shape == (hidden_size, input_size))
assert(W_if.shape == (hidden_size, input_size))
assert(W_ig.shape == (hidden_size, input_size))
assert(W_io.shape == (hidden_size, input_size))

W_hi = lstm.weight_hh_l0[:hidden_size]
W_hf = lstm.weight_hh_l0[hidden_size:2*hidden_size]
W_hg = lstm.weight_hh_l0[2*hidden_size:3*hidden_size]
W_ho = lstm.weight_hh_l0[3*hidden_size:]

assert(W_hi.shape == (hidden_size, hidden_size))
assert(W_hf.shape == (hidden_size, hidden_size))
assert(W_hg.shape == (hidden_size, hidden_size))
assert(W_ho.shape == (hidden_size, hidden_size))

b_ii = lstm.bias_ih_l0[:hidden_size]
b_if = lstm.bias_ih_l0[hidden_size:2*hidden_size]
b_ig = lstm.bias_ih_l0[2*hidden_size:3*hidden_size]
b_io = lstm.bias_ih_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))
assert(b_io.shape == (hidden_size, ))

b_hi = lstm.bias_hh_l0[:hidden_size]
b_hf = lstm.bias_hh_l0[hidden_size:2*hidden_size]
b_hg = lstm.bias_hh_l0[2*hidden_size:3*hidden_size]
b_ho = lstm.bias_hh_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))

In [33]:
for weights in lstm.all_weights:
  for w in weights:
    print(w.shape)
  print("----")

torch.Size([80, 10])
torch.Size([80, 20])
torch.Size([80])
torch.Size([80])
----


In [58]:
hidden = []
h_t_1 = torch.zeros((hidden_size,))
c_t_1 = torch.zeros((hidden_size,))
for x_t in i_t_s:
    i_t = torch.sigmoid(W_ii @ x_t + b_ii + W_hi @ h_t_1 + b_hi)
    f_t = torch.sigmoid(W_if @ x_t + b_if + W_hf @ h_t_1 + b_hf)
    g_t = torch.tanh(W_ig @ x_t + b_ig + W_hg @ h_t_1 + b_hg)
    o_t = torch.sigmoid(W_io @ x_t + b_io + W_ho @ h_t_1 + b_ho)
    c_t = f_t * c_t_1 + i_t * g_t
    h_t = o_t * torch.tanh(c_t)
    h_t_1 = h_t
    c_t_1 = c_t
    hidden.append(h_t.view(1, -1))

In [59]:
final_output = torch.cat(hidden)

In [69]:
assert(torch.isclose(final_output, output, 1e-4).all().item())

Using multiple batches

In [70]:
import torch
import torch.nn as nn

# Define the LSTM
lstm = nn.LSTM(input_size=10, hidden_size=20, num_layers=1, batch_first=True)

# Example input
batch_size = 5
sequence_length = 4
input_size = 10
input_tensor = torch.randn(batch_size, sequence_length, input_size)

# Forward pass
output, (hidden_state, cell_state) = lstm(input_tensor)

In [71]:
W_ii = lstm.weight_ih_l0[:hidden_size]
W_if = lstm.weight_ih_l0[hidden_size:2*hidden_size]
W_ig = lstm.weight_ih_l0[2*hidden_size:3*hidden_size]
W_io = lstm.weight_ih_l0[3*hidden_size:]

assert(W_ii.shape == (hidden_size, input_size))
assert(W_if.shape == (hidden_size, input_size))
assert(W_ig.shape == (hidden_size, input_size))
assert(W_io.shape == (hidden_size, input_size))

W_hi = lstm.weight_hh_l0[:hidden_size]
W_hf = lstm.weight_hh_l0[hidden_size:2*hidden_size]
W_hg = lstm.weight_hh_l0[2*hidden_size:3*hidden_size]
W_ho = lstm.weight_hh_l0[3*hidden_size:]

assert(W_hi.shape == (hidden_size, hidden_size))
assert(W_hf.shape == (hidden_size, hidden_size))
assert(W_hg.shape == (hidden_size, hidden_size))
assert(W_ho.shape == (hidden_size, hidden_size))

b_ii = lstm.bias_ih_l0[:hidden_size]
b_if = lstm.bias_ih_l0[hidden_size:2*hidden_size]
b_ig = lstm.bias_ih_l0[2*hidden_size:3*hidden_size]
b_io = lstm.bias_ih_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))
assert(b_io.shape == (hidden_size, ))

b_hi = lstm.bias_hh_l0[:hidden_size]
b_hf = lstm.bias_hh_l0[hidden_size:2*hidden_size]
b_hg = lstm.bias_hh_l0[2*hidden_size:3*hidden_size]
b_ho = lstm.bias_hh_l0[3*hidden_size:4*hidden_size]

assert(b_ii.shape == (hidden_size, ))
assert(b_if.shape == (hidden_size, ))
assert(b_ig.shape == (hidden_size, ))

In [73]:
input_tensor.shape

torch.Size([5, 4, 10])

In [82]:
W_ii.shape

torch.Size([20, 10])

In [93]:
output_tensor = torch.zeros((batch_size, sequence_length, hidden_size))
h_t_1 = torch.zeros((1, hidden_size))
c_t_1 = torch.zeros((1, hidden_size))
for i in range(sequence_length):
    x_t = input_tensor[:, i, :] # x_t is of shape (input_size, batch_size) now
    # print(x_t.shape)
    # print(W_ii.T.shape)
    # print(W_hi.shape)
    # print(h_t_1.shape)
    # print(b_hi.shape)
    i_t = torch.sigmoid(x_t @ W_ii.T + b_ii + h_t_1 @ W_hi.T + b_hi)
    f_t = torch.sigmoid(x_t @ W_if.T  + b_if + h_t_1 @ W_hf.T  + b_hf)
    g_t = torch.tanh(x_t @ W_ig.T + b_ig + h_t_1 @ W_hg.T + b_hg)
    o_t = torch.sigmoid(x_t @ W_io.T + b_io + h_t_1 @ W_ho.T  + b_ho)
    c_t = f_t * c_t_1 + i_t * g_t
    h_t = o_t * torch.tanh(c_t)
    h_t_1 = h_t
    c_t_1 = c_t
    output_tensor[:, i, :] = h_t

In [94]:
output.shape

torch.Size([5, 4, 20])

In [95]:
output_tensor.shape

torch.Size([5, 4, 20])

In [99]:
assert(torch.isclose(output, output_tensor, 1e-4).all().item())