### LSTM autocomplete example - a very simple and limited version ###

This will train a simple LSTM to predict the next word given a sequence of words. Please note that this is a very simple example and in a real-world scenario, you would need a lot more data, data preprocessing (like handling of unknown words, punctuation, and so on), and often more complex models and training procedures.

In [2]:
# Import modules

import torch # PyTorch
import torch.nn as nn # PyTorch

In [5]:
# Create a Synthetic training data

data = "hello world. this is an example to demonstrate lstm. hope you enjoy the example.".split() # tokenie the data by splitting the words


words = set(data) # take a union where every word that is duplicated appears only once
word2idx = {word: i for i, word in enumerate(words)} # (enumerate - gets both the word and index) - create a dictionary where the key is the word and the value is the index
idx2word = {i: word for word, i in word2idx.items()}

# Prepare input and target sequences
sequence_length = 2 # take the first two words as input and then learn to predict the third word
x_data = [] # input sequence
y_data = [] # target sequence

for i in range(0, len(data) - sequence_length): # iterate from the first word to the second last word
    sequence = data[i:i + sequence_length] # take the first two words as features
    target = data[i + sequence_length] # take the third word as the target/label
    x_data.append([word2idx[word] for word in sequence]) # append the index of the words in the sequence
    y_data.append(word2idx[target]) # append the index of the target word


In [6]:
# Let us understand what the above cell has done

print(f"Words list after tokenizing:\n{words}\n") # list of unique words

print(f"Generating the vocabulary and a numeric representation for each word:\n{word2idx}\n") # dictionary of words and their corresponding index
print(f"Reversing it to retrieve the word once the index is know:\n{idx2word}") # dictionary of index and the corresponding word

print(f"\nCreating the  sequence of features and labels from our training data as per sequence length. Features:\n{x_data}\nLabel:\n{y_data}") # sequence of features and labels

Words list after tokenizing:
{'an', 'is', 'demonstrate', 'world.', 'example', 'hope', 'example.', 'lstm.', 'this', 'the', 'to', 'enjoy', 'hello', 'you'}

Generating the vocabulary and a numeric representation for each word:
{'an': 0, 'is': 1, 'demonstrate': 2, 'world.': 3, 'example': 4, 'hope': 5, 'example.': 6, 'lstm.': 7, 'this': 8, 'the': 9, 'to': 10, 'enjoy': 11, 'hello': 12, 'you': 13}

Reversing it to retrieve the word once the index is know:
{0: 'an', 1: 'is', 2: 'demonstrate', 3: 'world.', 4: 'example', 5: 'hope', 6: 'example.', 7: 'lstm.', 8: 'this', 9: 'the', 10: 'to', 11: 'enjoy', 12: 'hello', 13: 'you'}

Creating the  sequence of features and labels from our training data as per sequence length. Features:
[[12, 3], [3, 8], [8, 1], [1, 0], [0, 4], [4, 10], [10, 2], [2, 7], [7, 5], [5, 13], [13, 11], [11, 9]]
Label:
[8, 1, 0, 4, 10, 2, 7, 5, 13, 11, 9, 6]


#### The LSTM in PyTorch, specified as `torch.nn.LSTM`, expects the following inputs: ####

1. **input**: The input tensor of shape `(seq_len, batch, input_size)`, where `seq_len` is the length of the sequence, `batch` is the batch size, and `input_size` is the number of features in the input. If the `batch_first` argument is `True`, the input is expected to be of shape `(batch, seq_len, input_size)`.

2. **h_0**: The initial hidden state for each element in the batch, of shape `(num_layers * num_directions, batch, hidden_size)`. This is an optional argument. If not provided, it defaults to a tensor of zeros.

3. **c_0**: The initial cell state for each element in the batch, of shape `(num_layers * num_directions, batch, hidden_size)`. This is also optional. If not provided, it defaults to a tensor of zeros.

The LSTM returns the following outputs:

1. **output**: The output features from the last layer of the LSTM for each time step, returned as a tensor of shape `(seq_len, batch, num_directions * hidden_size)`. If `batch_first` is `True`, the output will be of shape `(batch, seq_len, num_directions * hidden_size)`.

2. **h_n**: The hidden state for the last time step, returned as a tensor of shape `(num_layers * num_directions, batch, hidden_size)`.

3. **c_n**: The cell state for the last time step, returned as a tensor of shape `(num_layers * num_directions, batch, hidden_size)`.

Here, `num_layers` refers to the number of stacked LSTM layers you have (as specified in the LSTM constructor), and `num_directions` is 2 for bidirectional LSTMs and 1 for unidirectional LSTMs.

Note: In many cases, especially when working with sequence data of variable lengths, you will also need to make use of `torch.nn.utils.rnn.pack_padded_sequence` and `torch.nn.utils.rnn.pad_packed_sequence` to pack and unpack sequences.

In [7]:
# Build LSTM model

class LSTM(nn.Module): # create a class LSTM that inherits from nn.Module
    def __init__(self, input_size, hidden_size, output_size): # take in input_size - the number of features (words) in the input, hidden_size - the initial state for each element in the batch (how long the vector will be), and output_size as parameters for the constructor
        super(LSTM, self).__init__() # call the constructor of the parent class
        self.hidden_size = hidden_size # set the hidden_size to the hidden_size passed as a parameter
        self.embedding = nn.Embedding(input_size, hidden_size) # the inputs are vectorized
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True) # instantiate the LSTM layer, by sending in the hidden_size as the input size, hidden_size as the number of neurons, and batch_first as True so that the input and output tensors are provided as (batch, seq, feature)
        self.fc = nn.Linear(hidden_size, output_size) # create a fully connected layer that takes in the hidden_size as input and output_size as output

    def forward(self, x): # take in the input tensor x for the forward pass
        x = self.embedding(x) # embed the input tensor
        out, _ = self.lstm(x) # run the LSTM layer on the embedded input tensor
        out = self.fc(out[:, -1, :]) # take the last output from the LSTM layer and pass it through the fully connected layer
        return out # return the output


In [8]:
# Hyperparameters initializing

input_size = len(words) # the number of unique words
hidden_size = 50 # the number of neurons in the hidden layer
output_size = len(words) # the number of unique words
learning_rate = 0.001 # the learning rate

# Convert data to tensors
x_data = torch.tensor(x_data, dtype=torch.long) # convert x data into a torch tensor of type long
y_data = torch.tensor(y_data, dtype=torch.long) # convert y data into a torch tensor of type long

# Create the model
model = LSTM(input_size, hidden_size, output_size) # instantiate the LSTM model

# Loss and optimizer
criterion = nn.CrossEntropyLoss() # instantiate the CrossEntropyLoss function as the loss function
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) # instantiate the Adam optimizer with the model parameters and learning rate


In [9]:
# Training
# We never initialized the hidden layer and the cell state, telling us that if we do not do it ourselves, the model will initialize it with zeros as soon as we instantiate the model
num_epochs = 2000 # the number of epochs
for epoch in range(num_epochs): # iterate over the number of epochs
    outputs = model(x_data) # feed the input x data into the model to get the outputs
    loss = criterion(outputs, y_data) # calculate the loss by comparing the outputs and label data

    optimizer.zero_grad() # zero the gradients
    loss.backward() # backpropagate the loss
    optimizer.step() # update the parameters

    if (epoch+1) % 100 == 0: # print the loss every 100 epochs
        print ('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, loss.item())) # print the epoch number, number of epochs, and the loss


Epoch [100/2000], Loss: 0.1769
Epoch [200/2000], Loss: 0.0274
Epoch [300/2000], Loss: 0.0119
Epoch [400/2000], Loss: 0.0069
Epoch [500/2000], Loss: 0.0045
Epoch [600/2000], Loss: 0.0033
Epoch [700/2000], Loss: 0.0025
Epoch [800/2000], Loss: 0.0019
Epoch [900/2000], Loss: 0.0016
Epoch [1000/2000], Loss: 0.0013
Epoch [1100/2000], Loss: 0.0011
Epoch [1200/2000], Loss: 0.0009
Epoch [1300/2000], Loss: 0.0008
Epoch [1400/2000], Loss: 0.0007
Epoch [1500/2000], Loss: 0.0006
Epoch [1600/2000], Loss: 0.0005
Epoch [1700/2000], Loss: 0.0005
Epoch [1800/2000], Loss: 0.0004
Epoch [1900/2000], Loss: 0.0004
Epoch [2000/2000], Loss: 0.0004


In [10]:
# The function will take a sequence of words and 
# return the word that the model thinks is the most likely to come next.

def predict(model, sequence): # take the model and the sequence as parameters
    sequence = [word2idx[word] for word in sequence] # convert the sequence into a list of indices
    sequence = torch.tensor(sequence, dtype=torch.long).unsqueeze(0) # convert the sequence into a tensor of type long and add a dimension at the beginning
    output = model(sequence) # feed the sequence tensor into the model to get the output
    _, predicted = torch.max(output.data, 1) # get the index of the word with the highest probability
    return idx2word[predicted.item()] # return the word corresponding to the index

In [11]:
# Feed in two words of the same sequence length as the model was trained on

sequence = ["this", "is"] # take the first two words of the sequence, where the words should be in the vocabulary of the model, else it will throw an error
print(predict(model, sequence))  # Output next word


an


The predict function expects the sequence to be of the same length as was used for training (sequence_length), and all the words in the sequence must be in the training data, otherwise, they will not be in word2idx dictionary and you'll get a KeyError. 

In a real-world scenario, you'd want to add more sophisticated handling of such cases.