<a href="https://colab.research.google.com/github/soutrik71/pytorch_classics/blob/main/APTorch7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The focus of this notebook is to implement LSTM for sequence generation tasks
1. Text Generation using LSTM Networks (Character-based RNN)
2. Text Generation using PyTorch LSTM Networks (Character Embeddings)
3. Sequence generation at word level using LSTM Networks (Word Embeddings)
3. Application of pre-trained embedding models for the better representation of words

In [1]:
!pip install portalocker
!pip install torchview
!pip install torcheval
!pip install scikit-plot
!pip install lime

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2
Collecting torchview
  Downloading torchview-0.2.6-py3-none-any.whl (25 kB)
Installing collected packages: torchview
Successfully installed torchview-0.2.6
Collecting torcheval
  Downloading torcheval-0.0.7-py3-none-any.whl (179 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.2/179.2 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torcheval
Successfully installed torcheval-0.0.7
Collecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Installing collected packages: scikit-plot
Successfully installed scikit-plot-0.3.7
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdo

In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
from torch.utils.data import DataLoader, TensorDataset
from torchtext import data
from torchtext import datasets
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import re
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
from torchsummary import summary
from torchview import draw_graph
import numpy as np
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
from torcheval.metrics import MulticlassAccuracy,BinaryAccuracy
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

In [4]:
def set_seed(seed: int = 42) -> None:
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seed set as {seed}")

In [5]:
# Set manual seed since nn.Parameter are randomly initialzied
set_seed(42)
# Set device cuda for GPU if it's available otherwise run on the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
batch_size = 1024
epochs = 10
lr = 1e-4
embedding = False

Random seed set as 42
cuda


## Basic DataPrep

In this section, we undertake data preparation for training the neural network using a character-based approach. Specifically, we adopt a fixed sequence length of 100 characters, where the network's task is to predict the subsequent character given this sequence. Employing an embeddings technique, we encode text data by assigning a unique real-valued vector to each character.

The data preparation procedure is as follows:

* Loading Text Examples and Creating Vocabulary: Iterate through all text examples to construct a vocabulary, mapping each character to a distinct integer index. This vocabulary
facilitates character representation in a numerical format.

* Organizing Data with a Sliding Window: Implement a sliding window mechanism to organize the data. For every text example, we slide a window of 100 characters. The first 100 characters serve as input features (X), while the 101st character becomes the target value (Y). This process continues by shifting the window one character at a time until the end of the text example.

* Conversion to Integer Indices: Retrieve integer indices corresponding to characters in both data features and target values based on the previously constructed vocabulary. This step transforms characters into their corresponding numerical representations.

* Embeddings Assignment: Each unique integer index, representing a specific character in the data features, is associated with a real-valued vector known as an embedding. These embeddings provide a continuous representation of characters, facilitating numerical computation within the neural network. This is optional

### Data loading

In [6]:
train_dataset, valid_dataset, test_dataset = datasets.PennTreebank()

In [7]:
next(iter(train_dataset.shuffle()))

'instead new york city police seized the stolen goods and mr. <unk> avoided jail'

In [8]:
def info(x):
  return len(x)

elem_ls = list(train_dataset.map(info))



In [9]:
print(len(elem_ls)) # total 42k elements
print(max(elem_ls)) # max length of each element
print(min(elem_ls)) # min length of each element

42068
518
2


We construct a vocabulary of unique characters using build_vocab_from_iterator() from torchtext's 'vocab' sub-module. Our custom function build_vocabulary() serves as an iterator, looping through datasets and examples to yield character lists. Special handling ensures the '<unk>' token, representing unknown characters, is counted as a single token rather than individual characters.






In [10]:
def build_vocabulary(datasets):
  for dataset in datasets:
    for text in dataset:

      if "unk" in text:
        texts = text.split("<unk>")
        total = list(texts[0].lower())
        for t in texts[1:]:
            total.extend(["<unk>", ] + list(t.lower()))
        yield total

      else:
        yield list(text.lower())

In [11]:
vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, valid_dataset, test_dataset]), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [12]:
len(vocab)

47

In [13]:
print(vocab.get_itos()) # character level tokenization

['<unk>', ' ', 'e', 't', 'a', 'n', 'o', 'i', 's', 'r', 'h', 'l', 'd', 'c', 'u', 'm', 'f', 'p', 'g', 'y', 'b', 'w', 'v', 'k', '.', "'", 'x', 'j', '$', '-', 'q', 'z', '&', '0', '1', '9', '3', '#', '2', '8', '5', '\\', '7', '6', '/', '4', '*']


In [14]:
print(vocab.get_stoi()) # dictionary mapping token to indices

{'4': 45, '/': 44, '7': 42, '8': 39, '2': 38, '#': 37, '9': 35, '1': 34, 'z': 31, 'q': 30, '-': 29, '6': 43, '3': 36, 'r': 9, 's': 8, 'd': 12, 'k': 23, 'n': 5, 'h': 10, '*': 46, 'u': 14, '0': 33, 'p': 17, 't': 3, 'i': 7, '\\': 41, '5': 40, 'a': 4, 'e': 2, 'j': 27, '&': 32, 'v': 22, 'o': 6, '<unk>': 0, '.': 24, 'c': 13, 'm': 15, 'f': 16, 'l': 11, 'g': 18, 'y': 19, 'b': 20, 'w': 21, ' ': 1, 'x': 26, "'": 25, '$': 28}


Preparing the sequential data for training with sliding window approach and window size of 10 characters

In [126]:
seq_len = 35
train_records_max = 5000
X_train_full, y_train_full = [], []
X_val_full , y_val_full = [], []

In [None]:
# train data prep
for idex, text in enumerate(train_dataset):
  print(text)
  print("\n")
  for i in range(len(text) - seq_len):
    inp_rec = list(text[i:i+seq_len].lower())
    op_rec = text[i+seq_len].lower()

    if len(op_rec) == 0:
      break

    X_train_full.append(vocab(inp_rec))
    y_train_full.append(vocab[op_rec])

  if idex > train_records_max:
    break

In [128]:
print(len(X_train_full))
print(len(y_train_full))

423585
423585


In [None]:
# validation dataset prep
for idex, text in enumerate(valid_dataset):
  print(text)
  print("\n")
  for i in range(len(text) - seq_len):
    inp_rec = list(text[i:i+seq_len].lower())
    op_rec = text[i+seq_len].lower()

    if len(op_rec) == 0:
      break

    X_val_full.append(vocab(inp_rec))
    y_val_full.append(vocab[op_rec])

In [130]:
print(len(X_val_full))
print(len(y_val_full))

273982
273982


In [131]:
X_train = torch.tensor(X_train_full, dtype=torch.float32)
y_train = torch.tensor(y_train_full)
print(f"The shape of X_train is {X_train.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_train.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([423585, 35])
The shape of Y_train is torch.Size([423585])


In [132]:
X_val = torch.tensor(X_val_full, dtype=torch.float32)
y_val = torch.tensor(y_val_full)
print(f"The shape of X_train is {X_val.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_val.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([273982, 35])
The shape of Y_train is torch.Size([273982])


In [133]:
if not embedding:
  X_train = X_train.unsqueeze(dim=-1)
  X_val = X_val.unsqueeze(dim=-1)

Dataloader part

In [141]:
vectorized_train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)

vectorized_valid_dataset = TensorDataset(X_val, y_val)
valid_loader = DataLoader(vectorized_valid_dataset, batch_size=1024, shuffle=False)

In [142]:
for x, y in train_loader:
  print(x.shape)
  print(y.shape)
  break

torch.Size([1024, 35, 1])
torch.Size([1024])


## Modelling Building using Character based RNN

The network includes 2 LSTM layers with an output size of 256 each, followed by a linear layer. Stacking these LSTM layers enhances sequence learning. The output of the second LSTM layer feeds into the linear layer, whose output units match the vocabulary size

In [143]:
hidden_dim = 256
n_layers=2

class LSTMTextGenerator(nn.Module):
    def __init__(self):
        super(LSTMTextGenerator, self).__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(vocab))

    def forward(self, X_batch):
      # init weights
      hidden = torch.randn(n_layers, len(X_batch), hidden_dim).to(device)
      carry = torch.randn(n_layers, len(X_batch), hidden_dim).to(device)

      output, (hidden, carry) = self.lstm(X_batch, (hidden, carry))
      return self.linear(output[:,-1])

In [144]:
text_generator_lstm = LSTMTextGenerator().to(device)

In [145]:
for layer in text_generator_lstm.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print("\n")

Layer : LSTM(1, 256, num_layers=2, batch_first=True)
Parameters : 
torch.Size([1024, 1])
torch.Size([1024, 256])
torch.Size([1024])
torch.Size([1024])
torch.Size([1024, 256])
torch.Size([1024, 256])
torch.Size([1024])
torch.Size([1024])


Layer : Linear(in_features=256, out_features=47, bias=True)
Parameters : 
torch.Size([47, 256])
torch.Size([47])




In [146]:
def train_module(model:torch.nn.Module,
                 device:torch.device,
                 train_dataloader:torch.utils.data.DataLoader ,
                 optimizer:torch.optim.Optimizer,
                 criterion:torch.nn.Module,
                 metric,
                 train_losses:list,
                 train_metrics:list):

  # setting model to train mode
  model.train()
  pbar = tqdm(train_dataloader)

  # batch metrics
  train_loss = 0
  processed_batch = 0

  for idx, (data,label) in enumerate(pbar):
    # setting up device
    data = data.to(device)
    label = label.to(device)

    # forward pass output
    preds = model(data)

    # calc loss
    loss = criterion(preds, label)
    train_loss += loss.item()
    # print(f"training loss for batch {idx} is {loss}")

    # backpropagation
    optimizer.zero_grad() # flush out  existing grads
    loss.backward() # back prop of weights wrt loss
    optimizer.step() # optimizer step -> minima

    #updating batch count
    processed_batch += 1

    pbar.set_description(f"Avg Train Loss: {train_loss/processed_batch}")

  # updating epoch metrics
  train_losses.append(train_loss/processed_batch)

  return train_losses


In [147]:
def test_module(model:torch.nn.Module,
                device:torch.device,
                test_dataloader:torch.utils.data.DataLoader,
                criterion:torch.nn.Module,
                metric,
                test_losses,
                test_metrics):
  # setting model to eval mode
  model.eval()
  pbar = tqdm(test_dataloader)

  # batch metrics
  test_loss = 0
  processed_batch = 0

  with torch.inference_mode():
    for idx, (data,label) in enumerate(pbar):
      data , label = data.to(device), label.to(device)
      # predictions
      preds = model(data)
      # print(preds.shape)
      # print(label.shape)

      #loss calc
      loss = criterion(preds, label)
      test_loss += loss.item()

      #updating batch count
      processed_batch += 1

      pbar.set_description(f"Avg Test Loss: {test_loss/processed_batch}")

    # updating epoch metrics
    test_losses.append(test_loss/processed_batch)

  return test_losses

In [148]:
optimizer = optim.Adam(text_generator_lstm.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [149]:
%%time

# Place holders----
train_losses = []
test_losses = []

for epoch in range(0,epochs):
  print(f'Epoch {epoch}')
  train_losses = train_module(text_generator_lstm, device, train_loader, optimizer, criterion, None, train_losses, None)
  test_losses = test_module(text_generator_lstm, device, valid_loader, criterion, None, test_losses, None)

Epoch 0


Avg Train Loss: 3.0016534409085334: 100%|██████████| 414/414 [00:33<00:00, 12.31it/s]
Avg Test Loss: 2.8839129342961667: 100%|██████████| 268/268 [00:11<00:00, 23.35it/s]


Epoch 1


Avg Train Loss: 2.848298272072981: 100%|██████████| 414/414 [00:34<00:00, 11.91it/s]
Avg Test Loss: 2.800626163162402: 100%|██████████| 268/268 [00:10<00:00, 24.84it/s]


Epoch 2


Avg Train Loss: 2.7244479212783963: 100%|██████████| 414/414 [00:34<00:00, 11.86it/s]
Avg Test Loss: 2.659426087763772: 100%|██████████| 268/268 [00:10<00:00, 24.84it/s]


Epoch 3


Avg Train Loss: 2.605129315657316: 100%|██████████| 414/414 [00:35<00:00, 11.79it/s]
Avg Test Loss: 2.571557663269897: 100%|██████████| 268/268 [00:10<00:00, 25.70it/s]


Epoch 4


Avg Train Loss: 2.5361479572627856: 100%|██████████| 414/414 [00:35<00:00, 11.71it/s]
Avg Test Loss: 2.512654185295105: 100%|██████████| 268/268 [00:11<00:00, 23.78it/s]


Epoch 5


Avg Train Loss: 2.484039280725562: 100%|██████████| 414/414 [00:35<00:00, 11.80it/s]
Avg Test Loss: 2.463906225873463: 100%|██████████| 268/268 [00:11<00:00, 23.81it/s]


Epoch 6


Avg Train Loss: 2.4395451344153734: 100%|██████████| 414/414 [00:34<00:00, 11.84it/s]
Avg Test Loss: 2.421219460999788: 100%|██████████| 268/268 [00:11<00:00, 23.69it/s]


Epoch 7


Avg Train Loss: 2.4010436287248766: 100%|██████████| 414/414 [00:35<00:00, 11.78it/s]
Avg Test Loss: 2.385412451046616: 100%|██████████| 268/268 [00:11<00:00, 23.49it/s]


Epoch 8


Avg Train Loss: 2.367971516461764: 100%|██████████| 414/414 [00:35<00:00, 11.76it/s]
Avg Test Loss: 2.3539513789895756: 100%|██████████| 268/268 [00:10<00:00, 24.46it/s]


Epoch 9


Avg Train Loss: 2.3384690310644065: 100%|██████████| 414/414 [00:35<00:00, 11.82it/s]
Avg Test Loss: 2.326124023590515: 100%|██████████| 268/268 [00:11<00:00, 23.53it/s]

CPU times: user 7min 28s, sys: 4.09 s, total: 7min 32s
Wall time: 7min 40s





## Evaluation
The logic starts with the initial randomly selected sequence and makes the next character prediction. It then removes the first character from the sequence and adds a newly predicted character at the end. Then, it makes another prediction and the process repeats for 100 characters.

In [153]:
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist() # list of tokens matched with torch text
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))

Initial Pattern : enging a university faculty member 


In [154]:
generated_text = []
# genererate 100 characters and also offsetting one character after every iterations to maintain the seq length
for i in range(100):
  batch = torch.tensor(pattern, dtype=torch.float32).reshape(1, seq_len, 1).to(device)
  model_op = text_generator_lstm(batch)
  # print(model_op.shape) # 47 is the vocab size
  predicted_index = model_op.argmax(dim=-1).squeeze().cpu().item()
  generated_text.append(predicted_index) ## Add token index to result
  pattern.append(predicted_index) ## Add token index to original pattern
  pattern = pattern[1:] ## Resize pattern to bring again to seq_length l


In [155]:
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))

Generated Text : a n a n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n n 


The model is producing some random text but english text but in repeations
and even after mutiple training yet it produces similar random text

## Model Building using self trained embeddings

We have used character-based approach for our case which means that our network takes a list of characters as input and returns the next character that it thinks should come next. We can also design models that take a list of words as input and predicts the next word. For encoding text data, we have used character embeddings approach which assigns a real-valued vector to each token (character)

Network Architecture:

* Embedding Layer: 100 embedding length, input (batch_size, seq_length), output (batch_size, seq_length, 100).
* LSTM Layer 1 & 2: 256 hidden dimensions, input (batch_size, seq_length, embed_len), output (batch_size, seq_length, 256).
* Linear Layer: Output units match vocabulary length, input (batch_size, seq_length, 256), output (batch_size, vocab_len).
* Embedding Layer:

  Utilizes Embedding() constructor with vocab length and 100 embedding length.\
  Transforms input shape to (batch_size, seq_length, embed_len).

* LSTM Layers:

  LSTM Layer 1 processes embedding output with 256 hidden dimensions.\
  LSTM Layer 2 processes LSTM 1 output with 256 hidden dimensions.

* Linear Layer:\
  Transforms LSTM 2 output to (batch_size, vocab_len), representing predictions.

* Initialization & Verification:\
  Initialized network and examined weights/biases.\
  Conducted forward pass with sample data for validation.

In [168]:
X_train = torch.tensor(X_train_full, dtype=torch.int64)
y_train = torch.tensor(y_train_full)
print(f"The shape of X_train is {X_train.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_train.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([423585, 35])
The shape of Y_train is torch.Size([423585])


In [169]:
X_val = torch.tensor(X_val_full, dtype=torch.int64)
y_val = torch.tensor(y_val_full)
print(f"The shape of X_train is {X_val.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_val.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([273982, 35])
The shape of Y_train is torch.Size([273982])


In [178]:
# new data loader
vectorized_train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)

vectorized_valid_dataset = TensorDataset(X_val, y_val)
valid_loader = DataLoader(vectorized_valid_dataset, batch_size=1024, shuffle=False)

In [179]:
embed_len = 100
hidden_dim = 256
n_layers=2

class LSTMTextGenerator_Embed(nn.Module):
    def __init__(self):
        super(LSTMTextGenerator_Embed, self).__init__()
        self.word_embedding = nn.Embedding(num_embeddings= 47, embedding_dim=embed_len)
        self.lstm = nn.LSTM(input_size=embed_len, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(vocab))

    def forward(self, X_batch):
        embeddings = self.word_embedding(X_batch)

        hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim).to(device), torch.randn(n_layers, len(X_batch), hidden_dim).to(device)
        output, (hidden, carry) = self.lstm(embeddings, (hidden, carry))
        return self.linear(output[:,-1])

In [180]:
text_generator_lstm_embd = LSTMTextGenerator_Embed().to(device)

In [181]:
optimizer = optim.Adam(text_generator_lstm_embd.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [182]:
%%time

# Place holders----
train_losses = []
test_losses = []

for epoch in range(0,epochs):
  print(f'Epoch {epoch}')
  train_losses = train_module(text_generator_lstm_embd, device, train_loader, optimizer, criterion, None, train_losses, None)
  test_losses = test_module(text_generator_lstm_embd, device, valid_loader, criterion, None, test_losses, None)

Epoch 0


Avg Train Loss: 2.8572771076994816: 100%|██████████| 414/414 [00:36<00:00, 11.21it/s]
Avg Test Loss: 2.4506678554549146: 100%|██████████| 268/268 [00:11<00:00, 23.43it/s]


Epoch 1


Avg Train Loss: 2.282755870174095: 100%|██████████| 414/414 [00:42<00:00,  9.77it/s]
Avg Test Loss: 2.171659545667136: 100%|██████████| 268/268 [00:14<00:00, 17.95it/s]


Epoch 2


Avg Train Loss: 2.100640987716435: 100%|██████████| 414/414 [00:38<00:00, 10.73it/s]
Avg Test Loss: 2.045561326083852: 100%|██████████| 268/268 [00:12<00:00, 21.94it/s]


Epoch 3


Avg Train Loss: 1.9968238844387773: 100%|██████████| 414/414 [00:37<00:00, 11.12it/s]
Avg Test Loss: 1.9571161310174572: 100%|██████████| 268/268 [00:11<00:00, 22.66it/s]


Epoch 4


Avg Train Loss: 1.9185841598948419: 100%|██████████| 414/414 [00:38<00:00, 10.84it/s]
Avg Test Loss: 1.887179025963171: 100%|██████████| 268/268 [00:11<00:00, 23.72it/s]


Epoch 5


Avg Train Loss: 1.8544545648754507: 100%|██████████| 414/414 [00:38<00:00, 10.82it/s]
Avg Test Loss: 1.829539271432962: 100%|██████████| 268/268 [00:11<00:00, 23.95it/s]


Epoch 6


Avg Train Loss: 1.7997887341872505: 100%|██████████| 414/414 [00:38<00:00, 10.80it/s]
Avg Test Loss: 1.7806454825757154: 100%|██████████| 268/268 [00:11<00:00, 23.77it/s]


Epoch 7


Avg Train Loss: 1.7519318484453763: 100%|██████████| 414/414 [00:38<00:00, 10.70it/s]
Avg Test Loss: 1.7381872321242717: 100%|██████████| 268/268 [00:11<00:00, 23.72it/s]


Epoch 8


Avg Train Loss: 1.7088922089424685: 100%|██████████| 414/414 [00:38<00:00, 10.88it/s]
Avg Test Loss: 1.7004961682789361: 100%|██████████| 268/268 [00:11<00:00, 23.06it/s]


Epoch 9


Avg Train Loss: 1.6701050587898292: 100%|██████████| 414/414 [00:37<00:00, 10.92it/s]
Avg Test Loss: 1.6672258777404898: 100%|██████████| 268/268 [00:11<00:00, 23.10it/s]

CPU times: user 8min 1s, sys: 4.13 s, total: 8min 5s
Wall time: 8min 23s





In [183]:
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist() # list of tokens matched with torch text
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))

Initial Pattern : lped much by the announcement trade


In [186]:
generated_text = []
# genererate 100 characters and also offsetting one character after every iterations to maintain the seq length
for i in range(100):
  batch = torch.tensor(pattern, dtype=torch.int64).reshape(1, seq_len).to(device)
  model_op = text_generator_lstm_embd(batch)
  # print(model_op.shape) # 47 is the vocab size
  predicted_index = model_op.argmax(dim=-1).squeeze().cpu().item()
  generated_text.append(predicted_index) ## Add token index to result
  pattern.append(predicted_index) ## Add token index to original pattern
  pattern = pattern[1:] ## Resize pattern to bring again to seq_length l


In [187]:
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))

Generated Text : r and the securition and the securition and the securition and the securition and the securition and
