<a href="https://colab.research.google.com/github/soutrik71/pytorch_classics/blob/main/APTorch7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The focus of this notebook is to implement LSTM for sequence generation tasks
1. Text Generation using LSTM Networks (Character-based RNN)
2. Text Generation using PyTorch LSTM Networks (Character Embeddings)
3. Focused Natural Language Processing with PyTorch Experimentations

In [1]:
!pip install portalocker
!pip install torchview
!pip install torcheval
!pip install scikit-plot
!pip install lime



In [2]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchtext
from torch.utils.data import DataLoader, TensorDataset
from torchtext import data
from torchtext import datasets
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import re
from torch.utils.data import DataLoader
from torchtext.data.functional import to_map_style_dataset
from torchsummary import summary
from torchview import draw_graph
import numpy as np
import random
import matplotlib.pyplot as plt
from tqdm import tqdm
from torcheval.metrics import MulticlassAccuracy,BinaryAccuracy
import torch.optim as optim
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix
import scikitplot as skplt

In [4]:
def set_seed(seed: int = 42) -> None:
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ["PYTHONHASHSEED"] = str(seed)
    print(f"Random seed set as {seed}")

In [5]:
# Set manual seed since nn.Parameter are randomly initialzied
set_seed(42)
# Set device cuda for GPU if it's available otherwise run on the CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
batch_size = 1024
epochs = 10
lr = 1e-4
embedding = False

Random seed set as 42
cuda


## Basic DataPrep

In this section, we undertake data preparation for training the neural network using a character-based approach. Specifically, we adopt a fixed sequence length of 100 characters, where the network's task is to predict the subsequent character given this sequence. Employing an embeddings technique, we encode text data by assigning a unique real-valued vector to each character.

The data preparation procedure is as follows:

* Loading Text Examples and Creating Vocabulary: Iterate through all text examples to construct a vocabulary, mapping each character to a distinct integer index. This vocabulary
facilitates character representation in a numerical format.

* Organizing Data with a Sliding Window: Implement a sliding window mechanism to organize the data. For every text example, we slide a window of 100 characters. The first 100 characters serve as input features (X), while the 101st character becomes the target value (Y). This process continues by shifting the window one character at a time until the end of the text example.

* Conversion to Integer Indices: Retrieve integer indices corresponding to characters in both data features and target values based on the previously constructed vocabulary. This step transforms characters into their corresponding numerical representations.

* Embeddings Assignment: Each unique integer index, representing a specific character in the data features, is associated with a real-valued vector known as an embedding. These embeddings provide a continuous representation of characters, facilitating numerical computation within the neural network. This is optional

### Data loading

In [6]:
train_dataset, valid_dataset, test_dataset = datasets.PennTreebank()

In [7]:
next(iter(train_dataset.shuffle()))

'instead new york city police seized the stolen goods and mr. <unk> avoided jail'

In [8]:
def info(x):
  return len(x)

elem_ls = list(train_dataset.map(info))



In [9]:
print(len(elem_ls)) # total 42k elements
print(max(elem_ls)) # max length of each element
print(min(elem_ls)) # min length of each element

42068
518
2


We construct a vocabulary of unique characters using build_vocab_from_iterator() from torchtext's 'vocab' sub-module. Our custom function build_vocabulary() serves as an iterator, looping through datasets and examples to yield character lists. Special handling ensures the '<unk>' token, representing unknown characters, is counted as a single token rather than individual characters.






In [10]:
def build_vocabulary(datasets):
  for dataset in datasets:
    for text in dataset:

      if "unk" in text:
        texts = text.split("<unk>")
        total = list(texts[0].lower())
        for t in texts[1:]:
            total.extend(["<unk>", ] + list(t.lower()))
        yield total

      else:
        yield list(text.lower())

In [11]:
vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, valid_dataset, test_dataset]), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

In [12]:
len(vocab)

47

In [13]:
print(vocab.get_itos()) # character level tokenization

['<unk>', ' ', 'e', 't', 'a', 'n', 'o', 'i', 's', 'r', 'h', 'l', 'd', 'c', 'u', 'm', 'f', 'p', 'g', 'y', 'b', 'w', 'v', 'k', '.', "'", 'x', 'j', '$', '-', 'q', 'z', '&', '0', '1', '9', '3', '#', '2', '8', '5', '\\', '7', '6', '/', '4', '*']


In [14]:
print(vocab.get_stoi()) # dictionary mapping token to indices

{'4': 45, '/': 44, '7': 42, '8': 39, '2': 38, '#': 37, '9': 35, '1': 34, 'z': 31, 'q': 30, '-': 29, '6': 43, '3': 36, 'r': 9, 's': 8, 'd': 12, 'k': 23, 'n': 5, 'h': 10, '*': 46, 'u': 14, '0': 33, 'p': 17, 't': 3, 'i': 7, '\\': 41, '5': 40, 'a': 4, 'e': 2, 'j': 27, '&': 32, 'v': 22, 'o': 6, '<unk>': 0, '.': 24, 'c': 13, 'm': 15, 'f': 16, 'l': 11, 'g': 18, 'y': 19, 'b': 20, 'w': 21, ' ': 1, 'x': 26, "'": 25, '$': 28}


Preparing the sequential data for training with sliding window approach and window size of 10 characters

In [15]:
seq_len = 25
train_records_max = 10000
X_train, y_train = [], []
X_val , y_val = [], []

In [16]:
# train data prep
for idex, text in enumerate(train_dataset):
  print(text)
  print("\n")
  for i in range(len(text) - seq_len):
    inp_rec = list(text[i:i+seq_len].lower())
    op_rec = text[i+seq_len].lower()

    if len(op_rec) == 0:
      break

    X_train.append(vocab(inp_rec))
    y_train.append(vocab[op_rec])

  if idex > train_records_max:
    break

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


it said cs first boston has consistently been one of the most aggressive firms in merchant banking and that a very significant portion of the firm 's profit in recent years has come from merchant <unk> business


moody 's believes that the uncertain environment for merchant banking could put pressure on cs first boston 's performance the rating concern said citing continued problems from the firm 's exposures to various <unk> firms and to ohio <unk>


these two exposures alone represent a very substantial portion of cs first boston 's equity moody 's said


total merchant banking exposures are in excess of the firm 's equity


quotron systems inc. plans to cut about N or N N of its N employees over the next several months


this action will continue to keep operating expenses in line with revenue said j. david <unk> president and chief executive officer of los angeles-based quotron


the move by the financial informatio

In [17]:
print(len(X_train))
print(len(y_train))

949360
949360


In [18]:
# validation dataset prep
for idex, text in enumerate(valid_dataset):
  print(text)
  print("\n")
  for i in range(len(text) - seq_len):
    inp_rec = list(text[i:i+seq_len].lower())
    op_rec = text[i+seq_len].lower()

    if len(op_rec) == 0:
      break

    X_val.append(vocab(inp_rec))
    y_val.append(vocab[op_rec])

[1;30;43mStreaming output truncated to the last 5000 lines.[0m


when the market went into its free fall friday afternoon the investment firm ordered full pages in the monday editions of half a dozen newspapers


the ads touted fidelity 's automated <unk> beneath the huge headline fidelity is ready for your call


a fidelity spokesman says the <unk> which already was operating but which many clients did n't know about received about double the usual volume of calls over the weekend


a lot of investor confidence comes from the fact that they can speak to us he says


to maintain that dialogue is absolutely crucial


it would have been too late to think about on friday


we had to think about it ahead of time


today 's fidelity ad goes a step further encouraging investors to stay in the market or even to plunge in with fidelity


<unk> the headline diversification it <unk> based on the events of the past week all investors need to know their portfolios are balanced to help protect th

In [19]:
print(len(X_val))
print(len(y_val))

306238
306238


In [20]:
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train)
print(f"The shape of X_train is {X_train.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_train.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([949360, 25])
The shape of Y_train is torch.Size([949360])


In [21]:
X_val = torch.tensor(X_val, dtype=torch.float32)
y_val = torch.tensor(y_val)
print(f"The shape of X_train is {X_val.shape}") # n records with k elements in each
print(f"The shape of Y_train is {y_val.shape}") # n records with 1 element in each

The shape of X_train is torch.Size([306238, 25])
The shape of Y_train is torch.Size([306238])


In [22]:
if not embedding:
  X_train = X_train.unsqueeze(dim=-1)
  X_val = X_val.unsqueeze(dim=-1)

Dataloader part

In [23]:
vectorized_train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)

vectorized_valid_dataset = TensorDataset(X_val, y_val)
valid_loader = DataLoader(vectorized_valid_dataset, batch_size=1024, shuffle=False)

In [24]:
for x, y in train_loader:
  print(x.shape)
  print(y.shape)
  break

torch.Size([1024, 25, 1])
torch.Size([1024])


## Modelling Building

The network includes 2 LSTM layers with an output size of 256 each, followed by a linear layer. Stacking these LSTM layers enhances sequence learning. The output of the second LSTM layer feeds into the linear layer, whose output units match the vocabulary size

In [25]:
hidden_dim = 256
n_layers=2

class LSTMTextGenerator(nn.Module):
    def __init__(self):
        super(LSTMTextGenerator, self).__init__()
        self.lstm = nn.LSTM(input_size=1, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
        self.linear = nn.Linear(hidden_dim, len(vocab))

    def forward(self, X_batch):
      # init weights
      hidden = torch.randn(n_layers, len(X_batch), hidden_dim).to(device)
      carry = torch.randn(n_layers, len(X_batch), hidden_dim).to(device)

      output, (hidden, carry) = self.lstm(X_batch, (hidden, carry))
      return self.linear(output[:,-1])

In [26]:
text_generator_lstm = LSTMTextGenerator().to(device)

In [27]:
for layer in text_generator_lstm.children():
    print("Layer : {}".format(layer))
    print("Parameters : ")
    for param in layer.parameters():
        print(param.shape)
    print("\n")

Layer : LSTM(1, 256, num_layers=2, batch_first=True)
Parameters : 
torch.Size([1024, 1])
torch.Size([1024, 256])
torch.Size([1024])
torch.Size([1024])
torch.Size([1024, 256])
torch.Size([1024, 256])
torch.Size([1024])
torch.Size([1024])


Layer : Linear(in_features=256, out_features=47, bias=True)
Parameters : 
torch.Size([47, 256])
torch.Size([47])




In [28]:
def train_module(model:torch.nn.Module,
                 device:torch.device,
                 train_dataloader:torch.utils.data.DataLoader ,
                 optimizer:torch.optim.Optimizer,
                 criterion:torch.nn.Module,
                 metric,
                 train_losses:list,
                 train_metrics:list):

  # setting model to train mode
  model.train()
  pbar = tqdm(train_dataloader)

  # batch metrics
  train_loss = 0
  processed_batch = 0

  for idx, (data,label) in enumerate(pbar):
    # setting up device
    data = data.to(device)
    label = label.to(device)

    # forward pass output
    preds = model(data)

    # calc loss
    loss = criterion(preds, label)
    train_loss += loss.item()
    # print(f"training loss for batch {idx} is {loss}")

    # backpropagation
    optimizer.zero_grad() # flush out  existing grads
    loss.backward() # back prop of weights wrt loss
    optimizer.step() # optimizer step -> minima

    #updating batch count
    processed_batch += 1

    pbar.set_description(f"Avg Train Loss: {train_loss/processed_batch}")

  # updating epoch metrics
  train_losses.append(train_loss/processed_batch)

  return train_losses


In [29]:
def test_module(model:torch.nn.Module,
                device:torch.device,
                test_dataloader:torch.utils.data.DataLoader,
                criterion:torch.nn.Module,
                metric,
                test_losses,
                test_metrics):
  # setting model to eval mode
  model.eval()
  pbar = tqdm(test_dataloader)

  # batch metrics
  test_loss = 0
  processed_batch = 0

  with torch.inference_mode():
    for idx, (data,label) in enumerate(pbar):
      data , label = data.to(device), label.to(device)
      # predictions
      preds = model(data)
      # print(preds.shape)
      # print(label.shape)

      #loss calc
      loss = criterion(preds, label)
      test_loss += loss.item()

      #updating batch count
      processed_batch += 1

      pbar.set_description(f"Avg Test Loss: {test_loss/processed_batch}")

    # updating epoch metrics
    test_losses.append(test_loss/processed_batch)

  return test_losses

In [30]:
optimizer = optim.Adam(text_generator_lstm.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

In [31]:
%%timeit

# Place holders----
train_losses = []
test_losses = []

for epoch in range(0,epochs):
  print(f'Epoch {epoch}')
  train_losses = train_module(text_generator_lstm, device, train_loader, optimizer, criterion, None, train_losses, None)
  test_losses = test_module(text_generator_lstm, device, valid_loader, criterion, None, test_losses, None)

Epoch 0


Avg Train Loss: 2.9196194297280806: 100%|██████████| 928/928 [01:00<00:00, 15.43it/s]
Avg Test Loss: 2.792175707022349: 100%|██████████| 300/300 [00:11<00:00, 26.87it/s]


Epoch 1


Avg Train Loss: 2.643658997940606: 100%|██████████| 928/928 [01:01<00:00, 15.19it/s]
Avg Test Loss: 2.537196484406789: 100%|██████████| 300/300 [00:10<00:00, 28.10it/s]


Epoch 2


Avg Train Loss: 2.481267654690249: 100%|██████████| 928/928 [01:02<00:00, 14.94it/s]
Avg Test Loss: 2.4278574657440184: 100%|██████████| 300/300 [00:10<00:00, 27.48it/s]


Epoch 3


Avg Train Loss: 2.393442423436148: 100%|██████████| 928/928 [01:02<00:00, 14.96it/s]
Avg Test Loss: 2.3521500571568805: 100%|██████████| 300/300 [00:10<00:00, 29.10it/s]


Epoch 4


Avg Train Loss: 2.328999879159804: 100%|██████████| 928/928 [01:01<00:00, 15.01it/s]
Avg Test Loss: 2.2958285256226856: 100%|██████████| 300/300 [00:09<00:00, 30.28it/s]


Epoch 5


Avg Train Loss: 2.2764888470029008: 100%|██████████| 928/928 [01:01<00:00, 14.99it/s]
Avg Test Loss: 2.2467727704842884: 100%|██████████| 300/300 [00:10<00:00, 29.33it/s]


Epoch 6


Avg Train Loss: 2.229790813193239: 100%|██████████| 928/928 [01:01<00:00, 14.99it/s]
Avg Test Loss: 2.202251296043396: 100%|██████████| 300/300 [00:10<00:00, 28.33it/s]


Epoch 7


Avg Train Loss: 2.1873680130931836: 100%|██████████| 928/928 [01:01<00:00, 15.00it/s]
Avg Test Loss: 2.161680989265442: 100%|██████████| 300/300 [00:10<00:00, 28.20it/s]


Epoch 8


Avg Train Loss: 2.1487542410092106: 100%|██████████| 928/928 [01:01<00:00, 15.12it/s]
Avg Test Loss: 2.124908761580785: 100%|██████████| 300/300 [00:10<00:00, 28.87it/s]


Epoch 9


Avg Train Loss: 2.113752003502229: 100%|██████████| 928/928 [01:01<00:00, 15.04it/s]
Avg Test Loss: 2.091447780529658: 100%|██████████| 300/300 [00:10<00:00, 28.95it/s]


## Evaluation
The logic starts with the initial randomly selected sequence and makes the next character prediction. It then removes the first character from the sequence and adds a newly predicted character at the end. Then, it makes another prediction and the process repeats for 100 characters.