# Language Modeling using Transformer
 The Transformer is a deep learning model introduced in the paper "Attention is All You Need" by Vaswani et al. It's particularly renowned for its effectiveness in natural language processing (NLP) tasks, including language translation, text generation, and language understanding.

In this notebook, we are going to demonstrate the highlevel implementation of transformer by applying it in language modeling that is predicting next word in a sequence.

## 1. Imports
Let's first imports all the libraries that will be required through out this notebook.

In [2]:
import torch
import math
import os

from typing import Tuple
from torch import nn, Tensor
from tempfile import TemporaryDirectory
from torch.nn import TransformerEncoder, TransformerEncoderLayer
from torch.utils.data import dataset

## 2. Load the dataset
For language modeling, we are going to use Wikitext-2 dataset. We gonna access this dataset using `torchtext`.

In [3]:
!pip install portalocker
!pip install torchdata

Collecting portalocker
  Downloading portalocker-2.8.2-py3-none-any.whl (17 kB)
Installing collected packages: portalocker
Successfully installed portalocker-2.8.2


In [4]:
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [5]:
train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

Create a function that can convert a raw text into a flat tensor

In [6]:
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

In [7]:
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

In [8]:
print(train_data.shape)
print(val_data.shape)
print(test_data.shape)

torch.Size([2049990])
torch.Size([214417])
torch.Size([241859])


In [9]:
train_data

tensor([   9, 3849, 3869,  ..., 2442, 4810,    3])

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Let's batch the data. Given a 1-D vector of sequential data, batchify() arranges the data into batch_size columns. If the data does not divide evenly into batch_size columns, then the data is trimmed to fit. For instance, with the alphabet as the data (total length of 26) and batch_size=4, we would divide the alphabet into sequences of length 6, resulting in 4 of such sequences.

![Screenshot from 07-02-24 15:07:12](https://github.com/surajkarki66/Generative_AI_for_NLP/assets/50628520/3d26d44c-d63e-49f9-89b8-2dbe80e31b1a)

Batching enables more parallelizable processing. However, batching means that the model treats each column independently; for example, the dependence of G and F can not be learned in the example above.



In [11]:
def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into ``bsz`` separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Arguments:
        data: Tensor, shape ``[N]``
        bsz: int, batch size

    Returns:
        Tensor of shape ``[N // bsz, bsz]``
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

In [12]:
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape [seq_len, batch_size]
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

In [13]:
print(train_data)

tensor([[    9,    59,   564,  ..., 11652,  2435,     1],
        [ 3849,    12,   300,  ...,    47,    30,  1990],
        [ 3869,   315,    19,  ...,    97,  7720,     4],
        ...,
        [  587,  4011,    59,  ...,     1,  1439, 12313],
        [ 4987,    29,     4,  ...,  3165, 17106,  2060],
        [    6,     8,     1,  ...,    62,    18,     2]], device='cuda:0')


In [14]:
print(train_data.shape)

torch.Size([102499, 20])


In [15]:
print(train_data[0])

tensor([    9,    59,   564,   223,   443, 13627,     2,   539,  2872,  2464,
            0,   313,  4513,     1,     5,    47,    66, 11652,  2435,     1],
       device='cuda:0')


Build functions to generate input and target sequence
![Image](https://pytorch.org/tutorials/_images/transformer_input_target.png)

In [16]:
bptt = 35
def get_batch(source: Tensor, i: int) -> Tuple[Tensor, Tensor]:
    """
    Args:
        source: Tensor, shape ``[full_seq_len, batch_size]``
        i: int

    Returns:
        tuple (data, target), where data has shape ``[seq_len, batch_size]`` and
        target has shape ``[seq_len * batch_size]``
    """
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target

## 3. Model Building
The famous architecure is given by the following figure.

![image](https://pytorch.org/tutorials/_images/transformer_architecture.jpg)

This is the high level implementation of a Transformer model, we are not going to build transformer from scratch. We train a `nn.TransformerEncoder` model on a causal language modeling task.

Remember this notebook does not cover the training of `nn.TransformerDecoder`, as depicted in the right half of the diagram above. The language modeling task is to assign a probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words.


**General Idea**
1. First, A sequence of tokens are passed to the embedding layer first, followed by a positional encoding layer to account for the order of the word.
2. The `nn.TransformerEncoder` is made up of multiple layers of `nn.TransformerEncoderLayer`. Each layer helps the model understand the input data better.
3.When using TransformerDecoder (another part of the Transformer model), it's important to prevent it from looking at future words in the sequence during training. So, we need a special mask (like a filter) to block out any information from future positions.
4. In tasks like language modeling, where we predict the next word in a sentence, we want to make sure the model can't cheat by looking ahead. So, we mask out any words that come after the current word we're predicting by using attention mask.
5. After the input sequence goes through the TransformerEncoder, we pass it through a linear layer. This layer helps convert the encoded information into predictions for the next word.
6. The output of the linear layer gives us unnormalized scores, also called logits. These scores indicate how likely each word in the vocabulary is to be the next word in the sequence.

In [17]:
class TransformerModel(nn.Module):
    def __init__(self, ntoken: int, d_model: int, nhead: int, d_hid: int,
                 nlayers: int, dropout: float = 0.5):
        super().__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.embedding = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.linear = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: Tensor, src_mask: Tensor = None) -> Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        if src_mask is None:
            """Generate a square causal mask for the sequence. The masked positions are filled with float('-inf').
            Unmasked positions are filled with float(0.0).
            """
            src_mask = nn.Transformer.generate_square_subsequent_mask(len(src)).to(device)
        output = self.transformer_encoder(src, src_mask)
        output = self.linear(output)
        return output

PositionalEncoding module injects some information about the relative or absolute position of the tokens in the sequence. The positional encodings have the same dimension as the embeddings so that the two can be summed. Here, we use sine and cosine functions of different frequencies.

In [18]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: Tensor) -> Tensor:
        """
        Arguments:
            x: Tensor, shape ``[seq_len, batch_size, embedding_dim]``
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

## 4. Setup Model's Parameters and Hyper Parameters

In [19]:
ntokens = len(vocab)  # size of vocabulary
emsize = 200  # embedding dimension
d_hid = 200  # dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2  # number of nn.TransformerEncoderLayer  in nn.TransformerEncoder
nhead = 2  # number of heads in nn.MultiheadAttention
dropout = 0.2  # dropout probability
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)



## 5. Train the model
We use CrossEntropyLoss with the SGD (stochastic gradient descent) optimizer. The learning rate is initially set to 0.1 and follows a StepLR schedule. During training, we use nn.utils.clip_grad_norm_ to prevent gradients from exploding.

In [26]:
import time

criterion = nn.CrossEntropyLoss()
lr = 0.1  # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

def train(model: nn.Module) -> None:
    model.train()  # turn on train mode
    total_loss = 0.
    log_interval = 200
    start_time = time.time()

    num_batches = len(train_data) // bptt
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        output = model(data)
        output_flat = output.view(-1, ntokens)
        loss = criterion(output_flat, targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()


def evaluate(model: nn.Module, eval_data: Tensor) -> float:
    model.eval()  # turn on evaluation mode
    total_loss = 0.
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            seq_len = data.size(0)
            output = model(data)
            output_flat = output.view(-1, ntokens)
            total_loss += seq_len * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

In [27]:
best_val_loss = float('inf')
epochs = 90

with TemporaryDirectory() as tempdir:
    best_model_params_path = os.path.join(tempdir, "best_model_params.pt")

    for epoch in range(1, epochs + 1):
        epoch_start_time = time.time()
        train(model)
        val_loss = evaluate(model, val_data)
        val_ppl = math.exp(val_loss)
        elapsed = time.time() - epoch_start_time
        print('-' * 89)
        print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
            f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
        print('-' * 89)

        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), best_model_params_path)

        scheduler.step()
    model.load_state_dict(torch.load(best_model_params_path)) # load best model states

| epoch   1 |   200/ 2928 batches | lr 0.10 | ms/batch 13.84 | loss  6.52 | ppl   676.86
| epoch   1 |   400/ 2928 batches | lr 0.10 | ms/batch 13.54 | loss  6.51 | ppl   674.43
| epoch   1 |   600/ 2928 batches | lr 0.10 | ms/batch 13.54 | loss  6.47 | ppl   643.14
| epoch   1 |   800/ 2928 batches | lr 0.10 | ms/batch 13.95 | loss  6.50 | ppl   662.95
| epoch   1 |  1000/ 2928 batches | lr 0.10 | ms/batch 13.60 | loss  6.46 | ppl   637.95
| epoch   1 |  1200/ 2928 batches | lr 0.10 | ms/batch 13.68 | loss  6.49 | ppl   657.44
| epoch   1 |  1400/ 2928 batches | lr 0.10 | ms/batch 13.72 | loss  6.45 | ppl   635.84
| epoch   1 |  1600/ 2928 batches | lr 0.10 | ms/batch 14.07 | loss  6.47 | ppl   643.00
| epoch   1 |  1800/ 2928 batches | lr 0.10 | ms/batch 14.12 | loss  6.43 | ppl   617.98
| epoch   1 |  2000/ 2928 batches | lr 0.10 | ms/batch 13.84 | loss  6.44 | ppl   626.46
| epoch   1 |  2200/ 2928 batches | lr 0.10 | ms/batch 13.91 | loss  6.35 | ppl   573.92
| epoch   1 |  2400/ 

## 6. Model Evaluation
Now, evaluate the best model on the test dataset.

In [28]:
test_loss = evaluate(model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

| End of training | test loss  5.33 | test ppl   205.73


## 7. Conclusion
Since, we are training a huge transformer model from scratch in our dataset, it gonna takes more than 90 epochs to perform better. This is why we are not getting the very good result, but the purpose of this notebook is not to improve the model but to implement the model.