# Transformers in Action

We will now focus on the key components that make transformers so impactful and go through some hands-on exercises

## Attention is All you Need ⚠️
We leveraged a basic RNN based network to generate text in the previous notebook. To enhance performance of sequence to sequence tasks a typical Encoder-Decoder architecture is the go-to choice.

<img src="../../assets/module_2/encoder_decoder_nb_2.png">


Let us consider the case of **Machine Translation**, i.e. translation of English to Spanish (or any other language).

In a typical __Encoder-Decoder__ architecture, the Encoder takes in the input text in English as input and prepares a condensed vector representation of the whole input. Typically termed as bottleneck features. The Decoder then uses these features to generate the translated text in Spanish.

While this architecture and its variants worked wonders, they had issues. Issues such as inability handle longer input sequences, cases where there is not a one to one mapping between input vs output language and so on.

To handle these issues, __Vasvani et. al.__ in their now famouly titled paper __Attention Is All You Need__ build up on the concepts of attention. The main highlight of this work was the Transformer architecture. Transformers were shown to present state of the art results on multiple benchmarks without using any recurrence or convolutional components.


### Attention & Self-Attention
The concept of __Attention__ is a simple yet important one. In layman terms, it helps the model focus on not just the current input but also determine specific pieces of information from the past. This helps in models which are able to handle long range dependencies along with scenarios where there is not a one to one mapping between inputs and outputs. The following is a sample illustration from the paper demonstrating the focus/attention of the model on the words when making is the input.

<img src="../../assets/module_2/attention_nb_2.png">

> Source: [Vasvani et. al.](https://arxiv.org/pdf/1706.03762.pdf)


__Self-attention__ is a mechanism that allows the transformer model to weigh the importance of different positions (or "tokens") __within__ a sequence when encoding or decoding.

__Multi-head attention__ extends the self-attention mechanism by performing multiple parallel self-attention operations, each focusing on different learned linear projections of the input. Multiple attention heads allow the model to capture different types of relationships and learn more fine-grained representations (eg: grammar, context, dependency, etc.)

<img src="../../assets/module_2/multihead_attention_nb_2.png">

> Source: [Vasvani et. al.](https://arxiv.org/pdf/1706.03762.pdf)


### Positional Encoding
Positional encoding is a technique used to incorporate the position of each token in the input sequence. It provides the model with information about the token's position without relying solely on the order of tokens.
This additional aspect was required because transformers do not have the natural sequential setup of RNNs. In order to provide positional context, any encoding system should ideally have the following properties:

- It should output a unique encoding for each time-step (word’s position in a sentence)
- Distance between any two time-steps should be consistent across sentences with different lengths.
- Our model should generalize to longer sentences without any efforts. Its values should be bounded.
- It must be deterministic.

<img src="../../assets/module_2/positional_emb_nb_2.png">



### References
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)


## Hugging Face 🤗
> On a mission to solve NLP, one commit at a time.

As their tagline explains, they are helping solve NLP problems. While the transformer revolution changed things for language related tasks, using them was not a simple thing. With number of parameters running into billions, these models were out of reach for most researchers and application developers.

<a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop_dhs23/blob/main/module_02/solutions/2.transformer_text_generation.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
!pip install transformers

In [None]:
import time
import datetime

In [None]:
import pandas as pd
import numpy as np
import transformers
from numpy import random
from transformers import GPT2Tokenizer, GPT2LMHeadModel, GPT2Config
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader,
from torch.utils.data import random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

In [None]:
print(transformers.__version__)

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader, random_split, RandomSampler, SequentialSampler
torch.manual_seed(42)

In [None]:
!nvidia-smi

## Get Data
We will fine-tune a pre-trained model GPT-2 model on our earlier dataset itself. But wait, what do you mean pre-trained?

In [None]:
!wget -O sherlock_homes.txt http://www.gutenberg.org/files/1661/1661-0.txt

In [None]:
filename = "sherlock_homes.txt"
raw_text = open(filename, 'r', encoding='utf-8').read()
text = raw_text [1450:100000]

## Foundation & Pre-trained Models

**Foundation models** are the models that are trained from scratch on a large corpus of data. In the context of NLP, these models are designed to learn the fundamental patterns, structures, and representations of natural language. Foundation models are typically trained using unsupervised learning objectives, such as language modeling or autoencoding, where the model predicts the next word in a sentence or reconstructs the original sentence from a corrupted version/masked version.
Models such as GPT, BERT, T5, etc are typical examples of Foundation Models


Instances of foundation models that have been trained on specific downstream tasks or datasets are termed as **Pre-Trained Models**. Pretrained models leverage the knowledge learned from foundation models and are fine-tuned on task-specific data to perform well on specific NLP tasks, such as text classification, named entity recognition, machine translation, sentiment analysis, etc.

In [None]:
BOS_TOKEN = '<|sot|>'
EOS_TOKEN = '<|eot|>'
PAD_TOKEN = '<|pad|>'
MODEL_NAME = "raghavbali/gpt2_ft_sherlock_holmes"
#'gpt2'

In [None]:
# first, let us get the tokenizer object
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME,
                                          bos_token=BOS_TOKEN,
                                          eos_token=EOS_TOKEN,
                                          pad_token=PAD_TOKEN
                                          )

## Prepare Dataset

In [None]:
class GPT2Dataset(Dataset):

  def __init__(self, txt_list, tokenizer, max_length=768):

    self.tokenizer = tokenizer
    self.input_ids = []
    self.attn_masks = []

    for txt in txt_list:

      encodings_dict = tokenizer(
          BOS_TOKEN + txt + EOS_TOKEN,
          truncation=True,
          max_length=max_length,
          padding="max_length"
          )

      self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
      self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

  def __len__(self):
    return len(self.input_ids)

  def __getitem__(self, idx):
    return self.input_ids[idx], self.attn_masks[idx]

In [None]:
# set batch size to work it out on colab
BATCH_SIZE = 3

In [None]:
dataset = GPT2Dataset(text.split('\n'),
                      tokenizer, max_length=768)

# Split into training and validation sets
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

In [None]:
# Create the DataLoaders for our training and validation datasets.
train_dataloader = DataLoader(
            train_dataset,
            sampler = RandomSampler(train_dataset),
            batch_size = BATCH_SIZE
        )

validation_dataloader = DataLoader(
            val_dataset,
            sampler = SequentialSampler(val_dataset),
            batch_size = BATCH_SIZE
        )

## Setup Model Object

In [None]:
# Training Params
epochs = 1 #3 seems good if you train from gpt2 checkpoint
learning_rate = 5e-4
# to speed up learning
warmup_steps = 1e2
epsilon = 1e-8

# generate output after N steps
sample_every = 100

In [None]:
# Set Config
configuration = GPT2Config.from_pretrained(MODEL_NAME,
                                           output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained(MODEL_NAME, config=configuration,)

# NOTE: This is important to imply that we have updated BOS, EOS, etc
model.resize_token_embeddings(len(tokenizer))


device = torch.device("cuda")
model.cuda()
model = model.to(device)

In [None]:
optimizer = AdamW(model.parameters(),
                  lr = learning_rate,
                  eps = epsilon
                )

In [None]:
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = warmup_steps,
                                            num_training_steps = total_steps)

In [None]:
def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [None]:
total_t0 = time.time()
training_stats = []


for epoch_i in range(0, epochs):

    # Training
    print("*"*25)
    print('>> Epoch {:} / {:} '.format(epoch_i + 1, epochs))
    print("*"*25)

    t0 = time.time()
    total_train_loss = 0

    model.train()
    for step, batch in enumerate(train_dataloader):

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        model.zero_grad()

        outputs = model(  b_input_ids,
                          labels=b_labels,
                          attention_mask = b_masks,
                          token_type_ids=None
                        )

        loss = outputs[0]

        batch_loss = loss.item()
        total_train_loss += batch_loss

        # Get sample every x batches.
        if step % sample_every == 0 and not step == 0:

            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}. Training Loss: {:>5,}.   Time Taken: {:}.'.format(step,
                                                                                     len(train_dataloader),
                                                                                     batch_loss,
                                                                                     elapsed))

            model.eval()

            sample_outputs = model.generate(
                                    do_sample=True,
                                    top_k=50,
                                    max_length = 200,
                                    top_p=0.95,
                                    num_return_sequences=1,
                                    pad_token_id=tokenizer.eos_token_id
                                )
            for i, sample_output in enumerate(sample_outputs):
                  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

            model.train()

        loss.backward()
        optimizer.step()

        scheduler.step()

    # Average Loss
    avg_train_loss = total_train_loss / len(train_dataloader)

    # training time
    training_time = format_time(time.time() - t0)

    print("Average training loss: {0:.2f}".format(avg_train_loss))
    print("Training epoch time: {:}".format(training_time))

    # Validation
    t0 = time.time()

    model.eval()
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:

        b_input_ids = batch[0].to(device)
        b_labels = batch[0].to(device)
        b_masks = batch[1].to(device)

        with torch.no_grad():

            outputs  = model(b_input_ids,
                             attention_mask = b_masks,
                            labels=b_labels)

            loss = outputs[0]

        batch_loss = loss.item()
        total_eval_loss += batch_loss

    avg_val_loss = total_eval_loss / len(validation_dataloader)

    validation_time = format_time(time.time() - t0)

    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation time: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'train_loss': avg_train_loss,
            'val_oss': avg_val_loss,
            'train_ime': training_time,
            'val_ime': validation_time
        }
    )

print("Training Completed")
print("Total training time {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

In [None]:
df_stats = pd.DataFrame(data=training_stats)
df_stats

## Save the Model

In [None]:
output_dir = './model_save/'

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

In [None]:
model.eval()

prompt = "i am writing this prompt"

generated = torch.tensor(tokenizer.encode(BOS_TOKEN+prompt)).unsqueeze(0)
generated = generated.to(device)

sample_outputs = model.generate(
                                generated,
                                do_sample=True,
                                top_k=50,
                                max_length = len(generated) + 50,
                                top_p=0.92,
                                num_return_sequences=3,
                                pad_token_id=tokenizer.eos_token_id,
                                temperature=0.8,
                                )

for i, sample_output in enumerate(sample_outputs):
  print("{}: {}\n\n".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

In [None]:
# compare output to foundation model
pre_trainedtokenizer = GPT2Tokenizer.from_pretrained(MODEL_NAME)
pre_trainedmodel = GPT2LMHeadModel.from_pretrained(MODEL_NAME)

In [None]:
input_ids = pre_trainedtokenizer.encode(prompt, return_tensors="pt")

# Generate text
output = pre_trainedmodel.generate(
    input_ids,
    bos_token_id=random.randint(1,30000),
    max_length=len(input_ids[0]) + 50,
    num_return_sequences=1,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_p=0.92,  # Adjust the sampling parameters as needed
    temperature=0.8,
)

In [None]:
pre_trainedtokenizer.decode(output[0], skip_special_tokens=True)