## Lab 7a: Text generation with GPT

In [1]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a target="_blank" href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab07/lab07a-Text_Generation_with_GPT.ipynb">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>'
)
display(colab_button)

During this lab, we will further explore the transformer architecture and GPT. The GPT (Generative Pre-trained Transformer) architecture has significantly advanced the field of NLP by enabling the development of powerful and versatile language models. Its transformer-based design, coupled with unsupervised pre-training on large text corpora, has revolutionized tasks such as text generation, summarization, and language understanding.

Even though we are not able to perform a large-scale training in the scope of this lab, we can still explore the capabilities of the model on a smaller scale by training on the `tiny_shakespeare` dataset and utilizing some pre-trained weights.


In [None]:
!pip install torch numpy transformers datasets tiktoken tqdm nltk bert_score torcheval

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting torcheval
  Downloading torcheval-0.0.7-py3-none-any.whl.metadata (8.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Col

In [None]:
!nvidia-smi

Sun Mar 16 21:12:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import os
import requests
import tiktoken
import numpy as np
import pickle
import torch
import time
import math
from contextlib import nullcontext

# don't forget to upload model.py into /content
from model import GPTConfig, GPT

### Data Preparation

Download the `tiny_shakespeare` dataset, which consists of numerous Shakespeare plays concatenated into a single text file.

It is encoded with Byte-Pair Encoding (BPE) that builds a vocabulary of subword units to optimally represent the input data. The encoded tokens for each split (train/val) are saved into corresponding binary files.

In [None]:
input_file_path = os.path.join(os.path.abspath(''), 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.abspath(''), 'train.bin'))
val_ids.tofile(os.path.join(os.path.abspath(''), 'val.bin'))

train has 301,966 tokens
val has 36,059 tokens


### Train a small GPT model from scratch

In [None]:
# -----------------------------------------------------------------------------
# default config values designed to train a gpt2 (124M) on OpenWebText
# I/O

init_from = 'scratch'  # 'scratch' or 'resume' or 'gpt2*'
out_dir = 'out-shakespeare'
eval_interval = 100
eval_iters = 100
log_interval = 20
always_save_checkpoint = True  # if True, always save a checkpoint after each eval

# data
dataset = 'shakespeare'
gradient_accumulation_steps = 5 * 8  # used to simulate larger batch sizes
batch_size = 12  # if gradient_accumulation_steps > 1, this is the micro-batch size
block_size = 128

# model
# ------------------------------------------------------------------------------
# play with these parameters! if you have access to the GPU runtime, make the model
# bigger, it has a significant influence on its performance
n_layer = 6
n_head = 4
n_embd = 128
# ------------------------------------------------------------------------------
dropout = 0.0  # for pretraining 0 is good, for finetuning try 0.1+
bias = False  # do we use bias inside LayerNorm and Linear layers?

# adamw optimizer
learning_rate = 6e-4  # max learning rate
max_iters = 500  # total number of training iterations
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0  # clip gradients at this value, or disable if == 0.0

# learning rate decay settings
decay_lr = True  # whether to decay the learning rate
warmup_iters = 10  # how many steps to warm up for
lr_decay_iters = 500  # should be ~= max_iters per Chinchilla
min_lr = 6e-5  # minimum learning rate, should be ~= learning_rate/10 per Chinchilla

# system
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks
dtype = 'float16'  # 'float32', 'bfloat16', or 'float16', the latter will auto implement a GradScaler
compile = True  # use PyTorch 2.0 to compile the model to be faster
print(f'[.] {device} chosen as device')

# -----------------------------------------------------------------------------
config_keys = [k for k, v in globals().items() if not k.startswith('_') and isinstance(v, (int, float, bool, str))]
config = {k: globals()[k] for k in config_keys}
# -----------------------------------------------------------------------------


# helps estimate an arbitrarily accurate loss over either split using many batches
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, block_size=block_size)
            with ctx:
                logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


# learning rate decay scheduler (cosine with warmup)
def get_lr(it):
    # 1) linear warmup for warmup_iters steps
    if it < warmup_iters:
        return learning_rate * it / warmup_iters

    # 2) if it > lr_decay_iters, return min learning rate
    if it > lr_decay_iters:
        return min_lr

    # 3) in between, use cosine decay down to min learning rate
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))  # coeff ranges 0..1

    return min_lr + coeff * (learning_rate - min_lr)


def get_batch(split, batch_size=16, block_size=1024):
    # We recreate np.memmap every batch to avoid a memory leak, as per
    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
    if split == 'train':
        data = np.memmap(os.path.join(os.path.abspath(''), 'train.bin'),
                         dtype=np.uint16, mode='r')
    else:
        data = np.memmap(os.path.join(os.path.abspath(''), 'val.bin'),
                         dtype=np.uint16, mode='r')

    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i + block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i + 1:i + 1 + block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

[.] cuda chosen as device


Various inits, derived attributes, I/O setup

In [None]:
seed_offset = 0
tokens_per_iter = gradient_accumulation_steps * batch_size * block_size
print(f"tokens per iteration will be: {tokens_per_iter:,}")

os.makedirs(out_dir, exist_ok=True)

torch.manual_seed(1337 + seed_offset)
torch.backends.cuda.matmul.allow_tf32 = True  # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True  # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu'  # for later use in torch.autocast

# note: float16 data type will automatically use a GradScaler
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

data_dir = os.path.join('data', dataset)

# init these up here, can override if init_from='resume' (i.e. from a checkpoint)
iter_num = 0
best_val_loss = 1e9

# attempt to derive vocab_size from the dataset
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
    with open(meta_path, 'rb') as f:
        meta = pickle.load(f)
    meta_vocab_size = meta['vocab_size']
    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

# model init
model_args = dict(n_layer=n_layer, n_head=n_head, n_embd=n_embd, block_size=block_size,
                  bias=bias, vocab_size=None, dropout=dropout)

# init a new model from scratch
print("Initializing a new model from scratch")
# determine the vocab size we'll use for from-scratch training
if meta_vocab_size is None:
    print("defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)")

model_args['vocab_size'] = meta_vocab_size if meta_vocab_size is not None else 50304
gptconf = GPTConfig(**model_args)
model = GPT(gptconf)

# crop down the model block size if desired, using model surgery
if block_size < model.config.block_size:
    model.crop_block_size(block_size)
    model_args['block_size'] = block_size  # so that the checkpoint will have the right value

model.to(device)

# initialize a GradScaler. If enabled=False scaler is a no-op
scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))

# optimizer
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type)
if init_from == 'resume':
    optimizer.load_state_dict(checkpoint['optimizer'])
checkpoint = None  # free up memory

# compile the model
if compile:
    print("compiling the model... (takes a ~minute)")
    unoptimized_model = model
    model = torch.compile(model)  # requires PyTorch 2.0

tokens per iteration will be: 61,440
Initializing a new model from scratch
defaulting to vocab_size of GPT-2 to 50304 (50257 rounded up for efficiency)
number of parameters: 7.62M
num decayed parameter tensors: 26, with 7,634,944 parameters
num non-decayed parameter tensors: 13, with 1,664 parameters


  scaler = torch.cuda.amp.GradScaler(enabled=(dtype == 'float16'))


using fused AdamW: True
compiling the model... (takes a ~minute)


In [None]:
# training loop
X, Y = get_batch('train', block_size=block_size)  # fetch the very first batch
t0 = time.time()
local_iter_num = 0  # number of iterations in the lifetime of this process

while True:
    # determine and set the learning rate for this iteration
    lr = get_lr(iter_num) if decay_lr else learning_rate
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    # evaluate the loss on train/val sets and write checkpoints
    if iter_num % eval_interval == 0:
        losses = estimate_loss()
        print(f"step {iter_num}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        if losses['val'] < best_val_loss or always_save_checkpoint:
            best_val_loss = losses['val']
            if iter_num > 0:
                checkpoint = {
                    'model': model.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'model_args': model_args,
                    'iter_num': iter_num,
                    'best_val_loss': best_val_loss,
                    'config': config,
                }
                print(f"saving checkpoint to {out_dir}")
                torch.save(checkpoint, os.path.join(out_dir, 'ckpt.pt'))

    # forward backward update, with optional gradient accumulation to simulate larger batch size
    # and using the GradScaler if data type is float16
    for micro_step in range(gradient_accumulation_steps):
        with ctx:
            logits, loss = model(X, Y)
            loss = loss / gradient_accumulation_steps  # scale the loss to account for gradient accumulation
        # immediately async prefetch next batch while model is doing the forward pass on the GPU
        X, Y = get_batch('train', block_size=block_size)
        # backward pass, with gradient scaling if training in fp16
        scaler.scale(loss).backward()

    # clip the gradient
    if grad_clip != 0.0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)

    # step the optimizer and scaler if training in fp16
    scaler.step(optimizer)
    scaler.update()

    # flush the gradients as soon as we can, no need for this memory anymore
    optimizer.zero_grad(set_to_none=True)

    # timing and logging
    t1 = time.time()
    dt = t1 - t0
    t0 = t1
    if iter_num % log_interval == 0:
        # get loss as float. note: this is a CPU-GPU sync point
        # scale up to undo the division above, approximating the true total loss (exact would have been a sum)
        lossf = loss.item() * gradient_accumulation_steps
        print(f"iter {iter_num}: loss {lossf:.4f}, time {dt * 1000:.2f}ms")

    iter_num += 1
    local_iter_num += 1

    # termination conditions
    if iter_num > max_iters:
        break

step 0: train loss 10.8287, val loss 10.8227
iter 0: loss 10.8279, time 20081.17ms
iter 20: loss 9.0177, time 488.34ms
iter 40: loss 7.1812, time 490.91ms
iter 60: loss 6.4597, time 494.55ms
iter 80: loss 5.9980, time 496.67ms
step 100: train loss 5.7278, val loss 5.8724
saving checkpoint to out-shakespeare
iter 100: loss 5.8159, time 1858.18ms
iter 120: loss 5.2991, time 504.68ms
iter 140: loss 5.1239, time 501.93ms
iter 160: loss 5.0640, time 506.78ms
iter 180: loss 4.9745, time 496.50ms
step 200: train loss 4.7684, val loss 5.1917
saving checkpoint to out-shakespeare
iter 200: loss 4.9746, time 1853.07ms
iter 220: loss 4.7539, time 496.29ms
iter 240: loss 4.5990, time 496.75ms
iter 260: loss 4.6500, time 499.99ms
iter 280: loss 4.5809, time 500.66ms
step 300: train loss 4.3920, val loss 5.0102
saving checkpoint to out-shakespeare
iter 300: loss 4.3941, time 1962.26ms
iter 320: loss 4.4137, time 499.32ms
iter 340: loss 4.1362, time 499.07ms
iter 360: loss 4.6038, time 497.47ms
iter 3

> In case you are having trouble with securing a GPU runtime and training the model, download the trained weights from [here](https://drive.google.com/file/d/17gfJ76SyGJVW3jinz3Xv5B7XxTyE9NNA/view?usp=sharing). Create the directory called `out-shakespeare` and place the downloaded weights there.

In [None]:
if not os.path.exists('/content/out-shakespeare'):
  os.makedirs('/content/out-shakespeare')

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=17gfJ76SyGJVW3jinz3Xv5B7XxTyE9NNA' -O /content/out-shakespeare/ckpt.pt

--2025-03-17 11:05:09--  https://docs.google.com/uc?export=download&id=17gfJ76SyGJVW3jinz3Xv5B7XxTyE9NNA
Resolving docs.google.com (docs.google.com)... 74.125.68.102, 74.125.68.139, 74.125.68.100, ...
Connecting to docs.google.com (docs.google.com)|74.125.68.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=17gfJ76SyGJVW3jinz3Xv5B7XxTyE9NNA&export=download [following]
--2025-03-17 11:05:09--  https://drive.usercontent.google.com/download?id=17gfJ76SyGJVW3jinz3Xv5B7XxTyE9NNA&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 172.217.194.132, 2404:6800:4003:c04::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|172.217.194.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91684437 (87M) [application/octet-stream]
Saving to: ‘/content/out-shakespeare/ckpt.pt’


2025-03-17 11:05:18 (186 MB/s) - ‘/content/out-shake

---
### Sample from a trained model

Initialize the trained model from a directory.

<div class="alert alert-block alert-info"><b>Tip:</b> There is no need to re-initialize the model and load it into memory again if you've just trained it. Run the next cell if the context of the notebook was reset after the training.</div>

In [None]:
# -----------------------------------------------------------------------------
init_from = 'resume'
out_dir = 'out-shakespeare' # ignored if init_from is not 'resume'

seed = 1337
dtype = 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)


# init from a model saved in a specific directory
ckpt_path = os.path.join(out_dir, 'ckpt.pt')
checkpoint = torch.load(ckpt_path, map_location=device)
gptconf = GPTConfig(**checkpoint['model_args'])
model = GPT(gptconf)
state_dict = checkpoint['model']
unwanted_prefix = '_orig_mod.'
for k,v in list(state_dict.items()):
    if k.startswith(unwanted_prefix):
        state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)

model.load_state_dict(state_dict)
model.eval()
model.to(device)

if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

number of parameters: 7.62M


In [None]:
# assume gpt-2 encodings by default
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
decode = lambda l: enc.decode(l)

---
### Sample from the trained model

We can prompt the model by providing a context. Try sampling with a different context and sample length

In [None]:
# encode the beginning of the prompt
context = 'The Universe is vast'
start_ids = encode(context)
num_samples = 5
sample_len = 128

temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200 # retain only the top_k most likely tokens, clamp others to have 0 probability


with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

            probs, y = model.generate(x, sample_len, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('==============================')

The Universe is vast-shented--
I will have we be for so his son
Of no heart in our person, as my heart,
I'll die!

COMINIUS:
To help him.

LADY:
Nay, who thou hast thou duke he would not, I shall my husband, and dead;
Though she is the common kinsman:
Was she be seen a little with a thousand better name call me.

HENRY VI:
The other thing to say I have'st of nothing with war more.

KING RICHARD II:
F
The Universe is vast!

NORTHUMLAND:
Take it is the court is the foe, you,
And, my cousin,
This is the second world of the grave and my dear lord.


JULI am done?

KING EDWARD:
Y:
I am, my lord, madam, good time may say, if he's one, I will be mine.

BUCKINGHAM:
My name as his fortune with the people,
But for you say, as the world, and that; and lay us?

ES:
In it be the queen of
The Universe is vast
To do here in thy heart's very face'd,
And, to the kents?

Nay, even that we swear, though I be a life,
Here
I'll have of the night's time by the way,
Will be a more more than the rest?
I will 

The generated text mostly does not make sense but the model could definitely capture some attributes of the Shakespearean style.

Let's run the model on the validation dataset. Experiment with the context size and see how it influences the generation.

> Note: the context, provided to the model, consists of the first few tokens from each sample. Be shure to exclude is from evaluation as it will always be "right" and skew the metrics

In [None]:
# experiment with sample length, context size and their influence on the evaluation metrics
sample_len = 64
batch_size = 32
start_len = 5

temperature = 0.8
top_k = 200
# -------------------------------------

val_data = np.memmap('./val.bin', dtype=np.uint16, mode='r')
num_batches = len(val_data) // sample_len // batch_size

pred_sent = []
gt_sent = []

pred_tokens = []
gt_tokens = []

pred_probs = []

with torch.no_grad():
    with ctx:
        for batch_i in range(num_batches):
          print(f'batch {batch_i}/{num_batches}')

          X_val, _ = get_batch('val', batch_size, sample_len)

          for k in range(batch_size):
              start_ids = X_val[k, :start_len]
              # print('START:', start_ids, ' - "', decode(list(start_ids)), '"')
              x = start_ids.clone().detach().type(torch.long).to(device)[None, ...]

              probs, pred = model.generate(x, sample_len-start_len, temperature=temperature, top_k=top_k)
              pred_probs.append(torch.cat(probs).cpu())

              # skip the "context" that was provided
              decoded_pred = decode(pred[0, start_len:].tolist())
              pred_sent.append(decoded_pred)

              decoded_gt = decode(X_val[k, start_len:].tolist())
              gt_sent.append(decoded_gt)

              # print('PRED DECODED: ', decoded_pred)
              # print('--------------------------')
              # print('GT DECODED: ', decoded_gt)

              gt_tokens.append([X_val[k, start_len:].cpu().numpy()])
              pred_tokens.append(pred[0, start_len:].cpu().numpy())

              # print('==============================')

batch 0/17
batch 1/17
batch 2/17
batch 3/17
batch 4/17
batch 5/17
batch 6/17
batch 7/17
batch 8/17
batch 9/17
batch 10/17
batch 11/17
batch 12/17
batch 13/17
batch 14/17
batch 15/17
batch 16/17


Before proceeding, we will take a smaller subset of the output due to the computational limitations in Google Colab.

In [None]:
pred_sent = pred_sent[:120]
gt_sent = gt_sent[:120]
gt_tokens = gt_tokens[:120]
pred_tokens = pred_tokens[:120]
pred_probs = pred_probs[:120]

### Model evaluation

It is important to be able to quantitatively evaluate language models. Some of the popular evaluation metrics that use reference text are BLEU score, BERTScore and Perplexity.

<div class="alert alert-block alert-info"><b>Tip:</b> Save the needed predictions while running the model on the validation dataset in the cell above. Computing the metrics on the dataset level is more straightforward.</div>

#### 1. BLEU score

The BLEU (Bilingual Evaluation Understudy) score works by comparing the n-grams (contiguous sequences of n tokens) in the generated text to those in the reference text(s). It calculates a precision score for each n-gram size (typically up to 4-grams) and combines these scores using a weighted geometric mean.

<div class="alert alert-block alert-warning"><b>Challenge 1:</b> Use the NLTK framework to calculate BLEU scores for 1-, 2-, 3-, and 4-grams on the validation dataset.</div>

In [None]:
...

print('BLEU-1: ', bleu1)
print('BLEU-2: ', bleu2)
print('BLEU-3: ', bleu3)
print('BLEU-4: ', bleu4)

#### 2. BERTScore

BERTScore  leverages contextual embeddings from BERT (Bidirectional Encoder Representations from Transformers) to compute similarity between sentences or text spans. Unlike BLEU score that works on a token level, it considers both word overlap and contextual information, providing a more accurate evaluation.

Compared to BLEU score, BERTScore offers several advantages:

- Contextual understanding: BERTScore considers the contextual meaning of words, capturing nuances that BLEU, which relies solely on word overlap, may miss.
- Robustness to word order: BERTScore's contextual embeddings enable it to handle variations in word order, making it more robust to changes in sentence structure or word arrangement.
- Higher correlation with human judgment: BERTScore has been shown to correlate better with human judgment in evaluating text quality, especially in tasks like summarization and text generation.

<div class="alert alert-block alert-warning"><b>Challenge 2:</b> Use the bert_score package to calculate BERTScore on the validation dataset.</div>

In [None]:
...

print(f"BERTScore Precision: {bs_precision:.4f}, Recall: {bs_recall:.4f}, F1: {bs_f1:.4f}")

#### 3. Perplexity

Perplexity is a measurement used in natural language processing (NLP) to assess how well a language model predicts a sample of text. It quantifies the average uncertainty or surprise of the model in predicting the next word or token in a sequence. Lower perplexity values indicate that the model is more confident and accurate in its predictions, while higher values suggest more uncertainty.

<div class="alert alert-block alert-warning"><b>Challenge 3:</b> Use the torcheval package to calculate perplexity on the validation dataset.</div>

In [None]:
...

print('Perplexity: ', perplexity)