## Lab 7b: Text generation with GPT

For the second part of this lab, we will experiment with loading a GPT-2 model for the same task. We will also utilize the `tiny_shakespeare` dataset and all of the metrics in the first part to evaluate the model


In [1]:
!pip install torch numpy transformers datasets tiktoken tqdm nltk bert_score torcheval

Collecting datasets
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting torcheval
  Downloading torcheval-0.0.7-py3-none-any.whl.metadata (8.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Col

In [None]:
!nvidia-smi

Sun Mar 16 21:12:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   45C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [1]:
import os
import requests
import tiktoken
import numpy as np
import pickle
import torch
import time
import math
from contextlib import nullcontext

# *** don't forget to upload model.py into /content ***
from model import GPTConfig, GPT

### Data Preparation

Let's first download the `tiny_shakespeare` dataset with the following:

In [2]:
input_file_path = os.path.join(os.path.abspath(''), 'input.txt')
if not os.path.exists(input_file_path):
    data_url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file_path, 'w', encoding='utf-8') as f:
        f.write(requests.get(data_url).text)

with open(input_file_path, 'r', encoding='utf-8') as f:
    data = f.read()
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(os.path.abspath(''), 'train.bin'))
val_ids.tofile(os.path.join(os.path.abspath(''), 'val.bin'))

train has 301,966 tokens
val has 36,059 tokens


Here, we will define some variables and function for the subsequent code to work.

In [3]:
# -----------------------------------------------------------------------------
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1' etc., or try 'mps' on macbooks

seed = 1337
dtype = 'float16' # 'float32' or 'bfloat16' or 'float16'
compile = True # use PyTorch 2.0 to compile the model to be faster
# -----------------------------------------------------------------------------

torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cuda.matmul.allow_tf32 = True # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True # allow tf32 on cudnn
device_type = 'cuda' if 'cuda' in device else 'cpu' # for later use in torch.autocast
ptdtype = {'float32': torch.float32, 'bfloat16': torch.bfloat16, 'float16': torch.float16}[dtype]
ctx = nullcontext() if device_type == 'cpu' else torch.amp.autocast(device_type=device_type, dtype=ptdtype)

def get_batch(split, batch_size=16, block_size=1024):
    # We recreate np.memmap every batch to avoid a memory leak, as per
    # https://stackoverflow.com/questions/45132940/numpy-memmap-memory-usage-want-to-iterate-once/61472122#61472122
    if split == 'train':
        data = np.memmap(os.path.join(os.path.abspath(''), 'train.bin'),
                         dtype=np.uint16, mode='r')
    else:
        data = np.memmap(os.path.join(os.path.abspath(''), 'val.bin'),
                         dtype=np.uint16, mode='r')

    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([torch.from_numpy((data[i:i + block_size]).astype(np.int64)) for i in ix])
    y = torch.stack([torch.from_numpy((data[i + 1:i + 1 + block_size]).astype(np.int64)) for i in ix])
    if device_type == 'cuda':
        # pin arrays x,y, which allows us to move them to GPU asynchronously (non_blocking=True)
        x, y = x.pin_memory().to(device, non_blocking=True), y.pin_memory().to(device, non_blocking=True)
    else:
        x, y = x.to(device), y.to(device)
    return x, y

In [4]:
# assume gpt-2 encodings by default
enc = tiktoken.get_encoding("gpt2")
encode = lambda s: enc.encode(s, allowed_special={"<|endoftext|>"})
decode = lambda l: enc.decode(l)

---
Now let's load a pre-trained GPT-2 model and see, how does it perform in terms of the calculated metrics

In [5]:
init_from = 'gpt2-medium'  # 'gpt2-xl' if you have access to a decent GPU

# init from a given GPT-2 model
model = GPT.from_pretrained(init_from, dict(dropout=0.0))
model.eval()
model.to(device)
if compile:
    model = torch.compile(model) # requires PyTorch 2.0 (optional)

loading weights from pretrained gpt: gpt2-medium
forcing vocab_size=50257, block_size=1024, bias=True
overriding dropout rate to 0.0
number of parameters: 353.77M


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Sampling the model with a given context, observe the key differences between our previous trained model, and this current one which has been trained with an alternative set of data.

In [6]:
# encode the beginning of the prompt
context = 'The Universe is vast'
start_ids = encode(context)
num_samples = 5
sample_len = 128
temperature = 0.8 # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k = 200

with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            x = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])

            probs, y = model.generate(x, sample_len, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('==============================')

The Universe is vast. The Universe is infinite. And the Universe is not so much a place that we are in as it is a place that has been created under the guidance of God.

So, what do you do with that?

We call it the Creator-Conserver relationship.

That is, we meet God to worship Him, and our true God is God Himself. And it is actually very difficult to find a place in this world where you cannot find a believer who does not worship God. Most of our friends and neighbors do not worship God. They worship others more readily. But we do.

But what does
The Universe is vast, but we are only one of millions."

This article appeared in print under the headline "The cosmos, 1.8 billion years since the Big Bang"<|endoftext|>Perez is part of a growing cohort of players, including Atlanta United's forward Alvaro Saborio and the University of Arizona's Alex Crognale, who are seeing their careers take a slight turn for the worse.

Part of the problem is the loss of depth as the league's top few te

<div class="alert alert-block alert-warning"><b>Challenge 4:</b> Re-use the code from above to run the model on the validation dataset and calculate BLEU score, BERTScore and perplexity. What do you observe? Do numbers correlate with the qualitative evaluation?</div>

In [7]:
import json

# experiment with sample length, context size and their influence on the evaluation metrics
sample_len = 64
batch_size = 32
start_len = 5

temperature = 0.8
top_k = 200
# -------------------------------------

val_data = np.memmap('./val.bin', dtype=np.uint16, mode='r')
num_batches = len(val_data) // sample_len // batch_size

pred_sent = []
gt_sent = []

pred_tokens = []
gt_tokens = []

pred_probs = []

with torch.no_grad():
    with ctx:
        for batch_i in range(num_batches):
          print(f'batch {batch_i}/{num_batches}')

          if batch_i == 10:
            break

          X_val, _ = get_batch('val', batch_size, sample_len)

          for k in range(batch_size):
              start_ids = X_val[k, :start_len]
              x = start_ids.clone().detach().type(torch.long).to(device)[None, ...]

              probs, pred = model.generate(x, sample_len-start_len, temperature=temperature, top_k=top_k)
              pred_probs.append(torch.cat(probs).cpu())

              # skip the "context" that was provided
              decoded_pred = decode(pred[0, start_len:].tolist())
              pred_sent.append(decoded_pred)

              decoded_gt = decode(X_val[k, start_len:].tolist())
              gt_sent.append(decoded_gt)

              gt_tokens.append([X_val[k, start_len:].cpu().numpy()])
              pred_tokens.append(pred[0, start_len:].cpu().numpy())

pred_sent = pred_sent[:120]
gt_sent = gt_sent[:120]
gt_tokens = gt_tokens[:120]
pred_tokens = pred_tokens[:120]
pred_probs = pred_probs[:120]


batch 0/17
batch 1/17
batch 2/17
batch 3/17
batch 4/17
batch 5/17
batch 6/17
batch 7/17
batch 8/17
batch 9/17
batch 10/17


In [8]:
from nltk.translate.bleu_score import corpus_bleu
print('BLEU-1: ', corpus_bleu(gt_tokens, pred_tokens, weights=(1.0, 0, 0, 0)))
print('BLEU-2: ', corpus_bleu(gt_tokens, pred_tokens, weights=(0, 1.0, 0, 0)))
print('BLEU-3: ', corpus_bleu(gt_tokens, pred_tokens, weights=(0, 0, 1.0, 0)))
print('BLEU-4: ', corpus_bleu(gt_tokens, pred_tokens, weights=(0, 0, 0, 1.0)))

BLEU-1:  0.18022598870056494
BLEU-2:  0.028448275862068963
BLEU-3:  0.007602339181286548
BLEU-4:  0.0005952380952380952


In [9]:
from bert_score import BERTScorer
scorer = BERTScorer(model_type='bert-base-uncased')
P, R, F1 = scorer.score(pred_sent, gt_sent)

print(f"BERTScore Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

BERTScore Precision: 0.3928, Recall: 0.3860, F1: 0.3891


In [10]:
from torcheval.metrics.functional.text import perplexity
# 3d tensor of token probabilities: (num_samples, num_tokens, vocab size)
perp_probs = torch.tensor(np.array(pred_probs))
print(perp_probs.size())

# 2d tensor of gt tokens: (num_samples, num_tokens)
perp_gt = torch.stack([torch.from_numpy(elem[0]) for elem in gt_tokens])
print(perp_gt.size())

print('Perplexity: ', perplexity(perp_probs, perp_gt).item())

torch.Size([120, 59, 50257])
torch.Size([120, 59])
Perplexity:  49529.6875


> Do you see drawbacks of the metrics that rely on the reference text? Can we provide an adequate reference in case of an unconstrained text generation? Compare the outputs of both models qualitatively. Think about the other ways to evaluate the text generation models.