# GPT-2 Implementation from Scratch using PyTorch

This notebook implements a GPT-2 (Generative Pre-trained Transformer 2) model from scratch using PyTorch. We'll train it on the WikiText dataset for autoregressive language modeling.



In [1]:
# Install required packages
!pip install torch transformers datasets wandb tqdm numpy

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (179 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.nn import functional as F
import math
from transformers import GPT2Tokenizer
from datasets import load_dataset
import wandb
import tqdm
import os
import numpy as np
import glob
import sys
import traceback



In [3]:

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

Using device: cuda


In [4]:
# Connect with Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


##GPU RESOURCES

In [5]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Wed Dec 11 04:02:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0              43W / 400W |      5MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

##GPT-2 Multi-Head Causal Self-Attention Implementation

This module implements the core attention mechanism for GPT-2, featuring:
- Multi-head scaled dot-product attention
- Causal masking for autoregressive behavior
- Residual connections and dropout regularization


In [3]:
class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0
        # key, query, value projections for all heads
        self.key = nn.Linear(config.n_embd, config.n_embd)
        self.query = nn.Linear(config.n_embd, config.n_embd)
        self.value = nn.Linear(config.n_embd, config.n_embd)
        # regularization
        self.attn_drop = nn.Dropout(config.attn_pdrop)
        self.resid_drop = nn.Dropout(config.resid_pdrop)
        # output projection
        self.proj = nn.Linear(config.n_embd, config.n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(config.block_size, config.block_size))
                                     .view(1, 1, config.block_size, config.block_size))
        self.n_head = config.n_head

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embd)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y


##GPT-2 Transformer Block Implementation

This module implements a single transformer block for GPT-2, containing:
- Layer normalization
- Multi-head causal self-attention
- Position-wise feed-forward network
- Residual connections

In [4]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln1 = nn.LayerNorm(config.n_embd)
        self.ln2 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.mlp = nn.Sequential(
            nn.Linear(config.n_embd, 4 * config.n_embd),
            nn.GELU(),
            nn.Linear(4 * config.n_embd, config.n_embd),
            nn.Dropout(config.resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

##GPT-2 Model Implementation

This module implements the complete GPT-2 architecture, consisting of:
- Token and positional embeddings
- Multiple transformer blocks
- Language modeling head

In [5]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        # input embedding stem
        self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
        self.drop = nn.Dropout(config.embd_pdrop)
        # transformer
        self.blocks = nn.Sequential(*[Block(config) for _ in range(config.n_layer)])
        # decoder head
        self.ln_f = nn.LayerNorm(config.n_embd)
        self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        self.block_size = config.block_size
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, (nn.Linear, nn.Embedding)):
            module.weight.data.normal_(mean=0.0, std=0.02)
            if isinstance(module, nn.Linear) and module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)

    def forward(self, idx, targets=None):
        b, t = idx.size()
        assert t <= self.block_size, "Cannot forward, model block size is exhausted."

        # forward the GPT model
        token_embeddings = self.tok_emb(idx)
        position_embeddings = self.pos_emb[:, :t, :]
        x = self.drop(token_embeddings + position_embeddings)
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

        return logits, loss

##WikiText Dataset Processor for GPT-2 Training

This module implements a PyTorch Dataset for processing WikiText-103 data into
a format suitable for GPT-2 training. It handles:
- Dataset loading and filtering
- Tokenization using GPT-2 tokenizer
- Sequence chunking with proper context windows
- Input-target pairs creation for language modeling

In [6]:
class WikiTextDataset(Dataset):
    def __init__(self, split='train', block_size=128):
        # Load dataset

        dataset = load_dataset('wikitext', "wikitext-103-raw-v1", split = "train[:40%]")
        # Initialize tokenizer
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

        # Tokenize and chunk texts
        tokenized_texts = []
        for item in dataset['text']:
            if item.strip():  # Only process non-empty strings
                # Tokenize each text item
                tokens = self.tokenizer.encode(item, truncation=True, max_length=1024)
                if tokens:  # Only add if we got tokens back
                    tokenized_texts.extend(tokens)
                    # Add EOS token between texts
                    tokenized_texts.append(self.tokenizer.eos_token_id)

        # Convert to tensor
        data = torch.tensor(tokenized_texts, dtype=torch.long)

        # Create sequences of block_size + 1 (extra token for target)
        self.examples = []

        # Ensure we don't create sequences longer than block_size
        for i in range(0, len(data) - block_size, block_size):
            chunk = data[i:i + block_size + 1]
            if len(chunk) == block_size + 1:  # Only keep complete sequences
                self.examples.append(chunk)

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        chunk = self.examples[idx]
        x = chunk[:-1]
        y = chunk[1:]
        return x, y

##GPT-2 Model Configuration and Training Pipeline

This module implements:
1. GPT-2 configuration class with model hyperparameters
2. Complete training pipeline with:
  - Weights & Biases integration
  - Checkpoint management
  - Google Drive backup

In [7]:
from tqdm.notebook import tqdm
class GPTConfig:
    def __init__(
        self,
        vocab_size=50257,
        n_embd=768,
        n_head=12,
        n_layer=12,
        block_size=1024,
        dropout=0.1,
        bias=True,
        batch_size=8,
        learning_rate=3e-5,
        embd_pdrop=0.1,
        resid_pdrop=0.1,
        attn_pdrop=0.1
    ):
        self.vocab_size = vocab_size
        self.n_embd = n_embd
        self.n_head = n_head
        self.n_layer = n_layer
        self.block_size = block_size
        self.dropout = dropout
        self.bias = bias
        self.batch_size = batch_size
        self.learning_rate = learning_rate
        self.embd_pdrop = embd_pdrop
        self.resid_pdrop = resid_pdrop
        self.attn_pdrop = attn_pdrop

In [11]:
def train_gpt2():
    # Initialize wandb
    config = GPTConfig()
    wandb.init(
        project="gpt2-training-2",
        settings=wandb.Settings(init_timeout=300),
        config={
            "learning_rate": config.learning_rate,
            "batch_size": config.batch_size,
            "model_size": config.n_embd,
            "num_layers": config.n_layer,
            "num_heads": config.n_head,
            "block_size": config.block_size,
            "dropout": config.dropout,
            "architecture": "GPT2",
            "dataset": "WikiText-103",
        }
    )

    # Training parameters
    batch_size = config.batch_size
    learning_rate = config.learning_rate
    max_epochs = 15
    grad_norm_clip = 1.0
    max_checkpoints = 15
    checkpoint_dir = "checkpoints"
    drive_checkpoint_dir = "/content/drive/MyDrive/gpt2_checkpoints"
    # Create checkpoint directory
    os.makedirs(checkpoint_dir, exist_ok=True)

    # Setup device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Initialize model
    print("Initializing model...")
    model = GPT(config)
    model = model.to(device)

    # Watch model with wandb
    wandb.watch(model, log="all", log_freq=100)

    # Initialize optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    # Load dataset
    print("Loading dataset...")
    try:
        train_dataset = WikiTextDataset(split='train', block_size=config.block_size)
    except Exception as e:
        print(f"Error loading dataset: {e}")
        sys.exit(1)

    train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=2,
        pin_memory=True
    )

    print(f"Dataset size: {len(train_dataset)} sequences")

    # Load latest checkpoint
    start_epoch = 0
    checkpoints = sorted(glob.glob(os.path.join(drive_checkpoint_dir, 'gpt2_checkpoint_epoch_*.pt')))
    if checkpoints:
        try:
            print(f"Loading checkpoint {checkpoints[-1]}")
            checkpoint = torch.load(checkpoints[-1])
            model.load_state_dict(checkpoint['model'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            start_epoch = checkpoint['iter_num'] + 1
            print(f"Resumed from epoch {start_epoch}")
        except Exception as e:
            print(f"Error loading checkpoint: {e}")
            print("Starting from scratch")

    # Training loop
    print("Starting training...")
    try:
        for epoch in range(start_epoch, max_epochs):
            model.train()
            total_loss = 0
            progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{max_epochs}')

            for batch_idx, (x, y) in enumerate(progress_bar):
                try:
                    # Move batch to device
                    x, y = x.to(device), y.to(device)

                    # Forward pass
                    logits, loss = model(x, y)

                    # Backward pass
                    optimizer.zero_grad()
                    loss.backward()
                    torch.nn.utils.clip_grad_norm_(model.parameters(), grad_norm_clip)
                    optimizer.step()

                    # Update progress
                    total_loss += loss.item()
                    avg_loss = total_loss / (batch_idx + 1)
                    progress_bar.set_postfix({'loss': f'{avg_loss:.4f}'})

                    # Log to wandb
                    wandb.log({
                        "batch_loss": loss.item(),
                        "avg_loss": avg_loss,
                        "learning_rate": optimizer.param_groups[0]['lr'],
                        "epoch": epoch,
                        "batch": batch_idx,
                        "grad_norm": torch.nn.utils.clip_grad_norm_(model.parameters(), float('inf')).item()
                    })

                except RuntimeError as e:
                    if "out of memory" in str(e):
                        if hasattr(torch.cuda, 'empty_cache'):
                            torch.cuda.empty_cache()
                        print(f"\nWARNING: out of memory in batch {batch_idx}. Skipping...")
                        wandb.log({"memory_errors": 1})
                        continue
                    raise e

            # Save checkpoint
            checkpoint = {
                'model': model.state_dict(),
                'optimizer': optimizer.state_dict(),
                'iter_num': epoch,
                'avg_val_loss': avg_loss,
                'config': config,
            }
            checkpoint_path = os.path.join(checkpoint_dir, f'gpt2_checkpoint_epoch_{epoch+1}.pt')
            print(f"saving checkpoint to {checkpoint_path}")
            torch.save(checkpoint, checkpoint_path)



            # Log checkpoint to wandb
            wandb.save(checkpoint_path)

            # Log epoch metrics
            wandb.log({
                "epoch_avg_loss": avg_loss,
                "epoch": epoch,
            })

            print(f"Epoch {epoch+1} finished. Average loss: {avg_loss:.4f}")

            drive_checkpoint_path = os.path.join(drive_checkpoint_dir, f'gpt2_checkpoint_epoch_15.pt')
            print(f"Saving checkpoint to Drive: {drive_checkpoint_path}")
            torch.save(checkpoint, drive_checkpoint_path)
    except KeyboardInterrupt:
        print("\nTraining interrupted by user")
        # Save interrupt checkpoint
        interrupt_path = os.path.join(checkpoint_dir, 'interrupt_checkpoint.pt')
        wandb.save(interrupt_path)
        print("Interrupt checkpoint saved")

    except Exception as e:
        print(f"\nError during training: {e}")
        wandb.log({"training_error": str(e)})
        raise e

    finally:
        wandb.finish()

if __name__ == "__main__":
    train_gpt2()

[34m[1mwandb[0m: Currently logged in as: [33msk12154[0m ([33msk12154-new-york-university[0m). Use [1m`wandb login --relogin`[0m to force relogin


Using device: cuda
Initializing model...
Loading dataset...
Dataset size: 46359 sequences
Loading checkpoint /content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_12.pt


  checkpoint = torch.load(checkpoints[-1])


Resumed from epoch 12
Starting training...


Epoch 13/15:   0%|          | 0/5795 [00:00<?, ?it/s]

saving checkpoint to checkpoints/gpt2_checkpoint_epoch_13.pt
Epoch 13 finished. Average loss: 3.2114
Saving checkpoint to Drive: /content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_12.pt


Epoch 14/15:   0%|          | 0/5795 [00:00<?, ?it/s]

saving checkpoint to checkpoints/gpt2_checkpoint_epoch_14.pt
Epoch 14 finished. Average loss: 3.1596
Saving checkpoint to Drive: /content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_12.pt


Epoch 15/15:   0%|          | 0/5795 [00:00<?, ?it/s]

saving checkpoint to checkpoints/gpt2_checkpoint_epoch_15.pt
Epoch 15 finished. Average loss: 3.1115
Saving checkpoint to Drive: /content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_12.pt


VBox(children=(Label(value='757.543 MB of 5742.814 MB uploaded\r'), FloatProgress(value=0.13191148676916828, m…

0,1
avg_loss,▇▇▇▇█▇██████████▄▄▄▄▄▄▄▄▄▅▅▅▅▅▅▁▂▂▂▂▂▂▂▂
batch,▁▁▂▃▃▃▃▄▄▅▆▆▆▆▇▁▂▂▄▄▅▅▅▅▆▇▇██▁▂▂▃▃▃▅▅▅▆▇
batch_loss,▆▆▃▂▄▇▅▇▆█▄▆▄▄▄▃▄▄▅▅▅▆▃▄▅▄▄▅▃▆▃▃▁▄▁▃▆▄▄▃
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅██████████
epoch_avg_loss,█▄▁
grad_norm,▃▃▃▆▆█▆▆▃▃▆▁█▆▆█▆▃▆▆▆▆▆█▁▃▃██▃▆▁█▆▆▆▃▃▃▆
learning_rate,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
avg_loss,3.11149
batch,5794.0
batch_loss,3.15218
epoch,14.0
epoch_avg_loss,3.11149
grad_norm,1.0
learning_rate,3e-05


##GPT-2 Text Generation with WandB Tracking

This module implements text generation using a trained GPT-2 model with
comprehensive logging and visualization through Weights & Biases.

Features:
1. Temperature-controlled text generation
2. Top-k filtering for token selection
3. Detailed generation statistics tracking
4. Probability and entropy visualization
5. Step-by-step token generation monitoring

In [27]:
def generate_text(input_text, temperature=0.65, max_length=40):
    # Initialize wandb
    wandb.init(project="gpt2-pretrained-generation",
               config={
                   "max_length": max_length,
                   "temperature": temperature,
                   "top_k": 50
               })
    model_path = "/content/drive/MyDrive/gpt2_checkpoints/gpt2_checkpoint_epoch_15.pt"
    # Load model and tokenizer
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    checkpoint = torch.load(model_path, map_location=device)

    config = GPTConfig()
    model = GPT(config)

    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    model.eval()

    # Track input
    wandb.log({
        "input_text": input_text,
        "input_length": len(input_text.split())
    })

    # Tokenize input
    input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

    # Generation loop with progress tracking
    generated = input_ids.clone()
    generation_steps = []

    with torch.no_grad():
        for step in tqdm(range(max_length), desc="Generating"):
            # Get output logits
            outputs = model(generated[:, -1024:])
            next_token_logits = outputs.logits[:, -1, :] / temperature

            # Top-k filtering
            top_k = 50
            top_k_logits, top_k_indices = torch.topk(next_token_logits, top_k)
            next_token_logits[0, :] = float('-inf')
            next_token_logits[0, top_k_indices[0]] = top_k_logits[0]

            # Sample from filtered distribution
            probs = F.softmax(next_token_logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)

            # Track generation statistics
            token_prob = probs[0, next_token.item()].item()
            token_text = tokenizer.decode([next_token.item()])

            generation_steps.append({
                "step": step,
                "token": token_text,
                "probability": token_prob,
                "entropy": (-probs * probs.log()).sum().item()
            })

            # Append new token
            generated = torch.cat([generated, next_token], dim=1)

            # Log step information
            wandb.log({
                "generation_step": step,
                "token_probability": token_prob,
                "sequence_length": generated.size(1)
            })

            # Stop if EOS token is generated
            if next_token.item() == tokenizer.eos_token_id:
                break

    # Decode and get final output
    generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
    epsilon = 1e-10
    # Log final results
    wandb.log({
        "final_text": generated_text,
        "output_length": len(generated_text.split()),
        "generation_steps": wandb.Table(data=[[s["step"], s["token"], s["probability"], s["entropy"]]
                                            for s in generation_steps],
                                      columns=["step", "token", "probability", "entropy"]),
        "log_product_probability": np.sum(np.log([s["probability"] for s in generation_steps])),
        "entropy": (-probs * (probs + epsilon).log()).sum().item()
    })

    # Create visualization
    wandb.log({"generation_plot": wandb.plot.line_series(
        xs=[[s["step"] for s in generation_steps]],
        ys=[[s["probability"] for s in generation_steps]],
        keys=["Token Probability"],
        title="Generation Probabilities",
        xname="Step")
    })

    wandb.finish()
    return generated_text


In [30]:
def run_inference_experiments(qa):
    model_name = "gpt2"  # Can be gpt2, gpt2-medium, gpt2-large, or gpt2-xl

    def generate_qa_responses(qa):
        print("\n=== Context-Based Question Answering ===")

        print(f"\nContext: {qa['context']}")
        print(f"Question: {qa['question']}")

        prompt = f"""Context: {qa['context']}\n\nQuestion: {qa['question']}\n\nAnswer:"""
        generated = generate_text(
        input_text=prompt,
        temperature=0.7,
        max_length=100  # Shorter for focused answers
        )
        print(f"Generated Answer: {generated}")

    # Run different types of generation
    print("Starting inference experiments...")

    print("\n Context-Based QA")
    generate_qa_responses(qa)

In [15]:
qa = {
            "context": """The Apollo 11 spacecraft landed on the Moon on July 20, 1969. Neil Armstrong became
                         the first human to step onto the lunar surface, followed by Buzz Aldrin. They spent
                         about two and a half hours exploring and collecting samples.""",
            "question": "Who was the first person to walk on the Moon?"
        }
run_inference_experiments(qa)

Starting inference experiments...

 Context-Based QA

=== Context-Based Question Answering ===

Context: The Apollo 11 spacecraft landed on the Moon on July 20, 1969. Neil Armstrong became 
                         the first human to step onto the lunar surface, followed by Buzz Aldrin. They spent 
                         about two and a half hours exploring and collecting samples.
Question: Who was the first person to walk on the Moon?


Generating:   0%|          | 0/40 [00:00<?, ?it/s]

0,1
entropy,▁
generation_step,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
input_length,▁
log_product_probability,▁
output_length,▁
sequence_length,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
token_probability,▁▅▄█▄▅███▃▅█▇▁▂▃▂▁▅▇▆█▇▂█████▁▅▇█▂▆▆▅███

0,1
entropy,0.07134
final_text,Context: The Apollo ...
generation_step,39
input_length,53
input_text,Context: The Apollo ...
log_product_probability,-31.96865
output_length,84
sequence_length,159
token_probability,0.99166


Generated Answer: Context: The Apollo 11 spacecraft landed on the Moon on July 20, 1969. Neil Armstrong became 
                         the first human to step onto the lunar surface, followed by Buzz Aldrin. They spent 
                         about two and a half hours exploring and collecting samples.

Question: Who was the first person to walk on the Moon?

Answer: Astronaut Neil Armstrong.

Question: What was the first space flight that was made possible by the Apollo 11 spacecraft?

Answer: This was the first space flight that was made possible by


In [25]:
qa = {
            "context": """Photosynthesis is the process by which plants convert light energy into chemical energy.
                         This process occurs in the chloroplasts, specifically using chlorophyll pigments. The end
                         products are glucose and oxygen, while carbon dioxide and water are the raw materials.""",
            "question": "What are the end products of photosynthesis?"
        }
run_inference_experiments(qa)

Starting inference experiments...

 Context-Based QA

=== Context-Based Question Answering ===

Context: Photosynthesis is the process by which plants convert light energy into chemical energy. 
                         This process occurs in the chloroplasts, specifically using chlorophyll pigments. The end 
                         products are glucose and oxygen, while carbon dioxide and water are the raw materials.
Question: What are the end products of photosynthesis?


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
entropy,▁
generation_step,▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇████
input_length,▁
log_product_probability,▁
output_length,▁
sequence_length,▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇███
token_probability,▄▂██▁▁▆▂████▃▁█▂▇▇█▆███▅▂▁▄█▂█▇██▂█▆▃█▇█

0,1
entropy,0.15143
final_text,Context: Photosynthe...
generation_step,99
input_length,49
input_text,Context: Photosynthe...
log_product_probability,-68.19963
output_length,114
sequence_length,221
token_probability,0.9732


Generated Answer: Context: Photosynthesis is the process by which plants convert light energy into chemical energy. 
                         This process occurs in the chloroplasts, specifically using chlorophyll pigments. The end 
                         products are glucose and oxygen, while carbon dioxide and water are the raw materials.

Question: What are the end products of photosynthesis?

Answer: The chloroplasts, or phytoplankton, contain the products of photosynthesis. The phytoplankton are a subset of the chloroplasts. The end products of photosynthesis are glucose and oxygen, while carbon dioxide and water are the raw materials. 

Question: What is the process of photosynthesis, and how does it relate to the chloroplasts?

Answer: The process of photosynthesis is a process by which plants convert light energy into


In [26]:
qa = { "context": """The Industrial Revolution began in Britain in the late 18th century. It brought about
                         major changes in agriculture, manufacturing, mining, and transport. While it led to economic
                         growth and technological progress, it also caused environmental pollution and poor working
                         conditions for many laborers.""",
            "question": "What were the negative effects of the Industrial Revolution?"
      }
run_inference_experiments(qa)

Starting inference experiments...

 Context-Based QA

=== Context-Based Question Answering ===

Context: The Industrial Revolution began in Britain in the late 18th century. It brought about 
                         major changes in agriculture, manufacturing, mining, and transport. While it led to economic 
                         growth and technological progress, it also caused environmental pollution and poor working 
                         conditions for many laborers.
Question: What were the negative effects of the Industrial Revolution?


Generating:   0%|          | 0/100 [00:00<?, ?it/s]

0,1
entropy,▁
generation_step,▁▁▁▁▂▂▂▂▂▂▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
input_length,▁
log_product_probability,▁
output_length,▁
sequence_length,▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇███
token_probability,▂█▄▇▂▃▁█▁▆▆▂▅▅▄▁█▃▁▂▃▆▆██▁▁▁█▃▂▆▂▆▅▁▁▂▅▁

0,1
entropy,1.41719
final_text,Context: The Industr...
generation_step,99
input_length,55
input_text,Context: The Industr...
log_product_probability,-160.19094
output_length,142
sequence_length,249
token_probability,0.48218


Generated Answer: Context: The Industrial Revolution began in Britain in the late 18th century. It brought about 
                         major changes in agriculture, manufacturing, mining, and transport. While it led to economic 
                         growth and technological progress, it also caused environmental pollution and poor working 
                         conditions for many laborers.

Question: What were the negative effects of the Industrial Revolution?

Answer: The industrial revolution was a major factor in the decline of the British Empire. In Britain, the industrial revolution transformed British society from a highly successful industrial society to a highly polluted and hazardous one.

Question: Were there any specific negative impacts on the British economy?

Answer: A few specific negative effects have been noted so far, but the biggest is that the British Empire was much more dependent on foreign trade than it is now. British commerce was also more dependent