# üìö Understanding GPT-2 Model Weights & Files

This notebook provides a comprehensive guide to understanding the GPT-2 model files, how they are structured, downloaded, and used in practice.

---

## üìÅ Overview of GPT-2 Model Files

When you download GPT-2 weights from OpenAI, you get the following files in the `gpt2/124M/` directory:

| File | Size | Purpose |
|------|------|----------|
| `checkpoint` | 77 bytes | Points to the latest checkpoint file |
| `encoder.json` | ~1 MB | Token-to-ID mapping (vocabulary) |
| `vocab.bpe` | ~446 KB | Byte-Pair Encoding merge rules |
| `hparams.json` | 90 bytes | Model hyperparameters/configuration |
| `model.ckpt.data-00000-of-00001` | ~475 MB | **Actual model weights** |
| `model.ckpt.index` | ~5 KB | Index for weight tensors |
| `model.ckpt.meta` | ~461 KB | TensorFlow graph metadata |

Let's explore each file in detail!

---
## üîß Setup

In [12]:
import os
import json
import numpy as np
import tensorflow as tf
import requests
from tqdm import tqdm

# Suppress TensorFlow info messages
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

# Path to GPT-2 model files
MODEL_DIR = "gpt2/124M"

# Check if we need to download the model files
if not os.path.exists(MODEL_DIR):
    print("üì• GPT-2 model files not found. Downloading...")
    os.makedirs(MODEL_DIR, exist_ok=True)
    
    base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models/124M"
    filenames = [
        "checkpoint",
        "encoder.json",
        "hparams.json",
        "model.ckpt.data-00000-of-00001",
        "model.ckpt.index",
        "model.ckpt.meta",
        "vocab.bpe"
    ]
    
    for filename in filenames:
        url = f"{base_url}/{filename}"
        filepath = os.path.join(MODEL_DIR, filename)
        
        print(f"  Downloading {filename}...", end=" ")
        response = requests.get(url, stream=True)
        total_size = int(response.headers.get('content-length', 0))
        
        with open(filepath, 'wb') as f:
            if total_size > 1024*1024:  # Show progress for large files
                with tqdm(total=total_size, unit='B', unit_scale=True, desc=filename) as pbar:
                    for chunk in response.iter_content(chunk_size=8192):
                        f.write(chunk)
                        pbar.update(len(chunk))
            else:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
                print("‚úÖ")
    
    print("\n‚úÖ Download complete!")

# Display files in directory
print("\nüìÇ Files in GPT-2 124M directory:")
print("-" * 50)
for file in sorted(os.listdir(MODEL_DIR)):
    size = os.path.getsize(os.path.join(MODEL_DIR, file))
    if size > 1024*1024:
        size_str = f"{size/(1024*1024):.1f} MB"
    elif size > 1024:
        size_str = f"{size/1024:.1f} KB"
    else:
        size_str = f"{size} bytes"
    print(f"  {file:40} {size_str:>10}")

üì• GPT-2 model files not found. Downloading...
  Downloading checkpoint... ‚úÖ
  Downloading encoder.json... ‚úÖ
  Downloading hparams.json... ‚úÖ
  Downloading model.ckpt.data-00000-of-00001... 

model.ckpt.data-00000-of-00001: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 498M/498M [00:17<00:00, 29.2MB/s] 


  Downloading model.ckpt.index... ‚úÖ
  Downloading model.ckpt.meta... ‚úÖ
  Downloading vocab.bpe... ‚úÖ

‚úÖ Download complete!

üìÇ Files in GPT-2 124M directory:
--------------------------------------------------
  checkpoint                                 77 bytes
  encoder.json                              1017.9 KB
  hparams.json                               90 bytes
  model.ckpt.data-00000-of-00001             474.7 MB
  model.ckpt.index                             5.1 KB
  model.ckpt.meta                            460.1 KB
  vocab.bpe                                  445.6 KB


---
## 1Ô∏è‚É£ `hparams.json` - Model Configuration

This file contains the **hyperparameters** that define the model architecture. It's the blueprint that tells us how big the model is.

In [13]:
# Load and display hyperparameters
with open(os.path.join(MODEL_DIR, "hparams.json"), "r") as f:
    hparams = json.load(f)

print("üîß GPT-2 124M Hyperparameters:")
print("-" * 50)
print(json.dumps(hparams, indent=2))

üîß GPT-2 124M Hyperparameters:
--------------------------------------------------
{
  "n_vocab": 50257,
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_layer": 12
}


### Understanding Each Hyperparameter:

| Parameter | Value | Meaning |
|-----------|-------|----------|
| `n_vocab` | 50257 | Size of vocabulary (number of unique tokens) |
| `n_ctx` | 1024 | Maximum context length (tokens the model can "see") |
| `n_embd` | 768 | Embedding dimension (size of each token's vector) |
| `n_head` | 12 | Number of attention heads in multi-head attention |
| `n_layer` | 12 | Number of transformer blocks (depth of the model) |

### GPT-2 Model Sizes Comparison:

| Model | Parameters | n_layer | n_head | n_embd |
|-------|------------|---------|--------|--------|
| 124M (Small) | 124 Million | 12 | 12 | 768 |
| 355M (Medium) | 355 Million | 24 | 16 | 1024 |
| 774M (Large) | 774 Million | 36 | 20 | 1280 |
| 1558M (XL) | 1.5 Billion | 48 | 25 | 1600 |

In [14]:
# Calculate approximate number of parameters
n_vocab = hparams['n_vocab']
n_ctx = hparams['n_ctx']
n_embd = hparams['n_embd']
n_head = hparams['n_head']
n_layer = hparams['n_layer']

# Token embeddings: n_vocab * n_embd
token_emb_params = n_vocab * n_embd

# Position embeddings: n_ctx * n_embd
pos_emb_params = n_ctx * n_embd

# Each transformer block has:
# - Attention: 4 * n_embd * n_embd (Q, K, V, Output projections)
# - MLP: 2 * n_embd * (4 * n_embd) = 8 * n_embd^2
# - Layer norms: 4 * n_embd (2 layer norms with weight and bias)
params_per_block = 4 * n_embd * n_embd + 8 * n_embd * n_embd + 4 * n_embd
transformer_params = n_layer * params_per_block

# Final layer norm
final_ln_params = 2 * n_embd

total_params = token_emb_params + pos_emb_params + transformer_params + final_ln_params

print("üìä Parameter Count Breakdown:")
print("-" * 50)
print(f"Token Embeddings:     {token_emb_params:>15,} params")
print(f"Position Embeddings:  {pos_emb_params:>15,} params")
print(f"Transformer Blocks:   {transformer_params:>15,} params")
print(f"Final Layer Norm:     {final_ln_params:>15,} params")
print("-" * 50)
print(f"Total (approx):       {total_params:>15,} params")
print(f"                      ~{total_params/1e6:.0f} Million parameters")

üìä Parameter Count Breakdown:
--------------------------------------------------
Token Embeddings:          38,597,376 params
Position Embeddings:          786,432 params
Transformer Blocks:        84,971,520 params
Final Layer Norm:               1,536 params
--------------------------------------------------
Total (approx):           124,356,864 params
                      ~124 Million parameters


---
## 2Ô∏è‚É£ `encoder.json` - Vocabulary Mapping

This file contains the **token-to-ID mapping**. It maps each token (word piece) to a unique integer ID that the model uses internally.

GPT-2 uses **Byte-Pair Encoding (BPE)** tokenization, which breaks words into subword units.

In [15]:
# Load the encoder (vocabulary)
with open(os.path.join(MODEL_DIR, "encoder.json"), "r") as f:
    encoder = json.load(f)

print(f"üìñ Vocabulary Size: {len(encoder):,} tokens")
print("\nüî§ Sample tokens from vocabulary:")
print("-" * 50)

# Show some interesting tokens
sample_tokens = list(encoder.items())[:20]
for token, idx in sample_tokens:
    # Make whitespace visible
    display_token = repr(token)
    print(f"  Token: {display_token:20} ‚Üí ID: {idx}")

üìñ Vocabulary Size: 50,257 tokens

üî§ Sample tokens from vocabulary:
--------------------------------------------------
  Token: '!'                  ‚Üí ID: 0
  Token: '"'                  ‚Üí ID: 1
  Token: '#'                  ‚Üí ID: 2
  Token: '$'                  ‚Üí ID: 3
  Token: '%'                  ‚Üí ID: 4
  Token: '&'                  ‚Üí ID: 5
  Token: "'"                  ‚Üí ID: 6
  Token: '('                  ‚Üí ID: 7
  Token: ')'                  ‚Üí ID: 8
  Token: '*'                  ‚Üí ID: 9
  Token: '+'                  ‚Üí ID: 10
  Token: ','                  ‚Üí ID: 11
  Token: '-'                  ‚Üí ID: 12
  Token: '.'                  ‚Üí ID: 13
  Token: '/'                  ‚Üí ID: 14
  Token: '0'                  ‚Üí ID: 15
  Token: '1'                  ‚Üí ID: 16
  Token: '2'                  ‚Üí ID: 17
  Token: '3'                  ‚Üí ID: 18
  Token: '4'                  ‚Üí ID: 19


In [16]:
# Let's look at some specific tokens
print("\nüîç Looking up specific tokens:")
print("-" * 50)

words_to_find = ["hello", "Hello", " hello", "world", " the", "the", "ƒ†the", "ing", "ƒ†ing"]

for word in words_to_find:
    if word in encoder:
        print(f"  '{word}' ‚Üí ID: {encoder[word]}")
    else:
        print(f"  '{word}' ‚Üí NOT FOUND (would be split into subwords)")

print("\nüí° Note: 'ƒ†' represents a space before the token (GPT-2's way of encoding spaces)")


üîç Looking up specific tokens:
--------------------------------------------------
  'hello' ‚Üí ID: 31373
  'Hello' ‚Üí ID: 15496
  ' hello' ‚Üí NOT FOUND (would be split into subwords)
  'world' ‚Üí ID: 6894
  ' the' ‚Üí NOT FOUND (would be split into subwords)
  'the' ‚Üí ID: 1169
  'ƒ†the' ‚Üí ID: 262
  'ing' ‚Üí ID: 278
  'ƒ†ing' ‚Üí ID: 5347

üí° Note: 'ƒ†' represents a space before the token (GPT-2's way of encoding spaces)


In [17]:
# Create reverse mapping (ID to token)
decoder = {v: k for k, v in encoder.items()}

print("\nüîÑ Reverse lookup (ID ‚Üí Token):")
print("-" * 50)
for idx in [0, 1, 100, 1000, 10000, 50256]:
    token = decoder.get(idx, "NOT FOUND")
    print(f"  ID {idx:>6} ‚Üí Token: {repr(token)}")

print(f"\nüìù Special token: ID 50256 is '<|endoftext|>' - marks end of text")


üîÑ Reverse lookup (ID ‚Üí Token):
--------------------------------------------------
  ID      0 ‚Üí Token: '!'
  ID      1 ‚Üí Token: '"'
  ID    100 ‚Üí Token: '¬ß'
  ID   1000 ‚Üí Token: 'ale'
  ID  10000 ‚Üí Token: 'ƒ†pocket'
  ID  50256 ‚Üí Token: '<|endoftext|>'

üìù Special token: ID 50256 is '<|endoftext|>' - marks end of text


---
## 3Ô∏è‚É£ `vocab.bpe` - Byte-Pair Encoding Rules

This file contains the **BPE merge rules**. BPE is a compression algorithm that iteratively merges the most frequent pairs of characters/tokens.

### How BPE Works:
1. Start with individual characters
2. Find the most frequent pair of adjacent tokens
3. Merge them into a new token
4. Repeat until vocabulary size is reached

In [18]:
# Load BPE merge rules
with open(os.path.join(MODEL_DIR, "vocab.bpe"), "r", encoding="utf-8") as f:
    bpe_data = f.read()

# Parse the BPE file
bpe_lines = bpe_data.split('\n')
# First line is version, rest are merge rules
bpe_merges = [tuple(line.split()) for line in bpe_lines[1:] if line.strip()]

print(f"üìú Total BPE merge rules: {len(bpe_merges):,}")
print("\nüîÄ First 20 merge rules (most common pairs):")
print("-" * 50)
for i, (a, b) in enumerate(bpe_merges[:20]):
    print(f"  Rule {i+1:>3}: '{a}' + '{b}' ‚Üí '{a}{b}'")

üìú Total BPE merge rules: 50,000

üîÄ First 20 merge rules (most common pairs):
--------------------------------------------------
  Rule   1: 'ƒ†' + 't' ‚Üí 'ƒ†t'
  Rule   2: 'ƒ†' + 'a' ‚Üí 'ƒ†a'
  Rule   3: 'h' + 'e' ‚Üí 'he'
  Rule   4: 'i' + 'n' ‚Üí 'in'
  Rule   5: 'r' + 'e' ‚Üí 're'
  Rule   6: 'o' + 'n' ‚Üí 'on'
  Rule   7: 'ƒ†t' + 'he' ‚Üí 'ƒ†the'
  Rule   8: 'e' + 'r' ‚Üí 'er'
  Rule   9: 'ƒ†' + 's' ‚Üí 'ƒ†s'
  Rule  10: 'a' + 't' ‚Üí 'at'
  Rule  11: 'ƒ†' + 'w' ‚Üí 'ƒ†w'
  Rule  12: 'ƒ†' + 'o' ‚Üí 'ƒ†o'
  Rule  13: 'e' + 'n' ‚Üí 'en'
  Rule  14: 'ƒ†' + 'c' ‚Üí 'ƒ†c'
  Rule  15: 'i' + 't' ‚Üí 'it'
  Rule  16: 'i' + 's' ‚Üí 'is'
  Rule  17: 'a' + 'n' ‚Üí 'an'
  Rule  18: 'o' + 'r' ‚Üí 'or'
  Rule  19: 'e' + 's' ‚Üí 'es'
  Rule  20: 'ƒ†' + 'b' ‚Üí 'ƒ†b'


In [19]:
print("\nüîÄ Last 10 merge rules (least common pairs):")
print("-" * 50)
for i, (a, b) in enumerate(bpe_merges[-10:]):
    rule_num = len(bpe_merges) - 10 + i + 1
    print(f"  Rule {rule_num:>5}: '{a}' + '{b}' ‚Üí '{a}{b}'")


üîÄ Last 10 merge rules (least common pairs):
--------------------------------------------------
  Rule 49991: 'Comm' + 'ission' ‚Üí 'Commission'
  Rule 49992: 'ƒ†(' + '/' ‚Üí 'ƒ†(/'
  Rule 49993: '√¢ƒ¢¬¶' + '."' ‚Üí '√¢ƒ¢¬¶."'
  Rule 49994: 'Com' + 'par' ‚Üí 'Compar'
  Rule 49995: 'ƒ†ampl' + 'ification' ‚Üí 'ƒ†amplification'
  Rule 49996: 'om' + 'inated' ‚Üí 'ominated'
  Rule 49997: 'ƒ†reg' + 'ress' ‚Üí 'ƒ†regress'
  Rule 49998: 'ƒ†Coll' + 'ider' ‚Üí 'ƒ†Collider'
  Rule 49999: 'ƒ†inform' + 'ants' ‚Üí 'ƒ†informants'
  Rule 50000: 'ƒ†g' + 'azed' ‚Üí 'ƒ†gazed'


### Understanding BPE Merge Rules:

The merge rules are ordered by frequency:
- **Early rules** (like `'t' + 'h' ‚Üí 'th'`) are very common and applied first
- **Later rules** create longer tokens from merged subwords

This is why GPT-2 can handle any text - unknown words are broken into known subword pieces!

---
## 4Ô∏è‚É£ `checkpoint` - TensorFlow Checkpoint Pointer

This small file simply points to the latest checkpoint file.

In [None]:
# Read checkpoint file
with open(os.path.join(MODEL_DIR, "checkpoint"), "r") as f:
    checkpoint_content = f.read()

print("üìç Checkpoint file content:")
print("-" * 50)
print(checkpoint_content)

print("\nüí° This tells TensorFlow which checkpoint files to load")

---
## 5Ô∏è‚É£ Model Checkpoint Files (The Actual Weights!)

The three `model.ckpt.*` files together contain the **actual neural network weights**:

| File | Purpose |
|------|----------|
| `model.ckpt.data-00000-of-00001` | Binary data containing all weight values |
| `model.ckpt.index` | Index/map to locate tensors in the data file |
| `model.ckpt.meta` | TensorFlow graph structure and metadata |

Let's explore what's inside!

In [None]:
# Get the checkpoint path
ckpt_path = tf.train.latest_checkpoint(MODEL_DIR)
print(f"üìÇ Checkpoint path: {ckpt_path}")

# List all variables in the checkpoint
print("\nüì¶ All tensors in the checkpoint:")
print("=" * 70)

variables = tf.train.list_variables(ckpt_path)
total_params = 0

for name, shape in variables:
    num_params = np.prod(shape)
    total_params += num_params
    print(f"{name:50} {str(shape):20} {num_params:>12,} params")

print("=" * 70)
print(f"{'TOTAL':50} {'':<20} {total_params:>12,} params")
print(f"\nüéØ Actual parameter count: {total_params:,} ({total_params/1e6:.1f}M)")

### Understanding the Weight Names:

The naming convention follows this pattern:

```
model/
‚îú‚îÄ‚îÄ wte          ‚Üí Word Token Embeddings (vocab_size √ó emb_dim)
‚îú‚îÄ‚îÄ wpe          ‚Üí Word Position Embeddings (context_length √ó emb_dim)
‚îú‚îÄ‚îÄ h0/          ‚Üí Transformer Block 0
‚îÇ   ‚îú‚îÄ‚îÄ ln_1/    ‚Üí Layer Norm 1 (before attention)
‚îÇ   ‚îú‚îÄ‚îÄ attn/    ‚Üí Multi-Head Attention
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ c_attn/  ‚Üí Combined Q, K, V projection
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ c_proj/  ‚Üí Output projection
‚îÇ   ‚îú‚îÄ‚îÄ ln_2/    ‚Üí Layer Norm 2 (before MLP)
‚îÇ   ‚îî‚îÄ‚îÄ mlp/     ‚Üí Feed-Forward Network
‚îÇ       ‚îú‚îÄ‚îÄ c_fc/    ‚Üí First linear layer (expand)
‚îÇ       ‚îî‚îÄ‚îÄ c_proj/  ‚Üí Second linear layer (project back)
‚îú‚îÄ‚îÄ h1/ ... h11/ ‚Üí Transformer Blocks 1-11
‚îî‚îÄ‚îÄ ln_f/        ‚Üí Final Layer Norm
```

In [None]:
# Let's load and examine specific weights
print("üî¨ Examining specific weight tensors:")
print("=" * 70)

# Token embeddings
wte = tf.train.load_variable(ckpt_path, "model/wte")
print(f"\n1Ô∏è‚É£ Token Embeddings (wte):")
print(f"   Shape: {wte.shape} = (vocab_size, embedding_dim)")
print(f"   This maps each of {wte.shape[0]:,} tokens to a {wte.shape[1]}-dimensional vector")
print(f"   Sample embedding for token 0: {wte[0, :5]}... (first 5 values)")

# Position embeddings
wpe = tf.train.load_variable(ckpt_path, "model/wpe")
print(f"\n2Ô∏è‚É£ Position Embeddings (wpe):")
print(f"   Shape: {wpe.shape} = (max_positions, embedding_dim)")
print(f"   This encodes position information for up to {wpe.shape[0]} tokens")
print(f"   Sample embedding for position 0: {wpe[0, :5]}... (first 5 values)")

In [None]:
# Examine attention weights from first transformer block
print("\n3Ô∏è‚É£ Attention Weights (Block 0):")
print("-" * 50)

# c_attn combines Q, K, V projections
c_attn_w = tf.train.load_variable(ckpt_path, "model/h0/attn/c_attn/w")
c_attn_b = tf.train.load_variable(ckpt_path, "model/h0/attn/c_attn/b")
print(f"   c_attn weight: {c_attn_w.shape}")
print(f"   ‚Üí Combines Q, K, V projections: 768 ‚Üí 3√ó768 = 2304")
print(f"   ‚Üí Each of the 12 heads gets 768/12 = 64 dimensions")

c_proj_w = tf.train.load_variable(ckpt_path, "model/h0/attn/c_proj/w")
print(f"   c_proj weight: {c_proj_w.shape}")
print(f"   ‚Üí Projects concatenated heads back to embedding dimension")

In [None]:
# Examine MLP weights
print("\n4Ô∏è‚É£ MLP (Feed-Forward) Weights (Block 0):")
print("-" * 50)

c_fc_w = tf.train.load_variable(ckpt_path, "model/h0/mlp/c_fc/w")
c_proj_w = tf.train.load_variable(ckpt_path, "model/h0/mlp/c_proj/w")

print(f"   c_fc weight: {c_fc_w.shape}")
print(f"   ‚Üí Expands: 768 ‚Üí 4√ó768 = 3072 (hidden dimension)")
print(f"   c_proj weight: {c_proj_w.shape}")
print(f"   ‚Üí Contracts: 3072 ‚Üí 768 (back to embedding dimension)")
print(f"\n   üí° The 4√ó expansion is a common design choice in transformers")

In [None]:
# Examine Layer Norm parameters
print("\n5Ô∏è‚É£ Layer Normalization Parameters:")
print("-" * 50)

ln1_g = tf.train.load_variable(ckpt_path, "model/h0/ln_1/g")
ln1_b = tf.train.load_variable(ckpt_path, "model/h0/ln_1/b")
print(f"   ln_1 gamma (scale): {ln1_g.shape}")
print(f"   ln_1 beta (shift):  {ln1_b.shape}")

ln_f_g = tf.train.load_variable(ckpt_path, "model/ln_f/g")
ln_f_b = tf.train.load_variable(ckpt_path, "model/ln_f/b")
print(f"   ln_f (final) gamma: {ln_f_g.shape}")
print(f"   ln_f (final) beta:  {ln_f_b.shape}")
print(f"\n   üí° Layer Norm has learnable scale (g) and shift (b) parameters")

---
## 6Ô∏è‚É£ Loading All Weights into a Python Dictionary

Now let's see how to load ALL the weights into a structured Python dictionary that can be used with PyTorch.

In [None]:
def load_gpt2_params_from_tf_ckpt(ckpt_path, hparams):
    """
    Load GPT-2 parameters from TensorFlow checkpoint into a nested dictionary.
    
    The resulting structure:
    {
        'wte': array,           # Token embeddings
        'wpe': array,           # Position embeddings
        'blocks': [
            {                   # Block 0
                'ln_1': {'g': array, 'b': array},
                'attn': {
                    'c_attn': {'w': array, 'b': array},
                    'c_proj': {'w': array, 'b': array}
                },
                'ln_2': {'g': array, 'b': array},
                'mlp': {
                    'c_fc': {'w': array, 'b': array},
                    'c_proj': {'w': array, 'b': array}
                }
            },
            ...                 # Blocks 1-11
        ],
        'ln_f': {'g': array, 'b': array}  # Final layer norm
    }
    """
    # Initialize with empty blocks
    params = {"blocks": [{} for _ in range(hparams["n_layer"])]}
    
    for name, _ in tf.train.list_variables(ckpt_path):
        # Load the variable
        variable_array = np.squeeze(tf.train.load_variable(ckpt_path, name))
        
        # Parse the variable name (skip 'model/' prefix)
        variable_name_parts = name.split("/")[1:]
        
        # Find target dictionary
        target_dict = params
        if variable_name_parts[0].startswith("h"):
            # This is a transformer block parameter
            layer_number = int(variable_name_parts[0][1:])
            target_dict = params["blocks"][layer_number]
            variable_name_parts = variable_name_parts[1:]
        
        # Navigate/create nested dictionaries
        for key in variable_name_parts[:-1]:
            target_dict = target_dict.setdefault(key, {})
        
        # Set the value
        last_key = variable_name_parts[-1]
        target_dict[last_key] = variable_array
    
    return params

# Load all parameters
print("‚è≥ Loading all GPT-2 weights...")
params = load_gpt2_params_from_tf_ckpt(ckpt_path, hparams)
print("‚úÖ Weights loaded successfully!")

In [None]:
# Explore the structure
print("\nüìä Loaded parameters structure:")
print("=" * 50)

def print_structure(d, indent=0):
    """Recursively print dictionary structure with array shapes"""
    for key, value in d.items():
        if isinstance(value, dict):
            print(" " * indent + f"üìÅ {key}/")
            print_structure(value, indent + 4)
        elif isinstance(value, list):
            print(" " * indent + f"üìÅ {key}/ [{len(value)} blocks]")
            # Just show first block structure
            print(" " * (indent + 4) + "üìÅ [0]/  (showing first block)")
            print_structure(value[0], indent + 8)
        elif isinstance(value, np.ndarray):
            print(" " * indent + f"üî¢ {key}: shape={value.shape}")

print_structure(params)

In [None]:
# Verify we can access weights
print("\n‚úÖ Verification - Accessing loaded weights:")
print("-" * 50)
print(f"Token embeddings shape: {params['wte'].shape}")
print(f"Position embeddings shape: {params['wpe'].shape}")
print(f"Number of transformer blocks: {len(params['blocks'])}")
print(f"Block 0 attention c_attn weight shape: {params['blocks'][0]['attn']['c_attn']['w'].shape}")
print(f"Final layer norm gamma shape: {params['ln_f']['g'].shape}")

---
## 7Ô∏è‚É£ How Weights are Downloaded

Let's understand the download process from `07. GPT-2_weights_download.py`:

In [None]:
# The download URLs
base_url = "https://openaipublic.blob.core.windows.net/gpt-2/models"
model_size = "124M"

filenames = [
    "checkpoint",
    "encoder.json", 
    "hparams.json",
    "model.ckpt.data-00000-of-00001",
    "model.ckpt.index",
    "model.ckpt.meta",
    "vocab.bpe"
]

print("üåê GPT-2 Model Download URLs:")
print("=" * 80)
for filename in filenames:
    url = f"{base_url}/{model_size}/{filename}"
    print(f"  {url}")

### Download Process:

```python
# 1. Create directory
os.makedirs("gpt2/124M", exist_ok=True)

# 2. Download each file
for filename in filenames:
    url = f"{base_url}/{model_size}/{filename}"
    response = requests.get(url, stream=True)
    
    # Save to disk with progress bar
    with open(f"gpt2/124M/{filename}", "wb") as f:
        for chunk in response.iter_content(1024):
            f.write(chunk)

# 3. Load the TensorFlow checkpoint
ckpt_path = tf.train.latest_checkpoint("gpt2/124M")

# 4. Extract weights into Python dictionary
params = load_gpt2_params_from_tf_ckpt(ckpt_path, hparams)
```

---
## 8Ô∏è‚É£ Using Weights with PyTorch

Now let's see how these TensorFlow weights are converted and used in a PyTorch GPT-2 model.

In [None]:
import torch
import torch.nn as nn

def assign_weights(pytorch_layer, params_dict, transpose_weights=True):
    """
    Assign numpy weights to PyTorch layer.
    
    Note: TensorFlow uses (input_dim, output_dim) but PyTorch uses (output_dim, input_dim)
    So we need to transpose the weight matrices!
    """
    if 'w' in params_dict:
        weight = params_dict['w']
        if transpose_weights and len(weight.shape) == 2:
            weight = weight.T  # Transpose for PyTorch
        pytorch_layer.weight.data = torch.from_numpy(weight.copy())
    
    if 'b' in params_dict:
        pytorch_layer.bias.data = torch.from_numpy(params_dict['b'].copy())

print("üí° Key insight: TensorFlow vs PyTorch weight shapes")
print("=" * 50)
print("TensorFlow Linear: weight shape = (input_dim, output_dim)")
print("PyTorch Linear:    weight shape = (output_dim, input_dim)")
print("\n‚Üí We must TRANSPOSE weights when converting!")

In [None]:
# Example: Loading token embeddings into PyTorch
print("\nüì• Loading token embeddings into PyTorch:")
print("-" * 50)

# Create PyTorch embedding layer
token_embedding = nn.Embedding(hparams['n_vocab'], hparams['n_embd'])

# Load weights (no transpose needed for embeddings)
token_embedding.weight.data = torch.from_numpy(params['wte'].copy())

print(f"Original TF shape: {params['wte'].shape}")
print(f"PyTorch layer shape: {token_embedding.weight.shape}")
print(f"\n‚úÖ Embedding loaded successfully!")

# Test it
test_tokens = torch.tensor([50256, 0, 1])  # <|endoftext|> and first two tokens
embeddings = token_embedding(test_tokens)
print(f"\nTest embedding output shape: {embeddings.shape}")
print(f"Embedding for token 0 (first 5 dims): {embeddings[1, :5]}")

---
## üìù Summary

### What We Learned:

1. **`hparams.json`** - Contains model architecture configuration (layers, heads, dimensions)

2. **`encoder.json`** - Maps tokens to integer IDs (50,257 tokens)

3. **`vocab.bpe`** - Byte-Pair Encoding merge rules for tokenization

4. **`checkpoint`** - Points to the checkpoint files

5. **`model.ckpt.*`** - The actual neural network weights:
   - Token embeddings: (50257, 768)
   - Position embeddings: (1024, 768)
   - 12 Transformer blocks with:
     - Layer norms
     - Multi-head attention (Q, K, V projections)
     - Feed-forward MLP
   - Final layer norm

### Key Takeaways:

- GPT-2 124M has **124 million parameters**
- Weights are stored in **TensorFlow checkpoint format**
- Must **transpose** weights when loading into PyTorch
- The architecture follows the standard transformer decoder pattern

---
## üîó Next Steps

Now that you understand the GPT-2 weight files, you can:

1. **Load weights into your own GPT-2 implementation** (from notebook 04)
2. **Generate text** using the pre-trained model
3. **Fine-tune** on your own dataset
4. **Analyze** the learned representations

See the next notebook for loading these weights into the GPT-2 model architecture!