# GPT-2 for Next Word/Token Prediction & Text Generation

**Goal:** To use a pre-trained GPT-2 model from the Hugging Face `transformers` library to perform text generation, effectively predicting the next word (or token) in a sequence.

**Contrast with RNN Example:**
*   **Model:** Uses a large, pre-trained Transformer (GPT-2) instead of a simple RNN trained from scratch.
*   **Level:** Operates on words or subword tokens (learned by the tokenizer) instead of individual characters.
*   **Task:** Primarily focused on *generation* by leveraging the model's pre-trained knowledge, rather than training a model for prediction on a small corpus.
*   **Tokenizer:** Uses a sophisticated tokenizer (Byte-Pair Encoding based) provided by Hugging Face.

**Focus:** Understanding how to load and use pre-trained language models for text generation tasks.

## 1. Setup: Installation and Imports

In [1]:
# Install transformers library
!pip install transformers torch -q

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import time

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m96.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m42.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## 2. Load Pre-trained GPT-2 Model and Tokenizer

We'll load the standard `gpt2` model and its corresponding tokenizer directly from Hugging Face.

*   `GPT2LMHeadModel`: This is the GPT-2 model with a language modeling head on top, which is necessary for predicting the next token and generating text.
*   `GPT2Tokenizer`: This tokenizer is specifically trained for GPT-2. It breaks text down into tokens (which can be words, parts of words, punctuation, etc.) based on frequency in the training data (using Byte-Pair Encoding).

In [2]:
model_name = 'gpt2' # You can also try 'gpt2-medium', 'gpt2-large', 'gpt2-xl' (if memory allows)

print(f"Loading tokenizer: {model_name}...")
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

print(f"Loading model: {model_name}...")
model = GPT2LMHeadModel.from_pretrained(model_name)

# Move model to the appropriate device
model.to(device)
# Set model to evaluation mode (disables dropout, etc.)
model.eval()

print("Model and tokenizer loaded.")

# GPT-2 tokenizer doesn't have a default pad token, but we can set it to the EOS token for generation purposes
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id
    print(f"Set pad token ID to EOS token ID: {tokenizer.eos_token_id}")

Loading tokenizer: gpt2...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Loading model: gpt2...


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model and tokenizer loaded.
Set pad token ID to EOS token ID: 50256


## 3. Understanding GPT-2 Tokenization

Unlike our simple character mapping, GPT-2's tokenizer breaks text into meaningful units (tokens) based on its training data. This often involves splitting words into common prefixes/suffixes or handling punctuation separately.

In [3]:
sample_text = "Natural language processing helps computers understand text."

# Encode the text
encoded_output = tokenizer(sample_text)
input_ids = encoded_output['input_ids']

# Decode the IDs back to tokens (the strings the model sees)
tokens = [tokenizer.decode([id]) for id in input_ids]

print(f"Original Text: '{sample_text}'")
print(f"Input IDs    : {input_ids}")
print(f"Tokens       : {tokens}")

# Note how 'processing' might be split, or punctuation handled.

Original Text: 'Natural language processing helps computers understand text.'
Input IDs    : [35364, 3303, 7587, 5419, 9061, 1833, 2420, 13]
Tokens       : ['Natural', ' language', ' processing', ' helps', ' computers', ' understand', ' text', '.']


## 4. Text Generation with `model.generate()`

Hugging Face provides a convenient `generate()` method for text generation. We provide a starting prompt (text), and the method uses the model to predict subsequent tokens.

**Key Parameters for `generate()`:**
*   `input_ids`: The tokenized starting prompt.
*   `max_length`: The **total** length of the output sequence (prompt + generated text).
*   `num_return_sequences`: How many different sequences to generate.
*   `do_sample=True`: Enables sampling methods (crucial for creative generation). If `False`, it uses greedy decoding (always picks the most likely next token).
*   `temperature`: Controls randomness. Lower values (~0.7) make output more focused/deterministic; higher values (~1.0+) increase randomness.
*   `top_k`: Samples only from the `k` most likely next tokens. (e.g., `k=50`).
*   `top_p`: (Nucleus Sampling) Samples from the smallest set of tokens whose cumulative probability exceeds `p`. (e.g., `p=0.95`). Often used with or instead of `top_k`.
*   `pad_token_id`: Ensures proper handling if padding is needed during generation.

In [4]:
def generate_text_gpt2(prompt, max_len=50, temp=0.7, k=50, p=0.95):
    print(f"\n--- Generating text for prompt: '{prompt}' ---")
    print(f"Parameters: max_length={max_len}, temperature={temp}, top_k={k}, top_p={p}")
    start_gen_time = time.time()

    # 1. Encode the prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    # Ensure attention mask is also created if needed (usually handled by generate)
    # attention_mask = torch.ones(input_ids.shape, dtype=torch.long).to(device)

    # 2. Generate text using the model
    # Ensure model is on the correct device already
    with torch.no_grad(): # No need to track gradients during generation
        output_sequences = model.generate(
            input_ids=input_ids,
            # attention_mask=attention_mask, # Often optional if input_ids are provided
            max_length=max_len,             # Total max length
            temperature=temp,
            top_k=k,
            top_p=p,
            do_sample=True,                 # Enable sampling
            num_return_sequences=1,         # Generate one sequence
            pad_token_id=tokenizer.eos_token_id # Use EOS token for padding
        )

    # 3. Decode the generated sequence(s)
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)

    end_gen_time = time.time()
    print(f"Generation finished in {end_gen_time - start_gen_time:.2f} seconds.")
    print("\nGenerated Text:")
    print(generated_text)
    print("------")
    return generated_text

# --- Generation Examples ---

# Example 1: More focused generation
prompt1 = "Artificial intelligence is"
_ = generate_text_gpt2(prompt1, max_len=60, temp=0.7, k=50, p=0.95)

# Example 2: More creative/random generation
prompt2 = "The unreasonable effectiveness of"
_ = generate_text_gpt2(prompt2, max_len=70, temp=1.0, k=50, p=0.95)

# Example 3: Longer generation
prompt3 = "Recurrent neural networks were once"
_ = generate_text_gpt2(prompt3, max_len=100, temp=0.8, k=50, p=0.95)

# Example 4: Different topic
prompt4 = "The weather today is"
_ = generate_text_gpt2(prompt4, max_len=40, temp=0.6, k=40, p=0.9)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



--- Generating text for prompt: 'Artificial intelligence is' ---
Parameters: max_length=60, temperature=0.7, top_k=50, top_p=0.95
Generation finished in 7.41 seconds.

Generated Text:
Artificial intelligence is a very promising technology, but it's going to take some time before it's ready to meet the demands of people's lives.

The US National Security Agency has been working on an artificial intelligence (AI) system to help protect US citizens from terrorism.

The system
------

--- Generating text for prompt: 'The unreasonable effectiveness of' ---
Parameters: max_length=70, temperature=1.0, top_k=50, top_p=0.95
Generation finished in 7.37 seconds.

Generated Text:
The unreasonable effectiveness of marijuana has been exposed in the media, which has resulted in its overuse as a criminal offense."

The Washington Post reports that the Justice Department is looking into whether the group in question received the approval of the FBI because they may have violated state laws protecting 

## 5. Conclusion

This notebook demonstrated how to use a pre-trained GPT-2 model for text generation. Key takeaways:

*   **Power of Pre-training:** GPT-2 can generate relatively coherent text on various topics without specific fine-tuning because it was trained on a massive dataset.
*   **Tokenization:** Word/subword tokenization (like BPE) is standard for large language models.
*   **`generate()` Method:** Hugging Face provides a powerful and flexible `generate()` method that handles the complexities of decoding strategies (greedy, sampling, top-k, top-p).
*   **Sampling Parameters:** Parameters like `temperature`, `top_k`, and `top_p` significantly influence the style and quality of the generated text.
*   **Contrast with Simple RNN:** Compared to the character RNN trained from scratch on a tiny dataset, GPT-2 exhibits much stronger language understanding and generation capabilities due to its architecture (Transformer) and extensive pre-training.

## 2.5 Visualize Model Architecture and Parameters

Let's examine the structure of the loaded GPT-2 model and get an idea of where the parameters are concentrated. Printing the model object gives a hierarchical view of its layers. We can also iterate through the parameters to count them per major component.

In [6]:
# --- 2.5 Visualize Model Architecture and Parameters ---

print("--- Model Architecture ---")
# Printing the model object gives a detailed layer breakdown
print(model)



--- Model Architecture ---
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [7]:
print("\n--- Parameter Distribution ---")
total_params = 0
params_per_layer = {}

# Iterate through named parameters to aggregate counts per major component
for name, param in model.named_parameters():
    if not param.requires_grad:
        continue # Skip parameters that are not trainable

    num_params = param.numel()
    total_params += num_params

    # Aggregate parameters based on layer name components
    name_parts = name.split('.')
    component = name_parts[0] # e.g., 'transformer'

    if component == 'transformer':
        if name_parts[1] == 'wte': # Word Token Embeddings
             layer_key = "transformer.wte (Word Embeddings)"
        elif name_parts[1] == 'wpe': # Word Position Embeddings
             layer_key = "transformer.wpe (Position Embeddings)"
        elif name_parts[1] == 'h': # Transformer Blocks (h)
            block_num = name_parts[2]
            layer_key = f"transformer.h.{block_num} (Block {block_num})"
        elif name_parts[1] == 'ln_f': # Final LayerNorm
             layer_key = "transformer.ln_f (Final LayerNorm)"
        else:
             layer_key = name # Fallback if structure changes
    elif component == 'lm_head': # Language Model Head
        layer_key = "lm_head (Output Layer)"
    else:
        layer_key = name # Fallback

    params_per_layer[layer_key] = params_per_layer.get(layer_key, 0) + num_params

# Print aggregated parameters per layer/component
print("Parameters per major component:")
# Sort items for consistent display, maybe sort by block number if possible
sorted_items = sorted(params_per_layer.items(), key=lambda item: item[0])
for layer, count in sorted_items:
    print(f"  {layer:<40}: {count:>12,}") # Align output

print("-" * 60)
print(f"Total Trainable Parameters          : {total_params:>12,}")

# Verify total count
assert total_params == sum(p.numel() for p in model.parameters() if p.requires_grad)
print("(Total count verified)")
print("-" * 60)


--- Parameter Distribution ---
Parameters per major component:
  transformer.h.0 (Block 0)               :    7,087,872
  transformer.h.1 (Block 1)               :    7,087,872
  transformer.h.10 (Block 10)             :    7,087,872
  transformer.h.11 (Block 11)             :    7,087,872
  transformer.h.2 (Block 2)               :    7,087,872
  transformer.h.3 (Block 3)               :    7,087,872
  transformer.h.4 (Block 4)               :    7,087,872
  transformer.h.5 (Block 5)               :    7,087,872
  transformer.h.6 (Block 6)               :    7,087,872
  transformer.h.7 (Block 7)               :    7,087,872
  transformer.h.8 (Block 8)               :    7,087,872
  transformer.h.9 (Block 9)               :    7,087,872
  transformer.ln_f (Final LayerNorm)      :        1,536
  transformer.wpe (Position Embeddings)   :      786,432
  transformer.wte (Word Embeddings)       :   38,597,376
------------------------------------------------------------
Total Trainable Para