In [None]:
# @title Install the Dependencies and Set Everything Up {"display-mode": "form"}


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
import torch.nn.functional as F
import numpy as np
import seaborn as sns
from google.colab import output

# Suppress verbose output
from transformers import logging
logging.set_verbosity_error()

print("🎸 Libraries installed and imported successfully.")

# Decoders: Generating Text with Models like GPT

Let's talk about how models like GPT-2 are masterfully designed to **generate new text**. We'll uncover how these models write stories, answer questions, and complete your sentences. By the end, you'll understand the architecture and step-by-step process of text generation.

# The Core Idea: Causal Attention (or, Masked Self-Attention)

Imagine you're reading a mystery novel one page at a time—you can see everything that has already happened, but you’re not allowed to peek at future pages.  A **causal attention mask** makes a language model behave the same way.  When the model is generating the next word, the mask hides (“masks out”) all the words that come **after** the current position, so the model can only “attend” to words it has already produced.  This one-way window keeps the model from cheating by looking ahead and ensures each new word depends solely on the past context, just like a reader who hasn’t turned the next page yet.

The most important difference between a text-generation model (decoder) and a text-understanding model (encoder) is how they "see" the input.

* **Encoder (like BERT):** Uses **bidirectional attention**. When analyzing the word "sat" in "the cat sat on the mat," it can look at words both before ("the cat") and after ("on the mat"). This is great for understanding the full context.
* **Decoder (like GPT):** Uses **causal attention**. When *generating* the word "sat," it can only see the words that came before it ("the cat"). It cannot see the future.

Why is this restriction necessary? Because when you're generating text, the future words don't exist yet! The model _must_ predict the next word based only on what it has already written.

**Why it matters:** Preventing a model from seeing future tokens during training teaches it the natural left-to-right flow of language.  At inference time—when it’s writing text for you—the same masking lets it roll forward word by word, using only what it has already written to decide what comes next.


# Visualizing the Causal Attention Mask

Let's see what this looks like in practice. The causal mask is a matrix that explicitly prevents the model from attending to future tokens by setting their attention scores to `-infinity` before the softmax step.

In [None]:
# Let's create a dummy sequence of 5 tokens
sequence_length = 5
# The attention mask will be a 5x5 matrix
causal_mask = torch.triu(torch.ones(sequence_length, sequence_length) * -1e9, diagonal=1)

# Let's make a pretty little chart.
plt.figure(figsize=(6, 6))
sns.heatmap(causal_mask, cmap='viridis', annot=True, fmt=".0f", cbar=False,
            xticklabels=np.arange(sequence_length), yticklabels=np.arange(sequence_length))
plt.title("Causal Attention Mask Heatmap")
plt.xlabel("Key Token Index")
plt.ylabel("Query Token Index")
plt.show()

In the mask above, `0.` means "pay attention," and a large negative number means "ignore." This ensures that at each step, the model is only influenced by the past.

# The Decoder Architecture: Stacking the Blocks

A decoder-only model like GPT-2 is essentially a stack of identical "decoder blocks."

Each block has two main components:

1.  **Masked Multi-Head Self-Attention:** This is the causal attention mechanism we just discussed. It allows the model to weigh the importance of previous words.
2.  **Feed-Forward Neural Network:** This processes the output from the attention layer, adding more computational depth to learn complex patterns.

This stack of blocks processes the token embeddings. The output of the final block is fed into one last layer: a **language modeling head**. This is a linear layer that projects the final token representation into a massive vector—one score for every single word in the model's vocabulary. A **softmax** function then converts these scores into probabilities.

In [None]:
# Load a small pre-trained model (GPT-2) and its tokenizer
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Make sure the tokenizer has a padding token for batching
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

output.clear() # Clean up all of those progress bars.

# Let's see the language modeling head
print("Language Modeling Head (Output Layer):")
print(model.lm_head)

# The output size matches the vocabulary size
print(f"Vocabulary size: {tokenizer.vocab_size}")

# The Generation Process: Autoregressive Decoding

**Autoregressive** is a *fancy* term for a simple idea: the model's next prediction depends on its own previous predictions. It's a step-by-step loop.

1.  Start with a prompt.
2.  The model predicts probabilities for the very next token.
3.  A **decoding strategy** is used to select one token from that probability distribution.
4.  The selected token is added to the end of the input sequence.
5.  Repeat from step 2 until a stop token is generated or the maximum length is reached.

The most interesting part of this process is step 3: how do we choose the next token?

# Decoding Strategy 1: Greedy Search (The Simplest)

Greedy search always picks the single most likely next token. It's fast and predictable, but often leads to boring and repetitive text because it never takes a risk on a slightly less probable but more interesting word.

In [None]:
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text using greedy search
# max_new_tokens is the number of tokens to generate *after* the prompt
greedy_output = model.generate(inputs.input_ids, max_new_tokens=20, num_beams=1, do_sample=False)

print("Greedy Search Output:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

# Decoding Strategy 2: Sampling with Temperature

Instead of just picking the top token, we can *sample* from the probability distribution. This introduces randomness. We can control this randomness with a parameter called **temperature**.

  * **Low Temperature (e.g., 0.2):** Makes the distribution "peakier." The model becomes more confident and conservative, similar to greedy search.
  * **High Temperature (e.g., 1.5):** Flattens the distribution. The model takes more risks, leading to more creative or even nonsensical text.

Let's visualize how temperature reshapes probabilities.

In [None]:
# Get the model's predicted probabilities for the next word
with torch.no_grad():
    outputs = model(**inputs)
    # Get the logits for the last token in the sequence
    next_token_logits = outputs.logits[:, -1, :]
    # Apply softmax to get probabilities
    probs = F.softmax(next_token_logits, dim=-1).cpu().numpy().flatten()

# Get top 20 probabilities for visualization
topk_probs, topk_indices = torch.topk(torch.from_numpy(probs), 20)
topk_tokens = [tokenizer.decode(i) for i in topk_indices]

# Function to apply temperature
def apply_temperature(logits, temperature):
    return F.softmax(logits / temperature, dim=-1).cpu().numpy().flatten()

# Visualize the effect of temperature
plt.figure(figsize=(15, 5))
temperatures = [0.1, 0.7, 1.5]

for i, temp in enumerate(temperatures):
    plt.subplot(1, len(temperatures), i + 1)
    temp_probs = apply_temperature(next_token_logits, temp)
    plt.bar(topk_tokens, temp_probs[topk_indices.numpy()], color='skyblue')
    plt.title(f"Temperature = {temp}")
    plt.xticks(rotation=90)
    plt.ylabel("Probability")
plt.tight_layout()
plt.show()

# Let's generate with sampling and temperature
sampling_output = model.generate(
    inputs.input_ids,
    max_new_tokens=50,
    do_sample=True, # Enable sampling
    temperature=0.7, # A balanced value for creativity
    top_k=0 # We'll discuss top_k next
)

print("\nSampling with Temperature (0.7) Output:")
print(tokenizer.decode(sampling_output[0], skip_special_tokens=True))

# Decoding Strategy 3: Top-K and Top-p (Nucleus) Sampling

Sampling can sometimes produce very weird words if it randomly picks a token with a very low probability. To prevent this, we can filter the distribution before sampling.

  * **Top-K Sampling:** Only consider the `k` most likely tokens for sampling. For example, with `top_k=50`, the model will only sample from the 50 most probable next words.
  * **Top-p (Nucleus) Sampling:** A more dynamic approach. It considers the smallest set of tokens whose cumulative probability exceeds a threshold `p`. If `p=0.9`, it samples from the top tokens that make up 90% of the probability mass. This adapts to the situation: when the model is very certain, it might only consider a few words; when it's uncertain, it might consider many.

Top-p is generally the recommended sampling method for high-quality text generation.

In [None]:
# Generate with Top-p (Nucleus) Sampling
nucleus_output = model.generate(
    inputs.input_ids,
    max_new_tokens=50,
    do_sample=True,
    top_p=0.92, # Use nucleus sampling
    top_k=0 # Make sure top_k is disabled
)

print(tokenizer.decode(nucleus_output[0], skip_special_tokens=True))

# Common Limitations

  * **Context Window:** Models have a finite memory. GPT-2, for example, has a context window of 1024 tokens. It cannot remember anything that happened before that window.
  * **Repetition:** Models can sometimes get stuck in repetitive loops. Parameters like `repetition_penalty` can help mitigate this.
  * **Hallucination:** A model can generate text that sounds plausible but is factually incorrect or nonsensical. It has no true understanding of the world; it's a pattern-matching machine.
  * **Bias:** These models are trained on vast amounts of internet text, and they inherit the societal biases (both good and bad) present in that data. Always be critical of the output.

# Stopping Mid-Sentence

Models learn to end sentences naturally by predicting a special **end-of-sequence (EOS) token**.

During their training on vast amounts of text, models like GPT observe that sentences and paragraphs consistently end with a certain pattern. They learn to associate the completion of a thought or sentence with the prediction of this specific `[EOS]` token.

## The Generation Process

When a model generates text, it performs a step-by-step process at each stage:

1.  It calculates the probability for every possible token in its vocabulary that could come next.
2.  The special `[EOS]` token is included in this vocabulary and is assigned a probability just like any other word.
3.  As the generated sentence starts to form a complete thought, the model's training makes it much more likely to predict the `[EOS]` token as the next logical item.
4.  Once the decoding strategy (like top-p sampling or greedy search) selects the `[EOS]` token, the generation process stops.

This learned behavior is why the model can conclude a sentence naturally. Our previous examples often stopped midway because we used the `max_new_tokens` parameter, which acts as a hard cutoff regardless of whether the sentence was finished.

In [None]:
# Increase max_new_tokens to allow for a full sentence
natural_stop_output = model.generate(
    inputs.input_ids,
    max_new_tokens=100, # Give the model plenty of room
    do_sample=True,
    top_p=0.92,
    temperature=0.7,
    top_k=0,
    # The function automatically uses the model's configured EOS token ID
    eos_token_id=tokenizer.eos_token_id
)

print(tokenizer.decode(natural_stop_output[0], skip_special_tokens=True))

# Text Generation versus Question Answering

At their core, models like ChatGPT and Gemini are highly advanced **text-generation systems**. They perform tasks like question-answering *by* generating text that happens to be the correct answer.

They don't switch between different "modes" like the `pipeline` tasks. Instead, their single, powerful text-generation ability has been trained to be so good at recognizing patterns that it can produce the right kind of text for almost any task you give it.

* **When you ask a question**, it generates text that is an answer.
* **When you ask for a summary**, it generates text that is a summary.
* **When you ask it to translate**, it generates text in the target language.

It's all about predicting the most appropriate next token to fulfill your request.

## Extractive vs. Generative Question Answering

* **Extractive QA** 🌭: In example where we pull out content from a brief blurb on the history of hot dog, what we used was *extractive*. It was limited to finding and pulling the exact answer directly from the context you provided. It couldn't use any outside knowledge.
* **Generative QA** 🤖: ChatGPT, Claude, Gemini, and friends perform *generative* (or abstractive) question-answering. They don't search a provided text; they generate the answer from scratch based on the vast knowledge learned during their training. This is why they can answer questions on almost any topic without needing a specific context document.

## The Secret Sauce: Instruction Fine-Tuning

The reason these models are so good at following instructions—like answering questions—comes from an extra training step after the initial text-generation pre-training.

1.  **Base Model**: They start as a _massive_ decoder model (like `gpt2-medium` but _exponentially_ larger) that's been trained on a huge portion of the internet to just predict the next word.
2.  **Instruction Fine-Tuning ✅**: After this, the model is further trained on a high-quality, curated dataset of `[instruction, response]` pairs that were written by humans. This step specifically teaches the model how to be a helpful assistant—how to answer questions, follow commands, and format its output in a useful way.

So, while the fundamental mechanism is still text generation, this specialized fine-tuning is what turns a general-purpose text predictor into a powerful and helpful conversational AI.

Models figure out how to respond to a question through a specialized training process called **instruction fine-tuning**. The model learns that the correct text generation following a question is an answer, not a continuation of the question itself.

Think of it like this:

  * **A base model** is like a student who has read an entire library. If you give them the start of a sentence, "The definition of photosynthesis is...," they're very good at completing it based on the patterns they've seen.
  * **An instruction-tuned model** is like that same student after they've taken a class where they practiced answering questions. They've been explicitly taught that when they see a phrase ending in a question mark, the correct response is to provide an answer.

## The Fine-Tuning Process

During instruction fine-tuning, the model is shown millions of high-quality examples that look like this:

```json
{
  "instruction": "What is the capital of France?",
  "response": "The capital of France is Paris."
}
```

By training on this data, the model learns a powerful new pattern: when the input text is a question, the statistically likeliest "next tokens" are the ones that form a coherent answer. This process adjusts the model's internal weights so that generating an answer becomes the highest probability path.

So, the model isn't "deciding" to answer your question. It's simply that its training has made the text of an answer the most probable continuation of the text of your prompt.

# Conclusion

  * **Causal Attention:** The core mechanism that forces a model to only look at the past when generating text.
  * **Decoder Architecture:** How Masked Attention and Feed-Forward layers are stacked to create models like GPT.
  * **Autoregressive Generation:** The step-by-step process of predicting one token at a time.
  * **Decoding Strategies:** The different methods (Greedy, Temperature, Top-p) for choosing the next token, trading off between safety and creativity.

You are now equipped with the fundamental knowledge of how modern text generation models work under the hood. The "magic" of `pipeline("text-generation")` should now be way less magical.