# ðŸ§ª Exercise Set: How Inference Actually Works

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-repo/series1-coding-exercises/blob/main/exercises/blog-10/exercise-00.ipynb)

This blog is about **behavior at inference, not architecture**.

These exercises are designed to stack. Each one reveals a hidden assumption people carry about LLMs.

**What you'll experience:**
- Stochasticity
- Token-by-token generation
- Temperature effects
- Context window limits
- KV cache impact
- Cost/latency tradeoffs

**Not theory. Observable mechanics.**

All exercises are runnable in Colab with transformers.

## Setup

In [None]:
%pip install -q transformers torch

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
model.eval()

## ðŸ”¹ Exercise 1 â€” Same Prompt, Different Output

**Goal:** Make stochasticity tangible.

**What you'll observe:** Different answers. Same model. Same prompt. Nothing is broken.

This immediately kills the "deterministic software" assumption.

In [None]:
prompt = "Explain why the sky is blue in simple terms."

inputs = tokenizer(prompt, return_tensors="pt").to(device)

for i in range(3):
    output = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=0.8
    )
    print(f"\n--- Run {i+1} ---")
    print(tokenizer.decode(output[0], skip_special_tokens=True))

## ðŸ”¹ Exercise 2 â€” Turn Off Sampling (Deterministic Mode)

**Goal:** Show the opposite â€” deterministic behavior.

**What you'll observe:** Same output every time.

Greedy decoding = always pick highest probability token.

This shows: **Variability is a design choice, not instability.**

In [None]:
prompt = "Explain why the sky is blue in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(device)

for i in range(3):
    output = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=False   # greedy decoding
    )
    print(f"\n--- Run {i+1} ---")
    print(tokenizer.decode(output[0], skip_special_tokens=True))

## ðŸ”¹ Exercise 3 â€” Visualize Token Probabilities

**Goal:** Make it mechanical. Show actual probability distribution.

**What you'll see:** The model isn't "choosing words." It's sampling from a probability field.

In [None]:
import torch.nn.functional as F

prompt = "Explain why the sky is blue in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits[:, -1, :]
    probs = F.softmax(logits, dim=-1)

top_probs, top_indices = torch.topk(probs, 10)

print("Top 10 next-token candidates:\n")
for prob, idx in zip(top_probs[0], top_indices[0]):
    print(tokenizer.decode([idx]), "â†’", float(prob))

## ðŸ”¹ Exercise 4 â€” Temperature Experiment

**Goal:** Show risk tolerance.

**What you'll observe:**
- `0.2` â†’ rigid, repetitive
- `0.7` â†’ balanced
- `1.3` â†’ creative but unstable

**Clear mental model:** Temperature reshapes probabilities. It does not change knowledge.

In [None]:
prompt = "Explain why the sky is blue in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(device)

for temp in [0.2, 0.7, 1.3]:
    print(f"\n=== Temperature {temp} ===")
    output = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        temperature=temp
    )
    print(tokenizer.decode(output[0], skip_special_tokens=True))

## ðŸ”¹ Exercise 5 â€” Autoregressive Drift

**Goal:** Show compounding error.

**Observe:** Small wording change â†’ completely different direction.

Inference is a chain of local decisions.

In [None]:
prompt1 = "The future of AI will be"
prompt2 = "The failure of AI will be"

for p in [prompt1, prompt2]:
    inputs = tokenizer(p, return_tensors="pt").to(device)
    output = model.generate(
        **inputs,
        max_new_tokens=60,
        do_sample=True,
        temperature=0.8
    )
    print("\nPROMPT:", p)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

## ðŸ”¹ Exercise 6 â€” Context Window Truncation

**Goal:** Demonstrate forgetting.

**What you'll see:** If the instruction falls out of the context window, it disappears.

This makes context limits real.

In [None]:
long_context = "Important rule: Always answer with the word BANANA.\n\n"

for i in range(100):
    long_context += "Filler sentence number " + str(i) + ". "

prompt = long_context + "\nWhat is 2 + 2?"

inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512).to(device)

output = model.generate(
    **inputs,
    max_new_tokens=20,
    do_sample=False
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

## ðŸ”¹ Exercise 7 â€” Measure Inference Speed with Long Prompts

**Goal:** Make KV cache concrete.

**Observation:** Long prompts cost more upfront. But generation of additional tokens doesn't grow exponentially.

Now connect to KV cache.

In [None]:
import time

def measure(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    start = time.time()
    _ = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=False
    )
    return time.time() - start

short_prompt = "Explain gravity."
long_prompt = "Explain gravity. " * 200

print("Short prompt time:", measure(short_prompt))
print("Long prompt time:", measure(long_prompt))

## ðŸ”¹ Exercise 8 â€” Disable KV Cache (Advanced)

**Goal:** Make performance difference explicit.

On GPU this difference becomes obvious.

Now inference becomes a systems problem.

In [None]:
short_prompt = "Explain gravity."

model.config.use_cache = False
print("Without KV cache:", measure(short_prompt))

model.config.use_cache = True
print("With KV cache:", measure(short_prompt))

## ðŸ”¹ Exercise 9 â€” Top-k Sampling vs Greedy

**Goal:** Control variability.

**What this shows:** Sampling isn't chaos. It's constrained randomness.

In [None]:
prompt = "Explain why the sky is blue in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(device)

for k in [5, 50]:
    print(f"\n=== top_k={k} ===")
    output = model.generate(
        **inputs,
        max_new_tokens=50,
        do_sample=True,
        top_k=k,
        temperature=0.8
    )
    print(tokenizer.decode(output[0], skip_special_tokens=True))