---
## Setup


In [3]:
# Install required packages (uncomment if needed)
# !pip install transformers datasets torch

from transformers import pipeline, AutoTokenizer
from datasets import Dataset

---
# Section 1: What Is a Language Model?

A language model does one thing: **predict the next token** in a sequence.

Given `"The cat sat on the"`, it assigns probabilities:

| Next Word | Probability |
|-----------|-------------|
| mat       | 0.35        |
| floor     | 0.25        |
| chair     | 0.15        |
| ...       | ...         |

That is it. Every conversation, every essay, every piece of code an LLM generates comes from repeatedly picking the next most likely token.

Most modern LLMs use the **Transformer** architecture (from *"Attention Is All You Need"*, 2017). The key idea is an **attention mechanism** that lets each word look at every other word to determine context. This is how the model knows "bank" means something different in "river bank" vs. "bank loan."

Let's see this in action.


In [7]:
# Load a small language model
generator = pipeline(
    "text-generation", 
    model="distilgpt2", 
    max_new_tokens=25,
    )

# Watch it predict
prompt = "The cat sat on the"
result = generator(prompt)
result

Loading weights: 100%|██████████| 76/76 [00:00<00:00, 532.39it/s, Materializing param=transformer.wte.weight]            
GPT2LMHeadModel LOAD REPORT from: distilgpt2
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
transformer.h.{0, 1, 2, 3, 4, 5}.attn.bias | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=25) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


[{'generated_text': 'The cat sat on the counter of a shopping centre, sat in the centre of a shopping centre.\n\n\n\n\n\n\n\n\n\n'}]

In [9]:
type(result)

list

In [12]:
type(result[0])

dict

In [13]:
result[0]

{'generated_text': 'The cat sat on the counter of a shopping centre, sat in the centre of a shopping centre.\n\n\n\n\n\n\n\n\n\n'}

In [None]:

print("Prompt:", prompt)
print("Model completed:", result[0]['generated_text'])

### Before You Continue — Predict

What do you think will happen if we ask this model:

> "Our clothing store's return policy allows customers to"

Write down your prediction. Then run the next cell.


---
# Section 2: The Failure

We want to build a chatbot for **BlueSky Clothing Store**. It should answer questions about store hours, return policies, shipping, and sizing.

Let's try.


In [None]:
generator = pipeline("text-generation", model="distilgpt2", max_new_tokens=50)

# Ask the model about our store
questions = [
    "What is BlueSky Clothing's return policy?",
    "What are your store hours?",
    "Do you offer free shipping on orders over $75?"
]

for q in questions:
    result = generator(q)
    print(f"Q: {q}")
    print(f"A: {result[0]['generated_text']}")
    print("-" * 60)

### Stop and Analyze

The model produced *something* for each question. But examine the outputs carefully.

**Questions you must answer before continuing:**

1. Are any of the answers factually correct about BlueSky Clothing?
2. Did the model refuse to answer, or did it generate a confident-sounding response?
3. If a customer received these answers, what would happen?

The model is not broken. It is doing exactly what it was trained to do: predict likely next tokens based on internet text. It has never seen BlueSky Clothing's data.

This is the core problem: **the model is capable, but uninformed.**

Now the question becomes: *How do we give it the right information?*

Think of at least two different approaches before reading on.


---
# Section 3: First Fix — Prompt Engineering

The cheapest possible intervention: **change the input, not the model.**

Prompt engineering means crafting your input to guide the model's behavior. No retraining. No new data pipelines. Just better instructions.

A well-structured prompt has these components:

| Component | Required? | Purpose |
|-----------|-----------|---------|
| **System prompt** | Optional (but powerful) | Sets the model's role and constraints |
| **Context** | Optional | Provides reference information |
| **User prompt** | Required | The actual question |

Most users skip the system prompt. This is a mistake — it can produce materially better results.

Let's see if structuring our prompt fixes the failure from Section 2.


In [None]:
# Attempt 1: Add a system prompt
structured_prompt = "You are a customer service assistant for BlueSky Clothing Store.\nYou are helpful, accurate, and concise.\n\nCustomer question: What is your return policy?\nAssistant response:"

result = generator(structured_prompt, max_new_tokens=50)
print("With system prompt:")
print(result[0]['generated_text'])

### Did It Work?

Examine the output. The model now *knows* it should act as a store assistant. But does it know the actual return policy?

Likely not. The system prompt changed the **tone and role**, but it cannot inject **facts the model has never seen**.

Let's try harder — few-shot prompting. We give the model examples of correct answers and hope it follows the pattern.


In [None]:
# Attempt 2: Few-shot prompting
few_shot_prompt = "Customer: What are your store hours?\nAssistant: We are open Monday-Friday 9am-6pm, Saturday 10am-4pm.\n\nCustomer: Do you offer gift wrapping?\nAssistant: Yes, complimentary gift wrapping on purchases over $25.\n\nCustomer: What is your return policy?\nAssistant:"

result = generator(few_shot_prompt, max_new_tokens=40)
print("With few-shot examples:")
print(result[0]['generated_text'])

### Progress Check

Few-shot prompting is better. The model follows the format and may even generate a plausible-sounding return policy. But here is the critical question:

**Is the return policy it generated actually BlueSky's policy?**

If the real policy is "30 days with receipt, sale items final sale," and the model generates "14 days, no exceptions" — that is worse than no answer at all. The customer receives a confident, authoritative, *wrong* answer.

**What prompt engineering can fix:**
- Tone, format, role, style
- Output structure (JSON, bullet points, etc.)
- Reasoning approach (chain-of-thought)

**What prompt engineering cannot fix:**
- Missing factual knowledge
- Domain-specific details the model was never trained on

We need a different approach for the knowledge problem.

### Try It

1. Write three different system prompts for the same question. How does tone change? Does accuracy change?
2. What happens if your few-shot examples contradict each other? Try it.


---
# Section 4: Second Fix — Retrieval Augmented Generation (RAG)

Instead of hoping the model knows the answer, **give it the answer as part of the prompt.**

RAG works in two steps:
1. **Retrieve** relevant information from your knowledge base
2. **Generate** a response using that information as context

The model's weights never change. You are changing the *input*, not the *model*.


In [None]:
# Step 1: Our knowledge base (in production, this would be a vector database)
knowledge_base = {
    "return_policy": (
        "Items can be returned within 30 days with a receipt for a full refund. "
        "Sale items are final sale. Gift cards are non-refundable."
    ),
    "store_hours": (
        "Monday-Friday 9am-6pm, Saturday 10am-4pm, closed Sunday."
    ),
    "shipping": (
        "Free shipping on orders over $75. Standard shipping: 5-7 business days. "
        "Express 2-day shipping available for $12.99."
    ),
    "sizing": (
        "We carry sizes XS through 3XL. Size chart at bluesky.com/sizes. "
        "In-store fittings available by appointment."
    ),
}

# Step 2: Simple retrieval (keyword matching — production uses embeddings)
def retrieve(question, kb):
    question_lower = question.lower()
    best_key, best_score = None, 0
    for key, value in kb.items():
        keywords = key.replace("_", " ").split()
        score = sum(1 for kw in keywords if kw in question_lower)
        if score > best_score:
            best_score = score
            best_key = key
    return kb[best_key] if best_key else "No relevant information found."

# Step 3: Build the RAG prompt
question = "What is your return policy?"
context = retrieve(question, knowledge_base)

rag_prompt = f"Use ONLY the following context to answer the question.\nIf the context does not contain the answer, say \"I don't have that information.\"\n\nContext: {context}\n\nCustomer question: {question}\nAssistant response:"

print("Retrieved context:", context)
print()
print("Full prompt sent to model:")
print(rag_prompt)

### What Just Happened?

Compare this to every previous attempt:

| Approach | Model weights changed? | Knows BlueSky facts? | Risk of hallucination |
|----------|----------------------|---------------------|-----------------------|
| Bare prompt | No | No | High |
| System prompt | No | No | High |
| Few-shot | No | Only from examples | Medium |
| **RAG** | **No** | **Yes — from retrieved docs** | **Low** |

RAG is powerful because:
- The model gets the **exact** information it needs for each query
- Your knowledge base can be **updated instantly** without retraining
- You can **verify** answers against source documents
- It is **dramatically cheaper** than fine-tuning

### Try It

1. Add a new topic to `knowledge_base` (e.g., loyalty program). Query it.
2. Ask a question that spans two topics: *"Can I return a shipped item?"* What happens? How would you fix this?
3. What happens if the context is very long — say, a 50-page document? What are the limits?


---
# Section 5: What's Still Missing?

We have solved the knowledge problem with RAG. But look at this output from a hypothetical well-built chatbot:

---

**Customer:** *Can I return the blue silk dress I bought last week? I lost the receipt though.*

**Ideal Response:** *I'm sorry to hear you'd like to return the dress! While our standard policy requires a receipt for full refunds, we can look up your purchase using your credit card or loyalty account. If we find the transaction, we can process a return within our 30-day window. Would you like to visit us in store so we can check? We're open Monday-Friday 9am-6pm.*

---

### Reverse Engineer This Response

That response demonstrates several things at once. Identify which technique produced each:

1. **Accurate policy details** (30-day window, receipt requirement) — Where did this come from?
2. **Empathetic, on-brand tone** — Where did this come from?
3. **Proactive problem-solving** (credit card lookup alternative) — Where did this come from?
4. **Cross-referencing store hours** — Where did this come from?

**The hard question:** Could prompt engineering + RAG produce this response? Or does the model need to *learn* BlueSky's specific communication style?

Think about this carefully. Not every gap requires the same solution.

- If the tone is wrong: better system prompt
- If the facts are wrong: better retrieval
- If the *style* is consistently off despite good prompts: maybe the model needs to learn new behavior

This is where fine-tuning enters the conversation — not as a first resort, but as an answer to the question: **"What problems remain after we have tried everything cheaper?"**


---
# Section 6: Fine-Tuning — Adjusting the Model Itself

### What Fine-Tuning Actually Is

Fine-tuning means taking a **pretrained model** and continuing its training on a **smaller, task-specific dataset**. Unlike prompt engineering and RAG, this changes the model's internal weights.

Analogy:
- **Pretraining** = General university education
- **Fine-tuning** = Professional specialization

A critical point from practice: fine-tuning works best when the model has **some prior exposure** to similar data. You are reminding the model of patterns it partially learned during pretraining — like reinforcing a distant memory. If the model has never encountered anything like your domain, fine-tuning will struggle.


### 6.1 Supervised Fine-Tuning (SFT)

You provide (instruction, expected response) pairs. The model's weights are updated to minimize the gap between its outputs and your expected responses.


In [None]:
# What SFT training data looks like
sft_data = [
    {"instruction": "What is your return policy?",
     "response": "Items can be returned within 30 days with receipt for a full refund. "
                 "Sale items are final sale. We're happy to help with exchanges too!"},
    {"instruction": "A customer is upset about a late shipment. Respond helpfully.",
     "response": "I completely understand your frustration, and I sincerely apologize "
                 "for the delay. Let me look into your order right away and see what "
                 "we can do to make this right."},
    {"instruction": "What are your store hours?",
     "response": "We're open Monday through Friday 9am to 6pm, and Saturday 10am to 4pm. "
                 "We'd love to see you! If those times don't work, you can also shop "
                 "online 24/7 at bluesky.com."},
]

# Notice: these responses encode both FACTS and STYLE.
# The facts could come from RAG. The style is what SFT teaches.
for entry in sft_data:
    print(f"Instruction: {entry['instruction']}")
    print(f"Response:    {entry['response'][:80]}...")
    print()

### 6.2 RLHF — Reinforcement Learning from Human Feedback

Instead of one correct answer, the model generates multiple responses and **humans rank them**. A reward model learns these preferences, and the language model is trained to maximize the reward.

You may have seen this in ChatGPT — it sometimes shows two answers and asks you to pick the better one. That feedback is RLHF.

### 6.3 DPO — Direct Preference Optimization

A newer alternative that skips the separate reward model. It directly optimizes using (preferred response, rejected response) pairs, simplifying the pipeline.

### When to Use Which

| Technique | Input Data | Complexity | Best For |
|-----------|-----------|------------|----------|
| **SFT** | Instruction-response pairs | Low | Clear correct answers exist |
| **RLHF** | Human rankings of outputs | High | Subjective quality (tone, helpfulness) |
| **DPO** | Preferred vs. rejected pairs | Medium | RLHF benefits without RL complexity |


### Try It — Reverse Reasoning

Here are two responses to *"A customer wants to return a worn item."*

**Response A:** "Per our policy, worn items cannot be returned."

**Response B:** "I understand this is disappointing. Unfortunately, worn items fall outside our return policy. However, I'd be happy to help you find a replacement or explore our exchange options."

1. Both are factually correct. What makes B better?
2. Which fine-tuning technique would you use to get the model to consistently prefer B's style?
3. Could you achieve this with prompt engineering alone? What would be the tradeoff?


---
# Section 7: The Cost Reality

Fine-tuning is powerful. It is also expensive. Before committing, do the math.

| Cost Factor | What It Means |
|-------------|---------------|
| **Training compute** | GPU hours to run fine-tuning |
| **Data preparation** | Cleaning, formatting, labeling — often the biggest time cost |
| **Per-token training** | Services like OpenAI charge per token to fine-tune |
| **Inference costs** | Running a custom model costs more than standard API calls |
| **Hosting** | Self-hosting requires infrastructure |
| **Maintenance** | Domain changes require retraining |


In [None]:
# Cost estimation exercise
# Training costs (approximate — check current rates)
training_tokens = 1_000_000
cost_per_1k_train = 0.008  # example rate

training_cost = (training_tokens / 1000) * cost_per_1k_train

# Inference costs (ongoing)
daily_queries = 500
tokens_per_query = 200
cost_per_1k_inference = 0.012

monthly_tokens = daily_queries * 30 * tokens_per_query
monthly_inference = (monthly_tokens / 1000) * cost_per_1k_inference
annual_inference = monthly_inference * 12

print(f"One-time training cost:  ${training_cost:.2f}")
print(f"Monthly inference cost:  ${monthly_inference:.2f}")
print(f"Annual inference cost:   ${annual_inference:.2f}")
print(f"Total year-1 cost:       ${training_cost + annual_inference:.2f}")
print()

# Compare: what if RAG + prompt engineering is enough?
rag_cost_per_1k = 0.002  # standard model, no fine-tuning
rag_monthly = (monthly_tokens / 1000) * rag_cost_per_1k
print(f"RAG-only monthly cost:   ${rag_monthly:.2f}")
print(f"RAG-only annual cost:    ${rag_monthly * 12:.2f}")
print(f"\nFine-tuning costs {annual_inference / (rag_monthly * 12):.1f}x more per year at inference alone.")

### Try It

1. Change `daily_queries` to 5,000. How does the cost gap change?
2. At what query volume does annual cost exceed hiring a part-time employee ($25,000/year)?
3. If store policies change quarterly, what is the hidden cost of retraining each time?


---
# Section 8: Decision Framework

### Comparison

| | Prompt Engineering | RAG | Fine-Tuning |
|---|---|---|---|
| **Cost** | Very low | Low-Medium | High |
| **Setup time** | Minutes | Days | Weeks |
| **Data needed** | None | Documents | Labeled pairs (100s-1000s) |
| **Updates** | Edit prompt | Update doc store | Retrain |
| **Solves** | Format, tone, role | Knowledge gaps | Behavior, style, niche tasks |
| **Limits** | Cannot inject facts | Retrieval quality | Expensive, can overfit |

### The Decision Path

1. **Start with prompt engineering.** It is fast, cheap, and reversible. Does it solve the problem? Stop here.
2. **Add RAG if the model needs domain knowledge.** Facts, documents, policies — retrieve them at query time.
3. **Consider fine-tuning only if** the model needs to learn a fundamentally new behavior, style, or capability that prompting cannot capture consistently.
4. **In practice, combine approaches.** The best production systems often use all three: fine-tuning for style, RAG for knowledge, prompt engineering for structure.


In [None]:
# Scenario analysis — test your reasoning
scenarios = [
    ("A startup needs a bot for their 50-page FAQ.",
     "RAG", "Knowledge is in documents, changes often, limited resources."),

    ("A law firm wants briefs written in their specific house style.",
     "Fine-tuning (SFT)", "Consistent style is learned behavior, not easily prompted."),

    ("A developer wants an LLM to always respond in JSON.",
     "Prompt Engineering", "Formatting instruction handled by a clear system prompt."),

    ("A hospital needs an LLM referencing latest drug interaction data.",
     "RAG", "Medical data updates frequently. Fine-tuning outdated quickly."),

    ("A company wants its chatbot to sound casual, not corporate.",
     "Fine-tuning or Prompt Eng.", "Try prompting first. Fine-tune if tone is inconsistent."),
]

print("SCENARIO ANALYSIS")
print("=" * 70)
for scenario, approach, reasoning in scenarios:
    print(f"\nScenario: {scenario}")
    print(f"  Best approach: {approach}")
    print(f"  Reasoning: {reasoning}")

---
# Section 9: Troubleshooting

| Problem | Likely Cause | Solution |
|---------|-------------|----------|
| Model confidently gives wrong facts | Knowledge gap | Add RAG with verified source documents |
| Fine-tuned model is worse than base | Too little data or overfitting | More diverse examples; add validation set |
| Prompt engineering gives inconsistent results | Prompt is vague | Add few-shot examples; tighten constraints |
| RAG retrieves wrong documents | Weak retrieval or bad chunking | Upgrade embedding model; improve chunking |
| Fine-tuning too expensive for the value | Teaching knowledge instead of behavior | Switch to RAG for facts; fine-tune only for style |

### Debugging Mindset — Work Backwards From the Output

When quality is poor:

1. **Start from the output.** What specifically is wrong?
2. **Is it a fact problem?** Check retrieval. Is the right context being found?
3. **Is it a format or tone problem?** Revise the system prompt. Test with a hand-crafted "perfect" prompt.
4. **Is it a data problem?** Review training data quality and coverage.
5. **Is it a model capacity problem?** Try a larger model before adding pipeline complexity.


---
# Section 10 (Bonus): Preparing a Fine-Tuning Dataset with Hugging Face

If you decide fine-tuning is justified, here is how to prepare your data. We will not run actual training (requires a GPU), but data preparation is often the hardest and most important part.


In [None]:
from datasets import Dataset
from transformers import AutoTokenizer

# Step 1: Raw training data
raw_data = [
    {"instruction": "What are your store hours?",
     "response": "We're open Monday-Friday 9am-6pm, Saturday 10am-4pm."},
    {"instruction": "Do you offer free shipping?",
     "response": "Yes, free shipping on all orders over $75."},
    {"instruction": "What is your return policy?",
     "response": "Items can be returned within 30 days with a receipt."},
    {"instruction": "Do you have a loyalty program?",
     "response": "Yes, earn 1 point per dollar. 100 points = $10 off."},
    {"instruction": "Can I buy gift cards?",
     "response": "Gift cards are available in $25, $50, and $100 amounts."},
]

# Step 2: Format into training template
def format_example(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    }

dataset = Dataset.from_list(raw_data)
formatted = dataset.map(format_example)

print("Formatted example:")
print(formatted[0]['text'])
print()

# Step 3: Tokenize
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
tokenizer.pad_token = tokenizer.eos_token

tokenized = formatted.map(
    lambda x: tokenizer(x["text"], truncation=True, padding="max_length", max_length=128),
    batched=True
)

print(f"Dataset size: {len(tokenized)}")
print(f"Token count (first example): {sum(1 for t in tokenized[0]['input_ids'] if t != tokenizer.pad_token_id)}")

### What Comes Next in Production

After data prep, training uses Hugging Face's `Trainer`:

```python
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
)

trainer = Trainer(model=model, args=training_args, train_dataset=tokenized)
trainer.train()
```

This requires a GPU. For parameter-efficient alternatives, look into **LoRA** and **QLoRA**, which reduce training cost by 10-100x by updating only a small fraction of weights.


---
# Section 11: Capstone Exercise

Pick ONE scenario:

**A.** A legal firm needs a chatbot for their 200-page employee handbook.

**B.** A clothing brand wants an AI assistant that matches their playful, casual voice.

**C.** An internal IT helpdesk wants to automate the 50 most common support questions.

### Answer:

1. **Which approach(es)?** Prompt engineering, RAG, fine-tuning, or combination?
2. **Why?** Justify with cost, complexity, data needs, update frequency.
3. **What data do you need?** Format and approximate size.
4. **What could go wrong?** Two risks and mitigation strategies.
5. **How do you measure success?** Define 2-3 metrics.


---
# Summary

### The Path We Followed

1. A language model **predicts the next token** — nothing more.
2. We watched it **fail** on domain-specific questions — not from incapability, but from missing knowledge.
3. **Prompt engineering** fixed tone and format, but could not inject facts.
4. **RAG** solved the knowledge problem by retrieving information at query time.
5. We asked: **what's still missing?** Only then did fine-tuning become relevant — for behavior and style that prompting cannot capture.
6. We confronted the **cost reality** — fine-tuning is powerful but expensive, and often unnecessary.

### Key Principles

- Most problems are **knowledge gaps**, not capability gaps. Try RAG first.
- Fine-tuning is a **last resort**, not a first instinct.
- Fine-tuning reinforces **patterns the model has partially seen** — it is not teaching from scratch.
- The best systems **combine approaches**: RAG for facts, prompting for structure, fine-tuning for style.

### Next Steps

- **Embeddings and vector databases** — Build production RAG (FAISS, ChromaDB, Pinecone)
- **LoRA and QLoRA** — Parameter-efficient fine-tuning
- **Agents** — LLMs combined with tools (search, code execution, APIs)
- **DPO paper** — [arxiv.org/abs/2305.18290](https://arxiv.org/abs/2305.18290)
- **Hugging Face model hub** — [huggingface.co/models](https://huggingface.co/models)
