# Lab 3: Generative AI with Ollama - SOLUTIONS

**Duration:** 90-120 minutes | **Difficulty:** Intermediate to Advanced

## Learning Objectives

By the end of this lab, you will be able to:
1. Connect to and use Ollama for local LLM inference
2. Generate text using the Llama model
3. Apply prompt engineering techniques for better results
4. Control generation with temperature and other parameters
5. Build multi-turn conversations with chat history
6. Implement Retrieval-Augmented Generation (RAG)
7. Understand fine-tuning concepts with LoRA and QLoRA

## Setup

In [None]:
import ollama
import json
import numpy as np
from typing import List, Dict

# Check connection to Ollama
try:
    models = ollama.list()
    print("Connected to Ollama!")
    print("\nAvailable models:")
    for model in models.get('models', []):
        print(f"  - {model['name']}")
except Exception as e:
    print(f"Error connecting to Ollama: {e}")
    print("\nMake sure Ollama is running: ollama serve")

---
# Part 1: Basic Text Generation

In [None]:
# Basic text generation
response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',
    prompt='What is machine learning in one sentence?'
)
print("Response:")
print(response['response'])

## Exercise 1.1: Generate Your First Response - SOLUTION

In [None]:
"""
Exercise 1.1 Solution: Basic Text Generation

This solution demonstrates the simplest form of LLM interaction:
sending a prompt and receiving a generated response.
"""

# Generate a response using ollama.generate()
# model: specifies which LLM to use (must be installed via 'ollama pull')
# prompt: the text input that the model will respond to
my_response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',
    prompt='What is an API?'
)

# Extract the generated text from the response dictionary
# The response contains metadata (timing, tokens) and the actual text in 'response' key
answer = my_response['response']

print("Answer:", answer)

### Code Explanation: Exercise 1.1

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt='...')` | **Sends prompt to local LLM.** Returns a dictionary with the response and metadata. |
| 2 | `my_response['response']` | **Extracts generated text.** The response dict also contains `model`, `total_duration`, `eval_count`, etc. |

**Response Dictionary Structure:**
```python
{
    'model': 'llama3.1:8b-instruct-q4_K_M',           # Model used
    'response': 'An API is...',    # The generated text
    'total_duration': 2500000000,  # Time in nanoseconds
    'eval_count': 45,              # Tokens generated
    'prompt_eval_count': 8         # Tokens in prompt
}
```

**Why local LLMs matter:**
- No API costs or rate limits
- Data stays on your machine (privacy)
- Works offline
- Full control over model parameters

---
# Part 2: Prompt Engineering

## Exercise 2.1: Write a Role-Based Prompt - SOLUTION

In [None]:
"""
Exercise 2.1 Solution: Role-Based Prompting

Assigning a role to the LLM improves response quality and consistency.
The model adopts the persona's expertise and communication style.
"""

# Create a prompt that assigns a specific role/persona
# The role provides context that shapes the response style and content
chef_prompt = """You are a professional chef who specializes in quick, easy meals 
that anyone can make at home with common ingredients.

Suggest a simple dinner recipe that can be made in under 30 minutes.
Include a list of ingredients and brief cooking instructions."""

# Generate response - the model will respond as a professional chef
recipe_response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt=chef_prompt)

if recipe_response:
    print(recipe_response['response'])

### Code Explanation: Exercise 2.1

| Component | Purpose |
|-----------|---------|
| `"You are a professional chef..."` | **Sets the persona.** The model adopts this role's expertise, vocabulary, and perspective. |
| `"who specializes in..."` | **Narrows the expertise.** Focuses responses on quick, accessible cooking. |
| `"Suggest a simple dinner..."` | **The actual task.** Clear instruction for what output is needed. |
| `"Include a list..."` | **Output structure.** Specifies what format the response should take. |

**Anatomy of a Good Role Prompt:**
```
1. Role Assignment:  "You are a [profession/expert]..."
2. Specialization:   "...who specializes in [specific area]"
3. Task:             "Please [specific action]"
4. Format:           "Include/Format as [structure]"
```

**Why role-based prompting works:**
- LLMs have learned from text written by various professionals
- Assigning a role activates relevant knowledge patterns
- Responses become more focused and authoritative
- Communication style matches the role's typical tone

## Exercise 2.2: Generate JSON Output - SOLUTION

In [None]:
"""
Exercise 2.2 Solution: Structured JSON Output

This solution demonstrates how to get parseable, structured data from an LLM.
Key techniques: explicit format specification, "ONLY" instruction, try/except parsing.
"""

# Craft a prompt that forces structured JSON output
# Key elements:
# 1. Clear task description
# 2. Exact JSON format with field names and types
# 3. Explicit instruction to output ONLY JSON
movie_prompt = """Generate information about a famous movie.

Respond with ONLY valid JSON in this exact format:
{"title": "...", "director": "...", "year": YYYY, "rating": X.X}

The rating should be out of 10. Do not include any other text, just the JSON."""

# Send to LLM
movie_response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt=movie_prompt)

# Parse JSON response with error handling
# LLMs can sometimes add extra text, so we strip whitespace
# try/except catches malformed JSON gracefully
try:
    movie_data = json.loads(movie_response['response'].strip())
except:
    movie_data = None

# Display parsed data if successful
if movie_data:
    print("Movie data:")
    for key, value in movie_data.items():
        print(f"  {key}: {value}")

### Code Explanation: Exercise 2.2

| Line | Code | Explanation |
|------|------|-------------|
| 1 | `"""Respond with ONLY valid JSON..."""` | **Format enforcement.** "ONLY" reduces extra text. Showing exact format guides structure. |
| 2 | `{"title": "...", "director": "..."}` | **Template example.** LLMs follow patterns - provide the exact structure you want. |
| 3 | `json.loads(response.strip())` | **Parse JSON.** `.strip()` removes leading/trailing whitespace that would break parsing. |
| 4 | `try: ... except: movie_data = None` | **Graceful failure.** LLMs can produce malformed JSON; handle it without crashing. |

**Prompt Engineering for JSON:**
```
❌ Bad:  "Tell me about a movie"
✓ Good: "Respond with ONLY valid JSON in this exact format: {...}"
```

**Common JSON Pitfalls:**
1. LLM adds "Here's the JSON:" before the actual JSON
2. Markdown code fences: ` ```json ... ``` `
3. Trailing commas in arrays/objects
4. Single quotes instead of double quotes

**Solutions:**
- Use "ONLY" and "Do not include any other text"
- Use `.strip()` to remove whitespace
- Consider regex extraction as fallback
- Use low temperature for more reliable formatting

---
# Part 3: Generation Parameters

## Exercise 3.1: Experiment with Temperature - SOLUTION

In [None]:
"""
Exercise 3.1 Solution: Temperature Parameter Experimentation

Temperature controls the randomness/creativity of LLM outputs.
- Low temp (0.0-0.3): Deterministic, focused, consistent
- High temp (0.7-1.0): Creative, varied, unpredictable
"""

# Task that benefits from creative responses
creative_prompt = "Suggest a creative and unique name for a coffee shop."

# LOW TEMPERATURE (0.2): Deterministic, predictable output
# The model chooses high-probability tokens more often
# Good for: factual answers, code, structured data
low_temp_response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',
    prompt=creative_prompt,
    options={'temperature': 0.2}  # Conservative, focused
)

# HIGH TEMPERATURE (0.9): Creative, diverse output
# The model samples from a wider distribution of tokens
# Good for: creative writing, brainstorming, variety
high_temp_response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',
    prompt=creative_prompt,
    options={'temperature': 0.9}  # Creative, varied
)

# Compare outputs - run multiple times to see temperature effect
print("Low temp (0.2):")
print(low_temp_response['response'])
print("\nHigh temp (0.9):")
print(high_temp_response['response'])

### Code Explanation: Exercise 3.1

| Parameter | Value | Effect |
|-----------|-------|--------|
| `temperature: 0.2` | Low | Model picks most likely tokens. Consistent, focused output. |
| `temperature: 0.9` | High | Model samples broadly. Creative, varied, surprising output. |

**How Temperature Works (Softmax Scaling):**
```
Low temp:  [0.9, 0.05, 0.05]  → Almost always picks first token
High temp: [0.4, 0.35, 0.25] → Distributes probability more evenly
```

**Temperature Guidelines:**

| Use Case | Recommended Temp |
|----------|------------------|
| Code generation | 0.0 - 0.2 |
| Factual Q&A | 0.1 - 0.3 |
| JSON/structured output | 0.1 - 0.2 |
| General assistant | 0.5 - 0.7 |
| Creative writing | 0.7 - 0.9 |
| Brainstorming | 0.8 - 1.0 |

**Other Generation Parameters:**
- `top_p` (nucleus sampling): Limits token pool by cumulative probability
- `top_k`: Limits to top K most likely tokens
- `repeat_penalty`: Reduces repetition in output

**Pro Tip:** Run the same prompt multiple times with high temperature to see variation, and with low temperature to see consistency.

---
# Part 4: Chat Conversations

## Exercise 4.1: Create a Multi-Turn Conversation - SOLUTION

In [None]:
"""
Exercise 4.1 Solution: Multi-Turn Conversation with Memory

This demonstrates how to maintain conversation context across multiple exchanges.
The chat API tracks messages, allowing the model to reference earlier information.
"""

# Initialize conversation with first user message
# The messages list acts as the conversation memory
messages = [
    {'role': 'user', 'content': 'My favorite color is blue.'}
]

# TURN 1: Send initial message
# ollama.chat() differs from generate() - it takes structured messages
response1 = ollama.chat(model='llama3.1:8b-instruct-q4_K_M', messages=messages)

if response1:
    print("Assistant:", response1['message']['content'])
    
    # CRITICAL: Add assistant's response to message history
    # This is what gives the model "memory" of the conversation
    messages.append(response1['message'])
    
    # Add follow-up question that requires memory of earlier info
    messages.append({'role': 'user', 'content': 'What is my favorite color?'})
    
    # TURN 2: Model should remember the color from Turn 1
    response2 = ollama.chat(model='llama3.1:8b-instruct-q4_K_M', messages=messages)
    
    if response2:
        print("\nAssistant:", response2['message']['content'])

### Code Explanation: Exercise 4.1

| Step | Code | Purpose |
|------|------|---------|
| 1 | `messages = [{'role': 'user', 'content': '...'}]` | Initialize conversation history list |
| 2 | `ollama.chat(model='...', messages=messages)` | Send entire history to model |
| 3 | `messages.append(response1['message'])` | **Critical:** Add assistant reply to history |
| 4 | `messages.append({'role': 'user', '...'})` | Add next user message |
| 5 | `ollama.chat(...)` again | Model sees full conversation, has "memory" |

**Message Structure:**
```python
messages = [
    {'role': 'user', 'content': 'My favorite color is blue.'},
    {'role': 'assistant', 'content': 'That\'s nice! Blue is...'},
    {'role': 'user', 'content': 'What is my favorite color?'},
    # Model receives ALL messages and can reference earlier context
]
```

**Roles Explained:**
- `user`: Human messages/questions
- `assistant`: Model's previous responses
- `system`: Instructions that shape behavior (used in next exercise)

**Why append both user AND assistant messages?**
Without the assistant's response in history, the model doesn't know what it said. It would lose context and might contradict itself.

**generate() vs chat():**
| `ollama.generate()` | `ollama.chat()` |
|---------------------|-----------------|
| Single prompt in, text out | Message list in, message out |
| No conversation structure | Supports roles and turns |
| Good for one-shot tasks | Good for multi-turn dialogue |

## Exercise 4.2: Create a Specialized Chatbot - SOLUTION

In [None]:
"""
Exercise 4.2 Solution: Specialized Chatbot with System Prompt

System prompts define the chatbot's personality, expertise, and behavior rules.
They're sent at the start of every conversation but aren't visible to users.
"""

# SYSTEM PROMPT: Defines the chatbot's persona and behavior
# This shapes ALL responses in the conversation
# Key elements:
# 1. Role definition ("You are a...")
# 2. Expertise area ("Python programming tutor")
# 3. Behavior rules ("explain clearly", "provide examples")
# 4. Tone/personality ("helpful", "patient", "encouraging")
system_prompt = """You are a helpful and patient Python programming tutor. 
You explain concepts clearly using simple language and always provide code examples.
When answering questions, you break down complex topics into easy-to-understand steps.
You encourage learning and provide helpful tips."""

# Initialize messages with system prompt FIRST
# System message sets context before any user interaction
messages = [
    {'role': 'system', 'content': system_prompt},  # Always first!
    {'role': 'user', 'content': 'What is a list comprehension in Python?'}
]

# Get response - model will behave as defined in system prompt
tutor_response = ollama.chat(model='llama3.1:8b-instruct-q4_K_M', messages=messages)

if tutor_response:
    print("Python Tutor:")
    print(tutor_response['message']['content'])

### Code Explanation: Exercise 4.2

| Component | Purpose |
|-----------|---------|
| `'role': 'system'` | Special role that defines chatbot behavior (invisible to users) |
| `"You are a...tutor"` | Sets expertise and identity |
| `"explain clearly"` | Defines communication style |
| `"provide code examples"` | Specifies required output format |
| `"patient", "encouraging"` | Sets emotional tone |

**System Prompt Anatomy:**
```python
system_prompt = """
You are a [ROLE/IDENTITY].           # Who the bot is
You specialize in [EXPERTISE].       # What it knows
You always [BEHAVIOR RULES].         # How it acts
Your tone is [PERSONALITY].          # How it sounds
You never [RESTRICTIONS].            # What it won't do
"""
```

**System vs User Messages:**
| System Message | User Message |
|----------------|--------------|
| Sets behavior/rules | Actual questions/requests |
| Sent once at start | Sent each turn |
| Not shown in UI | Visible in chat |
| Persistent context | Turn-by-turn content |

**Best Practices for System Prompts:**
1. Be specific about expertise ("Python tutor" not just "tutor")
2. Define output format ("always provide code examples")
3. Set guardrails ("never write malicious code")
4. Include personality traits for consistent tone
5. Keep it focused - one role per chatbot

---
# Part 5: Building a Simple Application

## Exercise 5.1: Build a Sentiment Analyzer - SOLUTION

In [None]:
"""
Exercise 5.1 Solution: Sentiment Analysis Application

This function wraps an LLM to perform sentiment analysis on text.
It demonstrates how to build reusable AI-powered utilities.
"""

def analyze_sentiment(text):
    """
    Analyze the sentiment of input text using an LLM.
    
    Args:
        text: The text to analyze
        
    Returns:
        dict with 'sentiment' (positive/negative/neutral) and 'confidence' (high/medium/low)
        or None if parsing fails
    """
    # Construct a prompt that:
    # 1. Clearly states the task
    # 2. Provides the input text
    # 3. Specifies exact JSON output format
    # 4. Instructs to only output JSON (no extra text)
    prompt = f"""Analyze the sentiment of the following text.

Text: {text}

Respond with ONLY valid JSON in this exact format:
{{"sentiment": "positive/negative/neutral", "confidence": "high/medium/low"}}

Do not include any other text."""
    
    # Use low temperature for consistent, deterministic output
    response = ollama.generate(
        model='llama3.1:8b-instruct-q4_K_M',
        prompt=prompt,
        options={'temperature': 0.1}  # Low temp = more predictable
    )
    
    # Parse JSON response, handle potential errors
    try:
        return json.loads(response['response'].strip())
    except:
        return None

# Test with examples covering all sentiment types
test_texts = [
    "I absolutely love this product! Best purchase ever!",  # Positive
    "This is the worst experience I've ever had.",          # Negative
    "The weather today is cloudy."                          # Neutral
]

for text in test_texts:
    result = analyze_sentiment(text)
    if result:
        print(f"Text: {text[:50]}...")
        print(f"  Sentiment: {result.get('sentiment')}")
        print(f"  Confidence: {result.get('confidence')}")
        print()

### Code Explanation: Exercise 5.1

| Component | Code | Purpose |
|-----------|------|---------|
| Function wrapper | `def analyze_sentiment(text):` | Creates reusable utility for sentiment analysis |
| Prompt template | `f"""Analyze...Text: {text}..."""` | Injects input text into structured prompt |
| JSON format spec | `{{"sentiment": "...", "confidence": "..."}}` | Ensures parseable output format |
| Low temperature | `options={'temperature': 0.1}` | Reduces randomness for consistent output |
| Error handling | `try: json.loads(...) except: None` | Gracefully handles malformed responses |

**Pattern: LLM as a Function:**
```python
def llm_utility(input_data):
    prompt = f"[Task description]\n\nInput: {input_data}\n\nOutput format: [JSON spec]"
    response = ollama.generate(model='...', prompt=prompt, options={'temperature': 0.1})
    return json.loads(response['response'])
```

**Best Practices for Structured Output:**
1. Specify exact JSON format with example
2. Say "ONLY valid JSON" to prevent extra text
3. Use low temperature (0.1-0.3) for consistency
4. Always wrap `json.loads()` in try/except
5. Use `.strip()` to remove whitespace

## Exercise 5.2: Build a Q&A Bot - SOLUTION

In [None]:
"""
Exercise 5.2 Solution: Context-Grounded Q&A Bot

This function answers questions using ONLY provided context (not model knowledge).
This is the foundation of RAG - ensuring answers are grounded in your data.
"""

def answer_question(context, question):
    """
    Answer a question based strictly on provided context.
    
    Args:
        context: Text containing information to answer from
        question: The question to answer
        
    Returns:
        str: The answer, or acknowledgment if not found in context
    """
    # Prompt structure for grounded Q&A:
    # 1. Explicit instruction to use ONLY context
    # 2. Fallback behavior when answer isn't present
    # 3. Context block clearly labeled
    # 4. Question clearly separated
    prompt = f"""Answer the question based ONLY on the provided context.
If the answer is not in the context, say "I don't know based on the provided context."

Context:
{context}

Question: {question}

Answer:"""
    
    # Low temperature for factual, consistent answers
    response = ollama.generate(
        model='llama3.1:8b-instruct-q4_K_M',
        prompt=prompt,
        options={'temperature': 0.2}
    )
    
    return response['response'].strip()

# Test context - structured information about Python
context = """
Python was created by Guido van Rossum and first released in 1991. 
It emphasizes code readability and uses significant indentation. 
Python supports multiple programming paradigms including procedural, 
object-oriented, and functional programming. The language is named 
after the British comedy group Monty Python.
"""

# Questions - some answerable from context, one not
questions = [
    "Who created Python?",           # In context
    "When was Python first released?", # In context  
    "What is Python's mascot?"       # NOT in context - tests fallback
]

for q in questions:
    answer = answer_question(context, q)
    print(f"Q: {q}")
    print(f"A: {answer}")
    print()

### Code Explanation: Exercise 5.2

| Component | Code | Purpose |
|-----------|------|---------|
| Grounding instruction | `"based ONLY on the provided context"` | Prevents hallucination from model knowledge |
| Fallback instruction | `"If the answer is not in the context..."` | Defines behavior when info is missing |
| Context injection | `f"""...Context:\n{context}..."""` | Inserts your data into the prompt |
| Low temperature | `options={'temperature': 0.2}` | Ensures factual, consistent answers |

**Why "ONLY on the provided context" matters:**
Without this constraint, LLMs will answer from their training data, even if wrong or outdated. For example, asking "What is Python?" without context might get general info instead of your specific documentation.

**Q&A Prompt Pattern:**
```
[Instruction: Answer based ONLY on context]
[Fallback: What to say if answer not found]

Context:
[Your data here]

Question: [User's question]

Answer:
```

**Testing Grounding:**
The third question ("What is Python's mascot?") isn't in the context. A well-grounded model should respond "I don't know" rather than saying "snake" from general knowledge.

**This is the core of RAG:**
Instead of relying on model training data:
1. Retrieve relevant documents (next section)
2. Inject as context
3. Ground answer in that context

---
# Part 6: Retrieval-Augmented Generation (RAG)

In [None]:
# Helper functions for RAG
def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def get_embedding(text):
    response = ollama.embed(model='llama3.1:8b-instruct-q4_K_M', input=text)
    return response['embeddings'][0]

class SimpleRAG:
    def __init__(self, model='llama3.1:8b-instruct-q4_K_M'):
        self.model = model
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, docs: List[str]):
        for doc in docs:
            embedding = get_embedding(doc)
            self.documents.append(doc)
            self.embeddings.append(embedding)
        print(f"Added {len(docs)} documents. Total: {len(self.documents)}")
    
    def retrieve(self, query: str, top_k: int = 2) -> List[str]:
        query_embedding = get_embedding(query)
        similarities = []
        for i, doc_embedding in enumerate(self.embeddings):
            sim = cosine_similarity(query_embedding, doc_embedding)
            similarities.append((sim, i))
        similarities.sort(reverse=True)
        top_indices = [idx for _, idx in similarities[:top_k]]
        return [self.documents[i] for i in top_indices]
    
    def query(self, question: str, top_k: int = 2) -> str:
        relevant_docs = self.retrieve(question, top_k)
        context = "\n\n".join(relevant_docs)
        prompt = f"""Use the following context to answer the question. 
If the answer is not in the context, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""
        response = ollama.generate(model=self.model, prompt=prompt, options={'temperature': 0.3})
        return response['response'].strip()

# Create and populate RAG system
rag = SimpleRAG()
knowledge_base = [
    "The Eiffel Tower is located in Paris, France. It was built in 1889 and stands 330 meters tall.",
    "The Great Wall of China is over 21,000 kilometers long and was built over many centuries.",
    "Python programming language was created by Guido van Rossum and released in 1991.",
    "Machine learning is a subset of AI that enables computers to learn from data.",
    "The Amazon rainforest produces about 20% of the world's oxygen."
]
rag.add_documents(knowledge_base)

## Exercise 6.1: Extend the RAG Knowledge Base - SOLUTION

In [None]:
"""
Exercise 6.1 Solution: Extending RAG Knowledge Base

This demonstrates adding domain-specific documents to an existing RAG system.
Each document is embedded and added to the vector store for semantic search.
"""

# Define new domain-specific documents (sports knowledge)
# Each document should be a self-contained fact or concept
# Keep documents focused - one main idea per document works best
my_documents = [
    "The Olympic Games originated in ancient Greece around 776 BC.",
    "Basketball was invented by James Naismith in 1891 in Springfield, Massachusetts.",
    "The FIFA World Cup is held every four years and is the most watched sporting event.",
    "Tennis uses a scoring system of 15, 30, 40, and game points.",
    "The marathon race is 26.2 miles long, commemorating the legend of Pheidippides."
]

# Add documents to the RAG system
# Internally this:
# 1. Generates embeddings for each document (vector representation)
# 2. Stores documents alongside their embeddings
# 3. Enables semantic search across all documents
rag.add_documents(my_documents)

# Test retrieval with questions
# The RAG system will:
# 1. Embed the question
# 2. Find most similar documents (by cosine similarity)
# 3. Generate answer grounded in retrieved context
my_questions = [
    "Who invented basketball?",      # Should find Naismith document
    "How long is a marathon?",       # Should find marathon document
    "When did the Olympic Games start?"  # Should find Olympics document
]

for q in my_questions:
    print(f"Q: {q}")
    answer = rag.query(q)
    print(f"A: {answer}")
    print()

### Code Explanation: Exercise 6.1

| Step | What Happens | Why It Matters |
|------|--------------|----------------|
| `rag.add_documents(my_documents)` | Each doc gets embedded (converted to vector) | Vectors enable semantic similarity search |
| Internal: `get_embedding(doc)` | LLM creates ~4096-dimensional vector | Similar meanings → similar vectors |
| `rag.query(question)` | Question embedded, similar docs retrieved | Finds relevant info without keyword matching |
| `ollama.generate(context + question)` | LLM answers using retrieved docs | Grounds response in your specific data |

**RAG Pipeline Flow:**
```
1. Document Ingestion:
   "Basketball was invented..." → [0.12, -0.34, 0.78, ...] → Store

2. Query Time:
   "Who invented basketball?" 
   → [0.11, -0.32, 0.76, ...]  (similar vector!)
   → Find nearest documents
   → Generate answer from retrieved context
```

**Document Design Best Practices:**
- One fact/concept per document for precise retrieval
- Include key terms naturally (helps embedding quality)
- Keep documents similar length (~1-3 sentences)
- Avoid pronouns without antecedents ("It was invented..." - what's "it"?)

**Why Semantic Search > Keyword Search:**
| Query | Keyword Search | Semantic Search |
|-------|----------------|-----------------|
| "hoops game origin" | ❌ No match for "basketball" | ✓ Finds basketball doc |
| "foot race distance" | ❌ No match for "marathon" | ✓ Finds marathon doc |

## Exercise 6.2: Implement Document Chunking - SOLUTION

In [None]:
"""
Exercise 6.2 Solution: Document Chunking for RAG

Long documents must be split into chunks for effective RAG retrieval.
Overlapping chunks ensure important information at boundaries isn't lost.
"""

def chunk_text(text: str, chunk_size: int = 100, overlap: int = 20) -> List[str]:
    """
    Split text into overlapping chunks for RAG ingestion.
    
    Args:
        text: The long document to split
        chunk_size: Number of words per chunk
        overlap: Number of words to overlap between consecutive chunks
        
    Returns:
        List of text chunks with specified overlap
    """
    # Split text into words
    words = text.split()
    chunks = []
    
    # Handle short texts that don't need chunking
    if len(words) <= chunk_size:
        return [text.strip()]
    
    # Sliding window approach with overlap
    start = 0
    while start < len(words):
        # Get chunk_size words starting from current position
        end = min(start + chunk_size, len(words))
        chunk = ' '.join(words[start:end])
        chunks.append(chunk.strip())
        
        # Stop if we've reached the end
        if end >= len(words):
            break
        
        # Move start forward, but keep 'overlap' words for context
        # This creates overlapping windows to preserve boundary context
        start = end - overlap
    
    return chunks

# Test with a longer document
long_document = """
Artificial intelligence has transformed the technology landscape dramatically over the past decade. 
Machine learning algorithms now power everything from recommendation systems to autonomous vehicles.
Deep learning, a subset of machine learning, uses neural networks with many layers to learn complex patterns.
Natural language processing enables computers to understand and generate human language.
Computer vision allows machines to interpret and analyze visual information from the world.
Reinforcement learning teaches agents to make decisions through trial and error.
The field continues to advance rapidly, with new breakthroughs announced regularly.
Ethical considerations around AI bias and fairness have become increasingly important.
Researchers are working on making AI systems more transparent and explainable.
The future of AI holds both tremendous promise and significant challenges for society.
"""

# Create chunks with 50 words each, 10 word overlap
chunks = chunk_text(long_document, chunk_size=50, overlap=10)
print(f"Created {len(chunks)} chunks:\n")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}: {chunk[:80]}...")
    print()

### Code Explanation: Exercise 6.2

| Line | Code | Purpose |
|------|------|---------|
| `words = text.split()` | Tokenize by whitespace | Simple word-based splitting |
| `if len(words) <= chunk_size:` | Short-circuit | Don't chunk if already small enough |
| `end = min(start + chunk_size, len(words))` | Boundary check | Prevent index out of bounds |
| `start = end - overlap` | Sliding window | Creates overlapping chunks |

**Why Overlap Matters:**
```
Without overlap:
[Chunk 1: "...uses neural"] [Chunk 2: "networks with many..."]
→ Sentence split! "Neural networks" context lost

With overlap:
[Chunk 1: "...uses neural networks"] [Chunk 2: "neural networks with many..."]
→ Important phrase preserved in both chunks
```

**Chunking Strategy Visualization:**
```
Document: |-----100 words-----|-----100 words-----|-----100 words-----|
Chunk 1:  |===================|
Chunk 2:                  |===================|   (20 word overlap)
Chunk 3:                                    |===================|
```

**Choosing Chunk Parameters:**

| Parameter | Small Value | Large Value |
|-----------|-------------|-------------|
| `chunk_size` | More precise retrieval | More context per chunk |
| `overlap` | Less redundancy | Better boundary preservation |

**Common Settings:**
- `chunk_size=512`, `overlap=50` for long documents
- `chunk_size=100`, `overlap=20` for precise retrieval
- Consider semantic chunking (by paragraph/section) for structured docs

---
# Part 7: Fine-tuning Concepts with LoRA and QLoRA

In [None]:
# Training data formats demonstration
instruction_format = {
    "instruction": "Summarize the following text in one sentence.",
    "input": "The quick brown fox jumps over the lazy dog. This sentence contains every letter.",
    "output": "A pangram sentence featuring a fox demonstrates all 26 alphabet letters."
}

chat_format = {
    "messages": [
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I reverse a string in Python?"},
        {"role": "assistant", "content": "Use slicing: `reversed_string = original_string[::-1]`"}
    ]
}

print("Instruction Format:")
print(json.dumps(instruction_format, indent=2))

## Exercise 7.1: Create Training Data - SOLUTION

In [None]:
"""
Exercise 7.1 Solution: Creating Fine-Tuning Training Data

This demonstrates creating instruction-format training data for LoRA fine-tuning.
Quality training data is the most important factor in fine-tuning success.
"""

# Training data in instruction format for a Fitness Coach persona
# Each example has three parts:
# - instruction: The task type/category
# - input: The user's actual question
# - output: The ideal response (what we want the model to learn)
my_training_data = [
    {
        "instruction": "Provide fitness advice for a beginner.",
        "input": "I want to start exercising but I've never worked out before. Where do I begin?",
        "output": "Start with 20-30 minutes of walking 3-4 times per week. Once comfortable, add bodyweight exercises like squats, push-ups, and planks. Focus on consistency over intensity - it's better to do a little regularly than to overdo it and quit."
    },
    {
        "instruction": "Suggest a workout modification.",
        "input": "I have bad knees and can't do regular squats. What can I do instead?",
        "output": "Try wall sits or chair-assisted squats to reduce knee strain. You can also do glute bridges, which work similar muscles without stressing your knees. Swimming and cycling are excellent low-impact alternatives for leg strength."
    },
    {
        "instruction": "Provide nutrition advice for fitness.",
        "input": "What should I eat before a morning workout?",
        "output": "Have a light snack 30-60 minutes before: a banana, toast with peanut butter, or a small bowl of oatmeal. Avoid heavy meals. If exercising within 30 minutes of waking, just have water - your body can use stored energy for moderate workouts."
    },
    {
        "instruction": "Create a simple workout plan.",
        "input": "Can you give me a quick 15-minute home workout?",
        "output": "Try this circuit: 1 minute jumping jacks, 10 squats, 10 push-ups (or knee push-ups), 30-second plank, 10 lunges each leg, rest 30 seconds. Repeat twice. This hits all major muscle groups and gets your heart rate up!"
    },
    {
        "instruction": "Motivate someone who wants to give up.",
        "input": "I've been working out for 2 weeks and haven't lost any weight. Should I quit?",
        "output": "Don't quit! Two weeks is too early to see major results - your body is adapting. Focus on non-scale victories: more energy, better sleep, improved mood. Muscle weighs more than fat, so you might be recomposing. Take progress photos and measurements instead."
    }
]

def validate_training_data(data):
    """
    Validate that all training examples have required fields.
    Returns True if valid, False otherwise with error message.
    """
    required_keys = ['instruction', 'input', 'output']
    for i, example in enumerate(data):
        for key in required_keys:
            if key not in example or example[key] is None:
                print(f"Example {i+1} missing '{key}'")
                return False
    print(f"All {len(data)} examples are valid!")
    return True

# Validate our training data
validate_training_data(my_training_data)

### Code Explanation: Exercise 7.1

| Field | Purpose | Example |
|-------|---------|---------|
| `instruction` | Task category/type | "Provide fitness advice for a beginner" |
| `input` | User's actual question | "I want to start exercising but..." |
| `output` | Model's ideal response | "Start with 20-30 minutes of walking..." |

**Instruction Format Structure:**
```json
{
    "instruction": "What type of task is this?",
    "input": "What is the user asking?",
    "output": "What should the model say?"
}
```

**Training Data Quality Checklist:**
- ✓ Diverse examples covering different scenarios
- ✓ Consistent tone/personality across outputs
- ✓ Responses match desired length and style
- ✓ Instruction categories are meaningful
- ✓ No contradictory information between examples

**Why 5 Examples Isn't Enough:**
For real fine-tuning, you typically need:
- **Minimum:** 50-100 examples for basic adaptation
- **Good:** 500-1000 examples for reliable behavior
- **Production:** 10,000+ for complex tasks

**Data Quality > Quantity:**
5 high-quality examples that perfectly demonstrate desired behavior are better than 100 sloppy ones. Each example teaches the model a pattern.

**Alternative Format (Chat Format):**
```json
{
    "messages": [
        {"role": "system", "content": "You are a fitness coach."},
        {"role": "user", "content": "How do I start exercising?"},
        {"role": "assistant", "content": "Start with..."}
    ]
}
```

## Exercise 7.2: Design a LoRA Configuration - SOLUTION

In [None]:
"""
Exercise 7.2 Solution: LoRA Configuration Design

LoRA (Low-Rank Adaptation) enables efficient fine-tuning by training only
small adapter matrices instead of the full model weights.
"""

# LoRA CONFIGURATION for fitness coach fine-tuning
# These parameters control how the adapter matrices are structured
my_lora_config = {
    # Rank (r): Dimensionality of the low-rank matrices
    # Higher = more capacity, more memory, potentially better fit
    # Typical values: 4, 8, 16, 32, 64
    "r": 16,                # Medium rank - good balance for instruction following
    
    # Alpha: Scaling factor, typically 2x the rank
    # Affects how much the LoRA updates influence the model
    "lora_alpha": 32,       # 2x the rank as recommended
    
    # Target modules: Which transformer layers to adapt
    # Attention layers are most impactful for behavior changes
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    
    # Dropout: Regularization to prevent overfitting
    # Lower for small datasets, higher for large datasets
    "lora_dropout": 0.05,   # Light dropout for regularization
}

# TRAINING CONFIGURATION
# Controls the optimization process
my_training_config = {
    # Epochs: How many times to iterate through the training data
    # More epochs = more learning, but risk of overfitting
    "num_epochs": 3,        # Enough for small dataset to converge
    
    # Batch size: Examples processed together before weight update
    # Smaller = less memory, noisier gradients
    "batch_size": 4,        # Small batch for limited GPU memory
    
    # Learning rate: Step size for weight updates
    # Too high = unstable, too low = slow/stuck
    "learning_rate": 2e-4,  # Standard LoRA learning rate
    
    # Warmup: Gradually increase LR at start to stabilize training
    "warmup_ratio": 0.05,   # 5% warmup
}

print("Fitness Coach LoRA Config:")
print(json.dumps(my_lora_config, indent=2))
print("\nTraining Config:")
print(json.dumps(my_training_config, indent=2))

### Code Explanation: Exercise 7.2

**LoRA Parameters:**

| Parameter | Value | Purpose |
|-----------|-------|---------|
| `r` (rank) | 16 | Adapter matrix dimensions. Higher = more capacity |
| `lora_alpha` | 32 | Scaling factor (usually 2x rank) |
| `target_modules` | `["q_proj", "k_proj", "v_proj", "o_proj"]` | Which layers to adapt |
| `lora_dropout` | 0.05 | Regularization to prevent overfitting |

**How LoRA Works:**
```
Original: W (large matrix, e.g., 4096×4096)
LoRA:     W + A×B where A is 4096×16, B is 16×4096

Instead of training 16M parameters in W,
we train only 131K parameters in A and B!
```

**Rank Selection Guide:**

| Task Complexity | Recommended Rank |
|-----------------|------------------|
| Simple style change | 4-8 |
| Instruction following | 8-16 |
| Domain knowledge | 16-32 |
| Complex reasoning | 32-64 |

**Training Parameters:**

| Parameter | Value | Effect |
|-----------|-------|--------|
| `num_epochs` | 3 | More = better fit, risk overfitting |
| `batch_size` | 4 | Smaller = less GPU memory needed |
| `learning_rate` | 2e-4 | Standard for LoRA (10x higher than full fine-tune) |
| `warmup_ratio` | 0.05 | Stabilizes early training |

**Memory Comparison:**
- Full fine-tuning 7B model: ~28GB VRAM
- LoRA fine-tuning 7B model: ~8GB VRAM
- QLoRA (4-bit + LoRA): ~4GB VRAM

**Why target attention layers?**
Attention (q, k, v, o projections) controls how the model processes and relates information. Adapting these has the highest impact on behavior with minimal parameters.

---
# Lab Complete!

## Summary

You learned:
- **Basic Generation**: Use `ollama.generate()` for text completion
- **Prompt Engineering**: Role-based prompts, structured output, JSON responses
- **Parameters**: Control creativity with temperature
- **Chat API**: Multi-turn conversations with `ollama.chat()`
- **Applications**: Build summarizers, sentiment analyzers, and Q&A bots
- **RAG**: Implement retrieval-augmented generation with embeddings
- **Fine-tuning**: Understand LoRA/QLoRA for efficient model adaptation