# Lab 3: Generative AI with Ollama

**Duration:** 90-120 minutes | **Difficulty:** Intermediate to Advanced

---

## Overview

This lab teaches you how to work with Large Language Models (LLMs) using Ollama for local inference.

### Lab Structure

| Part | Topic | Key Concepts |
|------|-------|---------------|
| **Part 1** | Basic Generation | ollama.generate(), prompts |
| **Part 2** | Prompt Engineering | System prompts, roles, structured output |
| **Part 3** | Generation Parameters | Temperature, top_p, max_tokens |
| **Part 4** | Chat Conversations | Multi-turn chat, message history |
| **Part 5** | Building Applications | Summarization, sentiment, Q&A |
| **Part 6** | RAG | Embeddings, retrieval, augmented generation |
| **Part 7** | Fine-tuning Concepts | LoRA, QLoRA, training data |

### Prerequisites

- Ollama must be installed and running locally
- Pull a model: `ollama pull llama3.1:8b-instruct-q4_K_M`

### Instructions

- Read each markdown cell carefully
- Write your code in the empty code cells
- Run cells with `Shift+Enter`

## Setup

Run the cell below to import libraries and verify Ollama is running.

In [None]:
import ollama
import json
import numpy as np
from typing import List, Dict

# Check connection to Ollama
try:
    models = ollama.list()
    print("Connected to Ollama!")
    print("\nAvailable models:")
    for model in models.get('models', []):
        print(f"  - {model.get('model', model.get('name', 'unknown'))}")
except Exception as e:
    print(f"Error connecting to Ollama: {e}")
    print("\nMake sure Ollama is running: ollama serve")

---
# Part 1: Basic Text Generation

Ollama provides a simple API for generating text with local LLMs.

## 1.1 Simple Generation

Use `ollama.generate()` to generate text:

```python
response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',  # or your available model
    prompt='Your prompt here'
)
print(response['response'])
```

**Your Task:**
1. Call `ollama.generate()` with model 'llama3.1:8b-instruct-q4_K_M' (or your available model)
2. Use the prompt: "What is machine learning in one sentence?"
3. Print the response

**Expected Output:**
```
Machine learning is a type of artificial intelligence that enables computers 
to learn from data and make predictions or decisions without being explicitly 
programmed.
```
(Output will vary slightly each time)

In [None]:
# Your code here


## 1.2 Exploring the Response

The response object contains useful metadata:

```python
response['response']      # The generated text
response['model']         # Model used
response['total_duration'] # Time in nanoseconds
response['eval_count']    # Tokens generated
```

**Your Task:**
1. Generate a response for: "Explain neural networks briefly"
2. Print the response text
3. Print the model name
4. Calculate and print the generation time in seconds (total_duration / 1e9)

**Expected Output:**
```
Response: Neural networks are computing systems inspired by biological neural 
networks. They consist of layers of interconnected nodes that process and 
learn from data...

Model: llama3.1:8b-instruct-q4_K_M:latest
Generation time: 2.45 seconds
```

**Sample Code:**
```python
# Accessing response metadata
response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt='Hello')
text = response['response']
model_name = response['model']
duration_sec = response['total_duration'] / 1e9
print(f"Text: {text}")
print(f"Model: {model_name}")
print(f"Time: {duration_sec:.2f}s")
```

In [None]:
# Your code here


---
# Part 2: Prompt Engineering

Better prompts lead to better outputs. Learn techniques to guide the model.

## 2.1 Role-Based Prompting

Give the model a role to improve responses:

```python
prompt = """You are an expert Python programmer.

Question: How do I read a CSV file in Python?

Answer:"""
```

**Your Task:**
1. Create a prompt that assigns the role of "expert data scientist"
2. Ask: "What are the top 3 things to check when debugging a machine learning model?"
3. Generate and print the response

**Expected Output:**
```
As an expert data scientist, here are the top 3 things to check:

1. **Data Quality**: Check for missing values, outliers, and data leakage...
2. **Model Overfitting/Underfitting**: Compare train vs validation metrics...
3. **Feature Engineering**: Verify feature scaling and encoding...
```

**Sample Code:**
```python
# Using role-based prompting
prompt = """You are an expert chef specializing in Italian cuisine.

Question: What makes a great pasta dish?

Answer:"""

response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt=prompt)
print(response['response'])
```

In [None]:
# Your code here


## 2.2 Structured Output

Request specific output formats in your prompt:

```python
prompt = """List 3 programming languages.

Format your response as a JSON array like this:
["language1", "language2", "language3"]

Only output the JSON, nothing else."""
```

**Your Task:**
1. Create a prompt asking for 3 benefits of AI
2. Request the output in JSON format with keys: benefit1, benefit2, benefit3
3. Generate and print the response
4. Try parsing it with `json.loads()` (may need to extract just the JSON part)

**Expected Output:**
```
{
    "benefit1": "Automation of repetitive tasks",
    "benefit2": "Improved decision-making through data analysis",
    "benefit3": "24/7 availability and scalability"
}
```

In [None]:
# Your code here


---
# Part 3: Generation Parameters

Control the creativity and style of outputs with generation parameters.

## 3.1 Temperature

Temperature controls randomness:
- **Low (0.1-0.3)**: More focused, deterministic
- **Medium (0.5-0.7)**: Balanced creativity
- **High (0.8-1.0)**: More random, creative

```python
response = ollama.generate(
    model='llama3.1:8b-instruct-q4_K_M',
    prompt='Write a creative story opening',
    options={'temperature': 0.9}  # High creativity
)
```

**Your Task:**
1. Use the prompt: "Write a one-sentence tagline for a coffee shop"
2. Generate with temperature=0.2 (low) and print the result
3. Generate with temperature=0.9 (high) and print the result
4. Compare the differences

**Expected Output:**
```
Temperature 0.2 (Low - More predictable):
"Start your day with the perfect cup."

Temperature 0.9 (High - More creative):
"Where caffeine dreams dance in ceramic clouds."
```

In [None]:
# Your code here


## 3.2 Other Parameters

Additional generation options:

| Parameter | Description | Example |
|-----------|-------------|----------|
| `num_predict` | Max tokens to generate | `100` |
| `top_p` | Nucleus sampling threshold | `0.9` |
| `top_k` | Limit vocabulary choices | `40` |
| `repeat_penalty` | Reduce repetition | `1.1` |

```python
options = {
    'temperature': 0.7,
    'num_predict': 50,
    'top_p': 0.9
}
response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt='...', options=options)
```

**Your Task:**
1. Generate a response with `num_predict=30` to limit output length
2. Use prompt: "Explain the internet in simple terms"
3. Print the response

**Expected Output:**
```
The internet is a global network of connected computers that allows people 
to share information and communicate with each other...
(output truncated at ~30 tokens)
```

In [None]:
# Your code here


---
# Part 4: Chat Conversations

Build multi-turn conversations with message history.

## 4.1 Chat API

Use `ollama.chat()` for conversations:

```python
messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': 'Hello!'}
]

response = ollama.chat(model='llama3.1:8b-instruct-q4_K_M', messages=messages)
print(response['message']['content'])
```

Roles:
- `system`: Sets the assistant's behavior
- `user`: User messages
- `assistant`: Previous assistant responses

**Your Task:**
1. Create a messages list with a system message: "You are a friendly cooking assistant"
2. Add a user message: "What's a quick breakfast idea?"
3. Call `ollama.chat()` and print the response

**Expected Output:**
```
Here's a quick and delicious breakfast idea: scrambled eggs with toast!

Ingredients: 2 eggs, butter, salt, pepper, 2 slices of bread
Time: 5 minutes
...
```

In [None]:
# Your code here


## 4.2 Multi-turn Conversation

Continue a conversation by adding messages:

```python
# After getting a response, add it to history
messages.append({'role': 'assistant', 'content': response['message']['content']})

# Add next user message
messages.append({'role': 'user', 'content': 'Tell me more'})

# Get next response
response = ollama.chat(model='llama3.1:8b-instruct-q4_K_M', messages=messages)
```

**Your Task:**
1. Start a conversation about Python programming
2. Ask a follow-up question that references the first response
3. Print both responses showing the conversation flow

**Expected Output:**
```
User: What is Python best used for?
Assistant: Python is best used for web development, data science, 
machine learning, automation, and scripting...

User: Can you give me an example of the first one you mentioned?
Assistant: Sure! For web development, you can use frameworks like Django 
or Flask. Here's a simple Flask example...
```

In [None]:
# Your code here


---
# Part 5: Building Applications

Apply LLMs to common NLP tasks.

## 5.1 Text Summarization

Create a summarization function:

```python
def summarize(text, max_sentences=2):
    prompt = f"""Summarize this text in {max_sentences} sentences:

{text}

Summary:"""
    response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt=prompt)
    return response['response']
```

**Your Task:**
1. Create a `summarize(text, max_sentences)` function
2. Test it with this text:
   ```
   Machine learning is a subset of artificial intelligence that enables systems to learn from data. 
   Instead of being explicitly programmed, these systems improve through experience. 
   ML is used in many applications including image recognition, natural language processing, and recommendation systems.
   ```
3. Print the summary

**Expected Output:**
```
Summary: Machine learning is a branch of AI that allows systems to learn 
from data rather than being explicitly programmed. It powers applications 
like image recognition and recommendation systems.
```

In [None]:
# Your code here


## 5.2 Sentiment Analysis

Analyze the sentiment of text:

```python
def analyze_sentiment(text):
    prompt = f"""Analyze the sentiment of this text.
Respond with exactly one word: POSITIVE, NEGATIVE, or NEUTRAL.

Text: {text}

Sentiment:"""
    response = ollama.generate(model='llama3.1:8b-instruct-q4_K_M', prompt=prompt)
    return response['response'].strip()
```

**Your Task:**
1. Create an `analyze_sentiment(text)` function
2. Test it with these examples:
   - "I love this product! It's amazing!"
   - "This is the worst experience ever."
   - "The meeting is scheduled for 3pm."
3. Print the sentiment for each

**Expected Output:**
```
Text: "I love this product! It's amazing!"
Sentiment: POSITIVE

Text: "This is the worst experience ever."
Sentiment: NEGATIVE

Text: "The meeting is scheduled for 3pm."
Sentiment: NEUTRAL
```

In [None]:
# Your code here


---
# Part 6: Retrieval-Augmented Generation (RAG)

RAG combines retrieval with generation for knowledge-grounded responses.

## 6.1 Embeddings

Get vector embeddings for semantic search:

```python
response = ollama.embeddings(
    model='llama3.1:8b-instruct-q4_K_M',  # or embedding model
    prompt='Your text here'
)
embedding = response['embedding']  # List of floats
print(f"Embedding dimension: {len(embedding)}")
```

**Your Task:**
1. Get embeddings for two similar sentences:
   - "I love programming in Python"
   - "Python is my favorite programming language"
2. Print the dimension of the embeddings
3. Calculate cosine similarity between them using:
   ```python
   def cosine_similarity(a, b):
       return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
   ```

**Expected Output:**
```
Embedding dimension: 4096
Cosine similarity: 0.89 (high similarity for similar sentences)
```

In [None]:
# Your code here


## 6.2 Simple RAG Implementation

Build a basic RAG system:

1. **Index documents**: Create embeddings for your knowledge base
2. **Retrieve**: Find relevant documents for a query
3. **Generate**: Use retrieved context to answer

```python
class SimpleRAG:
    def __init__(self, model='llama3.1:8b-instruct-q4_K_M'):
        self.model = model
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text):
        self.documents.append(text)
        emb = ollama.embeddings(model=self.model, prompt=text)['embedding']
        self.embeddings.append(emb)
    
    def query(self, question, top_k=2):
        # Get query embedding
        q_emb = ollama.embeddings(model=self.model, prompt=question)['embedding']
        
        # Find most similar documents
        similarities = [cosine_similarity(q_emb, e) for e in self.embeddings]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        # Build context
        context = "\n".join([self.documents[i] for i in top_indices])
        
        # Generate answer
        prompt = f"""Context: {context}
        
Question: {question}

Answer based on the context:"""
        
        response = ollama.generate(model=self.model, prompt=prompt)
        return response['response']
```

**Your Task:**
1. Implement the `SimpleRAG` class
2. Add these documents:
   - "Python was created by Guido van Rossum in 1991."
   - "JavaScript is the language of the web browser."
   - "Machine learning uses algorithms to learn from data."
3. Query: "Who created Python?"
4. Print the answer

**Expected Output:**
```
Added 3 documents to RAG system

Query: Who created Python?
Answer: Based on the context, Python was created by Guido van Rossum in 1991.
```

In [None]:
# Your code here


---
# Part 7: Fine-tuning Concepts

Understanding how to customize LLMs for specific tasks.

## 7.1 LoRA (Low-Rank Adaptation)

LoRA is an efficient fine-tuning technique:

- Instead of updating all weights, adds small trainable matrices
- Reduces memory requirements significantly
- Key parameters:
  - `r` (rank): Size of adaptation matrices (4-64)
  - `alpha`: Scaling factor
  - `target_modules`: Which layers to adapt

```python
# Conceptual LoRA configuration (not runnable without training setup)
lora_config = {
    'r': 16,
    'lora_alpha': 32,
    'target_modules': ['q_proj', 'v_proj'],
    'lora_dropout': 0.05
}
```

**Your Task:** (Conceptual - no code execution needed)
1. Create a dictionary `lora_config` with:
   - `r`: 8
   - `lora_alpha`: 16
   - `target_modules`: ['q_proj', 'k_proj', 'v_proj']
   - `lora_dropout`: 0.1
2. Print the configuration

**Expected Output:**
```
LoRA Configuration:
{
    'r': 8,
    'lora_alpha': 16,
    'target_modules': ['q_proj', 'k_proj', 'v_proj'],
    'lora_dropout': 0.1
}
```

In [None]:
# Your code here


## 7.2 QLoRA (Quantized LoRA)

QLoRA combines quantization with LoRA:

- Uses 4-bit quantization to reduce memory even further
- Enables fine-tuning large models on consumer GPUs
- Key concepts:
  - **NF4**: Normalized Float 4-bit quantization
  - **Double quantization**: Quantizes the quantization constants

```python
# Conceptual QLoRA configuration
qlora_config = {
    'load_in_4bit': True,
    'bnb_4bit_quant_type': 'nf4',
    'bnb_4bit_use_double_quant': True,
    'bnb_4bit_compute_dtype': 'float16'
}
```

**Your Task:** (Conceptual)
1. Create a dictionary `qlora_config` with the 4-bit configuration shown above
2. Print it and explain (in comments) what each parameter does

**Expected Output:**
```
QLoRA Configuration:
{
    'load_in_4bit': True,           # Load model in 4-bit precision
    'bnb_4bit_quant_type': 'nf4',   # Use NormalFloat4 quantization
    'bnb_4bit_use_double_quant': True,  # Apply double quantization
    'bnb_4bit_compute_dtype': 'float16'  # Compute in float16
}
```

In [None]:
# Your code here


## 7.3 Training Data Format

Fine-tuning requires properly formatted training data:

**Instruction format:**
```json
{
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
}
```

**Chat format:**
```json
{
    "messages": [
        {"role": "user", "content": "What is 2+2?"},
        {"role": "assistant", "content": "2+2 equals 4."}
    ]
}
```

**Your Task:**
1. Create a list of 3 training examples in instruction format for a customer support bot
2. Print the examples as JSON

**Expected Output:**
```json
[
    {
        "instruction": "How do I reset my password?",
        "input": "",
        "output": "To reset your password, click 'Forgot Password' on the login page..."
    },
    {
        "instruction": "What are your business hours?",
        "input": "",
        "output": "Our business hours are Monday-Friday, 9am-5pm EST."
    },
    {
        "instruction": "How do I track my order?",
        "input": "",
        "output": "You can track your order by logging into your account and clicking 'Order History'."
    }
]
```

**Sample Code:**
```python
# Creating training data in instruction format
training_examples = [
    {
        "instruction": "Translate hello to Spanish",
        "input": "",
        "output": "Hello in Spanish is 'Hola'."
    },
    {
        "instruction": "What is the square root of 16?",
        "input": "",
        "output": "The square root of 16 is 4."
    }
]

print(json.dumps(training_examples, indent=2))
```

In [None]:
# Your code here


---
# Lab Complete!

## Summary

You learned:
- **Basic Generation**: ollama.generate() for text completion
- **Prompt Engineering**: Roles, structured output, formatting
- **Parameters**: Temperature, top_p, num_predict
- **Chat**: Multi-turn conversations with message history
- **Applications**: Summarization, sentiment analysis, Q&A
- **RAG**: Embeddings, retrieval, augmented generation
- **Fine-tuning**: LoRA, QLoRA, training data formats