# Notebook 03: LlamaStack Core Features

## üéØ What is This Notebook About?

Welcome to Notebook 03! In this notebook, we'll explore **LlamaStack's core capabilities** - the building blocks that make powerful agents possible.

**What we'll learn:**
1. **Simple Chat** - Basic LLM interactions
2. **RAG (Retrieval Augmented Generation)** - Enhancing LLMs with external knowledge
3. **MCP (Model Context Protocol)** - External tool integration
4. **Safety** - Content moderation and safety shields
5. **Evaluation** - Measuring AI performance

**Why this matters:**
- Understanding these features helps you build better agents
- Each feature solves a specific problem
- Combining features creates powerful solutions
- This knowledge prepares you for advanced agent development

---

## üìö Learning Objectives

By the end of this notebook, you will:
- ‚úÖ Understand LlamaStack's core capabilities
- ‚úÖ Know when to use each feature
- ‚úÖ See how features work independently
- ‚úÖ Be ready to combine features in agents (Notebook 04)

---

## ‚öôÔ∏è Prerequisites

- LlamaStack server running (see Module README)
- Ollama running with llama3.2:3b model
- Python environment with dependencies installed

---

## üîß Setup

Let's start by connecting to LlamaStack and verifying everything is working.


In [None]:
# Import required libraries
import os
from llama_stack_client import LlamaStackClient
from termcolor import cprint

# Configuration
llamastack_url = os.getenv("LLAMA_STACK_URL", "http://localhost:8321")
model = os.getenv("LLAMA_MODEL", "ollama/llama3.2:3b")

print(f"üì° LlamaStack URL: {llamastack_url}")
print(f"ü§ñ Model: {model}")

# Initialize LlamaStack client
client = LlamaStackClient(base_url=llamastack_url)

# Verify connection
try:
    models = client.models.list()
    print(f"\n‚úÖ Connected to LlamaStack")
    print(f"   Available models: {len(models)}")
except Exception as e:
    print(f"\n‚ùå Cannot connect to LlamaStack: {e}")
    print("   Please ensure LlamaStack is running:")
    print("   python scripts/start_llama_stack.py")
    raise


---

## Part 1: Simple Chat

### What is Chat?

**Chat** is the most basic way to interact with an LLM. It's a conversation where you send messages and receive responses.

**Key Concepts:**
- **Message Types**: System (instructions), User (questions), Assistant (responses)
- **Streaming vs Non-streaming**: Get responses as they're generated or wait for complete response
- **Conversation Context**: LLM remembers previous messages in the conversation

**When to use Chat:**
- Simple Q&A
- Text generation
- Basic reasoning tasks
- When you don't need external knowledge or tools

---

### Hands-on: Basic Chat Completion

Let's start with the simplest example - a single question and answer.


In [None]:
# Example 1: Basic chat completion
print("=" * 60)
print("Example 1: Basic Chat Completion")
print("=" * 60)

# Create a simple chat completion
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "What is artificial intelligence in one sentence?"
        }
    ]
)

# Extract and display the response
answer = response.choices[0].message.content
print(f"\nüìù Question: What is artificial intelligence in one sentence?")
print(f"\nü§ñ Answer:\n{answer}\n")


### System Prompts

**System prompts** are instructions that guide the LLM's behavior. They set the "personality" and "role" of the assistant.

**Why use system prompts:**
- Define the assistant's role (e.g., "You are a helpful IT operations assistant")
- Set behavior guidelines
- Provide context about the domain
- Ensure consistent responses


In [None]:
# Example 2: Chat with system prompt
print("=" * 60)
print("Example 2: Chat with System Prompt")
print("=" * 60)

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. You provide clear, concise answers about IT infrastructure and operations."
        },
        {
            "role": "user",
            "content": "What should I check if a web server is not responding?"
        }
    ]
)

answer = response.choices[0].message.content
print(f"\nüìù Question: What should I check if a web server is not responding?")
print(f"\nü§ñ Answer (with IT operations context):\n{answer}\n")


### Multi-turn Conversations

**Multi-turn conversations** maintain context across multiple exchanges. The LLM remembers previous messages in the conversation.

**Why this matters:**
- Natural conversation flow
- Can refer back to earlier topics
- Builds on previous context
- More human-like interaction


In [None]:
# Example 3: Multi-turn conversation
print("=" * 60)
print("Example 3: Multi-turn Conversation")
print("=" * 60)

# First turn
messages = [
    {
        "role": "user",
        "content": "I'm setting up a new database server. What should I consider?"
    }
]

response1 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer1 = response1.choices[0].message.content
print(f"\nüìù Turn 1 - Question: I'm setting up a new database server. What should I consider?")
print(f"\nü§ñ Answer:\n{answer1[:200]}...\n")

# Second turn - add previous messages to maintain context
messages.append({
    "role": "assistant",
    "content": answer1
})
messages.append({
    "role": "user",
    "content": "What about security specifically?"
})

response2 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer2 = response2.choices[0].message.content
print(f"\nüìù Turn 2 - Question: What about security specifically?")
print(f"   (Note: The assistant knows we're talking about database servers)\n")
print(f"ü§ñ Answer:\n{answer2[:200]}...\n")


### Streaming Responses

**Streaming** allows you to receive the response as it's being generated, token by token. This provides:
- Faster perceived response time
- Real-time feedback
- Better user experience

**When to use streaming:**
- Long responses
- Interactive applications
- When you want immediate feedback


In [None]:
# Example 4: Streaming response
print("=" * 60)
print("Example 4: Streaming Response")
print("=" * 60)

print(f"\nüìù Question: Explain what RAG (Retrieval Augmented Generation) is.\n")
print("ü§ñ Answer (streaming):\n")

# Create streaming completion
stream = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "Explain what RAG (Retrieval Augmented Generation) is in 2-3 sentences."
        }
    ],
    stream=True  # Enable streaming
)

# Process stream chunk by chunk
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

print("\n\n‚úÖ Streaming complete!")


---

## Part 2: RAG (Retrieval Augmented Generation)

### What is RAG?

**RAG** enhances LLMs with external knowledge by:
1. **Storing documents** in a vector database (vector store)
2. **Searching** for relevant context when answering questions
3. **Augmenting** the LLM's prompt with retrieved context

**Why RAG matters:**
- LLMs have training data cutoff dates
- Can't access private/internal documents
- RAG provides up-to-date, domain-specific knowledge
- Improves accuracy for specialized topics

**When to use RAG:**
- Need access to specific documents
- Domain-specific knowledge required
- Private/internal information
- Up-to-date information needed

---

### Hands-on: Creating a Vector Store

Let's create a vector store and add some IT operations documentation.


In [None]:
# Example 1: Create a vector store
print("=" * 60)
print("Example 1: Creating a Vector Store")
print("=" * 60)

# Sample IT operations documentation
it_docs = [
    {
        "id": "doc1",
        "content": "To restart a web server, use: systemctl restart nginx. Check status with: systemctl status nginx."
    },
    {
        "id": "doc2",
        "content": "High CPU usage troubleshooting: 1) Check top processes with 'top' or 'htop', 2) Identify CPU-intensive processes, 3) Check for runaway processes or infinite loops."
    },
    {
        "id": "doc3",
        "content": "Database connection issues: Check firewall rules, verify credentials, ensure database service is running, check network connectivity with 'telnet hostname port'."
    },
    {
        "id": "doc4",
        "content": "Disk space issues: Use 'df -h' to check disk usage, find large files with 'du -sh /*', clean logs with 'journalctl --vacuum-time=7d'."
    },
    {
        "id": "doc5",
        "content": "Service monitoring: Use 'systemctl list-units --type=service' to list all services, 'systemctl is-active servicename' to check status, set up monitoring with Prometheus or Nagios."
    }
]

print(f"\nüìö Sample IT Operations Documentation:")
for doc in it_docs:
    print(f"   - {doc['id']}: {doc['content'][:60]}...")

print("\nüí° These documents will be stored in a vector store for retrieval.")


In [None]:
# Create vector store using LlamaStack
print("\n" + "=" * 60)
print("Creating Vector Store")
print("=" * 60)

try:
    # Create a vector store
    vector_store = client.vector_stores.create(
        name="it-operations-docs",
        description="IT operations documentation and troubleshooting guides"
    )
    
    print(f"\n‚úÖ Vector store created!")
    print(f"   Name: {vector_store.name}")
    print(f"   ID: {vector_store.id}")
    
    # Add documents to the vector store
    print(f"\nüìù Adding {len(it_docs)} documents to vector store...")
    
    for doc in it_docs:
        client.vector_stores.documents.create(
            vector_store_id=vector_store.id,
            content=doc["content"],
            metadata={"doc_id": doc["id"]}
        )
    
    print(f"‚úÖ All documents added successfully!")
    
except Exception as e:
    print(f"\n‚ö†Ô∏è  Note: Vector store API may vary. Error: {e}")
    print("   This is a demonstration of the concept.")
    print("   In practice, you would use the appropriate LlamaStack vector store API.")
    
    # Store for later use
    vector_store_id = "demo_vector_store"
    print(f"\nüí° For this demo, we'll use a simulated vector store ID: {vector_store_id}")


### Searching for Relevant Context

Once documents are in the vector store, we can search for relevant context based on semantic similarity (meaning, not just keywords).

**How it works:**
1. Convert query to embedding (vector representation)
2. Compare with document embeddings
3. Return most similar documents
4. Use retrieved documents as context for LLM


In [None]:
# Example 2: Search for relevant context
print("=" * 60)
print("Example 2: Searching Vector Store")
print("=" * 60)

# Simulate vector store search (in practice, use LlamaStack API)
def search_vector_store(query, documents, top_k=2):
    """Simulate semantic search - in practice, this uses embeddings"""
    # Simple keyword matching for demo (real RAG uses semantic similarity)
    query_lower = query.lower()
    results = []
    
    for doc in documents:
        score = sum(1 for word in query_lower.split() if word in doc["content"].lower())
        if score > 0:
            results.append((doc, score))
    
    # Sort by score and return top_k
    results.sort(key=lambda x: x[1], reverse=True)
    return [doc for doc, score in results[:top_k]]

query = "How do I restart a web server?"
print(f"\nüîç Query: {query}\n")

# Search for relevant documents
relevant_docs = search_vector_store(query, it_docs, top_k=2)

print("üìö Retrieved Documents:")
for i, doc in enumerate(relevant_docs, 1):
    print(f"\n   {i}. {doc['id']}:")
    print(f"      {doc['content']}")

print("\nüí° These documents will be used as context for the LLM.")


### Using Retrieved Context in Chat

Now let's use the retrieved documents as context for the LLM. This is the "Augmented Generation" part of RAG.


In [None]:
# Example 3: RAG - Using retrieved context in chat
print("=" * 60)
print("Example 3: RAG - Chat with Retrieved Context")
print("=" * 60)

query = "How do I restart a web server?"
relevant_docs = search_vector_store(query, it_docs, top_k=2)

# Build context from retrieved documents
context = "\n\n".join([f"Document {doc['id']}: {doc['content']}" for doc in relevant_docs])

# Create prompt with context
prompt = f"""Use the following IT operations documentation to answer the question.

Documentation:
{context}

Question: {query}

Answer based on the documentation provided:"""

print(f"\nüìù Question: {query}\n")
print(f"üìö Context Retrieved:\n{context}\n")

# Get response with context
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. Answer questions based on the provided documentation."
        },
        {
            "role": "user",
            "content": prompt
        }
    ]
)

answer = response.choices[0].message.content
print(f"ü§ñ Answer (with RAG context):\n{answer}\n")
print("‚úÖ Notice how the answer uses the specific documentation provided!")


---

## Part 3: MCP (Model Context Protocol)

### What is MCP?

**MCP (Model Context Protocol)** is a protocol for integrating external tools and services with LLMs. It allows agents to:
- **Call external APIs** (e.g., check service status, restart services)
- **Access databases** (e.g., query incident logs)
- **Execute commands** (e.g., run system commands)
- **Integrate with other systems** (e.g., monitoring tools, ticketing systems)

**Why MCP matters:**
- LLMs can't directly interact with systems
- MCP provides a standardized way to connect tools
- Enables agents to take real actions
- Makes agents more powerful and useful

**When to use MCP:**
- Need to interact with external systems
- Want agents to take actions (not just answer questions)
- Need real-time data from APIs
- Want to integrate with existing tools

---

### Hands-on: Exploring Tool Runtime

Let's explore what tools are available and how they work.


In [None]:
# Example 1: List available tools
print("=" * 60)
print("Example 1: Exploring Tool Runtime")
print("=" * 60)

try:
    # List available tool runtimes
    tool_runtimes = client.tool_runtimes.list()
    print(f"\n‚úÖ Found {len(tool_runtimes)} tool runtime(s)")
    
    for runtime in tool_runtimes:
        print(f"\n   Runtime: {runtime.name}")
        print(f"   Type: {runtime.type}")
        
        # List tools in this runtime
        tools = client.tools.list(runtime_id=runtime.id)
        print(f"   Available tools: {len(tools)}")
        
        for tool in tools[:5]:  # Show first 5 tools
            print(f"      - {tool.name}: {tool.description[:60]}...")
        
        if len(tools) > 5:
            print(f"      ... and {len(tools) - 5} more")
            
except Exception as e:
    print(f"\n‚ö†Ô∏è  Note: Tool runtime API may vary. Error: {e}")
    print("   This is a demonstration of the concept.")
    print("\nüí° In practice, MCP tools allow agents to:")
    print("   - Call external APIs")
    print("   - Execute system commands")
    print("   - Query databases")
    print("   - Integrate with monitoring systems")


### Understanding Tool Execution

Tools are functions that agents can call. When an agent needs to perform an action, it:
1. **Decides** which tool to use
2. **Calls** the tool with appropriate parameters
3. **Receives** the result
4. **Uses** the result to continue reasoning

**Tool Structure:**
- **Name**: Identifies the tool
- **Description**: Tells the LLM what the tool does
- **Parameters**: What inputs the tool needs
- **Returns**: What the tool outputs


In [None]:
# Example 2: Create a simple custom tool
print("=" * 60)
print("Example 2: Creating a Custom Tool")
print("=" * 60)

# Define a simple tool function
def check_service_status(service_name: str) -> str:
    """
    Check the status of a system service.
    
    Args:
        service_name: Name of the service to check (e.g., 'nginx', 'mysql')
    
    Returns:
        Status of the service: 'running', 'stopped', or 'not found'
    """
    # Simulate service check (in practice, this would call systemctl or similar)
    import random
    statuses = ['running', 'stopped', 'not found']
    status = random.choice(statuses)
    
    return f"Service '{service_name}' is {status}."

# Test the tool
print("\nüîß Custom Tool: check_service_status")
print("   Description: Check the status of a system service")
print("   Parameters: service_name (str)")
print("\nüìù Testing tool:")
result = check_service_status("nginx")
print(f"   check_service_status('nginx') ‚Üí {result}")

print("\nüí° In Notebook 02, we saw how to use tools with agents.")
print("   Tools enable agents to take actions, not just answer questions.")


### Tool Integration Patterns

**Common patterns for tool integration:**
1. **Client-side tools**: Python functions that run in your process
2. **Server-side tools**: Tools registered with LlamaStack server
3. **MCP tools**: Tools accessed via Model Context Protocol
4. **API tools**: Tools that call external REST APIs

**Best practices:**
- Provide clear descriptions so LLM knows when to use tools
- Handle errors gracefully
- Return structured data when possible
- Log tool calls for debugging


---

## Part 4: Safety

### What is Safety?

**Safety** features protect against harmful or inappropriate content:
- **Content moderation**: Filter inappropriate content
- **Safety shields**: Prevent harmful outputs
- **Safe AI practices**: Guidelines for responsible AI use

**Why safety matters:**
- Prevents harmful outputs
- Protects users and systems
- Ensures responsible AI deployment
- Builds trust in AI systems

**When to use safety:**
- User-facing applications
- Production systems
- When handling sensitive data
- Public-facing agents

---

### Hands-on: Safety Shields

Let's explore how safety features work.


In [None]:
# Example 1: Safety in chat completions
print("=" * 60)
print("Example 1: Safety in Chat")
print("=" * 60)

# Note: Safety features are typically built into LlamaStack
# The exact API may vary, but the concept is demonstrated here

print("\nüí° Safety features in LlamaStack:")
print("   ‚úÖ Content moderation")
print("   ‚úÖ Safety shields")
print("   ‚úÖ Harmful content detection")
print("   ‚úÖ Safe response generation")

# Example: Safe chat completion
print("\nüìù Example: Safe chat completion")
print("   LlamaStack automatically applies safety checks")

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "What are best practices for IT security?"
        }
    ],
    # Safety settings (if available in your LlamaStack version)
    # safety_settings={
    #     "enabled": True,
    #     "moderation_level": "medium"
    # }
)

answer = response.choices[0].message.content
print(f"\nü§ñ Safe Response:\n{answer[:200]}...\n")

print("‚úÖ Response generated with safety checks applied.")


### Content Moderation

**Content moderation** checks inputs and outputs for:
- Inappropriate language
- Harmful content
- Sensitive information
- Policy violations

**Best practices:**
- Enable moderation for user-facing applications
- Configure appropriate moderation levels
- Log moderation events for review
- Provide clear feedback when content is blocked


---

## Part 5: Evaluation

### What is Evaluation?

**Evaluation** measures how well your AI system performs. It helps you:
- **Measure performance**: How accurate are responses?
- **Compare models**: Which model works best?
- **Track improvements**: Are changes making things better?
- **Identify issues**: What needs to be fixed?

**Why evaluation matters:**
- Ensures quality before deployment
- Helps choose the right model
- Tracks performance over time
- Builds confidence in AI systems

**When to use evaluation:**
- Before deploying to production
- When comparing different models
- After making changes
- Regular quality checks

---

### Hands-on: Creating an Evaluation Dataset

Let's create a simple evaluation dataset and run evaluations.


In [None]:
# Example 1: Create evaluation dataset
print("=" * 60)
print("Example 1: Creating Evaluation Dataset")
print("=" * 60)

# Sample evaluation dataset for IT operations Q&A
evaluation_dataset = [
    {
        "question": "How do I restart a web server?",
        "expected_topics": ["systemctl", "restart", "nginx", "apache"],
        "category": "troubleshooting"
    },
    {
        "question": "What causes high CPU usage?",
        "expected_topics": ["processes", "top", "htop", "monitoring"],
        "category": "diagnostics"
    },
    {
        "question": "How do I check disk space?",
        "expected_topics": ["df", "du", "disk", "storage"],
        "category": "monitoring"
    }
]

print(f"\nüìä Evaluation Dataset:")
for i, item in enumerate(evaluation_dataset, 1):
    print(f"\n   {i}. Question: {item['question']}")
    print(f"      Category: {item['category']}")
    print(f"      Expected topics: {', '.join(item['expected_topics'])}")

print("\nüí° This dataset can be used to evaluate model performance.")


In [None]:
# Example 2: Run evaluations
print("=" * 60)
print("Example 2: Running Evaluations")
print("=" * 60)

print("\nüîç Evaluating model responses...\n")

results = []
for item in evaluation_dataset:
    # Get model response
    response = client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "user",
                "content": item["question"]
            }
        ]
    )
    
    answer = response.choices[0].message.content
    
    # Simple evaluation: Check if expected topics appear in answer
    answer_lower = answer.lower()
    found_topics = [topic for topic in item["expected_topics"] if topic.lower() in answer_lower]
    score = len(found_topics) / len(item["expected_topics"])
    
    results.append({
        "question": item["question"],
        "answer": answer[:100] + "...",
        "found_topics": found_topics,
        "score": score
    })
    
    print(f"üìù Q: {item['question']}")
    print(f"   Found topics: {', '.join(found_topics) if found_topics else 'None'}")
    print(f"   Score: {score:.2%}")
    print()

# Calculate average score
avg_score = sum(r["score"] for r in results) / len(results)
print(f"‚úÖ Average Score: {avg_score:.2%}")
print("\nüí° In practice, you would use more sophisticated evaluation metrics.")


### Understanding Evaluation Metrics

**Common evaluation metrics:**
- **Accuracy**: How often is the answer correct?
- **Relevance**: Does the answer address the question?
- **Completeness**: Does the answer cover all aspects?
- **Latency**: How fast is the response?

**Evaluation workflows:**
1. Create evaluation dataset
2. Run model on dataset
3. Compare outputs to expected results
4. Calculate metrics
5. Analyze results and improve


---

## Summary

### When to Use Each Feature

**Simple Chat:**
- ‚úÖ Basic Q&A
- ‚úÖ Text generation
- ‚úÖ Simple reasoning
- ‚ùå Don't use when you need external knowledge or tools

**RAG:**
- ‚úÖ Need access to specific documents
- ‚úÖ Domain-specific knowledge required
- ‚úÖ Private/internal information
- ‚ùå Don't use for general knowledge questions

**MCP Tools:**
- ‚úÖ Need to interact with external systems
- ‚úÖ Want agents to take actions
- ‚úÖ Need real-time data
- ‚ùå Don't use for pure text generation

**Safety:**
- ‚úÖ User-facing applications
- ‚úÖ Production systems
- ‚úÖ Handling sensitive data
- ‚ùå Not needed for internal/trusted use cases

**Evaluation:**
- ‚úÖ Before deploying to production
- ‚úÖ Comparing different models
- ‚úÖ Tracking performance over time
- ‚ùå Not needed for one-off experiments

---

### How Features Complement Each Other

**Powerful combinations:**
- **Chat + RAG**: Answer questions with domain knowledge
- **Chat + MCP**: Answer questions and take actions
- **RAG + MCP**: Use knowledge to make informed actions
- **All + Safety**: Production-ready agent with safety checks
- **All + Evaluation**: Measured, safe, powerful agent

---

### Next Steps: Combining in Agents

In **Notebook 04**, we'll combine these features to build:
- **Knowledge-augmented agents** (Chat + RAG)
- **Action-taking agents** (Chat + MCP)
- **Safe agents** (All + Safety)
- **Evaluated agents** (All + Evaluation)

**Ready to build powerful agents?** Let's move to Notebook 04!

---

## üéì Key Takeaways

1. **Chat** is the foundation - basic LLM interaction
2. **RAG** adds knowledge - access to documents
3. **MCP** adds actions - interact with systems
4. **Safety** adds protection - responsible AI
5. **Evaluation** adds measurement - ensure quality

**Remember:** Each feature solves a specific problem. Combining them creates powerful solutions!
