# Notebook 03: LlamaStack Core Features

## üéØ What is This Notebook About?

Welcome! In this notebook, we'll explore **LlamaStack's core capabilities** - the building blocks that make powerful agents possible. Think of these as the agent's "superpowers"!

**What we'll learn:**
1. **Simple Chat** - Basic LLM interactions (the foundation)
2. **RAG (Retrieval Augmented Generation)** - Enhancing LLMs with external knowledge (giving agents access to your docs!)

**Why this matters:**
- Understanding these features helps you build better agents
- Each feature solves a specific problem (chat for Q&A, RAG for knowledge)
- Combining features creates powerful solutions (chat + RAG = smart assistant with your docs!)
- This knowledge prepares you for advanced agent development

---

## ‚öôÔ∏è Prerequisites

Before starting this notebook, make sure you have:
- ‚úÖ Completed Notebook 02: Building a Simple Agent
- ‚úÖ LlamaStack server running (see Module README for setup)
- ‚úÖ Ollama running with `llama3.2:3b` model
- ‚úÖ Python environment with dependencies installed

**The fun part:** We'll start with simple chat (easy!) and then add RAG so agents can answer questions about your specific IT operations documentation!

---

## üíº How This Applies to IT Operations

**The problem:** LLMs are trained on general knowledge, but they don't know about YOUR infrastructure, YOUR procedures, YOUR specific configurations. How do you give agents access to your internal knowledge?

**The solution:** RAG (Retrieval Augmented Generation)! You store your IT operations documentation in a vector store, and the agent retrieves relevant context when answering questions. It's like giving the agent access to your internal wiki!

**Real-world impact:**
- **Chat** for general Q&A - "What is a load balancer?" (general knowledge)
- **RAG** for specific knowledge - "How do we restart services in our infrastructure?" (your docs!)
- **Combine both** - Agents can answer general questions AND questions about your specific setup
- **Production-ready** - Give agents access to your runbooks, troubleshooting guides, and documentation

**The fun part:** We'll build a vector store with IT operations documentation and watch the agent retrieve relevant context to answer questions. It's like having an assistant that actually reads your documentation!

---

## üìö Learning Objectives

By the end of this notebook, you will:
- ‚úÖ Understand LlamaStack's core capabilities (chat and RAG)
- ‚úÖ Know when to use each feature (chat for general Q&A, RAG for domain-specific knowledge)
- ‚úÖ See how features work independently (each solves a different problem)
- ‚úÖ Be ready to explore MCP tools (Notebook 04), Safety (Notebook 05), and Evaluation (Notebook 06)

---

## üîß Setup

Let's start by connecting to LlamaStack and verifying everything is working.


In [None]:
# Import required libraries
import os
from llama_stack_client import LlamaStackClient
from termcolor import cprint

# Configuration
llamastack_url = os.getenv("LLAMA_STACK_URL", "http://localhost:8321")
model = os.getenv("LLAMA_MODEL", "ollama/llama3.2:3b")

print(f"üì° LlamaStack URL: {llamastack_url}")
print(f"ü§ñ Model: {model}")

# Initialize LlamaStack client
client = LlamaStackClient(base_url=llamastack_url)

# Verify connection
try:
    models = client.models.list()
    print(f"\n‚úÖ Connected to LlamaStack")
    print(f"   Available models: {len(models)}")
except Exception as e:
    print(f"\n‚ùå Cannot connect to LlamaStack: {e}")
    print("   Please ensure LlamaStack is running:")
    print("   python scripts/start_llama_stack.py")
    raise


---

## Part 1: Simple Chat

**What we're doing:** Learning the basics of LLM interactions - sending messages and getting responses.

**Why:** Chat is the foundation. Everything else builds on this. It's like learning to walk before you run!

### What is Chat?

**Chat** is the most basic way to interact with an LLM. It's a conversation where you send messages and receive responses - simple as that!

**Key Concepts:**
- **Message Types**: System (instructions), User (questions), Assistant (responses)
- **Streaming vs Non-streaming**: Get responses as they're generated (streaming) or wait for complete response (non-streaming)
- **Conversation Context**: LLM remembers previous messages in the conversation (like a real conversation!)

**When to use Chat:**
- ‚úÖ Simple Q&A ("What is a load balancer?")
- ‚úÖ Text generation (summaries, explanations)
- ‚úÖ Basic reasoning tasks (troubleshooting steps)
- ‚ùå Don't use when you need external knowledge or tools (use RAG or MCP instead)

---

### Hands-on: Basic Chat Completion

Let's start with the simplest example - a single question and answer.


In [None]:
# Example 1: Basic chat completion
print("=" * 60)
print("Example 1: Basic Chat Completion")
print("=" * 60)

# Create a simple chat completion
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "What is artificial intelligence in one sentence?"
        }
    ]
)

# Extract and display the response
answer = response.choices[0].message.content
print(f"\nüìù Question: What is artificial intelligence in one sentence?")
print(f"\nü§ñ Answer:\n{answer}\n")


### System Prompts

**System prompts** are instructions that guide the LLM's behavior. They set the "personality" and "role" of the assistant.

**Why use system prompts:**
- Define the assistant's role (e.g., "You are a helpful IT operations assistant")
- Set behavior guidelines
- Provide context about the domain
- Ensure consistent responses


In [None]:
# Example 2: Chat with system prompt
print("=" * 60)
print("Example 2: Chat with System Prompt")
print("=" * 60)

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. You provide clear, concise answers about IT infrastructure and operations."
        },
        {
            "role": "user",
            "content": "What should I check if a web server is not responding?"
        }
    ]
)

answer = response.choices[0].message.content
print(f"\nüìù Question: What should I check if a web server is not responding?")
print(f"\nü§ñ Answer (with IT operations context):\n{answer}\n")


### Multi-turn Conversations

**Multi-turn conversations** maintain context across multiple exchanges. The LLM remembers previous messages in the conversation.

**Why this matters:**
- Natural conversation flow
- Can refer back to earlier topics
- Builds on previous context
- More human-like interaction


In [None]:
# Example 3: Multi-turn conversation
print("=" * 60)
print("Example 3: Multi-turn Conversation")
print("=" * 60)

# First turn
messages = [
    {
        "role": "user",
        "content": "I'm setting up a new database server. What should I consider?"
    }
]

response1 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer1 = response1.choices[0].message.content
print(f"\nüìù Turn 1 - Question: I'm setting up a new database server. What should I consider?")
print(f"\nü§ñ Answer:\n{answer1[:200]}...\n")

# Second turn - add previous messages to maintain context
messages.append({
    "role": "assistant",
    "content": answer1
})
messages.append({
    "role": "user",
    "content": "What about security specifically?"
})

response2 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer2 = response2.choices[0].message.content
print(f"\nüìù Turn 2 - Question: What about security specifically?")
print(f"   (Note: The assistant knows we're talking about database servers)\n")
print(f"ü§ñ Answer:\n{answer2[:200]}...\n")


### Streaming Responses

**Streaming** allows you to receive the response as it's being generated, token by token. This provides:
- Faster perceived response time
- Real-time feedback
- Better user experience

**When to use streaming:**
- Long responses
- Interactive applications
- When you want immediate feedback


In [None]:
# Example 4: Streaming response
print("=" * 60)
print("Example 4: Streaming Response")
print("=" * 60)

print(f"\nüìù Question: Explain what RAG (Retrieval Augmented Generation) is.\n")
print("ü§ñ Answer (streaming):\n")

# Create streaming completion
stream = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "Explain what RAG (Retrieval Augmented Generation) is in 2-3 sentences."
        }
    ],
    stream=True  # Enable streaming
)

# Process stream chunk by chunk
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

print("\n\n‚úÖ Streaming complete!")


---

## Part 2: RAG (Retrieval Augmented Generation)

**What we're doing:** Giving agents access to YOUR documentation - runbooks, troubleshooting guides, procedures, anything!

**Why:** LLMs don't know about your specific infrastructure. RAG lets you store your docs in a vector store, and the agent retrieves relevant context when answering questions. It's like giving the agent access to your internal wiki!

### What is RAG?

**RAG** enhances LLMs with external knowledge by:
1. **Storing documents** in a vector database (vector store) - your docs go here!
2. **Searching** for relevant context when answering questions - semantic search finds the right docs
3. **Augmenting** the LLM's prompt with retrieved context - the agent uses YOUR docs to answer!

**Why RAG matters:**
- LLMs have training data cutoff dates (they don't know about your new systems!)
- Can't access private/internal documents (your runbooks aren't on the internet!)
- RAG provides up-to-date, domain-specific knowledge (your specific procedures!)
- Improves accuracy for specialized topics (your infrastructure, your way!)

**When to use RAG:**
- ‚úÖ Need access to specific documents (runbooks, procedures)
- ‚úÖ Domain-specific knowledge required (your infrastructure)
- ‚úÖ Private/internal information (internal docs, configurations)
- ‚úÖ Up-to-date information needed (current procedures, recent changes)

---

### Hands-on: Creating a Vector Store

Let's create a vector store and add some IT operations documentation.


In [None]:
# Example 1: Create a vector store
print("=" * 60)
print("Example 1: Creating a Vector Store")
print("=" * 60)

# Sample IT operations documentation
it_docs = [
    {
        "id": "doc1",
        "content": "To restart a web server, use: systemctl restart nginx. Check status with: systemctl status nginx."
    },
    {
        "id": "doc2",
        "content": "High CPU usage troubleshooting: 1) Check top processes with 'top' or 'htop', 2) Identify CPU-intensive processes, 3) Check for runaway processes or infinite loops."
    },
    {
        "id": "doc3",
        "content": "Database connection issues: Check firewall rules, verify credentials, ensure database service is running, check network connectivity with 'telnet hostname port'."
    },
    {
        "id": "doc4",
        "content": "Disk space issues: Use 'df -h' to check disk usage, find large files with 'du -sh /*', clean logs with 'journalctl --vacuum-time=7d'."
    },
    {
        "id": "doc5",
        "content": "Service monitoring: Use 'systemctl list-units --type=service' to list all services, 'systemctl is-active servicename' to check status, set up monitoring with Prometheus or Nagios."
    }
]

print(f"\nüìö Sample IT Operations Documentation:")
for doc in it_docs:
    print(f"   - {doc['id']}: {doc['content'][:60]}...")

print("\nüí° These documents will be stored in a vector store for retrieval.")


In [None]:
# Create vector store using LlamaStack
print("\n" + "=" * 60)
print("Creating Vector Store")
print("=" * 60)

from io import BytesIO

# Step 1: Create files from text content
print(f"\nüìù Creating files from {len(it_docs)} documents...")
file_ids = []

for i, doc in enumerate(it_docs, 1):
    # Create a file-like object from the document content
    file_content = BytesIO(doc["content"].encode('utf-8'))
    file_name = f"doc_{i}.txt"
    
    # Upload file to LlamaStack
    # The API expects a tuple: (filename, file_content, content_type)
    file_obj = (file_name, file_content, 'text/plain')
    
    uploaded_file = client.files.create(
        file=file_obj,
        purpose="assistants"
    )
    file_ids.append(uploaded_file.id)
    print(f"   ‚úÖ Uploaded {file_name} (ID: {uploaded_file.id})")

print(f"\n‚úÖ Created {len(file_ids)} files")

# Step 2: Create vector store with files
print(f"\nüì¶ Creating vector store...")
vector_store = client.vector_stores.create(
    name="it-operations-docs",
    file_ids=file_ids,
    metadata={"description": "IT operations documentation and troubleshooting guides"}
)

print(f"\n‚úÖ Vector store created!")
print(f"   Name: {vector_store.name}")
print(f"   ID: {vector_store.id}")
print(f"   Files: {len(file_ids)}")

# Step 3: Wait for files to be processed (vector stores need time to index files)
print(f"\n‚è≥ Waiting for files to be processed and indexed...")
import time

max_wait = 30  # Maximum wait time in seconds
wait_interval = 2  # Check every 2 seconds
elapsed = 0

while elapsed < max_wait:
    # Check vector store status
    vs_status = client.vector_stores.retrieve(vector_store.id)
    
    # Check if files are processed (status might be in file_counts or similar)
    if hasattr(vs_status, 'file_counts'):
        file_counts = vs_status.file_counts
        if hasattr(file_counts, 'in_progress') and file_counts.in_progress == 0:
            print(f"   ‚úÖ All files processed!")
            break
    elif hasattr(vs_status, 'status'):
        if vs_status.status == 'completed':
            print(f"   ‚úÖ Vector store ready!")
            break
    
    # Check file status directly
    vs_files = client.vector_stores.files.list(vector_store.id)
    if hasattr(vs_files, 'data'):
        processed = sum(1 for f in vs_files.data if hasattr(f, 'status') and f.status == 'completed')
        if processed == len(file_ids):
            print(f"   ‚úÖ All {processed} files processed!")
            break
    
    print(f"   ‚è≥ Waiting... ({elapsed}s/{max_wait}s)", end='\r')
    time.sleep(wait_interval)
    elapsed += wait_interval

if elapsed >= max_wait:
    print(f"\n   ‚ö†Ô∏è  Timeout waiting for processing. Files may still be indexing.")
    print(f"   üí° You can proceed, but search results may be incomplete initially.")

print(f"\nüí° The vector store is ready for semantic search!")


### Searching for Relevant Context

Once documents are in the vector store, we can search for relevant context based on semantic similarity (meaning, not just keywords).

**How it works:**
1. Convert query to embedding (vector representation)
2. Compare with document embeddings
3. Return most similar documents
4. Use retrieved documents as context for LLM


In [None]:
# Example 2: Search for relevant context using LlamaStack API
print("=" * 60)
print("Example 2: Searching Vector Store")
print("=" * 60)

query = "How do I restart a web server?"
print(f"\nüîç Query: {query}\n")

# Search the vector store using LlamaStack API
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=query,
    max_num_results=2
)

print("üìö Retrieved Documents (from vector store):")
print(f"   Found {len(search_results.data)} results\n")

if len(search_results.data) == 0:
    print("   ‚ö†Ô∏è  No results found. This might mean:")
    print("      - Files are still being processed/indexed")
    print("      - Try waiting a few seconds and searching again")
    print("      - Or check if files were added correctly to the vector store")
    print("\n   üí° For demonstration, we'll use the original documents:")
    # Fallback to original documents for demonstration
    for i, doc in enumerate(it_docs[:2], 1):
        if "restart" in doc["content"].lower() or "web server" in doc["content"].lower():
            print(f"\n   {i}. {doc['id']}:")
            print(f"      {doc['content']}")
else:
    for i, result in enumerate(search_results.data, 1):
        print(f"   {i}. ", end="")
        # The result contains the document content and score
        if hasattr(result, 'score'):
            print(f"Score: {result.score:.3f}")
        if hasattr(result, 'content') and result.content:
            print(f"      Content: {result.content[:150]}...")
        elif hasattr(result, 'text') and result.text:
            print(f"      Text: {result.text[:150]}...")
        elif hasattr(result, 'document') and result.document:
            print(f"      Document: {str(result.document)[:150]}...")
        else:
            # Try to get any text-like attribute
            result_str = str(result)
            print(f"      Result: {result_str[:150]}...")
        print()

print("\nüí° These documents were retrieved using semantic search (embeddings).")
print("   They will be used as context for the LLM.")


### Using Retrieved Context in Chat

Now let's use the retrieved documents as context for the LLM. This is the "Augmented Generation" part of RAG.


In [None]:
# Example 3: RAG - Using retrieved context in chat
print("=" * 60)
print("Example 3: RAG - Chat with Retrieved Context")
print("=" * 60)

query = "How do I restart a web server?"
print(f"\nüìù Question: {query}\n")

# Search the vector store for relevant context
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=query,
    max_num_results=2
)

# Build context from retrieved documents
context_parts = []
for i, result in enumerate(search_results.data, 1):
    # Extract content from result
    if hasattr(result, 'content') and result.content:
        content = result.content
    elif hasattr(result, 'text') and result.text:
        content = result.text
    else:
        # Try to get content from file if available
        content = f"Document {i} (score: {result.score:.3f})"
    
    context_parts.append(f"Document {i}:\n{content}")

context = "\n\n".join(context_parts)

# Create prompt with context
prompt = f"""Use the following IT operations documentation to answer the question.

Documentation:
{context}

Question: {query}

Answer based on the documentation provided:"""

print(f"üìö Context Retrieved from Vector Store:\n{context[:300]}...\n")

# Get response with context
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. Answer questions based on the provided documentation."
        },
        {
            "role": "user",
            "content": prompt
        }
    ]
)

answer = response.choices[0].message.content
print(f"ü§ñ Answer (with RAG context):\n{answer}\n")
print("‚úÖ Notice how the answer uses the specific documentation retrieved from the vector store!")


---

## üéì Key Takeaways

**What we learned:**

1. **Simple Chat** is the foundation - basic LLM interactions for Q&A and text generation
2. **RAG (Retrieval Augmented Generation)** gives agents access to YOUR documentation - store docs, retrieve context, answer questions!
3. **System prompts** guide the LLM's behavior - set the role, personality, and domain
4. **Multi-turn conversations** maintain context - agents remember what you talked about
5. **Streaming** provides real-time feedback - see responses as they're generated

**The big picture:**
- **Chat** for general Q&A - when you need general knowledge
- **RAG** for domain-specific knowledge - when you need YOUR docs
- **Combine both** - agents can answer general questions AND questions about your specific setup

**For IT operations:**
- Use **Chat** for general IT questions ("What is a load balancer?")
- Use **RAG** for your specific procedures ("How do we restart services in our infrastructure?")
- Store your runbooks, troubleshooting guides, and documentation in vector stores
- Give agents access to your internal knowledge base

---

## üöÄ Next Steps

**Ready for more?** In **Notebook 04**, we'll explore:
- **MCP (Model Context Protocol)** - External tool integration (give agents access to APIs, databases, commands!)
- **How to integrate tools** with agents (connect to your monitoring systems, ticketing systems, etc.)
- **Building production-ready agents** that can both answer questions AND take actions

**The fun part:** You'll learn how to give agents access to your IT infrastructure tools - monitoring APIs, service management, databases, anything!

---

**Ready?** Let's move to Notebook 04: MCP Tools! üöÄ
