# Notebook 03: LlamaStack Core Features

## üéØ What is This Notebook About?

Welcome to Notebook 03! This notebook explores **LlamaStack's core capabilities** - Chat and RAG (Retrieval Augmented Generation) - the building blocks that make powerful agents possible.

**What we'll do:**
- Understand Simple Chat - basic LLM interactions for Q&A and text generation
- Explore RAG - giving agents access to your documentation and knowledge bases
- Learn when to use each feature and how they work together
- See practical examples for IT operations use cases

**Why this matters:**
- Chat is the foundation - everything else builds on this
- RAG gives agents access to YOUR documentation (runbooks, procedures, internal docs)
- Understanding these features helps you build better agents
- This knowledge prepares you for advanced agent development (MCP tools, safety, evaluation)

**The big picture:**
- **Chat** = General Q&A using the LLM's training data
- **RAG** = Domain-specific Q&A using YOUR documents
- **Together** = Agents that can answer general questions AND questions about your specific infrastructure

**Real-world impact:**
- **Chat:** Answer general IT questions ("What is a load balancer?")
- **RAG:** Answer questions about YOUR infrastructure ("How do we restart services in our setup?")
- **Combined:** Agents that understand both general IT concepts and your specific procedures

---

## üìö Key Concepts Explained

### Simple Chat

**Chat** is the most basic way to interact with an LLM - a conversation where you send messages and receive responses.

**What it is:** Direct interaction with an LLM - send a message, get a response. That's it!

**Why it matters:** Chat is the foundation. Everything else (RAG, tools, agents) builds on this basic capability.

**Think of it like:** Talking to a knowledgeable colleague. You ask questions, they answer based on what they know.

**How it works:**
1. You send a message (user message)
2. Optionally set context (system prompt - defines the assistant's role)
3. LLM generates a response based on its training data
4. You can continue the conversation (multi-turn) - the LLM remembers context

**When to use:**
- ‚úÖ General Q&A ("What is a load balancer?")
- ‚úÖ Text generation (summaries, explanations)
- ‚úÖ Basic reasoning tasks
- ‚ùå Don't use when you need external knowledge or tools (use RAG or MCP instead)

### RAG (Retrieval Augmented Generation)

**RAG** enhances LLMs with external knowledge by retrieving relevant documents and using them as context.

**What it is:** A technique that combines document retrieval with text generation. Store your docs in a vector store, search for relevant context, and use that context to generate better answers.

**Why it matters:** LLMs have training data cutoff dates and can't access private/internal documents. RAG gives agents access to YOUR documentation - runbooks, procedures, internal knowledge bases.

**Think of it like:** A librarian who can search your internal wiki and use those documents to answer questions. The LLM doesn't just rely on its training - it uses YOUR docs!

**How it works:**
1. **Store:** Documents are stored in a vector database (vector store) with embeddings
2. **Search:** When you ask a question, the system searches for semantically similar documents
3. **Retrieve:** Relevant documents are retrieved based on meaning (not just keywords)
4. **Augment:** Retrieved documents are added as context to the LLM prompt
5. **Generate:** The LLM generates an answer using both its training data AND your documents

**When to use:**
- ‚úÖ Need access to specific documents (runbooks, procedures)
- ‚úÖ Domain-specific knowledge required (your infrastructure)
- ‚úÖ Private/internal information (internal docs, configurations)
- ‚úÖ Up-to-date information needed (current procedures, recent changes)

### Vector Stores

A **vector store** is a database that stores documents as embeddings (vector representations) for semantic search.

**What it is:** A specialized database that stores documents and their embeddings, enabling semantic search (finding documents by meaning, not just keywords).

**Why it matters:** Vector stores enable RAG - they're where your documents live and how the system finds relevant context.

**Think of it like:** A smart filing cabinet. Instead of searching by filename or keywords, you search by meaning - "find documents about restarting services" finds relevant docs even if they don't contain those exact words.

### System Prompts

**System prompts** are instructions that guide the LLM's behavior - they set the "personality" and "role" of the assistant.

**What they are:** Special messages that define how the assistant should behave, what role it plays, and what context it should consider.

**Why they matter:** System prompts ensure consistent, domain-appropriate responses. They tell the LLM "you are an IT operations assistant" so it responds accordingly.

**Think of it like:** A job description. The system prompt tells the LLM what job it's doing and how to do it.

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- ‚úÖ Understand what Chat is and when to use it
- ‚úÖ Understand what RAG is and how it works
- ‚úÖ Know when to use Chat vs. RAG
- ‚úÖ Create vector stores and add documents
- ‚úÖ Use RAG to answer questions with your documentation
- ‚úÖ See how Chat and RAG work together in agents

---

## ‚ö†Ô∏è Prerequisites

Before starting this notebook, make sure you have:
- ‚úÖ Completed Notebook 02: Building a Simple Agent (understanding of agents and tools)
- ‚úÖ LlamaStack server running (see Module README)
- ‚úÖ Python environment with dependencies installed (`llama-stack-client`)
- ‚úÖ Basic understanding of how LLMs work (from Notebook 01)

**The fun part:** No MCP servers needed yet! We're exploring LlamaStack's built-in features.


---

## üìã Step-by-Step Guide

### Step 1: Setup and Configuration

**What we're doing:** Setting up the environment and connecting to LlamaStack.

**Why:** We need to establish connections before we can explore Chat and RAG features.

**What to expect:**
- Import required libraries
- Load configuration from shared config system
- Connect to LlamaStack server
- Verify everything is working


In [None]:
# Import required libraries
import sys
from pathlib import Path
from llama_stack_client import LlamaStackClient
from termcolor import cprint

# Add src directory to path for shared configuration
root_dir = Path("../..").resolve()
sys.path.insert(0, str(root_dir / "src"))

# Import centralized configuration
from config import LLAMA_STACK_URL, MODEL, CONFIG

print("‚úÖ Libraries imported successfully!")
print(f"üì° LlamaStack URL: {LLAMA_STACK_URL}")
print(f"ü§ñ Model: {MODEL}")

# Verify configuration
if not LLAMA_STACK_URL:
    raise ValueError(
        "LLAMA_STACK_URL is not configured!\n"
        "Please run: ./scripts/setup-env.sh"
    )

# Initialize LlamaStack client
client = LlamaStackClient(base_url=LLAMA_STACK_URL)

# Verify connection
try:
    models = client.models.list()
    model_count = len(models.data) if hasattr(models, 'data') else len(models)
    print(f"\n‚úÖ Connected to LlamaStack")
    print(f"   Available models: {model_count}")
except Exception as e:
    print(f"\n‚ùå Cannot connect to LlamaStack: {e}")
    raise

# Use MODEL from config
model = MODEL


**What happened:** After running the code, you should see successful connections to LlamaStack. The configuration is loaded from the shared `src/config.py` system, which auto-detects your environment.

**Key takeaway:** The shared configuration system makes it easy to switch between environments (local, OpenShift, etc.) without changing code.


---

## üìã Step-by-Step Guide

### Step 2: Basic Chat Completion

**What we're doing:** Learning the basics of LLM interactions - sending messages and getting responses.

**Why:** Chat is the foundation. Everything else builds on this. It's like learning to walk before you run!

**What to expect:**
- Send a simple question to the LLM
- Receive a response
- See how basic chat works


**What happened:** After running the code, you should see successful connections to LlamaStack. The configuration is loaded from the shared `src/config.py` system, which auto-detects your environment.

**Key takeaway:** The shared configuration system makes it easy to switch between environments (local, OpenShift, etc.) without changing code.


In [None]:
# Basic chat completion - simplest example
print("=" * 60)
print("Basic Chat Completion")
print("=" * 60)

# Create a simple chat completion
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "What is artificial intelligence in one sentence?"
        }
    ]
)

# Extract and display the response
answer = response.choices[0].message.content
print(f"\nüìù Question: What is artificial intelligence in one sentence?")
print(f"\nü§ñ Answer:\n{answer}\n")


**What happened:** After running the code, you sent a message to the LLM and received a response. This is the most basic form of interaction - just a question and an answer.

**Key takeaway:** Chat is simple - send a message, get a response. The LLM uses its training data to generate the answer. This is the foundation that everything else builds on.


### Step 3: Chat with System Prompt

**What we're doing:** Using system prompts to guide the LLM's behavior and set its role.

**Why:** System prompts define the assistant's role, personality, and domain expertise. They ensure consistent, domain-appropriate responses.

**What to expect:**
- Set a system prompt that defines the assistant as an IT operations expert
- Ask an IT operations question
- See how the system prompt influences the response


In [None]:
# Chat with system prompt
print("=" * 60)
print("Chat with System Prompt")
print("=" * 60)

response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. You provide clear, concise answers about IT infrastructure and operations."
        },
        {
            "role": "user",
            "content": "What should I check if a web server is not responding?"
        }
    ]
)

answer = response.choices[0].message.content
print(f"\nüìù Question: What should I check if a web server is not responding?")
print(f"\nü§ñ Answer (with IT operations context):\n{answer}\n")


**What happened:** After running the code, you set a system prompt that defines the assistant as an IT operations expert. Notice how the response is tailored to IT operations - the system prompt influenced the LLM's behavior.

**Key takeaway:** System prompts are powerful - they define the assistant's role and ensure consistent, domain-appropriate responses. Think of them as a job description for the LLM.


### Step 4: Multi-turn Conversations

**What we're doing:** Maintaining context across multiple exchanges in a conversation.

**Why:** Multi-turn conversations let the LLM remember previous messages, enabling natural conversation flow and building on previous context.

**What to expect:**
- Start a conversation with one question
- Continue the conversation with a follow-up question
- See how the LLM remembers the context from earlier messages


In [None]:
# Multi-turn conversation
print("=" * 60)
print("Multi-turn Conversation")
print("=" * 60)

# First turn
messages = [
    {
        "role": "user",
        "content": "I'm setting up a new database server. What should I consider?"
    }
]

response1 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer1 = response1.choices[0].message.content
print(f"\nüìù Turn 1 - Question: I'm setting up a new database server. What should I consider?")
print(f"\nü§ñ Answer:\n{answer1[:200]}...\n")

# Second turn - add previous messages to maintain context
messages.append({
    "role": "assistant",
    "content": answer1
})
messages.append({
    "role": "user",
    "content": "What about security specifically?"
})

response2 = client.chat.completions.create(
    model=model,
    messages=messages
)

answer2 = response2.choices[0].message.content
print(f"\nüìù Turn 2 - Question: What about security specifically?")
print(f"   (Note: The assistant knows we're talking about database servers)\n")
print(f"ü§ñ Answer:\n{answer2[:200]}...\n")


**What happened:** After running the code, you had a multi-turn conversation. Notice how in Turn 2, the assistant knew you were asking about database server security - it remembered the context from Turn 1!

**Key takeaway:** Multi-turn conversations maintain context by including previous messages. The LLM remembers what you talked about, enabling natural conversation flow. This is how agents maintain context across multiple interactions.


### Step 5: Streaming Responses

**What we're doing:** Receiving responses as they're generated, token by token.

**Why:** Streaming provides faster perceived response time and real-time feedback, creating a better user experience.

**What to expect:**
- Enable streaming in the chat completion
- See the response appear token by token as it's generated
- Understand when to use streaming vs. non-streaming


In [None]:
# Streaming response
print("=" * 60)
print("Streaming Response")
print("=" * 60)

print(f"\nüìù Question: Explain what RAG (Retrieval Augmented Generation) is.\n")
print("ü§ñ Answer (streaming):\n")

# Create streaming completion
stream = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "user",
            "content": "Explain what RAG (Retrieval Augmented Generation) is in 2-3 sentences."
        }
    ],
    stream=True  # Enable streaming
)

# Process stream chunk by chunk
full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content:
        content = chunk.choices[0].delta.content
        print(content, end="", flush=True)
        full_response += content

print("\n\n‚úÖ Streaming complete!")


**What happened:** After running the code, you saw the response appear token by token as it was generated, rather than waiting for the complete response. This provides immediate feedback and feels more interactive.

**Key takeaway:** Streaming is great for long responses and interactive applications. It provides faster perceived response time - users see results immediately rather than waiting for the complete response. Use streaming when you want real-time feedback.


### Step 6: Creating a Vector Store for RAG

**What we're doing:** Creating a vector store and adding IT operations documentation.

**Why:** Vector stores enable RAG - they store your documents as embeddings, allowing semantic search to find relevant context when answering questions.

**What to expect:**
- Create sample IT operations documentation
- Upload documents to LlamaStack
- Create a vector store with those documents
- Wait for documents to be processed and indexed


In [None]:
# Create sample IT operations documentation
print("=" * 60)
print("Creating Vector Store for RAG")
print("=" * 60)

# Sample IT operations documentation
it_docs = [
    {
        "id": "doc1",
        "content": "To restart a web server, use: systemctl restart nginx. Check status with: systemctl status nginx."
    },
    {
        "id": "doc2",
        "content": "High CPU usage troubleshooting: 1) Check top processes with 'top' or 'htop', 2) Identify CPU-intensive processes, 3) Check for runaway processes or infinite loops."
    },
    {
        "id": "doc3",
        "content": "Database connection issues: Check firewall rules, verify credentials, ensure database service is running, check network connectivity with 'telnet hostname port'."
    },
    {
        "id": "doc4",
        "content": "Disk space issues: Use 'df -h' to check disk usage, find large files with 'du -sh /*', clean logs with 'journalctl --vacuum-time=7d'."
    },
    {
        "id": "doc5",
        "content": "Service monitoring: Use 'systemctl list-units --type=service' to list all services, 'systemctl is-active servicename' to check status, set up monitoring with Prometheus or Nagios."
    }
]

print(f"\nüìö Sample IT Operations Documentation:")
for doc in it_docs:
    print(f"   - {doc['id']}: {doc['content'][:60]}...")

print("\nüí° These documents will be stored in a vector store for retrieval.")


### Step 7: Uploading Documents and Creating Vector Store

**What we're doing:** Uploading documents to LlamaStack and creating a vector store.

**Why:** Documents need to be uploaded and indexed before they can be searched. The vector store organizes documents for semantic search.

**What to expect:**
- Upload each document as a file to LlamaStack
- Create a vector store containing all the files
- Wait for documents to be processed and indexed
- Verify the vector store is ready for search


**What happened:** After running the code, you created sample IT operations documentation. These documents represent the kind of knowledge you'd store in a vector store - runbooks, troubleshooting guides, procedures.

**Key takeaway:** Vector stores hold YOUR documentation. Instead of relying only on the LLM's training data, you can store your specific procedures, runbooks, and knowledge bases in a vector store for RAG.


In [None]:
# Create vector store using LlamaStack
print("\n" + "=" * 60)
print("Creating Vector Store")
print("=" * 60)

from io import BytesIO

# Step 1: Create files from text content
print(f"\nüìù Creating files from {len(it_docs)} documents...")
file_ids = []

for i, doc in enumerate(it_docs, 1):
    # Create a file-like object from the document content
    file_content = BytesIO(doc["content"].encode('utf-8'))
    file_name = f"doc_{i}.txt"
    
    # Upload file to LlamaStack
    # The API expects a tuple: (filename, file_content, content_type)
    file_obj = (file_name, file_content, 'text/plain')
    
    uploaded_file = client.files.create(
        file=file_obj,
        purpose="assistants"
    )
    file_ids.append(uploaded_file.id)
    print(f"   ‚úÖ Uploaded {file_name} (ID: {uploaded_file.id})")

print(f"\n‚úÖ Created {len(file_ids)} files")

# Step 2: Create vector store with files
print(f"\nüì¶ Creating vector store...")
vector_store = client.vector_stores.create(
    name="it-operations-docs",
    file_ids=file_ids,
    metadata={"description": "IT operations documentation and troubleshooting guides"}
)

print(f"\n‚úÖ Vector store created!")
print(f"   Name: {vector_store.name}")
print(f"   ID: {vector_store.id}")
print(f"   Files: {len(file_ids)}")

# Step 3: Wait for files to be processed (vector stores need time to index files)
print(f"\n‚è≥ Waiting for files to be processed and indexed...")
import time

max_wait = 30  # Maximum wait time in seconds
wait_interval = 2  # Check every 2 seconds
elapsed = 0

while elapsed < max_wait:
    # Check vector store status
    vs_status = client.vector_stores.retrieve(vector_store.id)
    
    # Check if files are processed (status might be in file_counts or similar)
    if hasattr(vs_status, 'file_counts'):
        file_counts = vs_status.file_counts
        if hasattr(file_counts, 'in_progress') and file_counts.in_progress == 0:
            print(f"   ‚úÖ All files processed!")
            break
    elif hasattr(vs_status, 'status'):
        if vs_status.status == 'completed':
            print(f"   ‚úÖ Vector store ready!")
            break
    
    # Check file status directly
    vs_files = client.vector_stores.files.list(vector_store.id)
    if hasattr(vs_files, 'data'):
        processed = sum(1 for f in vs_files.data if hasattr(f, 'status') and f.status == 'completed')
        if processed == len(file_ids):
            print(f"   ‚úÖ All {processed} files processed!")
            break
    
    print(f"   ‚è≥ Waiting... ({elapsed}s/{max_wait}s)", end='\r')
    time.sleep(wait_interval)
    elapsed += wait_interval

if elapsed >= max_wait:
    print(f"\n   ‚ö†Ô∏è  Timeout waiting for processing. Files may still be indexing.")
    print(f"   üí° You can proceed, but search results may be incomplete initially.")

print(f"\nüí° The vector store is ready for semantic search!")


**What happened:** After running the code, you created sample IT operations documentation. These documents represent the kind of knowledge you'd store in a vector store - runbooks, troubleshooting guides, procedures.

**Key takeaway:** Vector stores hold YOUR documentation. Instead of relying only on the LLM's training data, you can store your specific procedures, runbooks, and knowledge bases in a vector store for RAG.


### Step 8: Searching the Vector Store

**What we're doing:** Searching the vector store for relevant documents using semantic search.

**Why:** Semantic search finds documents by meaning, not just keywords. This is how RAG retrieves relevant context for answering questions.

**What to expect:**
- Search the vector store with a query
- See relevant documents retrieved based on semantic similarity
- Understand how semantic search works differently from keyword search


In [None]:
# Search the vector store
print("=" * 60)
print("Searching Vector Store")
print("=" * 60)

query = "How do I restart a web server?"
print(f"\nüîç Query: {query}\n")

# Search the vector store using LlamaStack API
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=query,
    max_num_results=2
)

print("üìö Retrieved Documents (from vector store):")
print(f"   Found {len(search_results.data)} results\n")

if len(search_results.data) == 0:
    print("   ‚ö†Ô∏è  No results found. This might mean:")
    print("      - Files are still being processed/indexed")
    print("      - Try waiting a few seconds and searching again")
    print("      - Or check if files were added correctly to the vector store")
    print("\n   üí° For demonstration, we'll use the original documents:")
    # Fallback to original documents for demonstration
    for i, doc in enumerate(it_docs[:2], 1):
        if "restart" in doc["content"].lower() or "web server" in doc["content"].lower():
            print(f"\n   {i}. {doc['id']}:")
            print(f"      {doc['content']}")
else:
    for i, result in enumerate(search_results.data, 1):
        print(f"   {i}. ", end="")
        # The result contains the document content and score
        if hasattr(result, 'score'):
            print(f"Score: {result.score:.3f}")
        if hasattr(result, 'content') and result.content:
            print(f"      Content: {result.content[:150]}...")
        elif hasattr(result, 'text') and result.text:
            print(f"      Text: {result.text[:150]}...")
        elif hasattr(result, 'document') and result.document:
            print(f"      Document: {str(result.document)[:150]}...")
        else:
            # Try to get any text-like attribute
            result_str = str(result)
            print(f"      Result: {result_str[:150]}...")
        print()

print("\nüí° These documents were retrieved using semantic search (embeddings).")
print("   They will be used as context for the LLM.")


**What happened:** After running the code, you uploaded documents to LlamaStack and created a vector store. The documents are being processed and indexed - converted to embeddings that enable semantic search.

**Key takeaway:** Vector stores need time to process documents. Once processed, documents are stored as embeddings, enabling semantic search (finding documents by meaning, not just keywords). This is what makes RAG possible.


In [None]:
# Search the vector store
print("=" * 60)
print("Searching Vector Store")
print("=" * 60)

query = "How do I restart a web server?"
print(f"\nüîç Query: {query}\n")

# Search the vector store using LlamaStack API
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=query,
    max_num_results=2
)

print("üìö Retrieved Documents (from vector store):")
print(f"   Found {len(search_results.data)} results\n")

if len(search_results.data) == 0:
    print("   ‚ö†Ô∏è  No results found. This might mean:")
    print("      - Files are still being processed/indexed")
    print("      - Try waiting a few seconds and searching again")
    print("      - Or check if files were added correctly to the vector store")
    print("\n   üí° For demonstration, we'll use the original documents:")
    # Fallback to original documents for demonstration
    for i, doc in enumerate(it_docs[:2], 1):
        if "restart" in doc["content"].lower() or "web server" in doc["content"].lower():
            print(f"\n   {i}. {doc['id']}:")
            print(f"      {doc['content']}")
else:
    for i, result in enumerate(search_results.data, 1):
        print(f"   {i}. ", end="")
        # The result contains the document content and score
        if hasattr(result, 'score'):
            print(f"Score: {result.score:.3f}")
        if hasattr(result, 'content') and result.content:
            print(f"      Content: {result.content[:150]}...")
        elif hasattr(result, 'text') and result.text:
            print(f"      Text: {result.text[:150]}...")
        elif hasattr(result, 'document') and result.document:
            print(f"      Document: {str(result.document)[:150]}...")
        else:
            # Try to get any text-like attribute
            result_str = str(result)
            print(f"      Result: {result_str[:150]}...")
        print()

print("\nüí° These documents were retrieved using semantic search (embeddings).")
print("   They will be used as context for the LLM.")


**What happened:** After running the code, you searched the vector store and retrieved relevant documents. Notice how the search found documents about restarting web servers even though the query might not match exact keywords - this is semantic search!

**Key takeaway:** Semantic search finds documents by meaning, not just keywords. The query "How do I restart a web server?" found documents about restarting nginx, even if they don't contain those exact words. This is the power of embeddings and semantic search.


### Step 9: Using RAG - Chat with Retrieved Context

**What we're doing:** Using retrieved documents as context for the LLM. This is the "Augmented Generation" part of RAG.

**Why:** RAG combines retrieval (finding relevant docs) with generation (using those docs to answer). The LLM uses YOUR documentation to answer questions, not just its training data.

**What to expect:**
- Search the vector store for relevant context
- Build a prompt that includes the retrieved documents
- Get a response that uses YOUR documentation to answer the question
- See how RAG provides domain-specific answers


In [None]:
# RAG - Chat with retrieved context
print("=" * 60)
print("RAG - Chat with Retrieved Context")
print("=" * 60)

query = "How do I restart a web server?"
print(f"\nüìù Question: {query}\n")

# Search the vector store for relevant context
search_results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query=query,
    max_num_results=2
)

# Build context from retrieved documents
context_parts = []
for i, result in enumerate(search_results.data, 1):
    # Extract content from result
    if hasattr(result, 'content') and result.content:
        content = result.content
    elif hasattr(result, 'text') and result.text:
        content = result.text
    else:
        # Try to get content from file if available
        content = f"Document {i} (score: {result.score:.3f})"
    
    context_parts.append(f"Document {i}:\n{content}")

context = "\n\n".join(context_parts)

# Create prompt with context
prompt = f"""Use the following IT operations documentation to answer the question.

Documentation:
{context}

Question: {query}

Answer based on the documentation provided:"""

print(f"üìö Context Retrieved from Vector Store:\n{context[:300]}...\n")

# Get response with context
response = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "You are a helpful IT operations assistant. Answer questions based on the provided documentation."
        },
        {
            "role": "user",
            "content": prompt
        }
    ]
)

answer = response.choices[0].message.content
print(f"ü§ñ Answer (with RAG context):\n{answer}\n")
print("‚úÖ Notice how the answer uses the specific documentation retrieved from the vector store!")


**What happened:** After running the code, you completed a full RAG cycle: searched the vector store, retrieved relevant documents, and used them as context for the LLM. Notice how the answer uses YOUR documentation (the specific commands from your docs) rather than just general knowledge!

**Key takeaway:** This is RAG in action! The LLM didn't just use its training data - it used YOUR documentation to provide a specific, accurate answer. This is how you give agents access to your internal knowledge bases, runbooks, and procedures.


---

## üéì Key Takeaways

**What we learned:**

1. **Simple Chat** is the foundation - basic LLM interactions for Q&A and text generation
2. **RAG (Retrieval Augmented Generation)** gives agents access to YOUR documentation - store docs, retrieve context, answer questions!
3. **System prompts** guide the LLM's behavior - set the role, personality, and domain
4. **Multi-turn conversations** maintain context - agents remember what you talked about
5. **Streaming** provides real-time feedback - see responses as they're generated
6. **Vector stores** enable RAG - store documents as embeddings for semantic search

**The big picture:**
- **Chat** for general Q&A - when you need general knowledge
- **RAG** for domain-specific knowledge - when you need YOUR docs
- **Combine both** - agents can answer general questions AND questions about your specific setup

**For IT operations:**
- Use **Chat** for general IT questions ("What is a load balancer?")
- Use **RAG** for your specific procedures ("How do we restart services in our infrastructure?")
- Store your runbooks, troubleshooting guides, and documentation in vector stores
- Give agents access to your internal knowledge base

**When to use each:**
- **Chat:** General Q&A, text generation, basic reasoning
- **RAG:** Domain-specific knowledge, private/internal docs, up-to-date information
- **Together:** Agents that understand both general concepts and your specific infrastructure

---

## üîó Next Steps

**Ready for more?** In **Notebook 04**, we'll explore:
- **MCP (Model Context Protocol)** - External tool integration (give agents access to APIs, databases, commands!)
- **How to integrate tools** with agents (connect to your monitoring systems, ticketing systems, etc.)
- **Building production-ready agents** that can both answer questions AND take actions

**The fun part:** You'll learn how to give agents access to your IT infrastructure tools - monitoring APIs, service management, databases, anything!

**Next notebook:** `04_mcp_tools.ipynb` - MCP Tools and External Integrations

**Related concepts:**
- Client-side tools (covered in Notebook 02)
- Agent safety and evaluation (covered in Notebooks 05-06)
