Let's start by installing the Python Libraries we neeed

https://github.com/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_RAG_Lifecycle.ipynb

In [None]:

!pip install -U llama-stack-client dotenv


When running this code in a regular Python application, we would usually like to read environment variables from an `.env` file, for our needs in this lab, we will hard code these in this cell, to make things more clear

In [None]:
import os

# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()

# for our lab, we will just define our variables manualy here:
os.environ['LLAMA_STACK_SERVER'] = 'http://localhost:8323'
os.environ['LLAMA_STACK_MODEL'] = 'meta-llama/Llama-3.2-3B-Instruct'

As a first step, let's define our client, provide it our Llama-Stack Server location and select the model we would like to work with, later, we will see that pointing this to a different location (Llama-Stack Serve) is all we would need to do to move to a production environment.

In [None]:
from llama_stack_client import LlamaStackClient

LLAMA_STACK_SERVER=os.getenv("LLAMA_STACK_SERVER")
LLAMA_STACK_MODEL=os.getenv("LLAMA_STACK_MODEL")

client = LlamaStackClient(base_url=LLAMA_STACK_SERVER)

# List available models
models = client.models.list()
print("--- Available models: ---")
for m in models:
    print(f"{m.identifier} - {m.provider_id} - {m.provider_resource_id}")


Now that our client is set up, let's go through some very simple code snippets, to get you familiar with the syntex. If you used other AI Frameworks, this will soon feel very familiar, as Llamastack follows similar principals and terminology, while allowing a standard to help you quickly shift different components in and out 

Let's see what vectorDBs our server support out of the box

In [None]:
# Get provider list and print it out 
providers = client.providers.list()
for provider in providers:
    print(provider)
    
    
# select vector_io providers into array
vector_providers = [
    provider for provider in client.providers.list() if provider.api == "vector_io"
]

# In this example, we only have one provider, but on other server we might have many. here, we simply select the first one.
selected_vector_provider = vector_providers[0]


vector_db_id = f"test_vector_db_{uuid.uuid4()}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
    provider_id=selected_vector_provider.provider_id,
)


In [None]:
import uuid

vector_db_id = f"test_vector_db_{uuid.uuid4()}"
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model="all-MiniLM-L6-v2",
    embedding_dimension=384,
    provider_id=selected_vector_provider.provider_id,
)


In [None]:
from llama_stack_client.types import Document
urls = [
    "memory_optimizations.rst",
    "chat.rst",
    "llama3.rst",
    "datasets.rst",
    "qat_finetune.rst",
    "lora_finetune.rst",
]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

In [None]:
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

In [None]:
from llama_stack_client import Agent

rag_agent = Agent(
    client,
    model=os.environ['LLAMA_STACK_MODEL'],
    instructions="You are a helpful assistant that can answer questions about the Torchtune project. You should always use the RAG tool to answer questions.",
    tools=[{
        "name": "builtin::rag",
        "args": {"vector_db_ids": [vector_db_id]},
    }],
)

In [None]:
# First, let's come up with a couple of examples to test the agent
examples = [
    {
        "input_query": "What precision formats does torchtune support?",
        "expected_answer": "Torchtune supports two data types for precision: fp32 (full-precision) which uses 4 bytes per model and optimizer parameter, and bfloat16 (half-precision) which uses 2 bytes per model and optimizer parameter."
    },
    {
        "input_query": "What does DoRA stand for in torchtune?",
        "expected_answer": "Weight-Decomposed Low-Rank Adaptation"
    },
    {
        "input_query": "How does the CPUOffloadOptimizer reduce GPU memory usage?",
        "expected_answer": "The CPUOffloadOptimizer reduces GPU memory usage by keeping optimizer states on CPU and performing optimizer steps on CPU. It can also optionally offload gradients to CPU by using offload_gradients=True"
    },
    {
        "input_query": "How do I ensure only LoRA parameters are trainable when fine-tuning?",
        "expected_answer": "You can set only LoRA parameters to trainable using torchtune's utility functions: first fetch all LoRA parameters with lora_params = get_adapter_params(lora_model), then set them as trainable with set_trainable_params(lora_model, lora_params). The LoRA recipe handles this automatically."
    }
]

In [None]:
from rich.pretty import pprint
import rich

for example in examples:
    rag_session_id = rag_agent.create_session(session_name=f"rag_session_{uuid.uuid4()}")
    response = rag_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": example["input_query"]
            }
        ],
        session_id=rag_session_id,
        stream=False
    )
    rich.print(f"[bold cyan]Question:[/bold cyan] {example['input_query']}")
    rich.print(f"[bold yellow]Agent Answer:[/bold yellow] {response.output_message.content}")

In [None]:

session_response = client.agents.session.retrieve(agent_id=rag_agent.agent_id, session_id=rag_agent.sessions[1])
pprint(session_response.turns)