# Ragas Evaluation with Llama Stack - Demo [remote execution]

This notebook demonstrates how to use the Ragas out-of-tree provider.


## 1. Setup and Imports


In [170]:
# Install dev packages if not already installed
# !uv pip install -e ".[distro,dev]"

from datetime import datetime
from rich.pretty import pprint

from llama_stack_client import LlamaStackClient


## 2. Llama Stack Client Setup

- Make sure we have an inference model (model_type='llm')
- Make sure we have an embedding model (model_type='embedding')


In [171]:
# You will need ngrok to enable remote access to your Llama Stack server
client = LlamaStackClient(base_url="https://9ffaa3434eba.ngrok-free.app")

## 4. Dataset Preparation

Create a sample RAG evaluation dataset. In a real scenario, you would load your own dataset.


In [172]:
# Sample Ragas evaluation dataset
evaluation_data = [
    {
        "user_input": "What is the capital of France?",
        "response": "The capital of France is Paris.",
        "retrieved_contexts": [
            "Paris is the capital and most populous city of France."
        ],
        "reference": "Paris",
    },
    {
        "user_input": "Who invented the telephone?",
        "response": "Alexander Graham Bell invented the telephone in 1876.",
        "retrieved_contexts": [
            "Alexander Graham Bell was a Scottish-American inventor who patented the first practical telephone."
        ],
        "reference": "Alexander Graham Bell",
    },
    {
        "user_input": "What is photosynthesis?",
        "response": "Photosynthesis is the process by which plants convert sunlight into energy.",
        "retrieved_contexts": [
            "Photosynthesis is a process used by plants to convert light energy into chemical energy."
        ],
        "reference": "Photosynthesis is the process by which plants and other organisms convert light energy into chemical energy.",
    },
]

## 5. Dataset Registration

Register the dataset with Llama Stack's Datasets API using the direct rows approach.


In [173]:
# Register the dataset
dataset_id = "ragas_demo_dataset_remote"

dataset_response = client.datasets.register(
    dataset_id=dataset_id,
    purpose="eval/question-answer",  # RAG evaluation purpose
    source={"type": "rows", "rows": evaluation_data},
    metadata={
        "provider_id": "localfs",  # seems there's a bug in datasets
        "description": "Sample RAG evaluation dataset for Ragas demo",
        "size": len(evaluation_data),
        "format": "ragas",
        "created_at": datetime.now().isoformat(),
    },
)
pprint(dataset_response)

INFO:httpx:HTTP Request: POST https://9ffaa3434eba.ngrok-free.app/v1/datasets "HTTP/1.1 200 OK"


## 6. Benchmark Registration

Register a benchmark that defines what metrics to use for evaluation.


In [174]:
benchmark_id = "ragas_demo_benchmark_remote"

ragas_metrics = [
    "answer_relevancy",  # How relevant is the answer to the question?
    # "context_precision",     # How precise are the retrieved contexts?
    # "faithfulness",          # How faithful is the answer to the contexts?
    # "context_recall",        # How much of the ground truth is covered by contexts?
    # "answer_correctness"  # How correct is the answer compared to ground truth?
]

benchmark_response = client.benchmarks.register(
    benchmark_id=benchmark_id,
    dataset_id=dataset_id,
    scoring_functions=ragas_metrics,
    provider_id="trustyai_ragas_remote",
    # metadata={
    #     "provider": "ragas",
    #     "version": "1.0",
    #     "metrics_count": len(ragas_metrics),
    #     "created_at": datetime.now().isoformat()
    # }
)

pprint(benchmark_response)

INFO:httpx:HTTP Request: POST https://9ffaa3434eba.ngrok-free.app/v1/eval/benchmarks "HTTP/1.1 200 OK"


In [175]:
benchmarks = client.benchmarks.list()
pprint(benchmarks[-1:])

INFO:httpx:HTTP Request: GET https://9ffaa3434eba.ngrok-free.app/v1/eval/benchmarks "HTTP/1.1 200 OK"


## 7. Evaluation Execution

Run the evaluation using our Ragas out-of-tree provider.


In [176]:
# Review settings in distributinon/run.yaml, eg., note that
# since we can't set the embedding model in the benchmark config,
# the embedding model is set in the distribution run.yaml file (all-MiniLM-L6-v2)

job = client.eval.run_eval(
    benchmark_id=benchmark_id,
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "granite3.3:2b",
            "sampling_params": {"temperature": 0.1, "max_tokens": 100},
        },
        "scoring_params": {},
    },
)
pprint(job)

INFO:httpx:HTTP Request: POST https://9ffaa3434eba.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark_remote/jobs "HTTP/1.1 200 OK"


## 8. Results Display


In [181]:
job = client.eval.jobs.status(benchmark_id=benchmark_id, job_id=job.job_id)
pprint(job)

INFO:llama_stack_client._base_client:Retrying request to /v1/eval/benchmarks/ragas_demo_benchmark_remote/jobs/670d924c-d44f-494d-98f0-b740ff28f399 in 0.450759 seconds
INFO:llama_stack_client._base_client:Retrying request to /v1/eval/benchmarks/ragas_demo_benchmark_remote/jobs/670d924c-d44f-494d-98f0-b740ff28f399 in 0.798969 seconds


APITimeoutError: Request timed out.

In [178]:
# wait a bit for the job to complete
job = client.eval.jobs.status(benchmark_id=benchmark_id, job_id=job.job_id)
pprint(job)
results = client.eval.jobs.retrieve(benchmark_id=benchmark_id, job_id=job.job_id)
pprint(results)

INFO:httpx:HTTP Request: GET https://9ffaa3434eba.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark_remote/jobs/670d924c-d44f-494d-98f0-b740ff28f399 "HTTP/1.1 200 OK"


INFO:httpx:HTTP Request: GET https://9ffaa3434eba.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark_remote/jobs/670d924c-d44f-494d-98f0-b740ff28f399/result "HTTP/1.1 200 OK"
