# Ragas Evaluation with Llama Stack - Demo [remote execution]

This notebook demonstrates how to use the Ragas out-of-tree provider.


## Run your Llama Stack distribution

Nuke any old distro config files you might have lying around (I find these get in the way whenever I change my `.env` variables): 
```bash
ls ~/.llama/distributions/
rm -r ~/.llama/distributions/<name-of-your-distro>
```

Then, run your llama stack server with:
```bash
dotenv run uv run llama stack run distribution/run.yaml
```




## Setup and Imports


In [2]:
# Install dev packages if not already installed
# !uv pip install -e ".[dev]"

import os
from datetime import datetime

import pandas as pd
from llama_stack_client import LlamaStackClient
from rich.pretty import pprint

from llama_stack_provider_ragas.constants import PROVIDER_ID_INLINE, PROVIDER_ID_REMOTE

## Llama Stack Client Setup

- Make sure we have an inference model (model_type='llm')
- Make sure we have an embedding model (model_type='embedding')


In [3]:
# If usingf the remote provider, you will need ngrok to enable remote access to your Llama Stack server
# Otherwise, the base_url is just http://localhost:8321
client = LlamaStackClient(base_url=os.getenv("KUBEFLOW_LLAMA_STACK_URL"))
available_models = client.models.list()
assert any(model.model_type == "llm" for model in available_models)
assert any(model.model_type == "embedding" for model in available_models)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/models "HTTP/1.1 200 OK"


## Dataset Preparation

Create a sample RAG evaluation dataset. In a real scenario, you would load your own dataset.


In [4]:
# Sample Ragas evaluation dataset
evaluation_data = [
    {
        "user_input": "What is the capital of France?",
        "response": "The capital of France is Paris.",
        "retrieved_contexts": [
            "Paris is the capital and most populous city of France."
        ],
        "reference": "Paris",
    },
    {
        "user_input": "Who invented the telephone?",
        "response": "Alexander Graham Bell invented the telephone in 1876.",
        "retrieved_contexts": [
            "Alexander Graham Bell was a Scottish-American inventor who patented the first practical telephone."
        ],
        "reference": "Alexander Graham Bell",
    },
    {
        "user_input": "What is photosynthesis?",
        "response": "Photosynthesis is the process by which plants convert sunlight into energy.",
        "retrieved_contexts": [
            "Photosynthesis is a process used by plants to convert light energy into chemical energy."
        ],
        "reference": "Photosynthesis is the process by which plants and other organisms convert light energy into chemical energy.",
    },
]

## Dataset Registration

Register the dataset with Llama Stack's Datasets API using the direct rows approach.


In [5]:
# De-register the dataset if it already exists
dataset_id = "ragas_demo_dataset"
try:
    client.datasets.unregister(dataset_id)
except Exception:
    pass

INFO:httpx:HTTP Request: DELETE https://815a5b807252.ngrok-free.app/v1/datasets/ragas_demo_dataset "HTTP/1.1 404 Not Found"


In [6]:
dataset_response = client.datasets.register(
    dataset_id=dataset_id,
    purpose="eval/question-answer",  # RAG evaluation purpose
    source={"type": "rows", "rows": evaluation_data},
    metadata={
        "provider_id": "localfs",  # seems there's a bug in datasets
        "description": "Sample RAG evaluation dataset for Ragas demo",
        "size": len(evaluation_data),
        "format": "ragas",
        "created_at": datetime.now().isoformat(),
    },
)
pprint(dataset_response)

INFO:httpx:HTTP Request: POST https://815a5b807252.ngrok-free.app/v1/datasets "HTTP/1.1 200 OK"


## Benchmark Registration

Register a benchmark that defines what metrics to use for evaluation.


In [7]:
# comment out the provider you don't want to run
benchmarks_providers = [
    ("ragas_demo_benchmark__inline", PROVIDER_ID_INLINE),
    ("ragas_demo_benchmark__remote", PROVIDER_ID_REMOTE),
]

for benchmark_id, provider_id in benchmarks_providers:
    benchmark_response = client.benchmarks.register(
        benchmark_id=benchmark_id,
        dataset_id=dataset_id,
        scoring_functions=[
            "answer_relevancy",  # How relevant is the answer to the question?
            # "context_precision",     # How precise are the retrieved contexts?
            # "faithfulness",          # How faithful is the answer to the contexts?
            # "context_recall",        # How much of the ground truth is covered by contexts?
            # "answer_correctness"  # How correct is the answer compared to ground truth?
        ],
        provider_id=provider_id,
        # metadata={
        #     "provider": "ragas",
        #     "version": "1.0",
        #     "metrics_count": len(ragas_metrics),
        #     "created_at": datetime.now().isoformat()
        # }
    )

pprint(benchmark_response)

INFO:httpx:HTTP Request: POST https://815a5b807252.ngrok-free.app/v1/eval/benchmarks "HTTP/1.1 200 OK"


INFO:httpx:HTTP Request: POST https://815a5b807252.ngrok-free.app/v1/eval/benchmarks "HTTP/1.1 200 OK"


In [8]:
benchmarks = client.benchmarks.list()
pprint(benchmarks)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/eval/benchmarks "HTTP/1.1 200 OK"


## Evaluation Execution

Run the evaluation using our Ragas out-of-tree provider.


In [15]:
# Review settings in distributinon/run.yaml, eg., note that
# since we can't set the embedding model in the benchmark config,
# the embedding model is set in the distribution run.yaml file(all-MiniLM-L6-v2)

remote_job = client.eval.run_eval(
    benchmark_id="ragas_demo_benchmark__remote",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "ollama/granite3.3:2b",
            "sampling_params": {"temperature": 0.1, "max_tokens": 100},
        },
        "scoring_params": {},
        # "num_examples": 1,
    },
)
pprint(remote_job)

INFO:httpx:HTTP Request: POST https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__remote/jobs "HTTP/1.1 200 OK"


In [20]:
# Review settings in distributinon/run.yaml, eg., note that
# since we can't set the embedding model in the benchmark config,
# the embedding model is set in the distribution run.yaml file(all-MiniLM-L6-v2)

inline_job = client.eval.run_eval(
    benchmark_id="ragas_demo_benchmark__inline",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "ollama/granite3.3:2b",
            "sampling_params": {"temperature": 0.1, "max_tokens": 100},
        },
        "scoring_params": {},
        # "num_examples": 1,
    },
)
pprint(inline_job)

INFO:httpx:HTTP Request: POST https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__inline/jobs "HTTP/1.1 200 OK"


## Results Display


In [21]:
# wait a bit for the job to complete
pprint(
    client.eval.jobs.status(
        benchmark_id="ragas_demo_benchmark__inline", job_id=inline_job.job_id
    )
)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__inline/jobs/1 "HTTP/1.1 200 OK"


In [24]:
# wait a bit for the job to complete
pprint(
    client.eval.jobs.status(
        benchmark_id="ragas_demo_benchmark__remote", job_id=remote_job.job_id
    )
)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__remote/jobs/48ee3d5f-5abe-407e-b6c1-fccb3c782fc9 "HTTP/1.1 200 OK"


In [28]:
remote_results = client.eval.jobs.retrieve(
    benchmark_id="ragas_demo_benchmark__remote", job_id=remote_job.job_id
)
pprint(remote_results)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__remote/jobs/48ee3d5f-5abe-407e-b6c1-fccb3c782fc9/result "HTTP/1.1 200 OK"


In [26]:
inline_results = client.eval.jobs.retrieve(
    benchmark_id="ragas_demo_benchmark__inline", job_id=inline_job.job_id
)
pprint(inline_results)

INFO:httpx:HTTP Request: GET https://815a5b807252.ngrok-free.app/v1/eval/benchmarks/ragas_demo_benchmark__inline/jobs/1/result "HTTP/1.1 200 OK"


## Inline vs Remote Side-by-side

In [30]:
pd.DataFrame.from_dict(
    {
        "inline": [
            r["score"] for r in inline_results.scores["answer_relevancy"].score_rows
        ],
        "remote": [
            r["score"] for r in remote_results.scores["answer_relevancy"].score_rows
        ],
    },
).assign(diff=lambda df: df["remote"] - df["inline"])

Unnamed: 0,inline,remote,diff
0,0.962093,0.972945,0.010852
1,0.962612,0.96108,-0.001532
2,0.787999,0.855692,0.067693
