# MLflow Ragas Evaluation Example

This notebook demonstrates how to use `mlflow.evaluate()` with a custom Ragas evaluator plugin that runs evaluations via Kubeflow Pipelines.

## Prerequisites

- Have a Kubeflow Pipelines endpoint available.
- Setup env vars in .env file.
- Have access to the S3 bucket used for storing evaluation results.

## Imports

In [1]:
from rich import print as rprint
import mlflow
import pandas as pd

## MLflow setup
Run MLflow locally with
```bash
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000
```


In [2]:
mlflow.set_tracking_uri("http://localhost:5000")

Ensure that our entry-point in pyproject.toml gets picked up by the MFflow plugin discovery. You should see ragas_kubeflow in the list of available evaluators:

In [3]:
rprint(mlflow.models.list_evaluators())
assert "ragas_kubeflow" in mlflow.models.list_evaluators()

  from .autonotebook import tqdm as notebook_tqdm


## Make sure things run: test with a static dataset using MLflow's default evaluator


In [4]:
# Evaluation with static predictions
eval_data_with_predictions = pd.DataFrame(
    {
        "inputs": [
            "What is the capital of France?",
            "How does photosynthesis work?",
            "Explain machine learning in simple terms",
        ],
        "context": [
            "France is a country in Europe with Paris as its capital city.",
            "Photosynthesis is the process by which plants use sunlight to make food.",
            "Machine learning is a subset of AI that learns patterns from data.",
        ],
        "ground_truth": [
            "Paris",
            "Photosynthesis",
            "Machine learning",
        ],
        "predictions": [
            "Paris",
            "Photosynthesis",
            "Machine learning",
        ],
    }
)

with mlflow.start_run() as run:
    results = mlflow.evaluate(
        data=eval_data_with_predictions,
        targets="ground_truth",
        predictions="predictions",
        extra_metrics=[mlflow.metrics.genai.answer_similarity()],
        evaluators=["default"],
    )

    print("✅ Static dataset evaluation completed!")
    print(f"   Run ID: {run.info.run_id}")

    # Show metrics
    print("\n📊 Metrics:")
    for metric_name, value in results.metrics.items():
        print(f"   {metric_name}: {value:.4f}")

2025/09/05 15:37:47 INFO mlflow.models.evaluation.default_evaluator: Testing metrics on first row...
100%|██████████| 1/1 [00:00<00:00, 32.60it/s]
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean,
  ret = ret.dtype.type(ret / rcount)
100%|██████████| 3/3 [00:00<00:00, 63872.65it/s]

✅ Static dataset evaluation completed!
   Run ID: 964657ffc1374fb985648be01b9bb97d

📊 Metrics:
   answer_similarity/v1/mean: nan
   answer_similarity/v1/variance: nan
🏃 View run rumbling-dove-216 at: http://localhost:5000/#/experiments/0/runs/964657ffc1374fb985648be01b9bb97d
🧪 View experiment at: http://localhost:5000/#/experiments/0





## Evaluation over a static dataset with our custom pipeline runner on KFP

In [5]:
# this will invoke the Ragas evaluator, so the dataset
# should conform to the expected format for Ragas
evaluation_data = pd.DataFrame.from_records(
    [
        {
            "user_input": "What is the capital of France?",
            "response": "The capital of France is Paris.",
            "retrieved_contexts": [
                "Paris is the capital and most populous city of France."
            ],
            "reference": "Paris",
        },
        {
            "user_input": "What is photosynthesis?",
            "response": "Photosynthesis is the process by which plants convert sunlight into energy.",
            "retrieved_contexts": [
                "Photosynthesis is a process used by plants to convert light energy into chemical energy."
            ],
            "reference": "Photosynthesis is the process by which plants and other organisms convert light energy into chemical energy.",
        },
    ]
)
rprint(evaluation_data)

In [None]:
with mlflow.start_run() as run:
    mlflow.evaluate(
        data=evaluation_data,
        model_type="ragas",  # This just needs to pass can_evaluate()
        evaluators=["ragas_kubeflow"],  # This explicitly selects our evaluator
        # evaluator_config={...}, # Can override config here
    )

