# TrustyAI Evaluation Demo

This notebook demonstrates how to use the TrustyAI SDK's evaluation functionality to evaluate language models using the LM Evaluation Harness both locally and on Kubernetes.

## Prerequisites

Make sure you have installed TrustyAI with evaluation support:

```bash
pip install .[eval]
```

Or for all features:

```bash
pip install .[all]
```


## 1. Basic Setup and Imports

First, let's import the necessary modules and check what evaluation providers are available.


In [1]:
from trustyai import Providers
from trustyai.core import DeploymentMode
from trustyai.core.eval import EvaluationProviderConfig

Available provider types:

In [2]:
dir(Providers)

['bias_detection', 'eval', 'evaluation', 'explainability']

Available **evaluation** providers:

In [3]:
dir(Providers.eval)

['LMEvalProvider']

We'll use: `Providers.eval.LMEvalProvider`

The deployment mode in the config will determine whether it runs _locally_ or on _Kubernetes_.

## 2. Initialise the Evaluation Provider

Now let's create and initialise the evaluation provider using the new organised Providers class.


In [4]:
# Create the evaluation provider
provider = Providers.eval.LMEvalProvider()

In [5]:
# Initialise the provider (this will check if lm-eval is available)
try:
    provider.initialize()
except ImportError as e:
    print(f"\n✗ Error initialising provider: {e}")

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
provider.__class__.__name__

'LMEvalProvider'

In [7]:
provider.provider_type()

'eval'

Supported deployment modes:

In [8]:
[mode.value for mode in provider.supported_deployment_modes]

['local', 'kubernetes']

## 4. Basic Evaluation Example

Let's run a basic evaluation using a small model and a simple task. We'll use google/flan-t5-base (a small model) and the arc_easy task for demonstration.


In [9]:
# Create evaluation configuration
config = EvaluationProviderConfig(
    evaluation_name="arc_easy",
    model="google/flan-t5-base",  # Small model for quick evaluation
    tasks=["arc_easy"],  # Common sense reasoning task
    limit=5,  # Limit to 5 examples for quick demonstration
    metrics=["acc", "acc_norm"],  # Accuracy metrics
    device="cpu",  # Use CPU to avoid GPU requirements
    deployment_mode=DeploymentMode.LOCAL,
    batch_size=1,  # Small batch size for stability
    num_fewshot=0,  # Zero-shot evaluation
)

In [10]:
config

EvaluationProviderConfig(evaluation_name='arc_easy', model='google/flan-t5-base', tasks=['arc_easy'], limit=5, metrics=['acc', 'acc_norm'], device='cpu', deployment_mode=ExecutionMode.LOCAL, additional_params={'batch_size': 1, 'num_fewshot': 0})

Run the evaluation:

In [11]:
results = provider.evaluate(config)

[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cpu for model evaluation


2025-06-27:23:47:48 INFO     [models.huggingface:137] Using device 'cpu'
2025-06-27:23:47:49 INFO     [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cpu'}
2025-06-27:23:47:53 INFO     [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-27:23:47:53 INFO     [evaluator:243] Using pre-initialized model
2025-06-27:23:47:57 INFO     [api.task:434] Building contexts for arc_easy on rank 0...
100%|██████████| 5/5 [00:00<00:00, 2484.48it/s]
2025-06-27:23:47:57 INFO     [evaluator:559] Running loglikelihood requests
Running loglikelihood requests:   0%|          | 0/20 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Running logli

## 5. Local vs Kubernetes Comparison

Compare the same configuration running locally vs on Kubernetes. This demonstrates how the same provider handles both deployment modes.


In [12]:
# Shared configuration for both deployments
shared_config = {
    "evaluation_name": "comparison_demo",
    "model": "google/flan-t5-base",
    "tasks": ["arc_easy"],
    "limit": 3,  # Small limit for quick comparison
    "metrics": ["acc", "acc_norm"],
    "batch_size": 1,
    "num_fewshot": 0,
}

In [13]:
# Configuration for LOCAL deployment
local_config = EvaluationProviderConfig(
    **shared_config, deployment_mode=DeploymentMode.LOCAL, device="cpu"
)

In [14]:
# Configuration for KUBERNETES deployment
kubernetes_config = EvaluationProviderConfig(
    **shared_config,
    deployment_mode=DeploymentMode.KUBERNETES,
    namespace="test",
    deploy=True,
    wait_for_completion=True,
    timeout=300,
)

First, run the local evaluation:

In [15]:
local_results = provider.evaluate(local_config)

2025-06-27:23:48:30 INFO     [models.huggingface:137] Using device 'cpu'


[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cpu for model evaluation


2025-06-27:23:48:31 INFO     [models.huggingface:382] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cpu'}
2025-06-27:23:48:33 INFO     [evaluator:189] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-27:23:48:33 INFO     [evaluator:243] Using pre-initialized model
2025-06-27:23:48:35 INFO     [api.task:434] Building contexts for arc_easy on rank 0...
100%|██████████| 3/3 [00:00<00:00, 2292.39it/s]
2025-06-27:23:48:35 INFO     [evaluator:559] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 12/12 [00:00<00:00, 20.32it/s]


Next, run the same evaluation, but on Kubernetes:

In [18]:
kubernetes_results = await provider.evaluate(kubernetes_config)

[DEBUG - _parse_args_to_config] Args=1: has namespace? True
[DEBUG - _parse_args_to_config] Namespace value: test
[DEBUG - _evaluate_kubernetes_async] Config keys: ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'additional_params', 'deployment_mode', 'device', 'evaluation_name', 'get_param', 'limit', 'metrics', 'model', 'tasks']
[DEBUG - _evaluate_kubernetes_async] Config namespace: test
[DEBUG] Using namespace for CR: test
[DEBUG] Setting limit in config as string: 3
[DEBUG] Setting namespace in LMEvalJob resource: test
[DEBUG] Setting limit as string: 3
[DEBUG] Deploying LMEvalJob to namespace: test
[DEBUG] API Group: trustyai.opendatahub.io, Version: v1alpha1
[DEBUG] Resourc

Results comparison:

In [19]:
print("\n📊 LOCAL RESULTS:")
print("-" * 20)
if local_results and "results" in local_results:
    for task_name, task_results in local_results["results"].items():
        print(f"✅ Task: {task_name}")
        for metric, value in task_results.items():
            print(f"   {metric}: {value}")

print("\n🚀 KUBERNETES RESULTS:")
print("-" * 20)
if kubernetes_results and "results" in kubernetes_results:
    for task_name, task_results in kubernetes_results["results"].items():
        print(f"✅ Task: {task_name}")
        for metric, value in task_results.items():
            print(f"   {metric}: {value}")


📊 LOCAL RESULTS:
--------------------
✅ Task: arc_easy
   alias: arc_easy
   acc,none: 0.6666666666666666
   acc_stderr,none: 0.33333333333333337
   acc_norm,none: 0.6666666666666666
   acc_norm_stderr,none: 0.33333333333333337

🚀 KUBERNETES RESULTS:
--------------------
✅ Task: arc_easy
   alias: arc_easy
   acc,none: 0.6666666666666666
   acc_stderr,none: 0.33333333333333337
   acc_norm,none: 0.6666666666666666
   acc_norm_stderr,none: 0.33333333333333337
