# TrustyAI Local Evaluation Demo

This notebook demonstrates how to use the TrustyAI SDK's local evaluation functionality to evaluate language models using the LM Evaluation Harness.

## Prerequisites

Make sure you have installed TrustyAI with evaluation support:

```bash
pip install .[eval]
```

Or for all features:

```bash
pip install .[all]
```


## 1. Basic Setup and Imports

First, let's import the necessary modules and check what evaluation providers are available.


In [1]:
import os
import json
from pprint import pprint

from trustyai.core.eval import EvaluationProviderConfig
from trustyai.core import DeploymentMode
from trustyai.providers.eval import LocalLMEvalProvider

## 2. Initialise the Local Evaluation Provider

Let's create and initialise the local evaluation provider.


In [4]:
# Create the local evaluation provider
provider = LocalLMEvalProvider()

# Initialise the provider (this will check if lm-eval is available)
try:
    provider.initialize()
    print("✓ Local evaluation provider initialised successfully!")
except ImportError as e:
    print(f"✗ Error initialising provider: {e}")
    print("Please install evaluation dependencies: pip install .[eval]")


  from .autonotebook import tqdm as notebook_tqdm


✓ Local evaluation provider initialised successfully!


## 3. Explore Available Tasks and Metrics

Let's see what evaluation tasks and metrics are available through the provider.


In [5]:
# List available evaluation datasets/tasks
available_tasks = provider.list_available_datasets()
print(f"Number of available tasks: {len(available_tasks)}")
print("\nFirst 10 available tasks:")
for task in sorted(available_tasks)[:10]:
    print(f"  - {task}")

if len(available_tasks) > 10:
    print(f"  ... and {len(available_tasks) - 10} more tasks")


Number of available tasks: 0

First 10 available tasks:


In [6]:
# List available metrics
available_metrics = provider.list_available_metrics()
print(f"Available metrics ({len(available_metrics)}):")
for metric in available_metrics:
    print(f"  - {metric}")


Available metrics (13):
  - acc
  - acc_norm
  - perplexity
  - bleu
  - rouge
  - exact_match
  - f1
  - precision
  - recall
  - matthews_correlation
  - multiple_choice_grade
  - wer
  - ter


## 4. Basic Evaluation Example

Let's run a basic evaluation using a small model and a simple task. We'll use google/flan-t5-base (a small model) and the HellaSwag task for demonstration.


In [7]:
# Create evaluation configuration
config = EvaluationProviderConfig(
    evaluation_name="arc_easy",
    model="google/flan-t5-base",  # Small model for quick evaluation
    tasks=["arc_easy"],  # Common sense reasoning task
    limit=5,  # Limit to 5 examples for quick demonstration
    metrics=["acc", "acc_norm"],  # Accuracy metrics
    device="cpu",  # Use CPU to avoid GPU requirements
    deployment_mode=DeploymentMode.LOCAL,
    batch_size=1,  # Small batch size for stability
    num_fewshot=0  # Zero-shot evaluation
)

print("Configuration created:")
print(f"  Model: {config.model}")
print(f"  Tasks: {config.tasks}")
print(f"  Metrics: {config.metrics}")
print(f"  Device: {config.device}")
print(f"  Limit: {config.limit} examples")
print(f"  Batch size: {config.get_param('batch_size')}")
print(f"  Few-shot examples: {config.get_param('num_fewshot')}")


Configuration created:
  Model: google/flan-t5-base
  Tasks: ['arc_easy']
  Metrics: ['acc', 'acc_norm']
  Device: cpu
  Limit: 5 examples
  Batch size: 1
  Few-shot examples: 0


In [8]:
# Run the evaluation
print("Running evaluation...")
print("This may take a few minutes as the model needs to be downloaded and loaded.")

try:
    results = provider.evaluate(config)
    print("\n✓ Evaluation completed successfully!")
except Exception as e:
    print(f"\n✗ Evaluation failed: {e}")
    results = None


2025-06-16:01:05:44,249 INFO     [lm_eval.models.huggingface:136] Using device 'cuda'


Running evaluation...
This may take a few minutes as the model needs to be downloaded and loaded.
[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cuda for model evaluation


2025-06-16:01:05:44,737 INFO     [lm_eval.models.huggingface:376] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-06-16:01:05:50,241 INFO     [lm_eval.evaluator:177] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-16:01:05:50,241 INFO     [lm_eval.evaluator:230] Using pre-initialized model
2025-06-16:01:05:53,581 INFO     [lm_eval.api.task:420] Building contexts for arc_easy on rank 0...
100%|██████████| 5/5 [00:00<00:00, 2561.56it/s]
2025-06-16:01:05:53,585 INFO     [lm_eval.evaluator:525] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 20/20 [00:01<00:00, 12.38it/s]



✓ Evaluation completed successfully!


In [9]:
# Display results if evaluation was successful
if results:
    print("\nEvaluation Results:")
    print("==================")
    
    # Pretty print the results
    if 'results' in results:
        for task_name, task_results in results['results'].items():
            print(f"\nTask: {task_name}")
            print("-" * (len(task_name) + 6))
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"  {metric}: {value:.4f}")
                else:
                    print(f"  {metric}: {value}")
    else:
        print("Raw results:")
        pprint(results)



Evaluation Results:

Task: arc_easy
--------------
  alias: arc_easy
  acc,none: 0.6000
  acc_stderr,none: 0.2449
  acc_norm,none: 0.6000
  acc_norm_stderr,none: 0.2449


## 5. Multi-Task Evaluation

Let's run an evaluation on multiple tasks to see how the model performs across different capabilities.


In [10]:
# Configuration for multi-task evaluation
multi_task_config = EvaluationProviderConfig(
    evaluation_name="multi_task_demo",
    model="google/flan-t5-base",
    tasks=[
        "hellaswag",    # Common sense reasoning
        "arc_easy",     # Science questions (easy)
        "winogrande"    # Pronoun resolution
    ],
    limit=3,  # Very small limit for quick demo
    metrics=["acc", "acc_norm"],
    device="cpu",
    deployment_mode=DeploymentMode.LOCAL,
    batch_size=1,
    num_fewshot=0
)

print("Multi-task evaluation configuration:")
print(f"  Tasks: {multi_task_config.tasks}")
print(f"  Limit per task: {multi_task_config.limit} examples")


Multi-task evaluation configuration:
  Tasks: ['hellaswag', 'arc_easy', 'winogrande']
  Limit per task: 3 examples


In [11]:
# Run multi-task evaluation
print("Running multi-task evaluation...")

try:
    multi_results = provider.evaluate(multi_task_config)
    print("\n✓ Multi-task evaluation completed!")
    
    # Display results for each task
    if 'results' in multi_results:
        print("\nResults Summary:")
        print("================")
        
        for task_name, task_results in multi_results['results'].items():
            print(f"\n{task_name.upper()}:")
            for metric, value in task_results.items():
                if isinstance(value, (int, float)):
                    print(f"  {metric}: {value:.4f}")
                    
except Exception as e:
    print(f"\n✗ Multi-task evaluation failed: {e}")


2025-06-16:01:06:44,210 INFO     [lm_eval.models.huggingface:136] Using device 'cuda'


Running multi-task evaluation...
[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cuda for model evaluation


2025-06-16:01:06:44,480 INFO     [lm_eval.models.huggingface:376] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-06-16:01:06:50,014 INFO     [lm_eval.evaluator:177] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-16:01:06:50,014 INFO     [lm_eval.evaluator:230] Using pre-initialized model
Downloading data: 100%|██████████| 3.40M/3.40M [00:00<00:00, 5.05MB/s]
Generating train split: 100%|██████████| 40398/40398 [00:00<00:00, 119801.92 examples/s]
Generating test split: 100%|██████████| 1767/1767 [00:00<00:00, 117351.52 examples/s]
Generating validation split: 100%|██████████| 1267/1267 [00:00<00:00, 115648.91 examples/s]
2025-06-16:01:07:01,531 INFO     [lm_eval.api.task:420] Building contexts for winogrande on rank 0...
100%|██████████| 3/3 [00:00<00:00, 42366.71it/s]
2025-06-16:01:07:01,532 INFO     [lm_eval.api.task:420] Building contexts for a


✓ Multi-task evaluation completed!

Results Summary:

ARC_EASY:
  acc,none: 0.6667
  acc_stderr,none: 0.3333
  acc_norm,none: 0.6667
  acc_norm_stderr,none: 0.3333

HELLASWAG:
  acc,none: 0.3333
  acc_stderr,none: 0.3333
  acc_norm,none: 0.3333
  acc_norm_stderr,none: 0.3333

WINOGRANDE:
  acc,none: 0.6667
  acc_stderr,none: 0.3333


## 6. Few-Shot Evaluation

Let's demonstrate few-shot evaluation, where we provide examples to the model before asking it to perform the task.


In [12]:
# Configuration for few-shot evaluation
few_shot_config = EvaluationProviderConfig(
    evaluation_name="few_shot_demo",
    model="google/flan-t5-base",
    tasks=["hellaswag"],
    limit=3,
    metrics=["acc", "acc_norm"],
    device="cpu",
    deployment_mode=DeploymentMode.LOCAL,
    batch_size=1,
    num_fewshot=2  # Provide 2 examples before each test question
)

print("Few-shot evaluation configuration:")
print(f"  Task: {few_shot_config.tasks[0]}")
print(f"  Few-shot examples: {few_shot_config.get_param('num_fewshot')}")
print(f"  Test examples: {few_shot_config.limit}")


Few-shot evaluation configuration:
  Task: hellaswag
  Few-shot examples: 2
  Test examples: 3


In [13]:
# Run few-shot evaluation
print("Running few-shot evaluation...")

try:
    few_shot_results = provider.evaluate(few_shot_config)
    print("\n✓ Few-shot evaluation completed!")
    
    # Display results
    if 'results' in few_shot_results:
        task_name = list(few_shot_results['results'].keys())[0]
        task_results = few_shot_results['results'][task_name]
        
        print(f"\nFew-shot Results for {task_name}:")
        print("=" * (len(task_name) + 23))
        
        for metric, value in task_results.items():
            if isinstance(value, (int, float)):
                print(f"  {metric}: {value:.4f}")
                
except Exception as e:
    print(f"\n✗ Few-shot evaluation failed: {e}")


2025-06-16:01:07:32,786 INFO     [lm_eval.models.huggingface:136] Using device 'cuda'


Running few-shot evaluation...
[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cuda for model evaluation


2025-06-16:01:07:33,051 INFO     [lm_eval.models.huggingface:376] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-06-16:01:07:38,612 INFO     [lm_eval.evaluator:177] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-16:01:07:38,613 INFO     [lm_eval.evaluator:230] Using pre-initialized model
2025-06-16:01:07:41,844 INFO     [lm_eval.api.task:420] Building contexts for hellaswag on rank 0...
100%|██████████| 3/3 [00:00<00:00, 1334.21it/s]
2025-06-16:01:07:41,849 INFO     [lm_eval.evaluator:525] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 12/12 [00:01<00:00,  6.31it/s]



✓ Few-shot evaluation completed!

Few-shot Results for hellaswag:
  acc,none: 0.3333
  acc_stderr,none: 0.3333
  acc_norm,none: 0.3333
  acc_norm_stderr,none: 0.3333


## 7. Comparing Models

Let's compare the performance of different models on the same task.


In [15]:
# List of models to compare (using small models for quick evaluation)
models_to_compare = [
    "google/flan-t5-base",
    "google/flan-t5-small"
]

comparison_results = {}

for model in models_to_compare:
    print(f"\nEvaluating {model}...")
    
    config = EvaluationProviderConfig(
        evaluation_name=f"comparison_{model}",
        model=model,
        tasks=["hellaswag"],
        limit=3,
        metrics=["acc", "acc_norm"],
        device="cpu",
        deployment_mode=DeploymentMode.LOCAL,
        batch_size=1,
        num_fewshot=0
    )
    
    try:
        results = provider.evaluate(config)
        if 'results' in results:
            task_results = results['results']['hellaswag']
            comparison_results[model] = task_results
            print(f"  ✓ {model} evaluation completed")
        else:
            print(f"  ✗ {model} evaluation returned unexpected format")
    except Exception as e:
        print(f"  ✗ {model} evaluation failed: {e}")
        comparison_results[model] = None


2025-06-16:01:13:28,193 INFO     [lm_eval.models.huggingface:136] Using device 'cuda'



Evaluating google/flan-t5-base...
[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cuda for model evaluation


2025-06-16:01:13:28,546 INFO     [lm_eval.models.huggingface:376] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-06-16:01:13:34,134 INFO     [lm_eval.evaluator:177] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-16:01:13:34,134 INFO     [lm_eval.evaluator:230] Using pre-initialized model
2025-06-16:01:13:37,454 INFO     [lm_eval.api.task:420] Building contexts for hellaswag on rank 0...
100%|██████████| 3/3 [00:00<00:00, 5504.34it/s]
2025-06-16:01:13:37,457 INFO     [lm_eval.evaluator:525] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 12/12 [00:01<00:00, 11.88it/s]
2025-06-16:01:13:38,992 INFO     [lm_eval.models.huggingface:136] Using device 'cuda'


  ✓ google/flan-t5-base evaluation completed

Evaluating google/flan-t5-small...
[DEBUG - _parse_args_to_config] Args=1: has namespace? False
Using device: cuda for model evaluation


2025-06-16:01:13:42,164 INFO     [lm_eval.models.huggingface:376] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
2025-06-16:01:14:07,510 INFO     [lm_eval.evaluator:177] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-06-16:01:14:07,510 INFO     [lm_eval.evaluator:230] Using pre-initialized model
2025-06-16:01:14:10,545 INFO     [lm_eval.api.task:420] Building contexts for hellaswag on rank 0...
100%|██████████| 3/3 [00:00<00:00, 5673.09it/s]
2025-06-16:01:14:10,549 INFO     [lm_eval.evaluator:525] Running loglikelihood requests
Running loglikelihood requests: 100%|██████████| 12/12 [00:00<00:00, 43.61it/s]


  ✓ google/flan-t5-small evaluation completed


In [16]:
# Display comparison results
print("\nModel Comparison Results:")
print("========================")
print(f"{'Model':<15} {'Accuracy':<10} {'Acc (Norm)':<10}")
print("-" * 35)

for model, results in comparison_results.items():
    if results:
        acc = results.get('acc', 'N/A')
        acc_norm = results.get('acc_norm', 'N/A')
        
        acc_str = f"{acc:.4f}" if isinstance(acc, (int, float)) else str(acc)
        acc_norm_str = f"{acc_norm:.4f}" if isinstance(acc_norm, (int, float)) else str(acc_norm)
        
        print(f"{model:<15} {acc_str:<10} {acc_norm_str:<10}")
    else:
        print(f"{model:<15} {'Failed':<10} {'Failed':<10}")



Model Comparison Results:
Model           Accuracy   Acc (Norm)
-----------------------------------
google/flan-t5-base N/A        N/A       
google/flan-t5-small N/A        N/A       
