# Evaluate Your Model on Custom Data with LM-Eval

While LM-Eval comes with 100+ out-of-the-box evaluation tasks, you might want to bring your own custom task to better evaluate a model's knowlegde and capabilities. This tutorial presents an example of running custom benchmark evaluations on the DK-Bench dataset with the TrustyAI LM-Eval Eval provider on LlamaStack. 



## Overview 

This tutorial covers the following steps:

1. Uploading custom data to your OpenShift cluster
2. Registering DK-Bench as a custom benchmark to LM-Eval
3. Running model evaluations on DK-Bench


## Prerequisites

* Create a virtual environment:
`uv venv .llama-venv`

* Activate the virutal environment:
`source .llama-venv/bin/activate`

* Install the required libraries:
`uv pip install -e .`

## 1. Start the LlamaStack Server

**1.1 Configure the LlamaStack Server**

Define the following env vars: 
* `export VLLM_URL=...` - the `v1/completions` endpoint of the deployed model
* `export TRUSTYAI_LM_EVAL_NAMESPACE=...` - the namespace that the model is deployed in

**1.2 Start the Llama Stack Server**

From the terminal, start the Llama Stack server in the virtual environment:

`llama stack run run.yaml --image-type venv`

In [1]:
# import required libraries
import os
import subprocess

import logging

import time
import pprint

**1.3 Initatialize the Llama Stack Python Client**

In [2]:
# address of the Llama Stack server specified on the run.yaml file
BASE_URL = "http://localhost:8321"

def create_http_client():
    from llama_stack_client import LlamaStackClient
    return LlamaStackClient(base_url=BASE_URL)

# create an HTTP client to interact with the Llama Stack server
client = create_http_client()

## 2. Upload Custom Data to OpenShift

In order to run LM-Eval with a custom task, we need to provide it with a reference to a stored dataset on our cluster. In this tutorial, we will use a PersistentVolumeClaim (PVC) as our storage object.



**2.1 Create a Persistent Volume Claim (PVC) and Pod Object**

The Pod downloads the data and stores it in the PVC

In [None]:
# Create a PVC to store custom data
!oc apply -f resources/pvc.yaml
# Create a Pod to download the data
!oc apply -f resources/pod.yaml

**2.2 Copy local data to the Pod**

We have sample DK-Bench dataset in the `/data` folder named `example-dk-bench-input-bmo.jsonl`. Let's upload it to our OpenShift Cluster by copying it to the Pod we just created

In [None]:
!oc cp resources/data/example-dk-bench-input-bmo.jsonl dataset-storage-pod:/data/upload_files/example-dk-bench-input-bmo.jsonl

## 2. Register the Custom Dataset



In order to run evaluations on custom dataset, at a minimum, we need to provide in the `metadata`:

* The `TrustyAI LM-Eval Tasks` GitHub url, branch, commit SHA, and path of the custom task
* The location of the custom task file in our PVC

In [None]:
client.benchmarks.register(
    benchmark_id="trustyai_lmeval::dk-bench",
    dataset_id="trustyai_lmeval::dk-bench",
    scoring_functions=["string"],
    provider_benchmark_id="string",
    provider_id="trustyai_lmeval",
    metadata={
        "custom_task": {
            "git": {
                "url": "https://github.com/trustyai-explainability/lm-eval-tasks.git",
                "branch": "main",
                "commit": "8220e2d73c187471acbe71659c98bccecfe77958",
                "path": "tasks/",
            }
        },
        "env": {
            "DK_BENCH_DATASET_PATH": "/opt/app-root/src/hf_home/example-dk-bench-input-bmo.jsonl", # Path of the dataset inside the PVC
            "JUDGE_MODEL_URL": "http://phi-3-predictor:8080/v1/chat/completions",
            "JUDGE_MODEL_NAME": "phi-3",  # For simplicity, we use the same model as the one being evaluated
            "JUDGE_API_KEY": "",
        },
        "tokenized_requests": False,
        "tokenizer": "google/flan-t5-small",
        "input": {"storage": {"pvc": "my-pvc"}}
    },
)

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/eval/benchmarks "HTTP/1.1 200 OK"


## 3. Run Benchmark Evaluation

**3.1 Initiate an LM-Eval Job**

In [None]:
job = client.eval.run_eval(
    benchmark_id="trustyai_lmeval::dk-bench",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "phi-3",
            "provider_id": "trustyai_lmeval",
            "sampling_params": {
                "temperature": 0.7,
                "top_p": 0.9,
                "max_tokens": 256
            },
        },
        "num_examples": 1000,
     },
)

print(f"Starting job '{job.job_id}'")

INFO:httpx:HTTP Request: POST http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs "HTTP/1.1 200 OK"


Starting job 'lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff'


**3.2 Iteratively Check the Job's Status for Results**

The job's status needs to be reported as `complete` before we can get the results of the evaluation

In [None]:
def get_job_status(job_id, benchmark_id):
    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)

while True:
    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench")
    print(job)

    if job.status in ['failed', 'completed']:
        print(f"Job ended with status: {job.status}")
        break

    time.sleep(20)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='scheduled')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='scheduled')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='in_progress')


INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff "HTTP/1.1 200 OK"


Job(job_id='lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff', status='completed')
Job ended with status: completed


**3.3 Get the Results of the Evaluation**

In [9]:
pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)

INFO:httpx:HTTP Request: GET http://localhost:8321/v1/eval/benchmarks/trustyai_lmeval::dk-bench/jobs/lmeval-job-6b0e40a0-7d1d-4208-b9e5-769ff01c9dff/result "HTTP/1.1 200 OK"


{'dk-bench:invalid_score_count': ScoringResult(aggregated_results={'invalid_score_count': 21.0}, score_rows=[{'score': 21.0}]),
 'dk-bench:mean_score': ScoringResult(aggregated_results={'mean_score': 2.020408163265306}, score_rows=[{'score': 2.020408163265306}]),
 'dk-bench:mean_score_stderr': ScoringResult(aggregated_results={'mean_score_stderr': 0.2711931872826892}, score_rows=[{'score': 0.2711931872826892}]),
 'dk-bench:score_1_count': ScoringResult(aggregated_results={'score_1_count': 1.0}, score_rows=[{'score': 1.0}]),
 'dk-bench:score_2_count': ScoringResult(aggregated_results={'score_2_count': 2.0}, score_rows=[{'score': 2.0}]),
 'dk-bench:score_3_count': ScoringResult(aggregated_results={'score_3_count': 9.0}, score_rows=[{'score': 9.0}]),
 'dk-bench:score_4_count': ScoringResult(aggregated_results={'score_4_count': 13.0}, score_rows=[{'score': 13.0}]),
 'dk-bench:score_5_count': ScoringResult(aggregated_results={'score_5_count': 3.0}, score_rows=[{'score': 3.0}])}
