# Getting Started with LM-Eval

This tutorial presents an example of running benchmark evaluations on the `ARC-Easy` dataset with the TrustyAI LM-Eval Eval provider on LlamaStack. 

## Overview

This tutorial covers the following steps:
1. Running a script to deploy a model on vLLM
2. Connecting to a custom llama-stack server with LM-Eval added as an Eval provider
3. Registering the ARC-Easy benchmark 
4. Running a LM-Eval job to evaluate the model over the ARC-Easy benchmark

## Prerequisites
* Create a virtual environment:
`uv venv .llama-venv`

* Activate the virutal environment:
`source .llama-venv/bin/activate`

* Install the required libraries:
`uv pip install -e .`

## 0. Deploy a model on vLLM
Run the following command to deploy a Phi-3 model on vLLM Serving Runtime. By default, it creates a namespace named `model-namespace` and deploys the model in it.

In [None]:
!scripts/deploy_model.sh

## 1. Start the LlamaStack Server

**1.1 Configure the LlamaStack Server**

Define the following env vars: 
* `export VLLM_URL=...` - the `v1/completions` endpoint of the deployed model
* `export TRUSTYAI_LM_EVAL_NAMESPACE=...` - the namespace that the model is deployed in

**1.2 Start the Llama Stack Server**

From the terminal, start the Llama Stack server in the virtual environment: 

`llama stack run run.yaml --image-type venv`

In [None]:
# import required libraries
import os
import subprocess

import logging

import time
import pprint

**1.3 Instantiate the Llama Stack Python Client**

In [2]:
BASE_URL = "http://localhost:8321"

def create_http_client():
    from llama_stack_client import LlamaStackClient
    return LlamaStackClient(base_url=BASE_URL)

client = create_http_client()

## 2. Register ARC-Easy as a Benchmark

**2.1 Check the current list of available benchmarks**

In [None]:
benchmarks = client.benchmarks.list()

pprint.print(f"Available benchmarks: {benchmarks}")

Available benchmarks: []


**2.2 Register ARC-Easy as a Benchmark**

In [None]:
client.benchmarks.register(
    benchmark_id="trustyai_lmeval::arc_easy",
    dataset_id="trustyai_lmeval::arc_easy",
    scoring_functions=["string"],
    provider_benchmark_id="string",
    provider_id="trustyai_lmeval",
    metadata={
        "tokenized_requests": False,
        "tokenizer": "google/flan-t5-small"
    }
)

**2.3 Sanity Check**

Verify that the list of available benchmarks has updated to include ARC-Easy

In [None]:
benchmarks = client.benchmarks.list()

pprint.print(f"Available benchmarks: {benchmarks}")

Available benchmarks: Benchmark(dataset_id='trustyai_lmeval::arc_easy', "
 "identifier='trustyai_lmeval::arc_easy', "
 "provider_id='trustyai_lmeval', provider_resource_id='string', "
 "scoring_functions=['string'], type='benchmark')



## 3. Run Benchmark Evaluation

**3.1 Initiate an LM-Eval Job**

In [7]:
job = client.eval.run_eval(
    benchmark_id="trustyai_lmeval::arc_easy",
    benchmark_config={
        "eval_candidate": {
            "type": "model",
            "model": "phi-3",
            "provider_id": "trustyai_lmeval",
            "sampling_params": {
                "temperature": 0.7,
                "top_p": 0.9,
                "max_tokens": 256
            },
        },
        "num_examples": 1000,
     },
)

print(f"Starting job '{job.job_id}'")

Starting job 'lmeval-job-17e25d1f-1f95-4cee-b250-cf477299b36b'


**3.2 Iteratively Check the Job's Status for Results**

The job's status needs to be reported as `complete` before we can get the results of the evaluation

In [7]:
def get_job_status(job_id, benchmark_id):
    return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)

while True:
    job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy")
    print(job)

    if job.status in ['failed', 'completed']:
        print(f"Job ended with status: {job.status}")
        break

    time.sleep(20)

Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='scheduled')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progress')
Job(job_id='lmeval-job-9ff0b86e-cba9-4060-8385-bf14e167480b', status='in_progr

**3.3 Get the Results of the Evaluation**

In [9]:
pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy").scores)

{'arc_easy:acc': ScoringResult(aggregated_results={'acc': 0.715}, score_rows=[{'score': 0.715}]),
 'arc_easy:acc_norm': ScoringResult(aggregated_results={'acc_norm': 0.679}, score_rows=[{'score': 0.679}]),
 'arc_easy:acc_norm_stderr': ScoringResult(aggregated_results={'acc_norm_stderr': 0.014770821817934637}, score_rows=[{'score': 0.014770821817934637}]),
 'arc_easy:acc_stderr': ScoringResult(aggregated_results={'acc_stderr': 0.014282120955200475}, score_rows=[{'score': 0.014282120955200475}])}
