# PCAI Use Case Demo - Benchmark LoRa Adapter
In this tutorial, we will benchmark LoRa Adapter deployed by MLIS. We will use lighteval library from huggingface to build and execute custom benchmark with our dataset

## What is Lighteval?
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends—whether it’s transformers, tgi, inference providers, vllm, or nanotron-with ease. Dive deep into your model’s performance by saving and exploring detailed, sample-by-sample results to debug and see how your models stack-up.

### 0. Prerequisites
**2. Install Required Libraries**</br>
Before running the demo, please install the necessary libraries in your environment:

In [1]:
!pip install lighteval==0.12.2 litellm==1.79.1 nltk==3.9.1 litellm[caching] 



In [2]:
%%capture
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# Create Custom benchmarks
Lighteval lets you flexibly add new evaluation tasks by creating a task file, defining how data is processed, selecting metrics, configuring the task, registering it, and running it via CLI.

Here’s a concise summary of the instructions for adding a custom task in Lighteval:</br>
**1. Create Task File:** Add a Python file where we will implement custom task.</br>
**2. Define Prompt Function:** Write a function that converts each dataset entry into a Doc object for evaluation.</br>
**3. Choose or Create Metrics:** You can use a built-in metric (e.g., Metrics.ACCURACY) or define a custom one using SampleLevelMetric.</br>
**4. Define Task Configuration:** Use LightevalTaskConfig to specify your task’s name, prompt function, dataset details, metrics, generation settings, etc.</br>
**5. Register Task:** Add your task to the TASKS_TABLE list, which is used for evaluation.</br>
**6. Run the Task**</br>

ref : https://huggingface.co/docs/lighteval/en/adding-a-custom-task 

In [3]:
from datasets import load_dataset

org_dataset_path = './org_dataset'

In [4]:
dataset = load_dataset('rhgt1996/camel_math_split')

dataset['train'].to_json(org_dataset_path + '/camel_math_train.json')
dataset['test'].to_json(org_dataset_path + '/camel_math_test.json')
dataset['validation'].to_json(org_dataset_path + '/camel_math_val.json')

Creating json from Arrow format:   0%|          | 0/40 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/5 [00:00<?, ?ba/s]

8102409

In [5]:
script = f"""
from lighteval.tasks.requests import Doc
from lighteval.metrics.metrics import Metrics
from lighteval.tasks.lighteval_task import LightevalTaskConfig


def prompt_fn(line: dict, task_name: str):
    query = line["message_1"]
    choices = [line["message_2"]]
    return Doc(
        task_name=task_name,
        query=query,
        choices=choices,
        gold_index=0,
    )

custom_task = LightevalTaskConfig(
    name="custom_task",
    prompt_function=prompt_fn,
    hf_repo="{org_dataset_path}",
    hf_subset="",
    evaluation_splits=["validation"],
    few_shots_split='train',
    few_shots_select='random_sampling_from_train',
    generation_size=1024,
    metrics=[Metrics.bleu],
    stop_sequence=[],
    version=0,
)

TASKS_TABLE = [custom_task]
"""

with open('custom_task.py','w') as f:
    f.write(script)

In this Demo we will use model endpoints deployed in <a href="./2.Serving LoRa using MLIS python sdk.ipynb">2.Serving LoRa using MLIS python sdk</a>

In [6]:
import lighteval
from lighteval.logging.evaluation_tracker import EvaluationTracker
from lighteval.models.endpoints.litellm_model import LiteLLMModelConfig
from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
from lighteval.models.model_input import GenerationParameters

[92m06:54:22 - LiteLLM:DEBUG[0m: http_handler.py:678 - Using AiohttpTransport...
[92m06:54:22 - LiteLLM:DEBUG[0m: http_handler.py:736 - Creating AiohttpTransport...
[92m06:54:22 - LiteLLM:DEBUG[0m: http_handler.py:746 - NEW SESSION: Creating new ClientSession (no shared session provided)
[92m06:54:23 - LiteLLM:DEBUG[0m: litellm_logging.py:186 - [Non-Blocking] Unable to import GenericAPILogger - LiteLLM Enterprise Feature - No module named 'litellm_enterprise'
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:678 - Using AiohttpTransport...
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:736 - Creating AiohttpTransport...
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:746 - NEW SESSION: Creating new ClientSession (no shared session provided)
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:678 - Using AiohttpTransport...
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:736 - Creating AiohttpTransport...
[92m06:54:23 - LiteLLM:DEBUG[0m: http_handler.py:746 - NEW SES

In [7]:
%update_token

Token successfully refreshed.


In [8]:
with open('/etc/secrets/ezua/.auth_token','r') as f:
    token = f.read()

In [9]:
evaluation_tracker = EvaluationTracker(
    output_dir="./results-custom",
    save_details=True,
)

pipeline_params = PipelineParameters(
    launcher_type=ParallelismManager.NONE,
    custom_tasks_directory='./custom_task.py',  # Set to path if using custom tasks
    max_samples=10 # Remove the parameter below once your configuration is tested
)

In [10]:
# model_name="hosted_vllm/HuggingFaceTB/SmolLM2-360M-Instruct",
model_name="hosted_vllm/math-lora"
isvc_url = "https://smollm2-360m-instruct.project-user-geun-tak-roh.serving.aie01.pcai.tryezmeral.com"

model_config = LiteLLMModelConfig(
    model_name=model_name,
    provider='vllm',
    base_url=isvc_url + "/v1",
    api_key=token,
    generation_parameters=GenerationParameters(
        temperature=0.5,
    ),
)

task = "custom_task|0"

If your cluster is using self-signed certificate, then please set environment variables for certificate validation

In [11]:
%env REQUESTS_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
%env SSL_CERT_FILE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
%env CURL_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt

env: REQUESTS_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
env: SSL_CERT_FILE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt
env: CURL_CA_BUNDLE=/etc/ezua-domain-ca-certs/ezua-domain-ca-cert.crt


In [12]:
pipeline = Pipeline(
    tasks=task,
    pipeline_parameters=pipeline_params,
    evaluation_tracker=evaluation_tracker,
    model_config=model_config,
)

--max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING.
--- INIT SEEDS ---
--- LOADING TASKS ---
Loaded 647 task configs in 1.7 seconds


Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

--- LOADING MODEL ---
[CACHING] Initializing data cache


In [13]:
%%capture
pipeline.evaluate();

--- RUNNING MODEL ---
Running SamplingMethod.GENERATIVE requests
Cache: Starting to process 10/10 samples (not found in cache) for tasks custom_task|0 (a16cfdd03f37b928, GENERATIVE)
You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring.
Cached 10 samples of custom_task|0 (a16cfdd03f37b928, GENERATIVE) at /home/geun-tak-roh/.cache/huggingface/lighteval/hosted_vllm/math-lora/c00b7eda00652e58/custom_task|0/a16cfdd03f37b928/GENERATIVE.parquet.
--- POST-PROCESSING MODEL RESPONSES ---
--- COMPUTING METRICS ---
Bootstrapping compute_corpus's stderr with 1 seeds.


In [14]:
pipeline.show_results()

--- DISPLAYING RESULTS ---


|    Task     |Version|Metric| Value |   |Stderr|
|-------------|-------|------|------:|---|-----:|
|all          |       |bleu  |18.2475|±  |0.1195|
|custom_task:0|       |bleu  |18.2475|±  |0.1195|



In [15]:
pipeline.save_and_push_results()

--- SAVING AND PUSHING RESULTS ---
Saving experiment tracker
Saving results to /mnt/user/finetuning_kfp/results-custom/results/hosted_vllm/math-lora/results_2025-11-14T06-54-52.745032.json
Saving details to /mnt/user/finetuning_kfp/results-custom/details/hosted_vllm/math-lora/2025-11-14T06-54-52.745032


Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]