# Evaluation

The evaluation module allows models to be tested on tasks they have not been explicitly trained for, thus providing insights into their generalization capabilities and adaptability to novel situations. By implementing a user-friendly evaluation framework in PhyAGI, users can effortlessly gauge their models' performance and determine their suitability for a wide range of applications.

## Loading tokenizer and model

The first step is to load the tokenizer, which must be the same as the model has been trained on. This is a requirement because it is used to convert the evaluation data to inputs. In this example, we will use the `gpt2` tokenizer from the Hugging Face hub, as follows:


In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2")

After loading the tokenizer, we can load the model. Again, we will use the `gpt2` model from the Hugging Face hub, as follows:

In [2]:
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", device_map="cuda", trust_remote_code=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Evaluating the model

PhyAGI provides a **task‑centric** API: you select a task object, pass the model and tokenizer, and call `run()`.   The task takes care of:

- Downloading or generating the dataset.
- Formatting inputs and expected outputs.
- Computing the chosen metric(s).

Below we measure accuracy on **PIQA** (physical commonsense reasoning) with just a few lines of code.

In [3]:
from phyagi.eval.tasks.piqa import PIQA

results = PIQA.run(model, tokenizer, batch_size=8)
print(results)

Device set to use cuda
100%|██████████| 1838/1838 [00:43<00:00, 42.26it/s]


{'accuracy': 0.7872687704026116, 'accuracy_norm': 0.7921653971708379}


There are dozens of pre-defined tasks, which covers a variety of applications, such as question answering, common-sense reasoning, code generation, among others. The full list of tasks can be found [here](https://microsoft.github.io/phyagi-sdk/api/eval).

# Integrations

Prefer a familiar benchmark harness? PhyAGI can act as a thin wrapper around popular suites such as [LM-Eval](https://github.com/EleutherAI/lm-evaluation-harness), letting you keep a single workflow while tapping into extra tasks.

## LM-Eval

Before using LM-Eval, you need to install it by running the following command (root folder of PhyAGI):

```bash
pip install -e .[eval]
```

Once installed, the code cell below shows how to launch an LM‑Eval run that internally reuses the same model object loaded earlier.

In [4]:
from phyagi.eval.tasks.lm_evaluation_harness import LMEvaluationHarness

results = LMEvaluationHarness.run(model, tokenizer, tasks="piqa", batch_size=8)
print(results)

100%|██████████| 1838/1838 [00:01<00:00, 1538.31it/s]
Running loglikelihood requests: 100%|██████████| 3676/3676 [00:19<00:00, 188.68it/s]


{'piqa': {'accuracy': 0.7856365614798694, 'accuracy_stderr': 0.009574842136050943, 'accuracy_norm': 0.7927094668117519, 'accuracy_norm_stderr': 0.009457844699952379}}


# Customization

In this section, we will show how to customize the evaluation module to fit your needs. The only requirement for implementing a custom evaluation task is to implement a `run` function, which takes the `model` and `tokenizer` as positional arguments.

In [5]:
from typing import Any, Dict, List, Optional

import numpy as np
import torch
from datasets import load_dataset
from evaluate import load
from tqdm import tqdm
from transformers import PreTrainedTokenizerBase

from phyagi.eval.generation import example_generator
from phyagi.eval.log_likelihood_pipeline import LogLikelihoodPipeline
from phyagi.utils.file_utils import save_json_file


class ARCEasy:
    @staticmethod
    def mapping_fn(example: Dict[str, Any]) -> List[Dict[str, Any]]:
        prompt = "Question: {}\nAnswer:{}"
        targets = [" " + choice for choice in example["choices"]["text"]]

        answer_key_map = {"1": "A", "2": "B", "3": "C", "4": "D", "5": "E"}
        answer_key = answer_key_map.get(example["answerKey"], example["answerKey"])

        return [
            {
                "source": prompt.format(example["question"], target),
                "target": target,
                "label": ["A", "B", "C", "D", "E"].index(answer_key),
            }
            for target in targets
        ]

    @staticmethod
    def run(model: torch.nn.Module, tokenizer: PreTrainedTokenizerBase, output_file_path: Optional[str] = None, **kwargs) -> Dict[str, Any]:
        pipeline = LogLikelihoodPipeline(model, tokenizer)
        dataset = load_dataset("ai2_arc", name="ARC-Easy")["test"]
        metric = {
            "accuracy": load("accuracy"),
            "accuracy_norm": load("accuracy"),
        }
        outputs = []

        for output in tqdm(
            pipeline(example_generator(dataset, mapping_fn=ARCEasy.mapping_fn), **kwargs), total=len(dataset)
        ):
            log_likelihoods = output["log_likelihoods"]
            target_lengths = output["target_lengths"]
            label = output["label"]

            prediction = np.argmax(log_likelihoods)
            prediction_norm = np.argmax(np.array(log_likelihoods) / target_lengths)

            metric["accuracy"].add(predictions=prediction, reference=label)
            metric["accuracy_norm"].add(predictions=prediction_norm, reference=label)

            outputs.append(output)

        save_json_file(outputs, output_file_path) if output_file_path else None

        return {key: metric.compute()["accuracy"] for key, metric in metric.items()}

The core computation of the task is defined by the `LogLikelihoodPipeline`, which is an extension of the `transformer.Pipeline` class and provides a straighforward way to compute the log-likelihood of a given sentence. Additionally, since it relies on Hugging Face API, we do not need to worry about batched inference or multi-GPU usage, since the pipeline implements these features under the hood.

Even though using pipelines are not required, PhyAGI provides a nice set of pipelines that makes it easier to implement custom evaluation tasks. Please take a look over the [documentation](https://microsoft.github.io/phyagi-sdk/api/eval.html#utilities) to check the pre-defined pipelines.