# Lab 09 - LLM Evaluation using lm-evaluation-harness

In [1]:
from IPython.display import HTML, display
colab_button = HTML(
    '<a target="_blank" href="https://colab.research.google.com/github/surrey-nlp/NLP-2025/blob/main/lab09/lab09_lm_eval_overview.ipynb">'
    '<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>'
)
display(colab_button)

With the vast amount of work done in the LLM domain today, it helps to have a tool that people can use easily to share their results and use to check others to ensure reported numbers are valid. The LM Evaluation Harness is one such tool the community has used extensively. We want to continue to support the community and with that in mind, we’re excited to announce a major update on the LM Evaluation Harness to further our goal for open and accessible AI research.

This library allows you to evaluate pre-trained language models on over 60 standard academic benchmarks, with hundreds of subtasks and variants implemented. Moreover, it has direct integration with Huggingface's datasets and Transformers APIs, making it extremely simple to use open source models and data for these evaluations.

## Install LM-Eval

In [1]:
# Install LM-Eval
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-bb2o5s2x
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-bb2o5s2x
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit 5a9d5ba0234493e1d67868d8b21ae37bf0e36bc2
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.8)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.16.0 (from lm_eval==0.4.8)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting jsonlines (from lm_eval==0.4.8)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.8)
  Downloading pybind11-2.13.

In [2]:
from lm_eval import api

## Create new evaluation tasks with config-based tasks

Even within the same task, many works have reported numbers based on different choices of evaluation. Some report on the test sets, validation sets, or even subset of the training sets. Others have specialized prompts and verbalizers. We introduce YAMLs to allow users to easily make different variations. By leveraging the YAML configs to configure evaluations, the refactored LM-Eval takes the methods of the `Task` object and makes them configurable by setting the appropriate attributes in the config file. There, users can set the tasks they want by setting the name of the HF dataset (local tasks are also possible), the dataset splits used, and much more. Key configurations relating to prompting, such as `doc_to_text`, previously implemented as a method of the same name, are now configurable with jinja2 to allow high-level scripting to transform a HF dataset to text string as input to the model.



A core-feature to LM-Eval is to configure tasks with YAML configs. With configs, you can fill preset fields to easily set up a task.

Here, we will first use the task of hate detection (https://huggingface.co/datasets/dipteshkanojia/implicit_hate) as an example on two well known state of the art architectures. This will allow us to evaluate how biased the pre-training of these models are w.r.t identifying hate speech.

In [3]:
YAML_imhc_string = '''
task: implicit_hate_classification
dataset_path: dipteshkanojia/implicit_hate
output_type: multiple_choice
fewshot_split: validation
test_split: test
doc_to_text: "Statement: {{text}}"
doc_to_target: label
doc_to_choice: ["neutral", "hate"]
'''
with open('implicit-hate.yaml', 'w') as f:
    f.write(YAML_imhc_string)

### Zero-shot classification on the Implicit Hate Dataset using distilgpt2 model

To start out, we will use DistilGPT to perform zero-shot/few-shot classification on the previously mentioned dataset.
Zero-shot refers to the process of evaluating a pre-trained network on a dataset without any prior training on it, while in few-shot set-ups the models are allowed to see a really small subset of samples before testing. This would ensure that we keep the all/most of the knowledge gained during pre-training without biasing the architecture to our evaluation benchmark.

In [None]:
# !lm-eval --tasks list

In [4]:
!lm_eval \
    --model hf \
    --model_args pretrained=distilbert/distilgpt2 \
    --include_path ./ \
    --tasks implicit_hate_classification \
    --limit 10

2025-03-28 12:07:46.684265: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163666.705767    1735 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163666.712428    1735 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:07:50 INFO     [__main__:338] Including path: ./
2025-03-28:12:07:53 INFO     [__main__:422] Selected Tasks: ['implicit_hate_classification']
2025-03-28:12:07:53 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:07:53 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'distilbe

### Few-shot (5-shot) classification on the Implicit Hate Dataset using distilgpt2 model

In [5]:
!lm_eval \
    --model hf \
    --model_args pretrained=distilbert/distilgpt2 \
    --num_fewshot 5 \
    --include_path ./ \
    --tasks implicit_hate_classification \
    --limit 10

2025-03-28 12:08:13.466666: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163693.487821    1895 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163693.494259    1895 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:08:17 INFO     [__main__:338] Including path: ./
2025-03-28:12:08:20 INFO     [__main__:422] Selected Tasks: ['implicit_hate_classification']
2025-03-28:12:08:20 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:08:20 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'distilbe

### Zero-shot classification on the Implicit Hate Dataset using GPT-Neo-1.3B model

Let's repeat the process using another well-known LLM i.e. GPT-Neo

In [6]:
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-neo-1.3B \
    --include_path ./ \
    --tasks implicit_hate_classification \
    --limit 10

2025-03-28 12:08:33.229390: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163713.249921    2033 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163713.255963    2033 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:08:37 INFO     [__main__:338] Including path: ./
2025-03-28:12:08:39 INFO     [__main__:422] Selected Tasks: ['implicit_hate_classification']
2025-03-28:12:08:39 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:08:39 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'Eleuther

### Few-shot (5-shot) classification on the Implicit Hate Dataset using GPT-Neo-1.3B model

In [7]:
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-neo-1.3B \
    --num_fewshot 5 \
    --include_path ./ \
    --tasks implicit_hate_classification \
    --limit 10

2025-03-28 12:09:38.438701: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163778.459287    2357 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163778.465393    2357 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:09:42 INFO     [__main__:338] Including path: ./
2025-03-28:12:09:46 INFO     [__main__:422] Selected Tasks: ['implicit_hate_classification']
2025-03-28:12:09:46 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:09:46 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'Eleuther

With LLM-eval, you can switch between evaluation benchmarks by just creating a new YAML string for each dataset to use. In this case we can also use the toxigen dataset, a similar collection also tailored to implicit hate-speech detection.

In [8]:
YAML_toxigen_string = '''
task: toxigen
dataset_path: skg/toxigen-data
dataset_name: annotated
output_type: multiple_choice
training_split: train
test_split: test
doc_to_text: "Is the following statement hateful? Respond with either Yes or No. Statement: '{{text}}'"
doc_to_target: !function utils.doc_to_target
doc_to_choice: ['No', 'Yes']
'''
with open('toxigen.yaml', 'w') as f:
    f.write(YAML_toxigen_string)

However, the toxigen dataset requires for the users to accept its terms and conditions prior to using it, so we first have to log into the hugging-face API for access (this step may only work on google collab). To complete the process, create a huggingface account and follow the steps in the dialog box below

In [None]:
# !huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

In [9]:
!lm_eval \
    --model hf \
    --model_args pretrained=distilbert/distilgpt2 \
    --include_path ./ \
    --tasks toxigen \
    --limit 10

2025-03-28 12:10:14.163524: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163814.183921    2565 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163814.190036    2565 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:10:18 INFO     [__main__:338] Including path: ./
2025-03-28:12:10:22 INFO     [__main__:422] Selected Tasks: ['toxigen']
2025-03-28:12:10:22 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:10:22 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'distilbert/distilgpt2'}
2025-

In [10]:
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-neo-1.3B \
    --include_path ./ \
    --tasks toxigen \
    --limit 10

2025-03-28 12:10:38.797069: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743163838.818085    2727 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743163838.824397    2727 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:10:42 INFO     [__main__:338] Including path: ./
2025-03-28:12:10:45 INFO     [__main__:422] Selected Tasks: ['toxigen']
2025-03-28:12:10:45 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:10:45 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'EleutherAI/gpt-neo-1.3B'}
202

And we can now run evaluation on this task, by pointing to the config file we've just created:

**Challenge:** The two cells above perform zero-shot evaluation on the aforementioned dataset.

Follow a similar approach to running both models (GPT-neo and distilgpt2) on a few-shot set-up:

In [None]:
# !TODO: Run GPT-neo on a few-shot (5 examples) evaluation task:






In [None]:
# !TODO: Run distil-GPT on a few-shot (5 examples) evaluation task:






## Edit Prompt Templates Quickly

The following is a yaml made to evaluate the specific subtask of `high_school_geography` from MMLU. It uses the standard prompt where the we choose the letters from the options with most likelihood as the model's prediction.

In [20]:
YAML_mmlu_geo_string = '''
task: demo_mmlu_high_school_geography
dataset_path: cais/mmlu
dataset_name: high_school_geography
description: "The following are multiple choice questions (with answers) about high school geography.\n\n"
test_split: test
fewshot_split: dev
fewshot_config:
  sampler: first_n
output_type: multiple_choice
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: ["A", "B", "C", "D"]
doc_to_target: answer
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
'''
with open('mmlu_high_school_geography.yaml', 'w') as f:
    f.write(YAML_mmlu_geo_string)


In [21]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-2.8b \
    --include_path ./ \
    --tasks demo_mmlu_high_school_geography \
    --limit 10 \
    --output output/mmlu_high_school_geography/ \
    --log_samples

2025-03-28 12:24:16.807668: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743164656.828043    6238 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743164656.834493    6238 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:24:21 INFO     [__main__:338] Including path: ./
2025-03-28:12:24:23 INFO     [__main__:422] Selected Tasks: ['demo_mmlu_high_school_geography']
2025-03-28:12:24:23 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:24:23 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretrained': 'Eleut

We could also evaluate this task in a different way. For example, instead of observing the loglikelihood of the letters, we can instead evaluate on the choices themselves as the continuation. This is done by simply changing `doc_to_choice` from a list of letters to the corresponding `choices` field from the HF dataset. We write `"{{choices}}"` so that the string field is interpreted as jinja string that acquires the list from the HF dataset directly.

Another convenient feature here is since we're only modifying the `doc_to_choice` and the rest of config is the same as the task above, we can use the above configuration as a template by using `include: mmlu_high_school_geography.yaml` to load the config from that file. We'll need to add a unique task name as to not colide with the existing yaml config we're including. For this case we'll simply name this one `mmlu_high_school_geography_continuation`. `doc_to_text` is added here just for sake of clarity.

In [22]:
YAML_mmlu_geo_string = '''
include: mmlu_high_school_geography.yaml
task: demo_mmlu_high_school_geography_continuation
doc_to_text: "{{question.strip()}}\nA. {{choices[0]}}\nB. {{choices[1]}}\nC. {{choices[2]}}\nD. {{choices[3]}}\nAnswer:"
doc_to_choice: "{{choices}}"
'''
with open('mmlu_high_school_geography_continuation.yaml', 'w') as f:
    f.write(YAML_mmlu_geo_string)


In [23]:
# !accelerate launch --no_python
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-2.8b \
    --include_path ./ \
    --tasks demo_mmlu_high_school_geography_continuation \
    --limit 10 \
    --output output/mmlu_high_school_geography_continuation/ \
    --log_samples


2025-03-28 12:28:56.725037: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743164936.745310    7446 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743164936.751452    7446 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:29:00 INFO     [__main__:338] Including path: ./
2025-03-28:12:29:02 INFO     [__main__:422] Selected Tasks: ['demo_mmlu_high_school_geography_continuation']
2025-03-28:12:29:02 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:29:02 INFO     [evaluator:218] Initializing hf model, with arguments: {'pretra

If we take a look at the samples, we can see that it is in fact evaluating the continuation based on the choices rather than the letters.

Go to your output folder in google colab and paste the json file into the line below The file path should look something like
`/content/output/mmlu_high_school_geography_continuation/EleutherAI__pythia-2.8b/samples_demo_mmlu_high_school_geography_continuation_2025-03-27T18-13-18.735917.jsonl`

In [None]:
from google.colab import files
files.view("/content/output/mmlu_high_school_geography_continuation/EleutherAI__pythia-2.8b/samples_demo_mmlu_high_school_geography_continuation_2025-03-29T14-20-43.644245.jsonl")


<IPython.core.display.Javascript object>

## Closer Look at YAML Fields

To prepare a task we can simply fill in a YAML config with the relevant information.

`output_type`
The current provided evaluation types comprise of the following:
1.   `loglikelihood`: Evaluates the loglikelihood of a continuation, conditioned on some input string.
2.   `loglikelihood_rolling`: evaluate the loglikelihood of producing a string, conditioned on the empty string. (Used for perplexity evaluations)
3.   `multiple_choice`: Evaluates loglikelihood among the a number of choices predicted by the model.
4.   `greedy_until`: Model outputs greedy generation (can be configured to to use beam search and other generation-related parameters)

The core prompt revolves around 3 fields.
1. `doc_to_text`: Denotes the prompt template that will be used as input to the model.
2. `doc_to_choice`: Available choices that will be used as continuation for the model. This is used when the `output_type` is `multiple_choice`, and otherwise can be left as `None`.
3. `doc_to_target`: When `output_type` is `multiple_choice`, this can be an index that corresponds to the correct answer, or the answer string itself (must be a subset of `doc_to_choice`). For other tasks, this is expected to be a string. You can fill this field with a feature name from the HF dataset so long as the resulting feature follows the conditioned described.

These three fields can be expressed as strings, column names from the source dataset, or as Jinja2 templates that can use fields from the source dataset as variables.


## What if Jinja is not Sufficient?

There can be times where the Jinja2 templating language is not enough to make the prompt we had in mind. There are a few ways to circumvent this limitation:

1. Use `!function` operator for the prompt-related fields to pass a python function that takes as input the dataset row, and will output the prompt template component.
2. Perform a transformation on the dataset beforehand.

Below, we show an example of using `!function` to create `doc_to_text` from a python function:

In [24]:
YAML_mmlu_geo_string = '''
include: mmlu_high_school_geography.yaml
task: demo_mmlu_high_school_geography_function_prompt
doc_to_text: !function utils.doc_to_text
doc_to_choice: "{{choices}}"
'''
with open('demo_mmlu_high_school_geography_function_prompt.yaml', 'w') as f:
    f.write(YAML_mmlu_geo_string)

DOC_TO_TEXT = '''
def doc_to_text(x):
    question = x["question"].strip()
    choices = x["choices"]
    option_a = choices[0]
    option_b = choices[1]
    option_c = choices[2]
    option_d = choices[3]
    return f"{question}\\nA. {option_a}\\nB. {option_b}\\nC. {option_c}\\nD. {option_d}\\nAnswer:"
'''
with open('utils.py', 'w') as f:
    f.write(DOC_TO_TEXT)

!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-2.8b \
    --include_path ./ \
    --tasks demo_mmlu_high_school_geography_function_prompt \
    --limit 10 \
    --output output/demo_mmlu_high_school_geography_function_prompt/ \
    --log_samples


2025-03-28 12:29:26.835691: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743164966.857369    7643 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743164966.863807    7643 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:29:31 INFO     [__main__:338] Including path: ./
2025-03-28:12:29:33 INFO     [__main__:422] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt']
2025-03-28:12:29:33 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:29:33 INFO     [evaluator:218] Initializing hf model, with arguments: {'pre

Next, we'll also show how to do this via preprocessing the dataset as necessary using the `process_docs` config field:

We will write a function that will modify each document in our evaluation dataset's split to add a field that is suitable for us to use in `doc_to_text`.

In [25]:
YAML_mmlu_geo_string = '''
include: mmlu_high_school_geography.yaml
task: demo_mmlu_high_school_geography_function_prompt_2
process_docs: !function utils_process_docs.process_docs
doc_to_text: "{{input}}"
doc_to_choice: "{{choices}}"
'''
with open('demo_mmlu_high_school_geography_process_docs.yaml', 'w') as f:
    f.write(YAML_mmlu_geo_string)

DOC_TO_TEXT = '''
def process_docs(dataset):
    def _process_doc(x):
        question = x["question"].strip()
        choices = x["choices"]
        option_a = choices[0]
        option_b = choices[1]
        option_c = choices[2]
        option_d = choices[3]
        # Create a new doc dictionary with processed fields
        out_doc = {
            "input": f"{question}\\nA. {option_a}\\nB. {option_b}\\nC. {option_c}\\nD. {option_d}\\nAnswer:",
            "choices": ["A", "B", "C", "D"]
        }
        return out_doc
    return dataset.map(_process_doc)
'''

with open('utils_process_docs.py', 'w') as f:
    f.write(DOC_TO_TEXT)

!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/pythia-2.8b \
    --include_path ./ \
    --tasks demo_mmlu_high_school_geography_function_prompt_2 \
    --limit 10 \
    --output output/demo_mmlu_high_school_geography_function_prompt_2/ \
    --log_samples


2025-03-28 12:29:51.408311: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1743164991.428835    7787 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1743164991.435207    7787 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-28:12:29:56 INFO     [__main__:338] Including path: ./
2025-03-28:12:29:58 INFO     [__main__:422] Selected Tasks: ['demo_mmlu_high_school_geography_function_prompt_2']
2025-03-28:12:29:58 INFO     [evaluator:180] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-03-28:12:29:58 INFO     [evaluator:218] Initializing hf model, with arguments: {'p

**Challenge:** The wino bias dataset contains a collection of occupation-related sentences with missing pronouns. A model would have to predict which pronoun (him or her) would be the biased and non-biased options (for more information on the dataset visit https://huggingface.co/datasets/sasha/wino_bias_prompt2).

However, this dataset requires of pre-procesing prior to being used in conjunction the LLM-eval harness. Use the techniques learned above to do so.

In [None]:
YAML_winobias_string = '''
task: wino_bias
dataset_path: sasha/wino_bias_prompt2
output_type: multiple_choice
validation_split: validation
test_split: test
doc_to_text: "Statement: {{prompt_phrase}}"
doc_to_choice: ['anti_bias_pronoun', 'bias_pronoun']
'''
with open('wino-bias.yaml', 'w') as f:
    f.write(YAML_winobias_string)

In [None]:
# !TODO: pre-process the wino_bias dataset using the techniques shown above:




Once you have successfully pre-processed the wino bias dataset, you can run evaluation on the model of your choice by running the code below .We are using the same two architectures as an example but feel free to explore other options!

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=EleutherAI/gpt-neo-1.3B \
    --include_path ./ \
    --tasks wino_bias \
    --limit 10

In [None]:
!lm_eval \
    --model hf \
    --model_args pretrained=distilbert/distilgpt2 \
    --include_path ./ \
    --tasks wino_bias_prompt \
    --limit 10