## How to contribute by creating a new task?

In this notebook, we will do a quick walkthrough of how to use `lm-evaluation-harness` in order to implement a new task from scratch.

The guide is based on the [original guide for adding a new task](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) - feel free to check out that guide for a more comprehensive tutorial. In this tutorial, we will create a new small variant of `piqa` benchmark to augment it with some paraphrase tool (only for demonstration purpose).


Make sure to [first fork the original repository](https://github.com/EleutherAI/lm-evaluation-harness.git) and clone it locally.

In [None]:
!pip install -q accelerate

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m75.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m72.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We will clone and build the package from the main repository but you should clone your fork here.

In [None]:
!git clone https://github.com/EleutherAI/lm-evaluation-harness.git
%cd lm-evaluation-harness
!pip install -e ".[dev]" -q
%cd ..

Cloning into 'lm-evaluation-harness'...
remote: Enumerating objects: 49693, done.[K
remote: Counting objects: 100% (52/52), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 49693 (delta 29), reused 18 (delta 15), pack-reused 49641 (from 2)[K
Receiving objects: 100% (49693/49693), 29.64 MiB | 9.65 MiB/s, done.
Resolving deltas: 100% (34403/34403), done.
/content/lm-evaluation-harness
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Now, we will navigate within the repo and create a new YAML file [as per the documentation](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md#creating-a-yaml-file)

In [None]:
%cd lm-evaluation-harness
!cp -r templates/new_yaml_task lm_eval/tasks/

/content/lm-evaluation-harness


We will try to reproduce step by step the `piqa` benchmark with some slight modifications which consists of a multi-choice question on common-sense sentences.

Let's start to fill our `blank_yaml.yaml` file with these arguments

```yaml
task: piqa_modified
dataset_path: baber/piqa
dataset_name: null
```

- `task`: This refers to the task name
- `dataset_path`: This refers to the name of the dataset on Hugging Face.
- `dataset_name`: Leave `null` if your dataset does not require a config to be passed. See https://huggingface.co/docs/datasets/load_hub#configurations for more info.

We will also set:

```yaml
output_type: multiple_choice
```

To make sure to set multi-choice question answering.

We will also set the names of the training / validation and test split. For that make sure to inspect the split names within the Hugging Face dataset.

```yaml
training_split: train
validation_split: validation
test_split: null
```




```yaml
doc_to_text: "Question: {{goal}}\nAnswer:"
doc_to_target: label
doc_to_choice: "{{[sol1, sol2]}}"
```

Finally, we will add the following blocks so that lm-eval will compute the accuracy and normalized accuracy for this metric:

```yaml
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
dataset_kwargs:
  trust_remote_code: true
```

In [None]:
!accelerate launch -m lm_eval --model hf \
    --model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
    --tasks piqa_modified \
    --batch_size auto \
    --output_path results/

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-04-11 00:47:10.150262: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744332430.472127    3161 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744332430.556385    3161 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-11 00:47:11.215483: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in perf

For `multiple_choice` tasks, `lm-eval` will calculate the log-likelihood of all choices, and consider the answer with the highest log-likelihood as the chosen answer.

It is also possible to create `generate_until` tasks which consists of generative tasks. For that, you can look more into details by navigating what is done for other tasks [such as `gsm8k`](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/gsm8k/README.md)

Let's now try to make our piqa benchmark slightly more different - we will replace the solution by their paraphrased versions. For that, we will use the tool ["T5 Paraphrase Generation"](https://huggingface.co/spaces/AventIQ-AI/t5-paraphrase-generation) to paraphrase the solutions.

For that we will use `gradio_client` to use the Space as an API. Let's first test it on some examples of piqa:

```python
from gradio_client import Client

client = Client("AventIQ-AI/t5-paraphrase-generation")
result = client.predict(
		text="Hello!!",
		max_length=50,
		temperature=1,
		api_name="/predict"
)
print(result)
```

In [None]:
! pip install -q gradio_client

In [None]:
from gradio_client import Client

client = Client("AventIQ-AI/t5-paraphrase-generation")
result = client.predict(
		text="Weld the metal together to get it to stay firmly in place",
		max_length=50,
		temperature=1,
		api_name="/predict"
)
print(result)

Loaded as API: https://aventiq-ai-t5-paraphrase-generation.hf.space ✔
['Etymologically, weld the metal together to ensure that it holds its firmly in place.', 'The metal was separated to ensure its stabilization and strong stability.', 'Wel


As you can see, the tool generates N possible paraphrases, we can create an utility method that replaces the solutions with a sampled paraphrased sentence:

```python
import random
import datasets

from gradio_client import Client

client = Client("AventIQ-AI/t5-paraphrase-generation")

def paraphrase(text):
  result = client.predict(
		text=text,
		max_length=50,
		temperature=1,
		api_name="/predict"
  )
  return random.choice(result)

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
    def _process_doc(doc):
        sol1 = paraphrase(doc["sol1"])
        sol2 = paraphrase(doc["sol2"])
        out_doc = {
            "goal": doc["goal"],
            "sol1": sol1,
            "sol2": sol2,
            "label": doc["label"],
        }
        return out_doc

    return dataset.map(_process_doc)
```

We will perform the processing offline and save the modified dataset locally (just a fraction of it for demonstration purpose).

In [None]:
import random
import datasets
import json

from gradio_client import Client

client = Client("AventIQ-AI/t5-paraphrase-generation")

def paraphrase(text):
  result = client.predict(
		text=text,
		max_length=50,
		temperature=1,
		api_name="/predict"
  )
  # A hack to convert properly the pure string into a list
  result = result.replace("[", "").replace("]", "").replace(" '", "").replace("['", "").split("',")
  return random.choice(result).replace("'", "")

def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:
    def _process_doc(doc):
        sol1 = paraphrase(doc["sol1"])
        sol2 = paraphrase(doc["sol2"])
        out_doc = {
            "goal": doc["goal"],
            "sol1": sol1,
            "sol2": sol2,
            "label": doc["label"],
        }
        return out_doc

    return dataset.map(_process_doc)

YOUR_HF_TOKEN = "" # Paste your HF Token here

dataset = datasets.load_dataset("baber/piqa", split='train[:1%]')
dataset = process_docs(dataset)

# we will also create a validation split
dataset.push_to_hub("piqa_modified", split="validation", private=True, token=YOUR_HF_TOKEN)

dataset = datasets.load_dataset("baber/piqa", split='train[1%:2%]')
dataset = process_docs(dataset)

dataset.push_to_hub("piqa_modified", split="validation", private=True, token=YOUR_HF_TOKEN)

Loaded as API: https://aventiq-ai-t5-paraphrase-generation.hf.space ✔


Map:   0%|          | 0/161 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

38383

After that, make sure to modify the dataset paths correctly to point it to your namespace - in our case:

```yaml
dataset_path: tiiuae/piqa_modified
dataset_name: null
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: null
```

In [None]:
!HF_TOKEN="YOUR_TOKEN" accelerate launch -m lm_eval --model hf \
    --model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
    --tasks piqa_modified \
    --batch_size auto \
    --output_path results/

The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
2025-04-11 02:06:43.969039: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744337203.992954   24010 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744337204.000605   24010 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-04-11 02:06:44.023280: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in perf

Once you are happy with your new benchmark - refer to our submission guideline (we will detail the submission process soon) to submit your solution.

For much detailed tutorial, make sure to read carefully [the guide from lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md) in order to have an extensive understanding of all possible corner cases and integrations.