# 2- How to evaluate a model checkpoint

In this notebook, you will understand how to evaluate a checkpoint using the [`lm-evaluation-harness`](https://github.com/EleutherAI/lm-evaluation-harness) package. You will understand how to install and use `lm-eval` package to run evaluations.

## Selecting a task

To check available tasks, you can navigate [here](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks) and get all supported tasks by `lm-eval`.

For the sake of demonstration, we will evaluate `HuggingFaceTB/SmolLM2-135M` on `hellaswag` benchmark.

You can read more about the benchmark in the [original paper](https://arxiv.org/abs/1905.07830)

## Running the evaluation

To run an evaluation, follow this command line template. You can replace the model ID with something else by changing the `pretrained=` parameter.

In [1]:
!accelerate launch -m lm_eval --model hf \
    --model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
    --tasks hellaswag \
    --batch_size auto \
    --output_path results/

ipex flag is deprecated, will be removed in Accelerate v1.10. From 2.7.0, PyTorch has all needed optimizations for Intel CPU and XPU.
2025-07-11:07:13:53 INFO     [__main__:441] Selected Tasks: ['hellaswag']
2025-07-11:07:13:53 INFO     [evaluator:198] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-07-11:07:13:53 INFO     [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'HuggingFaceTB/SmolLM2-135M', 'dtype': 'bfloat16'}
2025-07-11:07:13:54 INFO     [models.huggingface:138] Using device 'cuda'
config.json: 100%|█████████████████████████████| 704/704 [00:00<00:00, 2.21MB/s]
tokenizer_config.json: 3.66kB [00:00, 7.11MB/s]
vocab.json: 801kB [00:00, 3.72MB/s]
merges.txt: 466kB [00:00, 13.1MB/s]
tokenizer.json: 2.10MB [00:00, 19.0MB/s]
special_tokens_map.json: 100%|█████████████████| 831/831 [00:00<00:00, 3.76MB/s]
2025-07-11:07:13:59 INFO     [models.huggingface:391] Model parallel was

Let's now inspect the result. We can specify the output path with `output_path` parameter. Inside that path you should have a json file with a time-stamp corresponding to the moment the model has been evaluated.

You can navigate into the json file and retrieve the field `results` to get the scores. In the case of `hellaswag`, it is possible to get `acc` (accuracy) and `acc_norm` (normalizaed accuracy).

Note that the results are also directly displayed into the terminal.

```json
  "results": {
    "hellaswag": {
      "alias": "hellaswag",
      "acc,none": 0.3545110535749851,
      "acc_stderr,none": 0.004773872456201056,
      "acc_norm,none": 0.4311890061740689,
      "acc_norm_stderr,none": 0.004942302768002102
    }
  }
```

If you would like to change the number of shots, you can also pass the `--num_fewshot` parameter. For example, if you want to run `hellaswag` on 25-shots, you can run the following command line argument:

In [2]:
!accelerate launch -m lm_eval --model hf \
    --model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
    --tasks hellaswag \
    --batch_size auto \
    --output_path results_25_shots/ \
    --num_fewshot 25

ipex flag is deprecated, will be removed in Accelerate v1.10. From 2.7.0, PyTorch has all needed optimizations for Intel CPU and XPU.
2025-07-11:07:16:34 INFO     [__main__:441] Selected Tasks: ['hellaswag']
2025-07-11:07:16:34 INFO     [evaluator:198] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-07-11:07:16:34 INFO     [evaluator:236] Initializing hf model, with arguments: {'pretrained': 'HuggingFaceTB/SmolLM2-135M', 'dtype': 'bfloat16'}
2025-07-11:07:16:34 INFO     [models.huggingface:138] Using device 'cuda'
2025-07-11:07:16:35 INFO     [models.huggingface:391] Model parallel was set to False, max memory was not set, and device map was set to {'': 'cuda'}
`trust_remote_code` is not supported anymore.
Please check that the Hugging Face dataset 'hellaswag' isn't based on a loading script and remove `trust_remote_code`.
If the dataset is based on a loading script, please ask the dataset author to re

Since the evaluation framework uses `accelerate` library, you can also use this notebook in a Kaggle multi-GPU notebook to benefit from multi-GPU inference and make the evaluation faster.

## Going further

This notebook simply demonstrates how to run a simple evaluation using common parameters which we believe should be sufficient enough for the competition. Feel free to check out the [official documentation page](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interface.md) for more details.