# Lesson 6. Model evaluation

The model comparison tool that Sung described in the video can be found at this link: https://console.upstage.ai/ (note that you need to create a free account to try it out.)

A useful tool for evaluating LLMs is the **LM Evaluation Harness** built by EleutherAI. Information about the harness can be found at this [github repo](https://github.com/EleutherAI/lm-evaluation-harness):

You can run the commented code below to install the evaluation harness in your own environment:

In [1]:
#!pip install -U git+https://github.com/EleutherAI/lm-evaluation-harness

You will evaluate TinySolar-248m-4k on 5 questions from the **TruthfulQA MC2 task**. This is a multiple-choice question answering task that tests the model's ability to identify true statements. You can read more about the TruthfulQA benchmark in [this paper](https://arxiv.org/abs/2109.07958), and you can checkout the code for implementing the tasks at this [github repo](https://github.com/sylinrl/TruthfulQA).

The code below runs only the TruthfulQA MC2 task using the LM Evaluation Harness:

In [2]:
!lm_eval --model hf \
    --model_args pretrained=./models/TinySolar-248m-4k \
    --tasks truthfulqa_mc2 \
    --device cpu \
    --limit 5

2024-09-23:16:15:39,626 INFO     [__main__.py:272] Verbosity set to INFO
2024-09-23:16:15:39,680 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2024-09-23:16:15:46,001 INFO     [__main__.py:376] Selected Tasks: ['truthfulqa_mc2']
2024-09-23:16:15:46,005 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-09-23:16:15:46,005 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': './models/TinySolar-248m-4k'}
2024-09-23:16:15:46,007 INFO     [huggingface.py:130] Using device 'cpu'
2024-09-23:16:15:46,096 INFO     [huggingface.py:366] Model parallel was set to F

### Evaluation for the Hugging Face Leaderboard
You can use the code below to test your own model against the evaluations required for the [Hugging Face leaderboard](https://huggingface.co/open-llm-leaderboard). 

If you decide to run this evaluation on your own model, don't change the few-shot numbers below - they are set by the rules of the leaderboard.

In [None]:
import os

def h6_open_llm_leaderboard(model_name):
  task_and_shot = [
      ('arc_challenge', 25),
      ('hellaswag', 10),
      ('mmlu', 5),
      ('truthfulqa_mc2', 0),
      ('winogrande', 5),
      ('gsm8k', 5)
  ]

  for task, fewshot in task_and_shot:
    eval_cmd = f"""
    lm_eval --model hf \
        --model_args pretrained={model_name} \
        --tasks {task} \
        --device cpu \
        --num_fewshot {fewshot}
    """
    os.system(eval_cmd)

h6_open_llm_leaderboard(model_name="YOUR_MODEL")

2024-09-23:16:17:00,507 INFO     [__main__.py:272] Verbosity set to INFO
2024-09-23:16:17:00,538 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2024-09-23:16:17:05,010 INFO     [__main__.py:376] Selected Tasks: ['arc_challenge']
2024-09-23:16:17:05,011 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-09-23:16:17:05,011 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': 'YOUR_MODEL'}
2024-09-23:16:17:05,013 INFO     [huggingface.py:130] Using device 'cpu'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub

2024-09-23:16:17:17,656 INFO     [__main__.py:272] Verbosity set to INFO
2024-09-23:16:17:17,689 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2024-09-23:16:17:22,319 INFO     [__main__.py:376] Selected Tasks: ['mmlu']
2024-09-23:16:17:22,320 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-09-23:16:17:22,320 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': 'YOUR_MODEL'}
2024-09-23:16:17:22,323 INFO     [huggingface.py:130] Using device 'cpu'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/utils/_e

2024-09-23:16:17:35,571 INFO     [__main__.py:272] Verbosity set to INFO
2024-09-23:16:17:35,606 INFO     [__init__.py:491] `group` and `group_alias` keys in tasks' configs will no longer be used in the next release of lm-eval. `tag` will be used to allow to call a collection of tasks just like `group`. `group` will be removed in order to not cause confusion with the new ConfigurableGroup which will be the offical way to create groups with addition of group-wide configuations.
2024-09-23:16:17:40,138 INFO     [__main__.py:376] Selected Tasks: ['winogrande']
2024-09-23:16:17:40,138 INFO     [evaluator.py:158] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-09-23:16:17:40,138 INFO     [evaluator.py:195] Initializing hf model, with arguments: {'pretrained': 'YOUR_MODEL'}
2024-09-23:16:17:40,140 INFO     [huggingface.py:130] Using device 'cpu'
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/huggingface_hub/ut