# Evaluate Fine-Tuned Models

## What you will learn in this course 🧐🧐

You've trained a model but now what? The problem with LLMs is that they will always provide an answer to a user. The only question is: *Will it be the one you expected?* (when fine-tuning the model). This is why you need to have some kind of an evaluation process. 

In this course, we will teach you a simple technic to evaluate a model thanks to another LLM

## Methods to Evaluate an LLM 

### Metrics in Short

If you need a quick overview before diving into each metric, check-out the table below 👇

| Metric                   | Best For                           | Interpretation                                                                                       |
|--------------------------|------------------------------------|------------------------------------------------------------------------------------------------------|
| Perplexity               | Language modeling                 | Lower perplexity = better fit to natural language patterns                                           |
| BLEU, ROUGE              | Translation, Summarization        | Higher = closer to reference; may penalize creative wording                                          |
| Accuracy (HellaSwag, MMLU) | Reasoning, Knowledge             | Higher accuracy = better at reasoning or factual understanding                                       |
| TruthfulQA Accuracy      | Truthfulness                      | Higher accuracy = avoids common misconceptions, especially in complex or controversial topics        |
| BLEURT, BERTScore        | QA, Text Generation               | Higher = closer semantic meaning to reference, allows for varied wording                             |
| Human Evaluation         | Conversational AI, Open-ended generation | Provides qualitative feedback on coherence, relevance, fluency, and engagement                  |
| LLM-based Evaluation     | Automated, scalable assessments   | Uses another LLM to score relevance, coherence, or factuality; efficient but may reinforce biases    |


There are several ways to evaluate an LLM. Among the most popular are:

### Extended classification

This category aims at evaluating the LLM's capacity to classify, translate or summarize text. These metrics are great when the need for a model's creativity is limited:

- **Perplexity**: Measures how well the model predicts the next word. It's a common metric in language modeling tasks.
   - **Interpretation**: A lower perplexity indicates the model finds the sequence of words more “predictable” and thus has a stronger grasp of natural language patterns. This works well in **language modeling** tasks but can be limiting for **open-ended generation** since it doesn’t directly correlate with meaningfulness or truthfulness.
   
- **BLEU (Bilingual Evaluation Understudy)**: Often used in **machine translation** and tasks with a clear reference output.
   - **Interpretation**: BLEU compares model-generated output to reference output based on word overlaps. Higher BLEU scores indicate closer similarity to the reference but can penalize creative wording and isn't ideal for open-ended tasks.

- **ROUGE (Recall-Oriented Understudy for Gisting Evaluation)**: Primarily used in **summarization** to measure overlap between generated summaries and reference summaries.
   - **Interpretation**: Higher ROUGE scores suggest that the generated text retains important content from the reference. Like BLEU, ROUGE may not fully capture nuanced quality, making it suitable for extractive but not always for abstractive tasks.

- **METEOR (Metric for Evaluation of Translation with Explicit ORdering)**: Extends BLEU with synonyms and stemming, often used in machine translation.
   - **Interpretation**: Slightly better for capturing “semantic” similarity than BLEU, making it useful for **translation** and **summarization** where meaning preservation is essential.

- **BLEURT**: An **evaluation model** trained to score responses for **semantic similarity** with reference answers. Unlike BLEU, it focuses on the *meaning* behind words, making it suitable for **summarization** and **open-ended** tasks.
   - **Interpretation**: Higher BLEURT scores mean the model output is closer in meaning to human-provided references. It’s beneficial for tasks like **QA** and **paraphrasing**.


- **BERTScore**: Computes similarity at the word level using embeddings, which can capture **semantic closeness** rather than exact word matches.
   - **Interpretation**: This is more flexible than BLEU or ROUGE, as it considers synonyms and contextual relevance. High BERTScore indicates semantically similar outputs, making it suitable for **summarization** and **text generation** tasks.

- **Exact Match (EM)**: Common in **QA** tasks, it measures if the answer is exactly correct.
   - **Interpretation**: High EM means the model can precisely identify correct answers. However, it’s strict and may not reflect near-correct answers.


### Task-Specific Benchmarks

Evaluating "how well" a model generates data is quite hard. Therefore the industry came up with metrics that focuses on specific tasks to evaluate a model on. Among them are:

- **HellaSwag**: Tests **common-sense reasoning**. It presents scenarios followed by multiple possible endings, and the model must choose the most plausible one.
   - **Evaluation**: Accuracy is the main metric, as the task is multiple-choice. Higher accuracy indicates a better understanding of common-sense knowledge.
   
- **TruthfulQA**: Evaluates **truthfulness** in responses, specifically targeting the model's propensity to avoid false or misleading information. The dataset contains challenging questions where common misconceptions may lead the model astray.
   - **Evaluation**: Scored with accuracy but with a strong emphasis on "truthfulness." Low scores often indicate the model’s tendency to repeat misconceptions or fabricate plausible-sounding but incorrect information.

- **MMLU (Massive Multitask Language Understanding)**: Measures **general knowledge** and **domain-specific expertise** across fields like history, math, and science.
   - **Evaluation**: Accuracy on MMLU reflects the model’s factual knowledge and understanding across diverse subjects, with higher scores indicating broader and deeper knowledge.

### Open-Ended Text Generation Evaluation

For tasks like story generation, dialogue, and creative writing, classic metrics may fall short. Instead, we rely on qualitative and advanced quantitative measures:

- **Human Evaluation**: Direct human assessment is the most reliable for open-ended generation. Annotators rate output on criteria like **relevance**, **coherence**, **factual accuracy**, and **fluency**.
   - **Interpretation**: While costly, human evaluations are often conducted in tandem with automated metrics to provide a well-rounded assessment.

- **Specific Qualitative Metrics**:
   - **Relevance**: Does the output address the prompt directly?
   - **Coherence**: Does the text flow logically?
   - **Consistency**: Are statements within the response internally consistent?
   - **Fluency**: Is the language natural and free of grammatical errors?

### LLM Based Evaluation 

Now the final evaluation method that is getting really popular is to use an LLM to evaluate another LLM's answer. The idea is often to take a bigger (or at least a complete other model) and trust its capacity to evaluate answer. Even though it is quite hard to say for sure that evaluation is 100% accurate, it is remarkably efficient and scalable!


<Note type="tip" title="Best way to evaluate LLMs">

For LLMs, no single metric tells the full story. Here’s how to approach evaluation based on task needs:

- **For Knowledge and Reasoning Tasks** (like MMLU and HellaSwag):
   - **Accuracy** is key, but **MMLU’s diverse subjects** provide insights into specialized knowledge. **HellaSwag accuracy** also measures the model’s capacity for common-sense reasoning.

- **For Conversational and Creative Tasks**:
   - **Human ratings** are crucial, assessing fluency, relevance, and engagement.
   - **BLEURT** and **BERTScore** give insights into how closely generated responses match expected content while allowing some creativity in wording.

- **For Factuality and Trustworthiness**:
   - **TruthfulQA** accuracy checks the model's adherence to truthful responses.
   - **Human evaluators** often rate output for factual accuracy, especially in complex or nuanced topics.

</Note>


## Demo

For this course we will do a demo of LLM-based evaluation. We'll take back a fine-tuned model and evaluate it by Llama-3B model. 

<Note type="tip" title="Wanna follow along?">

If you want to follow along:

* Open up a LightningAI studio 
* Switch to GPUs 

</Note>


In [1]:
# First let's import the validation data 
# We used json files for our fine-tuning so let's keep using that
import json
with open("data/val.json", "r") as file:
    test_data = json.load(file)

In [2]:
# Then you will need to apply some formatting 
# This helps your evaluation model to do its job 
# Fortunately for us, LitGPT provides a template prompt from Alpaca 
from litgpt.prompts import Alpaca

prompt_style = Alpaca()

print(prompt_style.apply(prompt=test_data[0]["instruction"], **test_data[0]))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Can I work while attending Jedha Bootcamp part-time?

### Response:



In [4]:
# Now we need load our fine-tuned model
# We will generate answer from the validation data 
from litgpt import LLM
from tqdm import trange # This is simply to have a nice looking progress. It doesn't matter if you don't use it

# Load the LLM
llm = LLM.load("results/fine-tuned-llama-3.2-1B/final")

# Here we generate answers from the validation dataset
for i in trange(len(test_data)):
    response = llm.generate(prompt_style.apply(prompt=test_data[i]["instruction"], **test_data[i]))
    test_data[i]["response"] = response

100%|██████████| 4/4 [00:04<00:00,  1.08s/it]


In [5]:
# Now let's load the evaluation LLM
# We will use LLama 3.2 3B Instruct 
# This is a rather small model trained on specialized data 
# This should do the job but if you are in more production-like environment
# The bigger the LLM, the better obviously 
del llm # delete previous `llm` to free up GPU memory
scorer = LLM.load("meta-llama/Llama-3.2-3B-Instruct", access_token="REPLACE_WITH_YOUR_TOKEN")

Setting HF_HUB_ENABLE_HF_TRANSFER=1


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

Converting checkpoint files to LitGPT format.
{'checkpoint_dir': PosixPath('checkpoints/meta-llama/Llama-3.2-3B-Instruct'),
 'debug_mode': False,
 'dtype': None,
 'model_name': None}


Loading weights: model-00002-of-00002.safetensors: 100%|██████████| 00:09<00:00, 11.01it/s


Saving converted checkpoint to checkpoints/meta-llama/Llama-3.2-3B-Instruct


In [6]:
# Now let's build a function that will provide a system prompt to the evaluation model 
# and ask it to grade each answer from 0 to 100 
from tqdm import tqdm # Again this simply a progression bar. Doesn't matter if you don't use it but it's good looking

def generate_model_scores(data_dict, model, response_field="response", target_field="output"):
    scores = []
    for entry in tqdm(data_dict, desc="Scoring entries"):
        prompt = (
            f"Given the input `{entry}`"
            f"and correct output `{entry[target_field]}`, "
            f"score the model response `{entry[response_field]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = model.generate(prompt, max_new_tokens=50)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

# And now we can generate scores for the model
scores = generate_model_scores(test_data, model=scorer)
print(f"Number of scores: {len(scores)} of {len(test_data)}")
print(f"Average score: {sum(scores)/len(scores):.2f}\n")

Scoring entries: 100%|██████████| 4/4 [00:00<00:00, 10.73it/s]

Number of scores: 4 of 4
Average score: 67.50






## Resources 📚📚

* [LLM Evaluation](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/evaluation.md)
* [Evaluating Large Language Model (LLM) systems: Metrics, challenges, and best practices](https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5)
* [LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide](https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation)
* [A list of metrics for evaluating LLM-generated content](https://learn.microsoft.com/en-us/ai/playbook/technology-guidance/generative-ai/working-with-llms/evaluation/list-of-eval-metrics)