# Evaluating the fine tuned model using the evaluation service

After fine tuning and deploying your model for testing, you can compare it to the accuracy and faithfulness of responses from other models in RAG based systems.  This notebook will guide you through the steps of evaluating your model using the evaluation service.

NOTE: If you'd like to do a custom evaluation using your own comparison models and configurations, use the `local_custom_eval.ipynb` notebook.

To prepare for the evaluation, you will need to have the following:
1. A fine-tuned model deployed and accessible via an API.
2. A set of reference questions and answers in a common format (csv, jsonl, or qna.yaml).
3. A set of context data in PDF format.  These are generally the documents that the model was trained on.

The process involves the following steps:
1. Load the InstructLab tuned model using and test it with a simple question to ensure it is working.
2. Generate reference questions and answers from the `reference_answers` directory.
3. Generate sample context data using a Milvus Lite Vector DB and the PDFs in the `data_preparation/document_collection` directory.
4. Create an evaluation using the reference answers and the InstructLab tuned model.
5. Wait for the evaluation to complete.
6. Summarize the results in an Excel, Markdown, and HTML.

By the end of the notebook, you will have json file with the evaluation and a summary of the evaluation results in an Excel, Markdown, and HTML.  By default be stored in the `results` directory.
```
.
├── data_preparation
│   ├── document_collection
│   │   └── sample.pdf
│   └── taxonomy
│       └── knowledge
└── eval
    ├── eval_rh_api.ipynb
    ├── reference_answers
    │   ├── sample.csv
    │   ├── sample.jsonl
    │   └── sample.yaml
    └── results
        ├── evaluation.json
        ├── ilab_scores.xlsx
        ├── openai_scores.xlsx
        └── reference_answers.jsonl
```

The `evaluation.json` will contain an evaluation run for ilab and for a plain OpenAI generated score.
```
{
    "id": "20250203175239362956",
    "reference_answers": [...],
    "openai_evaluation": {
        "status": "complete",
        "results": [...]
    },
    "ilab_evaluation": {
        "status": "complete",
        "results": [...]
    }
}
```

And the output files will contain the summary of the evaluation results.

#### Summary
| question index   |   lab-tuned-granite |   lab-tuned-granite-rag |   granite-3.0-8b-instruct-rag |   gpt-4-rag |
|:-----------------|--------------------:|------------------------:|------------------------------:|------------:|
| Q1               |                   4 |                       5 |                             5 |     4       |
| Q2               |                   1 |                       5 |                             5 |     5       |
| ...              |                 ... |                     ... |                           ... |   ...       |
| QX               |                   4 |                       5 |                             5 |     5       |
| Sum              |                   9 |                      15 |                            15 |    14       |
| Average          |                   3 |                       5 |                             5 |     4.66667 |


#### lab-tuned-granite
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |

#### lab-tuned-granite-rag
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |


### Needed packages and imports

The following packages are needed to run the evaluation service.  If you have not already installed them, you can do so by running the following command:

In [None]:
!pip install -r requirements.txt

### Testing Configuration - Environment Variables

The following environment variables are needed to run the evaluation service.  If you have not already set them, you can do so by creating a `.env` file in the root of the project and adding the following variables:

```
FINETUNED_MODEL_URL=https://finetuned-model-api.server.com/v1
FINETUNED_MODEL_NAME=finetuned_model_name
FINETUNED_MODEL_API_KEY=model-api-key
EVAL_SERVICE_URL=https://eval-service-api.server.com
EVAL_SERVICE_API_KEY=eval-service-api-key
```

In [None]:
import os

from dotenv import load_dotenv

load_dotenv()

output_directory = os.getenv("OUTPUT_DIR", "results")

eval_service_url = os.getenv("EVAL_SERVICE_URL")
eval_service_api_key = os.getenv("EVAL_SERVICE_API_KEY")

ilab_tuned_model_url = os.getenv("ILAB_TUNED_MODEL_URL")
ilab_tuned_model_name = os.getenv("ILAB_TUNED_MODEL_NAME")
ilab_tuned_model_api_key = os.getenv("ILAB_TUNED_MODEL_API_KEY")

if not all([
    output_directory,
    eval_service_url,
    eval_service_api_key,
    ilab_tuned_model_url,
    ilab_tuned_model_name,
    ilab_tuned_model_api_key
]):
    raise ValueError("One or more required variables are empty.  "
                     "Please check your environment settings.")


## Sanity check model

We will first test the InstructLab tuned model to ensure it is working correctly.  We will use a simple question to test the model.  If you're curious about the code, you can find it in the `eval_utils.py` file.


#### Test Requests

In [None]:
from eval_utils import create_llm, chat_request, ILAB_TUNED_MODEL_PROMPT, ILAB_TUNED_MODEL_RAG_PROMPT

llm = create_llm({
    "endpoint_url": ilab_tuned_model_url,
    "model_name": ilab_tuned_model_name,
    "api_key": ilab_tuned_model_api_key
})

question = "Who are you?"
answer = chat_request(llm, ILAB_TUNED_MODEL_PROMPT, question)
print(answer, "\n")

answer = chat_request(llm, ILAB_TUNED_MODEL_RAG_PROMPT, question, context="Pretend to be a human named Bob")
print(answer)

## Generate Reference Data (Questions, Answers, and Context)

### Use qna.yaml, csv, jsonl to create some data

Before creating a set of reference ansers in a common `jsonl` format, you must:

1. Put your reference answers in the `reference_answers` directory
2. Put any relevant source PDF documents in the `data_preparation/document_collection`.

The reference answers should be in the format of a csv, jsonl, or a qna.yaml file.  It's preferable to use questions and reference answers made by human subject matter experts.  To this end CSV and jsonl files are easy formats to work with.  A qna.yaml file can also be added as an easy option.

The CSV should be formatted with `user_input` and `reference` fields.
| user_input | reference |
|:-----------|----------:|
| What is ...| It is...  |

The JSONL should be formatted with `user_input` and `reference` fields.
```json lines
{"user_input": "What is ...", "reference": "It is..."}
{"user_input": "What is ...", "reference": "It is..."}
```

The YAML file should be formatted with `seed_examples` and `questions_and_answers` fields.  This mirrors the normal `qna.yaml` format so that you can reuse the qna.yaml from your taxonomy.
```yaml
seed_examples:
    questions_and_answers:
      - question: >
          relevant question?
        answer: >
          reference answer
      - question: >
          relevant question 2?
        answer: >
          reference answer 2
```
After transforming the data, we will write the data to a `jsonl` file and add a `retrieved_context` field to the data. A Milvus Lite Vector DB will be generated from the PDFs in `data_preparation/document_collection`.  The context will be retrieved from the document collection.

At this point you can inspect the `results/reference_answers.jsonl` file to see the data and fix any issues you see, such as manually fixing the `retrieved_context` field before moving on.

In [None]:
from eval_utils import get_reference_answers, get_context, write_jsonl, read_jsonl

reference_answers = get_reference_answers("./reference_answers")
reference_answers = get_context(reference_answers, "../data_preparation/document_collection")
print(str(len(reference_answers)) + " reference answers loaded")

os.makedirs(output_directory, exist_ok=True)
write_jsonl(f"{output_directory}/reference_answers.jsonl", reference_answers)

print("user_input:", reference_answers[-1]["user_input"])
print("reference:", reference_answers[-1]["reference"])
print("retrieved_context:", reference_answers[-1]["retrieved_context"][0:100])

## Prepare API Request

Now that we have the reference answers, we can create an evaluation request to the evaluation service.  The request will need the InstructLab tuned model information, the reference answers, and the evaluation service API url and key which we set as environment variables.


### Create Evaluation

 We'll create the evaluation, but it will take some time to complete.  We'll check the status of the evaluation in the next step using the evaluation id.

In [None]:
import os
import requests
from urllib.parse import urljoin
from eval_utils import read_jsonl


def create_evaluation(reference_answers: list[dict]) -> str:
    eval_service_url = os.getenv("EVAL_SERVICE_URL")
    eval_service_api_key = os.getenv("EVAL_SERVICE_API_KEY")

    ilab_tuned_model_url = os.getenv("ILAB_TUNED_MODEL_URL")
    ilab_tuned_model_name = os.getenv("ILAB_TUNED_MODEL_NAME")
    ilab_tuned_model_api_key = os.getenv("ILAB_TUNED_MODEL_API_KEY")


    request_data = {
        "endpoint_url": ilab_tuned_model_url,
        "model_name": ilab_tuned_model_name,
        "api_key": ilab_tuned_model_api_key,
        "reference_answers": reference_answers
    }

    post_url = urljoin(eval_service_url, "/api/evaluations")

    response = requests.post(
        post_url,
        json=request_data,
        headers={
            "Content-Type": "application/json",
            "api-key": eval_service_api_key
        })

    return response.json()


reference_answers = read_jsonl(f"{output_directory}/reference_answers.jsonl")
eval = create_evaluation(reference_answers)
eval_id = eval.get("id")
eval_id


### Wait for Evaluation

There are two evaluations that will be run, one for the InstructLab native evaluation and one for a plain OpenAI generated score.  We will wait for both evaluations to complete before moving on to the next step.

In [None]:
import time
import json


def wait_for_evaluation(eval_id, eval_type):
    eval_service_url = os.getenv("EVAL_SERVICE_URL")
    eval_service_api_key = os.getenv("EVAL_SERVICE_API_KEY")

    get_url = urljoin(eval_service_url, f"/api/evaluations/{eval_id}")
    seconds_between_requests = 15

    # every 15 seconds get the evaluation status
    eval_status = "new"
    eval = None
    while eval_status != "complete":
        response = requests.get(
            get_url,
            headers={
                "Content-Type": "application/json",
                "api-key": eval_service_api_key
            })

        eval = response.json()
        eval_status = eval.get(eval_type).get("status")
        print(f'Eval {eval_type} for {eval_id} is "{eval_status}".')
        if eval_status != "complete":
            print('Waiting for evaluation to complete...')
            time.sleep(seconds_between_requests)
    return eval

eval = wait_for_evaluation(eval_id, "ilab_evaluation")
eval = wait_for_evaluation(eval_id, "openai_evaluation")

with open(f"{output_directory}/evaluation.json", 'w') as json_file:
    json.dump(eval, json_file, indent=4)

## Create resulting score report Excel / Markdown / HTML

Now that the evaluation is complete, we can summarize the results in an Excel, Markdown, and HTML file for both the InstructLab evaluation and the OpenAI evaluation.  Feel free to use either.  You can find the files in the `results` directory and inspect the results.  The summary scores are between 1 and 5, with 5 being the best score.  The first table is a summary for each model and each model detail, including all the data follows.  If you're worried about the results, this should help diagnose any issues like subpar context retrieval.

#### Summary
| question index   |   lab-tuned-granite |   lab-tuned-granite-rag |   granite-3.0-8b-instruct-rag | gpt-4-rag |
|:-----------------|--------------------:|------------------------:|------------------------------:|----------:|
| Q1               |                   4 |                       5 |                             5 |         4 |
| Q2               |                   1 |                       5 |                             5 |         5 |
| ...              |                 ... |                     ... |                           ... |       ... |
| QX               |                   4 |                       5 |                             5 |         5 |
| Sum              |                   9 |                      15 |                            15 |        14 |
| Average          |                   3 |                       5 |                             5 |   4.66667 |


#### lab-tuned-granite
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |

#### lab-tuned-granite-rag
| user_input | reference | retrieved_context |  response |   score |     reasoning |
|:-----------|----------:|------------------:|----------:|--------:|--------------:|
| What is ...| It is...  | There is ...      | It is...  |  4      | The answer... |




In [None]:
import json
eval = json.load(open(f"{output_directory}/evaluation.json"))

In [None]:
from eval_utils import summarize_results, write_excel, write_markdown, write_html

ilab_summary_output_df = summarize_results(eval.get("ilab_evaluation").get("results"))
openai_summary_output_df = summarize_results(eval.get("openai_evaluation").get("results"))

write_excel(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.xlsx"
)

write_excel(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.xlsx"
)

write_markdown(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.md"
)

write_markdown(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.md"
)

write_html(
    ilab_summary_output_df,
    eval.get("ilab_evaluation").get("results"),
    f"{output_directory}/ilab_scores.html"
)

write_html(
    openai_summary_output_df,
    eval.get("openai_evaluation").get("results"),
    f"{output_directory}/openai_scores.html"
)
