<h1 align="center">
  <a href="https://www.nlga.niedersachsen.de/startseite">
    <img width="300" src="https://www.nlga.niedersachsen.de/assets/image/246974" alt="NLGA">
  </a>
</h1>

## Experimenting with different LLMs - Aleph Alpha Luminous-supreme-control vs meta-llama-3-8b-instruct

In [42]:
# define configuration for dynamic experiment handling
EXPERIMENT_NAME = "meta-llama-3-8b-instruct-vs-Luminous-Supreme"

**Overview**: In this notebook, we will compare different LLM providers. We will be using around 35 example questions from the [Testfragen](https://secure-confluence.nortal.com/display/NLGAC/Testfragen) dataset and evaluate the response on different criteria to determine which of the two models performs better.

We have used the following metrics from UpTrain's library:

1. [Response Conciseness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness): Evaluates how concise the generated response is or if it has any additional irrelevant information for the question asked.

2. [Response Matching](https://docs.uptrain.ai/predefined-evaluations/ground-truth-comparison/response-matching): Evaluates how well the response generated by the LLM aligns with the provided ground truth.

3. [Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy): Evaluates whether the response generated is factually correct and grounded by the provided context.

4. [Context Utilization](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-utilization): Evaluates how complete the generated response is for the question specified given the information provided in the context. Also known as Reponse Completeness wrt context (RESPONSE_COMPLETENESS_WRT_CONTEXT)

5. [Response Relevance](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance): Evaluates how relevant the generated response was to the question specified.
6. **What is Response Validity?**: In some cases, an LLM might fail to generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to identify these cases, where a model is not generating an informative response.
For example, if the question asked is "What is the chemical formula of chlorophyll?", a valid response would be "The  formula for chlorophyll is C55H72O5N4Mg." An invalid response could be "Sorry, I have no idea about that."
7. **What is Guideline Adherence?**: [Guideline adherence](https://github.com/uptrain-ai/uptrain/blob/main/examples/checks/custom/guideline_adherence.ipynb) refers to the extent to which the LLM follows a given guideline, rule, or protocol. Given the complexity of LLMs, it is crucial to define certain guidelines, be it in terms of the structure of the output or the constraints on the content of the output or protocols on the decision-making capabilities of the LLM (agents). 
For example, for an LLM-powered chatbot agent trained to perform appointment booking tasks only, you want to make sure that the LLM is following the guideline: "The agent should redirect all the queries to the human agent, except the ones related to appointment booking."

Each score has a value between 0 and 1. 

Complete list of UpTrain's supported metrics [here](https://docs.uptrain.ai/predefined-evaluations/overview)

### Install Dependencies

In [43]:
# %pip install openai uptrain together replicate lazy_loader fsspec pandas polars networkx pydantic_settings aiolimiter

In [44]:
import os
from dotenv import load_dotenv
import polars as pl 
import shutil
from datetime import datetime
import time
from openai import OpenAI
import replicate

### Authentication and Configuration

Let's define the required api keys - mainly the Together API key (for generating responses) and the Azure OpenAI API key (for evaluating the responses).
Please also ensure that the dataset path is correctly defined in the configuration.
Do not forget to set the API_KEY and BASE_URL for the LLM API Endpoint provider. 

In [45]:
# Load the environment variables from the .env file
load_dotenv()

CONFIG = {
    # Replicate.com API key
    "API_KEY": os.getenv("REPLICATE_API_TOKEN"),
    # The model name used to generate responses
    "GENERATE_MODEL_NAME": "meta/meta-llama-3-8b-instruct",
    "AA_MODEL_NAME": "luminous-supreme-control",
    # Guideline name used in the Guideline Adherence check
    "GUIDELINE_NAME": "Strict_Context",
    # dataset path
    # "DATASET_PATH": "nlga_dataset_AA_small.jsonl",
    "DATASET_PATH": "./nlga_dataset_AA.jsonl",
    "RESULTS_DIR": "./results/",
    "AZURE_OPENAI_API_KEY": os.getenv("AZURE_OPENAI_API_KEY"),
    "AZURE_API_VERSION": os.getenv("AZURE_API_VERSION"),
    "AZURE_API_BASE": os.getenv("AZURE_API_BASE"),
    # Azure deployments:
    "GPT_35_TURBO_16K": "gpt-35-turbo-16k-deployment",
    "GPT_4": "gpt4",
}

# Azure deployment used to evaluate 
EVAL_MODEL_NAME = "azure/gpt4"
# EVAL_MODEL_NAME = "azure/gpt35-16k"

def get_experiment_file_path(extension):
    filename = f"{EXPERIMENT_NAME.replace(' ', '_').replace('-', '_').lower()}_experiment.{extension}"
    return os.path.join(CONFIG['RESULTS_DIR'], filename)

jsonl_file_path = get_experiment_file_path('jsonl')
csv_file_path = get_experiment_file_path('csv')

In [46]:
# Utility functions for API and file operations
def initialize_openai_client():
    return OpenAI(
        api_key=CONFIG["API_KEY"],
        base_url=CONFIG["BASE_URL"]
    )

def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

def read_dataset(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified dataset path does not exist: {path}")
    return pl.read_ndjson(path)
# dataset = pl.read_ndjson(dataset_path).select(pl.col(["question", "ground_truth", "context"]))

### Load the testing dataset

In [47]:
ensure_directory_exists(CONFIG['RESULTS_DIR'])
dataset = read_dataset(CONFIG['DATASET_PATH'])
filtered_dataset = dataset.filter(dataset["idx"] > 100)
filtered_dataset

idx,question,response,ground_truth,context
i64,str,str,str,str
104,"""Welche Erreger werden in Armin erfasst?""","""Armin erfasst keine spezifischen Erreger. Es hand…","""* Gramnegative Bakterien o Escherichia coli o…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…"
106,"""Wann sind Zecken aktiv? Stechen Zecken auch im Wi…","""Zecke sind in Deutschland von März bis Oktober ak…","""Zecken werden bereits ab einer Außentemperatur vo…","""[Doc Nr. 1] Erythema chronicum migrans: Wenn die …"
300,"""Wer ist Cristiano Ronaldo?""","""Cristiano Ronaldo ist ein portugiesischer Fußball…","""Diese Anfrage liegt außerhalb des gesundheitsbezo…","""[Doc Nr. 1] ERKRANKUNG IMPFUNG21Als mögliche Nebe…"
301,"""Wie funktioniert ein Elektromotor?""","""Die Frage nach dem Funktionieren eines Elektromot…","""Diese Anfrage liegt außerhalb des gesundheitsbezo…","""[Doc Nr. 1] Je	größer	die	Temperaturdifferenz zw…"


### Let's define a prompt to generate responses

In [48]:
SYSTEM_PROMPT = """### INSTRUKTIONEN
Generiere bitte eine ANTWORT, die sich strikt an den gegebenen KONTEXT hält und präzise auf die gestellte FRAGE antwortet, ohne eigene Informationen des Modells hinzuzufügen. Falls die benötigte Information nicht im KONTEXT zu finden ist, antworte mit: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Vermeide Bezüge auf vorherige Ausgaben des Modells. Die Antwort soll auf dem bereitgestellten KONTEXT basieren. Sollte die FRAGE nicht direkt einem gesundheitsbezogenen Thema zuzuordnen sein oder nicht klar zu beantworten sein, erkläre kurz, warum die Anfrage nicht beantwortet werden kann und empfehle eine genauere Formulierung oder zusätzliche Informationen.

### KONTEXT
{context}"""

In [49]:
# client = initialize_openai_client()
# 
# def get_response(row, model):
#     question = row['question'][0]
#     context = row['context'][0]
#     
#     response = client.chat.completions.create(
#         model=model,
#         messages=[
#             {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
#             {"role": "user", "content": question},
#             # {"role": "assistant", "content": "Example answer"},
#             # {"role": "user", "content": "First question/message for the model to actually respond to."}
#         ]
#     ).choices[0].message.content
#     
#     return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model}

In [50]:
REPLICATE_API_TOKEN = os.getenv("REPLICATE_API_TOKEN")
def get_response(row, model_name):
    # https://replicate.com/meta/meta-llama-3-8b-instruct/api/learn-more
    question = row['question'][0]
    context = row['context'][0]
    system_prompt = SYSTEM_PROMPT.format(context=context)
    
    input_data = {
        "prompt": question,
        "prompt_template": f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant in health topics.\n\n{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{{prompt}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "temperature": 0.1,
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    
    output = replicate.run(
        model_name,
        input=input_data
    )

    response = "".join(output)
    return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model_name}

In [51]:
# row_example = {
#     'question': ["What is the capital of Bishkek?"],
#     'context': ["Germany is a country in Europe known for its rich history and culture."],
#     'ground_truth': ["Paris"]
# }
# 
# model_name = "meta/meta-llama-3-8b-instruct" 
# 
# response = get_response_from_replicate(row_example, model_name)
# print(response)

In [52]:
def format_response(row):
    question = row['question'][0]
    context = row['context'][0]
    response = row['response'][0]
    ground_truth = row['ground_truth'][0]
    model = CONFIG["AA_MODEL_NAME"]

    return {'question': question, 'context': context, 'response': response, 'ground_truth': ground_truth, 'model': model}

### Generate responses for both the models

In [53]:
start_time = time.time()
results_cmodel = [get_response(dataset[idx], CONFIG["GENERATE_MODEL_NAME"]) for idx in range(len(dataset))]
results_aleph_alpha = [format_response(dataset[idx]) for idx in range(len(dataset))]

In [54]:
print(f"Total execution time: {time.time() - start_time:.2f} seconds")

Total execution time: 78.41 seconds


In [55]:
pl.Config.set_fmt_str_lengths(50)
df = pl.DataFrame(results_cmodel)
df

question,context,response,ground_truth,model
str,str,str,str,str
"""Wie kann ich mich vor Legionellen schützen?""","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""Um sich vor Legionellen zu schützen, sollten Sie …","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""meta/meta-llama-3-8b-instruct"""
"""Warum heißen Legionellen so?""","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""Die Legionellen heißen so, weil sie ursprünglich …","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""meta/meta-llama-3-8b-instruct"""
"""Was sind Anzeichen für Borreliose?""","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""Nach den bereitgestellten Informationen sind die …","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""meta/meta-llama-3-8b-instruct"""
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""Die Lyme-Krankheit und Borreliose sind nicht das …","""Ja. (die durch Bakterien verursachte Borreliose (…","""meta/meta-llama-3-8b-instruct"""
"""Schützt die schnelle Entfernung einer Zecke vor F…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""Laut dem Dokument [Doc Nr. 1] bietet die schnelle…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""meta/meta-llama-3-8b-instruct"""
"""Ist FSME tödlich?""","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""Laut dem Dokument [Doc Nr. 1] kann die Hirnentzün…","""Bei einem großen Teil der Infizierten (> 70%) und…","""meta/meta-llama-3-8b-instruct"""
"""Welche Daten muss eine Bescheinigung über einen d…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""Nach Absatz 2 der Verordnung muss eine Bescheinig…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""meta/meta-llama-3-8b-instruct"""
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""Laut dem Dokument [Doc Nr. 1] werden zunächst fol…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""meta/meta-llama-3-8b-instruct"""
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""Die Frage kann nicht direkt mit den bereitgestell…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""meta/meta-llama-3-8b-instruct"""
"""Wo erhalte ich weitere Informationen zur Corona-I…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""Nach den bereitgestellten Dokumenten können Sie w…","""Informationen zur Impfung können vorab z.B. unter…","""meta/meta-llama-3-8b-instruct"""


### Evaluating Experiments using UpTrain

UpTrain's EvalLLM provides an `evaluate_experiments` method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment. The method returns a list of dictionaries containing the results of the evaluation. 

In [56]:
from uptrain import EvalLLM, Evals, ResponseMatching, Settings

import nest_asyncio
nest_asyncio.apply()

start_time = time.time()

settings = Settings(model=EVAL_MODEL_NAME, azure_api_key=CONFIG["AZURE_OPENAI_API_KEY"], azure_api_version=CONFIG["AZURE_API_VERSION"], azure_api_base=CONFIG["AZURE_API_BASE"])
eval_llm = EvalLLM(settings)

res = eval_llm.evaluate_experiments(
    project_name = f"{EXPERIMENT_NAME}-Experiments",
    data =results_cmodel + results_aleph_alpha,
    checks = [
       Evals.RESPONSE_CONCISENESS,
       ResponseMatching(method='llm'),  # Comment this if you don't have Ground Truth
       Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
       Evals.FACTUAL_ACCURACY,
       Evals.RESPONSE_RELEVANCE,
        Evals.VALID_RESPONSE
    ],
    exp_columns=['model']
)

100%|██████████| 70/70 [00:23<00:00,  3.02it/s]
 56%|█████▌    | 78/140 [00:13<00:12,  4.80it/s][32m2024-04-18 22:23:23.553[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m - [31m[1mError when sending request to LLM API: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-03-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 10 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}[0m
[32m2024-04-18 22:23:23.553[0m | [1mINFO    [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m130[0m - [1mGoing to sleep before retrying for payload 82[0m
[32m2024-04-18 22:23:23.567[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m 

In [57]:
print(f"Total evaluation time: {time.time() - start_time:.2f} seconds")

Total evaluation time: 503.89 seconds


In [58]:
res_df = pl.DataFrame(res)
res_df

question,explanation_factual_accuracy_model_meta/meta-llama-3-8b-instruct,explanation_factual_accuracy_model_luminous-supreme-control,score_response_conciseness_model_meta/meta-llama-3-8b-instruct,score_response_conciseness_model_luminous-supreme-control,explanation_response_matching_model_meta/meta-llama-3-8b-instruct,explanation_response_matching_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_meta/meta-llama-3-8b-instruct,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_factual_accuracy_model_meta/meta-llama-3-8b-instruct,score_factual_accuracy_model_luminous-supreme-control,explanation_response_relevance_model_meta/meta-llama-3-8b-instruct,explanation_response_relevance_model_luminous-supreme-control,context_model_meta/meta-llama-3-8b-instruct,context_model_luminous-supreme-control,score_valid_response_model_meta/meta-llama-3-8b-instruct,score_valid_response_model_luminous-supreme-control,response_model_meta/meta-llama-3-8b-instruct,response_model_luminous-supreme-control,score_response_match_model_meta/meta-llama-3-8b-instruct,score_response_match_model_luminous-supreme-control,score_response_match_precision_model_meta/meta-llama-3-8b-instruct,score_response_match_precision_model_luminous-supreme-control,score_response_relevance_model_meta/meta-llama-3-8b-instruct,score_response_relevance_model_luminous-supreme-control,score_response_match_recall_model_meta/meta-llama-3-8b-instruct,score_response_match_recall_model_luminous-supreme-control,ground_truth_model_meta/meta-llama-3-8b-instruct,ground_truth_model_luminous-supreme-control,explanation_valid_response_model_meta/meta-llama-3-8b-instruct,explanation_valid_response_model_luminous-supreme-control,explanation_response_conciseness_model_meta/meta-llama-3-8b-instruct,explanation_response_conciseness_model_luminous-supreme-control,score_response_completeness_wrt_context_model_meta/meta-llama-3-8b-instruct,score_response_completeness_wrt_context_model_luminous-supreme-control
str,str,str,f64,f64,str,str,str,str,f64,f64,str,str,str,str,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str,f64,f64
"""Wie kann ich mich vor Legionellen schützen?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The context provides detailed…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …",1.0,0.0,"""Um sich vor Legionellen zu schützen, sollten Sie …","""'Ihre Anfrage kann nicht mit den bereitgestellten…",0.857143,0.0,0.6,,1.0,0.0,1.0,0.0,"""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…","""{  ""Reasoning"": ""The response provides compreh…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0
"""Warum heißen Legionellen so?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.75{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly identi…",1.0,0.6,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…",1.0,1.0,"""Die Legionellen heißen so, weil sie ursprünglich …","""Der Name Legionellen leitet sich von der Legionel…",0.8,0.761905,1.0,0.8,1.0,1.0,0.75,0.75,"""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides a direc…","""{  ""Reasoning"": ""The response accurately expla…",1.0,1.0
"""Was sind Anzeichen für Borreliose?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.4{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…",1.0,1.0,"""Nach den bereitgestellten Informationen sind die …","""Die Anzeichen für Lyme-Borreliose können sich in …",0.888889,0.470588,0.666667,1.0,1.0,1.0,1.0,0.4,"""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides a compr…",1.0,1.0
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response is incorrect. Ac…","""{  ""Reasoning"": ""The response correctly identi…",0.1,0.8,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …",1.0,1.0,"""Die Lyme-Krankheit und Borreliose sind nicht das …","""Die Lyme Disease und die Borreliosis sind das sel…",0.470588,0.727273,0.4,0.4,1.0,1.0,0.5,1.0,"""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",0.0,1.0
"""Schützt die schnelle Entfernung einer Zecke vor F…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.25{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly utiliz…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…",1.0,1.0,"""Laut dem Dokument [Doc Nr. 1] bietet die schnelle…","""Die schnelle Entfernung einer angehefteten Zecke …",0.8,0.266667,1.0,0.333333,1.0,1.0,0.75,0.25,"""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response directly answers…",1.0,1.0
"""Ist FSME tödlich?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,1.0,"""Information Recall: 0.375{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly states…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…",1.0,0.0,"""Laut dem Dokument [Doc Nr. 1] kann die Hirnentzün…","""Ihre Anfrage kann nicht mit dem bereitgestellten …",0.25,0.0,0.125,,1.0,0.0,0.375,0.0,"""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.5
"""Welche Daten muss eine Bescheinigung über einen d…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response correctly identi…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…",1.0,1.0,"""Nach Absatz 2 der Verordnung muss eine Bescheinig…","""Die Bescheinigung über einen durchgeführten Coron…",1.0,0.842105,1.0,1.0,1.0,1.0,1.0,0.8,"""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides a detai…",1.0,0.5
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides informat…",0.666667,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,0.0,"""Laut dem Dokument [Doc Nr. 1] werden zunächst fol…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.444444,0.0,0.166667,,0.666667,0.0,1.0,0.0,"""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 0.16666666666666666{  ""Res…","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context does not provide …",0.666667,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,0.0,"""Die Frage kann nicht direkt mit den bereitgestell…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.0,0.0,0.0,,0.666667,0.0,0.166667,0.0,"""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response is relevant to t…","""{  ""Reasoning"": ""The response does not provide…",0.0,0.0
"""Wo erhalte ich weitere Informationen zur Corona-I…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…",1.0,0.0,"""Nach den bereitgestellten Dokumenten können Sie w…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.551724,0.0,0.8,,1.0,0.0,0.5,0.0,"""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…","""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response provides specifi…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0


### Adding Guideline Adherence evaluations

In [59]:
guideline = "The response must strictly adhere to the provided context and not introduce external information. If the necessary information is absent from the context, respond with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Should the question fall outside the health-related jurisdiction of the Landesgesundheitsamt Niedersachsen, it means the query is beyond the health-related scope and shouldn't be answered."

In [60]:
data_cmodel_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_cmodel]
data_aleph_alpha_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_aleph_alpha]

In [61]:
from uptrain import GuidelineAdherence

def run_guideline_adherence_eval(data, guideline_name):
    return eval_llm.evaluate(
        data=data,
        checks=[GuidelineAdherence(guideline=guideline, guideline_name=guideline_name)]
    )

res_guideline_cmodel = run_guideline_adherence_eval(data_cmodel_for_guideline_eval, CONFIG["GUIDELINE_NAME"])
res_guideline_aleph_alpha = run_guideline_adherence_eval(data_aleph_alpha_for_guideline_eval, CONFIG["GUIDELINE_NAME"])

100%|██████████| 35/35 [00:13<00:00,  2.55it/s]
[32m2024-04-18 22:31:27.066[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m
100%|██████████| 35/35 [00:19<00:00,  1.76it/s]
[32m2024-04-18 22:31:52.563[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m


In [62]:
def update_guidelines(guideline_name, res_guidelines, config_model_name):
    DEFAULT_SCORE = float("nan")
    DEFAULT_EXPLANATION = "No data available"
    score_name = 'score_' + guideline_name + '_adherence'
    explanation_name = 'explanation_' + guideline_name + '_adherence'
    
    for f in res_guidelines:
        score_key = score_name + '_model_' + config_model_name
        explanation_key = explanation_name + '_model_' + config_model_name
        
        if score_name in f:
            f[score_key] = f.pop(score_name)
        else:
            f[score_key] = DEFAULT_SCORE
        
        if explanation_name in f:
            f[explanation_key] = f.pop(explanation_name)
        else:
            if score_key not in f or f[score_key] == DEFAULT_SCORE:
                f[explanation_key] = DEFAULT_EXPLANATION

    return res_guidelines

In [63]:
res_guideline_cmodel = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_cmodel, CONFIG["GENERATE_MODEL_NAME"])
res_guideline_aleph_alpha = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_aleph_alpha, CONFIG["AA_MODEL_NAME"])

In [64]:
def merge_lists(base_list, update_list):
    update_dict = {item['question']: item for item in update_list if 'question' in item}
    
    for item in base_list:
        question = item.get('question')
        if question and question in update_dict:
            # print(f"updating with {question}")            
            update_info = {key: val for key, val in update_dict[question].items() if key != 'response'}
            item.update(update_info)
    return base_list

res=merge_lists(res, res_guideline_cmodel)
res=merge_lists(res, res_guideline_aleph_alpha)

### Creating Dataframe and displaying Average Score

In [65]:
res_df = pl.DataFrame(res)
res_df

question,explanation_factual_accuracy_model_meta/meta-llama-3-8b-instruct,explanation_factual_accuracy_model_luminous-supreme-control,score_response_conciseness_model_meta/meta-llama-3-8b-instruct,score_response_conciseness_model_luminous-supreme-control,explanation_response_matching_model_meta/meta-llama-3-8b-instruct,explanation_response_matching_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_meta/meta-llama-3-8b-instruct,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_factual_accuracy_model_meta/meta-llama-3-8b-instruct,score_factual_accuracy_model_luminous-supreme-control,explanation_response_relevance_model_meta/meta-llama-3-8b-instruct,explanation_response_relevance_model_luminous-supreme-control,context_model_meta/meta-llama-3-8b-instruct,context_model_luminous-supreme-control,score_valid_response_model_meta/meta-llama-3-8b-instruct,score_valid_response_model_luminous-supreme-control,response_model_meta/meta-llama-3-8b-instruct,response_model_luminous-supreme-control,score_response_match_model_meta/meta-llama-3-8b-instruct,score_response_match_model_luminous-supreme-control,score_response_match_precision_model_meta/meta-llama-3-8b-instruct,score_response_match_precision_model_luminous-supreme-control,score_response_relevance_model_meta/meta-llama-3-8b-instruct,score_response_relevance_model_luminous-supreme-control,score_response_match_recall_model_meta/meta-llama-3-8b-instruct,score_response_match_recall_model_luminous-supreme-control,ground_truth_model_meta/meta-llama-3-8b-instruct,ground_truth_model_luminous-supreme-control,explanation_valid_response_model_meta/meta-llama-3-8b-instruct,explanation_valid_response_model_luminous-supreme-control,explanation_response_conciseness_model_meta/meta-llama-3-8b-instruct,explanation_response_conciseness_model_luminous-supreme-control,score_response_completeness_wrt_context_model_meta/meta-llama-3-8b-instruct,score_response_completeness_wrt_context_model_luminous-supreme-control,score_Strict_Context_adherence_model_meta/meta-llama-3-8b-instruct,explanation_Strict_Context_adherence_model_meta/meta-llama-3-8b-instruct,score_Strict_Context_adherence_model_luminous-supreme-control,explanation_Strict_Context_adherence_model_luminous-supreme-control
str,str,str,f64,f64,str,str,str,str,f64,f64,str,str,str,str,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str,str,str,f64,f64,f64,str,f64,str
"""Wie kann ich mich vor Legionellen schützen?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The context provides detailed…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …",1.0,0.0,"""Um sich vor Legionellen zu schützen, sollten Sie …","""'Ihre Anfrage kann nicht mit den bereitgestellten…",0.857143,0.0,0.6,,1.0,0.0,1.0,0.0,"""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…","""{  ""Reasoning"": ""The response provides compreh…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,1.0,"""{  ""Reasoning"": ""The given response adheres to…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Warum heißen Legionellen so?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.75{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly identi…",1.0,0.6,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…",1.0,1.0,"""Die Legionellen heißen so, weil sie ursprünglich …","""Der Name Legionellen leitet sich von der Legionel…",0.8,0.761905,1.0,0.8,1.0,1.0,0.75,0.75,"""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides a direc…","""{  ""Reasoning"": ""The response accurately expla…",1.0,1.0,1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Was sind Anzeichen für Borreliose?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.4{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…",1.0,1.0,"""Nach den bereitgestellten Informationen sind die …","""Die Anzeichen für Lyme-Borreliose können sich in …",0.888889,0.470588,0.666667,1.0,1.0,1.0,1.0,0.4,"""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides a compr…",1.0,1.0,1.0,"""{  ""Reasoning"": ""The given LLM response strict…",1.0,"""{  ""Reasoning"": ""The given response strictly a…"
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response is incorrect. Ac…","""{  ""Reasoning"": ""The response correctly identi…",0.1,0.8,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …",1.0,1.0,"""Die Lyme-Krankheit und Borreliose sind nicht das …","""Die Lyme Disease und die Borreliosis sind das sel…",0.470588,0.727273,0.4,0.4,1.0,1.0,0.5,1.0,"""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",0.0,1.0,1.0,"""{  ""Reasoning"": ""The given LLM response strict…",1.0,"""{  ""Reasoning"": ""The given LLM response strict…"
"""Schützt die schnelle Entfernung einer Zecke vor F…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.25{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly utiliz…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…",1.0,1.0,"""Laut dem Dokument [Doc Nr. 1] bietet die schnelle…","""Die schnelle Entfernung einer angehefteten Zecke …",0.8,0.266667,1.0,0.333333,1.0,1.0,0.75,0.25,"""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response directly answers…",1.0,1.0,1.0,"""{  ""Reasoning"": ""The given response adheres to…",1.0,"""{  ""Reasoning"": ""The given LLM response strict…"
"""Ist FSME tödlich?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,1.0,"""Information Recall: 0.375{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly states…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…",1.0,0.0,"""Laut dem Dokument [Doc Nr. 1] kann die Hirnentzün…","""Ihre Anfrage kann nicht mit dem bereitgestellten …",0.25,0.0,0.125,,1.0,0.0,0.375,0.0,"""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.5,1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Welche Daten muss eine Bescheinigung über einen d…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response correctly identi…",1.0,1.0,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…",1.0,1.0,"""Nach Absatz 2 der Verordnung muss eine Bescheinig…","""Die Bescheinigung über einen durchgeführten Coron…",1.0,0.842105,1.0,1.0,1.0,1.0,1.0,0.8,"""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response provides a detai…",1.0,0.5,1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides informat…",0.666667,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,0.0,"""Laut dem Dokument [Doc Nr. 1] werden zunächst fol…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.444444,0.0,0.166667,,0.666667,0.0,1.0,0.0,"""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…","""{  ""Reasoning"": ""The response provides a detai…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 0.16666666666666666{  ""Res…","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context does not provide …",0.666667,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,0.0,"""Die Frage kann nicht direkt mit den bereitgestell…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.0,0.0,0.0,,0.666667,0.0,0.166667,0.0,"""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response is relevant to t…","""{  ""Reasoning"": ""The response does not provide…",0.0,0.0,1.0,"""{  ""Reasoning"": ""The given response adheres to…",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…"
"""Wo erhalte ich weitere Informationen zur Corona-I…","""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,0.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…",1.0,,"""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…",1.0,0.0,"""Nach den bereitgestellten Dokumenten können Sie w…","""Ihre Anfrage kann aufgrund fehlender Informatione…",0.551724,0.0,0.8,,1.0,0.0,0.5,0.0,"""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…","""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response is not providing…","""{  ""Reasoning"": ""The response provides specifi…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"


In [66]:
def backup_and_save_df(df, file_path, file_type='csv'):
    backup_dir = os.path.join(os.path.dirname(file_path), 'backups')
    if not os.path.exists(backup_dir):
        os.makedirs(backup_dir)
    
    if os.path.exists(file_path):
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        backup_filename = os.path.basename(file_path) + f".backup-{timestamp}"
        backup_path = os.path.join(backup_dir, backup_filename)
        shutil.copy(file_path, backup_path)
    
    if file_type == 'csv':
        print(f"Saving DataFrame to CSV at: {file_path}")
        df.write_csv(file_path)
    elif file_type == 'jsonl':
        print(f"Saving DataFrame to NDJSON at: {file_path}")
        df.write_ndjson(file_path)
    
backup_and_save_df(res_df, jsonl_file_path, 'jsonl')
backup_and_save_df(res_df, csv_file_path, 'csv')

Saving DataFrame to NDJSON at: ./results/meta_llama_3_8b_instruct_vs_luminous_supreme_experiment.jsonl
Saving DataFrame to CSV at: ./results/meta_llama_3_8b_instruct_vs_luminous_supreme_experiment.csv


In [67]:
def display_average_scores(df):
    score_columns = [col for col in df.columns if 'score' in col]
    data_for_table = []
    
    for column in score_columns:
        average = df[column].drop_nans().mean()
        
        parts = column.split('_model_')
        # print(f"___ parts: {parts}")
        metric_name = parts[0].replace('score_', '').replace('_', ' ').capitalize()
        model_name = parts[1]
        # print(f"metric_name: {metric_name}, average: {average}")
        # print(f"model_name: {model_name}")
        
        data_for_table.append({
            "Model": model_name,
            "Metric": metric_name,
            "Average Score": average
        })
    
    results_table = pl.DataFrame(data_for_table)
    # print(data_for_table)
    return results_table

In [68]:
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(res_df)

Model,Metric,Average Score
str,str,f64
"""meta/meta-llama-3-8b-instruct""","""Response conciseness""",0.942857
"""luminous-supreme-control""","""Response conciseness""",0.685714
"""meta/meta-llama-3-8b-instruct""","""Factual accuracy""",0.731845
"""luminous-supreme-control""","""Factual accuracy""",0.477381
"""meta/meta-llama-3-8b-instruct""","""Valid response""",0.8
"""luminous-supreme-control""","""Valid response""",0.6
"""meta/meta-llama-3-8b-instruct""","""Response match""",0.436124
"""luminous-supreme-control""","""Response match""",0.199738
"""meta/meta-llama-3-8b-instruct""","""Response match precision""",0.513393
"""luminous-supreme-control""","""Response match precision""",0.324206
