<h1 align="center">
  <a href="https://www.nlga.niedersachsen.de/startseite">
    <img width="300" src="https://www.nlga.niedersachsen.de/assets/image/246974" alt="NLGA">
  </a>
</h1>

## Experimenting with different LLMs - Aleph Alpha Luminous-supreme-control vs GPT-35-turbo

In [223]:
# define configuration for dynamic experiment handling
EXPERIMENT_NAME = "GPT_35_vs_Luminous_Supreme"

**Overview**: In this notebook, we will compare different LLM providers. We will be using around 35 example questions from the [Testfragen](https://secure-confluence.nortal.com/display/NLGAC/Testfragen) dataset and evaluate the response on different criteria to determine which of the two models performs better.

We have used the following metrics from UpTrain's library:

1. [Response Conciseness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness): Evaluates how concise the generated response is or if it has any additional irrelevant information for the question asked.

2. [Response Matching](https://docs.uptrain.ai/predefined-evaluations/ground-truth-comparison/response-matching): Evaluates how well the response generated by the LLM aligns with the provided ground truth.

3. [Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy): Evaluates whether the response generated is factually correct and grounded by the provided context.

4. [Context Utilization](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-utilization): Evaluates how complete the generated response is for the question specified given the information provided in the context. Also known as Reponse Completeness wrt context (RESPONSE_COMPLETENESS_WRT_CONTEXT)

5. [Response Relevance](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance): Evaluates how relevant the generated response was to the question specified.
6. **What is Response Validity?**: In some cases, an LLM might fail to generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to identify these cases, where a model is not generating an informative response.
For example, if the question asked is "What is the chemical formula of chlorophyll?", a valid response would be "The  formula for chlorophyll is C55H72O5N4Mg." An invalid response could be "Sorry, I have no idea about that."
7. **What is Guideline Adherence?**: [Guideline adherence](https://github.com/uptrain-ai/uptrain/blob/main/examples/checks/custom/guideline_adherence.ipynb) refers to the extent to which the LLM follows a given guideline, rule, or protocol. Given the complexity of LLMs, it is crucial to define certain guidelines, be it in terms of the structure of the output or the constraints on the content of the output or protocols on the decision-making capabilities of the LLM (agents). 
For example, for an LLM-powered chatbot agent trained to perform appointment booking tasks only, you want to make sure that the LLM is following the guideline: "The agent should redirect all the queries to the human agent, except the ones related to appointment booking."

Each score has a value between 0 and 1. 

Complete list of UpTrain's supported metrics [here](https://docs.uptrain.ai/predefined-evaluations/overview)

### Install Dependencies

In [224]:
# %pip install openai uptrain together lazy_loader fsspec pandas polars networkx pydantic_settings aiolimiter

In [250]:
import os
from dotenv import load_dotenv
import polars as pl 
import shutil
from datetime import datetime
import time
from openai import OpenAI

### Authentication and Configuration 

Let's define the required api keys - mainly the Azure openai key (for generating responses).
Please also ensure that the dataset path is correctly defined in the configuration.
Do not forget to set the API_KEY and BASE_URL for the LLM API Endpoint provider. 

In [226]:
# Load the environment variables from the .env file
load_dotenv()

CONFIG = {
    # The model name used to generate responses
    "GENERATE_MODEL_NAME": "gpt-35-turbo-16k-deployment",
    "AA_MODEL_NAME": "luminous-supreme-control",
    # Guideline name used in the Guideline Adherence check 
    "GUIDELINE_NAME": "Strict_Context",
    # dataset path
    # "DATASET_PATH": "nlga_dataset_AA_small.jsonl",
    "DATASET_PATH": "./nlga_dataset_AA.jsonl",
    "RESULTS_DIR": "./results/",
    "AZURE_OPENAI_API_KEY": os.getenv("AZURE_OPENAI_API_KEY"),
    "AZURE_API_VERSION": os.getenv("AZURE_API_VERSION"),
    "AZURE_API_BASE": os.getenv("AZURE_API_BASE"),
    # Azure deployments:
    "GPT_35_TURBO_16K": "gpt-35-turbo-16k-deployment",
    "GPT_4": "gpt4"
}

# Azure deployment used to evaluate 
EVAL_MODEL_NAME = "azure/gpt4"
# EVAL_MODEL_NAME = "azure/gpt35-16k"

def get_experiment_file_path(extension):
    filename = f"{EXPERIMENT_NAME.replace(' ', '_').replace('-', '_').lower()}_experiment.{extension}"
    return os.path.join(CONFIG['RESULTS_DIR'], filename)

jsonl_file_path = get_experiment_file_path('jsonl')
csv_file_path = get_experiment_file_path('csv')

In [227]:
# Utility functions for API and file operations
from openai import AzureOpenAI

def initialize_azure_openai_client():
    # gets the API Key from environment variable AZURE_OPENAI_API_KEY
    return AzureOpenAI(api_version=CONFIG["AZURE_API_VERSION"], 
                       azure_endpoint=CONFIG["AZURE_API_BASE"])

def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

def read_dataset(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified dataset path does not exist: {path}")
    return pl.read_ndjson(path)
# dataset = pl.read_ndjson(dataset_path).select(pl.col(["question", "ground_truth", "context"]))

### Load the testing dataset

In [228]:
ensure_directory_exists(CONFIG['RESULTS_DIR'])
dataset = read_dataset(CONFIG['DATASET_PATH'])
filtered_dataset = dataset.filter(dataset["idx"] > 100)
filtered_dataset

idx,question,response,ground_truth,context
i64,str,str,str,str
104,"""Welche Erreger werden in Armin erfasst?""","""Armin erfasst keine spezifischen Erreger. Es hand…","""* Gramnegative Bakterien o Escherichia coli o…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…"
106,"""Wann sind Zecken aktiv? Stechen Zecken auch im Wi…","""Zecke sind in Deutschland von März bis Oktober ak…","""Zecken werden bereits ab einer Außentemperatur vo…","""[Doc Nr. 1] Erythema chronicum migrans: Wenn die …"
300,"""Wer ist Cristiano Ronaldo?""","""Cristiano Ronaldo ist ein portugiesischer Fußball…","""Diese Anfrage liegt außerhalb des gesundheitsbezo…","""[Doc Nr. 1] ERKRANKUNG IMPFUNG21Als mögliche Nebe…"
301,"""Wie funktioniert ein Elektromotor?""","""Die Frage nach dem Funktionieren eines Elektromot…","""Diese Anfrage liegt außerhalb des gesundheitsbezo…","""[Doc Nr. 1] Je	größer	die	Temperaturdifferenz zw…"


### Let's define a prompt to generate responses

In [229]:
SYSTEM_PROMPT = """### INSTRUKTIONEN
Generiere bitte eine ANTWORT, die sich strikt an den gegebenen KONTEXT hält und präzise auf die gestellte FRAGE antwortet, ohne eigene Informationen des Modells hinzuzufügen. Falls die benötigte Information nicht im KONTEXT zu finden ist, antworte mit: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Vermeide Bezüge auf vorherige Ausgaben des Modells. Die Antwort soll auf dem bereitgestellten KONTEXT basieren. Sollte die FRAGE nicht direkt einem gesundheitsbezogenen Thema zuzuordnen sein oder nicht klar zu beantworten sein, erkläre kurz, warum die Anfrage nicht beantwortet werden kann und empfehle eine genauere Formulierung oder zusätzliche Informationen.

### KONTEXT
{context}"""

In [230]:
client = initialize_azure_openai_client()

def get_response(row, model):
    question = row['question'][0]
    context = row['context'][0]

    if "gpt" in model:
        response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
            {"role": "user", "content": question},
            # {"role": "assistant", "content": "Example answer"},
            # {"role": "user", "content": "First question/message for the model to actually respond to."}
        ]
    ).choices[0].message.content
    
    return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model}

In [231]:
def format_response(row):
    question = row['question'][0]
    context = row['context'][0]
    response = row['response'][0]
    ground_truth = row['ground_truth'][0]
    model = CONFIG["AA_MODEL_NAME"]

    return {'question': question, 'context': context, 'response': response, 'ground_truth': ground_truth, 'model': model}

### Generate responses for both the models

In [232]:
start_time = time.time()
results_cmodel = [get_response(dataset[idx], CONFIG["GENERATE_MODEL_NAME"]) for idx in range(len(dataset))]
results_aleph_alpha = [format_response(dataset[idx]) for idx in range(len(dataset))]

In [None]:
print(f"Total execution time: {time.time() - start_time:.2f} seconds")

In [254]:
pl.Config.set_fmt_str_lengths(50)
df = pl.DataFrame(results_cmodel)
df

question,context,response,ground_truth,model
str,str,str,str,str
"""Wie kann ich mich vor Legionellen schützen?""","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""Um sich vor Legionellen zu schützen, sollten Maßn…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""gpt-35-turbo-16k-deployment"""
"""Warum heißen Legionellen so?""","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""Der Name Legionellen stammt von einem Vorfall im …","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""gpt-35-turbo-16k-deployment"""
"""Was sind Anzeichen für Borreliose?""","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""Die Anzeichen für Borreliose können sein: eine mi…","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""gpt-35-turbo-16k-deployment"""
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""Nein, Lyme-Krankheit und Borreliose sind nicht da…","""Ja. (die durch Bakterien verursachte Borreliose (…","""gpt-35-turbo-16k-deployment"""
"""Schützt die schnelle Entfernung einer Zecke vor F…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""Nein, die schnelle Entfernung einer Zecke bietet …","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""gpt-35-turbo-16k-deployment"""
"""Ist FSME tödlich?""","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""Ja, die FSME-Erkrankung kann in schweren Fällen t…","""Bei einem großen Teil der Infizierten (> 70%) und…","""gpt-35-turbo-16k-deployment"""
"""Welche Daten muss eine Bescheinigung über einen d…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""Eine Bescheinigung über einen durchgeführten Coro…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""gpt-35-turbo-16k-deployment"""
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""Ja, als jemand im Alter von 80 Jahren gehören Sie…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""gpt-35-turbo-16k-deployment"""
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""Ja, Sie können sich auf Borkum gegen Corona impfe…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""gpt-35-turbo-16k-deployment"""
"""Wo erhalte ich weitere Informationen zur Corona-I…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""Für weitere Informationen zur Corona-Impfung könn…","""Informationen zur Impfung können vorab z.B. unter…","""gpt-35-turbo-16k-deployment"""


### Evaluating Experiments using UpTrain

UpTrain's EvalLLM provides an `evaluate_experiments` method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment. The method returns a list of dictionaries containing the results of the evaluation. 

In [234]:
from uptrain import EvalLLM, Evals, ResponseMatching, Settings

import nest_asyncio
nest_asyncio.apply()

settings = Settings(model=EVAL_MODEL_NAME, azure_api_key=CONFIG["AZURE_OPENAI_API_KEY"], azure_api_version=CONFIG["AZURE_API_VERSION"], azure_api_base=CONFIG["AZURE_API_BASE"])
eval_llm = EvalLLM(settings)

res = eval_llm.evaluate_experiments(
    project_name = f"{EXPERIMENT_NAME}-Experiments",
    data = results_cmodel + results_aleph_alpha,
    checks = [
       Evals.RESPONSE_CONCISENESS,
       ResponseMatching(method='llm'),  # Comment this if you don't have Ground Truth
       Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
       Evals.FACTUAL_ACCURACY,
       Evals.RESPONSE_RELEVANCE,
        Evals.VALID_RESPONSE
    ],
    exp_columns=['model']
)

100%|██████████| 70/70 [00:24<00:00,  2.80it/s]
 34%|███▎      | 47/140 [00:11<00:13,  6.99it/s][32m2024-04-18 15:56:04.066[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m - [31m[1mError when sending request to LLM API: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-03-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 9 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}[0m
[32m2024-04-18 15:56:04.066[0m | [1mINFO    [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m130[0m - [1mGoing to sleep before retrying for payload 77[0m
100%|██████████| 140/140 [01:18<00:00,  1.79it/s]
100%|██████████| 140/140 [01:53<00:00,  1.24it/s]
100%|██████████| 70/70 [01:15<00:00,  1.08s/

In [None]:
print(f"Total evaluation time: {time.time() - start_time:.2f} seconds")

In [235]:
res_df = pl.DataFrame(res)
res_df

question,ground_truth_model_gpt-35-turbo-16k-deployment,ground_truth_model_luminous-supreme-control,score_factual_accuracy_model_gpt-35-turbo-16k-deployment,score_factual_accuracy_model_luminous-supreme-control,context_model_gpt-35-turbo-16k-deployment,context_model_luminous-supreme-control,explanation_response_conciseness_model_gpt-35-turbo-16k-deployment,explanation_response_conciseness_model_luminous-supreme-control,score_valid_response_model_gpt-35-turbo-16k-deployment,score_valid_response_model_luminous-supreme-control,score_response_conciseness_model_gpt-35-turbo-16k-deployment,score_response_conciseness_model_luminous-supreme-control,explanation_valid_response_model_gpt-35-turbo-16k-deployment,explanation_valid_response_model_luminous-supreme-control,score_response_match_precision_model_gpt-35-turbo-16k-deployment,score_response_match_precision_model_luminous-supreme-control,explanation_factual_accuracy_model_gpt-35-turbo-16k-deployment,explanation_factual_accuracy_model_luminous-supreme-control,explanation_response_relevance_model_gpt-35-turbo-16k-deployment,explanation_response_relevance_model_luminous-supreme-control,score_response_relevance_model_gpt-35-turbo-16k-deployment,score_response_relevance_model_luminous-supreme-control,score_response_match_model_gpt-35-turbo-16k-deployment,score_response_match_model_luminous-supreme-control,score_response_match_recall_model_gpt-35-turbo-16k-deployment,score_response_match_recall_model_luminous-supreme-control,response_model_gpt-35-turbo-16k-deployment,response_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_gpt-35-turbo-16k-deployment,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_response_completeness_wrt_context_model_gpt-35-turbo-16k-deployment,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_matching_model_gpt-35-turbo-16k-deployment,explanation_response_matching_model_luminous-supreme-control
str,str,str,f64,f64,str,str,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,str,str,f64,f64,f64,f64,f64,f64,str,str,str,str,f64,f64,str,str
"""Wie kann ich mich vor Legionellen schützen?""","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…",0.9,,"""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response is asking for mo…",0.3,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.428571,0.0,0.5,0.0,"""Um sich vor Legionellen zu schützen, sollten Maßn…","""'Ihre Anfrage kann nicht mit den bereitgestellten…","""{  ""Reasoning"": ""The response includes all the…","""{  ""Reasoning"": ""The context provides detailed…",1.0,0.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …"
"""Warum heißen Legionellen so?""","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…",1.0,0.6,"""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response accurately expla…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Der Name Legion…","""{  ""Reasoning"": ""The response 'Der Name Legion…",1.0,0.6,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.689655,0.672,0.625,0.7,"""Der Name Legionellen stammt von einem Vorfall im …","""Der Name Legionellen leitet sich von der Legionel…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly identi…",1.0,1.0,"""Information Recall: 0.625{  ""Result"": [  …","""Information Recall: 0.7{  ""Result"": [  …"
"""Was sind Anzeichen für Borreliose?""","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…",1.0,1.0,"""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""{  ""Reasoning"": ""The response accurately provi…","""{  ""Reasoning"": ""The response provides a detai…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response provides detaile…",0.8,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.246154,0.756757,0.2,0.7,"""Die Anzeichen für Borreliose können sein: eine mi…","""Die Anzeichen für Lyme-Borreliose können sich in …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Information Recall: 0.2{  ""Result"": [  …","""Information Recall: 0.7{  ""Result"": [  …"
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…",0.75,0.8,"""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Nein, Lyme-Kran…","""{  ""Reasoning"": ""The response provides informa…",0.25,0.4,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.25,0.727273,0.25,1.0,"""Nein, Lyme-Krankheit und Borreliose sind nicht da…","""Die Lyme Disease und die Borreliosis sind das sel…","""{  ""Reasoning"": ""The response is incorrect. Th…","""{  ""Reasoning"": ""The response correctly identi…",0.0,1.0,"""Information Recall: 0.25{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …"
"""Schützt die schnelle Entfernung einer Zecke vor F…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…",1.0,1.0,"""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response directly answers…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Nein, die schne…","""{  ""Reasoning"": ""The response provides informa…",0.333333,0.333333,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.444444,0.444444,0.5,0.5,"""Nein, die schnelle Entfernung einer Zecke bietet …","""Die schnelle Entfernung einer angehefteten Zecke …","""{  ""Reasoning"": ""The response correctly incorp…","""{  ""Reasoning"": ""The response correctly incorp…",1.0,1.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …"
"""Ist FSME tödlich?""","""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…",1.0,,"""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response 'Ja, die FSME-Er…","""{  ""Reasoning"": ""The response is asking for mo…",0.4,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.470588,0.0,0.5,0.0,"""Ja, die FSME-Erkrankung kann in schweren Fällen t…","""Ihre Anfrage kann nicht mit dem bereitgestellten …","""{  ""Reasoning"": ""The response correctly incorp…","""{  ""Reasoning"": ""The response is incorrect bec…",1.0,0.5,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …"
"""Welche Daten muss eine Bescheinigung über einen d…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…",1.0,1.0,"""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response accurately provi…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,1.0,0.842105,1.0,0.8,"""Eine Bescheinigung über einen durchgeführten Coro…","""Die Bescheinigung über einen durchgeführten Coron…","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response correctly includ…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …"
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…",0.8,,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response 'Ja, als jemand …","""{  ""Reasoning"": ""The response is asking for mo…",0.25,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.571429,0.0,1.0,0.0,"""Ja, als jemand im Alter von 80 Jahren gehören Sie…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response does not incorpo…",1.0,0.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …"
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…",0.0,,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Ja, Sie können …","""{  ""Reasoning"": ""The response is not providing…",0.0,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,0.0,0.0,0.0,0.166667,0.0,"""Ja, Sie können sich auf Borkum gegen Corona impfe…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The context does not provide …","""{  ""Reasoning"": ""The context provided does not…",0.0,0.0,"""Information Recall: 0.16666666666666666{  ""Res…","""Information Recall: 0.0{  ""Result"": [  …"
"""Wo erhalte ich weitere Informationen zur Corona-I…","""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…",1.0,,"""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response 'Ihre Anfrage ka…",1.0,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,0.0,0.756757,0.0,0.7,0.0,"""Für weitere Informationen zur Corona-Impfung könn…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…",1.0,0.0,"""Information Recall: 0.7{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …"


### Adding Guideline Adherence evaluations

In [236]:
guideline = "The response must strictly adhere to the provided context and not introduce external information. If the necessary information is absent from the context, respond with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Should the question fall outside the health-related jurisdiction of the Landesgesundheitsamt Niedersachsen, it means the query is beyond the health-related scope and shouldn't be answered."

In [237]:
data_cmodel_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_cmodel]
data_aleph_alpha_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_aleph_alpha]

In [238]:
from uptrain import GuidelineAdherence

def run_guideline_adherence_eval(data, guideline_name):
    return eval_llm.evaluate(
        data=data,
        checks=[GuidelineAdherence(guideline=guideline, guideline_name=guideline_name)]
    )

res_guideline_cmodel = run_guideline_adherence_eval(data_cmodel_for_guideline_eval, CONFIG["GUIDELINE_NAME"])
res_guideline_aleph_alpha = run_guideline_adherence_eval(data_aleph_alpha_for_guideline_eval, CONFIG["GUIDELINE_NAME"])

100%|██████████| 35/35 [00:21<00:00,  1.65it/s]
[32m2024-04-18 16:05:13.581[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m
100%|██████████| 35/35 [00:28<00:00,  1.21it/s]
[32m2024-04-18 16:05:47.808[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m


In [239]:
def update_guidelines(guideline_name, res_guidelines, config_model_name):
    DEFAULT_SCORE = float("nan")
    DEFAULT_EXPLANATION = "No data available"
    score_name = 'score_' + guideline_name + '_adherence'
    explanation_name = 'explanation_' + guideline_name + '_adherence'
    
    for f in res_guidelines:
        score_key = score_name + '_model_' + config_model_name
        explanation_key = explanation_name + '_model_' + config_model_name
        
        if score_name in f:
            f[score_key] = f.pop(score_name)
        else:
            f[score_key] = DEFAULT_SCORE
        
        if explanation_name in f:
            f[explanation_key] = f.pop(explanation_name)
        else:
            if score_key not in f or f[score_key] == DEFAULT_SCORE:
                f[explanation_key] = DEFAULT_EXPLANATION

    return res_guidelines

In [240]:
res_guideline_cmodel = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_cmodel, CONFIG["GENERATE_MODEL_NAME"])
res_guideline_aleph_alpha = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_aleph_alpha, CONFIG["AA_MODEL_NAME"])

In [241]:
def merge_lists(base_list, update_list):
    update_dict = {item['question']: item for item in update_list if 'question' in item}
    
    for item in base_list:
        question = item.get('question')
        if question and question in update_dict:
            # print(f"updating with {question}")            
            update_info = {key: val for key, val in update_dict[question].items() if key != 'response'}
            item.update(update_info)
    return base_list

res=merge_lists(res, res_guideline_cmodel)
res=merge_lists(res, res_guideline_aleph_alpha)

### Creating Dataframe and displaying Average Score

In [242]:
res_df = pl.DataFrame(res)
res_df

question,ground_truth_model_gpt-35-turbo-16k-deployment,ground_truth_model_luminous-supreme-control,score_factual_accuracy_model_gpt-35-turbo-16k-deployment,score_factual_accuracy_model_luminous-supreme-control,context_model_gpt-35-turbo-16k-deployment,context_model_luminous-supreme-control,explanation_response_conciseness_model_gpt-35-turbo-16k-deployment,explanation_response_conciseness_model_luminous-supreme-control,score_valid_response_model_gpt-35-turbo-16k-deployment,score_valid_response_model_luminous-supreme-control,score_response_conciseness_model_gpt-35-turbo-16k-deployment,score_response_conciseness_model_luminous-supreme-control,explanation_valid_response_model_gpt-35-turbo-16k-deployment,explanation_valid_response_model_luminous-supreme-control,score_response_match_precision_model_gpt-35-turbo-16k-deployment,score_response_match_precision_model_luminous-supreme-control,explanation_factual_accuracy_model_gpt-35-turbo-16k-deployment,explanation_factual_accuracy_model_luminous-supreme-control,explanation_response_relevance_model_gpt-35-turbo-16k-deployment,explanation_response_relevance_model_luminous-supreme-control,score_response_relevance_model_gpt-35-turbo-16k-deployment,score_response_relevance_model_luminous-supreme-control,score_response_match_model_gpt-35-turbo-16k-deployment,score_response_match_model_luminous-supreme-control,score_response_match_recall_model_gpt-35-turbo-16k-deployment,score_response_match_recall_model_luminous-supreme-control,response_model_gpt-35-turbo-16k-deployment,response_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_gpt-35-turbo-16k-deployment,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_response_completeness_wrt_context_model_gpt-35-turbo-16k-deployment,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_matching_model_gpt-35-turbo-16k-deployment,explanation_response_matching_model_luminous-supreme-control,score_Strict_Context_adherence_model_gpt-35-turbo-16k-deployment,explanation_Strict_Context_adherence_model_gpt-35-turbo-16k-deployment,score_Strict_Context_adherence_model_luminous-supreme-control,explanation_Strict_Context_adherence_model_luminous-supreme-control
str,str,str,f64,f64,str,str,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,str,str,f64,f64,f64,f64,f64,f64,str,str,str,str,f64,f64,str,str,f64,str,f64,str
"""Wie kann ich mich vor Legionellen schützen?""","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…",0.9,,"""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response is asking for mo…",0.3,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.428571,0.0,0.5,0.0,"""Um sich vor Legionellen zu schützen, sollten Maßn…","""'Ihre Anfrage kann nicht mit den bereitgestellten…","""{  ""Reasoning"": ""The response includes all the…","""{  ""Reasoning"": ""The context provides detailed…",1.0,0.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Warum heißen Legionellen so?""","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…",1.0,0.6,"""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response accurately expla…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Der Name Legion…","""{  ""Reasoning"": ""The response 'Der Name Legion…",1.0,0.6,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.689655,0.672,0.625,0.7,"""Der Name Legionellen stammt von einem Vorfall im …","""Der Name Legionellen leitet sich von der Legionel…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly identi…",1.0,1.0,"""Information Recall: 0.625{  ""Result"": [  …","""Information Recall: 0.7{  ""Result"": [  …",0.0,"""{  ""Reasoning"": ""The response provides informa…",1.0,"""{  ""Reasoning"": ""The response provides informa…"
"""Was sind Anzeichen für Borreliose?""","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…",1.0,1.0,"""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""{  ""Reasoning"": ""The response accurately provi…","""{  ""Reasoning"": ""The response provides a detai…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response provides detaile…",0.8,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.246154,0.756757,0.2,0.7,"""Die Anzeichen für Borreliose können sein: eine mi…","""Die Anzeichen für Lyme-Borreliose können sich in …","""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response accurately captu…",1.0,1.0,"""Information Recall: 0.2{  ""Result"": [  …","""Information Recall: 0.7{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…",0.75,0.8,"""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Nein, Lyme-Kran…","""{  ""Reasoning"": ""The response provides informa…",0.25,0.4,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.25,0.727273,0.25,1.0,"""Nein, Lyme-Krankheit und Borreliose sind nicht da…","""Die Lyme Disease und die Borreliosis sind das sel…","""{  ""Reasoning"": ""The response is incorrect. Th…","""{  ""Reasoning"": ""The response correctly identi…",0.0,1.0,"""Information Recall: 0.25{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Schützt die schnelle Entfernung einer Zecke vor F…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…",1.0,1.0,"""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response directly answers…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Nein, die schne…","""{  ""Reasoning"": ""The response provides informa…",0.333333,0.333333,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,0.444444,0.444444,0.5,0.5,"""Nein, die schnelle Entfernung einer Zecke bietet …","""Die schnelle Entfernung einer angehefteten Zecke …","""{  ""Reasoning"": ""The response correctly incorp…","""{  ""Reasoning"": ""The response correctly incorp…",1.0,1.0,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Ist FSME tödlich?""","""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…",1.0,,"""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response 'Ja, die FSME-Er…","""{  ""Reasoning"": ""The response is asking for mo…",0.4,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.470588,0.0,0.5,0.0,"""Ja, die FSME-Erkrankung kann in schweren Fällen t…","""Ihre Anfrage kann nicht mit dem bereitgestellten …","""{  ""Reasoning"": ""The response correctly incorp…","""{  ""Reasoning"": ""The response is incorrect bec…",1.0,0.5,"""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Welche Daten muss eine Bescheinigung über einen d…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…",1.0,1.0,"""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response accurately provi…",1.0,1.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,1.0,1.0,0.842105,1.0,0.8,"""Eine Bescheinigung über einen durchgeführten Coro…","""Die Bescheinigung über einen durchgeführten Coron…","""{  ""Reasoning"": ""The response accurately lists…","""{  ""Reasoning"": ""The response correctly includ…",1.0,1.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Ich bin 80. Kann ich demnächst geimpft werden?""","""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…",0.8,,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,0.5,"""{  ""Reasoning"": ""The response 'Ja, als jemand …","""{  ""Reasoning"": ""The response is asking for mo…",0.25,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.5{  ""Reasoning"": ""The re…",1.0,0.0,0.571429,0.0,1.0,0.0,"""Ja, als jemand im Alter von 80 Jahren gehören Sie…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response does not incorpo…",1.0,0.0,"""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Kann ich mich auf Borkum gegen Corona impfen lass…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…",0.0,,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,1.0,"""{  ""Reasoning"": ""The response 'Ja, Sie können …","""{  ""Reasoning"": ""The response is not providing…",0.0,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,0.0,0.0,0.0,0.166667,0.0,"""Ja, Sie können sich auf Borkum gegen Corona impfe…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The context does not provide …","""{  ""Reasoning"": ""The context provided does not…",0.0,0.0,"""Information Recall: 0.16666666666666666{  ""Res…","""Information Recall: 0.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Wo erhalte ich weitere Informationen zur Corona-I…","""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…",1.0,,"""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not contain…",1.0,0.0,1.0,1.0,"""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response 'Ihre Anfrage ka…",1.0,,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…",1.0,0.0,0.756757,0.0,0.7,0.0,"""Für weitere Informationen zur Corona-Impfung könn…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…",1.0,0.0,"""Information Recall: 0.7{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,"""{  ""Reasoning"": ""The response provides informa…",0.0,"""{  ""Reasoning"": ""The response is not adhering …"


In [252]:
def backup_and_save_df(df, file_path, file_type='csv'):
    backup_dir = os.path.join(os.path.dirname(file_path), 'backups')
    if not os.path.exists(backup_dir):
        os.makedirs(backup_dir)
    
    if os.path.exists(file_path):
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        backup_filename = os.path.basename(file_path) + f".backup-{timestamp}"
        backup_path = os.path.join(backup_dir, backup_filename)
        shutil.copy(file_path, backup_path)
    
    if file_type == 'csv':
        print(f"Saving DataFrame to CSV at: {file_path}")
        df.write_csv(file_path)
    elif file_type == 'jsonl':
        print(f"Saving DataFrame to NDJSON at: {file_path}")
        df.write_ndjson(file_path)
    
backup_and_save_df(res_df, jsonl_file_path, 'jsonl')
backup_and_save_df(res_df, csv_file_path, 'csv')

Saving DataFrame to NDJSON at: ./results/gpt_35_vs_luminous_supreme_experiment.jsonl
Saving DataFrame to CSV at: ./results/gpt_35_vs_luminous_supreme_experiment.csv


In [244]:
def display_average_scores(df):
    score_columns = [col for col in df.columns if 'score' in col]
    data_for_table = []
    
    for column in score_columns:
        average = df[column].drop_nans().mean()
        
        parts = column.split('_model_')
        # print(f"___ parts: {parts}")
        metric_name = parts[0].replace('score_', '').replace('_', ' ').capitalize()
        model_name = parts[1]
        # print(f"metric_name: {metric_name}, average: {average}")
        # print(f"model_name: {model_name}")
        
        data_for_table.append({
            "Model": model_name,
            "Metric": metric_name,
            "Average Score": average
        })
    
    results_table = pl.DataFrame(data_for_table)
    # print(data_for_table)
    return results_table

In [245]:
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(res_df)

Model,Metric,Average Score
str,str,f64
"""gpt-35-turbo-16k-deployment""","""Factual accuracy""",0.664506
"""luminous-supreme-control""","""Factual accuracy""",0.482143
"""gpt-35-turbo-16k-deployment""","""Valid response""",0.8
"""luminous-supreme-control""","""Valid response""",0.6
"""gpt-35-turbo-16k-deployment""","""Response conciseness""",0.914286
"""luminous-supreme-control""","""Response conciseness""",0.842857
"""gpt-35-turbo-16k-deployment""","""Response match precision""",0.529938
"""luminous-supreme-control""","""Response match precision""",0.347619
"""gpt-35-turbo-16k-deployment""","""Response relevance""",0.733333
"""luminous-supreme-control""","""Response relevance""",0.538095
