<h1 align="center">
  <a href="https://www.nlga.niedersachsen.de/startseite">
    <img width="300" src="https://www.nlga.niedersachsen.de/assets/image/246974" alt="NLGA">
  </a>
</h1>

## Experimenting with different LLMs - Aleph Alpha Luminous-supreme-control vs Nous-Hermes-2-Mixtral-8x7B-DPO

In [11]:
# define configuration for dynamic experiment handling
EXPERIMENT_NAME = "Mixtral-8x7B-Instruct-v0.1-vs-Luminous-Supreme"

**Overview**: In this notebook, we will compare different LLM providers. We will be using around 35 example questions from the [Testfragen](https://secure-confluence.nortal.com/display/NLGAC/Testfragen) dataset and evaluate the response on different criteria to determine which of the two models performs better.

We have used the following metrics from UpTrain's library:

1. [Response Conciseness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness): Evaluates how concise the generated response is or if it has any additional irrelevant information for the question asked.

2. [Response Matching](https://docs.uptrain.ai/predefined-evaluations/ground-truth-comparison/response-matching): Evaluates how well the response generated by the LLM aligns with the provided ground truth.

3. [Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy): Evaluates whether the response generated is factually correct and grounded by the provided context.

4. [Context Utilization](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-utilization): Evaluates how complete the generated response is for the question specified given the information provided in the context. Also known as Reponse Completeness wrt context (RESPONSE_COMPLETENESS_WRT_CONTEXT)

5. [Response Relevance](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance): Evaluates how relevant the generated response was to the question specified.
6. **What is Response Validity?**: In some cases, an LLM might fail to generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to identify these cases, where a model is not generating an informative response.
For example, if the question asked is "What is the chemical formula of chlorophyll?", a valid response would be "The  formula for chlorophyll is C55H72O5N4Mg." An invalid response could be "Sorry, I have no idea about that."
7. **What is Guideline Adherence?**: [Guideline adherence](https://github.com/uptrain-ai/uptrain/blob/main/examples/checks/custom/guideline_adherence.ipynb) refers to the extent to which the LLM follows a given guideline, rule, or protocol. Given the complexity of LLMs, it is crucial to define certain guidelines, be it in terms of the structure of the output or the constraints on the content of the output or protocols on the decision-making capabilities of the LLM (agents). 
For example, for an LLM-powered chatbot agent trained to perform appointment booking tasks only, you want to make sure that the LLM is following the guideline: "The agent should redirect all the queries to the human agent, except the ones related to appointment booking."

Each score has a value between 0 and 1. 

Complete list of UpTrain's supported metrics [here](https://docs.uptrain.ai/predefined-evaluations/overview)

### Install Dependencies

In [12]:
# %pip install openai uptrain together lazy_loader fsspec pandas polars networkx pydantic_settings aiolimiter

In [47]:
import os
from dotenv import load_dotenv
import polars as pl 
import shutil
from datetime import datetime
import time
from openai import OpenAI

### Authentication and Configuration

Let's define the required api keys - mainly the Together API key (for generating responses) and the Azure OpenAI API key (for evaluating the responses).
Please also ensure that the dataset path is correctly defined in the configuration.
Do not forget to set the API_KEY and BASE_URL for the LLM API Endpoint provider. 

In [14]:
# Load the environment variables from the .env file
load_dotenv()

CONFIG = {
    # Together.ai API key and base URL
    "API_KEY": os.getenv("TOGETHER_API_KEY"),
    "BASE_URL": "https://api.together.xyz/v1",
    # The model name used to generate responses
    "GENERATE_MODEL_NAME": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "AA_MODEL_NAME": "luminous-supreme-control",
    # Guideline name used in the Guideline Adherence check
    "GUIDELINE_NAME": "Strict_Context",
    # dataset path
    # "DATASET_PATH": "nlga_dataset_AA_small.jsonl",
    "DATASET_PATH": "./nlga_dataset_AA.jsonl",
    "RESULTS_DIR": "./results/",
    "AZURE_OPENAI_API_KEY": os.getenv("AZURE_OPENAI_API_KEY"),
    "AZURE_API_VERSION": os.getenv("AZURE_API_VERSION"),
    "AZURE_API_BASE": os.getenv("AZURE_API_BASE"),
    # Azure deployments:
    "GPT_35_TURBO_16K": "gpt-35-turbo-16k-deployment",
    "GPT_4": "gpt4",
}

# Azure deployment used to evaluate 
EVAL_MODEL_NAME = "azure/gpt4"
# EVAL_MODEL_NAME = "azure/gpt35-16k"

def get_experiment_file_path(extension):
    filename = f"{EXPERIMENT_NAME.replace(' ', '_').replace('-', '_').lower()}_experiment.{extension}"
    return os.path.join(CONFIG['RESULTS_DIR'], filename)

jsonl_file_path = get_experiment_file_path('jsonl')
csv_file_path = get_experiment_file_path('csv')

In [15]:
# Utility functions for API and file operations
def initialize_openai_client():
    return OpenAI(
        api_key=CONFIG["API_KEY"],
        base_url=CONFIG["BASE_URL"]
    )

def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

def read_dataset(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified dataset path does not exist: {path}")
    return pl.read_ndjson(path)
# dataset = pl.read_ndjson(dataset_path).select(pl.col(["question", "ground_truth", "context"]))

### Load the testing dataset

In [16]:
ensure_directory_exists(CONFIG['RESULTS_DIR'])
dataset = read_dataset(CONFIG['DATASET_PATH'])
filtered_dataset = dataset.filter(dataset["idx"] > 100)
filtered_dataset

idx,question,response,ground_truth,context
i64,str,str,str,str
104,"""Welche Erreger…","""Armin erfasst …","""* Gramnegative…","""[Doc Nr. 1] Je…"
106,"""Wann sind Zeck…","""Zecke sind in …","""Zecken werden …","""[Doc Nr. 1] Er…"
300,"""Wer ist Cristi…","""Cristiano Rona…","""Diese Anfrage …","""[Doc Nr. 1] ER…"
301,"""Wie funktionie…","""Die Frage nach…","""Diese Anfrage …","""[Doc Nr. 1] Je…"


### Let's define a prompt to generate responses

In [17]:
SYSTEM_PROMPT = """### INSTRUKTIONEN
Generiere bitte eine ANTWORT, die sich strikt an den gegebenen KONTEXT hält und präzise auf die gestellte FRAGE antwortet, ohne eigene Informationen des Modells hinzuzufügen. Falls die benötigte Information nicht im KONTEXT zu finden ist, antworte mit: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Vermeide Bezüge auf vorherige Ausgaben des Modells. Die Antwort soll auf dem bereitgestellten KONTEXT basieren. Sollte die FRAGE nicht direkt einem gesundheitsbezogenen Thema zuzuordnen sein oder nicht klar zu beantworten sein, erkläre kurz, warum die Anfrage nicht beantwortet werden kann und empfehle eine genauere Formulierung oder zusätzliche Informationen.

### KONTEXT
{context}"""

In [18]:
client = initialize_openai_client()

def get_response(row, model):
    question = row['question'][0]
    context = row['context'][0]
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
            {"role": "user", "content": question},
            # {"role": "assistant", "content": "Example answer"},
            # {"role": "user", "content": "First question/message for the model to actually respond to."}
        ]
    ).choices[0].message.content
    
    return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model}

In [19]:
def format_response(row):
    question = row['question'][0]
    context = row['context'][0]
    response = row['response'][0]
    ground_truth = row['ground_truth'][0]
    model = CONFIG["AA_MODEL_NAME"]

    return {'question': question, 'context': context, 'response': response, 'ground_truth': ground_truth, 'model': model}

### Generate responses for both the models

In [20]:
start_time = time.time()
results_cmodel = [get_response(dataset[idx], CONFIG["GENERATE_MODEL_NAME"]) for idx in range(len(dataset))]
results_aleph_alpha = [format_response(dataset[idx]) for idx in range(len(dataset))]

In [None]:
print(f"Total execution time: {time.time() - start_time:.2f} seconds")

In [21]:
pl.Config.set_fmt_str_lengths(50)
df = pl.DataFrame(results_cmodel)
df

question,context,response,ground_truth,model
str,str,str,str,str
"""Wie kann ich mich vor Legionellen schützen?""","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …",""" Um sich vor Legionellen zu schützen, können Sie …","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Warum heißen Legionellen so?""","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…",""" Der Name Legionellen geht auf das Jahr 1976 zurü…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Was sind Anzeichen für Borreliose?""","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…",""" Die Anzeichen für Borreliose können an der Stich…","""Wanderröte in etwa 50 % der Fälle: nach einigen…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Sind die Lyme-Krankheit und Borreliose das selbe?…","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …",""" Ja, die Lyme-Krankheit und Borreliose sind dasse…","""Ja. (die durch Bakterien verursachte Borreliose (…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Schützt die schnelle Entfernung einer Zecke vor F…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…",""" Nein, die schnelle Entfernung einer Zecke bietet…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
…,…,…,…,…
"""Gibt es eine Impfung gegen Genitalwarzen?""","""[Doc Nr. 1] Bei Infektionen beobachtet man zwei u…",""" Based on the provided context, there is no direc…","""Ja. Humane Papillomviren (HPV) können sowohl Haut…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Was sind Nebenwirkungen einer Polio-Impfung?""","""[Doc Nr. 1] ERKRANKUNG IMPFUNG11Als mögliche Nebe…",""" Laut dem bereitgestellten Kontext können möglich…","""Als mögliche Nebenwirkungen können Reaktionen an …","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Wie wird Borreliose behandelt?""","""[Doc Nr. 1] Insbesondere fortgeschrittene Stadien…",""" Laut dem bereitgestellten Kontext wird Borrelios…","""Die akute Borrelieninfektion kann in vielen Fälle…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""
"""Wer ist Cristiano Ronaldo?""","""[Doc Nr. 1] ERKRANKUNG IMPFUNG21Als mögliche Nebe…",""" Ihre Anfrage kann nicht mit den bereitgestellten…","""Diese Anfrage liegt außerhalb des gesundheitsbezo…","""mistralai/Mixtral-8x7B-Instruct-v0.1"""


### Evaluating Experiments using UpTrain

UpTrain's EvalLLM provides an `evaluate_experiments` method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment. The method returns a list of dictionaries containing the results of the evaluation. 

In [34]:
from uptrain import EvalLLM, Evals, ResponseMatching, Settings

import nest_asyncio
nest_asyncio.apply()

start_time = time.time()

settings = Settings(model=EVAL_MODEL_NAME, azure_api_key=CONFIG["AZURE_OPENAI_API_KEY"], azure_api_version=CONFIG["AZURE_API_VERSION"], azure_api_base=CONFIG["AZURE_API_BASE"])
eval_llm = EvalLLM(settings)

res = eval_llm.evaluate_experiments(
    project_name = f"{EXPERIMENT_NAME}-Experiments",
    data =results_cmodel + results_aleph_alpha,
    checks = [
       Evals.RESPONSE_CONCISENESS,
       ResponseMatching(method='llm'),  # Comment this if you don't have Ground Truth
       Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
       Evals.FACTUAL_ACCURACY,
       Evals.RESPONSE_RELEVANCE,
        Evals.VALID_RESPONSE
    ],
    exp_columns=['model']
)

 60%|██████    | 42/70 [00:21<00:11,  2.51it/s][32m2024-04-18 21:23:28.893[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m - [31m[1mError when sending request to LLM API: Error code: 429 - {'error': {'code': '429', 'message': 'Requests to the ChatCompletions_Create Operation under Azure OpenAI API version 2024-03-01-preview have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 2 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit.'}}[0m
[32m2024-04-18 21:23:28.893[0m | [1mINFO    [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m130[0m - [1mGoing to sleep before retrying for payload 49[0m
[32m2024-04-18 21:23:28.915[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m - [31m[1mError when sending request to LLM API: 

In [None]:
print(f"Total evaluation time: {time.time() - start_time:.2f} seconds")

In [36]:
res_df = pl.DataFrame(res)
res_df

question,score_response_match_recall_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_recall_model_luminous-supreme-control,explanation_response_conciseness_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_conciseness_model_luminous-supreme-control,score_response_match_precision_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_precision_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_relevance_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_relevance_model_luminous-supreme-control,explanation_valid_response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_valid_response_model_luminous-supreme-control,score_valid_response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_valid_response_model_luminous-supreme-control,score_response_match_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_model_luminous-supreme-control,context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,context_model_luminous-supreme-control,score_factual_accuracy_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_factual_accuracy_model_luminous-supreme-control,ground_truth_model_mistralai/Mixtral-8x7B-Instruct-v0.1,ground_truth_model_luminous-supreme-control,response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,response_model_luminous-supreme-control,explanation_response_matching_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_matching_model_luminous-supreme-control,score_response_conciseness_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_conciseness_model_luminous-supreme-control,score_response_relevance_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_relevance_model_luminous-supreme-control,score_response_completeness_wrt_context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_factual_accuracy_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_factual_accuracy_model_luminous-supreme-control
str,f64,f64,str,str,f64,f64,str,str,str,str,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,str,str,str,str,f64,f64,f64,f64,f64,f64,str,str
"""Wie kann ich mich vor Legionellen schützen?""",1.0,0.0,"""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",0.3,,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The context provides detailed…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.631579,0.0,"""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …",1.0,,"""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…",""" Um sich vor Legionellen zu schützen, können Sie …","""'Ihre Anfrage kann nicht mit den bereitgestellten…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }"""
"""Warum heißen Legionellen so?""",0.875,0.75,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response accurately expla…",0.75,0.8,"""{  ""Reasoning"": ""The response accurately expla…","""{  ""Reasoning"": ""The response correctly identi…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…",1.0,1.0,0.84,0.761905,"""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…",0.6,0.6,"""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…",""" Der Name Legionellen geht auf das Jahr 1976 zurü…","""Der Name Legionellen leitet sich von der Legionel…","""Information Recall: 0.875{  ""Result"": [  …","""Information Recall: 0.75{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…"
"""Was sind Anzeichen für Borreliose?""",0.5,0.5,"""{  ""Reasoning"": ""The response provides a compr…","""{  ""Reasoning"": ""The response provides a detai…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The response accurately captu…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,0.571429,0.571429,"""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…",1.0,1.0,"""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…",""" Die Anzeichen für Borreliose können an der Stich…","""Die Anzeichen für Lyme-Borreliose können sich in …","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…"
"""Sind die Lyme-Krankheit und Borreliose das selbe?…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",0.3,0.4,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly identi…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,0.631579,0.727273,"""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …",0.4,0.8,"""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…",""" Ja, die Lyme-Krankheit und Borreliose sind dasse…","""Die Lyme Disease und die Borreliosis sind das sel…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…"
"""Schützt die schnelle Entfernung einer Zecke vor F…",0.75,0.5,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response directly address…",0.5,0.333333,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The response correctly incorp…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides informa…",1.0,1.0,0.666667,0.444444,"""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…",1.0,1.0,"""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…",""" Nein, die schnelle Entfernung einer Zecke bietet…","""Die schnelle Entfernung einer angehefteten Zecke …","""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…"
"""Ist FSME tödlich?""",0.375,0.0,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not provide…",0.25,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly states…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.333333,0.0,"""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…",0.9,,"""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…",""" Ja, FSME (Frühsommer-Meningoenzephalitis) kann t…","""Ihre Anfrage kann nicht mit dem bereitgestellten …","""Information Recall: 0.375{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,1.0,1.0,0.0,1.0,0.5,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }"""
"""Welche Daten muss eine Bescheinigung über einen d…",1.0,0.8,"""{  ""Reasoning"": ""The response provides a compr…","""{  ""Reasoning"": ""The response provides a compr…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly includ…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,1.0,0.842105,"""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…",1.0,1.0,"""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…",""" Laut dem Kontext muss eine Bescheinigung über ei…","""Die Bescheinigung über einen durchgeführten Coron…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…"
"""Ich bin 80. Kann ich demnächst geimpft werden?""",0.5,0.0,"""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not provide…",0.333333,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response does not incorpo…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.444444,0.0,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,,"""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…",""" Gemäß dem bereitgestellten Kontext können Person…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,1.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }"""
"""Kann ich mich auf Borkum gegen Corona impfen lass…",0.5,0.0,"""{  ""Reasoning"": ""The response is relevant to t…","""{  ""Reasoning"": ""The response does not provide…",0.5,,"""{  ""Reasoning"": ""The response correctly states…","""{  ""Reasoning"": ""The context does not provide …","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response is not providing…",1.0,0.0,0.5,0.0,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",,,"""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…",""" Auf Grundlage des bereitgestellten Kontextes kan…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,0.666667,0.0,1.0,0.0,"""{  ""Result"": [] }""","""{  ""Result"": [] }"""
"""Wo erhalte ich weitere Informationen zur Corona-I…",0.8,0.0,"""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not provide…",1.0,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.842105,0.0,"""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…",0.625,,"""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…",""" Sie können weitere Informationen zur Corona-Impf…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.8{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }"""


### Adding Guideline Adherence evaluations

In [37]:
guideline = "The response must strictly adhere to the provided context and not introduce external information. If the necessary information is absent from the context, respond with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Should the question fall outside the health-related jurisdiction of the Landesgesundheitsamt Niedersachsen, it means the query is beyond the health-related scope and shouldn't be answered."

In [38]:
data_cmodel_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_cmodel]
data_aleph_alpha_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_aleph_alpha]

In [39]:
from uptrain import GuidelineAdherence

def run_guideline_adherence_eval(data, guideline_name):
    return eval_llm.evaluate(
        data=data,
        checks=[GuidelineAdherence(guideline=guideline, guideline_name=guideline_name)]
    )

res_guideline_cmodel = run_guideline_adherence_eval(data_cmodel_for_guideline_eval, CONFIG["GUIDELINE_NAME"])
res_guideline_aleph_alpha = run_guideline_adherence_eval(data_aleph_alpha_for_guideline_eval, CONFIG["GUIDELINE_NAME"])

100%|██████████| 35/35 [00:16<00:00,  2.11it/s]
[32m2024-04-18 21:37:43.319[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m
100%|██████████| 35/35 [00:15<00:00,  2.19it/s]
[32m2024-04-18 21:38:05.169[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m


In [40]:
def update_guidelines(guideline_name, res_guidelines, config_model_name):
    DEFAULT_SCORE = float("nan")
    DEFAULT_EXPLANATION = "No data available"
    score_name = 'score_' + guideline_name + '_adherence'
    explanation_name = 'explanation_' + guideline_name + '_adherence'
    
    for f in res_guidelines:
        score_key = score_name + '_model_' + config_model_name
        explanation_key = explanation_name + '_model_' + config_model_name
        
        if score_name in f:
            f[score_key] = f.pop(score_name)
        else:
            f[score_key] = DEFAULT_SCORE
        
        if explanation_name in f:
            f[explanation_key] = f.pop(explanation_name)
        else:
            if score_key not in f or f[score_key] == DEFAULT_SCORE:
                f[explanation_key] = DEFAULT_EXPLANATION

    return res_guidelines

In [41]:
res_guideline_cmodel = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_cmodel, CONFIG["GENERATE_MODEL_NAME"])
res_guideline_aleph_alpha = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_aleph_alpha, CONFIG["AA_MODEL_NAME"])

In [42]:
def merge_lists(base_list, update_list):
    update_dict = {item['question']: item for item in update_list if 'question' in item}
    
    for item in base_list:
        question = item.get('question')
        if question and question in update_dict:
            # print(f"updating with {question}")            
            update_info = {key: val for key, val in update_dict[question].items() if key != 'response'}
            item.update(update_info)
    return base_list

res=merge_lists(res, res_guideline_cmodel)
res=merge_lists(res, res_guideline_aleph_alpha)

### Creating Dataframe and displaying Average Score

In [43]:
res_df = pl.DataFrame(res)
res_df

question,score_response_match_recall_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_recall_model_luminous-supreme-control,explanation_response_conciseness_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_conciseness_model_luminous-supreme-control,score_response_match_precision_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_precision_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_relevance_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_relevance_model_luminous-supreme-control,explanation_valid_response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_valid_response_model_luminous-supreme-control,score_valid_response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_valid_response_model_luminous-supreme-control,score_response_match_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_match_model_luminous-supreme-control,context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,context_model_luminous-supreme-control,score_factual_accuracy_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_factual_accuracy_model_luminous-supreme-control,ground_truth_model_mistralai/Mixtral-8x7B-Instruct-v0.1,ground_truth_model_luminous-supreme-control,response_model_mistralai/Mixtral-8x7B-Instruct-v0.1,response_model_luminous-supreme-control,explanation_response_matching_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_response_matching_model_luminous-supreme-control,score_response_conciseness_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_conciseness_model_luminous-supreme-control,score_response_relevance_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_relevance_model_luminous-supreme-control,score_response_completeness_wrt_context_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_factual_accuracy_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_factual_accuracy_model_luminous-supreme-control,score_Strict_Context_adherence_model_mistralai/Mixtral-8x7B-Instruct-v0.1,explanation_Strict_Context_adherence_model_mistralai/Mixtral-8x7B-Instruct-v0.1,score_Strict_Context_adherence_model_luminous-supreme-control,explanation_Strict_Context_adherence_model_luminous-supreme-control
str,f64,f64,str,str,f64,f64,str,str,str,str,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,str,str,str,str,f64,f64,f64,f64,f64,f64,str,str,f64,str,f64,str
"""Wie kann ich mich vor Legionellen schützen?""",1.0,0.0,"""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",0.3,,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The context provides detailed…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.631579,0.0,"""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …","""[Doc Nr. 1] Legionellen Vorkommen und Vermeidung …",1.0,,"""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…","""Kaltes Wasser muss kalt sein, d.h. < 20 °C Warm…",""" Um sich vor Legionellen zu schützen, können Sie …","""'Ihre Anfrage kann nicht mit den bereitgestellten…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Warum heißen Legionellen so?""",0.875,0.75,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response accurately expla…",0.75,0.8,"""{  ""Reasoning"": ""The response accurately expla…","""{  ""Reasoning"": ""The response correctly identi…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides informa…",1.0,1.0,0.84,0.761905,"""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…","""[Doc Nr. 1] Jedoch sind kranke und ältere Mensche…",0.6,0.6,"""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…","""Im Juni 1976 erkrankten 221 Teilnehmer eines Tref…",""" Der Name Legionellen geht auf das Jahr 1976 zurü…","""Der Name Legionellen leitet sich von der Legionel…","""Information Recall: 0.875{  ""Result"": [  …","""Information Recall: 0.75{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,"""{  ""Reasoning"": ""The given LLM response strict…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Was sind Anzeichen für Borreliose?""",0.5,0.5,"""{  ""Reasoning"": ""The response provides a compr…","""{  ""Reasoning"": ""The response provides a detai…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The response accurately captu…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,0.571429,0.571429,"""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…","""[Doc Nr. 1] Risiko Zeckenstich Borreliose Was sin…",1.0,1.0,"""Wanderröte in etwa 50 % der Fälle: nach einigen…","""Wanderröte in etwa 50 % der Fälle: nach einigen…",""" Die Anzeichen für Borreliose können an der Stich…","""Die Anzeichen für Lyme-Borreliose können sich in …","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Sind die Lyme-Krankheit und Borreliose das selbe?…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately answe…","""{  ""Reasoning"": ""The response accurately answe…",0.3,0.4,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly identi…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,0.631579,0.727273,"""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …","""[Doc Nr. 1] Dieses Erkrankungsbild besteht in 15 …",0.4,0.8,"""Ja. (die durch Bakterien verursachte Borreliose (…","""Ja. (die durch Bakterien verursachte Borreliose (…",""" Ja, die Lyme-Krankheit und Borreliose sind dasse…","""Die Lyme Disease und die Borreliosis sind das sel…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 1.0{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Schützt die schnelle Entfernung einer Zecke vor F…",0.75,0.5,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response directly address…",0.5,0.333333,"""{  ""Reasoning"": ""The response accurately incor…","""{  ""Reasoning"": ""The response correctly incorp…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response provides informa…",1.0,1.0,0.666667,0.444444,"""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…","""[Doc Nr. 1] Risiko Zeckenstich FSME Schutz vor FS…",1.0,1.0,"""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…","""Nein/kaum. Im Gegensatz zu der Infektion mit Borr…",""" Nein, die schnelle Entfernung einer Zecke bietet…","""Die schnelle Entfernung einer angehefteten Zecke …","""Information Recall: 0.75{  ""Result"": [  …","""Information Recall: 0.5{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,"""{  ""Reasoning"": ""The given response adheres to…",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…"
"""Ist FSME tödlich?""",0.375,0.0,"""{  ""Reasoning"": ""The response directly answers…","""{  ""Reasoning"": ""The response does not provide…",0.25,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response correctly states…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.333333,0.0,"""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…","""[Doc Nr. 1] www.nlga.niedersachsen.de FSME-Erkran…",0.9,,"""Bei einem großen Teil der Infizierten (> 70%) und…","""Bei einem großen Teil der Infizierten (> 70%) und…",""" Ja, FSME (Frühsommer-Meningoenzephalitis) kann t…","""Ihre Anfrage kann nicht mit dem bereitgestellten …","""Information Recall: 0.375{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,1.0,1.0,0.0,1.0,0.5,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,"""{  ""Reasoning"": ""The given LLM response strict…",1.0,"""{  ""Reasoning"": ""The given LLM response strict…"
"""Welche Daten muss eine Bescheinigung über einen d…",1.0,0.8,"""{  ""Reasoning"": ""The response provides a compr…","""{  ""Reasoning"": ""The response provides a compr…",1.0,1.0,"""{  ""Reasoning"": ""The response accurately captu…","""{  ""Reasoning"": ""The response correctly includ…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides detaile…","""{  ""Reasoning"": ""The response provides detaile…",1.0,1.0,1.0,0.842105,"""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…","""[Doc Nr. 1] (2) 1Die Person, die den Test gemäß A…",1.0,1.0,"""Die Bescheinigung muss Name, Vorname, Geburtsdatu…","""Die Bescheinigung muss Name, Vorname, Geburtsdatu…",""" Laut dem Kontext muss eine Bescheinigung über ei…","""Die Bescheinigung über einen durchgeführten Coron…","""Information Recall: 1.0{  ""Result"": [  …","""Information Recall: 0.8{  ""Result"": [  …",1.0,1.0,1.0,1.0,1.0,1.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [  {  ""Fact"": ""…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…",1.0,"""{  ""Reasoning"": ""The response strictly adheres…"
"""Ich bin 80. Kann ich demnächst geimpft werden?""",0.5,0.0,"""{  ""Reasoning"": ""The response provides relevan…","""{  ""Reasoning"": ""The response does not provide…",0.333333,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The response does not incorpo…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.444444,0.0,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",1.0,,"""Ja. Personen im Alter von ? 80 Jahren gehören zum…","""Ja. Personen im Alter von ? 80 Jahren gehören zum…",""" Gemäß dem bereitgestellten Kontext können Person…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,1.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"
"""Kann ich mich auf Borkum gegen Corona impfen lass…",0.5,0.0,"""{  ""Reasoning"": ""The response is relevant to t…","""{  ""Reasoning"": ""The response does not provide…",0.5,,"""{  ""Reasoning"": ""The response correctly states…","""{  ""Reasoning"": ""The context does not provide …","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides informa…","""{  ""Reasoning"": ""The response is not providing…",1.0,0.0,0.5,0.0,"""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…","""[Doc Nr. 1] 3 1. ALLGEMEINE FRAGEN ZUR IMPFUN…",,,"""Nein. Inselbewohnerinnen und -bewohner müssen sic…","""Nein. Inselbewohnerinnen und -bewohner müssen sic…",""" Auf Grundlage des bereitgestellten Kontextes kan…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.5{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,0.666667,0.0,1.0,0.0,"""{  ""Result"": [] }""","""{  ""Result"": [] }""",1.0,"""{  ""Reasoning"": ""The response adheres to the g…",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…"
"""Wo erhalte ich weitere Informationen zur Corona-I…",0.8,0.0,"""{  ""Reasoning"": ""The response directly address…","""{  ""Reasoning"": ""The response does not provide…",1.0,,"""{  ""Reasoning"": ""The response correctly identi…","""{  ""Reasoning"": ""The context provides multiple…","""Response Precision: 1.0{  ""Reasoning"": ""The re…","""Response Precision: 0.0{  ""Reasoning"": ""The re…","""{  ""Reasoning"": ""The response provides a list …","""{  ""Reasoning"": ""The response does not provide…",1.0,0.0,0.842105,0.0,"""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…","""[Doc Nr. 1] 9 6. KOMMUNIKATION UND INFORMATION…",0.625,,"""Informationen zur Impfung können vorab z.B. unter…","""Informationen zur Impfung können vorab z.B. unter…",""" Sie können weitere Informationen zur Corona-Impf…","""Ihre Anfrage kann aufgrund fehlender Informatione…","""Information Recall: 0.8{  ""Result"": [  …","""Information Recall: 0.0{  ""Result"": [  …",1.0,0.0,1.0,0.0,1.0,0.0,"""{  ""Result"": [  {  ""Fact"": ""…","""{  ""Result"": [] }""",1.0,"""{  ""Reasoning"": ""The given LLM response adhere…",1.0,"""{  ""Reasoning"": ""The response adheres to the g…"


In [44]:
def backup_and_save_df(df, file_path, file_type='csv'):
    backup_dir = os.path.join(os.path.dirname(file_path), 'backups')
    if not os.path.exists(backup_dir):
        os.makedirs(backup_dir)
    
    if os.path.exists(file_path):
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        backup_filename = os.path.basename(file_path) + f".backup-{timestamp}"
        backup_path = os.path.join(backup_dir, backup_filename)
        shutil.copy(file_path, backup_path)
    
    if file_type == 'csv':
        print(f"Saving DataFrame to CSV at: {file_path}")
        df.write_csv(file_path)
    elif file_type == 'jsonl':
        print(f"Saving DataFrame to NDJSON at: {file_path}")
        df.write_ndjson(file_path)
    
backup_and_save_df(res_df, jsonl_file_path, 'jsonl')
backup_and_save_df(res_df, csv_file_path, 'csv')

Saving DataFrame to NDJSON at: ./results/mixtral_8x7b_instruct_v0.1_vs_luminous_supreme_experiment.jsonl
Saving DataFrame to CSV at: ./results/mixtral_8x7b_instruct_v0.1_vs_luminous_supreme_experiment.csv


In [45]:
def display_average_scores(df):
    score_columns = [col for col in df.columns if 'score' in col]
    data_for_table = []
    
    for column in score_columns:
        average = df[column].drop_nans().mean()
        
        parts = column.split('_model_')
        # print(f"___ parts: {parts}")
        metric_name = parts[0].replace('score_', '').replace('_', ' ').capitalize()
        model_name = parts[1]
        # print(f"metric_name: {metric_name}, average: {average}")
        # print(f"model_name: {model_name}")
        
        data_for_table.append({
            "Model": model_name,
            "Metric": metric_name,
            "Average Score": average
        })
    
    results_table = pl.DataFrame(data_for_table)
    # print(data_for_table)
    return results_table

In [46]:
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(res_df)

Model,Metric,Average Score
str,str,f64
"""mistralai/Mixtral-8x7B-Instruct-v0.1""","""Response match recall""",0.536869
"""luminous-supreme-control""","""Response match recall""",0.244697
"""mistralai/Mixtral-8x7B-Instruct-v0.1""","""Response match precision""",0.417892
"""luminous-supreme-control""","""Response match precision""",0.349206
"""mistralai/Mixtral-8x7B-Instruct-v0.1""","""Valid response""",0.971429
"""luminous-supreme-control""","""Valid response""",0.6
"""mistralai/Mixtral-8x7B-Instruct-v0.1""","""Response match""",0.458352
"""luminous-supreme-control""","""Response match""",0.213882
"""mistralai/Mixtral-8x7B-Instruct-v0.1""","""Factual accuracy""",0.701263
"""luminous-supreme-control""","""Factual accuracy""",0.488095
