<h1 align="center">
  <a href="https://www.nlga.niedersachsen.de/startseite">
    <img width="300" src="https://www.nlga.niedersachsen.de/assets/image/246974" alt="NLGA">
  </a>
</h1>

## Experimenting with different LLMs - Aleph Alpha Luminous-supreme-control vs Nous-Hermes-2-Mixtral-8x7B-DPO

In [18]:
# define configuration for dynamic experiment handling
EXPERIMENT_NAME = "Nous-Hermes-2-Mixtral-8x7B-DPO-vs-Luminous-Supreme"

**Overview**: In this notebook, we will compare different LLM providers. We will be using around 35 example questions from the [Testfragen](https://secure-confluence.nortal.com/display/NLGAC/Testfragen) dataset and evaluate the response on different criteria to determine which of the two models performs better.

We have used the following metrics from UpTrain's library:

1. [Response Conciseness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness): Evaluates how concise the generated response is or if it has any additional irrelevant information for the question asked.

2. [Response Matching](https://docs.uptrain.ai/predefined-evaluations/ground-truth-comparison/response-matching): Evaluates how well the response generated by the LLM aligns with the provided ground truth.

3. [Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy): Evaluates whether the response generated is factually correct and grounded by the provided context.

4. [Context Utilization](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-utilization): Evaluates how complete the generated response is for the question specified given the information provided in the context. Also known as Reponse Completeness wrt context (RESPONSE_COMPLETENESS_WRT_CONTEXT)

5. [Response Relevance](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance): Evaluates how relevant the generated response was to the question specified.
6. **What is Response Validity?**: In some cases, an LLM might fail to generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to identify these cases, where a model is not generating an informative response.
For example, if the question asked is "What is the chemical formula of chlorophyll?", a valid response would be "The  formula for chlorophyll is C55H72O5N4Mg." An invalid response could be "Sorry, I have no idea about that."
7. **What is Guideline Adherence?**: [Guideline adherence](https://github.com/uptrain-ai/uptrain/blob/main/examples/checks/custom/guideline_adherence.ipynb) refers to the extent to which the LLM follows a given guideline, rule, or protocol. Given the complexity of LLMs, it is crucial to define certain guidelines, be it in terms of the structure of the output or the constraints on the content of the output or protocols on the decision-making capabilities of the LLM (agents). 
For example, for an LLM-powered chatbot agent trained to perform appointment booking tasks only, you want to make sure that the LLM is following the guideline: "The agent should redirect all the queries to the human agent, except the ones related to appointment booking."

Each score has a value between 0 and 1. 

Complete list of UpTrain's supported metrics [here](https://docs.uptrain.ai/predefined-evaluations/overview)

### Install Dependencies

In [19]:
# %pip install openai uptrain together lazy_loader fsspec pandas polars networkx pydantic_settings aiolimiter

In [20]:
import os
from dotenv import load_dotenv
import polars as pl 
import shutil
from datetime import datetime
import time
from openai import OpenAI

### Authentication and Configuration

Let's define the required api keys - mainly the Together API key (for generating responses) and the Azure OpenAI API key (for evaluating the responses).
Please also ensure that the dataset path is correctly defined in the configuration.
Do not forget to set the API_KEY and BASE_URL for the LLM API Endpoint provider. 

In [21]:
# Load the environment variables from the .env file
load_dotenv()

CONFIG = {
    # Together.ai API key and base URL
    "API_KEY": os.getenv("TOGETHER_API_KEY"),
    "BASE_URL": "https://api.together.xyz/v1",
    # The model name used to generate responses
    "GENERATE_MODEL_NAME": "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    "AA_MODEL_NAME": "luminous-supreme-control",
    # Guideline name used in the Guideline Adherence check
    "GUIDELINE_NAME": "Strict_Context",
    # dataset path
    # "DATASET_PATH": "nlga_dataset_AA_small.jsonl",
    "DATASET_PATH": "./nlga_dataset_AA.jsonl",
    "RESULTS_DIR": "./results/",
    "AZURE_OPENAI_API_KEY": os.getenv("AZURE_OPENAI_API_KEY"),
    "AZURE_API_VERSION": os.getenv("AZURE_API_VERSION"),
    "AZURE_API_BASE": os.getenv("AZURE_API_BASE"),
    # Azure deployments:
    "GPT_35_TURBO_16K": "gpt-35-turbo-16k-deployment",
    "GPT_4": "gpt4",
}

# Azure deployment used to evaluate 
EVAL_MODEL_NAME = "azure/gpt4"
# EVAL_MODEL_NAME = "azure/gpt35-16k"

def get_experiment_file_path(extension):
    filename = f"{EXPERIMENT_NAME.replace(' ', '_').replace('-', '_').lower()}_experiment.{extension}"
    return os.path.join(CONFIG['RESULTS_DIR'], filename)

jsonl_file_path = get_experiment_file_path('jsonl')
csv_file_path = get_experiment_file_path('csv')

In [22]:
# Utility functions for API and file operations
def initialize_openai_client():
    return OpenAI(
        api_key=CONFIG["API_KEY"],
        base_url=CONFIG["BASE_URL"]
    )

def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

def read_dataset(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified dataset path does not exist: {path}")
    return pl.read_ndjson(path)
# dataset = pl.read_ndjson(dataset_path).select(pl.col(["question", "ground_truth", "context"]))

### Load the testing dataset

In [23]:
ensure_directory_exists(CONFIG['RESULTS_DIR'])
dataset = read_dataset(CONFIG['DATASET_PATH'])
filtered_dataset = dataset.filter(dataset["idx"] > 100)
filtered_dataset

idx,question,response,ground_truth,context
i64,str,str,str,str
104,"""Welche Erreger…","""Armin erfasst …","""* Gramnegative…","""[Doc Nr. 1] Je…"
106,"""Wann sind Zeck…","""Zecke sind in …","""Zecken werden …","""[Doc Nr. 1] Er…"
300,"""Wer ist Cristi…","""Cristiano Rona…","""Diese Anfrage …","""[Doc Nr. 1] ER…"
301,"""Wie funktionie…","""Die Frage nach…","""Diese Anfrage …","""[Doc Nr. 1] Je…"


### Let's define a prompt to generate responses

In [24]:
SYSTEM_PROMPT = """### INSTRUKTIONEN
Generiere bitte eine ANTWORT, die sich strikt an den gegebenen KONTEXT hält und präzise auf die gestellte FRAGE antwortet, ohne eigene Informationen des Modells hinzuzufügen. Falls die benötigte Information nicht im KONTEXT zu finden ist, antworte mit: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Vermeide Bezüge auf vorherige Ausgaben des Modells. Die Antwort soll auf dem bereitgestellten KONTEXT basieren. Sollte die FRAGE nicht direkt einem gesundheitsbezogenen Thema zuzuordnen sein oder nicht klar zu beantworten sein, erkläre kurz, warum die Anfrage nicht beantwortet werden kann und empfehle eine genauere Formulierung oder zusätzliche Informationen.

### KONTEXT
{context}"""

In [25]:
client = initialize_openai_client()

def get_response(row, model):
    question = row['question'][0]
    context = row['context'][0]
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
            {"role": "user", "content": question},
            # {"role": "assistant", "content": "Example answer"},
            # {"role": "user", "content": "First question/message for the model to actually respond to."}
        ]
    ).choices[0].message.content
    
    return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model}

In [26]:
def format_response(row):
    question = row['question'][0]
    context = row['context'][0]
    response = row['response'][0]
    ground_truth = row['ground_truth'][0]
    model = CONFIG["AA_MODEL_NAME"]

    return {'question': question, 'context': context, 'response': response, 'ground_truth': ground_truth, 'model': model}

### Generate responses for both the models

In [27]:
start_time = time.time()
results_cmodel = [get_response(dataset[idx], CONFIG["GENERATE_MODEL_NAME"]) for idx in range(len(dataset))]
results_aleph_alpha = [format_response(dataset[idx]) for idx in range(len(dataset))]

In [None]:
print(f"Total execution time: {time.time() - start_time:.2f} seconds")

In [28]:
pl.Config.set_fmt_str_lengths(50)
df = pl.DataFrame(results_cmodel)
df

question,context,response,ground_truth,model
str,str,str,str,str
"""Wie kann ich m…","""[Doc Nr. 1] Le…","""Um sich vor Le…","""Kaltes Wasser …","""NousResearch/N…"
"""Warum heißen L…","""[Doc Nr. 1] Je…","""Die Legionelle…","""Im Juni 1976 e…","""NousResearch/N…"
"""Was sind Anzei…","""[Doc Nr. 1] Ri…","""In etwa 50 % d…","""Wanderröte in …","""NousResearch/N…"
"""Sind die Lyme-…","""[Doc Nr. 1] Di…","""Ja, die Lyme-K…","""Ja. (die durch…","""NousResearch/N…"
"""Schützt die sc…","""[Doc Nr. 1] Ri…","""Ihre Anfrage k…","""Nein/kaum. Im …","""NousResearch/N…"
…,…,…,…,…
"""Gibt es eine I…","""[Doc Nr. 1] Be…","""Ja, es gibt ei…","""Ja. Humane Pap…","""NousResearch/N…"
"""Was sind Neben…","""[Doc Nr. 1] ER…","""Als mögliche N…","""Als mögliche N…","""NousResearch/N…"
"""Wie wird Borre…","""[Doc Nr. 1] In…","""Die akute Borr…","""Die akute Borr…","""NousResearch/N…"
"""Wer ist Cristi…","""[Doc Nr. 1] ER…","""Ihre Anfrage k…","""Diese Anfrage …","""NousResearch/N…"


### Evaluating Experiments using UpTrain

UpTrain's EvalLLM provides an `evaluate_experiments` method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment. The method returns a list of dictionaries containing the results of the evaluation. 

In [29]:
from uptrain import EvalLLM, Evals, ResponseMatching, Settings

import nest_asyncio
nest_asyncio.apply()

start_time = time.time()

settings = Settings(model=EVAL_MODEL_NAME, azure_api_key=CONFIG["AZURE_OPENAI_API_KEY"], azure_api_version=CONFIG["AZURE_API_VERSION"], azure_api_base=CONFIG["AZURE_API_BASE"])
eval_llm = EvalLLM(settings)

res = eval_llm.evaluate_experiments(
    project_name = f"{EXPERIMENT_NAME}-Experiments",
    data =results_cmodel + results_aleph_alpha,
    checks = [
       Evals.RESPONSE_CONCISENESS,
       ResponseMatching(method='llm'),  # Comment this if you don't have Ground Truth
       Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
       Evals.FACTUAL_ACCURACY,
       Evals.RESPONSE_RELEVANCE,
        Evals.VALID_RESPONSE
    ],
    exp_columns=['model']
)

100%|██████████| 70/70 [00:11<00:00,  5.89it/s]
100%|██████████| 140/140 [00:26<00:00,  5.37it/s]
 45%|████▌     | 63/140 [00:06<00:11,  6.50it/s][32m2024-04-18 18:57:34.644[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36mparse_json[0m:[36m55[0m - [31m[1mError when parsing JSON: <string>:5 Unexpected "Z" at column 67[0m
[32m2024-04-18 18:57:34.644[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36mrun_validation[0m:[36m63[0m - [31m[1mError when running validation function: 'Result'[0m
[32m2024-04-18 18:57:34.644[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.llm[0m:[36masync_process_payload[0m:[36m103[0m - [31m[1mError when sending request to LLM API: Response doesn't pass the validation func.
Response: {
    "Result": [
        {
            "Fact": "1. Zecken sind in Deutschland von März bis Oktober aktiv.",
            "Reasoning": "The context explicitly states that the "Zeckensaison" is in the 

In [None]:
print(f"Total evaluation time: {time.time() - start_time:.2f} seconds")

In [30]:
res_df = pl.DataFrame(res)
res_df

question,response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,response_model_luminous-supreme-control,explanation_response_conciseness_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_conciseness_model_luminous-supreme-control,explanation_response_relevance_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_relevance_model_luminous-supreme-control,explanation_valid_response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_valid_response_model_luminous-supreme-control,score_response_relevance_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_relevance_model_luminous-supreme-control,explanation_response_matching_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_matching_model_luminous-supreme-control,score_valid_response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_valid_response_model_luminous-supreme-control,score_response_completeness_wrt_context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_factual_accuracy_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_factual_accuracy_model_luminous-supreme-control,ground_truth_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,ground_truth_model_luminous-supreme-control,score_response_conciseness_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_conciseness_model_luminous-supreme-control,explanation_factual_accuracy_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_factual_accuracy_model_luminous-supreme-control,score_response_match_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_model_luminous-supreme-control,context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,context_model_luminous-supreme-control,score_response_match_recall_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_recall_model_luminous-supreme-control,score_response_match_precision_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_precision_model_luminous-supreme-control
str,str,str,str,str,str,str,str,str,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,f64,f64,str,str,f64,f64,str,str,f64,f64,f64,f64
"""Wie kann ich m…","""Um sich vor Le…","""'Ihre Anfrage …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.666667,0.0,"""Information Re…","""Information Re…",1.0,0.0,1.0,0.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.875,"""Kaltes Wasser …","""Kaltes Wasser …",0.5,0.0,"""{  ""Result""…","""{  ""Result""…",0.727273,0.0,"""[Doc Nr. 1] Le…","""[Doc Nr. 1] Le…",1.0,0.0,0.4,
"""Warum heißen L…","""Die Legionelle…","""Der Name Legio…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.5,1.0,"""Im Juni 1976 e…","""Im Juni 1976 e…",0.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.75,0.727273,"""[Doc Nr. 1] Je…","""[Doc Nr. 1] Je…",0.75,0.75,0.75,0.666667
"""Was sind Anzei…","""In etwa 50 % d…","""Die Anzeichen …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…",,1.0,0.666667,"""Information Re…","""Information Re…",1.0,,1.0,0.5,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Wanderröte in …","""Wanderröte in …",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",0.914286,1.0,"""[Doc Nr. 1] Ri…","""[Doc Nr. 1] Ri…",0.888889,1.0,1.0,1.0
"""Sind die Lyme-…","""Ja, die Lyme-K…","""Die Lyme Disea…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Ja. (die durch…","""Ja. (die durch…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.888889,0.941176,"""[Doc Nr. 1] Di…","""[Doc Nr. 1] Di…",1.0,1.0,0.666667,0.8
"""Schützt die sc…","""Ihre Anfrage k…","""Die schnelle E…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.666667,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Nein/kaum. Im …","""Nein/kaum. Im …",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",0.727273,0.444444,"""[Doc Nr. 1] Ri…","""[Doc Nr. 1] Ri…",1.0,0.5,0.4,0.333333
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Gibt es eine I…","""Ja, es gibt ei…","""Es gibt keine …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.6,0.5,"""Ja. Humane Pap…","""Ja. Humane Pap…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.444444,0.461538,"""[Doc Nr. 1] Be…","""[Doc Nr. 1] Be…",0.75,0.5,0.2,0.375
"""Was sind Neben…","""Als mögliche N…","""Als Nebenwirku…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.8,"""Als mögliche N…","""Als mögliche N…",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",1.0,0.903226,"""[Doc Nr. 1] ER…","""[Doc Nr. 1] ER…",1.0,1.0,1.0,0.7
"""Wie wird Borre…","""Die akute Borr…","""Die Behandlung…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Die akute Borr…","""Die akute Borr…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.5,0.727273,"""[Doc Nr. 1] In…","""[Doc Nr. 1] In…",0.5,1.0,0.5,0.4
"""Wer ist Cristi…","""Ihre Anfrage k…","""Cristiano Rona…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.0,0.0,"""Information Re…","""Information Re…",0.0,1.0,1.0,0.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.5,0.0,"""Diese Anfrage …","""Diese Anfrage …",0.0,0.0,"""{  ""Result""…","""{  ""Result""…",0.5,0.0,"""[Doc Nr. 1] ER…","""[Doc Nr. 1] ER…",0.5,,0.5,0.0


### Adding Guideline Adherence evaluations

In [31]:
guideline = "The response must strictly adhere to the provided context and not introduce external information. If the necessary information is absent from the context, respond with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Should the question fall outside the health-related jurisdiction of the Landesgesundheitsamt Niedersachsen, it means the query is beyond the health-related scope and shouldn't be answered."

In [32]:
data_cmodel_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_cmodel]
data_aleph_alpha_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_aleph_alpha]

In [33]:
from uptrain import GuidelineAdherence

def run_guideline_adherence_eval(data, guideline_name):
    return eval_llm.evaluate(
        data=data,
        checks=[GuidelineAdherence(guideline=guideline, guideline_name=guideline_name)]
    )

res_guideline_cmodel = run_guideline_adherence_eval(data_cmodel_for_guideline_eval, CONFIG["GUIDELINE_NAME"])
res_guideline_aleph_alpha = run_guideline_adherence_eval(data_aleph_alpha_for_guideline_eval, CONFIG["GUIDELINE_NAME"])

100%|██████████| 35/35 [00:05<00:00,  6.30it/s]
[32m2024-04-18 19:05:21.594[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.guideline[0m:[36mevaluate_local[0m:[36m194[0m - [31m[1mError when processing payload at index 8: None[0m
[32m2024-04-18 19:05:21.610[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.guideline[0m:[36mevaluate_local[0m:[36m194[0m - [31m[1mError when processing payload at index 18: None[0m
[32m2024-04-18 19:05:26.230[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m367[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m
100%|██████████| 35/35 [00:02<00:00, 16.72it/s]
[32m2024-04-18 19:05:28.951[0m | [31m[1mERROR   [0m | [36muptrain.operators.language.guideline[0m:[36mevaluate_local[0m:[36m194[0m - [31m[1mError when processing payload at index 2: None[0m
[32m2024-04-18 19:05:33.580[0m | [1mINFO    [0m | [36muptrain.framewo

In [34]:
def update_guidelines(guideline_name, res_guidelines, config_model_name):
    DEFAULT_SCORE = float("nan")
    DEFAULT_EXPLANATION = "No data available"
    score_name = 'score_' + guideline_name + '_adherence'
    explanation_name = 'explanation_' + guideline_name + '_adherence'
    
    for f in res_guidelines:
        score_key = score_name + '_model_' + config_model_name
        explanation_key = explanation_name + '_model_' + config_model_name
        
        if score_name in f:
            f[score_key] = f.pop(score_name)
        else:
            f[score_key] = DEFAULT_SCORE
        
        if explanation_name in f:
            f[explanation_key] = f.pop(explanation_name)
        else:
            if score_key not in f or f[score_key] == DEFAULT_SCORE:
                f[explanation_key] = DEFAULT_EXPLANATION

    return res_guidelines

In [35]:
res_guideline_cmodel = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_cmodel, CONFIG["GENERATE_MODEL_NAME"])
res_guideline_aleph_alpha = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_aleph_alpha, CONFIG["AA_MODEL_NAME"])

In [36]:
def merge_lists(base_list, update_list):
    update_dict = {item['question']: item for item in update_list if 'question' in item}
    
    for item in base_list:
        question = item.get('question')
        if question and question in update_dict:
            # print(f"updating with {question}")            
            update_info = {key: val for key, val in update_dict[question].items() if key != 'response'}
            item.update(update_info)
    return base_list

res=merge_lists(res, res_guideline_cmodel)
res=merge_lists(res, res_guideline_aleph_alpha)

### Creating Dataframe and displaying Average Score

In [37]:
res_df = pl.DataFrame(res)
res_df

question,response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,response_model_luminous-supreme-control,explanation_response_conciseness_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_conciseness_model_luminous-supreme-control,explanation_response_relevance_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_relevance_model_luminous-supreme-control,explanation_valid_response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_valid_response_model_luminous-supreme-control,score_response_relevance_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_relevance_model_luminous-supreme-control,explanation_response_matching_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_matching_model_luminous-supreme-control,score_valid_response_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_valid_response_model_luminous-supreme-control,score_response_completeness_wrt_context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_completeness_wrt_context_model_luminous-supreme-control,explanation_response_completeness_wrt_context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_response_completeness_wrt_context_model_luminous-supreme-control,score_factual_accuracy_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_factual_accuracy_model_luminous-supreme-control,ground_truth_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,ground_truth_model_luminous-supreme-control,score_response_conciseness_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_conciseness_model_luminous-supreme-control,explanation_factual_accuracy_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_factual_accuracy_model_luminous-supreme-control,score_response_match_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_model_luminous-supreme-control,context_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,context_model_luminous-supreme-control,score_response_match_recall_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_recall_model_luminous-supreme-control,score_response_match_precision_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_response_match_precision_model_luminous-supreme-control,score_Strict_Context_adherence_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,explanation_Strict_Context_adherence_model_NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO,score_Strict_Context_adherence_model_luminous-supreme-control,explanation_Strict_Context_adherence_model_luminous-supreme-control
str,str,str,str,str,str,str,str,str,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64,str,str,f64,f64,str,str,f64,f64,str,str,f64,f64,f64,f64,f64,str,f64,str
"""Wie kann ich m…","""Um sich vor Le…","""'Ihre Anfrage …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.666667,0.0,"""Information Re…","""Information Re…",1.0,0.0,1.0,0.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.875,"""Kaltes Wasser …","""Kaltes Wasser …",0.5,0.0,"""{  ""Result""…","""{  ""Result""…",0.727273,0.0,"""[Doc Nr. 1] Le…","""[Doc Nr. 1] Le…",1.0,0.0,0.4,,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Warum heißen L…","""Die Legionelle…","""Der Name Legio…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.5,1.0,"""Im Juni 1976 e…","""Im Juni 1976 e…",0.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.75,0.727273,"""[Doc Nr. 1] Je…","""[Doc Nr. 1] Je…",0.75,0.75,0.75,0.666667,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Was sind Anzei…","""In etwa 50 % d…","""Die Anzeichen …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…",,1.0,0.666667,"""Information Re…","""Information Re…",1.0,,1.0,0.5,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Wanderröte in …","""Wanderröte in …",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",0.914286,1.0,"""[Doc Nr. 1] Ri…","""[Doc Nr. 1] Ri…",0.888889,1.0,1.0,1.0,1.0,"""{  ""Reasoni…",,
"""Sind die Lyme-…","""Ja, die Lyme-K…","""Die Lyme Disea…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Ja. (die durch…","""Ja. (die durch…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.888889,0.941176,"""[Doc Nr. 1] Di…","""[Doc Nr. 1] Di…",1.0,1.0,0.666667,0.8,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Schützt die sc…","""Ihre Anfrage k…","""Die schnelle E…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.666667,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Nein/kaum. Im …","""Nein/kaum. Im …",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",0.727273,0.444444,"""[Doc Nr. 1] Ri…","""[Doc Nr. 1] Ri…",1.0,0.5,0.4,0.333333,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""Gibt es eine I…","""Ja, es gibt ei…","""Es gibt keine …","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.6,0.5,"""Ja. Humane Pap…","""Ja. Humane Pap…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.444444,0.461538,"""[Doc Nr. 1] Be…","""[Doc Nr. 1] Be…",0.75,0.5,0.2,0.375,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Was sind Neben…","""Als mögliche N…","""Als Nebenwirku…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,0.8,"""Als mögliche N…","""Als mögliche N…",1.0,0.5,"""{  ""Result""…","""{  ""Result""…",1.0,0.903226,"""[Doc Nr. 1] ER…","""[Doc Nr. 1] ER…",1.0,1.0,1.0,0.7,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Wie wird Borre…","""Die akute Borr…","""Die Behandlung…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Information Re…","""Information Re…",1.0,1.0,1.0,1.0,"""{  ""Reasoni…","""{  ""Reasoni…",1.0,1.0,"""Die akute Borr…","""Die akute Borr…",1.0,1.0,"""{  ""Result""…","""{  ""Result""…",0.5,0.727273,"""[Doc Nr. 1] In…","""[Doc Nr. 1] In…",0.5,1.0,0.5,0.4,1.0,"""{  ""Reasoni…",1.0,"""{  ""Reasoni…"
"""Wer ist Cristi…","""Ihre Anfrage k…","""Cristiano Rona…","""{  ""Reasoni…","""{  ""Reasoni…","""Response Preci…","""Response Preci…","""{  ""Reasoni…","""{  ""Reasoni…",0.0,0.0,"""Information Re…","""Information Re…",0.0,1.0,1.0,0.0,"""{  ""Reasoni…","""{  ""Reasoni…",0.5,0.0,"""Diese Anfrage …","""Diese Anfrage …",0.0,0.0,"""{  ""Result""…","""{  ""Result""…",0.5,0.0,"""[Doc Nr. 1] ER…","""[Doc Nr. 1] ER…",0.5,,0.5,0.0,1.0,"""{  ""Reasoni…",0.0,"""{  ""Reasoni…"


In [38]:
def backup_and_save_df(df, file_path, file_type='csv'):
    backup_dir = os.path.join(os.path.dirname(file_path), 'backups')
    if not os.path.exists(backup_dir):
        os.makedirs(backup_dir)
    
    if os.path.exists(file_path):
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        backup_filename = os.path.basename(file_path) + f".backup-{timestamp}"
        backup_path = os.path.join(backup_dir, backup_filename)
        shutil.copy(file_path, backup_path)
    
    if file_type == 'csv':
        print(f"Saving DataFrame to CSV at: {file_path}")
        df.write_csv(file_path)
    elif file_type == 'jsonl':
        print(f"Saving DataFrame to NDJSON at: {file_path}")
        df.write_ndjson(file_path)
    
backup_and_save_df(res_df, jsonl_file_path, 'jsonl')
backup_and_save_df(res_df, csv_file_path, 'csv')

Saving DataFrame to NDJSON at: ./results/nous_hermes_2_mixtral_8x7b_dpo_vs_luminous_supreme_experiment.jsonl
Saving DataFrame to CSV at: ./results/nous_hermes_2_mixtral_8x7b_dpo_vs_luminous_supreme_experiment.csv


In [39]:
def display_average_scores(df):
    score_columns = [col for col in df.columns if 'score' in col]
    data_for_table = []
    
    for column in score_columns:
        average = df[column].drop_nans().mean()
        
        parts = column.split('_model_')
        # print(f"___ parts: {parts}")
        metric_name = parts[0].replace('score_', '').replace('_', ' ').capitalize()
        model_name = parts[1]
        # print(f"metric_name: {metric_name}, average: {average}")
        # print(f"model_name: {model_name}")
        
        data_for_table.append({
            "Model": model_name,
            "Metric": metric_name,
            "Average Score": average
        })
    
    results_table = pl.DataFrame(data_for_table)
    # print(data_for_table)
    return results_table

In [40]:
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(res_df)

Model,Metric,Average Score
str,str,f64
"""NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO""","""Response relevance""",0.6
"""luminous-supreme-control""","""Response relevance""",0.514286
"""NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO""","""Valid response""",0.666667
"""luminous-supreme-control""","""Valid response""",0.588235
"""NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO""","""Response completeness wrt context""",0.885714
"""luminous-supreme-control""","""Response completeness wrt context""",0.757143
"""NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO""","""Factual accuracy""",0.821101
"""luminous-supreme-control""","""Factual accuracy""",0.69076
"""NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO""","""Response conciseness""",0.614286
"""luminous-supreme-control""","""Response conciseness""",0.514286
