<h1 align="center">
  <a href="https://www.nlga.niedersachsen.de/startseite">
    <img width="300" src="https://www.nlga.niedersachsen.de/assets/image/246974" alt="NLGA">
  </a>
</h1>

## Experimenting with different LLMs - Aleph Alpha Luminous-supreme-control vs GPT-4

In [None]:
# define configuration for dynamic experiment handling
EXPERIMENT_NAME = "GPT_4_vs_Luminous_Supreme"

**Overview**: In this notebook, we will compare different LLM providers. We will be using around 35 example questions from the [Testfragen](https://secure-confluence.nortal.com/display/NLGAC/Testfragen) dataset and evaluate the response on different criteria to determine which of the two models performs better.

We have used the following metrics from UpTrain's library:

1. [Response Conciseness](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-conciseness): Evaluates how concise the generated response is or if it has any additional irrelevant information for the question asked.

2. [Response Matching](https://docs.uptrain.ai/predefined-evaluations/ground-truth-comparison/response-matching): Evaluates how well the response generated by the LLM aligns with the provided ground truth.

3. [Factual Accuracy](https://docs.uptrain.ai/predefined-evaluations/context-awareness/factual-accuracy): Evaluates whether the response generated is factually correct and grounded by the provided context.

4. [Context Utilization](https://docs.uptrain.ai/predefined-evaluations/context-awareness/context-utilization): Evaluates how complete the generated response is for the question specified given the information provided in the context. Also known as Reponse Completeness wrt context (RESPONSE_COMPLETENESS_WRT_CONTEXT)

5. [Response Relevance](https://docs.uptrain.ai/predefined-evaluations/response-quality/response-relevance): Evaluates how relevant the generated response was to the question specified.
6. **What is Response Validity?**: In some cases, an LLM might fail to generate a response due to reasons like limited knowledge or the asked question not being clear. Response Validity score can be used to identify these cases, where a model is not generating an informative response.
For example, if the question asked is "What is the chemical formula of chlorophyll?", a valid response would be "The  formula for chlorophyll is C55H72O5N4Mg." An invalid response could be "Sorry, I have no idea about that."
7. **What is Guideline Adherence?**: [Guideline adherence](https://github.com/uptrain-ai/uptrain/blob/main/examples/checks/custom/guideline_adherence.ipynb) refers to the extent to which the LLM follows a given guideline, rule, or protocol. Given the complexity of LLMs, it is crucial to define certain guidelines, be it in terms of the structure of the output or the constraints on the content of the output or protocols on the decision-making capabilities of the LLM (agents). 
For example, for an LLM-powered chatbot agent trained to perform appointment booking tasks only, you want to make sure that the LLM is following the guideline: "The agent should redirect all the queries to the human agent, except the ones related to appointment booking."

Each score has a value between 0 and 1. 

Complete list of UpTrain's supported metrics [here](https://docs.uptrain.ai/predefined-evaluations/overview)

### Install Dependencies

In [None]:
# %pip install openai uptrain together lazy_loader fsspec pandas polars networkx pydantic_settings aiolimiter

In [None]:
import os
from dotenv import load_dotenv
import polars as pl 
import shutil
from datetime import datetime
import time
from openai import OpenAI

### Authentication and Configuration 

Let's define the required api keys - mainly the Azure openai key (for generating responses).
Please also ensure that the dataset path is correctly defined in the configuration.
Do not forget to set the API_KEY and BASE_URL for the LLM API Endpoint provider. 

In [None]:
# Load the environment variables from the .env file
load_dotenv()

CONFIG = {
    # The model name used to generate responses
    "GENERATE_MODEL_NAME": "gpt4",
    "AA_MODEL_NAME": "luminous-supreme-control",
    # Guideline name used in the Guideline Adherence check 
    "GUIDELINE_NAME": "Strict_Context",
    # dataset path
    # "DATASET_PATH": "nlga_dataset_AA_small.jsonl",
    "DATASET_PATH": "./nlga_dataset_AA.jsonl",
    "RESULTS_DIR": "./results/",
    "AZURE_OPENAI_API_KEY": os.getenv("AZURE_OPENAI_API_KEY"),
    "AZURE_API_VERSION": os.getenv("AZURE_API_VERSION"),
    "AZURE_API_BASE": os.getenv("AZURE_API_BASE"),
    # Azure deployments:
    "GPT_35_TURBO_16K": "gpt-35-turbo-16k-deployment",
    "GPT_4": "gpt4"
}

# Azure deployment used to evaluate 
EVAL_MODEL_NAME = "azure/gpt4"
# EVAL_MODEL_NAME = "azure/gpt35-16k"

def get_experiment_file_path(extension):
    filename = f"{EXPERIMENT_NAME.replace(' ', '_').replace('-', '_').lower()}_experiment.{extension}"
    return os.path.join(CONFIG['RESULTS_DIR'], filename)

jsonl_file_path = get_experiment_file_path('jsonl')
csv_file_path = get_experiment_file_path('csv')

In [None]:
# Utility functions for API and file operations
from openai import AzureOpenAI

def initialize_azure_openai_client():
    # gets the API Key from environment variable AZURE_OPENAI_API_KEY
    return AzureOpenAI(api_version=CONFIG["AZURE_API_VERSION"], 
                       azure_endpoint=CONFIG["AZURE_API_BASE"])

def ensure_directory_exists(path):
    if not os.path.exists(path):
        os.makedirs(path)

def read_dataset(path):
    if not os.path.exists(path):
        raise FileNotFoundError(f"The specified dataset path does not exist: {path}")
    return pl.read_ndjson(path)
# dataset = pl.read_ndjson(dataset_path).select(pl.col(["question", "ground_truth", "context"]))

### Load the testing dataset

In [None]:
ensure_directory_exists(CONFIG['RESULTS_DIR'])
dataset = read_dataset(CONFIG['DATASET_PATH'])
filtered_dataset = dataset.filter(dataset["idx"] > 100)
filtered_dataset

### Let's define a prompt to generate responses

In [None]:
SYSTEM_PROMPT = """### INSTRUKTIONEN
Generiere bitte eine ANTWORT, die sich strikt an den gegebenen KONTEXT hält und präzise auf die gestellte FRAGE antwortet, ohne eigene Informationen des Modells hinzuzufügen. Falls die benötigte Information nicht im KONTEXT zu finden ist, antworte mit: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Vermeide Bezüge auf vorherige Ausgaben des Modells. Die Antwort soll auf dem bereitgestellten KONTEXT basieren. Sollte die FRAGE nicht direkt einem gesundheitsbezogenen Thema zuzuordnen sein oder nicht klar zu beantworten sein, erkläre kurz, warum die Anfrage nicht beantwortet werden kann und empfehle eine genauere Formulierung oder zusätzliche Informationen.

### KONTEXT
{context}"""

In [None]:
client = initialize_azure_openai_client()

def get_response(row, model):
    question = row['question'][0]
    context = row['context'][0]

    if "gpt" in model:
        response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT.format(context=context)},
            {"role": "user", "content": question},
            # {"role": "assistant", "content": "Example answer"},
            # {"role": "user", "content": "First question/message for the model to actually respond to."}
        ]
    ).choices[0].message.content
    
    return {'question': question, 'context': context, 'response': response, 'ground_truth': row['ground_truth'][0], 'model': model}

In [None]:
def format_response(row):
    question = row['question'][0]
    context = row['context'][0]
    response = row['response'][0]
    ground_truth = row['ground_truth'][0]
    model = CONFIG["AA_MODEL_NAME"]

    return {'question': question, 'context': context, 'response': response, 'ground_truth': ground_truth, 'model': model}

### Generate responses for both the models

In [None]:
start_time = time.time()
results_cmodel = [get_response(dataset[idx], CONFIG["GENERATE_MODEL_NAME"]) for idx in range(len(dataset))]
results_aleph_alpha = [format_response(dataset[idx]) for idx in range(len(dataset))]

In [None]:
print(f"Total execution time: {time.time() - start_time:.2f} seconds")

In [None]:
pl.Config.set_fmt_str_lengths(50)
df = pl.DataFrame(results_cmodel)
df

### Evaluating Experiments using UpTrain

UpTrain's EvalLLM provides an `evaluate_experiments` method which takes the input data to be evaluated along with the list of checks to be run and the name of the columns associated with the experiment. The method returns a list of dictionaries containing the results of the evaluation. 

In [None]:
from uptrain import EvalLLM, Evals, ResponseMatching, Settings

import nest_asyncio
nest_asyncio.apply()

settings = Settings(model=EVAL_MODEL_NAME, azure_api_key=CONFIG["AZURE_OPENAI_API_KEY"], azure_api_version=CONFIG["AZURE_API_VERSION"], azure_api_base=CONFIG["AZURE_API_BASE"])
eval_llm = EvalLLM(settings)

res = eval_llm.evaluate_experiments(
    project_name = f"{EXPERIMENT_NAME}-Experiments",
    data = results_cmodel + results_aleph_alpha,
    checks = [
       Evals.RESPONSE_CONCISENESS,
       ResponseMatching(method='llm'),  # Comment this if you don't have Ground Truth
       Evals.RESPONSE_COMPLETENESS_WRT_CONTEXT,
       Evals.FACTUAL_ACCURACY,
       Evals.RESPONSE_RELEVANCE,
        Evals.VALID_RESPONSE
    ],
    exp_columns=['model']
)

In [None]:
print(f"Total evaluation time: {time.time() - start_time:.2f} seconds")

In [None]:
res_df = pl.DataFrame(res)
res_df

### Adding Guideline Adherence evaluations

In [None]:
guideline = "The response must strictly adhere to the provided context and not introduce external information. If the necessary information is absent from the context, respond with: 'Ihre Anfrage kann nicht mit den bereitgestellten Daten beantwortet werden. Bitte erläutern Sie Ihre Anfrage genauer oder geben Sie weitere Informationen an, falls notwendig.'. Should the question fall outside the health-related jurisdiction of the Landesgesundheitsamt Niedersachsen, it means the query is beyond the health-related scope and shouldn't be answered."

In [None]:
data_cmodel_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_cmodel]
data_aleph_alpha_for_guideline_eval = [{'question': i['question'], 'response': i['response']} for i in results_aleph_alpha]

In [None]:
from uptrain import GuidelineAdherence

def run_guideline_adherence_eval(data, guideline_name):
    return eval_llm.evaluate(
        data=data,
        checks=[GuidelineAdherence(guideline=guideline, guideline_name=guideline_name)]
    )

res_guideline_cmodel = run_guideline_adherence_eval(data_cmodel_for_guideline_eval, CONFIG["GUIDELINE_NAME"])
res_guideline_aleph_alpha = run_guideline_adherence_eval(data_aleph_alpha_for_guideline_eval, CONFIG["GUIDELINE_NAME"])

In [None]:
def update_guidelines(guideline_name, res_guidelines, config_model_name):
    DEFAULT_SCORE = float("nan")
    DEFAULT_EXPLANATION = "No data available"
    score_name = 'score_' + guideline_name + '_adherence'
    explanation_name = 'explanation_' + guideline_name + '_adherence'
    
    for f in res_guidelines:
        score_key = score_name + '_model_' + config_model_name
        explanation_key = explanation_name + '_model_' + config_model_name
        
        if score_name in f:
            f[score_key] = f.pop(score_name)
        else:
            f[score_key] = DEFAULT_SCORE
        
        if explanation_name in f:
            f[explanation_key] = f.pop(explanation_name)
        else:
            if score_key not in f or f[score_key] == DEFAULT_SCORE:
                f[explanation_key] = DEFAULT_EXPLANATION

    return res_guidelines

In [None]:
res_guideline_cmodel = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_cmodel, CONFIG["GENERATE_MODEL_NAME"])
res_guideline_aleph_alpha = update_guidelines(CONFIG["GUIDELINE_NAME"], res_guideline_aleph_alpha, CONFIG["AA_MODEL_NAME"])

In [None]:
def merge_lists(base_list, update_list):
    update_dict = {item['question']: item for item in update_list if 'question' in item}
    
    for item in base_list:
        question = item.get('question')
        if question and question in update_dict:
            # print(f"updating with {question}")            
            update_info = {key: val for key, val in update_dict[question].items() if key != 'response'}
            item.update(update_info)
    return base_list

res=merge_lists(res, res_guideline_cmodel)
res=merge_lists(res, res_guideline_aleph_alpha)

### Creating Dataframe and displaying Average Score

In [None]:
res_df = pl.DataFrame(res)
res_df

In [None]:
def backup_and_save_df(df, file_path, file_type='csv'):
    backup_dir = os.path.join(os.path.dirname(file_path), 'backups')
    if not os.path.exists(backup_dir):
        os.makedirs(backup_dir)
    
    if os.path.exists(file_path):
        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
        backup_filename = os.path.basename(file_path) + f".backup-{timestamp}"
        backup_path = os.path.join(backup_dir, backup_filename)
        shutil.copy(file_path, backup_path)
    
    if file_type == 'csv':
        print(f"Saving DataFrame to CSV at: {file_path}")
        df.write_csv(file_path)
    elif file_type == 'jsonl':
        print(f"Saving DataFrame to NDJSON at: {file_path}")
        df.write_ndjson(file_path)
    
backup_and_save_df(res_df, jsonl_file_path, 'jsonl')
backup_and_save_df(res_df, csv_file_path, 'csv')

In [None]:
def display_average_scores(df):
    score_columns = [col for col in df.columns if 'score' in col]
    data_for_table = []
    
    for column in score_columns:
        average = df[column].drop_nans().mean()
        
        parts = column.split('_model_')
        # print(f"___ parts: {parts}")
        metric_name = parts[0].replace('score_', '').replace('_', ' ').capitalize()
        model_name = parts[1]
        # print(f"metric_name: {metric_name}, average: {average}")
        # print(f"model_name: {model_name}")
        
        data_for_table.append({
            "Model": model_name,
            "Metric": metric_name,
            "Average Score": average
        })
    
    results_table = pl.DataFrame(data_for_table)
    # print(data_for_table)
    return results_table

In [None]:
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(res_df)

In [None]:
_res_df = read_dataset(jsonl_file_path)
pl.Config.set_tbl_rows(32)
pl.Config.set_fmt_str_lengths(50)
display_average_scores(_res_df)