### Experiment:  Establish a Base-line Measurement of model hallucination

**Background:**  There are several metrics which are indicative of hallucination.  
Specifcically we will look at _Answer Faithfulness_ as it identifies deviations from provided context which would likely be hallucinated:

* [Answer Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) - measures the factual consistency of generated answer vs. given context.

Secondarily the following provide indications of total RAG system effectiveness at generating a correct answer:

* [Answer Similarity](https://docs.ragas.io/en/stable/concepts/metrics/semantic_similarity.html) - (aka Answer Semantic Similarity) which is the simple cosine similarity of the generated and ground-truth answer.
* [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html) - measures the accuracy of the generated answer when compared to the ground truth answer.

**Test Approach:** A sample of questions will be selected from QA corpus.  Answers to questions will be generated via "v0" RAG implementation (excluding mitigation and advanced processing).  The above measures will be generated to establish a baseline of comparison for future experiments.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
os.environ.get("http_proxy")
os.environ["http_proxy"] = ''

In [3]:
# Common import
from deh.assessment import QASetRetriever
from deh import settings
from deh.eval import generate_experiment_dataset

import pandas as pd
import json
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


#### Test Configuration

In [4]:
num_samples:int = 2
experiment_folder:str = "../../data/evaluation/baseline-v0/"
qa_data_set_file:str = "../../data/qas/squad_qas.tsv"

# Create experiment folder:
if not os.path.exists(experiment_folder):
    Path(experiment_folder).mkdir(parents=True, exist_ok=True)

#### Sample QA dataset

In [5]:
qa_set = QASetRetriever.get_qasets(
    file_path = qa_data_set_file,
    sample_size= num_samples
)

print(f"{len(qa_set)} questions sampled from QA corpus ({qa_data_set_file})")

2 questions sampled from QA corpus (../../data/qas/squad_qas.tsv)


In [7]:
def api_endpoint(**kwargs) -> str:
    """Endpoint for context retrieval."""
    hyde= False
    evaluation = False
    
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?h={hyde}&e={evaluation}&{query_params}"

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    response_df = pd.json_normalize(
        data=response["response"], record_path="context", meta=["answer","question", "hyde", ["evaluation", "grade"]]
    )

    # Add reference/evaluation values:
    response_df["reference.ground_truth"] = response["reference"]["ground_truth"]
    response_df["reference.is_impossible"] = response["reference"]["is_impossible"]

    # Add full JSON response incase needed:
    response_df["json"] = json.dumps(response)
    return response_df

exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store the generated response:
exp_df.to_pickle( f"{experiment_folder}/baseline-response-v0.pkl" )
exp_df[0:1]


Processing 1 of 2 question/answer pairs.
Processing 2 of 2 question/answer pairs.


Unnamed: 0,id,page_content,type,metadata.source,metadata.similarity_score,answer,question,hyde,evaluation.grade,reference.ground_truth,reference.is_impossible,json,reference_id
0,,Harvard's academic programs operate on a semes...,Document,/data/contexts/context_640.context,0.659308,Shortening the admission event is also referre...,What is another term for shortening the admiss...,False,,shortening the cutoff,False,"{""response"": {""question"": ""What is another ter...",1


#### Generate Measures for Response

##### Evaluation Model Configuration

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama

embedding = OllamaEmbeddings(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_EMBEDDING_MODEL,
)

llm = Ollama(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_LLM_MODEL,
)

##### Evaluation Responses

In [9]:
from datasets import Dataset
from ragas import evaluate
import ragas.metrics as metrics

In [25]:
# Convert to Dataset
responses_df = pd.read_pickle(f"{experiment_folder}/baseline-response-v0.pkl")

responses_df = responses_df.groupby("reference_id").agg(
    retrieved_contexts = ('page_content', lambda x: list(x)),
    question = ('question','first'),
    ground_truth = ('reference.ground_truth', 'first'),
    answer = ('answer', 'first')
    )

responses_df[0:1]


Unnamed: 0_level_0,retrieved_contexts,question,ground_truth,answer
reference_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,[Harvard's academic programs operate on a seme...,What is another term for shortening the admiss...,shortening the cutoff,Shortening the admission event is also referre...


In [28]:

responses_ds = Dataset.from_pandas( responses_df)

evaluation_ds = evaluate(
    dataset = responses_ds,
    metrics = [metrics.answer_similarity, metrics.faithfulness, metrics.answer_correctness],
    embeddings = embedding,
    llm = llm
)


Evaluating:  50%|█████     | 1/2 [00:22<00:22, 22.28s/it]Failed to parse output. Returning None.
Evaluating: 100%|██████████| 2/2 [00:49<00:00, 24.89s/it]


In [29]:
evaluation_ds.to_pandas()

Unnamed: 0,question,contexts,answer,ground_truth,faithfulness
0,What is another term for shortening the admiss...,[Harvard's academic programs operate on a seme...,Shortening the admission event is also referre...,shortening the cutoff,
1,How many elements did Aristotle believe the te...,[Aristotle provided a philosophical discussion...,"According to the provided context, Aristotle b...",four,1.0
