### Experiment:  Measurement of model hallucination w/ HYDE enabled

**Background:**  There are several metrics which are indicative of hallucination.  
Specifcically we will look at _Answer Faithfulness_ as it identifies deviations from provided context which would likely be hallucinated:

* [Answer Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html) - measures the factual consistency of generated answer vs. given context.

Secondarily the following provide indications of total RAG system effectiveness at generating a correct answer:

* [Answer Similarity](https://docs.ragas.io/en/stable/concepts/metrics/semantic_similarity.html) - (aka Answer Semantic Similarity) which is the simple cosine similarity of the generated and ground-truth answer.
* [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html) - measures the accuracy of the generated answer when compared to the ground truth answer.

**Test Approach:** A sample of questions will be selected from QA corpus.  Answers to questions will be generated via "v1" RAG implementation __With HYDE document creation enabled__.  The above measures will be compared to [v0 Baseline measurement](./experiment_hallucination_measurement.ipynb).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
os.environ["http_proxy"] = ''

In [1]:
# Common import
from deh import settings
from deh.assessment import QASetRetriever
from deh.eval import generate_experiment_dataset

import pandas as pd
import json
import os
from pathlib import Path

#### Test Configuration

In [2]:
num_samples:int = 5
experiment_folder:str = "../../data/evaluations/hallucination-measurement-v1-hyde/"
qa_data_set_file:str = "../../data/qas/squad_qas.tsv"

# Create experiment folder:
if not os.path.exists(experiment_folder):
    Path(experiment_folder).mkdir(parents=True, exist_ok=True)

#### Sample QA dataset

In [5]:
qa_set = QASetRetriever.get_qasets(
    file_path = qa_data_set_file,
    sample_size= num_samples
)

print(f"{len(qa_set)} questions sampled from QA corpus ({qa_data_set_file})")

NameError: name 'QASetRetriever' is not defined

In [12]:
def api_endpoint(**kwargs) -> str:
    """Endpoint for context retrieval."""
    hyde= True  # Enable HYDE for the context retrieval
    evaluation = False
    
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?h={hyde}&e={evaluation}&{query_params}"

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    response_df = pd.json_normalize(
        data=response["response"], record_path="context", meta=["answer","question", "hyde", ["evaluation", "grade"]]
    )

    # Add reference/evaluation values:
    response_df["reference.ground_truth"] = response["reference"]["ground_truth"]
    response_df["reference.is_impossible"] = response["reference"]["is_impossible"]

    # Add full JSON response incase needed:
    response_df["json"] = json.dumps(response)
    return response_df

exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store the generated response:
exp_df.to_pickle( f"{experiment_folder}/response-v1.pkl" )
exp_df[0:1]


Processing 1 of 5 question/answer pairs.
Processing 2 of 5 question/answer pairs.
Processing 3 of 5 question/answer pairs.
Processing 4 of 5 question/answer pairs.
Processing 5 of 5 question/answer pairs.


Unnamed: 0,id,page_content,type,metadata.source,metadata.similarity_score,answer,question,hyde,evaluation.grade,reference.ground_truth,reference.is_impossible,json,reference_id
0,,Other: Civil rights leader W. E. B. Du Bois; p...,Document,../data/contexts/context_650.context,0.486736,"Conan O'Brien, a TV host and writer, attended ...",What tv host and writer went to Harvard?,True,,Conan O'Brien,False,"{""response"": {""question"": ""What tv host and wr...",1


#### Generate Measures for Response

##### Evaluation Model Configuration

In [14]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

from enum import Enum
class LLM_PLATFORMS (Enum):
    OPENAI = 1
    OLLAMA = 2

# Either local (Ollama) or remote (OpenAI) evaluation models can be used:
evaluation_model = LLM_PLATFORMS.OPENAI


In [15]:
if evaluation_model == LLM_PLATFORMS.OPENAI:
    print ("Using OpenAI platform.")
    llm = ChatOpenAI(model="gpt-4o-mini")
    embeddings = OpenAIEmbeddings()
else:
    print ("Using OLLAMA platform")
    llm = Ollama(
        base_url=settings.OLLAMA_URL,
        model=settings.ASSESSMENT_LLM_MODEL,
    )
    embeddings = OllamaEmbeddings(
        base_url=settings.OLLAMA_URL,
        model=settings.ASSESSMENT_EMBEDDING_MODEL,
    )

Using OpenAI platform.


##### Evaluation Responses

In [3]:
from datasets import Dataset
from ragas import evaluate
import ragas.metrics as metrics
from ragas.run_config import RunConfig


For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.llms.prompt import PromptValue


In [17]:
# Convert to Dataset
responses_df = pd.read_pickle(f"{experiment_folder}/response-v1.pkl")

responses_df = responses_df.groupby("reference_id").agg(
    retrieved_contexts = ('page_content', lambda x: list(x)),
    question = ('question','first'),
    ground_truth = ('reference.ground_truth', 'first'),
    answer = ('answer', 'first')
    )

responses_df[0:1]


Unnamed: 0_level_0,retrieved_contexts,question,ground_truth,answer
reference_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,[Other: Civil rights leader W. E. B. Du Bois; ...,What tv host and writer went to Harvard?,Conan O'Brien,"Conan O'Brien, a TV host and writer, attended ..."


In [18]:

responses_ds = Dataset.from_pandas( responses_df)

evaluation_ds = evaluate(
    dataset = responses_ds,
    metrics = [metrics.answer_similarity, metrics.faithfulness, metrics.answer_correctness],
    embeddings = embeddings,
    llm = llm,
    run_config=RunConfig(
        max_workers=5
    ),
    raise_exceptions=False
)


Evaluating: 100%|██████████| 15/15 [00:15<00:00,  1.01s/it]


In [19]:
eval_df = evaluation_ds.to_pandas()

# Evaluation metadata
eval_df["evaluation.llm_model"] = "openai.gpt-4o-mini"
eval_df["evaluation.embedding_model"] = "openai.Text-embedding-ada-002-v2"

eval_df[0:2]

Unnamed: 0,question,contexts,answer,ground_truth,answer_similarity,faithfulness,answer_correctness,evaluation.llm_model,evaluation.embedding_model
0,What tv host and writer went to Harvard?,[Other: Civil rights leader W. E. B. Du Bois; ...,"Conan O'Brien, a TV host and writer, attended ...",Conan O'Brien,0.881632,0.833333,0.595408,openai.gpt-4o-mini,openai.Text-embedding-ada-002-v2
1,Where were Persians more successful compared t...,[A rich cultural diversity developed during th...,I don't know where Persians were more successf...,reaching the highest-post in the government,0.736512,0.5,0.184128,openai.gpt-4o-mini,openai.Text-embedding-ada-002-v2


In [20]:
eval_df.to_pickle( f"{experiment_folder}/results-v0-openai.pkl" )

#### Faithful vs. Non-Faithful Responses

In [None]:
eval_df = pd.read_pickle( f"{experiment_folder}/results-v0-openai.pkl" )

In [21]:
faithfulness_threshold = 0.75 #75% of claims are considered supported by context.
ttl = len ( eval_df )
faithful = len( eval_df[ eval_df["faithfulness"] >= faithfulness_threshold ] )

print (f"% of responses indicated as faithful: {faithful/ttl*100}%")

% of responses indicated as faithful: 20.0%
