### Experiment:  Impact of context on RAG performance metrics

**Background:**
To establish a base-line understanding and validate the thesis that context is a key element of hallucination reduction we can measure the change in RAGAS metrics for LLM that uses context or not.

**Test Approach**
Ask a question to LLM with and without a context.  We expect that RAGAS measures with a context should be significantly better than without.


In [1]:
# Common import
from deh.assessment import QASetRetriever
from deh.assessment import QASetType
from deh import settings
from deh.eval import generate_experiment_dataset

import pandas as pd
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


#### Test Configuration

In [2]:
num_samples:int = 5
experiment_folder:str = "../../data/evaluation/no-context-prompt-experiment/"
qa_data_set_file:str = "../../data/qas/squad_qas.tsv"

# Create experiment folder:
if not os.path.exists(experiment_folder):
    Path(experiment_folder).mkdir(parents=True, exist_ok=True)


#### Sample QA dataset

In [3]:
# Only get impossible to answer questions:
qa_set = QASetRetriever.get_qasets(
    file_path = qa_data_set_file,
    sample_size= num_samples,
    qa_type = QASetType.POSSIBLE_ONLY
)

print(f"{len(qa_set)} questions sampled from QA corpus ({qa_data_set_file})")

5 questions sampled from QA corpus (../../data/qas/squad_qas.tsv)


#### Get Responses with default prompt (using context)

In [4]:

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    response_df = pd.json_normalize(
        data=response
    )

    response_df["reference.ground_truth"] = response["reference"]["ground_truth"]
    response_df["reference.is_impossible"] = response["reference"]["is_impossible"]

    return response_df

def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 1
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=1"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_1.pkl" )
exp_df[0:1]


Processing 1 of 5 question/answer pairs.
Processing 2 of 5 question/answer pairs.
Processing 3 of 5 question/answer pairs.
Processing 4 of 5 question/answer pairs.
Processing 5 of 5 question/answer pairs.


Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,What century did the name of the Rhine come from?,False,1st century BC.,"[{'id': None, 'metadata': {'source': '../data/...",,,00:00:02,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,What century did the name of the Rhine come from?,1st,False,1073,1


#### Get Responses without context provided

In [5]:
def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 2
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=2"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_2.pkl" )
exp_df[0:1]


Processing 1 of 5 question/answer pairs.
Processing 2 of 5 question/answer pairs.
Processing 3 of 5 question/answer pairs.
Processing 4 of 5 question/answer pairs.
Processing 5 of 5 question/answer pairs.


Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,What century did the name of the Rhine come from?,False,"The name ""Rhine"" comes from a Celtic tribe, th...","[{'id': None, 'metadata': {'source': '../data/...",,,00:00:01,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,What century did the name of the Rhine come from?,1st,False,1073,1


#### Generate Measures for Response

##### Evaluation Model Configuration

In [6]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Either local (Ollama) or remote (OpenAI) evaluation models can be used:

embedding = OllamaEmbeddings(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_EMBEDDING_MODEL,
)

llm = Ollama(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_LLM_MODEL,
)

openai_llm = ChatOpenAI(model="gpt-4o-mini")
openai_embedding = OpenAIEmbeddings()

##### Evaluation Responses

In [7]:
from datasets import Dataset
from ragas import evaluate
import ragas.metrics as metrics
from ragas.run_config import RunConfig

In [8]:
comp_data = [
    ("prompt_1.pkl","prompt_1_eval.pkl"),
    ("prompt_2.pkl", "prompt_2_eval.pkl")
]

In [9]:
for input_file, output_file in comp_data:

    # Load Data and format
    responses_df = pd.read_pickle(f"{experiment_folder}/{input_file}")
    responses_df = responses_df.rename(columns={
        "response.question" : "question",
        "response.answer" : "answer",
        "reference.ground_truth" : "ground_truth"
    })[["question", "answer", "ground_truth"]]

    
    # Convert to Dataset
    responses_ds = Dataset.from_pandas( responses_df)

    # Evaluate
    evaluation_ds = evaluate(
        dataset = responses_ds,
        metrics = [metrics.answer_similarity, metrics.answer_correctness],
        embeddings = embedding,
        llm = llm,
        run_config=RunConfig(
            max_workers=5
        ),
        raise_exceptions=False
    )

    eval_df = evaluation_ds.to_pandas()

    # Evaluation metadata
    eval_df["evaluation.llm_model"] = "ollama"
    eval_df["evaluation.embedding_model"] = "ollama"

    # Persist
    eval_df.to_pickle( f"{experiment_folder}/{output_file}" )

    

Evaluating: 100%|██████████| 10/10 [01:22<00:00,  8.25s/it]
Evaluating:  50%|█████     | 5/10 [00:03<00:03,  1.40it/s]Failed to parse output. Returning None.
Evaluating: 100%|██████████| 10/10 [01:33<00:00,  9.32s/it]


In [10]:
eval_df[0:2]

Unnamed: 0,question,answer,ground_truth,answer_similarity,answer_correctness,evaluation.llm_model,evaluation.embedding_model
0,What century did the name of the Rhine come from?,"The name ""Rhine"" comes from a Celtic tribe, th...",1st,0.646525,0.661631,ollama,ollama
1,What type of flower is sought on Midsummer's Eve?,"Lily or wildflowers, often associated with Sco...",fern,0.725134,0.181284,ollama,ollama


#### Load and merge Experiment Datasets for comparison

In [11]:
# Load experiment results:
context_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_1_eval.pkl")
no_context_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_2_eval.pkl")

In [12]:
# Concatenate datasets together for comparison:
combined_df = pd.merge( context_retr_df, no_context_retr_df, left_index=True, right_index=True, suffixes=["_context", "_no_context"])
combined_df[0:2]

Unnamed: 0,question_context,answer_context,ground_truth_context,answer_similarity_context,answer_correctness_context,evaluation.llm_model_context,evaluation.embedding_model_context,question_no_context,answer_no_context,ground_truth_no_context,answer_similarity_no_context,answer_correctness_no_context,evaluation.llm_model_no_context,evaluation.embedding_model_no_context
0,What century did the name of the Rhine come from?,1st century BC.,1st,0.772291,0.693073,ollama,ollama,What century did the name of the Rhine come from?,"The name ""Rhine"" comes from a Celtic tribe, th...",1st,0.646525,0.661631,ollama,ollama
1,What type of flower is sought on Midsummer's Eve?,The sought flower is the fern flower.,fern,0.914749,0.728687,ollama,ollama,What type of flower is sought on Midsummer's Eve?,"Lily or wildflowers, often associated with Sco...",fern,0.725134,0.181284,ollama,ollama


##### RAGAS comparison (context - no_context)

In [13]:
# Metric comparison:
combined_df["answer_similarity_diff"] = combined_df["answer_similarity_context"] - combined_df["answer_similarity_no_context"]
combined_df["answer_correctness_diff"] = combined_df["answer_correctness_context"] - combined_df["answer_correctness_no_context"]


combined_df[["answer_similarity_diff", "answer_correctness_diff"]][0:2]


Unnamed: 0,answer_similarity_diff,answer_correctness_diff
0,0.125765,0.031441
1,0.189615,0.547404


In [14]:
# For Answer Similiarity:
print(f"""
Answer Similiarity Difference:
Min: {combined_df["answer_similarity_diff"].min()}
Avg: {combined_df["answer_similarity_diff"].mean()}
Max: {combined_df["answer_similarity_diff"].max()}
""")



Answer Similiarity Difference:
Min: -0.16755055549431708
Avg: 0.0537072351784353
Max: 0.18961495380685867



In [15]:
# For Answer Correctness:
print(f"""
Answer Correctness Difference:
Min: {combined_df["answer_correctness_diff"].min()}
Avg: {combined_df["answer_correctness_diff"].mean()}
Max: {combined_df["answer_correctness_diff"].max()}
""")



Answer Correctness Difference:
Min: -0.14188763887357947
Avg: 0.10414109450889449
Max: 0.5474037384517146



In [16]:
combined_df

Unnamed: 0,question_context,answer_context,ground_truth_context,answer_similarity_context,answer_correctness_context,evaluation.llm_model_context,evaluation.embedding_model_context,question_no_context,answer_no_context,ground_truth_no_context,answer_similarity_no_context,answer_correctness_no_context,evaluation.llm_model_no_context,evaluation.embedding_model_no_context,answer_similarity_diff,answer_correctness_diff
0,What century did the name of the Rhine come from?,1st century BC.,1st,0.772291,0.693073,ollama,ollama,What century did the name of the Rhine come from?,"The name ""Rhine"" comes from a Celtic tribe, th...",1st,0.646525,0.661631,ollama,ollama,0.125765,0.031441
1,What type of flower is sought on Midsummer's Eve?,The sought flower is the fern flower.,fern,0.914749,0.728687,ollama,ollama,What type of flower is sought on Midsummer's Eve?,"Lily or wildflowers, often associated with Sco...",fern,0.725134,0.181284,ollama,ollama,0.189615,0.547404
2,What type of organization would need large qua...,Steel and aerospace industries.,hospitals,0.688686,0.672172,ollama,ollama,What type of organization would need large qua...,A hospital or steel manufacturing facility.,hospitals,0.856237,0.814059,ollama,ollama,-0.167551,-0.141888
3,What type of radar was used to classify trees ...,Synthetic Aperture Radar (SAR).,Synthetic aperture,0.817274,0.63289,ollama,ollama,What type of radar was used to classify trees ...,Pulsed LIDAR (Light Detection and Ranging) tec...,Synthetic aperture,0.790144,0.626108,ollama,ollama,0.02713,0.006782
4,What is missing a theory on quantum gravity?,A theory of quantum gravity is still missing.,General relativity,0.815217,0.632376,ollama,ollama,What is missing a theory on quantum gravity?,"A complete, consistent, and experimentally ver...",General relativity,0.72164,0.55541,ollama,ollama,0.093577,0.076966


In [36]:
# Optional
combined_df.to_csv(f"{experiment_folder}/combined_data.csv")