### Experiment:  Impact of context on RAG performance metrics

**Background:**
To establish a base-line understanding and validate the thesis that context is a key element of hallucination reduction we can measure the change in RAGAS metrics for LLM that uses context or not.

**Test Approach**
Ask a question to LLM with and without a context.  We expect that RAGAS measures with a context should be significantly better than without.


In [1]:
# Common import
from deh.assessment import QASetRetriever
from deh.assessment import QASetType
from deh import settings
from deh.eval import generate_experiment_dataset

import pandas as pd
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


#### Test Configuration

In [2]:
num_samples:int = 5
experiment_folder:str = "../../data/evaluation/no-context-prompt-experiment/"
qa_data_set_file:str = "../../data/qas/squad_qas.tsv"

# Create experiment folder:
if not os.path.exists(experiment_folder):
    Path(experiment_folder).mkdir(parents=True, exist_ok=True)


#### Sample QA dataset

In [3]:
# Only get impossible to answer questions:
qa_set = QASetRetriever.get_qasets(
    file_path = qa_data_set_file,
    sample_size= num_samples,
    qa_type = QASetType.POSSIBLE_ONLY
)

print(f"{len(qa_set)} questions sampled from QA corpus ({qa_data_set_file})")

5 questions sampled from QA corpus (../../data/qas/squad_qas.tsv)


#### Get Responses with default prompt (using context)

In [6]:

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    response_df = pd.json_normalize(
        data=response
    )

    response_df["reference.ground_truth"] = response["reference"]["ground_truth"]
    response_df["reference.is_impossible"] = response["reference"]["is_impossible"]

    return response_df

def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 1
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=1"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_1.pkl" )
exp_df[0:1]


Processing 1 of 5 question/answer pairs.
Processing 2 of 5 question/answer pairs.
Processing 3 of 5 question/answer pairs.
Processing 4 of 5 question/answer pairs.
Processing 5 of 5 question/answer pairs.


Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,When did the colonization of India occur?,False,The colonization of India occurred in the mid-...,"[{'id': None, 'metadata': {'source': '../data/...",,,00:00:04,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,When did the colonization of India occur?,18th century,False,235,1


#### Get Responses without context provided

In [7]:
def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 2
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=2"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_2.pkl" )
exp_df[0:1]


Processing 1 of 5 question/answer pairs.
Processing 2 of 5 question/answer pairs.
Processing 3 of 5 question/answer pairs.
Processing 4 of 5 question/answer pairs.
Processing 5 of 5 question/answer pairs.


Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,When did the colonization of India occur?,False,The British East India Company established its...,"[{'id': None, 'metadata': {'source': '../data/...",,,00:00:03,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,When did the colonization of India occur?,18th century,False,235,1


#### Generate Measures for Response

##### Evaluation Model Configuration

In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Either local (Ollama) or remote (OpenAI) evaluation models can be used:

embedding = OllamaEmbeddings(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_EMBEDDING_MODEL,
)

llm = Ollama(
    base_url=settings.OLLAMA_URL,
    model=settings.ASSESSMENT_LLM_MODEL,
)

openai_llm = ChatOpenAI(model="gpt-4o-mini")
openai_embedding = OpenAIEmbeddings()

##### Evaluation Responses

In [9]:
from datasets import Dataset
from ragas import evaluate
import ragas.metrics as metrics
from ragas.run_config import RunConfig

In [25]:
comp_data = [
    ("prompt_1.pkl","prompt_1_eval.pkl"),
    ("prompt_2.pkl", "prompt_2_eval.pkl")
]

In [26]:
for input_file, output_file in comp_data:

    # Load Data and format
    responses_df = pd.read_pickle(f"{experiment_folder}/{input_file}")
    responses_df = responses_df.rename(columns={
        "response.question" : "question",
        "response.answer" : "answer",
        "reference.ground_truth" : "ground_truth"
    })[["question", "answer", "ground_truth"]]

    
    # Convert to Dataset
    responses_ds = Dataset.from_pandas( responses_df)

    # Evaluate
    evaluation_ds = evaluate(
        dataset = responses_ds,
        metrics = [metrics.answer_similarity, metrics.answer_correctness],
        embeddings = embedding,
        llm = llm,
        run_config=RunConfig(
            max_workers=5
        ),
        raise_exceptions=False
    )

    eval_df = evaluation_ds.to_pandas()

    # Evaluation metadata
    eval_df["evaluation.llm_model"] = "ollama"
    eval_df["evaluation.embedding_model"] = "ollama"

    # Persist
    eval_df.to_pickle( f"{experiment_folder}/{output_file}" )

    

Evaluating: 100%|██████████| 10/10 [01:40<00:00, 10.03s/it]
Evaluating: 100%|██████████| 10/10 [01:43<00:00, 10.38s/it]


In [27]:
eval_df[0:2]

Unnamed: 0,question,answer,ground_truth,answer_similarity,answer_correctness,evaluation.llm_model,evaluation.embedding_model
0,When did the colonization of India occur?,The British East India Company established its...,18th century,0.768604,0.567151,ollama,ollama
1,What were requests made to British?,"The British received several requests, includi...",continue worshiping in their Roman Catholic tr...,0.642737,0.433412,ollama,ollama


#### Load and merge Experiment Datasets for comparison

In [28]:
# Load experiment results:
context_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_1_eval.pkl")
no_context_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_2_eval.pkl")

In [29]:
# Concatenate datasets together for comparison:
combined_df = pd.merge( context_retr_df, no_context_retr_df, left_index=True, right_index=True, suffixes=["_context", "_no_context"])
combined_df[0:2]

Unnamed: 0,question_context,answer_context,ground_truth_context,answer_similarity_context,answer_correctness_context,evaluation.llm_model_context,evaluation.embedding_model_context,question_no_context,answer_no_context,ground_truth_no_context,answer_similarity_no_context,answer_correctness_no_context,evaluation.llm_model_no_context,evaluation.embedding_model_no_context
0,When did the colonization of India occur?,The colonization of India occurred in the mid-...,18th century,0.807099,0.576775,ollama,ollama,When did the colonization of India occur?,The British East India Company established its...,18th century,0.768604,0.567151,ollama,ollama
1,What were requests made to British?,Requests were made to the British for medical ...,continue worshiping in their Roman Catholic tr...,0.651037,0.435486,ollama,ollama,What were requests made to British?,"The British received several requests, includi...",continue worshiping in their Roman Catholic tr...,0.642737,0.433412,ollama,ollama


##### RAGAS comparison (context - no_context)

In [30]:
# Metric comparison:
combined_df["answer_similarity_diff"] = combined_df["answer_similarity_context"] - combined_df["answer_similarity_no_context"]
combined_df["answer_correctness_diff"] = combined_df["answer_correctness_context"] - combined_df["answer_correctness_no_context"]


combined_df[["answer_similarity_diff", "answer_correctness_diff"]][0:2]


Unnamed: 0,answer_similarity_diff,answer_correctness_diff
0,0.038495,0.009624
1,0.0083,0.002075


In [33]:
# For Answer Similiarity:
print(f"""
Answer Similiarity Difference:
Min: {combined_df["answer_similarity_diff"].min()}
Avg: {combined_df["answer_similarity_diff"].mean()}
Max: {combined_df["answer_similarity_diff"].max()}
""")



Answer Similiarity Difference:
Min: 0.00829950381471245
Avg: 0.028568275375045738
Max: 0.03859078489428036



In [34]:
# For Answer Correctness:
print(f"""
Answer Correctness Difference:
Min: {combined_df["answer_correctness_diff"].min()}
Avg: {combined_df["answer_correctness_diff"].mean()}
Max: {combined_df["answer_correctness_diff"].max()}
""")



Answer Correctness Difference:
Min: -0.03374665791811937
Avg: -0.001191264489571886
Max: 0.00964769622357009



In [35]:
combined_df

Unnamed: 0,question_context,answer_context,ground_truth_context,answer_similarity_context,answer_correctness_context,evaluation.llm_model_context,evaluation.embedding_model_context,question_no_context,answer_no_context,ground_truth_no_context,answer_similarity_no_context,answer_correctness_no_context,evaluation.llm_model_no_context,evaluation.embedding_model_no_context,answer_similarity_diff,answer_correctness_diff
0,When did the colonization of India occur?,The colonization of India occurred in the mid-...,18th century,0.807099,0.576775,ollama,ollama,When did the colonization of India occur?,The British East India Company established its...,18th century,0.768604,0.567151,ollama,ollama,0.038495,0.009624
1,What were requests made to British?,Requests were made to the British for medical ...,continue worshiping in their Roman Catholic tr...,0.651037,0.435486,ollama,ollama,What were requests made to British?,"The British received several requests, includi...",continue worshiping in their Roman Catholic tr...,0.642737,0.433412,ollama,ollama,0.0083,0.002075
2,How many inhabitants did Betty Meggers believe...,Betty Meggers believed that a population densi...,0.2,0.707551,0.551888,ollama,ollama,How many inhabitants did Betty Meggers believe...,"Betty Meggers, a renowned archaeologist, estim...",0.2,0.681775,0.545444,ollama,ollama,0.025776,0.006444
3,What scientific field's theory has received co...,Thermodynamics is the scientific field whose t...,thermodynamic,0.84872,0.545513,ollama,ollama,What scientific field's theory has received co...,The scientific field that has received contrib...,thermodynamic,0.81704,0.57926,ollama,ollama,0.03168,-0.033747
4,When was the final legislative proposals for a...,The final legislative proposals for a Scottish...,1978,0.727553,0.556888,ollama,ollama,When was the final legislative proposals for a...,The Final Legislative Proposals for a Scottish...,1978,0.688963,0.547241,ollama,ollama,0.038591,0.009648


In [36]:
# Optional
combined_df.to_csv(f"{experiment_folder}/combined_data.csv")