### Experiment:  Identify effectiveness of "no answer" prompt

**Background:**
LLM prompt is told: 'If you don't know the answer, just say that you don't know.' to help prevent hallucinations when context does not provide an answer.  It would be helpful to understand how effective prompt is at preventing responses when no context answer.

**Test Approach**
A sample of questions will be selected from QA corpus where answer is not possible.  LLM will be asked question with most relevant possible context but with expectation that context does not provide actual answer.  Assessment will measure what % of responses accurately indicate the LLM doesn't know.


In [2]:
# Common import
from deh.assessment import QASetRetriever
from deh.assessment import QASetType
from deh import settings
from deh.eval import generate_experiment_dataset

import pandas as pd
import os
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


#### Test Configuration

In [38]:
num_samples:int = 100
experiment_folder:str = "../../data/evaluation/no-context-prompt-experiment/"
qa_data_set_file:str = "../../data/qas/squad_qas.tsv"

# Create experiment folder:
if not os.path.exists(experiment_folder):
    Path(experiment_folder).mkdir(parents=True, exist_ok=True)


#### Sample QA dataset

In [39]:
# Only get impossible to answer questions:
qa_set = QASetRetriever.get_qasets(
    file_path = qa_data_set_file,
    sample_size= num_samples,
    qa_type = QASetType.IMPOSSIBLE_ONLY
)

print(f"{len(qa_set)} questions sampled from QA corpus ({qa_data_set_file})")

100 questions sampled from QA corpus (../../data/qas/squad_qas.tsv)


#### Get Responses with default prompt (does not specify to say don't know)

In [40]:

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    return pd.json_normalize(
        data=response
    )

def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 1
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=1"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_1.pkl" )
exp_df[0:1]


Processing 1 of 100 question/answer pairs.
Processing 2 of 100 question/answer pairs.
Processing 3 of 100 question/answer pairs.
Processing 4 of 100 question/answer pairs.
Processing 5 of 100 question/answer pairs.
Processing 6 of 100 question/answer pairs.
Processing 7 of 100 question/answer pairs.
Processing 8 of 100 question/answer pairs.
Processing 9 of 100 question/answer pairs.
Processing 10 of 100 question/answer pairs.
Processing 11 of 100 question/answer pairs.
Processing 12 of 100 question/answer pairs.
Processing 13 of 100 question/answer pairs.
Processing 14 of 100 question/answer pairs.
Processing 15 of 100 question/answer pairs.
Processing 16 of 100 question/answer pairs.
Processing 17 of 100 question/answer pairs.
Processing 18 of 100 question/answer pairs.
Processing 19 of 100 question/answer pairs.
Processing 20 of 100 question/answer pairs.
Processing 21 of 100 question/answer pairs.
Processing 22 of 100 question/answer pairs.
Processing 23 of 100 question/answer pair

Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,Why did Pedro Menendez de Aviles called the St...,False,Pedro Menendez de Aviles did not call the St. ...,"[{'id': None, 'metadata': {'source': '../data/...",,,00:00:05,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,Why did Pedro Menendez de Aviles called the St...,,True,287,1


#### Get Responses with default prompt (specify to say don't know)

In [41]:

def convert(response) -> pd.DataFrame:
    """Converts retrieved JSON response to Pandas DataFrame"""
    return pd.json_normalize(
        data=response
    )

def api_endpoint(**kwargs) -> str:
    """Endpoint for answer.
    parameters:
    - hyde (h) = False
    - evaluation (e) = False
    - lmm prompt selection (lp) = 0
    """
    query_params = "&".join([f"{key}={kwargs[key]}" for key in kwargs])
    return f"http://{settings.API_ANSWER_ENDPOINT}/answer?{query_params}&h=False&e=False&lp=0"

# Collect response:
exp_df = generate_experiment_dataset(qa_set, convert, api_endpoint)

# Store dataframe:
exp_df.to_pickle( f"{experiment_folder}/prompt_0.pkl" )
exp_df[0:1]


Processing 1 of 100 question/answer pairs.
Processing 2 of 100 question/answer pairs.
Processing 3 of 100 question/answer pairs.
Processing 4 of 100 question/answer pairs.
Processing 5 of 100 question/answer pairs.
Processing 6 of 100 question/answer pairs.
Processing 7 of 100 question/answer pairs.
Processing 8 of 100 question/answer pairs.
Processing 9 of 100 question/answer pairs.
Processing 10 of 100 question/answer pairs.
Processing 11 of 100 question/answer pairs.
Processing 12 of 100 question/answer pairs.
Processing 13 of 100 question/answer pairs.
Processing 14 of 100 question/answer pairs.
Processing 15 of 100 question/answer pairs.
Processing 16 of 100 question/answer pairs.
Processing 17 of 100 question/answer pairs.
Processing 18 of 100 question/answer pairs.
Processing 19 of 100 question/answer pairs.
Processing 20 of 100 question/answer pairs.
Processing 21 of 100 question/answer pairs.
Processing 22 of 100 question/answer pairs.
Processing 23 of 100 question/answer pair

Unnamed: 0,response.question,response.hyde,response.answer,response.context,response.evaluation.grade,response.evaluation.description,response.execution_time,system_settings.gpu_enabled,system_settings.llm_model,system_settings.llm_prompt,...,system_settings.text_chunk_size,system_settings.text_chunk_overlap,system_settings.context_similarity_threshold,system_settings.context_docs_retrieved,system_settings.docs_loaded,reference.question,reference.ground_truth,reference.is_impossible,reference.ref_context_id,reference_id
0,Why did Pedro Menendez de Aviles called the St...,False,Pedro Menendez de Aviles called the St. Johns ...,"[{'id': None, 'metadata': {'source': '../data/...",,,00:00:25,True,llama3.1:8b-instruct-q3_K_L,rlm/rag-prompt-llama,...,1500,100,1.0,6,1256,Why did Pedro Menendez de Aviles called the St...,,True,287,1


#### Load and merge Experiment Datasets for comparison

In [42]:
# Load experiment results:
p_0_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_0.pkl")[["response.question", "response.answer", "response.execution_time"]]
p_0_retr_df = p_0_retr_df.reset_index(drop=True)

p_1_retr_df = pd.read_pickle(f"{experiment_folder}/prompt_1.pkl")[["response.question", "response.answer", "response.execution_time"]]
p_1_retr_df = p_1_retr_df.reset_index(drop=True)

In [43]:
# Concatenate datasets together for comparison:
combined_df = pd.merge( p_0_retr_df, p_1_retr_df, left_index=True, right_index=True, suffixes=["_p_0", "_p_1"])
combined_df[0:2]

Unnamed: 0,response.question_p_0,response.answer_p_0,response.execution_time_p_0,response.question_p_1,response.answer_p_1,response.execution_time_p_1
0,Why did Pedro Menendez de Aviles called the St...,Pedro Menendez de Aviles called the St. Johns ...,00:00:25,Why did Pedro Menendez de Aviles called the St...,Pedro Menendez de Aviles did not call the St. ...,00:00:05
1,State educational and economic development whe...,Education is a crucial factor in economic deve...,00:00:18,State educational and economic development whe...,Education has been a crucial factor in economi...,00:00:04


In [44]:
# Indicate if answer contains don't know:
combined_df["DNK_p_0"] = combined_df['response.answer_p_0'].str.contains("don't know")
combined_df["DNK_p_1"] = combined_df['response.answer_p_1'].str.contains("don't know")

combined_df[0:2]


Unnamed: 0,response.question_p_0,response.answer_p_0,response.execution_time_p_0,response.question_p_1,response.answer_p_1,response.execution_time_p_1,DNK_p_0,DNK_p_1
0,Why did Pedro Menendez de Aviles called the St...,Pedro Menendez de Aviles called the St. Johns ...,00:00:25,Why did Pedro Menendez de Aviles called the St...,Pedro Menendez de Aviles did not call the St. ...,00:00:05,True,False
1,State educational and economic development whe...,Education is a crucial factor in economic deve...,00:00:18,State educational and economic development whe...,Education has been a crucial factor in economi...,00:00:04,False,False


##### Hallucinations prevented comparison


In [45]:
# Percent p_0:
pcnt_p_0 = len( combined_df[ combined_df["DNK_p_0"] == True ] ) / len (combined_df) * 100
pcnt_p_0

41.0

In [46]:
# Percent p_1:
pcnt_p_1 = len( combined_df[ combined_df["DNK_p_1"] == True ] ) / len (combined_df) * 100
pcnt_p_1

0.0

Telling the prompt to respond with "I do not know" if not available in context reduces hallucinations by:

In [47]:
pcnt_p_0 - pcnt_p_1

41.0

Any performance hit from prompt enhancment?

In [51]:
combined_df["response_diff"] = pd.to_timedelta(combined_df["response.execution_time_p_0"]).dt.total_seconds() - pd.to_timedelta(combined_df["response.execution_time_p_1"]).dt.total_seconds()
combined_df["response_diff"].mean()

9.15