### **CELL 1: Imports and Paths**
This cell handles all necessary imports and path configurations for the script.  
It also sets the logging level and displays available LLMs and embeddings for reference.


In [107]:
# CELL 1: Imports and Paths
import logging
from modules import RAG_tools, models, preprocess
from modules.vectorstore import VectorstoreHandler
from modules.testset_manager import save_test_set
from giskard.rag import KnowledgeBase, generate_testset, evaluate
from modules.giskard_wrappers import GiskardEmbeddingAdapter, GiskardLLMAdapter
from modules.eval_tools import display_evaluation_results
from modules.models import AVAILABLE_EMBS, AVAILABLE_LLMS

# Set logging level to suppress verbose outputs
logging.getLogger().setLevel(logging.WARNING)

# File paths
processed_data_path = "data/processed/processed_data.pkl"  # Preprocessed dataset path
raw_data_path = "data/raw"                                # Raw dataset path
vsts_dir = "data/vectorstores"                            # Vectorstore directory
reports_dir = "eval_results"                              # Directory for evaluation reports

# Define LLM and embedding configurations
# RAG LLM and embedding (used for production RAG)
rag_model_name = "ChatGPT3.5-turbo"
rag_emb_name = "hf-mpnet-base-v2"

# Evaluation LLM and embedding (used for test set generation and evaluation)
kb_model_name = "ChatGPT4o"
kb_emb_name = "openai-embedding-3-large"

# Display available LLMs and embeddings
print(list(AVAILABLE_LLMS.keys()))
print(list(AVAILABLE_EMBS.keys()))


['ChatGPT4o', 'ChatGPT3.5-turbo', 'Llama3.2-3b']
['openai-ada-002', 'openai-embedding-3-small', 'openai-embedding-3-large', 'hf-mpnet-base-v2', 'hf-minilm-l6-v2', 'hf-multiqa-minilm-l6-v1']


### **CELL 2: Constructing the RAG**
In this cell, we:
1. Prepare the dataset from raw and processed data paths.
2. Initialize the RAG chain components:
   - The RAG LLM (Language Model).
   - The RAG embedding.
3. Build the Vectorstore and retriever.
4. Construct the RAG chain that combines these components for retrieval-augmented generation.

In [108]:

# CELL 2: Constructing the RAG
# Step 1: Prepare the dataset
data = preprocess.prepare_data(processed_data_path, raw_data_path)

# Step 2: Initialize RAG chain components
rag_llm = models.init_llm(rag_model_name)
rag_embedding = models.init_emb(rag_emb_name)

# Step 3: Initialize and build the VectorstoreHandler
vst_handler = VectorstoreHandler(
    persist_directory=vsts_dir,
    embedding=rag_embedding,
    dataset=data,
    force_rebuild=False  # Set to True to force rebuild of the vectorstore
)

# Build vectorstore and retriever
vst = vst_handler.init_vectorstore()
retriever = vst_handler.init_retriever(vst, k=2)

# Build the RAG chain
chain = RAG_tools.build_rag_chain(retriever, rag_llm)


Loading processed data from data/processed/processed_data.pkl...
Loading existing vectorstore from data/vectorstores/sentence-transformers_all-mpnet-base-v2/9a7d1387f14cfe05da8b0b963ad3411f6081192c542dc549d032564dca53d52a...


### **CELL 3: RAG Chain Demonstration**
Here, we test the RAG chain to verify it functions as expected.  
We provide a text prompt to the chain and display its response.  
This demonstrates the RAG's ability to retrieve relevant information and generate an output.

We also introcue an answer function, which is a wrapper of the chain (and the handle_query function in the RAG_tools module) that just deals with str input/outputs - needed for the evaluation in the last cell

In [109]:

# CELL 3: Demonstrating the RAG chain
# Define a prompt for testing
test_prompt = "What are the main topics covered in the dataset?"
response = chain.invoke(test_prompt)  # Invoke the chain with the prompt
print("RAG Response:", response)       # Display the output

# For later, we want to have an answer_fn which just produces the asnwer without any other fluff:
answer_fn = RAG_tools.create_answer_fn(chain)
answer = answer_fn(test_prompt)
print("RAG Answer: ", answer)


RAG Response: {'question': 'What are the main topics covered in the dataset?', 'answer': 'The main topics covered in the dataset appear to be the expected outcomes and impacts of research projects, the assessment of future expected impacts, contributions to challenges at a European/Global level, embedding projects into overarching goals, and demonstrating knowledge of both MSCA and EU strategies.', 'docs': [Document(id='ac83fd47-b916-4c18-b1d2-e96d93d81910', metadata={}, page_content='MSCA-PF-FORMSET ve  1.00 20230406 Page 1 of 20 Las  saved 07/04/2023 07:42'), Document(id='31975363-4f2e-4d19-8c61-7914c05953c7', metadata={}, page_content='MSCA POSTDOCTORAL FELLOWSHIP HANDBOOK 2024\n➢ Fo  each expec ed ou come, p ovide qua ified i dica o s, whe e possible. Fo  example,\nexpec ed  eve ues f om  ew  ech ologies, size of pa ie  g oups  ha  will be affec ed by a\n ew  ea me ,  umbe  of  ew jobs/po e ial p ojec s/ ca ee  oppo u i ies fo   he s aff  ha \nwill be c ea ed af e  a successful p o

### **CELL 4: Constructing Components for Test Set Generation**
This cell sets up everything required for generating a test set:
1. Initializes the evaluation LLM and embedding (ideally more powerful than the production RAG configuration).
2. Wraps these components for compatibility with Giskard.
3. Creates a KnowledgeBase object using the dataset, embedding model, and LLM.


In [110]:
# CELL 4: Constructing Components for Test Set Generation
# Step 1: Initialize KnowledgeBase components
kb_llm = models.init_llm(kb_model_name)
kb_embedding = models.init_emb(kb_emb_name)

# Wrap LLM and embedding for compatibility with Giskard
wrapped_llm = GiskardLLMAdapter(kb_llm)
wrapped_embedding = GiskardEmbeddingAdapter(kb_embedding)

# Step 2: Create the KnowledgeBase
kb = KnowledgeBase(
    data=data,
    embedding_model=wrapped_embedding,
    llm_client=wrapped_llm
)


### **CELL 5: Generating the Test Set**
Using the KnowledgeBase constructed earlier, this cell generates a test set.  
The test set consists of questions in English, with the agent relying solely on the provided context to answer.

In [113]:

# CELL 5: Generating the Test Set
test_set = generate_testset(
    knowledge_base=kb,
    num_questions=40,
    language='en',
    agent_description=(
        "We're creating a test set for a RAG. The rag will use the context to answer questions from applicants that don't understand the differnt parts of the application process."
        "The agent should only use information from the context and never ask follow up questions."
    )
)


Generating questions: 100%|██████████| 40/40 [02:49<00:00,  4.24s/it]


In [114]:
test_set_df = test_set.to_pandas()
for index, row in test_set_df.iterrows():
    print(f"Question: {row['question']}")
    print(f"Reference Answer: {row['reference_answer']}")
    print(f"Reference Context: {row['reference_context'][:100]}...")  # Truncate long contexts
    print("Metadata:")
    
    # Iterate through the 'metadata' dictionary and handle nested structures
    if isinstance(row['metadata'], dict):  # Ensure it's a dictionary
        for key, val in row['metadata'].items():
            if isinstance(val, dict):  # Handle nested dictionary
                print(f"  - {key}:")
                for sub_key, sub_val in val.items():
                    print(f"      * {sub_key}: {sub_val}")
            else:
                print(f"  - {key}: {val}")
    else:
        print("  Metadata is not in dictionary format.")
    
    print("")
    print("="*80)
    print("")


Question: What are the language requirements for PPI contract notices in the EU?
Reference Answer: The PPI contract notices must be published EU-wide in at least English, offers must be accepted and communication with stakeholders must be enabled at all stages in at least English.
Reference Context: Document 176: file_name: General annexes-diverse regler_horizon-2023-2024_en.pdf
page_number: 44
tex...
Metadata:
  - question_type: simple
  - seed_document_id: 176
  - topic: content='"European Research Funding Program"' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 1873, 'total_tokens': 1880, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_703d4ff298', 'finish_reason': 'stop', 'logprobs': None} id='run-7ff71

### **CELL 6: Evaluating the RAG**
This cell evaluates the RAG chain using the generated test set and KnowledgeBase.  
The evaluation results are displayed to assess the RAG's performance on the test set.

In [115]:

# CELL 6: Evaluating the RAG Using the Test Set
report = evaluate(answer_fn, testset=test_set, knowledge_base=kb)
display_evaluation_results(report)


Asking questions to the agent: 100%|██████████| 40/40 [00:45<00:00,  1.13s/it]
CorrectnessMetric evaluation: 100%|██████████| 40/40 [00:53<00:00,  1.34s/it]



=== RAG Evaluation Results ===

Overall Correctness: 12.50%
Total Test Cases: 40

--- RAG Component Scores ---
GENERATOR: 14.29%
RETRIEVER: 14.29%
REWRITER: 4.76%
ROUTING: 100.00%
KNOWLEDGE_BASE: 94.05%

--- Correctness by Question Type ---
complex: 28.57%
conversational: 0.00%
distracting element: 14.29%
double: 0.00%
simple: 14.29%
situational: 14.29%

--- Correctness by Topic ---
content='"European Research Funding Program"' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 1873, 'total_tokens': 1880, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-2024-08-06', 'system_fingerprint': 'fp_703d4ff298', 'finish_reason': 'stop', 'logprobs': None} id='run-7ff71476-8c50-4cf4-973e-8a0dbd482bf0-0' usage_metadata={'input_tokens': 1873, 'output_tokens': 7

## Set up a stronger rag to compare:


In [116]:
strong_model_name = "ChatGPT4o"
strong_emb_name = "openai-embedding-3-large"

strong_llm = models.init_llm(strong_model_name)
strong_embedding = models.init_emb(strong_emb_name)

# Step 3: Initialize and build the VectorstoreHandler
strong_vst_handler = VectorstoreHandler(
    persist_directory=vsts_dir,
    embedding=rag_embedding,
    dataset=data,
    force_rebuild=False  # Set to True to force rebuild of the vectorstore
)

# Build vectorstore and retriever
strong_vst = strong_vst_handler.init_vectorstore()
strong_retriever = strong_vst_handler.init_retriever(vst, k = 5)

# Build the RAG chain
strong_chain = RAG_tools.build_rag_chain(strong_retriever, strong_llm)
strong_answer_fn = RAG_tools.create_answer_fn(strong_chain)

Loading existing vectorstore from data/vectorstores/sentence-transformers_all-mpnet-base-v2/9a7d1387f14cfe05da8b0b963ad3411f6081192c542dc549d032564dca53d52a...


Evaluate strong rag,
keeping the test_set and knowledge base the same 

In [142]:
strong_report = evaluate(strong_answer_fn, testset=test_set, knowledge_base=kb)

Asking questions to the agent: 100%|██████████| 40/40 [02:14<00:00,  3.36s/it]
CorrectnessMetric evaluation: 100%|██████████| 40/40 [01:00<00:00,  1.52s/it]


In [143]:
print("Weak rag:")
print(report.component_scores())
print()

print("Strong rag:")
print(strong_report.component_scores())
print()


Weak rag:
                   score
RAG Components          
GENERATOR       0.142857
RETRIEVER       0.142857
REWRITER        0.047619
ROUTING         1.000000
KNOWLEDGE_BASE  0.940476

Strong rag:
                   score
RAG Components          
GENERATOR       0.290476
RETRIEVER       0.357143
REWRITER        0.198413
ROUTING         1.000000
KNOWLEDGE_BASE  0.880952



In [144]:
df1=report.correctness_by_topic()
df2=strong_report.correctness_by_topic()

for column in df1.columns:
    if column in df2.columns:  # Ensure the column exists in both DataFrames
        print(f"Correctness by topic: {column}")
        print(f"{'Original rag':<20} {'Strong rag':<20}")
        for val1, val2 in zip(df1[column], df2[column]):  # Iterate through the values
            print(f"{str(val1):<20} {str(val2):<20}")  # Adjust column width as needed
        print("=" * 40)  # Separator for clarity
    else:
        print(f"Column '{column}' is missing in one of the DataFrames!")


df1=report.correctness_by_question_type()
df2=strong_report.correctness_by_question_type()

for column in df1.columns:
    if column in df2.columns:  # Ensure the column exists in both DataFrames
        print(f"Correctness by question type: {column}")
        print(f"{'Original rag':<20} {'Strong rag':<20}")
        for val1, val2 in zip(df1[column], df2[column]):  # Iterate through the values
            print(f"{str(val1):<20} {str(val2):<20}")  # Adjust column width as needed
        print("=" * 40)  # Separator for clarity
    else:
        print(f"Column '{column}' is missing in one of the DataFrames!")

report.to_html("notebook_reports/report.html")
strong_report.to_html("notebook_reports/strong_report.html")


Correctness by topic: correctness
Original rag         Strong rag          
0.08333333333333333  0.16666666666666666 
0.14285714285714285  0.2857142857142857  
Correctness by question type: correctness
Original rag         Strong rag          
0.2857142857142857   0.42857142857142855 
0.0                  0.0                 
0.14285714285714285  0.42857142857142855 
0.0                  0.16666666666666666 
0.14285714285714285  0.2857142857142857  
0.14285714285714285  0.14285714285714285 


In [145]:
df1 = report.failures
df2 = strong_report.failures

import textwrap

# Set a character limit for line wrapping
LINE_WIDTH = 80

def wrapped_print(label, text, indent=2):
    wrapped_text = textwrap.fill(text, width=LINE_WIDTH, subsequent_indent=" " * indent)
    print(f"{label}{wrapped_text}")

for idx, row1 in df1.iterrows():
    # Access the id from the current row in df1
    id1 = row1['question']  # Assuming 'question' is the unique identifier

    # Find the corresponding row in df2 using the same id
    row2 = df2[df2['question'] == id1]
    
    if not row2.empty:  # Ensure that a matching row was found
        row2 = row2.iloc[0]  # Get the first matching row as a Series (if there's only one match)

        print("=" * 80)
        wrapped_print("Question: ", id1)
        print()

        # Weak RAG Output
        print("Weak RAG:")
        wrapped_print("  Answer: ", row1['agent_answer'])
        wrapped_print("  Judgment: ", str(row1['correctness']))
        wrapped_print("  Argument: ", row1['correctness_reason'])
        print()

        # Strong RAG Output
        print("Strong RAG:")
        wrapped_print("  Answer: ", row2['agent_answer'])
        wrapped_print("  Judgment: ", str(row2['correctness']))
        wrapped_print("  Argument: ", row2['correctness_reason'])
        print()

        # Reference Answer and Context
        wrapped_print("Reference Answer: ", row1['reference_answer'])
        print()
        wrapped_print("Reference Context: ", row1['reference_context'][:500])  # Truncate context for brevity
        print()
        wrapped_print("Metadata: ", str(row1['metadata']))
        print("=" * 80)
        print()
    else:
        wrapped_print("No matching row in df2 for question: ", id1)



Question: What are the language requirements for PPI contract notices in the EU?

Weak RAG:
  Answer: There is no information provided in the context regarding language requirements
  for PPI contract notices in the EU.
  Judgment: False
  Argument: The agent stated that there is no information provided in the context regarding
  language requirements, but the reference answer provides specific language
  requirements for PPI contract notices in the EU.

Strong RAG:
  Answer: The context provided does not include information about the language
  requirements for PPI contract notices in the EU.
  Judgment: False
  Argument: The agent stated that the context does not include information about the
  language requirements, but it should have provided the information that PPI
  contract notices must be published EU-wide in at least English, and offers and
  communication must be enabled in at least English.

Reference Answer: The PPI contract notices must be published EU-wide in at least En