# Setting Up the RAG and Evaluation Framework

This notebook initializes the configurations and paths for:
1. **Production RAG**: Configurations for the LLM and embedding model used in the production RAG pipeline.
2. **Evaluation**: Configurations for the LLM and embedding model used to generate test sets and evaluate the production RAG.

### Key Highlights:
- **Flexible Configurations**: Both the production RAG and evaluation LLM/embedding combinations can be freely adjusted.
- **Purpose of Each Component**:
  - **File Paths**: Preprocessed data, raw data, vectorstore directory, and evaluation report locations.
  - **RAG vs. Evaluation Models**: The production RAG (`rag_model_name` and `rag_emb_name`) is typically lightweight, whereas the evaluation models (`kb_model_name` and `kb_emb_name`) should be more robust.
- **Metadata Columns**: These specify the structure of the dataset used in the pipeline.

The code cell below sets up these configurations to be used throughout the rest of the notebook.


In [1]:
import logging
from modules import RAG_tools, models, preprocess
from modules.vectorstore import VectorstoreHandler
from modules.testset_manager import save_test_set
from giskard.rag import KnowledgeBase, generate_testset, evaluate
from modules.giskard_wrappers import GiskardEmbeddingAdapter, GiskardLLMAdapter
from modules.eval_tools import display_evaluation_results

# Set logging level to suppress verbose outputs
logging.getLogger().setLevel(logging.WARNING)

# File paths
processed_data_path = "data/processed/processed_data.pkl"  # Preprocessed dataset path
raw_data_path = "data/raw"                                # Raw dataset path
vsts_dir = "data/vectorstores"                            # Vectorstore directory
reports_dir = "eval_results"                              # Directory for evaluation reports

# Define LLM and embedding configurations
# RAG LLM and embedding (used for production RAG)
rag_model_name = "ChatGPT3.5-turbo"
rag_emb_name = "hf-mpnet-base-v2"

# Evaluation LLM and embedding (used for test set generation and evaluation)
kb_model_name = "ChatGPT4o"
kb_emb_name = "openai-embedding-3-small"

# Note:
# - The evaluation LLM/embedding (kb_model_name/kb_emb_name) should ideally be "stronger" or more powerful
#   than the production RAG configuration to ensure robust evaluations.
# - In this example, lightweight options are used for quicker testing.

# Other configurations
k = 3  # Number of documents to retrieve during retrieval
columns = ["text", "file_name", "page_number"]  # Metadata columns for the dataset


  from .autonotebook import tqdm as notebook_tqdm


In the above I use the same llms and embeddings for both the rag and the knowledge base. This is just to illustrate. In the models module ther are two dicts with some models that are availabel now. Can likely be expanded with openai and ollama models just by adding the key:values into the corresponding dict. Beyond that anything that fits the description from huggingface should work (?).


In [6]:
from modules.models import AVAILABLE_EMBS, AVAILABLE_LLMS


print(list(AVAILABLE_LLMS.keys()))
print(list(AVAILABLE_EMBS.keys()))

['ChatGPT4o', 'ChatGPT3.5-turbo', 'Llama3.2-3b']
['openai-ada-002', 'openai-embedding-3-small', 'openai-embedding-3-large', 'hf-mpnet-base-v2', 'hf-minilm-l6-v2', 'hf-multiqa-minilm-l6-v1']


In [2]:
# Prepare the dataset
data = preprocess.prepare_data(processed_data_path, raw_data_path)

# Initialize RAG chain components
rag_llm = models.init_llm(rag_model_name)
rag_embedding = models.init_emb(rag_emb_name)

# Initialize KnowledgeBase components
kb_llm = models.init_llm(kb_model_name)
kb_embedding = models.init_emb(kb_emb_name)
wrapped_llm = GiskardLLMAdapter(kb_llm)
wrapped_embedding = GiskardEmbeddingAdapter(kb_embedding)

# Initialize the VectorstoreHandler
vst_handler = VectorstoreHandler(
    persist_directory=vsts_dir,
    embedding=rag_embedding,
    dataset=data,
    force_rebuild=False  # Set to True to force rebuild of the vectorstore
)

# Build vectorstore and retriever
vst = vst_handler.init_vectorstore()
retriever = vst_handler.init_retriever(vst, k)
chain = RAG_tools.build_rag_chain(retriever, rag_llm)
answer_fn = RAG_tools.create_answer_fn(chain)

Loading processed data from data/processed/processed_data.pkl...
Creating a new vectorstore...
Splitting text into chunks...


Processing documents: 100%|██████████| 178/178 [01:05<00:00,  2.72it/s]


Adding documents to the vectorstore...
Vectorstore initialized and metadata saved successfully in data/vectorstores/sentence-transformers_all-mpnet-base-v2/9a7d1387f14cfe05da8b0b963ad3411f6081192c542dc549d032564dca53d52a.


In [3]:
# Create the KnowledgeBase
kb = KnowledgeBase(
    data=data,
    embedding_model=wrapped_embedding,
    llm_client=wrapped_llm
)

# Generate the test set
test_set = generate_testset(
    knowledge_base=kb,
    num_questions=10,
    language='en',
    agent_description=(
        "This is an agent that uses the following context to answer the question. "
        "It only uses information from the context provided. It does not ask questions."
    )
)

2025-01-12 21:51:58,359 pid:13178 MainThread giskard.rag  INFO     Finding topics in the knowledge base.


OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


2025-01-12 21:52:07,497 pid:13178 MainThread giskard.rag  INFO     Found 3 topics in the knowledge base.


Generating questions: 100%|██████████| 10/10 [00:32<00:00,  3.25s/it]


In [4]:
report = evaluate(answer_fn=answer_fn, testset=test_set, knowledge_base=kb)
display_evaluation_results(report)

Asking questions to the agent: 100%|██████████| 10/10 [00:13<00:00,  1.31s/it]
CorrectnessMetric evaluation: 100%|██████████| 10/10 [00:11<00:00,  1.14s/it]



=== RAG Evaluation Results ===

Overall Correctness: 20.00%
Total Test Cases: 10

--- RAG Component Scores ---
GENERATOR: 30.00%
RETRIEVER: 0.00%
REWRITER: 33.33%
ROUTING: 100.00%
KNOWLEDGE_BASE: 11.11%

--- Correctness by Question Type ---
complex: 0.00%
conversational: 0.00%
distracting element: 0.00%
double: 100.00%
simple: 0.00%
situational: 50.00%

--- Correctness by Topic ---
content='"European Funding Programmes"' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 7, 'prompt_tokens': 2224, 'total_tokens': 2231, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None} id='run-a6ba0979-a67b-405c-a641-afa4ea4c0ae7-0' usage_metadata={'input_tokens': 2224, 'output_tokens': 7, 'total_tokens':