# Comparison of Agentic RAG vs Reason-ModernColBERT

Notebook author: Danny Williams @ Weaviate

This notebook will compare an 'agentic' RAG solution to dynamically searching a database via breaking down a question with complex reasoning, to a new method for complex reasoning retrieval: [Reason-ModernColBERT](https://huggingface.co/lightonai/Reason-ModernColBERT).

For an overview of Reason-ModernColBERT, I recommend you check out [this recipe](https://github.com/weaviate/recipes/blob/main/weaviate-features/multi-vector/reason_moderncolbert.ipynb).


### Setup

First, let's set up the dependencies for the notebook. For our generative models we will use OpenAI (and tiktoken for counting tokens), we need PyLate and Sentence Transformers to load the Reason-ModernColBERT model, we will of course use Weaviate as the vector database search engine, and to assess the quality of the results we will use DeepEval. We will also use rich for pretty printing. 

In [1]:
%%capture
!pip install weaviate-client==4.14.4
!pip install pylate==1.2.0
!pip install openai==1.84.0
!pip install tiktoken==0.9.0
!pip install deepeval==3.0.1
!pip install rich==13.9.4
!pip install protobuf==6.31.1

Additional requirements include your Weaviate instance being on version 1.29 or later.

Now, we define some helper functions for later which count and reduce the number of tokens in the data so we can truncate the texts if they are too long.

In [2]:
import tiktoken

def tokenize_text(text: str):
    return tiktoken.get_encoding("o200k_base").encode(text)

def reduce_tokens(text: str, max_tokens: int):
    tokens = tokenize_text(text)
    if len(tokens) > max_tokens:
        return tiktoken.get_encoding("o200k_base").decode(tokens[:max_tokens])
    return text

def count_tokens(text: str):
    return len(tokenize_text(text))

And define the embedding function for the target multi vector, using the Reason-ModernColBERT model.

In [3]:
from pylate import models

# Load the ModernColBERT model
model = models.ColBERT(
    model_name_or_path="lightonai/Reason-ModernColBERT",
)

def multi_vec_embed(text: str):
    return model.encode(text, is_query=False)

  from .autonotebook import tqdm as notebook_tqdm


## Data


We are going to use the [BioASQ dataset](https://www.bioasq.org/), as it contains a lot of domain specific knowledge, whose questions require breaking down into individual parts and complex reasoning to obtain good retrieval performance. The BioASQ dataset contains:
- 40.2K text passages
- 4.72K question and answer pairs with corresponding relevant passage ids

For this example notebook, let us consider only a subset of 100 questions. Each question has corresponding `relevant_passage_ids`, which is a list detailing which passages are pertinent to answering the question. We will include all these relevant passage IDs in the subset dataset, as well as a sample of irrelevant passages.

Let us first use pandas to load these datasets.

In [4]:
import pandas as pd

questions_splits = {'train': 'question-answer-passages/train-00000-of-00001.parquet', 'test': 'question-answer-passages/test-00000-of-00001.parquet'}
questions_df = pd.read_parquet("hf://datasets/enelpol/rag-mini-bioasq/" + questions_splits["train"])

texts_splits = {'train': 'text-corpus/train-00000-of-00001.parquet', 'test': 'text-corpus/test-00000-of-00001.parquet'}
texts_df = pd.read_parquet("hf://datasets/enelpol/rag-mini-bioasq/" + texts_splits["train"])

And we can look at a brief snapshot of the data below.

In [5]:
questions_df.head()

Unnamed: 0,question,answer,id,relevant_passage_ids
0,What is the implication of histone lysine meth...,"Aberrant patterns of H3K4, H3K9, and H3K27 his...",1682,"[23179372, 19270706, 23184418]"
1,What is the role of STAG1/STAG2 proteins in di...,STAG1/STAG2 proteins are tumour suppressor pro...,3722,"[26997282, 21589869, 19822671, 29867216, 15361..."
2,What is the association between cell phone use...,The association between cell phone use and inc...,1235,"[20215713, 17851009, 22882019, 12527940, 24348..."
3,What is the applicability of the No Promoter L...,No Promoter Left Behind (NPLB) is an efficient...,2103,[26530723]
4,Does the Oncotype DX test work with paraffin e...,"Yes, the Oncotype DX test works with paraffin ...",1713,"[23074401, 17039265, 18922117, 17463177, 16361..."


In [6]:
texts_df.head()

Unnamed: 0,passage,id
0,New data on viruses isolated from patients wit...,9797
1,We describe an improved method for detecting d...,11906
2,We have studied the effects of curare on respo...,16083
3,Kinetic and electrophoretic properties of 230-...,23188
4,Male Wistar specific-pathogen-free rats aged 2...,23469


### Subset Data

In [7]:
subset_questions_df = questions_df.sample(frac=100/len(questions_df))

relevant_passages = []
for i, row in subset_questions_df.iterrows():
    relevant_passages.extend(row["relevant_passage_ids"])

relevant_passages = list(set(relevant_passages))
irrelevant_passages = texts_df.id.tolist()
irrelevant_passages = [p for p in irrelevant_passages if p not in relevant_passages]

# Get twice as many irrelevant passages
import random
random.seed(42)
irrelevant_passages = random.sample(irrelevant_passages, len(relevant_passages)*2)

subset_texts_df = texts_df[texts_df.id.isin(relevant_passages + irrelevant_passages)]


In [8]:
print(f"Number of relevant passages: {len(relevant_passages)}")
print(f"Number of irrelevant passages: {len(irrelevant_passages)}")
print(f"Subset of {len(subset_texts_df)} passages created")
print(f"Subset of {len(subset_questions_df)} questions created")

Number of relevant passages: 826
Number of irrelevant passages: 1652
Subset of 2478 passages created
Subset of 100 questions created


### Add Data to Weaviate

First, let's connect to the Weaviate client. In this instance we are connecting to Weaviate cloud using API keys stored in the local environment. But you can also use [Weaviate Embedded](https://weaviate.io/developers/weaviate/installation/embedded), [Docker](https://weaviate.io/developers/weaviate/installation/docker-compose), amongst other options.

In [9]:
import os

import weaviate
from weaviate.classes.config import Configure, Property, DataType
from weaviate.classes.init import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.getenv("WCD_URL"),
    auth_credentials=Auth.api_key(os.getenv("WCD_API_KEY")),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)



In [10]:
print(client.is_ready())

True


Now we will create a collection called `bioASQ_passages`, containing two named vectors - one for the single vector embeddings (using the default OpenAI vectorizer), and one with the multi-vector embeddings using the Reason-ModernColBERT model. For both we will use [scalar quantization](https://weaviate.io/developers/weaviate/concepts/vector-quantization#scalar-quantization) to reduce the memory footprint of the vectors.

In [11]:
collection = client.collections.create(
    "bioASQ_passages",
    vectorizer_config=[
        Configure.NamedVectors.none(
            name="multi_vector",
            vector_index_config=Configure.VectorIndex.hnsw(
                multi_vector=Configure.VectorIndex.MultiVector.multi_vector()
            )
        ),
        Configure.NamedVectors.text2vec_openai(
            name="single_vector",
            vector_index_config=Configure.VectorIndex.hnsw()
        ),  
    ],
    properties=[
        Property(
            name="text",
            data_type=DataType.TEXT,
            vectorize_property_name=False  
        ),
        Property(
            name="docid",
            data_type=DataType.TEXT,
            vectorize_property_name=False  
        ),
    ],
)

/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/weaviate/collections/classes/config.py:1963: PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
  for cls_field in self.model_fields:


#### Import to Collection


In [12]:
from weaviate.util import generate_uuid5
collection = client.collections.get("bioASQ_passages")
with collection.batch.fixed_size(10) as batch:
    for iter, (_, doc) in enumerate(subset_texts_df.iterrows()):

        uuid = generate_uuid5(doc.id)

        text = doc.passage
        
        if count_tokens(text) > 6_000:
            text = reduce_tokens(text, 6_000) # truncate long documents

        batch.add_object(
            properties={"text": text, "docid": str(doc.id)},
            vector={"multi_vector": multi_vec_embed(text)},
            uuid=uuid
        )

        print(f"\r{iter+1}/{len(subset_texts_df)}", end="", flush=True)

2478/2478

## Search Methods


In [13]:
from openai import OpenAI
openai_client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

In [14]:
from pydantic import BaseModel, Field

answer_question_system_prompt = """
You are an expert at answering biomedical questions using a given context.
You will be given a question, and a list of documents.
You need to answer the question using the context.
Do not answer the question if you do not have enough information.
If the context does not fully answer the question, you must respond with "I do not know" or similar.
Provide the answer only in a concise manner.
Use citations to reference the context in your answer. It should be formatted as e.g. "[1]" at the end of each sentence that references the context.
The contexts are marked with a number in square brackets, e.g. "[1]", at the beginning of each context.
"""

class AnswerQuestionOutput(BaseModel):
    answer: str

def answer_question(question: str, context: list[str]):
    context_str = ""
    for i, c in enumerate(context):
        context_str += f"[{i+1}]: {c}\n\n"

    response = openai_client.beta.chat.completions.parse(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": answer_question_system_prompt},
            {"role": "user", "content": f"Question: {question}\nContext: {context_str}"}
        ],
        response_format=AnswerQuestionOutput
    )
    return response.choices[0].message.parsed.answer

### Agentic RAG with Single Vector Search

The 'Agentic' RAG pipeline we will build is as follows. First, the system prompt loosely defines the dataset and gives instructions to the model to break down the question in terms of its core components. Then the model will extract these components to use in a search engine (Weaviate).

The reasoning field will improve the model performance via chain-of-thought prompting, and allow the model to decide what parts of the question need to be delved into deeper to get more relevant search results.

In [15]:
from pydantic import BaseModel, Field

class ModelOutput(BaseModel):
    reasoning: str
    search_components: list[str] = Field(min_items=3, max_items=3) # ensure the model is forced to provide 3 components (for comparison equity)

extract_reasoning_system_prompt = """
You are an expert at breaking down biomedical questions into their fundamental components used for a retrieval service.
You will be given a question, and you need to first explain your reasoning for breaking down the question into the components.
Then you need to provide the components.
Think carefully about what these components should be - it may not be outright stated in the question.
Use deductive reasoning to think step by step to complete this task.
Each component should be independent as they will be used separately to find relevant documents via a search engine.
You MUST provide 3 components only. No more, no less.
"""

# Use GPT-4.1-mini for the model (cheaper option but will have lower quality)
def extract_reasoning(question: str):
    response = openai_client.beta.chat.completions.parse(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": extract_reasoning_system_prompt},
            {"role": "user", "content": f"Question: {question}"}
        ],
        response_format=ModelOutput
    )
    reasoning = response.choices[0].message.parsed.reasoning
    components = response.choices[0].message.parsed.search_components
    return reasoning, components

def agentic_rag_search(question: str, verbose: bool = False):
    reasoning, components = extract_reasoning(question)
    if verbose:
        print(f"Model Reasoning:\n{reasoning}")
    
    contexts = []
    doc_ids = []
    for search_component in components:
        response = collection.query.hybrid(
            query=search_component,
            target_vector="single_vector", # specify the single vector (not multi-vector)
            limit=5
        )
        if verbose:
            print(f"Search Component: '{search_component}'")
            print(f"Search Results:")
            for i,obj in enumerate(response.objects):
                print(f"  {i+1}. {obj.properties['text'][:50].replace('\n', ' ')}...")
        contexts.extend([obj.properties['text'] for obj in response.objects])
        doc_ids.extend([int(obj.properties['docid']) for obj in response.objects])

    answer = answer_question(question, contexts)
    if verbose:
        print(f"Answer: {answer}")
    return answer, contexts, doc_ids


/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/pydantic/fields.py:1058: PydanticDeprecatedSince20: `min_items` is deprecated and will be removed, use `min_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/pydantic/fields.py:1064: PydanticDeprecatedSince20: `max_items` is deprecated and will be removed, use `max_length` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/


#### Example

Let's run the agentic model on the first question in the dataset to see an example in action.

In [16]:
agentic_rag_search(subset_questions_df.iloc[0]["question"], verbose=True);

Model Reasoning:
The question asks specifically about thyroid hormone analogs that have been used in human studies. To address this, it is important to identify: 1) the focus on 'thyroid hormone analogs' which implies substances structurally or functionally similar to thyroid hormones, 2) the usage context in 'human studies' rather than animal models or theoretical compounds, and 3) the need to look for specific names or classes of these analogs that are documented in clinical or experimental human research. These components will help retrieve scientific or clinical literature discussing thyroid hormone analogs tested or used in humans.
Search Component: 'Thyroid hormone analogs'
Search Results:
  1. Thyroid hormones [predominantly 3, 5, 3 -I- iodoth...
  2. Thyromimetic agents that can treat dyslipidemia wi...
  3. We previously reported that T3(3,3',5-triiodo-L-th...
  4. We have recently described the proangiogenesis eff...
  5. 3,5,3,'-Triiodothyroacetic acid (Triac) has been u...


### Reason Modern-ColBERT Embedding Search

In [17]:
def reason_moderncolbert_search(question: str, verbose: bool = False):
    response = collection.query.near_vector(
        near_vector=multi_vec_embed(question),
        target_vector="multi_vector", # specify the multi-vector (not single-vector)
        limit=15 # 3*the results per component (which is max 3, so this search has the same number of contexts as agentic search)
    )
    contexts = [obj.properties['text'] for obj in response.objects]
    doc_ids = [int(obj.properties['docid']) for obj in response.objects]
    if verbose:
        print(f"Search Results:")
        for i, context in enumerate(contexts):
            print(f"  {i+1}. {context[:25].replace('\n', ' ')}...")

    # use the same answer_question function as the agentic search
    answer = answer_question(question, contexts)
    if verbose:
        print(f"Answer: {answer}")
    return answer, contexts, doc_ids

#### Example

In [18]:
reason_moderncolbert_search(subset_questions_df.iloc[0]["question"], verbose=True);

Search Results:
  1. Thyroid hormones [predomi...
  2. CONTEXT: Thyronamines are...
  3. 3,5,3,'-Triiodothyroaceti...
  4. We previously reported th...
  5. Thyromimetic agents that ...
  6. The endogenous thyroid ho...
  7. We have recently describe...
  8. BACKGROUND: Tetraiodothyr...
  9. OBJECTIVE: The monocarbox...
  10. The worldwide prevalence ...
  11. A protein that binds tetr...
  12. Diiodothyropropionic acid...
  13. The monocarboxylate trans...
  14. Gross clinical manifestat...
  15. BACKGROUND: The effective...
Answer: The thyroid hormone analogs utilized in human studies include GC1, KB-2115, KB-141, thyronamines (including 3-iodothyronamine, 3-T1AM), 3,5,3'-triiodothyroacetic acid (Triac), 3,5,3'-triiodothyropropionic acid (Triprop), eprotirome, sobetirome, diiodothyropropionic acid (DITPA), tetraiodothyroacetic acid (tetrac), and 3,5-diiodo-L-thyronine (T2). Some of these analogs are under investigation for effects on lipid metabolism, cardiac effects, angiogenesis, a

In [19]:
answer, contexts, doc_ids = reason_moderncolbert_search(subset_questions_df.iloc[0]["question"], verbose=True);

Search Results:
  1. Thyroid hormones [predomi...
  2. CONTEXT: Thyronamines are...
  3. 3,5,3,'-Triiodothyroaceti...
  4. We previously reported th...
  5. Thyromimetic agents that ...
  6. The endogenous thyroid ho...
  7. We have recently describe...
  8. BACKGROUND: Tetraiodothyr...
  9. OBJECTIVE: The monocarbox...
  10. The worldwide prevalence ...
  11. A protein that binds tetr...
  12. Diiodothyropropionic acid...
  13. The monocarboxylate trans...
  14. Gross clinical manifestat...
  15. BACKGROUND: The effective...
Answer: The thyroid hormone analogs utilized in human studies include GC1, KB-2115, KB-141, thyronamines (such as 3-iodothyronamine), 3,5,3'-triiodothyroacetic acid (Triac), 3,5,3'-triiodothyropropionic acid (Triprop), Eprotirome, Sobetirome, diiodothyropropionic acid (DITPA), tetrac, and 3,5-diiodo-L-thyronine (T2). Triac has been used in therapy for resistance to thyroid hormone and studied for transcriptional activation, while DITPA is in phase II clinical tria

## Comparison

The [DeepEval framework](https://deepeval.com/) is a handy tool for measure certain statistics:
- [Contextual Precision](https://deepeval.com/docs/metrics-contextual-precision) evaluates the relevance of the ranking of the retrieved contexts according to the question.
- [Contextual Recall](https://deepeval.com/docs/metrics-contextual-recall) evaluates the quality of the retrieval context in how it aligns with the true answers.
- [Answer Relevancy](https://deepeval.com/docs/metrics-answer-relevancy) evaluates the quality of the final answer output by the LLM.

The DeepEval framework uses an LLM as a judge to determine these scores.

In [None]:
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    ContextualPrecisionMetric, ContextualRecallMetric, AnswerRelevancyMetric
)

precision_metric = ContextualPrecisionMetric(
    threshold=0.7,
    model="gpt-4.1-mini",
    include_reason=False, # keep reasoning off for speed/compute cost
    verbose_mode=False
)

recall_metric = ContextualRecallMetric(
    threshold=0.7,
    model="gpt-4.1-mini",
    include_reason=False,
    verbose_mode=False
)

answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4.1-mini",
    include_reason=False,
    verbose_mode=False
)


/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/pydantic/_internal/_config.py:323: PydanticDeprecatedSince20: Support for class-based `config` is deprecated, use ConfigDict instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/danny/Documents/Work/Other/recipes/.venv/lib/python3.12/site-packages/pydantic/fields.py:1089: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'lias'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.

In [22]:
# import random
# num_test_cases = 1

# random.seed(42)
# random_indices = random.sample(range(len(subset_questions_df)), num_test_cases)

# questions_subset = subset_questions_df["question"].iloc[random_indices]
# answers_subset = subset_questions_df["answer"].iloc[random_indices]

Set up the test cases for this subset of the question/answer pairs.

In [None]:
agentic_rag_test_cases = []
reason_moderncolbert_test_cases = []
recalls = pd.DataFrame(index=subset_questions_df.question, columns=["agentic_rag", "reason_moderncolbert"])
for i in range(len(subset_questions_df)):
    question = subset_questions_df.iloc[i]
    answer = subset_questions_df.iloc[i]
    relevant_passage_ids = subset_questions_df.iloc[i]["relevant_passage_ids"]

    agentic_answer, agentic_contexts, agentic_doc_ids = agentic_rag_search(question)
    reason_answer, reason_contexts, reason_doc_ids = reason_moderncolbert_search(question)

    recalls.loc[question, "agentic_rag"] = recall(relevant_passage_ids, agentic_doc_ids)
    recalls.loc[question, "reason_moderncolbert"] = recall(relevant_passage_ids, reason_doc_ids)

    agentic_rag_test_case = LLMTestCase(
        input=question, 
        actual_output=agentic_answer,
        expected_output=answer,
        retrieval_context=agentic_contexts
    )

    reason_moderncolbert_test_case = LLMTestCase(
        input=question, 
        actual_output=reason_answer,
        expected_output=answer,
        retrieval_context=reason_contexts
    )

    agentic_rag_test_cases.append(agentic_rag_test_case)
    reason_moderncolbert_test_cases.append(reason_moderncolbert_test_case)

Evaluate each metric.

In [24]:
from deepeval.evaluate import DisplayConfig
eval_config = DisplayConfig(
    verbose_mode=False,
    print_results=False,
    show_indicator=False
)

agentic_rag_results = evaluate(
    test_cases=agentic_rag_test_cases, 
    metrics=[precision_metric, recall_metric, answer_relevancy_metric],
    display_config=eval_config
)

reason_moderncolbert_results = evaluate(
    test_cases=reason_moderncolbert_test_cases, 
    metrics=[precision_metric, recall_metric, answer_relevancy_metric],
    display_config=eval_config
)



In [25]:
results_data = pd.DataFrame(
    columns=["agentic_rag", "reason_moderncolbert"],
    index=pd.MultiIndex.from_tuples(
        [
            (question, metric)
            for question in subset_questions_df.question
            for metric in ["Contextual Precision", "Contextual Recall", "Answer Relevancy"]
        ],
        names=["question", "metric"]
    )
)


In [44]:
for i, test_result in enumerate(agentic_rag_results.test_results):
    question = questions_subset.iloc[i]
    for metrics_data in test_result.metrics_data:
        metric = metrics_data.name
        results_data.loc[(question, metric), "agentic_rag"] = metrics_data.score

for i, test_result in enumerate(reason_moderncolbert_results.test_results):
    question = questions_subset.iloc[i]
    for metrics_data in test_result.metrics_data:
        metric = metrics_data.name
        results_data.loc[(question, metric), "reason_moderncolbert"] = metrics_data.score


Now we can groupby the index on the `metric` index and see the average results for the DeepEval benchmarks.

In [45]:
results_data.groupby(level=[1]).mean()

Unnamed: 0_level_0,agentic_rag,reason_moderncolbert
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Answer Relevancy,1.0,0.777778
Contextual Precision,0.332776,0.918366
Contextual Recall,1.0,1.0


In [41]:
results_data.groupby(level=[1]).mean()


Unnamed: 0_level_0,agentic_rag,reason_moderncolbert
metric,Unnamed: 1_level_1,Unnamed: 2_level_1
Answer Relevancy,1.0,0.777778
Contextual Precision,0.332776,0.918366
Contextual Recall,1.0,1.0


In [46]:
results_data.to_csv("results_data.csv")