[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/llm-agent-frameworks/agentic-rag-benchmark/vanilla-rag-vs-agentic-rag.ipynb)

# Vanilla RAG versus Agentic RAG

## === Final Score ===

### `Agentic RAG`: 34 Wins
### `Vanilla RAG`: 11 Wins

## ===============

This notebook will compare Vanilla RAG with Agentic RAG on the task of answering questions about Weaviate.

Both systems are connected to a Weaviate Database instance containing chunks of Weaviate's blog posts. These blog posts can help answer questions such as: "How does BM25 work?", "What was released in Weaviate 1.27?", or "What is Retrieval-Augmented Generation?", to give a few examples.

We use an **LLM-as-Judge** to determine which answer to each question is better, the Vanilla RAG answer or the Agentic RAG answer. 

Both systems, including the LLM Judge, use the **GPT-4o** Large Language Model.

### 1. Import Data into Weaviate Cloud

The following code cells illustrate a fairly standard process of loading markdown files from disk, chunking them into 500 token units, and importing the chunks into Weaviate.

In [2]:
import os
weaviate_url = os.environ["WEAVIATE_URL"]
weaviate_api_key = os.environ["WEAVIATE_API_KEY"]
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

In [9]:
import weaviate
import weaviate.classes.config as wvcc
from weaviate.classes.init import Auth

weaviate_client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=Auth.api_key(weaviate_api_key),
    headers={
        "X-OpenAI-Api-Key": OPENAI_API_KEY
    }
)

print(weaviate_client.is_ready())

# Create Schema
if weaviate_client.collections.exists("WeaviateBlogChunk"):
    weaviate_client.collections.delete("WeaviateBlogChunk") 

collection = weaviate_client.collections.create(
    name="WeaviateBlogChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    ),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT),
      ]
)

            Consider upgrading to the latest version. See https://weaviate.io/developers/weaviate/client-libraries/python for details.


True


In [10]:
import os
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def read_and_chunk_index_files(main_folder_path):
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    blog_chunks = []
    for folder_name in os.listdir(main_folder_path):
        subfolder_path = os.path.join(main_folder_path, folder_name)
        if os.path.isdir(subfolder_path):
            index_file_path = os.path.join(subfolder_path, 'index.mdx')
            if os.path.isfile(index_file_path):
                with open(index_file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    sentences = split_into_sentences(content)
                    sentence_chunks = chunk_list(sentences, 5)
                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]
                    blog_chunks.extend(sentence_chunks)
    return blog_chunks

# Example usage
main_folder_path = './blog'
blog_chunks = read_and_chunk_index_files(main_folder_path)

In [11]:
len(blog_chunks)

1874

### Chunking Visualization

The markdown from the blog posts are processed into 500 token chunks.

To gain more intuition for what this looks like, here is a visualizaton of the first chunk.

In [14]:
print(blog_chunks[0]) # 1 500 Token Chunk

---
title: 'Accelerating Vector Search up to +40% with Intel’s latest Xeon CPU - Emerald Rapids'
slug: intel
authors: [zain, asdine, john]
date: 2024-03-26
image: ./img/hero.png
tags: ['engineering', 'research']
description: 'Boosting Weaviate using SIMD-AVX512, Loop Unrolling and Compiler Optimizations'
---

![HERO image](./img/hero.png)

**Overview of Key Sections:**
- [**Vector Distance Calculations**](#vector-distance-calculations) Different vector distance metrics popularly used in Weaviate. - [**Implementations of Distance Calculations in Weaviate**](#vector-distance-implementations) Improvements under the hood for implementation of Dot product and L2 distance metrics. - [**Intel’s 5th Gen Intel Xeon Processor, Emerald Rapids**](#enter-intel-emerald-rapids)  More on Intel's new 5th Gen Xeon processor. - [**Benchmarking Performance**](#lets-talk-numbers) Performance numbers on microbenchmarks along with simulated real-world usage scenarios. What’s the most important calculation a 

### Import to Weaviate

In [13]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

blogs = weaviate_client.collections.get("WeaviateBlogChunk")

for idx, blog_chunk in enumerate(blog_chunks):
    upload = blogs.data.insert(
        properties={
            "content": blog_chunk
        }
    )

### 2. Build Vanilla RAG and Agentic RAG Systems

The following code cells implement the Vanilla RAG and Agentic RAG systems.

We also reuse the same class to implement the LLM-as-Judge.

There are 4 key things to note here:

1. Basic connection and conventional `generate` API

2. Structured Outputs `generate`

3. Add Weaviate Search as a Tool

4. Function Calling Loop

In [17]:
# Tools model used for OpenAI Function Calling API
from pydantic import BaseModel
from typing import Optional, Literal

class ParameterProperty(BaseModel):
    type: str
    description: str
    enum: Optional[list[str]] = None


class Parameters(BaseModel):
    type: Literal["object"]
    properties: dict[str, ParameterProperty]
    required: Optional[list[str]]


class Function(BaseModel):
    name: str
    description: str
    parameters: Parameters


class Tool(BaseModel):
    type: Literal["function"]
    function: Function

In [42]:
import json

class LM_System():
    def __init__(
            self,
            weaviate_client: weaviate.WeaviateClient,
            model_name: str,
            model_provider: str,
            api_key: str
    ):
        self.weaviate_client = weaviate_client

        self.model_name = model_name
        self.model_provider = model_provider
        self.api_key = api_key
        
        match self.model_provider:
            case "openai":
                from openai import OpenAI
                self.client = OpenAI(api_key=self.api_key)
            case _:
                raise ValueError(f"Unsupported model_provider: {model_provider}")

    def execute_tool(
            self, 
            tool_name: str, 
            tool_arguments: dict
        ) -> str:
        match tool_name:
            case "search_blogs":
                return self.search_blogs(**tool_arguments)
            case _:
                raise ValueError(f"Invalid tool_name: {tool_name}")

    def search_blogs(
            self, 
            search_query: str
        ) -> str:
        search_collection = self.weaviate_client.collections.get("WeaviateBlogChunk")
        results = search_collection.query.hybrid(
            query=search_query,
            limit=5
        )
        stringified_response = ""
        for idx, o in enumerate(results.objects):
            stringified_response += f"Search Result: {idx+1}:\n"
            for prop in o.properties:
                stringified_response += f"{prop}:{o.properties[prop]}"
            stringified_response += "\n"
        
        return stringified_response

    def generate(
            self,
            prompt: str
    ) -> str:
        messages = [
            {
                "role": "system", 
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=messages
        )
        return response.choices[0].message.content

    def generate_with_output_model(
            self,
            prompt: str,
            output_model: BaseModel
    ):
        messages = [
            {
                "role": "system", 
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        response = self.client.beta.chat.completions.parse(
            model=self.model_name,
            messages=messages,
            response_format=output_model
        )
        parsed_response = response.choices[0].message.parsed
        parsed_response = parsed_response.json()
        return parsed_response
        
    def vanilla_rag(
            self,
            search_query: str
    ) -> str:
        context = self.search_blogs(search_query=search_query)
        vanilla_rag_prompt = f"""Assess the context and answer the question.

        [[ question ]]
        {search_query}

        [[ context ]]
        {context}

        [[ answer ]]"""
        answer = self.generate(prompt=vanilla_rag_prompt)
        return answer

    def generate_with_function_calling_loop(
            self,
            prompt: str,
            tools: list[Tool]
    ) -> str:
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        calls, call_budget = 0, 20
    
        # Initial call to get first response
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=messages,
            tools=tools
        ).choices[0]

        while calls < call_budget:
            message = response.message
            
            if not message.tool_calls:
                return message.content
            
            # Add assistant message with tool calls
            messages.append({
                "role": "assistant",
                "content": message.content if message.content else None,
                "tool_calls": [
                    {
                        "id": tool_call.id,
                        "type": "function", 
                        "function": {
                            "name": tool_call.function.name,
                            "arguments": tool_call.function.arguments
                        }
                    } for tool_call in message.tool_calls
                ]
            })
            
            # Handle parallel function calls
            for tool_call in message.tool_calls:
                function_response = self.execute_tool(
                    tool_name=tool_call.function.name,
                    tool_arguments=json.loads(tool_call.function.arguments)
                )
                
                messages.append({
                    "role": "tool",
                    "content": function_response,
                    "tool_call_id": tool_call.id
                })
            
            # Get next response
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                tools=tools
            ).choices[0]
            
            calls += 1
        
        return "Exceeded maximum number of function calls"


In [43]:
lm_service = LM_System(
    weaviate_client=weaviate_client,
    model_name="gpt-4o",
    model_provider="openai",
    api_key=OPENAI_API_KEY
)

print(lm_service.generate("say hello"))

Hello! How can I assist you today?


### Structured Output Demo

The Structured Output makes it so the Language Model can only output either `hello how are you` or `Hello!`

We use this for our LLM-as-Judge to determine which System produces the better answer to a technical question about Weaviate.

In [44]:
class StructuredHello(BaseModel):
    greeting: Literal["hello how are you", "Hello!"]

print(lm_service.generate_with_output_model(
    prompt="say hello",
    output_model=StructuredHello)
)

{"greeting":"Hello!"}


/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py:1138: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.9/migration/


### Search Demo

In [46]:
print(lm_service.search_blogs("Hnsw"))

Search Result: 1:
content:Here are some key differences between Vamana and HNSW:

### Vamana indexing - in short:
* Build a random graph. * Optimize the graph, so it only connects vectors close to each other. * Modify the graph by removing some short connections and adding some long-range edges to speed up the traversal of the graph. ### HNSW indexing - in short:
* Build a hierarchy of layers to speed up the traversal of the nearest neighbor graph. * In this graph, the top layers contain only long-range edges.author:None
Search Result: 2:
content:Furthermore, the most popular library hnswlib only supports snapshotting, but not individual writes to disk. To get to where Weaviate is today, a custom HNSW implementation was needed. It follows the same principles [as outlined in this paper](https://arxiv.org/abs/1603.09320) but extends it with more features. Each write is added to a [write-ahead log](https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html). Additionally, 

### `Vanilla RAG` Quick Test

In [47]:
lm_service.vanilla_rag(
    search_query="What is HNSW?"
)

'HNSW stands for Hierarchical Navigable Small World. It is a data structure and algorithm used for approximate nearest neighbor search in high-dimensional spaces. HNSW builds a hierarchy of graph layers to expedite the traversal and search process. In this structure, the top layers contain only long-range edges, which allow quicker navigation across the graph. HNSW is optimized for in-memory access and can support querying simultaneously with data insertion, although it faces challenges with full CRUD operations. Its primary advantage is the efficient traversal of the nearest neighbor graph, facilitated by its hierarchical representation.'

### `Agentic RAG` Quick Test

In [48]:
tools = [Tool(
    type="function",
    function=Function(
        name="search_blogs",
        description="Search a Vector Database containing blog posts information about Weaviate.",
        parameters=Parameters(
            type="object",
            properties={
                "search_query": ParameterProperty(
                    type="string",
                    description="The natural language query to search for in the database"
                )
            },
            required=["search_query"]
        )
    )
)]

In [49]:
lm_service.generate_with_function_calling_loop(
    prompt="What is HNSW?",
    tools=tools
)

"HNSW, which stands for Hierarchical Navigable Small World, is a method used for creating a nearest neighbor graph with a hierarchy of layers that speeds up the traversal process. Here's a brief overview based on the search results:\n\n1. **Indexing Method**: HNSW creates a hierarchical graph to facilitate efficient nearest neighbor search. The top layers of this hierarchy contain longer-range edges which help in faster traversal of the graph. This structure accelerates the process of finding the nearest neighbors.\n\n2. **Growth Pattern**: To handle large volumes of data without significant slowdown, the HNSW index grows relatively, meaning its size increases by a fixed percentage or number of objects, whichever is larger.\n\n3. **Mutability**: HNSW supports querying during the insertion process, enabling more dynamic use cases. However, there are limitations in the context of full CRUD operations, as updating indices directly is not well supported.\n\n4. **Performance Considerations*

### 3. Evaluate Agentic RAG vs. Vanilla RAG

In [57]:
from pydantic import BaseModel

class Winner(BaseModel):
    rationale: str
    winner: Literal["vanilla rag", "agentic rag"]

class RAGEvalModel(BaseModel):
    query: str
    response: str
    win: bool

In [58]:
from datasets import load_dataset

ds = load_dataset("weaviate/WeaviateBlogRAG-0-0-0")["train"] # Please leave a heart if you find this dataset useful!

In [65]:
ds[0]["query"]

"What is the role of the Binary Independence Model in the BM25 algorithm used by Weaviate's hybrid search?"

In [63]:
# load queries

compare_system_responses = """Assess the responses from two systems and determine which one had the better response:

[[ answer from vanilla rag system ]]
{vanilla_rag_response}

[[ answer from agentic rag system ]]
{agentic_rag_response}

[[ winning system ]]
"""

vanilla_rag_scores, agentic_rag_scores = [], []
vanilla_rag_wins = 0
agentic_rag_wins = 0

for idx, row in enumerate(ds):
    if idx == 0:
        print(f"Logging run {idx+1} of {len(ds)}:")
    query = row["query"]
    if idx == 0:
        print(f"\033[1;32mQuery: {query}\n\033[0m")
    
    vanilla_rag_response = lm_service.vanilla_rag(
        search_query=query
    )
    if idx == 0:
        print("\033[1;32mVanilla RAG Response:\n\033[0m")
        print(vanilla_rag_response)
    
    agentic_rag_response = lm_service.generate_with_function_calling_loop(
        prompt=query,
        tools=tools
    )

    if idx == 0:
        print("\033[1;32m\nAgentic RAG Response:\n\033[0m")
        print(agentic_rag_response)
    
    formatted_compare_system_responses = compare_system_responses.format(
        vanilla_rag_response=vanilla_rag_response,
        agentic_rag_response=agentic_rag_response
    )
    
    winner = lm_service.generate_with_output_model(
        prompt=formatted_compare_system_responses,
        output_model=Winner
    )

    winner = json.loads(winner)

    if idx == 0:
        print("\033[1;32m\nJudged Winner to be:\033[0m")
        print(winner["winner"])
        print("\033[1;32mWith Rationale:\033[0m")
        print(winner["rationale"])

    winner = winner["winner"]
    
    if winner == "vanilla rag":
        vanilla_rag_wins += 1
    else:
        agentic_rag_wins += 1
        
    print("\033[96m\nScoreboard:\033[0m")
    print(f"Vanilla RAG: {vanilla_rag_wins} wins")
    print(f"Agentic RAG: {agentic_rag_wins} wins\n")

    # Save results
    vanilla_rag_scores.append(RAGEvalModel(
        query=query,
        response=vanilla_rag_response,
        win=(winner == "vanilla rag")
    ))
    
    agentic_rag_scores.append(RAGEvalModel(
        query=query,
        response=agentic_rag_response, 
        win=(winner == "agentic rag")
    ))


Logging run 1 of 45:
[1;32mQuery: What is the role of the Binary Independence Model in the BM25 algorithm used by Weaviate's hybrid search?
[0m
[1;32mVanilla RAG Response:
[0m
The Binary Independence Model (BIM) plays a significant role in the BM25 algorithm, which is used in Weaviate's hybrid search. BM25 builds on the Term-Frequency Inverse-Document Frequency (TF-IDF) scoring method by incorporating the Binary Independence Model from the IDF calculation. This model assumes that the presence or absence of a term in a document is independent of the presence or absence of other terms. It is used to weigh the uniqueness of each keyword in the query relative to the collection of texts. Additionally, BM25 adds a normalization penalty that considers a document's length relative to the average length of all documents in the database. This adjustment helps in providing more relevant search results.
[1;32m
Agentic RAG Response:
[0m
The role of the Binary Independence Model in the BM25 al

### Final Win Rate

In [64]:
print(f"Agentic RAG win rate: {agentic_rag_wins / 45 * 100:.2f}%")

Agentic RAG win rate: 75.56%
