# Building a RAG application and evaluating its performance using LLM-as-a-judge and Maxim AI

In this cookbook, we'll create a RAG application using the following tools:
- **MongoDB**: We'll use MongoDB atlas as a vector database for retrieval of relevant information from our knowledge base. 
- **OpenAI**: We'll use openAI's SOTA models to:
    - Create embeddings
    - Generate response from retrieved-context
    - creating an LLM-as-a-judge evaluator to evaluate the relevance of our context
- **Maxim AI**: We'll use Maxim's suite to:
    - Trace the workflows of our RAG application (input, retrieval and generation)
    - Evaluate the quality of input, context and output.

### Installing Dependencies

In [1]:
import os
import openai
from openai import OpenAI
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from pymongo.operations import SearchIndexModel

import json
from uuid import uuid4

- Follow this [guide to fetch you connection string (URI) from mongoDB.](https://www.mongodb.com/docs/guides/atlas/connection-string/) 
- Follow this [guide to generate and export your OpenAI API key.](https://platform.openai.com/docs/quickstart#create-and-export-an-api-key)

In [2]:
mongo_uri = os.getenv("MONGO_URI") 
openAI_key = os.getenv("OPENAI_API_KEY")
openai_client = OpenAI(api_key=openAI_key)
openai.api_key = openAI_key

### Reading File and Creating chunks
We'll start with reading our .txt file and chunking the text. 

**Chunking**: breaking large texts into smaller and manageable semantic units. Key benefits of chunking:
- <u>Handle embedding API token limits</u>- if the text is too long, API will reject the request
- <u>Be specific</u>- Smaller chunks capture local meaning better + stays relevant to context
- <u>Memory Management</u>- large texts can overwhelm system memory

In [3]:
# function to read a txt file and return its entire content as a single string
def read_txt_file(filePath):
    with open(filePath, "r",encoding='utf-8') as file:
        fileText = file.read() # return single string of all the text
    return fileText

# function to split the string into smaller chunks of specified size.
def chunk_txt(text, chunkSize): # based on number of chars
    chunkedText = [text[i:i+chunkSize] for i in range(0, len(text), chunkSize)]
    return chunkedText

Our RAG application will be based on the Harry Potter and the Deathly Hallows book. <.txt file to be uploaded to maximHQ Github and linked here> 

In [4]:
# uncomment following lines of code to read file and split the text into chunks
# textData= read_txt_file("harry-potter-deathly-hallows.txt")
# chunks = chunk_txt(textData, 250)

### Creating Vector Embeddings
**Embeddings** are vector representations of text that capture semantic meaning. We'll use openAI's [text-embedding-ada-002](https://platform.openai.com/docs/guides/embeddings#embedding-models) embedding model.

We'll create embeddings for a list of text chunks i.e., the entire content of our book split into smaller parts. This enables us to vectorize the knowledge base for semantic search.

In [5]:
# Embedding function: generates embeddings for input text using OpenAI's embedding model
def get_embedding(text):
    response = openai.embeddings.create(
            model= "text-embedding-ada-002",
            input= [text] )
    return response.data[0].embedding

# creates embeddings of each chunk and return a list of documents containing text and corresponding embedding.
def embed_chunks(chunks):
    documents = []
    for i in range(0, len(chunks)):
        document = {
            "text": chunks[i],
            "embedding": get_embedding(chunks[i]),
        }
        documents.append(document)
    return documents

In [6]:
mongo_client = MongoClient(mongo_uri)
db = mongo_client["sample_mflix"] # replace "sample_mflix" with the name of your database
collection = db["hp_embedding"] # replace "hp_embedding" with the name of your collection


# uncomment the following lines of code to store embeddings in you collection 

# documents = embed_chunks(chunks)
# collection.insert_many(documents)

### Creating Vector Search Index
Next, we need to create a vector search index in MongoDB for semantic similarity search. Vector index enables efficient retrieval of relevant context based on similarity with user query.

Here, we'll define **Cosine similarity** for efficient retrieval of relevant context based on query similarity. Cosine similarity measures cosine of the angle between two vectors i.e., 1-meaning the vectors are identical and 0-meaning they have no correlation.

In [7]:
# create a vector index for our collection in MongoDB
def create_vector_index(collection, indexName, fieldName):

    search_index_model = SearchIndexModel(
      definition={
        "fields": [
          {
            "type": "vector",
            "numDimensions": 1536,
            "path": fieldName,#name of the field where embeddings are stored, here "embedding"
            "similarity":  "cosine"
          }
        ]
      },
      name= indexName,
      type="vectorSearch",
    )
    collection.create_search_index(model=search_index_model)

In [8]:
# uncomment the following line of code to create a vector index for your collection

# create_vector_index(collection,"vector_index_hp", "embedding")

### Retrieving context from Vector DB
Creating a retriever function that calls mongoDBs to fetch relevant context from the vector database based on input query. The vector database uses semantic similarity to find the "k" most relevant text chunks to our input.

In [9]:
# function to  retrieve context from vector database using semantic search.
def retrieve_context(query: str):
    query_vector = get_embedding(query)
    index_name = "vector_index_hp"
    field_name = "embedding"
    
    response = collection.aggregate([
        {
            '$vectorSearch': {
                "index": index_name, #name of the vector index
                "path": field_name, #name of the field where embeddings are stored
                "queryVector": query_vector,
                "numCandidates": 50,
                "limit": 10 #top k chunks to be retrieved
            }
        },
        {
            "$project": {
                'text' : 1,
                "search_score": { "$meta": "vectorSearchScore" }
            }
        }
    ])
    context = [item.get('text', 'N/A') for item in response]
    return context

### Generating LLM Response

We'll use openAIs `gpt-4o` model for generation of the response. We'll pass the user query and retrieved context to the LLM to generate a response grounded in the context. 

In [10]:
def generate_response(query: str):
    context = retrieve_context(query)
    prompt = (
        f"You are a smart agent. A question will be asked to you along with relevant context."
        f"Your task is to answer the question using the information provided."
        f"Question: {query}. Context: {context}"
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    result = response.choices[0].message.content
    return result

In [11]:
query = "What are the Deathly Hallows?"

In [12]:
context = retrieve_context(query)
print(context)

['\n“Sorry?” said Hermione, sounding confused.\n“Those are the Deathly Hallows,” said Xenophilius.\nHe picked up a quill from a packed table at his elbow, and pulled a torn piece of parchment from between more books.\n“The Elder Wand,” he said, and he drew', 'eyebrows.\n“Are you referring to the sign of the Deathly Hallows?”\n\n\x0cChapter 21\nThe Tale of The Three Brothers\n\nHarry turned to look at Ron and Hermione. Neither of them seemed to have understood what Xenophilius had said either.\n“The Deathly Hallows?', 'd come true.\n“And at the heart of our schemes, the Deathly Hallows! How they fascinated him, how they fascinated both of us! The unbeatable wand, the weapon that would lead us to power! The Resurrection Stone — to him, though I pretended not to know ', '“What are you talking about?” asked Harry, startled by Dumbledore’s tone, by the sudden tears in his eyes.\n“The Hallows, the Hallows,” murmured Dumbledore. “A desperate man’s dream!”\n“But they’re real!”\n“Real, and dang

In [13]:
llm_response = generate_response(query)
print(llm_response)

The Deathly Hallows are three magical objects that are said to grant their possessor mastery over death. These objects are:

1. **The Elder Wand** - An unbeatable wand with unparalleled power.
2. **The Resurrection Stone** - A stone that can bring back the dead, albeit not truly resurrecting them.
3. **The Invisibility Cloak** - A cloak that renders the wearer completely invisible.

In the context of the story, they are tied to the Peverell brothers, who were said to have each possessed one of these objects. The legend around them suggests that uniting all three Hallows would make someone the Master of Death.


## Creating LLM-as-a-judge Evaluator
Now we'll create an LLM-as-a-judge evaluator to evaluate the context relevance of our workflow. We'll use OpenAI's `gpt-4o-mini` model to evaluate the responses.

<u>Context relevance</u>: assesses the effectiveness of your RAG pipeline's retriever by determining how relevant the information in the retrieved context is to the given input.
  in simple terms, Context Relevance = Number of Relevant Statements / Total Number of Statements

In [14]:
# using LLM as a Judge to evaluate Context relevance
def context_relevance(input, context, prompt):
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": prompt.format(input=input, context=context),
            }
        ]
    )
    return response.choices[0].message.content

### #1- Basic scoring
For the first version of our Context Relevance Evaluator, we will directly prompt the model to rate the relevance of the given context on a scale from 1 to 5.

In [15]:
prompt_1 = """
You are a CONTEXTUAL RELEVANCE EVALUATOR who has to score how relevant the context given is, based on the given input:
Rate the submission on a scale of 1 to 5.
Input: {input}
Context: {context}
"""
response_1 = context_relevance(query, context, prompt_1)

In [16]:
response_1

'I would rate the relevance of the context provided to the input "What are the Deathly Hallows?" as a 5. \n\nThe context consists of multiple excerpts discussing the Deathly Hallows directly, which clearly addresses the subject of the input. It includes quotes from characters and details about the Hallows themselves, such as their significance, the legend surrounding them, and their connection to other characters in the narrative. This makes the context highly relevant and informative regarding the query about what the Deathly Hallows are.'

### #2 Added reasoning
We'll further refine our prompt and instruct the model to assign scores based on a set of criteria. We'll also prompt the model to give the reason behind the score given to get better insights into the logic applied behind the score.

In [17]:
prompt_2 = """
Score the context's relevancy to the input from 1-5 based on:
1. Topic match: Keywords and subject alignment of input and context
2. Quality of the context: Required information which are needed for input are present in the context
Also, provide a reason for the score given.
Example:
Example Context: "AlphaFold won the Nobel Prize in 2025. In the same year, OpenAI released the O3 model with reasoning. There was a cat on the road."
Example Input: "What were the biggest events of 2024?"
Example Score: 3.3
Example Reason: Two out of the three sentences in the context are relevant to the input
Input: {input}
Context: {context}
"""
response_2 = context_relevance(query, context, prompt_2)

In [18]:
response_2

"Score: 4.7\n\nReason: The context is highly relevant to the input as it directly discusses the Deathly Hallows. There are multiple mentions of the Deathly Hallows in different contexts within the provided text, ensuring a strong alignment with the topic. Additionally, the context includes explanations and references that inform the reader about the nature of the Deathly Hallows, their significance, and their connection to characters in the narrative. The only reason it's not a perfect score is that the context may require some prior knowledge of the story for complete clarity, but overall, it provides substantial information."

### #3 Comprehensive evaluation with thought process
In this version, the model will be prompted to engage in a chain of thought process to analyze the broader scope of the situation. For scoring, we will define criteria that encourage the model to apply critical thinking and structured reasoning.

We'll additionally use OpenAI's structured response feature to convert the output of our evaluator into callable format.

In [19]:
from pydantic import BaseModel

In [20]:
# Returning the thought and score of LLM-as-a-judge evaluator in structered way
class Reasoning(BaseModel):
    thought: str
    score: str
client = OpenAI()

prompt_3 = """
You are an evaluator who analyzes if the context is relevant to the input.
Before scoring, analyze in the thought:
1. What does the input ask for?
2. What information does the context provide?
3. What's missing or irrelevant?
Then score (1-5):
1. Topic match: Keywords and subject alignment of input and context
2. Quality of the context: Required information which are needed for input are present in the context
Explain your reasoning and give the total score.
Example:
Example Input: "What is photosynthesis?"
Example Context: "Photosynthesis is how plants make food using sunlight, water, and carbon dioxide. We have many trees in Bangalore."
Example Thought: "The input asks for photosynthesis. The first statement in the context addresses it while not the second."
Example Score: 2.5
Example Reason: Out of the two statements, only one context statement is relevant to the input hence a score of 0.5
Input: {input}
Context: {context}
"""

response_3 = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": prompt_3.format(input=query, context=context),
        }
    ],
    response_format=Reasoning,
)

In [21]:
structured_output= response_3.choices[0].message.parsed

In [22]:
print("Evaluator score:", structured_output.score)
print("Reason for the score:", structured_output.thought)

Evaluator score: 3.5
Reason for the score: 1. The input asks for information about 'the Deathly Hallows.' Specifically, it seems to inquire about their definition or significance in the context of the Harry Potter universe.  
2. The context provides dialogue from a scene where characters discuss the Deathly Hallows and their history, including mentions of the Elder Wand, Resurrection Stone, and the tale of the three brothers, which are key elements related to the Deathly Hallows. However, it lacks a clear, concise definition or explanation of what the Deathly Hallows are, instead relying on indirect references and dialogue.  
3. The context includes various characters discussing the Deathly Hallows but does not explicitly define them or explain their importance comprehensively. Therefore, while there is relevant material, the clarity and completeness needed to accurately answer the input are missing.


## Using Maxim AI
Maxim is an end-to-end [AI evaluation and observability](https://www.getmaxim.ai/) platform. 

### Getting started
We'll use Maxim's SDK to trace and evaluate our RAG workflow.
- Get started for free by [signing up here](https://app.getmaxim.ai/sign-up).
- Follow this [guide to generate the Maxim API key](https://www.getmaxim.ai/docs/sdk/observability/python/overview#generate-maxim-api-key) and ensure to copy the API key before closing the dialog.

Ref: [Installation steps to get started with Maxim's python SDK](https://www.getmaxim.ai/docs/sdk/observability/python/overview)

In [23]:
from maxim import Config, Maxim
from maxim.logger import Logger, LoggerConfig

from maxim.logger import TraceConfig, SpanConfig, GenerationConfig, RetrievalConfig

In [24]:
maxim_api_key = os.getenv("MAXIM_API_KEY")
maxim_log_key = os.getenv("LOG_REPOSITORY_ID")

maxim = Maxim(Config(api_key=maxim_api_key))
logger = maxim.logger(LoggerConfig(id=maxim_log_key))

### Tracing and evaluating RAG components using Maxim SDK
We'll trace the generation and retrieval using Maxim's SDK. Maxim enables us to attach LLM-as-a-judge evaluators directly to our traces for continuous evaluation. 

Components of Maxim's [logging heirarchy](https://www.getmaxim.ai/docs/observability/concepts#components-of-logs):
- [Session](https://www.getmaxim.ai/docs/observability/concepts#session): it is the top level entity that captures all the multi-turn interactions of your system. 
- [Trace](https://www.getmaxim.ai/docs/observability/concepts#trace): a trace is the complete processing of a request through a distributed system, including all the actions between the request and the response.
- [Span](https://www.getmaxim.ai/docs/observability/concepts#span): Spans are fundamental building blocks of distributed tracing. A single trace in distributed tracing consists of a series of tagged time intervals known as spans
- [Generation](https://www.getmaxim.ai/docs/observability/concepts#generation): A Generation represents a single Large Language Model (LLM) call within a trace or span. Multiple generations can exist within a single trace/span.
- [Retrieval](https://www.getmaxim.ai/docs/observability/concepts#retrieval): A Retrieval (commonly used in RAG) represents a query operation to fetch relevant context or information from a knowledge base or vector database within a trace or span. 

Ref: [Tracing your workflow using Maxim](https://www.getmaxim.ai/docs/sdk/observability/python/manual-integration)

#### Tracing our retriever function
We'll log the input we're passing to our vector index and the context we're fetching.

In [25]:
def retrieve_context_with_trace(query: str, span):
    query_vector = get_embedding(query)
    index_name = "vector_index_hp"
    field_name = "embedding"
    retrieval = span.retrieval(RetrievalConfig(id=str(uuid4())))
    retrieval.input(query) #logging input to retriever i.e user query
    
    response = collection.aggregate([
        {
            '$vectorSearch': {
                "index": index_name, #name of the vector index
                "path": field_name, #name of the field where embeddings are stored
                "queryVector": query_vector,
                "numCandidates": 50,
                "limit": 10 #top k chunks to be retrieved
            }
        },
        {
            "$project": {
                'text' : 1,
                "search_score": { "$meta": "vectorSearchScore" }
            }
        }
    ])
    context = [item.get('text', 'N/A') for item in response]
    
    retrieval.output(context) #logging the output of retrieval action i.e the context

    retrieval.end()
    return context

#### Tracing our LLM generation function and adding evaluators to logs using SDK
We'll log the generated response and model parameters such as cost, tokens, and latency for performance monitoring.

Further, using Maxim, we can attach evaluators to each level of our logging hierarchy (i.e., trace, span, or component within the span). Here, we'll:
- <u>Evaluate trace</u>: to check the **relevance of our retrieved context** with respect to input and output. 
- <u>Evaluate llm generation</u>: to check the **clarity** of our models response and detect any **bias** in it.

Read more about [adding evals at node level using Maxim](https://www.getmaxim.ai/docs/observability/evaluating-logs/agentic-evaluation#conclusion)

In [33]:
def generate_response_with_trace(query: str):
    trace = logger.trace(TraceConfig(id=str(uuid4().hex)))
    span = trace.span(SpanConfig(id=str(uuid4())))
    
    context = retrieve_context_with_trace(query, span)
    prompt = (
        f"You are a smart agent. A question will be asked to you along with relevant context. "
        f"Your task is to answer the question using the information provided. "
        f"Question: {query}. Context: {context}"
    )

    response = openai_client.chat.completions.create(
        model="gpt-4o",  # You can change this to "gpt-4" if you have access
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )

    generationConfig = GenerationConfig(
					id=str(uuid4()),
					name="llmGeneration",
					provider="openAI",
					model="gpt-4o",
					messages=[{
						"role": "user", 
						"content": query # log the input to generation step
					}])
    
    generation = trace.generation(generationConfig) 
    generation.result(response) # Maxim's generation.result() expects result to be in OpenAI response format.

    llm_response = response.choices[0].message.content

# evaluating bias and clarity in our LLM's generated response
    generation.evaluate().with_evaluators("clarity", "bias").with_variables({
            "output": llm_response
        })
    
    trace.evaluate().with_evaluators("context relevance")
    trace.evaluate().with_variables(
        { 
            "output": llm_response,
            "input": query,
            "context": context
        }, 
        ["context relevance"] # List of evaluators
    )
    span.end()

    return llm_response

In [34]:
query_2 = "What is the Elder Wand?"

In [35]:
generate_response_with_trace(query_2)

"The Elder Wand is one of the most powerful wands in the wizarding world, often described as more powerful than any other wand. It is one of the Deathly Hallows, and is also known as the Deathstick or the Wand of Destiny. According to legend, it was created by Death himself and must always win duels for its true owner. The wand is unique in that its allegiance can change if its owner is disarmed or defeated, and it is said that the wand chooses the wizard. The wand's power is most effectively harnessed by its rightful master, the person who has won its allegiance."

![Maxim platform](rag-tracing-and-evaluation-openai.gif)

## Use local dataset to trigger test runs on Maxim

We'll programmatically trigger test runs using Maxim's SDK with custom datasets, flexible output functions, and evaluations for our RAG applications.