# Evaluating Domain Specific RAG Chunking & Embedding Strategies

The first step of creating RAG systems, choosing how to load and split your documents, is an often overlooked yet critical step. Recent research from [ChromaDB](https://trychroma.com) titled [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking) outlines various popular chunking approaches and a few novel ideas to help give a good direction of choosing your chunking strategy.

<img src="./media_2/hero_table.png" width="600">

Their main findings while using `text-embedding-3-large` from OpenAI:
1. The **Cluster Semantic Chunker** with a 200 token chunk size achieves the highest precision, precision with perfect recall, and intersection over union.
2. The **LLM Chunker** achieves the highest recall.
3. The **Recursive Character Text Splitter** with chunk size 200 achieves consistently high metrics and is a good lightweight option.

I've broken down how each one of these chunking strategies works [in a prior notebook](https://github.com/ALucek/chunking-strategies) using their [respective repo](https://github.com/brandonstarxel/chunking_evaluation). On top of the different chunking implementations, Chroma provided their evaluation framework that allows you to run tests both on a standard and domain specific documents to determine what chunking and embedding method might be the best for your specific application. While following the research data is a useful start, running your own experiments can help you find exactly what works best for you.

We'll be covering the four main approaches to:

1. Create Custom Chunking Strategies
2. Evaluate Custom & Existing Chunking Strategies
3. Evaluate Custom & Existing Embedding Strategies
4. Create a Synthetic Dataset for Domain Specific Evaluations

#### Installing the [Chunking Evaluation Repo](https://github.com/brandonstarxel/chunking_evaluation/tree/main)

In [1]:
%%capture
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

**Imports & Dependencies**

In [55]:
from chunking_evaluation import GeneralEvaluation, SyntheticEvaluation, BaseChunker
from chromadb.utils import embedding_functions

import pandas as pd
from IPython.display import display, clear_output
import os

---
# Custom Chunking Strategies

We'll use the `BaseChunker` class to define our own. At it's core `BaseChunker` is very simple:

```python

class BaseChunker(ABC):
    @abstractmethod
    def split_text(self, text: str) -> List[str]:
        pass

```

Expecting only a `split_text` method that can take in a string and return a list of strings, which is our chunks. The transformation along the way can be more creatively defined. 

As an example, we'll define a `SentenceChunker` that uses a simple regex to attempt to split text at a basic sentence level, with one variable that controls how many sentences are included in each chunk

In [3]:
class SentenceChunker(BaseChunker):
    def __init__(self, sentences_per_chunk: int = 3):
        # Initialize the chunker with the number of sentences per chunk
        self.sentences_per_chunk = sentences_per_chunk

    def split_text(self, text: str) -> List[str]:
        # Handle the case where the input text is empty
        if not text:
            return []

        # Split the input text into sentences using regular expression
        # Regex looks for white space following . ! or ? and makes a split
        sentences = re.split(r'(?<=[.!?])\s+', text)
        chunks = []

        # Group sentences into chunks based on the specified number
        for i in range(0, len(sentences), self.sentences_per_chunk):
            # Combine sentences into a single chunk
            chunk = ' '.join(sentences[i:i + self.sentences_per_chunk])
            chunks.append(chunk)
        
        # Return the list of chunks
        return chunks


**Loading Example Document**

We'll be using [NVIDIA's Form 10-K for FY24](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001045810/1cbe8fe7-e08a-46e3-8dcc-b429fc06c1a4.pdf) as an example, converted into a plain text file already.

In [81]:
with open("./domain_specific/nvidia_10k.txt", "r", encoding="utf-8") as f:
    nvidia_10k = f.read()

**Chunking the Document**

In [5]:
# Instantiate the SentenceChunker
sentence_chunker = SentenceChunker(sentences_per_chunk = 10)

# Split the Document
sentence_chunks = sentence_chunker.split_text(nvidia_10k)

**Observing the Chunk Data**

In [7]:
len(sentence_chunks)

175

In [13]:
sentence_chunks[10]

'Headquartered in Santa Clara, California, NVIDIA was incorporated in California in April 1993 and reincorporated in Delaware in April 1998. Our Businesses\nWe report our business results in two segments. The Compute & Networking segment is comprised of our Data Center accelerated computing platforms and end-to-end networking platforms including Quantumfor InfiniBand and Spectrum for Ethernet; our NVIDIA DRIVE automated-driving platform and automotive development agreements; Jetson robotics and other\nembedded platforms; NVIDIA AI Enterprise and other software; and DGX Cloud software and services. The Graphics segment includes GeForce GPUs for gaming and PCs, the GeForce NOW game streaming service and related infrastructure; Quadro/NVIDIARTX GPUs for enterprise workstation graphics; virtual GPU, or vGPU, software for cloud-based visual and virtual computing; automotive platforms for\ninfotainment systems; and Omniverse Enterprise software for building and operating metaverse and 3D int

**Now that we have our text corpus, we can run experiments to determine how well it works. First, we need to define our metrics of interest to measure.**

---
# Metrics Breakdown

The built in metrics here are slightly different from traditional information retrieval metrics, which usually operate a document level. These will be more concerned with measuring the token level performance of our chunking and embedding strategies. The motivation for this is that:

*For a given query related to a specific corpus, only a subset of tokens within that corpus will be relevant. Ideally, for both efficiency and accuracy, the retrieval system should retrieve exactly and only the relevant tokens for each query across the entire corpus.*

This is better suited for testing chunking as a retrieval part of RAG systems, as we are less concerned with the specific document than the actual relevant token level information for the LLM to process, trying to maximize the relevant tokens and exclude irrelevant, redundant, and distracting superfluous information.

<img src="./media_2/recall_precision.png" width=800>

## Variables

- $q$ represents a specific query
- $\mathbf{C}$ represents the chunked corpus (the entire document split into chunks)  
- $t_e$ represents the set of tokens in relevant excerpts/highlights (ground truth)  
- $t_r$ represents the set of tokens in retrieved chunks (what our system returns)  

A **highlight** is a segment of text in the original document that contains the relevant information needed to answer a specific query. Highlights serve as the "ground truth" against which we measure our chunking and retrieval performance.  

For example:

- **Document**: "The Sun is composed primarily of hydrogen and helium. Through nuclear fusion in its core, it converts hydrogen into helium, releasing massive amounts of energy. This energy travels to Earth as sunlight and heat."
- **Query**: "How does the Sun produce energy?"
- **Highlight**: "Through nuclear fusion in its core, it converts hydrogen into helium, releasing massive amounts of energy."

## Recall

$\text{Recall}_q(\mathbf{C}) = \frac{|t_e \cap t_r|}{|t_e|}$

**Calculated by**: length of overlap between retrieved chunks and highlights / total length of highlights

Measures what fraction of the important/relevant text is captured by the retrieved chunks. Ranges from 0 to 1, where 1 means all relevant text was captured. A low recall means the chunking strategy is missing important information.

**Answers**: How much of these important highlighted segments did we capture?

**Example**: If a highlight is 100 tokens and our chunks only capture 70 tokens of it, recall = 0.7

## Precision

$\text{Precision}_q(\mathbf{C}) = \frac{|t_e \cap t_r|}{|t_r|}$

**Calculated by**: length of overlap between retrieved chunks and highlights / total length of retrieved chunks

Measures how much of the retrieved text is actually relevant. Ranges from 0 to 1, where 1 means all retrieved text was relevant. A low precision means the chunks contain a lot of irrelevant text.

**Answers**: How much of what we retrieved matches these highlights?

**Example**: If we retrieve 200 tokens of text but only 70 overlap with highlights, precision = 0.35

## Precision Ω

$\text{Precision}_\Omega(\mathbf{C}) = \frac{|t_e \cap t_r|}{|t_r| + |t_e \setminus t_r|}$

Measures precision in an ideal scenario where all relevant text is captured. Shows the theoretical best precision possible for a given chunking strategy. Like regular precision but assumes you've retrieved all highlights. Lower precision omega means chunks are inherently too large or poorly aligned with natural text boundaries.

**Answers**: If we made sure to get all the highlights, how precise could we be?

**Example**: If a chunking strategy always creates chunks twice as large as needed, precision omega would be around 0.5

## Intersection over Union (IoU)

$\text{IoU}_q(\mathbf{C}) = \frac{|t_e \cap t_r|}{|t_e| + |t_r| - |t_e \cap t_r|}$

**Calculated by**: length of overlap / length of union of retrieved chunks and highlights

Balances both precision and recall in a single metric. Ranges from 0 to 1, where 1 is perfect overlap. A low IoU indicates either missing content (poor recall) or retrieving too much irrelevant text (poor precision). IoU penalizes missing important content and including irrelevant content while handling redundant information.

**Answers**: How well do our retrieved chunks overlap with these highlights overall?

**Example**: If we retrieve 200 tokens, the highlight is 100 tokens, and overlap is 70 tokens, IoU = 70/(200+100-70) = 0.304

## Metric Interpretation

These metrics work well together:
- High recall + low precision = retrieving too much text
- Low recall + high precision = missing important content
- High IoU = good balance of both
- Precision Ω helps evaluate the chunking strategy independent of the retrieval step

---
# Evaluating Chunking Strategies and Embedding Models

Built into the repo is a default evaluation structure of 5 text documents and respective question & highlights data. The text corpus includes a mix of clean and unstructured text documents to simulate various text chunking and retrieval scenarios. These include:

1. [State of the Union 2024](https://www.whitehouse.gov/state-of-the-union-2024/): A clean, well-structured transcript of the 2024 presidential address (10,444 tokens)
2. [Wikitext](https://huggingface.co/datasets/Salesforce/wikitext): A curated collection of high-quality Wikipedia articles from verified Good and Featured sections (26,649 tokens subset)
3. [UltraChat 200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k): A dataset of ChatGPT-generated dialogues with JSON syntax intact to simulate real-world messy data (7,727 tokens subset)
4. [ConvFinQA](https://github.com/czyssrs/ConvFinQA): A conversational Q&A dataset focused on numerical reasoning in financial reports (166,177 tokens subset)
5. [PMC Open Access](https://huggingface.co/datasets/pmc/open_access): Biomedical and life sciences journal literature from the National Library of Medicine's open access collection (117,211 tokens subset)

A standard set of question and highlights have been generated and filtered as well, [full file viewable here](https://github.com/brandonstarxel/chunking_evaluation/blob/main/chunking_evaluation/evaluation_framework/general_evaluation_data/questions_df.csv)

<img src="./media_2/standard_eval_corpus.png" width=800>

The question/references take the form of, for example:

- **Question**: `What were the values of other indefinite-lived intangible assets at the end of 2011 and 2012?`
- **Reference(s)**: `[{"content": "other indefinite-lived intangible assets were $132 million and $174 million at december 31, 2012 and 2011, respectively, and principally included registered trademarks","start_index": 568963,"end_index": 569130}]`

Which show the respective text chunk, and its character position within the entire database collection.

## General Evaluation Process

The general evaluation will take the text, chunk it using the chosen chunker along with the chunks start and end index, then:

- **Calculate Retrieval Performance**:
    1. Embed the evaluation questions using the chosen embedding function
    2. Perform vector similarity search to retrieve top-k most relevant chunks per question
    3. Calculates regular metrics:
        1. *Recall*: How much of the highlighted segments were captured
        2. *Precision*: How much of the retrieved chunks were actually relevant
        3. *IoU*: Overall balance of precision and recall
- **Calculate Precision Ω Performance**:
  1. Examine ALL chunks in the collection
  2. Identify which chunks contain any part of the highlight segments
  3. Calculate theoretical best precision possible if you retrieved all necessary chunks

**Start General Evaluation**

In [None]:
# Instantiate the General Eval
evaluation = GeneralEvaluation()

# Define Chunking Approach
sentence_chunker = SentenceChunker(sentences_per_chunk = 10)

# Define OpenAI Embedding Model
default_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-large"
)

# Run the Eval With Chunker and Embedding Model
results = evaluation.run(sentence_chunker, default_ef)

**Observing Results**

In [29]:
# Helper Function
def print_metrics(results):
    
    # Grab Summary Metrics    
    metrics = {
        'Recall': (results['recall_mean'], results['recall_std']),
        'Precision': (results['precision_mean'], results['precision_std']),
        'Precision Ω': (results['precision_omega_mean'], results['precision_omega_std']),
        'IoU': (results['iou_mean'], results['iou_std'])
    }
    
    # Print each metric with mean ± std
    for metric, (mean, std) in metrics.items():
        print(f"{metric}: {mean:.4f} ± {std:.4f}")

In [30]:
print_metrics(results)

Recall: 0.8703 ± 0.3216
Precision: 0.0370 ± 0.0303
Precision Ω: 0.1674 ± 0.1084
IoU: 0.0370 ± 0.0303


**Interpretation**: The 10 sentence chunking strategy demonstrated strong recall performance (87.03% ± 32.16%), indicating effective retrieval of relevant information. However, the low precision (3.70% ± 3.03%) and IoU (3.70% ± 3.03%) metrics suggest significant inclusion of irrelevant text within chunks. The precision Ω value of 16.74% ± 10.84% indicates that even under optimal retrieval conditions, the chunking strategy includes substantial extraneous content. When compared to benchmark chunkers (above), while the recall performance aligns with state-of-the-art approaches (83-91%), the precision metrics suggest opportunities for improvement through reduced chunk sizes or refined boundary determination methods.

### Embedding Functions

We've mostly focused on the chunking strategies, but along with this is the ability to plug in and out different embedding functions. 

**Existing Integrations with ChromaDB**

As we demonstrated in the above example, you can easily use already built in [embedding functions](https://github.com/chroma-core/chroma/tree/main/chromadb/utils/embedding_functions) from Chroma's repo.

Let's demonstrate this by using the [Sentence Transformers Embedding Function](https://github.com/chroma-core/chroma/blob/main/chromadb/utils/embedding_functions/sentence_transformer_embedding_function.py) with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.

In [43]:
# Load Embedding Function
st_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Same Chunking Strategy
sentence_chunker = SentenceChunker(sentences_per_chunk = 10)

# Run the Eval With Chunker and Ollama Embedding Function
st_results = evaluation.run(sentence_chunker, st_ef)

# Display Results
print_metrics(st_results)

Recall: 0.7859 ± 0.3897
Precision: 0.0335 ± 0.0306
Precision Ω: 0.1674 ± 0.1084
IoU: 0.0334 ± 0.0306


**Interpretation**: The lightweight open source model decreased performance over OpenAI's SoTA model. To be expected, and proven!

If you have custom embedding functions, i.e. if you're applying something like a ([query only linear adapter](https://github.com/ALucek/linear-adapter-embedding) to your embeddings), or have a custom fine tuned model, you can easily create your own compatible embedding function with the following outline:

```python
from chromadb import Documents, EmbeddingFunction, Embeddings

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        return embeddings
```

## Running Multiple Evaluations At Once

When looking for the optimal configuration for chunking and embedding, you may want to run a hyperparameter sweep to test various setups. This is made a lot easier with this framework, let's run a simple sweep now across multiple `SentenceChunker` configurations and embedding functions.

**Defining our Chunkers and Embedding Models**

*Note: You don't necessarily have to just use one chunker here, you could load this up with multiple kinds of chunkers and configurations*

In [64]:
# Defining our Configurations
chunkers = [
    SentenceChunker(sentences_per_chunk = 5),
    SentenceChunker(sentences_per_chunk = 10),
    SentenceChunker(sentences_per_chunk = 15),
    SentenceChunker(sentences_per_chunk = 20),
]

# Defining our Embedding Functions
embedders = [
    embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2"),
    embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ["OPENAI_API_KEY"], model_name="text-embedding-3-large"),
]

**Main Sweep**

Logic is going to run each embedding function with each chunker setup and create a final dataframe of the results.

In [65]:
# Initialize Evaluation and Results Storage
evaluation = GeneralEvaluation()
results = []

# Helper Function
def get_config_name(chunker, ef):
    chunk_size = chunker.sentences_per_chunk if hasattr(chunker, 'sentences_per_chunk') else 0
    ef_name = ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__
    return f"{chunker.__class__.__name__}_{chunk_size}_{ef_name}"

# Progress tracking
total_combinations = len(chunkers) * len(embedders)
current_combination = 0

# Run evaluation sweep
for chunker in chunkers:
    for ef in embedders:
        current_combination += 1
        try:
            print(f"Evaluating combination {current_combination}/{total_combinations}:")
            print(f"  Chunker: {chunker.__class__.__name__} (size: {chunker.sentences_per_chunk})")
            print(f"  Embedding: {ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__}")
            
            # Run evaluation
            result = evaluation.run(chunker, ef, retrieve=5)
            
            # Clean up and store results
            if 'corpora_scores' in result:
                del result['corpora_scores']
            
            # Add configuration identifiers
            result['chunker'] = chunker.__class__.__name__
            result['chunk_size'] = chunker.sentences_per_chunk
            result['embedding_function'] = ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__
            result['config'] = get_config_name(chunker, ef)
            
            results.append(result)
            clear_output(wait=True)

        except Exception as e:
            # Error Handling Just in Case
            print(f"Error in combination {current_combination}: {str(e)}")
            continue

# Create final DataFrame and display
df = pd.DataFrame(results)
print("\nFinal Results:")
display(df)


Final Results:


Unnamed: 0,iou_mean,iou_std,recall_mean,recall_std,precision_omega_mean,precision_omega_std,precision_mean,precision_std,chunker,chunk_size,embedding_function,config
0,0.060672,0.055114,0.751719,0.404472,0.294542,0.176066,0.061171,0.055477,SentenceChunker,5,SentenceTransformerEmbeddingFunction,SentenceChunker_5_SentenceTransformerEmbedding...
1,0.069801,0.054571,0.871479,0.306642,0.294542,0.176066,0.070187,0.054913,SentenceChunker,5,OpenAIEmbeddingFunction,SentenceChunker_5_OpenAIEmbeddingFunction
2,0.033444,0.030564,0.785924,0.389651,0.167426,0.108418,0.033532,0.030637,SentenceChunker,10,SentenceTransformerEmbeddingFunction,SentenceChunker_10_SentenceTransformerEmbeddin...
3,0.036992,0.030295,0.870283,0.321624,0.167426,0.108418,0.037043,0.030341,SentenceChunker,10,OpenAIEmbeddingFunction,SentenceChunker_10_OpenAIEmbeddingFunction
4,0.021538,0.019466,0.766879,0.403846,0.120034,0.082415,0.021577,0.019499,SentenceChunker,15,SentenceTransformerEmbeddingFunction,SentenceChunker_15_SentenceTransformerEmbeddin...
5,0.024965,0.020593,0.882058,0.314057,0.120034,0.082415,0.024979,0.020606,SentenceChunker,15,OpenAIEmbeddingFunction,SentenceChunker_15_OpenAIEmbeddingFunction
6,0.015787,0.015199,0.738888,0.428088,0.093078,0.063231,0.0158,0.015205,SentenceChunker,20,SentenceTransformerEmbeddingFunction,SentenceChunker_20_SentenceTransformerEmbeddin...
7,0.01872,0.015482,0.862835,0.33372,0.093078,0.063231,0.018729,0.015486,SentenceChunker,20,OpenAIEmbeddingFunction,SentenceChunker_20_OpenAIEmbeddingFunction


<img src="./media_2/sweep_graphs.png" width=1200>

---
# Domain Specific Evaluation Pipelines

While general evals are great for getting up and running, it's more than likely that you're looking for the best chunking strategy and embedding model combination for your own specific documentation.

Chroma also open sourced their methodology for generating the dataset of questions and chunks from a text corpus automatically in their [synthetic_evaluation](https://github.com/brandonstarxel/chunking_evaluation/blob/main/chunking_evaluation/evaluation_framework/synthetic_evaluation.py) framework, allowing you to generate evaluation datasets tailored to your domain. 

The pipeline works by:
1. Randomly selecting segments (4000 characters) from your input documents
2. Using GPT-4 to generate natural questions based on the content, along with relevant supporting references
3. Identifying and extracting precise text spans that contain the information needed to answer each question
4. Filter for duplicates and similarity to remove redundant or unrelated questions.

Within this there are two reference extraction methods available:
1. **Exact matching**: Finds precise text spans in the source document
2. **Approximate matching**: Pre-chunks text into 100-character segments and allows references to span multiple chunks

Let's apply this with our NVIDIA Form 10-K from before

In [85]:
# Specify the corpora paths, you can have multiple but we'll just use our one file
corpora_paths = [
    './domain_specific/nvidia_10k.txt',
]
csv_path = './domain_specific/generated_queries_and_excerpts.csv'

# Initialize the evaluation
synthetic_pipeline = SyntheticEvaluation(corpora_paths, csv_path, openai_api_key=os.environ["OPENAI_API_KEY"])

In [91]:
synthetic_df = pd.read_csv(csv_path)
synthetic_df.head()

Unnamed: 0,question,references,corpus_id
0,What is the net income of NVIDIA Corporation f...,"[{""content"": ""tax expense (benefit) 4,058 (187...",./domain_specific/nvidia_10k.txt
1,What are the implications of failing to comply...,"[{""content"": ""Administration of China, or CAC....",./domain_specific/nvidia_10k.txt
2,What are the consequences of not adhering to d...,"[{""content"": ""ct to penalties of up to \u20ac2...",./domain_specific/nvidia_10k.txt
3,What are the steps involved in revenue recogni...,"[{""content"": ""share.\nRevenue Recognition\nWe ...",./domain_specific/nvidia_10k.txt
4,How are operating lease assets and liabilities...,"[{""content"": ""lease payments over the lease te...",./domain_specific/nvidia_10k.txt


**Run the Data Generation Pipeline**

The `generate_queries_and_excerpts` method takes the arguments:
1. **approximate_excerpts**: Whether or not to use the chunked flexible approach or exact text matching
2. **num_rounds**: How many times per document to run query generations
3. **queries_per_corpus**: Number of queries to generate per round, `-1` will run indefinitely.

In [88]:
synthetic_pipeline.generate_queries_and_excerpts(approximate_excerpts=True, 
                                         num_rounds=1, 
                                         queries_per_corpus=5)

Trying Query 0
Trying Query 1
Error occurred: Expecting ',' delimiter: line 11 column 488 (char 995)
Trying Query 1
Trying Query 2
Trying Query 3
Trying Query 4


**Filter Poor Excerpts**

This method filters out questions where any of the references aren't sufficiently similar to the question semantically. This is done by embedding the question and reference(s) and comparing both through semantic similarity. Under a certain threshold, defaulting to `0.36` the line is removed.

In [93]:
synthetic_pipeline.filter_poor_excerpts(threshold=0.36)

Corpus: ./domain_specific/nvidia_10k.txt - Removed 29 .


**Remove Duplicates**

This method looks then at the questions generated themselves, first removing all of the exact duplicates then creating a similarity matrix comparing every question to every other, again by embedding. It then applies a greedy algorithm to remove similar questions by:
1. Keeping the first question
2. Removing any later questions that are too similar above a certain threshold. `0.78` by default.
3. Move to the next question and repeat

In [94]:
synthetic_pipeline.filter_duplicates(threshold=0.7)

Corpus: ./domain_specific/nvidia_10k.txt - Removed 10 .


We now have a cleaned dataset of relevant and unique questions as our synthetic evaluation dataset. We initially generated 105, but reduced down to 66 through our filters.

In [95]:
synthetic_df = pd.read_csv(csv_path)
synthetic_df.tail()

Unnamed: 0,question,references,corpus_id
61,What are the conditions necessary for recogniz...,"[{""content"": ""benefits during the period.\nWe ...",./domain_specific/nvidia_10k.txt
62,What subsidiaries does NVIDIA Corporation own?,"[{""content"": ""a significant subsidiary.\nSubsi...",./domain_specific/nvidia_10k.txt
63,What are the significant changes in Other Inco...,"[{""content"": ""prepayment provided at\nsigning....",./domain_specific/nvidia_10k.txt
64,What are the recent changes in the share repur...,"[{""content"": ""Shareholders\u2019 Equity\nCapit...",./domain_specific/nvidia_10k.txt
65,What does NVIDIA's full-stack computing infras...,"[{""content"": ""offerings that are reshaping ind...",./domain_specific/nvidia_10k.txt


## Running Evaluations

We can now employ the same techniques as earlier to run evaluations, this time however using our newly created dataset.

In [96]:
# Define Chunking Approach
sentence_chunker = SentenceChunker(sentences_per_chunk = 10)

# Define OpenAI Embedding Model
default_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-large"
)

# Run the Eval With Chunker and Embedding Model on our Synthetic Dataset
synth_results = synthetic_pipeline.run(sentence_chunker, default_ef)

# Display our Results
print_metrics(synth_results)

Recall: 0.7668 ± 0.3774
Precision: 0.0455 ± 0.0413
Precision Ω: 0.1769 ± 0.1181
IoU: 0.0453 ± 0.0411


Compared to our earlier results:

```
Prior Recall: 0.8703 ± 0.3216
Prior Precision: 0.0370 ± 0.0303
Prior Precision Ω: 0.1674 ± 0.1084
Prior IoU: 0.0370 ± 0.0303
```
We see a lower recall, but higher precision, precision Ω, and IoU!  
Let's also perform the same sweep and observe the results:

In [99]:
# Defining our Configurations
chunkers = [
    SentenceChunker(sentences_per_chunk = 5),
    SentenceChunker(sentences_per_chunk = 10),
    SentenceChunker(sentences_per_chunk = 15),
    SentenceChunker(sentences_per_chunk = 20),
]

# Defining our Embedding Functions
embedders = [
    embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2"),
    embedding_functions.OpenAIEmbeddingFunction(api_key=os.environ["OPENAI_API_KEY"], model_name="text-embedding-3-large"),
]

# Initialize Results Storage
synth_results = []

# Helper Function
def get_config_name(chunker, ef):
    chunk_size = chunker.sentences_per_chunk if hasattr(chunker, 'sentences_per_chunk') else 0
    ef_name = ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__
    return f"{chunker.__class__.__name__}_{chunk_size}_{ef_name}"

# Progress tracking
total_combinations = len(chunkers) * len(embedders)
current_combination = 0

# Run evaluation sweep
for chunker in chunkers:
    for ef in embedders:
        current_combination += 1
        try:
            print(f"Evaluating combination {current_combination}/{total_combinations}:")
            print(f"  Chunker: {chunker.__class__.__name__} (size: {chunker.sentences_per_chunk})")
            print(f"  Embedding: {ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__}")
            
            # Run evaluation
            result = synthetic_pipeline.run(chunker, ef, retrieve=5)
            
            # Clean up and store results
            if 'corpora_scores' in result:
                del result['corpora_scores']
            
            # Add configuration identifiers
            result['chunker'] = chunker.__class__.__name__
            result['chunk_size'] = chunker.sentences_per_chunk
            result['embedding_function'] = ef.model_name if hasattr(ef, 'model_name') else ef.__class__.__name__
            result['config'] = get_config_name(chunker, ef)
            
            synth_results.append(result)
            clear_output(wait=True)

        except Exception as e:
            # Error Handling Just in Case
            print(f"Error in combination {current_combination}: {str(e)}")
            continue

# Create final DataFrame and display
synth_df = pd.DataFrame(synth_results)
print("\nFinal Results:")
display(synth_df)


Final Results:


Unnamed: 0,iou_mean,iou_std,recall_mean,recall_std,precision_omega_mean,precision_omega_std,precision_mean,precision_std,chunker,chunk_size,embedding_function,config
0,0.065792,0.07112,0.639653,0.410139,0.264383,0.148847,0.067492,0.072384,SentenceChunker,5,SentenceTransformerEmbeddingFunction,SentenceChunker_5_SentenceTransformerEmbedding...
1,0.077645,0.070578,0.767041,0.332649,0.264383,0.148847,0.079137,0.071506,SentenceChunker,5,OpenAIEmbeddingFunction,SentenceChunker_5_OpenAIEmbeddingFunction
2,0.035074,0.042812,0.633313,0.443139,0.176874,0.118116,0.035287,0.042903,SentenceChunker,10,SentenceTransformerEmbeddingFunction,SentenceChunker_10_SentenceTransformerEmbeddin...
3,0.045239,0.041125,0.766759,0.377373,0.176874,0.118116,0.045496,0.04127,SentenceChunker,10,OpenAIEmbeddingFunction,SentenceChunker_10_OpenAIEmbeddingFunction
4,0.022789,0.030471,0.525709,0.472539,0.131036,0.089597,0.022841,0.03049,SentenceChunker,15,SentenceTransformerEmbeddingFunction,SentenceChunker_15_SentenceTransformerEmbeddin...
5,0.032013,0.030004,0.766126,0.400795,0.131036,0.089597,0.032084,0.030004,SentenceChunker,15,OpenAIEmbeddingFunction,SentenceChunker_15_OpenAIEmbeddingFunction
6,0.016428,0.019835,0.563793,0.466783,0.10841,0.078149,0.016456,0.019848,SentenceChunker,20,SentenceTransformerEmbeddingFunction,SentenceChunker_20_SentenceTransformerEmbeddin...
7,0.02275,0.022281,0.685653,0.433153,0.10841,0.078149,0.022793,0.022305,SentenceChunker,20,OpenAIEmbeddingFunction,SentenceChunker_20_OpenAIEmbeddingFunction


<img src="./media_2/synth_sweep.png" width=1200>

---
# Discussion

Chroma's research framework provides a powerful toolset for evaluating and optimizing RAG systems through careful analysis of chunking and embedding strategies. Through our experiments, we've uncovered several key insights:

1. **Chunking Strategy Impact**: Our evaluation demonstrated how different chunking approaches can dramatically affect retrieval performance. Smaller chunks (5 sentences) consistently showed higher precision but at the cost of potentially fragmenting related content, while larger chunks (15-20 sentences) achieved better recall but with more noise.

2. **Embedding Model Comparison**: The experiments clearly showed the performance gap between state-of-the-art models (OpenAI's text-embedding-3-large) and lighter-weight alternatives (SentenceTransformer). This helps quantify the tradeoff between cost/speed and performance when choosing embedding models.

3. **Domain Adaptation**: The synthetic dataset generation pipeline revealed how general-purpose chunking strategies might need adjustment for specific document types. The financial documentation from NVIDIA's 10-K showed different optimal chunking parameters compared to the general evaluation corpus, highlighting the importance of domain-specific tuning.

4. **Metric Tradeoffs**: Throughout our evaluation, we observed the fundamental tension between precision and recall, with IoU providing a balanced perspective on overall performance. These metrics help guide practical decisions about chunk size and strategy based on specific use case requirements.

This framework provides a systematic approach to developing and testing RAG systems, allowing practitioners to make data-driven decisions about their text processing pipeline. Future work might explore additional chunking strategies optimized for specific document types, or investigate how different preprocessing steps could improve retrieval performance.