# Ad-Hoc Experimentation with Chunking

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/experimentation/chunk_size_adhoc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A key parameter for RAG pipelines is chunking - the chunk size affects the accuracy of your overall RAG pipeline.

Unlike retrieval and query-time parameters though, chunking is a little harder to experiment with. This is because changing your chunking configuration requires reindexing your data, which can be tedious to experiment with.

LlamaCloud provides easy ways for you to perform **ad-hoc** experimentation over chunking. 
1. First, validate if a given query is correct over an index with our index playground features.
2. If not correct, you can clone an index with a click of a button, and set different chunking/ingestion parameters more broadly.as
3. Try the same query over the playground features again and see if it leads to the right results. 

**NOTE**: More structured experimentation capabilities here are coming soon! 

## Setup a LlamaCloud Index

Download the three ICLR 2024 papers below. Then, create a new LlamaCloud Index in the UI and upload these three files through drag/drop.

In the "Transform Settings" - make sure to select the "Auto" tab with a chunk size of 512. This will be our starting point - we'll analyze how well different queries perform on this index, and iterate on indexing parameters after. 

In [1]:
import nest_asyncio
nest_asyncio.apply()

from IPython.display import Markdown, display

In [3]:
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings

Settings.llm = OpenAI(model="gpt-4o")

In [4]:
# NOTE: insert your own `name`, `project_name`, and `api_key`
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="research_papers_512", 
  project_name="llamacloud_demo",
  # api_key="llx-"
)

## Ad-hoc Test a Question

You want to test this index by sanity-checking with a question you already know the answer to. In this example, we want to understand the core features of the SWE-Bench dataset in page 3, as shown in the image below.

![](chunk_size_adhoc_images/source_chunk.png)

In [5]:
# set re-ranking top-n to 3 
query_engine = index.as_query_engine(rerank_top_n=3)

In [6]:
response = query_engine.query("Tell me about the core features of SWE-bench")

RETRIEVING: Tell me about the core features of SWE-bench


In [7]:
print(str(response))

SWE-bench is a benchmark designed to evaluate language models (LMs) in a realistic software engineering setting. Its core features include:

1. **Real-world Software Engineering Tasks**: Each task involves a large and complex codebase along with a detailed issue description, requiring sophisticated skills and knowledge akin to those of experienced software engineers.

2. **Continually Updatable**: The benchmark can be easily extended with new task instances from any Python repository on GitHub, ensuring a continual supply of fresh challenges that were not part of the models' training data.

3. **Diverse Long Inputs**: Issue descriptions are typically lengthy and detailed, and the codebases contain many thousands of files. This requires models to identify the specific lines that need modification among a vast amount of context.

4. **Robust Evaluation**: Each task instance includes at least one fail-to-pass test to verify the solution, with many instances having multiple such tests. Add

### Analyze Results

**NOTE**: Assuming we have knowledge of the ground-truth, we know this answer isn't quite complete. 
Instead of the notebook, we can also quickly validate this over the chat UI and retrieval UI in the "Playground" section of the index page. 
Try clicking into the LlamaCloud playground, and enter the question above in the chat UI, and look at the response and set of retrieved nodes.

![](chunk_size_adhoc_images/chat_ui_test.png)

Now enter the same question into the retrieval UI, which lets you not only see the chunks but also the source document for each chunk. 

![](chunk_size_adhoc_images/retrieval_ui_test.png)

Clicking "View in File" on the first chunk will let you see how the source document is parsed and chunked. Since we have knowledge of the ground-truth, we can check to see if the ground-truth context is chunked in a cohesive manner - in this case we can see that the relevant section is cutoff, and Node 21 is not in the retrieved set at all. 

![](chunk_size_adhoc_images/view_chunks.png)

You'll notice that the relevant paragraph is broken up into two chunks.

## Experimenting with Chunk Sizes

You can tackle the above issue in a variety of ways. For instance, you can keep the chunk sizes fixed and only tune retrieval parameters, like top-k, hybrid search, reranking, etc. There are tradeoffs to only tuning retrieval though. Increasing top-k can lead to increased latency and cost.

Chunk sizes are a little harder than retrieval parameters to experiment with, since changing it requires retriggering an index run.

With LlamaCloud, we can easily create a new index with a different chunking configuration and see if the retrieved results change. 

First, on the Index page click the "Copy" button to duplicate the index, and give it a new name. 

Click into the new index, rename it as you wish, and then click "Edit" and change the chunking configuration to page-level chunking.
1. Click into "Manual"
2. Click "Page" segmentation in "Segmentation Configuration" to segment by page at the top-level
3. In "Chunking Configuration", select None for the mode.

![](chunk_size_adhoc_images/transform_config.png)

Click "Save" to set the new index settings and retrigger a run of the pipeline.

Let's try testing this same index. 

In [8]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex

index = LlamaCloudIndex(
  name="research_papers_page", 
  project_name="llamacloud_demo",
  # api_key="llx-"
)
query_engine = index.as_query_engine(rerank_top_n=3)

In [9]:
response = query_engine.query("Tell me about the core features of SWE-bench")

RETRIEVING: Tell me about the core features of SWE-bench


In [10]:
print(str(response))

SWE-bench is a benchmark designed to evaluate language models (LMs) in realistic software engineering settings. It features several core attributes:

1. **Real-world Software Engineering Tasks**: It involves large and complex codebases with detailed issue descriptions, requiring sophisticated skills and knowledge akin to those of experienced software engineers.

2. **Continually Updatable**: The collection process can be applied to any Python repository on GitHub with minimal human intervention, allowing for a continual supply of new task instances.

3. **Diverse Long Inputs**: Issue descriptions are typically long and detailed, and codebases contain many thousands of files, necessitating the identification of specific lines that need editing.

4. **Robust Evaluation**: Each task instance includes at least one fail-to-pass test to ensure the model addresses the problem, with additional tests to check for proper maintenance of prior functionality.

5. **Cross-context Code Editing**: Unl

**Result**: Turns out that page-level chunking helps you give back the main result. This is not unexpected, since page-level chunking preserves context across an entire page. 

## Next Steps

If you are aiming for development velocity, you can keep the page-level chunking as a reasonable baseline and build something that "just works". If you are looking to iteratively improve chunking further, consider running the two LlamaCloud indexes you've defined over a more structured dataset and evaluating the results.