# Day 2: Chunking and Intelligent Processing for Data

- https://docs.google.com/document/d/12wVi866gQDFSw09LZdTixYltqY4_P0g7x04IyYDil5o/edit?tab=t.0

### Why We Need to Prepare Large Documents Before Using Them

Large documents create several problems:

- Token limits: Most LLMs have maximum input token limits
- Cost: Longer prompts cost more money
- Performance: LLMs perform worse with very long contexts
- Relevance: Not all parts of a long document are relevant to a specific question

So we need to split documents into smaller subdocuments. For AI applications like RAG, this process is referred to as "chunking."


### Loading Data from Day-1

In [1]:
from utils.ingest import read_repo_data

In [2]:
evidently_docs = read_repo_data('evidentlyai', 'docs')
print(f"Evidently documents: {len(evidently_docs)}")

Evidently documents: 95


In [55]:
evidently_docs[0]

{'title': 'Create Plant',
 'openapi': 'POST /plants',
 'content': '',
 'filename': 'docs-main/api-reference/endpoint/create.mdx'}

In [7]:
pydanticai_docs = read_repo_data('pydantic', 'pydantic-ai')
print(f"Pydantic documents: {len(pydanticai_docs)}")

Pydantic documents: 114


In [44]:
pydanticai_docs[1]

 'filename': 'pydantic-ai-main/CLAUDE.md'}

### 1. Simple Chunking with Sliding Window

In [49]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [50]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

In [51]:
print(f"Evidently chunks: {len(evidently_chunks)}")

Evidently chunks: 575


In [52]:
evidently_chunks[0]

{'start': 0,
 'chunk': '<Note>\n  If you\'re not looking to build API reference documentation, you can delete\n  this section by removing the api-reference folder.\n</Note>\n\n## Welcome\n\nThere are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.\n\n<Card\n  title="Plant Store Endpoints"\n  icon="leaf"\n  href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"\n>\n  View the OpenAPI specification file\n</Card>\n\n## Authentication\n\nAll API endpoints are authenticated using Bearer tokens and picked up from the specification file.\n\n```json\n"security": [\n  {\n    "bearerAuth": []\n  }\n]\n```',
 'title': 'Introduction',
 'description': 'Example section for showcasing API endpoints',
 'filename': 'docs-main/api-reference/introduction.mdx'}

In [53]:
pydanticai_chunks = []

for doc in pydanticai_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    pydanticai_chunks.extend(chunks)
    
print(f"Pydantic chunks: {len(pydanticai_chunks)}")

Pydantic chunks: 606


In [54]:
pydanticai_docs[1]

 'filename': 'pydantic-ai-main/CLAUDE.md'}

### 2. Splitting by Paragraphs and Sections

#### Paragraphs

Use `\n\s*\n` regex pattern for splitting:

- `\n` matches a newline
- `\s*` matches zero or more whitespace characters
- `\n` matches another newline
- So `\n\s*\n` matches two newlines with optional whitespace between them


In [23]:
# Splitting by paragraphs
import re
from pprint import pprint

# text = evidently_docs[45]['content']
text = pydanticai_docs[100]['content']
paragraphs = re.split(r'\n\s*\n', text.strip())

In [24]:
print(text)

"Output" refers to the final value returned from [running an agent](agents.md#running-agents). This can be either plain text, [structured data](#structured-output), or the result of a [function](#output-functions) called with arguments provided by the model.

The output is wrapped in [`AgentRunResult`][pydantic_ai.agent.AgentRunResult] or [`StreamedRunResult`][pydantic_ai.result.StreamedRunResult] so that you can access other data, like [usage][pydantic_ai.usage.RunUsage] of the run and [message history](message-history.md#accessing-messages-from-results).

Both `AgentRunResult` and `StreamedRunResult` are generic in the data they wrap, so typing information about the data returned by the agent is preserved.

A run ends when the model responds with one of the structured output types, or, if no output type is specified or `str` is one of the allowed options, when a plain text response is received. A run can also be cancelled if usage limits are exceeded, see [Usage Limits](agents.md#usa

In [26]:
paragraphs[:3]

['"Output" refers to the final value returned from [running an agent](agents.md#running-agents). This can be either plain text, [structured data](#structured-output), or the result of a [function](#output-functions) called with arguments provided by the model.',
 'The output is wrapped in [`AgentRunResult`][pydantic_ai.agent.AgentRunResult] or [`StreamedRunResult`][pydantic_ai.result.StreamedRunResult] so that you can access other data, like [usage][pydantic_ai.usage.RunUsage] of the run and [message history](message-history.md#accessing-messages-from-results).',
 'Both `AgentRunResult` and `StreamedRunResult` are generic in the data they wrap, so typing information about the data returned by the agent is preserved.']

In [25]:
print(f"Paragraphs: {len(paragraphs)}")

Paragraphs: 167


#### Sections

Markdown documents have this structure:

```text
# Heading 1
## Heading 2  
### Heading 3
```

In [56]:
import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

In [57]:
sections = split_markdown_by_level(text, level=2)
print(f"Sections: {len(sections)}")

Sections: 3


In [58]:
sections[0]

'## Output data {#structured-output}\n\nThe [`Agent`][pydantic_ai.Agent] class constructor takes an `output_type` argument that takes one or more types or [output functions](#output-functions). It supports simple scalar types, list and dict types (including `TypedDict`s and [`StructuredDict`s](#structured-dict)), dataclasses and Pydantic models, as well as type unions -- generally everything supported as type hints in a Pydantic model. You can also pass a list of multiple choices.\n\nBy default, Pydantic AI leverages the model\'s tool calling capability to make it return structured data. When multiple output types are specified (in a union or list), each member is registered with the model as a separate output tool in order to reduce the complexity of the schema and maximise the chances a model will respond correctly. This has been shown to work well across a wide range of models. If you\'d like to change the names of the output tools, use a model\'s native structured output feature, o

In [59]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)
        
print(f"Evidently sections: {len(evidently_chunks)}")

Evidently sections: 262


In [60]:
evidently_chunks[0]

{'title': 'Introduction',
 'description': 'Example section for showcasing API endpoints',
 'filename': 'docs-main/api-reference/introduction.mdx',
 'section': '## Welcome\n\nThere are two ways to build API documentation: [OpenAPI](https://mintlify.com/docs/api-playground/openapi/setup) and [MDX components](https://mintlify.com/docs/api-playground/mdx/configuration). For the starter kit, we are using the following OpenAPI specification.\n\n<Card\n  title="Plant Store Endpoints"\n  icon="leaf"\n  href="https://github.com/mintlify/starter/blob/main/api-reference/openapi.json"\n>\n  View the OpenAPI specification file\n</Card>'}

In [61]:
pydanticai_chunks = []

for doc in pydanticai_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        pydanticai_chunks.append(section_doc)
        
print(f"Pydantic sections: {len(pydanticai_chunks)}")

Pydantic sections: 264


In [62]:
pydanticai_chunks[0]

{'filename': 'pydantic-ai-main/CLAUDE.md',
 'section': '## Development Commands\n\n### Core Development Tasks\n\n- **Install dependencies**: `make install` (requires uv, pre-commit, and deno)\n- **Run all checks**: `pre-commit run --all-files`\n- **Run tests**: `make test`\n- **Build docs**: `make docs` or `make docs-serve` (local development)\n\n### Single Test Commands\n\n- **Run specific test**: `uv run pytest tests/test_agent.py::test_function_name -v`\n- **Run test file**: `uv run pytest tests/test_agent.py -v`\n- **Run with debug**: `uv run pytest tests/test_agent.py -v -s`'}

### 3. Intelligent Chunking with LLM

In [13]:
# from openai import OpenAI

# openai_client = OpenAI()


# def llm(prompt, model='gpt-4o-mini'):
#     messages = [
#         {"role": "user", "content": prompt}
#     ]

#     response = openai_client.responses.create(
#         model='gpt-4o-mini',
#         input=messages
#     )

#     return response.output_text

In [16]:
import os
from groq import Groq

groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def llm(prompt, model="openai/gpt-oss-20b"):
    messages = [
        {
            "role": "user", 
            "content": prompt
        }
    ]

    response = groq_client.chat.completions.create(
        messages = messages,
        model = model,
    )

    return response.choices[0].message.content

In [17]:
llm("What is the capital of France?")

'The capital of France is **Paris**.'

In [18]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()


The prompt asks the LLM to:

- Split the document logically (not just by length)
- Make sections self-contained
- Use a specific output format that's easy to parse


In [19]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [None]:
from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs[:10]): # Limiting to first 10 docs for cost control
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    
    if len(doc_content) == 0:
        continue 

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)
        
print(f"Evidently sections: {len(evidently_chunks)}")

  0%|          | 0/10 [00:00<?, ?it/s]

Evidently sections: 124


In [27]:
for chunk in evidently_chunks[:10]:
    print("="*80)
    print(chunk['section'][:1000])  # Print first 1000 characters of the section
    print()

## Optional API Reference Folder

If you are not looking to build API reference documentation, you can delete this section by removing the `api-reference` folder.

## Getting Started

There are two ways to build API documentation:

1. **OpenAPI** – use an OpenAPI specification file.  
2. **MDX Components** – use custom MDX components for documentation.

For the starter kit, we are using the following OpenAPI specification.

## API Specification

**Plant Store Endpoints**

- **Specification File:**  
  <https://github.com/mintlify/starter/blob/main/api-reference/openapi.json>

  The OpenAPI specification file defines all of the Plant Store API endpoints, request/response schemas, and metadata. You can view or download it directly from the link above.

## Authentication

All API endpoints are authenticated using Bearer tokens. The security configuration is defined in the OpenAPI specification as follows:

```json
"security": [
  {
    "bearerAuth": []
  }
]
```

Clients must include a va