# Day 2: Chunking and Intelligent Processing for Data

- https://docs.google.com/document/d/12wVi866gQDFSw09LZdTixYltqY4_P0g7x04IyYDil5o/edit?tab=t.0

### Why We Need to Prepare Large Documents Before Using Them

Large documents create several problems:

- Token limits: Most LLMs have maximum input token limits
- Cost: Longer prompts cost more money
- Performance: LLMs perform worse with very long contexts
- Relevance: Not all parts of a long document are relevant to a specific question

So we need to split documents into smaller subdocuments. For AI applications like RAG, this process is referred to as "chunking."


### Loading Data from Day-1

In [None]:
from utils.ingest import read_repo_data

In [3]:
evidently_docs = read_repo_data('evidentlyai', 'docs')
print(f"Evidently documents: {len(evidently_docs)}")

Evidently documents: 95


### 1. Simple Chunking with Sliding Window

In [4]:
def sliding_window(seq, size, step):
    if size <= 0 or step <= 0:
        raise ValueError("size and step must be positive")

    n = len(seq)
    result = []
    for i in range(0, n, step):
        chunk = seq[i:i+size]
        result.append({'start': i, 'chunk': chunk})
        if i + size >= n:
            break

    return result

In [5]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    chunks = sliding_window(doc_content, 2000, 1000)
    for chunk in chunks:
        chunk.update(doc_copy)
    evidently_chunks.extend(chunks)

In [6]:
print(f"Evidently chunks: {len(evidently_chunks)}")

Evidently chunks: 575


### 2. Splitting by Paragraphs and Sections

#### Paragraphs

Use `\n\s*\n` regex pattern for splitting:

- `\n` matches a newline
- `\s*` matches zero or more whitespace characters
- `\n` matches another newline
- So `\n\s*\n` matches two newlines with optional whitespace between them


In [7]:
# Splitting by paragraphs
import re
text = evidently_docs[45]['content']
paragraphs = re.split(r'\n\s*\n', text.strip())

In [8]:
from pprint import pprint
pprint(text)

('In this tutorial, you will learn how to perform regression testing for LLM '
 'outputs.\n'
 '\n'
 'You can compare new and old responses after changing a prompt, model, or '
 'anything else in your system. By re-running the same inputs with new '
 'parameters, you can spot any significant changes. This helps you push '
 'updates with confidence or identify issues to fix.\n'
 '\n'
 '<Info>\n'
 "  **This example uses Evidently Cloud.** You'll run evals in Python and "
 'upload them. You can also skip the upload and view Reports locally. For '
 'self-hosted, replace `CloudWorkspace` with `Workspace`.\n'
 '</Info>\n'
 '\n'
 '# Tutorial scope\n'
 '\n'
 "Here's what we'll do:\n"
 '\n'
 '* **Create a toy dataset**. Build a small Q&A dataset with answers and '
 'reference responses.\n'
 '\n'
 '* **Get new answers**. Imitate generating new answers to the same question.\n'
 '\n'
 '* **Create and run a Report with Tests**. Compare the answers using '
 'LLM-as-a-judge to evaluate length, correct

In [9]:
print(f"Paragraphs: {len(paragraphs)}")

Paragraphs: 153


#### Sections

Markdown documents have this structure:

```text
# Heading 1
## Heading 2  
### Heading 3
```

In [10]:
import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """
    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections

In [11]:
sections = split_markdown_by_level(text, level=2)
print(f"Sections: {len(sections)}")

Sections: 8


In [12]:
evidently_chunks = []

for doc in evidently_docs:
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    sections = split_markdown_by_level(doc_content, level=2)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)
        
print(f"Evidently sections: {len(evidently_chunks)}")

Evidently sections: 262


### 3. Intelligent Chunking with LLM

In [13]:
# from openai import OpenAI

# openai_client = OpenAI()


# def llm(prompt, model='gpt-4o-mini'):
#     messages = [
#         {"role": "user", "content": prompt}
#     ]

#     response = openai_client.responses.create(
#         model='gpt-4o-mini',
#         input=messages
#     )

#     return response.output_text

In [16]:
import os
from groq import Groq

groq_client = Groq(api_key=os.environ.get("GROQ_API_KEY"))

def llm(prompt, model="openai/gpt-oss-20b"):
    messages = [
        {
            "role": "user", 
            "content": prompt
        }
    ]

    response = groq_client.chat.completions.create(
        messages = messages,
        model = model,
    )

    return response.choices[0].message.content

In [17]:
llm("What is the capital of France?")

'The capital of France is **Paris**.'

In [18]:
prompt_template = """
Split the provided document into logical sections
that make sense for a Q&A system.

Each section should be self-contained and cover
a specific topic or concept.

<DOCUMENT>
{document}
</DOCUMENT>

Use this format:

## Section Name

Section content with all relevant details

---

## Another Section Name

Another section content

---
""".strip()


The prompt asks the LLM to:

- Split the document logically (not just by length)
- Make sections self-contained
- Use a specific output format that's easy to parse


In [19]:
def intelligent_chunking(text):
    prompt = prompt_template.format(document=text)
    response = llm(prompt)
    sections = response.split('---')
    sections = [s.strip() for s in sections if s.strip()]
    return sections

In [None]:
from tqdm.auto import tqdm

evidently_chunks = []

for doc in tqdm(evidently_docs[:10]): # Limiting to first 10 docs for cost control
    doc_copy = doc.copy()
    doc_content = doc_copy.pop('content')
    
    if len(doc_content) == 0:
        continue 

    sections = intelligent_chunking(doc_content)
    for section in sections:
        section_doc = doc_copy.copy()
        section_doc['section'] = section
        evidently_chunks.append(section_doc)
        
print(f"Evidently sections: {len(evidently_chunks)}")

  0%|          | 0/10 [00:00<?, ?it/s]

Evidently sections: 124


In [27]:
for chunk in evidently_chunks[:10]:
    print("="*80)
    print(chunk['section'][:1000])  # Print first 1000 characters of the section
    print()

## Optional API Reference Folder

If you are not looking to build API reference documentation, you can delete this section by removing the `api-reference` folder.

## Getting Started

There are two ways to build API documentation:

1. **OpenAPI** – use an OpenAPI specification file.  
2. **MDX Components** – use custom MDX components for documentation.

For the starter kit, we are using the following OpenAPI specification.

## API Specification

**Plant Store Endpoints**

- **Specification File:**  
  <https://github.com/mintlify/starter/blob/main/api-reference/openapi.json>

  The OpenAPI specification file defines all of the Plant Store API endpoints, request/response schemas, and metadata. You can view or download it directly from the link above.

## Authentication

All API endpoints are authenticated using Bearer tokens. The security configuration is defined in the OpenAPI specification as follows:

```json
"security": [
  {
    "bearerAuth": []
  }
]
```

Clients must include a va