# Intelligent Chunking

In the previous lesson, we got data from a GitHub repository and used simple character-based chunking. For most applications, it's actually enough.

But let's take a look at some alternative approaches:
- Token-based chunking: You first tokenize the content (turn it into a sequence of words) and then do a sliding window over tokens
    - Advantages: More precise control over LLM input size
    - Disadvantages: Doesn't work well for documents with code
- Paragraph splitting: Split by paragraphs
- Section splitting: Split by sections
- AI-powered splitting: Let AI split the text intelligently

We won't cover token-based chunking here, as we're working with documents that contain code. But it's easy to implement - ask ChatGPT for help if you need it for text-only content.

In [2]:
import io
from typing import Iterable, Callable
import zipfile
import traceback
from dataclasses import dataclass

import requests


@dataclass
class RawRepositoryFile:
    filename: str
    content: str


class GithubRepositoryDataReader:
    """
    Downloads and parses markdown and code files from a GitHub repository.
    """

    def __init__(self,
                repo_owner: str,
                repo_name: str,
                allowed_extensions: Iterable[str] | None = None,
                filename_filter: Callable[[str], bool] | None = None
        ):
        """
        Initialize the GitHub repository data reader.
        
        Args:
            repo_owner: The owner/organization of the GitHub repository
            repo_name: The name of the GitHub repository
            allowed_extensions: Optional set of file extensions to include
                    (e.g., {"md", "py"}). If not provided, all file types are included
            filename_filter: Optional callable to filter files by their path
        """
        prefix = "https://codeload.github.com"
        self.url = (
            f"{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main"
        )

        if allowed_extensions is not None:
            self.allowed_extensions = {ext.lower() for ext in allowed_extensions}

        if filename_filter is None:
            self.filename_filter = lambda filepath: True
        else:
            self.filename_filter = filename_filter

    def read(self) -> list[RawRepositoryFile]:
        """
        Download and extract files from the GitHub repository.
        
        Returns:
            List of RawRepositoryFile objects for each processed file
            
        Raises:
            Exception: If the repository download fails
        """
        resp = requests.get(self.url)
        if resp.status_code != 200:
            raise Exception(f"Failed to download repository: {resp.status_code}")

        zf = zipfile.ZipFile(io.BytesIO(resp.content))
        repository_data = self._extract_files(zf)
        zf.close()

        return repository_data

    def _extract_files(self, zf: zipfile.ZipFile) -> list[RawRepositoryFile]:
        """
        Extract and process files from the zip archive.
        
        Args:
            zf: ZipFile object containing the repository data

        Returns:
            List of RawRepositoryFile objects for each processed file
        """
        data = []

        for file_info in zf.infolist():
            filepath = self._normalize_filepath(file_info.filename)

            if self._should_skip_file(filepath):
                continue

            try:
                with zf.open(file_info) as f_in:
                    content = f_in.read().decode("utf-8", errors="ignore")
                    if content is not None:
                        content = content.strip()

                    file = RawRepositoryFile(
                        filename=filepath,
                        content=content
                    )
                    data.append(file)

            except Exception as e:
                print(f"Error processing {file_info.filename}: {e}")
                traceback.print_exc()
                continue

        return data

    def _should_skip_file(self, filepath: str) -> bool:
        """
        Determine whether a file should be skipped during processing.
        
        Args:
            filepath: The file path to check
            
        Returns:
            True if the file should be skipped, False otherwise
        """
        filepath = filepath.lower()

        # directory
        if filepath.endswith("/"):
            return True

        # hidden file
        filename = filepath.split("/")[-1]
        if filename.startswith("."):
            return True

        if self.allowed_extensions:
            ext = self._get_extension(filepath)
            if ext not in self.allowed_extensions:
                return True

        if not self.filename_filter(filepath):
            return True

        return False

    def _get_extension(self, filepath: str) -> str:
        """
        Extract the file extension from a filepath.
        
        Args:
            filepath: The file path to extract extension from
            
        Returns:
            The file extension (without dot) or empty string if no extension
        """
        filename = filepath.lower().split("/")[-1]
        if "." in filename:
            return filename.rsplit(".", maxsplit=1)[-1]
        else:
            return ""

    def _normalize_filepath(self, filepath: str) -> str:
        """
        Removes the top-level directory from the file path inside the zip archive.
        'repo-main/path/to/file.py' -> 'path/to/file.py'
        
        Args:
            filepath: The original filepath from the zip archive
            
        Returns:
            The normalized filepath with top-level directory removed
        """
        parts = filepath.split("/", maxsplit=1)
        if len(parts) > 1:
            return parts[1]
        else:
            return parts[0]

In [4]:
from typing import List

def read_github_data(repo_owner: str, repo_name: str) -> List[RawRepositoryFile]:
    allowed_extensions = {"md", "mdx"}

    reader = GithubRepositoryDataReader(
        repo_owner,
        repo_name,
        allowed_extensions=allowed_extensions,
    )
    
    return reader.read()

In [5]:
github_data = read_github_data('evidentlyai', 'docs')

In [6]:
import frontmatter
from typing import Dict, Any

def parse_data(data_raw: List[RawRepositoryFile]) -> List[Dict[str, Any]]:
    data_parsed = []
    for f in data_raw:
        post = frontmatter.loads(f.content)
        data = post.to_dict()
        data['filename'] = f.filename
        data_parsed.append(data)

    return data_parsed

In [7]:
parsed_data = parse_data(github_data)

## Splitting by Paragraphs

One paragraph is separated from another by two or more new lines (\n characters), so we can split our text using regular expressions:

In [8]:
import re
text = parsed_data[45]['content']
paragraphs = re.split(r"\n\s*\n", text.strip())

In [14]:
paragraphs[9]

"<Note>\n  To simplify things, we won't create an actual LLM app, but will simulate generating new outputs.\n</Note>"

## Section Splitting

Markdown documents have this structure:

```md
# Heading 1
## Heading 2  
### Heading 3
```

What we can do is split by headers.

In [9]:
import re

def split_markdown_by_level(text, level=2):
    """
    Split markdown text by a specific header level.
    
    :param text: Markdown text as a string
    :param level: Header level to split on
    :return: List of sections as strings
    """

    # This regex matches markdown headers
    # For level 2, it matches lines starting with "## "
    header_pattern = r'^(#{' + str(level) + r'} )(.+)$'
    pattern = re.compile(header_pattern, re.MULTILINE)

    # Split and keep the headers
    parts = pattern.split(text)
    
    sections = []
    for i in range(1, len(parts), 3):
        # We step by 3 because regex.split() with
        # capturing groups returns:
        # [before_match, group1, group2, after_match, ...]
        # here group1 is "## ", group2 is the header text
        header = parts[i] + parts[i+1]  # "## " + "Title"
        header = header.strip()

        # Get the content after this header
        content = ""
        if i+2 < len(parts):
            content = parts[i+2].strip()

        if content:
            section = f'{header}\n\n{content}'
        else:
            section = header
        sections.append(section)
    
    return sections


In [10]:
sections = split_markdown_by_level(text, level=2)

In [18]:
sections[3]

'## 4. Get new answers\n\nSuppose you generate new responses using your LLM after changing a prompt. We will imitate it by adding a new column with new responses to the DataFrame:\n\n<Accordion title="New toy data generation" defaultOpen={false}>\n  Run this code to generate a new dataset.\n\n  ```python\n  data = [\n    ["Why is the sky blue?",\n     "The sky is blue because molecules in the air scatter blue light from the sun more than they scatter red light.",\n     "The sky appears blue because air molecules scatter the sun’s blue light more than they scatter other colors."],\n\n    ["How do airplanes stay in the air?",\n     "Airplanes stay in the air because their wings create lift by forcing air to move faster over the top of the wing than underneath, which creates lower pressure on top.",\n     "Airplanes stay airborne because the shape of their wings causes air to move faster over the top than the bottom, generating lift."],\n\n    ["Why do we have seasons?",\n     "We have se

## Intelligent chunking with 

This makes sense when:
- Complex structure: Documents have complex, non-standard structure
- Semantic coherence: You want chunks that are semantically meaningful
- Custom logic: You need domain-specific splitting rules
- Quality over cost: You prioritize quality over processing cost

Note: This costs money. In most cases, we don't need intelligent chunking.

Simple approaches are sufficient. Use intelligent chunking only when
- You already evaluated simpler methods and you can confirm that they produce poor results
- You have complex, unstructured documents
- Quality is more important than cost
- You have the budget for LLM processing



Prompt:

In [22]:
chunking_instructions = """
Split the provided document into logical sections that make sense for a Q&A system.
Each section should be self-contained and cover a specific topic or concept.
Sections should be relatively large (3000-5000 characters).
""".strip()

Structured output:

In [19]:
from pydantic import BaseModel

class Section(BaseModel):
    title: str
    markdown: str

class Document(BaseModel):
    title: str
    sections: list[Section]


In [20]:
from openai import OpenAI

openai_client = OpenAI()

def llm_structured(instructions, user_prompt, output_type, model="gpt-4o-mini"):
    messages = [
        { "role": "system", "content": instructions },
        { "role": "user", "content": user_prompt }
    ]

    response = openai_client.responses.parse(
        model=model,
        input=messages,
        text_format=output_type
    )

    return response.output_parsed

In [23]:
result = llm_structured(
    instructions=chunking_instructions,
    user_prompt=text,
    output_type=Document
)


In [26]:
print(result.title)
print()
for s in result.sections:
    print("------")
    print(s.title)
    print(s.markdown)

Regression Testing for LLM Outputs

------
Introduction to Regression Testing for LLM Outputs
In this tutorial, you will learn how to perform regression testing for outputs generated by Large Language Models (LLMs).

Regression testing is essential for ensuring that changes made to your system—be it a new model, a modified prompt, or any other variation—do not lead to unintended changes in behavior. By comparing new outputs against prior ones, you can identify significant changes, ascertain updates confidently, and correct potential issues.

_info_  
**This example utilizes Evidently Cloud.** We will execute evaluations in Python and have the option to upload results, or alternatively, view reports locally. For self-hosted environments, replace `CloudWorkspace` with `Workspace`._

# Tutorial Overview
This tutorial outlines the following key steps:

1. **Create a toy dataset**: Build a small dataset that includes questions, expected answers, and reference responses.
2. **Generate new an