# PLC Chunking Strategies Comparison

This notebook allows you to experiment with and compare different chunking strategies for PLC code files (e.g., Structured Text, Instruction List) and CODESYS documentation. Add your own test files and documents in the placeholders below.

In [None]:
# Import required libraries
import re
from typing import List

## 1. Load Test Subjects

Add your PLC code samples and CODESYS documentation excerpts here for testing. Replace the placeholders with your actual data.

In [None]:
# Placeholder for PLC code samples (Structured Text, Instruction List, etc.)
plc_code_samples = [
    # Example: "PROGRAM Main\nVAR\n    MotorStart : BOOL;\nEND_VAR\n...",
]

# Placeholder for CODESYS documentation excerpts
codesys_docs = [
    # Example: "The TON (Timer On-Delay) function block is used to create a time delay before an output is set to TRUE...",
]

## 2. Define Chunking Strategies

We will implement and compare several chunking strategies:
- Fixed-size chunking (with and without overlap)
- Code-aware chunking (e.g., by function/block for PLC code)
- Recursive character/text splitting (for documentation)

In [None]:
def fixed_size_chunking(text: str, chunk_size: int = 200, overlap: int = 50) -> List[str]:
    """Split text into fixed-size chunks with optional overlap."""
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

def code_aware_chunking_st(text: str) -> List[str]:
    """Chunk Structured Text code by function or program blocks."""
    # Simple regex for FUNCTION, PROGRAM, or METHOD blocks
    pattern = r"(?=\b(FUNCTION|PROGRAM|METHOD)\b)"
    blocks = re.split(pattern, text, flags=re.IGNORECASE)
    # Re-attach block headers
    chunks = []
    for i in range(1, len(blocks), 2):
        header = blocks[i]
        body = blocks[i+1] if i+1 < len(blocks) else ''
        chunks.append(header + body)
    return [c.strip() for c in chunks if c.strip()]

def recursive_text_split(text: str, max_length: int = 300) -> List[str]:
    """Recursively split text by paragraphs, then sentences, then characters."""
    if len(text) <= max_length:
        return [text]
    # Try splitting by paragraphs
    paras = text.split('\n\n')
    if len(paras) > 1:
        return sum([recursive_text_split(p, max_length) for p in paras], [])
    # Try splitting by sentences
    sentences = re.split(r'(?<=[.!?]) +', text)
    if len(sentences) > 1:
        chunks = []
        current = ''
        for s in sentences:
            if len(current) + len(s) < max_length:
                current += ' ' + s if current else s
            else:
                if current:
                    chunks.append(current)
                current = s
        if current:
            chunks.append(current)
        return chunks
    # Fallback: character split
    return fixed_size_chunking(text, chunk_size=max_length, overlap=0)

## 3. Apply Chunking Strategies to Test Subjects

This section applies each chunking method to your test PLC code and documentation. Review the output to compare chunk boundaries and content.

In [None]:
# Example: Apply to first PLC code sample (if available)
if plc_code_samples:
    print("--- Fixed-size chunking (PLC code) ---")
    for i, chunk in enumerate(fixed_size_chunking(plc_code_samples[0])):
        print(f"Chunk {i+1}:\n{chunk}\n---")
    print("\n--- Code-aware chunking (Structured Text) ---")
    for i, chunk in enumerate(code_aware_chunking_st(plc_code_samples[0])):
        print(f"Chunk {i+1}:\n{chunk}\n---")
else:
    print("No PLC code samples provided.")

# Example: Apply to first CODESYS doc excerpt (if available)
if codesys_docs:
    print("\n--- Fixed-size chunking (CODESYS doc) ---")
    for i, chunk in enumerate(fixed_size_chunking(codesys_docs[0])):
        print(f"Chunk {i+1}:\n{chunk}\n---")
    print("\n--- Recursive text split (CODESYS doc) ---")
    for i, chunk in enumerate(recursive_text_split(codesys_docs[0])):
        print(f"Chunk {i+1}:\n{chunk}\n---")
else:
    print("No CODESYS documentation excerpts provided.")

## 4. Compare and Evaluate

- Review the output chunks for each method and test subject.
- Consider: Are code blocks preserved? Are documentation chunks semantically meaningful?
- Add your own notes and observations below.