Key Features of the Python AST Chunker:
1. Python-Specific AST Node Types

function_definition - Regular function definitions
async_function_definition - Async function definitions
class_definition - Class definitions
decorated_definition - Functions/classes with
RetryAContinueEditKey Features of the Python AST Chunker (continued):
1. Python-Specific AST Node Types

function_definition - Regular function definitions
async_function_definition - Async function definitions
class_definition - Class definitions
decorated_definition - Functions/classes with decorators (@decorator)
import_statement & import_from_statement - Import statements
assignment - Global variable assignments and constants
if_statement, try_statement, with_statement - Control structures
for_statement, while_statement - Loop structures

2. Python-Specific Name Extraction
The extract_node_name() function handles Python patterns:

Function names: def function_name(
Class names: class ClassName
Async functions: async def function_name(
Decorated functions: @decorator def function_name(
Import statements: import module or from module import something
Special handling for if __name__ == "__main__": blocks
Variable assignments: CONSTANT = value

3. Semantic Boundaries for Python
The chunker identifies meaningful Python code blocks:

Top-level functions and classes (including async and decorated)
Import statements (grouped together)
Global assignments (constants, configuration variables)
Main execution blocks (if __name__ == "__main__":)
Module-level control structures (substantial if/try/with blocks)

4. Intelligent Grouping

Import consolidation: Groups all imports into a single chunk if under token limit
Small chunk merging: Combines related small chunks to reach optimal size
Complete module detection: If entire file is under token limit, keeps it as one chunk
Hierarchical sub-chunking: Breaks down oversized chunks using line-based splitting

5. Enhanced Interactive Interface

Directory scanning: Finds all .py files recursively
File selection options: Process all, select specific files, or choose by directory
Progress reporting: Shows what semantic units are found in each file
Statistics: Provides token counts, chunk counts, and processing summaries
Markdown export: Saves chunks with YAML frontmatter for documentation

Questions for You:

Additional Node Types: Are there other Python constructs you'd like to treat as semantic boundaries? (e.g., global variables, module docstrings, exception handlers)
Decorator Handling: Should decorated functions be treated differently, or is the current approach of keeping them as single units appropriate?
Class Method Chunking: Should we chunk individual methods within large classes, or keep entire classes together?
Module Structure: Do you want special handling for common Python patterns like __all__ definitions, module-level constants, or configuration sections?
Testing Integration: Should the chunker have special handling for test files (recognizing test classes, test functions, fixtures, etc.)?

The chunker is ready to use and follows the same robust architecture as the TypeScript version, with Python-specific adaptations for the language's syntax and common patterns.

In [2]:
#!/usr/bin/env python3

import os
import secrets
import string
from pathlib import Path
from typing import List, Dict, Any
import tiktoken

# Tree-sitter setup for Python
try:
    from tree_sitter_language_pack import get_language, get_parser
    python_language = get_language('python')
    python_parser = get_parser('python')
    print("✅ Using Python parser")
except ImportError:
    print("Please install: pip install tree-sitter-languages")
    exit(1)

# Configuration
MAX_CHUNK_TOKENS = 1000
MAX_RECURSION_DEPTH = 3

# Supported file extensions and their parsers
SUPPORTED_EXTENSIONS = ['.py']
PARSER_MAP = {
    '.py': python_parser
}

# Initialize token encoder
encoder = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Count tokens in text"""
    return len(encoder.encode(text))

def get_syntax_highlighting_language(file_extension: str) -> str:
    """Get appropriate syntax highlighting language for markdown output"""
    language_map = {
        '.py': 'python'
    }
    return language_map.get(file_extension, 'text')

class Chunk:
    def __init__(self, start_byte: int, end_byte: int, content: str, node_type: str, name: str, depth: int = 0):
        self.start_byte = start_byte
        self.end_byte = end_byte
        self.content = content
        self.node_type = node_type
        self.name = name
        self.depth = depth
        self.token_count = count_tokens(content)
        self.sub_chunks = []

def extract_node_name(node, source_code: str) -> str:
    """Extract meaningful name from AST node"""
    node_text = source_code[node.start_byte:node.end_byte]
    lines = node_text.split('\n')
    
    import re
    
    # Try different patterns based on node type
    if node.type == 'function_definition':
        # Look for function name: def function_name(
        match = re.search(r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return match.group(1)
    
    elif node.type == 'class_definition':
        # Look for class name: class ClassName
        match = re.search(r'class\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return match.group(1)
    
    elif node.type == 'assignment':
        # Look for variable assignment: variable_name = 
        match = re.search(r'^([a-zA-Z_][a-zA-Z0-9_]*)\s*=', node_text.strip())
        if match:
            return match.group(1)
    
    elif node.type == 'import_statement' or node.type == 'import_from_statement':
        # Handle import statements
        if 'import' in node_text:
            # Extract what's being imported
            if 'from' in node_text:
                # from module import something
                match = re.search(r'from\s+([a-zA-Z0-9_.]+)', node_text)
                if match:
                    return f"from_{match.group(1)}"
            else:
                # import module
                match = re.search(r'import\s+([a-zA-Z0-9_.]+)', node_text)
                if match:
                    return f"import_{match.group(1)}"
    
    elif node.type == 'decorated_definition':
        # Handle decorated functions/classes (with @decorator)
        # Look for the actual function/class name after decorators
        lines_list = node_text.split('\n')
        for line in lines_list:
            if line.strip().startswith('def '):
                match = re.search(r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)', line)
                if match:
                    return f"decorated_{match.group(1)}"
            elif line.strip().startswith('class '):
                match = re.search(r'class\s+([a-zA-Z_][a-zA-Z0-9_]*)', line)
                if match:
                    return f"decorated_{match.group(1)}"
    
    elif node.type == 'async_function_definition':
        # Look for async function name: async def function_name(
        match = re.search(r'async\s+def\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return f"async_{match.group(1)}"
    
    elif node.type == 'if_statement':
        # Special handling for if __name__ == "__main__"
        if '__name__' in node_text and '__main__' in node_text:
            return "main_block"
        else:
            return "if_block"
    
    elif node.type == 'try_statement':
        return "try_block"
    
    elif node.type == 'with_statement':
        return "with_block"
    
    elif node.type == 'for_statement':
        return "for_loop"
    
    elif node.type == 'while_statement':
        return "while_loop"
    
    # Fallback - try to extract first identifier
    first_line = lines[0][:50].strip()
    simple_match = re.search(r'\b([a-zA-Z_][a-zA-Z0-9_]*)', first_line)
    if simple_match:
        return simple_match.group(1)
    
    return f"{node.type}_{node.start_byte}"

def find_semantic_chunks(tree, source_code: str, file_extension: str) -> List[Dict[str, Any]]:
    """Find semantic chunks - complete, meaningful code blocks"""
    semantic_nodes = []
    
    def traverse(node, parent_types=None):
        if parent_types is None:
            parent_types = []
        
        current_parent_types = parent_types + [node.type]
        node_text = source_code[node.start_byte:node.end_byte]
        
        # Python semantic boundaries
        if file_extension == '.py':
            # Top-level semantic boundaries
            if node.type in ['import_statement', 'import_from_statement'] and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': f"import_{name}", 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': 'import_statement', 'content': node_text
                })
                return
            
            # Function definitions (including async)
            elif node.type in ['function_definition', 'async_function_definition'] and len(parent_types) <= 2:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Class definitions
            elif node.type == 'class_definition' and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Decorated definitions (functions/classes with decorators)
            elif node.type == 'decorated_definition' and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Global assignments and constants
            elif (node.type == 'assignment' and len(parent_types) <= 1 and 
                  len(node_text.strip()) > 50):  # Only substantial assignments
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Control structures at module level
            elif (node.type in ['if_statement', 'try_statement', 'with_statement', 'for_statement', 'while_statement'] 
                  and len(parent_types) <= 1 and len(node_text.strip()) > 100):
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
        
        # Continue traversing children
        for child in node.children:
            traverse(child, current_parent_types)
    
    traverse(tree.root_node)
    
    # Filter out overlapping nodes (keep the largest/most specific)
    filtered_nodes = []
    for node in semantic_nodes:
        is_contained = False
        for other in semantic_nodes:
            if (other != node and 
                other['start_byte'] <= node['start_byte'] and 
                other['end_byte'] >= node['end_byte'] and
                other['end_byte'] - other['start_byte'] > node['end_byte'] - node['start_byte']):
                is_contained = True
                break
        
        if not is_contained:
            filtered_nodes.append(node)
    
    return filtered_nodes

def create_semantic_chunks(semantic_nodes: List[Dict[str, Any]]) -> List[Chunk]:
    """Create chunks from semantic nodes"""
    chunks = []
    
    for node_info in semantic_nodes:
        chunk = Chunk(
            start_byte=node_info['start_byte'],
            end_byte=node_info['end_byte'],
            content=node_info['content'],
            node_type=node_info['type'],
            name=node_info['name'],
            depth=0
        )
        chunks.append(chunk)
    
    return chunks

def group_small_chunks(chunks: List[Chunk], target_tokens: int = 600, file_extension: str = '') -> List[Chunk]:
    """Group small chunks together to reach reasonable size for Python files"""
    if not chunks:
        return chunks
    
    # Separate imports from other chunks
    import_chunks = [c for c in chunks if c.node_type in ['import_statement', 'import_from_statement']]
    other_chunks = [c for c in chunks if c.node_type not in ['import_statement', 'import_from_statement']]
    
    # Check if everything together is under limit
    total_tokens = sum(c.token_count for c in chunks)
    if total_tokens <= MAX_CHUNK_TOKENS:
        # Combine everything into one chunk
        combined_content = '\n\n'.join(c.content for c in chunks)
        combined_chunk = Chunk(
            start_byte=chunks[0].start_byte,
            end_byte=chunks[-1].end_byte,
            content=combined_content,
            node_type='complete_module',
            name=f"complete_module_{len(chunks)}_parts",
            depth=0
        )
        return [combined_chunk]
    
    # If too large, handle imports separately
    if import_chunks:
        total_import_tokens = sum(c.token_count for c in import_chunks)
        if total_import_tokens <= MAX_CHUNK_TOKENS:
            # Combine all imports into one chunk
            combined_imports = '\n'.join(c.content for c in import_chunks)
            imports_chunk = Chunk(
                start_byte=import_chunks[0].start_byte,
                end_byte=import_chunks[-1].end_byte,
                content=combined_imports,
                node_type='imports_group',
                name=f"imports_{len(import_chunks)}_statements",
                depth=0
            )
            import_chunks = [imports_chunk]
    
    # Group other chunks
    grouped_others = group_chunks_by_size(other_chunks, target_tokens, file_extension)
    
    # Combine imports + other chunks
    return import_chunks + grouped_others

def group_chunks_by_size(chunks: List[Chunk], target_tokens: int = 600, file_extension: str = '') -> List[Chunk]:
    """Group chunks by size logic"""
    if not chunks:
        return chunks
    
    # Skip grouping if we already have reasonably sized chunks
    if len(chunks) == 1 or any(c.token_count > target_tokens for c in chunks):
        total_tokens = sum(c.token_count for c in chunks)
        if total_tokens <= MAX_CHUNK_TOKENS:
            # All chunks together are still under limit - combine them
            if len(chunks) > 1:
                combined_content = '\n\n'.join(c.content for c in chunks)
                group_name = f"python_module_{len(chunks)}_definitions"
                
                combined_chunk = Chunk(
                    start_byte=chunks[0].start_byte,
                    end_byte=chunks[-1].end_byte,
                    content=combined_content,
                    node_type='grouped_content',
                    name=group_name,
                    depth=0
                )
                return [combined_chunk]
        return chunks
    
    # Group small chunks
    grouped_chunks = []
    current_group = []
    current_tokens = 0
    
    for chunk in chunks:
        if current_tokens + chunk.token_count > target_tokens and current_group:
            # Finalize current group
            group_content = '\n\n'.join(c.content for c in current_group)
            group_name = f"python_group_{len(current_group)}_definitions"
            
            grouped_chunk = Chunk(
                start_byte=current_group[0].start_byte,
                end_byte=current_group[-1].end_byte,
                content=group_content,
                node_type='grouped_content',
                name=group_name,
                depth=0
            )
            grouped_chunks.append(grouped_chunk)
            
            # Start new group
            current_group = [chunk]
            current_tokens = chunk.token_count
        else:
            current_group.append(chunk)
            current_tokens += chunk.token_count
    
    # Add final group
    if current_group:
        if len(current_group) == 1:
            grouped_chunks.append(current_group[0])
        else:
            group_content = '\n\n'.join(c.content for c in current_group)
            group_name = f"python_group_{len(current_group)}_definitions"
            
            grouped_chunk = Chunk(
                start_byte=current_group[0].start_byte,
                end_byte=current_group[-1].end_byte,
                content=group_content,
                node_type='grouped_content',
                name=group_name,
                depth=0
            )
            grouped_chunks.append(grouped_chunk)
    
    return grouped_chunks

def sub_chunk_by_statements(chunk: Chunk, tree, source_code: str, depth: int = 0) -> List[Chunk]:
    """Sub-chunk by breaking down into logical statements/blocks"""
    if depth >= MAX_RECURSION_DEPTH or chunk.token_count <= MAX_CHUNK_TOKENS:
        return [chunk]
    
    print(f"    Breaking down {chunk.name} ({chunk.token_count} tokens) into smaller pieces...")
    
    # Simple line-based splitting for now
    lines = chunk.content.split('\n')
    sub_chunks = []
    current_lines = []
    current_size = 0
    
    for line in lines:
        line_tokens = count_tokens(line)
        
        if current_size + line_tokens > MAX_CHUNK_TOKENS and current_lines:
            # Create sub-chunk
            sub_content = '\n'.join(current_lines)
            if sub_content.strip():
                sub_chunk = Chunk(
                    start_byte=chunk.start_byte,  # Approximate
                    end_byte=chunk.start_byte + len(sub_content),
                    content=sub_content,
                    node_type=f"{chunk.node_type}_part",
                    name=f"{chunk.name}_part_{len(sub_chunks)+1}",
                    depth=depth + 1
                )
                sub_chunks.append(sub_chunk)
            
            current_lines = [line]
            current_size = line_tokens
        else:
            current_lines.append(line)
            current_size += line_tokens
    
    # Add remaining lines
    if current_lines:
        sub_content = '\n'.join(current_lines)
        if sub_content.strip():
            sub_chunk = Chunk(
                start_byte=chunk.start_byte,
                end_byte=chunk.end_byte,
                content=sub_content,
                node_type=f"{chunk.node_type}_part",
                name=f"{chunk.name}_part_{len(sub_chunks)+1}",
                depth=depth + 1
            )
            sub_chunks.append(sub_chunk)
    
    chunk.sub_chunks = sub_chunks
    return sub_chunks if len(sub_chunks) > 1 else [chunk]

def process_python_file(file_path: Path) -> List[Chunk]:
    """Process a single .py file and return semantic chunks"""
    print(f"\n=== Processing: {file_path.name} ===")
    
    # Read file
    with open(file_path, 'r', encoding='utf-8') as f:
        source_code = f.read()
    
    print(f"File size: {len(source_code)} characters")
    
    # Parse with Tree-sitter Python parser
    tree = python_parser.parse(source_code.encode('utf-8'))
    
    if tree.root_node.has_error:
        print("⚠️ Parse errors detected")
    
    # Find semantic chunks
    semantic_nodes = find_semantic_chunks(tree, source_code, file_path.suffix)
    print(f"Found {len(semantic_nodes)} semantic units")
    
    # Show what we found
    for node in semantic_nodes:
        preview = node['content'][:100].replace('\n', ' ').strip()
        print(f"  - {node['type']}: {node['name']} ({count_tokens(node['content'])} tokens)")
        print(f"    Preview: {preview}...")
    
    # Create chunks
    base_chunks = create_semantic_chunks(semantic_nodes)
    
    # Group small chunks
    base_chunks = group_small_chunks(base_chunks, target_tokens=600, file_extension=file_path.suffix)
    
    print(f"Created {len(base_chunks)} semantic chunks")
    
    # Apply sub-chunking for oversized chunks
    final_chunks = []
    oversized_count = 0
    
    for chunk in base_chunks:
        if chunk.token_count > MAX_CHUNK_TOKENS:
            print(f"  Sub-chunking {chunk.name} ({chunk.token_count} tokens)")
            sub_chunks = sub_chunk_by_statements(chunk, tree, source_code)
            final_chunks.extend(sub_chunks)
            oversized_count += 1
        else:
            final_chunks.append(chunk)
    
    if oversized_count > 0:
        print(f"Sub-chunked {oversized_count} oversized chunks")
    print(f"Final result: {len(final_chunks)} total chunks")
    
    return final_chunks

def generate_unique_id(length: int = 6) -> str:
    """Generate a random unique ID"""
    alphabet = string.ascii_lowercase + string.digits
    return ''.join(secrets.choice(alphabet) for _ in range(length))

def create_chunk_filename(original_filename: str, chunk_number: int, unique_id: str) -> str:
    """Create chunk filename: script.py_chunk_001_a1s2d3.md"""
    return f"{original_filename}_chunk_{chunk_number:03d}_{unique_id}.md"

def get_markdown_language(file_extension: str) -> str:
    """Get markdown language for code blocks"""
    lang_map = {
        '.py': 'python'
    }
    return lang_map.get(file_extension, 'text')

def create_chunk_markdown(chunk: Chunk, source_file_path: str, file_extension: str) -> str:
    """Create markdown content with YAML frontmatter"""
    language = get_markdown_language(file_extension)
    unique_id = generate_unique_id()
    
    frontmatter = f"""---
file_path: "{source_file_path}"
chunk_id: "{unique_id}"
chunk_type: "{chunk.node_type}"
chunk_name: "{chunk.name}"
start_byte: {chunk.start_byte}
end_byte: {chunk.end_byte}
token_count: {chunk.token_count}
depth: {chunk.depth}
language: "{language}"
---

# {chunk.name}

**Type:** {chunk.node_type}  
**Tokens:** {chunk.token_count}  
**Depth:** {chunk.depth}

```{language}
{chunk.content}
```
"""
    return frontmatter

def print_chunk_summary(chunks: List[Chunk], file_name: str):
    """Print detailed summary of chunks"""
    print(f"\n--- Semantic Chunk Summary for {file_name} ---")
    
    for i, chunk in enumerate(chunks, 1):
        indent = "  " * chunk.depth
        content_lines = len(chunk.content.split('\n'))
        
        print(f"{indent}{i}. {chunk.name}")
        print(f"{indent}   Type: {chunk.node_type}")
        print(f"{indent}   Size: {chunk.token_count} tokens, {content_lines} lines")
        print(f"{indent}   Content preview:")
        
        # Show first few lines of actual content
        content_lines_list = chunk.content.split('\n')
        for j, line in enumerate(content_lines_list[:3]):
            print(f"{indent}     {line.strip()}")
        if len(content_lines_list) > 3:
            print(f"{indent}     ... ({len(content_lines_list) - 3} more lines)")
        print()

def main():
    """Main function to test semantic chunking"""
    print("🚀 Python Semantic Chunking Test")
    print(f"Max chunk tokens: {MAX_CHUNK_TOKENS}")
    print(f"Max recursion depth: {MAX_RECURSION_DEPTH}")
    print(f"Supported extensions: {', '.join(SUPPORTED_EXTENSIONS)}")
    
    # Get directory from user or use current directory
    directory = input("\nEnter directory path (or press Enter for current directory): ").strip()
    if not directory:
        directory = "."
    
    target_dir = Path(directory)
    if not target_dir.exists():
        print(f"❌ Directory not found: {directory}")
        return
    
    # Find all Python files
    all_files = list(target_dir.rglob("*.py"))
    
    if not all_files:
        print(f"❌ No Python files found in {directory}")
        return
    
    print(f"📁 Found {len(all_files)} Python files:")
    
    # Group by directory for summary
    by_dir = {}
    for f in all_files:
        rel_path = f.relative_to(target_dir)
        dir_path = str(rel_path.parent) if rel_path.parent != Path('.') else '.'
        by_dir[dir_path] = by_dir.get(dir_path, []) + [f.name]
    
    for dir_path, files in sorted(by_dir.items()):
        print(f"  📂 {dir_path}: {len(files)} files")
        for file_name in sorted(files)[:3]:  # Show first 3 files
            print(f"    📄 {file_name}")
        if len(files) > 3:
            print(f"    ... and {len(files) - 3} more")
    
    # Interactive file selection
    print(f"\n🔍 Select files to process:")
    print("1. Process all files")
    print("2. Select specific files")
    print("3. Process files in a specific directory")
    
    choice = input("\nEnter your choice (1-3): ").strip()
    
    selected_files = []
    
    if choice == "1":
        selected_files = all_files
        print(f"✅ Processing all {len(selected_files)} files")
    
    elif choice == "2":
        print("\nAvailable files:")
        for i, f in enumerate(all_files, 1):
            rel_path = f.relative_to(target_dir)
            print(f"  {i:2d}. {rel_path}")
        
        selection = input("\nEnter file numbers (comma-separated, e.g., 1,3,5): ").strip()
        try:
            indices = [int(x.strip()) - 1 for x in selection.split(',')]
            selected_files = [all_files[i] for i in indices if 0 <= i < len(all_files)]
            print(f"✅ Selected {len(selected_files)} files")
        except (ValueError, IndexError):
            print("❌ Invalid selection, processing first file only")
            selected_files = [all_files[0]] if all_files else []
    
    elif choice == "3":
        print("\nAvailable directories:")
        unique_dirs = sorted(set(by_dir.keys()))
        for i, dir_path in enumerate(unique_dirs, 1):
            print(f"  {i}. {dir_path} ({len(by_dir[dir_path])} files)")
        
        try:
            dir_choice = int(input("\nEnter directory number: ").strip()) - 1
            if 0 <= dir_choice < len(unique_dirs):
                selected_dir = unique_dirs[dir_choice]
                selected_files = [f for f in all_files 
                                if str(f.relative_to(target_dir).parent) == selected_dir or 
                                   (selected_dir == '.' and f.parent == target_dir)]
                print(f"✅ Selected {len(selected_files)} files from {selected_dir}")
            else:
                print("❌ Invalid directory, processing first file only")
                selected_files = [all_files[0]] if all_files else []
        except ValueError:
            print("❌ Invalid input, processing first file only")
            selected_files = [all_files[0]] if all_files else []
    
    else:
        print("❌ Invalid choice, processing first file only")
        selected_files = [all_files[0]] if all_files else []
    
    if not selected_files:
        print("❌ No files selected for processing")
        return
    
    # Process selected files
    print(f"\n🔄 Processing {len(selected_files)} file(s)...")
    all_chunks = {}
    
    for file_path in selected_files:
        try:
            chunks = process_python_file(file_path)
            all_chunks[file_path] = chunks
            print_chunk_summary(chunks, file_path.name)
            
        except Exception as e:
            print(f"❌ Error processing {file_path}: {e}")
            continue
    
    # Summary
    total_chunks = sum(len(chunks) for chunks in all_chunks.values())
    total_tokens = sum(chunk.token_count for chunks in all_chunks.values() for chunk in chunks)
    
    print(f"\n📊 Final Summary:")
    print(f"   Files processed: {len(all_chunks)}")
    print(f"   Total chunks: {total_chunks}")
    print(f"   Total tokens: {total_tokens:,}")
    print(f"   Average tokens per chunk: {total_tokens/total_chunks:.1f}" if total_chunks > 0 else "   No chunks created")
    
    # Ask about saving chunks
    save_chunks = input(f"\n💾 Save chunks to markdown files? (y/n): ").strip().lower()
    if save_chunks == 'y':
        output_dir = Path("python_chunks")
        output_dir.mkdir(exist_ok=True)
        
        saved_count = 0
        for file_path, chunks in all_chunks.items():
            for i, chunk in enumerate(chunks, 1):
                chunk_filename = create_chunk_filename(file_path.name, i, generate_unique_id())
                chunk_content = create_chunk_markdown(chunk, str(file_path), file_path.suffix)
                
                output_path = output_dir / chunk_filename
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(chunk_content)
                saved_count += 1
        
        print(f"✅ Saved {saved_count} chunk files to {output_dir}")

if __name__ == "__main__":
    main()

✅ Using Python parser
🚀 Python Semantic Chunking Test
Max chunk tokens: 1000
Max recursion depth: 3
Supported extensions: .py
📁 Found 20 Python files:
  📂 py_src/fusion: 18 files
    📄 __init__.py
    📄 __main__.py
    📄 attributes.py
    ... and 15 more
  📂 py_src/fusion/_legacy: 2 files
    📄 __init__.py
    📄 authentication.py

🔍 Select files to process:
1. Process all files
2. Select specific files
3. Process files in a specific directory
✅ Processing all 20 files

🔄 Processing 20 file(s)...

=== Processing: fusion_types.py ===
File size: 385 characters
Found 2 semantic units
  - import_statement: import_from_enum (4 tokens)
    Preview: from enum import Enum...
  - class_definition: Types (112 tokens)
    Preview: class Types(Enum):     """Fusion types.      Args:         Enum (class: `enum.Enum`): Enum inheritan...
Created 1 semantic chunks
Final result: 1 total chunks

--- Semantic Chunk Summary for fusion_types.py ---
1. complete_module_2_parts
   Type: complete_module
   Size:

Key Changes Made:
1. Parallel Output Directory Structure

Creates output directory parallel to source directory: source_dir → source_dir_chunks
Uses target_dir.parent / f"{target_dir.name}_chunks" to ensure parallel placement
Preserves the entire directory structure within the output directory

2. Automatic Processing

Removes all interactive file selection options
Automatically processes all .py files found in the directory tree
No user prompts for file selection or save confirmation

3. Directory Structure Preservation

Calculates relative paths from source directory: rel_path = file_path.relative_to(target_dir)
Creates corresponding subdirectories in output: chunk_dir = output_dir / rel_path.parent
Uses mkdir(parents=True, exist_ok=True) to ensure all parent directories are created

4. Consistent Naming Convention

Follows the same naming pattern as TypeScript chunker: filename.py_chunk_001_a1s2d3.md
Uses relative paths in YAML frontmatter for portability
Maintains the same markdown structure with YAML frontmatter

5. Streamlined Output

Automatic markdown output by default
Clear progress reporting during processing
Final summary with directory structure confirmation

In [3]:
#!/usr/bin/env python3

import os
import secrets
import string
from pathlib import Path
from typing import List, Dict, Any
import tiktoken

# Tree-sitter setup for Python
try:
    from tree_sitter_language_pack import get_language, get_parser
    python_language = get_language('python')
    python_parser = get_parser('python')
    print("✅ Using Python parser")
except ImportError:
    print("Please install: pip install tree-sitter-languages")
    exit(1)

# Configuration
MAX_CHUNK_TOKENS = 1000
MAX_RECURSION_DEPTH = 3

# Supported file extensions and their parsers
SUPPORTED_EXTENSIONS = ['.py']
PARSER_MAP = {
    '.py': python_parser
}

# Initialize token encoder
encoder = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Count tokens in text"""
    return len(encoder.encode(text))

def get_syntax_highlighting_language(file_extension: str) -> str:
    """Get appropriate syntax highlighting language for markdown output"""
    language_map = {
        '.py': 'python'
    }
    return language_map.get(file_extension, 'text')

class Chunk:
    def __init__(self, start_byte: int, end_byte: int, content: str, node_type: str, name: str, depth: int = 0):
        self.start_byte = start_byte
        self.end_byte = end_byte
        self.content = content
        self.node_type = node_type
        self.name = name
        self.depth = depth
        self.token_count = count_tokens(content)
        self.sub_chunks = []

def extract_node_name(node, source_code: str) -> str:
    """Extract meaningful name from AST node"""
    node_text = source_code[node.start_byte:node.end_byte]
    lines = node_text.split('\n')
    
    import re
    
    # Try different patterns based on node type
    if node.type == 'function_definition':
        # Look for function name: def function_name(
        match = re.search(r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return match.group(1)
    
    elif node.type == 'class_definition':
        # Look for class name: class ClassName
        match = re.search(r'class\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return match.group(1)
    
    elif node.type == 'assignment':
        # Look for variable assignment: variable_name = 
        match = re.search(r'^([a-zA-Z_][a-zA-Z0-9_]*)\s*=', node_text.strip())
        if match:
            return match.group(1)
    
    elif node.type == 'import_statement' or node.type == 'import_from_statement':
        # Handle import statements
        if 'import' in node_text:
            # Extract what's being imported
            if 'from' in node_text:
                # from module import something
                match = re.search(r'from\s+([a-zA-Z0-9_.]+)', node_text)
                if match:
                    return f"from_{match.group(1)}"
            else:
                # import module
                match = re.search(r'import\s+([a-zA-Z0-9_.]+)', node_text)
                if match:
                    return f"import_{match.group(1)}"
    
    elif node.type == 'decorated_definition':
        # Handle decorated functions/classes (with @decorator)
        # Look for the actual function/class name after decorators
        lines_list = node_text.split('\n')
        for line in lines_list:
            if line.strip().startswith('def '):
                match = re.search(r'def\s+([a-zA-Z_][a-zA-Z0-9_]*)', line)
                if match:
                    return f"decorated_{match.group(1)}"
            elif line.strip().startswith('class '):
                match = re.search(r'class\s+([a-zA-Z_][a-zA-Z0-9_]*)', line)
                if match:
                    return f"decorated_{match.group(1)}"
    
    elif node.type == 'async_function_definition':
        # Look for async function name: async def function_name(
        match = re.search(r'async\s+def\s+([a-zA-Z_][a-zA-Z0-9_]*)', node_text)
        if match:
            return f"async_{match.group(1)}"
    
    elif node.type == 'if_statement':
        # Special handling for if __name__ == "__main__"
        if '__name__' in node_text and '__main__' in node_text:
            return "main_block"
        else:
            return "if_block"
    
    elif node.type == 'try_statement':
        return "try_block"
    
    elif node.type == 'with_statement':
        return "with_block"
    
    elif node.type == 'for_statement':
        return "for_loop"
    
    elif node.type == 'while_statement':
        return "while_loop"
    
    # Fallback - try to extract first identifier
    first_line = lines[0][:50].strip()
    simple_match = re.search(r'\b([a-zA-Z_][a-zA-Z0-9_]*)', first_line)
    if simple_match:
        return simple_match.group(1)
    
    return f"{node.type}_{node.start_byte}"

def find_semantic_chunks(tree, source_code: str, file_extension: str) -> List[Dict[str, Any]]:
    """Find semantic chunks - complete, meaningful code blocks"""
    semantic_nodes = []
    
    def traverse(node, parent_types=None):
        if parent_types is None:
            parent_types = []
        
        current_parent_types = parent_types + [node.type]
        node_text = source_code[node.start_byte:node.end_byte]
        
        # Python semantic boundaries
        if file_extension == '.py':
            # Top-level semantic boundaries
            if node.type in ['import_statement', 'import_from_statement'] and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': f"import_{name}", 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': 'import_statement', 'content': node_text
                })
                return
            
            # Function definitions (including async)
            elif node.type in ['function_definition', 'async_function_definition'] and len(parent_types) <= 2:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Class definitions
            elif node.type == 'class_definition' and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Decorated definitions (functions/classes with decorators)
            elif node.type == 'decorated_definition' and len(parent_types) <= 1:
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Global assignments and constants
            elif (node.type == 'assignment' and len(parent_types) <= 1 and 
                  len(node_text.strip()) > 50):  # Only substantial assignments
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
            
            # Control structures at module level
            elif (node.type in ['if_statement', 'try_statement', 'with_statement', 'for_statement', 'while_statement'] 
                  and len(parent_types) <= 1 and len(node_text.strip()) > 100):
                name = extract_node_name(node, source_code)
                semantic_nodes.append({
                    'node': node, 'name': name, 'start_byte': node.start_byte,
                    'end_byte': node.end_byte, 'type': node.type, 'content': node_text
                })
                return
        
        # Continue traversing children
        for child in node.children:
            traverse(child, current_parent_types)
    
    traverse(tree.root_node)
    
    # Filter out overlapping nodes (keep the largest/most specific)
    filtered_nodes = []
    for node in semantic_nodes:
        is_contained = False
        for other in semantic_nodes:
            if (other != node and 
                other['start_byte'] <= node['start_byte'] and 
                other['end_byte'] >= node['end_byte'] and
                other['end_byte'] - other['start_byte'] > node['end_byte'] - node['start_byte']):
                is_contained = True
                break
        
        if not is_contained:
            filtered_nodes.append(node)
    
    return filtered_nodes

def create_semantic_chunks(semantic_nodes: List[Dict[str, Any]]) -> List[Chunk]:
    """Create chunks from semantic nodes"""
    chunks = []
    
    for node_info in semantic_nodes:
        chunk = Chunk(
            start_byte=node_info['start_byte'],
            end_byte=node_info['end_byte'],
            content=node_info['content'],
            node_type=node_info['type'],
            name=node_info['name'],
            depth=0
        )
        chunks.append(chunk)
    
    return chunks

def group_small_chunks(chunks: List[Chunk], target_tokens: int = 600, file_extension: str = '') -> List[Chunk]:
    """Group small chunks together to reach reasonable size for Python files"""
    if not chunks:
        return chunks
    
    # Separate imports from other chunks
    import_chunks = [c for c in chunks if c.node_type in ['import_statement', 'import_from_statement']]
    other_chunks = [c for c in chunks if c.node_type not in ['import_statement', 'import_from_statement']]
    
    # Check if everything together is under limit
    total_tokens = sum(c.token_count for c in chunks)
    if total_tokens <= MAX_CHUNK_TOKENS:
        # Combine everything into one chunk
        combined_content = '\n\n'.join(c.content for c in chunks)
        combined_chunk = Chunk(
            start_byte=chunks[0].start_byte,
            end_byte=chunks[-1].end_byte,
            content=combined_content,
            node_type='complete_module',
            name=f"complete_module_{len(chunks)}_parts",
            depth=0
        )
        return [combined_chunk]
    
    # If too large, handle imports separately
    if import_chunks:
        total_import_tokens = sum(c.token_count for c in import_chunks)
        if total_import_tokens <= MAX_CHUNK_TOKENS:
            # Combine all imports into one chunk
            combined_imports = '\n'.join(c.content for c in import_chunks)
            imports_chunk = Chunk(
                start_byte=import_chunks[0].start_byte,
                end_byte=import_chunks[-1].end_byte,
                content=combined_imports,
                node_type='imports_group',
                name=f"imports_{len(import_chunks)}_statements",
                depth=0
            )
            import_chunks = [imports_chunk]
    
    # Group other chunks
    grouped_others = group_chunks_by_size(other_chunks, target_tokens, file_extension)
    
    # Combine imports + other chunks
    return import_chunks + grouped_others

def group_chunks_by_size(chunks: List[Chunk], target_tokens: int = 600, file_extension: str = '') -> List[Chunk]:
    """Group chunks by size logic"""
    if not chunks:
        return chunks
    
    # Skip grouping if we already have reasonably sized chunks
    if len(chunks) == 1 or any(c.token_count > target_tokens for c in chunks):
        total_tokens = sum(c.token_count for c in chunks)
        if total_tokens <= MAX_CHUNK_TOKENS:
            # All chunks together are still under limit - combine them
            if len(chunks) > 1:
                combined_content = '\n\n'.join(c.content for c in chunks)
                group_name = f"python_module_{len(chunks)}_definitions"
                
                combined_chunk = Chunk(
                    start_byte=chunks[0].start_byte,
                    end_byte=chunks[-1].end_byte,
                    content=combined_content,
                    node_type='grouped_content',
                    name=group_name,
                    depth=0
                )
                return [combined_chunk]
        return chunks
    
    # Group small chunks
    grouped_chunks = []
    current_group = []
    current_tokens = 0
    
    for chunk in chunks:
        if current_tokens + chunk.token_count > target_tokens and current_group:
            # Finalize current group
            group_content = '\n\n'.join(c.content for c in current_group)
            group_name = f"python_group_{len(current_group)}_definitions"
            
            grouped_chunk = Chunk(
                start_byte=current_group[0].start_byte,
                end_byte=current_group[-1].end_byte,
                content=group_content,
                node_type='grouped_content',
                name=group_name,
                depth=0
            )
            grouped_chunks.append(grouped_chunk)
            
            # Start new group
            current_group = [chunk]
            current_tokens = chunk.token_count
        else:
            current_group.append(chunk)
            current_tokens += chunk.token_count
    
    # Add final group
    if current_group:
        if len(current_group) == 1:
            grouped_chunks.append(current_group[0])
        else:
            group_content = '\n\n'.join(c.content for c in current_group)
            group_name = f"python_group_{len(current_group)}_definitions"
            
            grouped_chunk = Chunk(
                start_byte=current_group[0].start_byte,
                end_byte=current_group[-1].end_byte,
                content=group_content,
                node_type='grouped_content',
                name=group_name,
                depth=0
            )
            grouped_chunks.append(grouped_chunk)
    
    return grouped_chunks

def sub_chunk_by_statements(chunk: Chunk, tree, source_code: str, depth: int = 0) -> List[Chunk]:
    """Sub-chunk by breaking down into logical statements/blocks"""
    if depth >= MAX_RECURSION_DEPTH or chunk.token_count <= MAX_CHUNK_TOKENS:
        return [chunk]
    
    print(f"    Breaking down {chunk.name} ({chunk.token_count} tokens) into smaller pieces...")
    
    # Simple line-based splitting for now
    lines = chunk.content.split('\n')
    sub_chunks = []
    current_lines = []
    current_size = 0
    
    for line in lines:
        line_tokens = count_tokens(line)
        
        if current_size + line_tokens > MAX_CHUNK_TOKENS and current_lines:
            # Create sub-chunk
            sub_content = '\n'.join(current_lines)
            if sub_content.strip():
                sub_chunk = Chunk(
                    start_byte=chunk.start_byte,  # Approximate
                    end_byte=chunk.start_byte + len(sub_content),
                    content=sub_content,
                    node_type=f"{chunk.node_type}_part",
                    name=f"{chunk.name}_part_{len(sub_chunks)+1}",
                    depth=depth + 1
                )
                sub_chunks.append(sub_chunk)
            
            current_lines = [line]
            current_size = line_tokens
        else:
            current_lines.append(line)
            current_size += line_tokens
    
    # Add remaining lines
    if current_lines:
        sub_content = '\n'.join(current_lines)
        if sub_content.strip():
            sub_chunk = Chunk(
                start_byte=chunk.start_byte,
                end_byte=chunk.end_byte,
                content=sub_content,
                node_type=f"{chunk.node_type}_part",
                name=f"{chunk.name}_part_{len(sub_chunks)+1}",
                depth=depth + 1
            )
            sub_chunks.append(sub_chunk)
    
    chunk.sub_chunks = sub_chunks
    return sub_chunks if len(sub_chunks) > 1 else [chunk]

def process_python_file(file_path: Path) -> List[Chunk]:
    """Process a single .py file and return semantic chunks"""
    print(f"\n=== Processing: {file_path.name} ===")
    
    # Read file
    with open(file_path, 'r', encoding='utf-8') as f:
        source_code = f.read()
    
    print(f"File size: {len(source_code)} characters")
    
    # Parse with Tree-sitter Python parser
    tree = python_parser.parse(source_code.encode('utf-8'))
    
    if tree.root_node.has_error:
        print("⚠️ Parse errors detected")
    
    # Find semantic chunks
    semantic_nodes = find_semantic_chunks(tree, source_code, file_path.suffix)
    print(f"Found {len(semantic_nodes)} semantic units")
    
    # Show what we found
    for node in semantic_nodes:
        preview = node['content'][:100].replace('\n', ' ').strip()
        print(f"  - {node['type']}: {node['name']} ({count_tokens(node['content'])} tokens)")
        print(f"    Preview: {preview}...")
    
    # Create chunks
    base_chunks = create_semantic_chunks(semantic_nodes)
    
    # Group small chunks
    base_chunks = group_small_chunks(base_chunks, target_tokens=600, file_extension=file_path.suffix)
    
    print(f"Created {len(base_chunks)} semantic chunks")
    
    # Apply sub-chunking for oversized chunks
    final_chunks = []
    oversized_count = 0
    
    for chunk in base_chunks:
        if chunk.token_count > MAX_CHUNK_TOKENS:
            print(f"  Sub-chunking {chunk.name} ({chunk.token_count} tokens)")
            sub_chunks = sub_chunk_by_statements(chunk, tree, source_code)
            final_chunks.extend(sub_chunks)
            oversized_count += 1
        else:
            final_chunks.append(chunk)
    
    if oversized_count > 0:
        print(f"Sub-chunked {oversized_count} oversized chunks")
    print(f"Final result: {len(final_chunks)} total chunks")
    
    return final_chunks

def generate_unique_id(length: int = 6) -> str:
    """Generate a random unique ID"""
    alphabet = string.ascii_lowercase + string.digits
    return ''.join(secrets.choice(alphabet) for _ in range(length))

def create_chunk_filename(original_filename: str, chunk_number: int, unique_id: str) -> str:
    """Create chunk filename: script.py_chunk_001_a1s2d3.md"""
    return f"{original_filename}_chunk_{chunk_number:03d}_{unique_id}.md"

def get_markdown_language(file_extension: str) -> str:
    """Get markdown language for code blocks"""
    lang_map = {
        '.py': 'python'
    }
    return lang_map.get(file_extension, 'text')

def create_chunk_markdown(chunk: Chunk, source_file_path: str, file_extension: str) -> str:
    """Create markdown content with YAML frontmatter"""
    language = get_markdown_language(file_extension)
    unique_id = generate_unique_id()
    
    frontmatter = f"""---
file_path: "{source_file_path}"
chunk_id: "{unique_id}"
chunk_type: "{chunk.node_type}"
chunk_name: "{chunk.name}"
start_byte: {chunk.start_byte}
end_byte: {chunk.end_byte}
token_count: {chunk.token_count}
depth: {chunk.depth}
language: "{language}"
---

# {chunk.name}

**Type:** {chunk.node_type}  
**Tokens:** {chunk.token_count}  
**Depth:** {chunk.depth}

```{language}
{chunk.content}
```
"""
    return frontmatter

def print_chunk_summary(chunks: List[Chunk], file_name: str):
    """Print detailed summary of chunks"""
    print(f"\n--- Semantic Chunk Summary for {file_name} ---")
    
    for i, chunk in enumerate(chunks, 1):
        indent = "  " * chunk.depth
        content_lines = len(chunk.content.split('\n'))
        
        print(f"{indent}{i}. {chunk.name}")
        print(f"{indent}   Type: {chunk.node_type}")
        print(f"{indent}   Size: {chunk.token_count} tokens, {content_lines} lines")
        print(f"{indent}   Content preview:")
        
        # Show first few lines of actual content
        content_lines_list = chunk.content.split('\n')
        for j, line in enumerate(content_lines_list[:3]):
            print(f"{indent}     {line.strip()}")
        if len(content_lines_list) > 3:
            print(f"{indent}     ... ({len(content_lines_list) - 3} more lines)")
        print()

def main():
    """Main function for Python semantic chunking"""
    print("🚀 Python Semantic Chunking")
    print(f"Max chunk tokens: {MAX_CHUNK_TOKENS}")
    print(f"Max recursion depth: {MAX_RECURSION_DEPTH}")
    print(f"Supported extensions: {', '.join(SUPPORTED_EXTENSIONS)}")
    
    # Get directory from user or use current directory
    directory = input("\nEnter source directory path (or press Enter for current directory): ").strip()
    if not directory:
        directory = "."
    
    target_dir = Path(directory).resolve()
    if not target_dir.exists():
        print(f"❌ Directory not found: {directory}")
        return
    
    # Create output directory parallel to source directory
    output_dir = target_dir.parent / f"{target_dir.name}_chunks"
    output_dir.mkdir(exist_ok=True)
    print(f"📁 Output directory: {output_dir}")
    
    # Find all Python files
    all_files = list(target_dir.rglob("*.py"))
    
    if not all_files:
        print(f"❌ No Python files found in {directory}")
        return
    
    print(f"📁 Found {len(all_files)} Python files:")
    
    # Group by directory for summary
    by_dir = {}
    for f in all_files:
        rel_path = f.relative_to(target_dir)
        dir_path = str(rel_path.parent) if rel_path.parent != Path('.') else '.'
        by_dir[dir_path] = by_dir.get(dir_path, []) + [f.name]
    
    for dir_path, files in sorted(by_dir.items()):
        print(f"  📂 {dir_path}: {len(files)} files")
        for file_name in sorted(files)[:3]:  # Show first 3 files
            print(f"    📄 {file_name}")
        if len(files) > 3:
            print(f"    ... and {len(files) - 3} more")
    
    # Process all files automatically
    print(f"\n🔄 Processing all {len(all_files)} file(s)...")
    all_chunks = {}
    
    for file_path in all_files:
        try:
            chunks = process_python_file(file_path)
            all_chunks[file_path] = chunks
            print_chunk_summary(chunks, file_path.name)
            
        except Exception as e:
            print(f"❌ Error processing {file_path}: {e}")
            continue
    
    # Summary
    total_chunks = sum(len(chunks) for chunks in all_chunks.values())
    total_tokens = sum(chunk.token_count for chunks in all_chunks.values() for chunk in chunks)
    
    print(f"\n📊 Processing Summary:")
    print(f"   Files processed: {len(all_chunks)}")
    print(f"   Total chunks: {total_chunks}")
    print(f"   Total tokens: {total_tokens:,}")
    print(f"   Average tokens per chunk: {total_tokens/total_chunks:.1f}" if total_chunks > 0 else "   No chunks created")
    
    # Save chunks automatically with parallel directory structure
    print(f"\n💾 Saving chunks to markdown files...")
    saved_count = 0
    
    for file_path, chunks in all_chunks.items():
        # Calculate relative path from source directory
        rel_path = file_path.relative_to(target_dir)
        
        # Create corresponding directory structure in output directory
        chunk_dir = output_dir / rel_path.parent
        chunk_dir.mkdir(parents=True, exist_ok=True)
        
        for i, chunk in enumerate(chunks, 1):
            chunk_filename = create_chunk_filename(file_path.name, i, generate_unique_id())
            chunk_content = create_chunk_markdown(chunk, str(rel_path), file_path.suffix)
            
            output_path = chunk_dir / chunk_filename
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(chunk_content)
            saved_count += 1
    
    print(f"✅ Saved {saved_count} chunk files to {output_dir}")
    print(f"📁 Directory structure preserved in output")

if __name__ == "__main__":
    main()

✅ Using Python parser
🚀 Python Semantic Chunking
Max chunk tokens: 1000
Max recursion depth: 3
Supported extensions: .py
📁 Output directory: /Users/tiyadiashok/python-projects/code_chunker/rag_chunks/pre_processed/code_sources/python/fusion_latest_chunks
📁 Found 20 Python files:
  📂 py_src/fusion: 18 files
    📄 __init__.py
    📄 __main__.py
    📄 attributes.py
    ... and 15 more
  📂 py_src/fusion/_legacy: 2 files
    📄 __init__.py
    📄 authentication.py

🔄 Processing all 20 file(s)...

=== Processing: fusion_types.py ===
File size: 385 characters
Found 2 semantic units
  - import_statement: import_from_enum (4 tokens)
    Preview: from enum import Enum...
  - class_definition: Types (112 tokens)
    Preview: class Types(Enum):     """Fusion types.      Args:         Enum (class: `enum.Enum`): Enum inheritan...
Created 1 semantic chunks
Final result: 1 total chunks

--- Semantic Chunk Summary for fusion_types.py ---
1. complete_module_2_parts
   Type: complete_module
   Size: 117 tok