# Phase 1: Initialization and Setup - Python Implementation

Configuration Settings (Markdown)
This cell contains all the configuration parameters for the code chunking system. Modify these settings according to your project requirements before running the chunking process.
Key Configuration Areas:

Directory Paths: Set your target React/TypeScript project directory and desired output location
Processing Parameters: Control chunk sizes, token limits, and recursion depth
File Processing: Define supported file types and exclusion patterns
Fallback Settings: Configure the fallback chunking algorithm parameters
Logging: Set up logging verbosity and format

In [16]:
"""
pip install tree-sitter-language-pack
pip install tiktoken
pip install pathspec  # for .gitignore parsing
pip install pyyaml
pip install ipykernel
"""
import os
import json
import yaml
import logging
import hashlib
import secrets
from pathlib import Path
from typing import List, Dict, Optional, Tuple, Union, Any
from dataclasses import dataclass, field
from enum import Enum

# Tree-sitter imports
from tree_sitter_language_pack import get_binding, get_language, get_parser
from tree_sitter import Tree, Node

# Token counting
import tiktoken

# .gitignore parsing
import pathspec

In [17]:
# =============================================================================
# CONFIGURATION SETTINGS - MODIFY THESE FOR YOUR PROJECT
# =============================================================================

# Directory paths - CHANGE THESE TO YOUR PROJECT PATHS
TARGET_DIRECTORY = r"/Users/tiyadiashok/python-projects/code_chunker/rag_sources/code_sources/typescript"  # Your React/TypeScript project
OUTPUT_DIRECTORY = r"/Users/tiyadiashok/python-projects/code_chunker/rag_chunks/pre_processed/code_sources/typescript"       # Where to save chunked files

# Processing parameters
MAX_CHUNK_TOKENS = 1000          # Maximum tokens per chunk (GPT-4o tokenization)
MIN_CHUNK_CHARS = 50             # Minimum characters per chunk
COALESCE_THRESHOLD = 50          # Minimum size for chunk coalescing
IMPORT_LINES_THRESHOLD = 10      # Threshold for creating separate import chunks
MAX_RECURSION_DEPTH = 3          # Maximum depth for recursive sub-chunking

# File processing
SUPPORTED_EXTENSIONS = ['.tsx', '.ts', '.js', '.jsx', '.html', '.css', '.scss']
ADDITIONAL_EXCLUSIONS = [
    'node_modules',              # Node.js dependencies
    'dist',                      # Distribution/build folder
    'build',                     # Build output
    '.next',                     # Next.js build cache
    '.git',                      # Git repository
    '*.min.js',                  # Minified JavaScript
    '*.min.css',                 # Minified CSS
    'coverage',                  # Test coverage reports
    '.vscode',                   # VS Code settings
    '.idea',                     # JetBrains IDE settings
]

# Fallback chunker settings (Sweep's algorithm)
FALLBACK_MAX_CHARS = 512 * 3     # 1536 characters for fallback chunking
FALLBACK_COALESCE = 50           # Coalescing threshold for fallback

# Token encoding
TIKTOKEN_ENCODING = "cl100k_base"  # GPT-4o compatible encoding

# Logging configuration
LOG_LEVEL = logging.INFO
LOG_FORMAT = "%(asctime)s - %(levelname)s - %(message)s"

print("✅ Configuration loaded successfully")
print(f"📁 Target Directory: {TARGET_DIRECTORY}")
print(f"📁 Output Directory: {OUTPUT_DIRECTORY}")
print(f"🎯 Max Chunk Tokens: {MAX_CHUNK_TOKENS}")
print(f"📝 Supported Extensions: {', '.join(SUPPORTED_EXTENSIONS)}")

✅ Configuration loaded successfully
📁 Target Directory: /Users/tiyadiashok/python-projects/code_chunker/rag_sources/code_sources/typescript
📁 Output Directory: /Users/tiyadiashok/python-projects/code_chunker/rag_chunks/pre_processed/code_sources/typescript
🎯 Max Chunk Tokens: 1000
📝 Supported Extensions: .tsx, .ts, .js, .jsx, .html, .css, .scss


Core Imports and Dependencies (Markdown)
This cell imports all required libraries and modules for the chunking system. The imports are organized by functionality:
Core Libraries:

Standard Python libraries for file operations, logging, and data structures
pathlib for cross-platform path handling
dataclasses and enum for structured data types

External Dependencies:

tree-sitter-language-pack for AST parsing
tiktoken for GPT-4o token counting
pathspec for .gitignore pattern matching
pyyaml for YAML frontmatter generation

Error Handling:

Graceful handling of missing dependencies with helpful error messages

In [18]:
# =============================================================================
# CORE IMPORTS AND DEPENDENCIES
# =============================================================================

# Standard library imports
import os
import json
import yaml
import logging
import hashlib
import secrets
import time
from pathlib import Path
from typing import List, Dict, Optional, Tuple, Union, Any
from dataclasses import dataclass, field
from enum import Enum

# Check and import external dependencies
try:
    # Tree-sitter imports
    from tree_sitter_language_pack import get_binding, get_language, get_parser
    from tree_sitter import Tree, Node
    print("✅ Tree-sitter dependencies loaded")
except ImportError as e:
    print("❌ Error importing tree-sitter dependencies:")
    print("   Please install: pip install tree-sitter-language-pack")
    raise e

try:
    # Token counting
    import tiktoken
    print("✅ Tiktoken loaded")
except ImportError as e:
    print("❌ Error importing tiktoken:")
    print("   Please install: pip install tiktoken")
    raise e

try:
    # .gitignore parsing
    import pathspec
    print("✅ Pathspec loaded")
except ImportError as e:
    print("❌ Error importing pathspec:")
    print("   Please install: pip install pathspec")
    raise e

# Verify PyYAML is available
try:
    import yaml
    print("✅ PyYAML loaded")
except ImportError as e:
    print("❌ Error importing PyYAML:")
    print("   Please install: pip install pyyaml")
    raise e

print("\n🎉 All dependencies loaded successfully!")

✅ Tree-sitter dependencies loaded
✅ Tiktoken loaded
✅ Pathspec loaded
✅ PyYAML loaded

🎉 All dependencies loaded successfully!


Data Structures and Enums (Markdown)
This cell defines the core data structures used throughout the chunking system. These classes provide type safety and structured data handling:
Enums:

ChunkType: Categorizes chunks as code, imports, or fallback
ChunkMethod: Tracks whether AST parsing or fallback chunking was used

Data Classes:

FileMetadata: Stores information about each source file
ImportInfo: Manages import classification and counting
ChunkSpan: Represents the position and content of a chunk
CodeChunk: Complete chunk information with all metadata
ProcessingStats: Tracks processing results and performance metrics

In [19]:
# =============================================================================
# DATA STRUCTURES AND ENUMS
# =============================================================================

class ChunkType(Enum):
    """Types of chunks that can be created"""
    CODE = "code"           # Regular code chunks
    IMPORTS = "imports"     # Dedicated import chunks
    FALLBACK = "fallback"   # Chunks created using fallback strategy

class ChunkMethod(Enum):
    """Methods used for chunking"""
    AST = "ast"             # AST-based chunking (preferred)
    FALLBACK = "fallback"   # Fallback chunking (when AST fails)

@dataclass
class FileMetadata:
    """Metadata for a source file"""
    file_path: Path
    relative_path: str
    extension: str
    size_bytes: int
    total_lines: int
    import_lines: int
    has_package_json: bool = False
    
@dataclass
class ImportInfo:
    """Information about imports in a file"""
    external_imports: List[str] = field(default_factory=list)
    local_imports: List[str] = field(default_factory=list)
    all_imports: List[str] = field(default_factory=list)
    import_line_count: int = 0

@dataclass
class ChunkSpan:
    """Represents a chunk's position in source code"""
    start_byte: int
    end_byte: int
    start_line: int
    end_line: int
    content: str

@dataclass
class CodeChunk:
    """Complete chunk information with all metadata"""
    source_file: str
    chunk_index: int
    total_chunks: int
    chunk_type: ChunkType
    chunk_method: ChunkMethod
    content: str
    imports_used: List[str]
    structure: str
    summary: str
    file_id: str
    span: ChunkSpan

@dataclass
class ProcessingStats:
    """Statistics for processing results"""
    total_files: int = 0
    successfully_parsed: int = 0
    fallback_chunked: int = 0
    total_chunks: int = 0
    files_with_import_chunks: int = 0
    processing_time: float = 0.0
    error_files: List[str] = field(default_factory=list)

print("✅ Data structures and enums defined")

✅ Data structures and enums defined


Logger Setup (Markdown)
This cell configures the logging system for the chunking process. The logger provides:
Features:

Configurable log levels (DEBUG, INFO, WARNING, ERROR)
Timestamped log messages with clear formatting
Console output for real-time monitoring
Proper error tracking and reporting

Log Levels:

INFO: General processing information, progress updates
WARNING: Non-critical issues, fallback usage
ERROR: Critical errors, file processing failures
DEBUG: Detailed debugging information (verbose)

In [20]:
# =============================================================================
# LOGGER SETUP
# =============================================================================

def setup_logging(log_level: int = logging.INFO) -> logging.Logger:
    """
    Set up logging configuration for the chunking system.
    
    Args:
        log_level: Logging level (default: INFO)
        
    Returns:
        Configured logger instance
    """
    # Create logger
    logger = logging.getLogger('code_chunker')
    logger.setLevel(log_level)
    
    # Remove existing handlers to avoid duplicate logs
    for handler in logger.handlers[:]:
        logger.removeHandler(handler)
    
    # Create console handler
    console_handler = logging.StreamHandler()
    console_handler.setLevel(log_level)
    
    # Create formatter
    formatter = logging.Formatter(LOG_FORMAT)
    console_handler.setFormatter(formatter)
    
    # Add handler to logger
    logger.addHandler(console_handler)
    
    # Prevent propagation to root logger
    logger.propagate = False
    
    return logger

# Initialize the logger
logger = setup_logging(LOG_LEVEL)
logger.info("🔧 Logging system initialized")
logger.info(f"📊 Log level set to: {logging.getLevelName(LOG_LEVEL)}")

print("✅ Logger setup complete")

2025-08-02 19:28:53,910 - INFO - 🔧 Logging system initialized
2025-08-02 19:28:53,911 - INFO - 📊 Log level set to: INFO


✅ Logger setup complete


Token Counter Initialization (Markdown)
This cell sets up the token counting system using OpenAI's tiktoken library. Token counting is crucial for:
Purpose:

Ensuring chunks stay within the 1000 token limit
Accurate token counting for GPT-4o compatibility
Determining when chunks need sub-chunking or coalescing

Token Encoding:

Uses cl100k_base encoding (GPT-4o compatible)
Provides accurate token counts for modern language models
Handles special characters and Unicode properly

Functions:

initialize_token_counter(): Sets up the tiktoken encoder
count_tokens(): Counts tokens in any text string

In [21]:
# =============================================================================
# TOKEN COUNTER INITIALIZATION
# =============================================================================

def initialize_token_counter(encoding_name: str = "cl100k_base") -> tiktoken.Encoding:
    """
    Initialize tiktoken encoder for token counting.
    
    Args:
        encoding_name: Name of the tiktoken encoding to use
        
    Returns:
        Initialized tiktoken encoding instance
    """
    try:
        encoder = tiktoken.get_encoding(encoding_name)
        logger.info(f"🎯 Token encoder initialized with {encoding_name}")
        return encoder
    except Exception as e:
        logger.error(f"❌ Failed to initialize token encoder: {e}")
        raise

def count_tokens(text: str, encoder: tiktoken.Encoding) -> int:
    """
    Count tokens in text using tiktoken.
    
    Args:
        text: Text to count tokens for
        encoder: Tiktoken encoder instance
        
    Returns:
        Number of tokens in the text
    """
    try:
        if not text:
            return 0
        return len(encoder.encode(text))
    except Exception as e:
        logger.warning(f"⚠️ Error counting tokens: {e}")
        return 0

# Initialize the global token encoder
token_encoder = initialize_token_counter(TIKTOKEN_ENCODING)

# Test token counting with a sample
test_text = "function helloWorld() { console.log('Hello, World!'); }"
test_tokens = count_tokens(test_text, token_encoder)
logger.info(f"✅ Token counting test: '{test_text}' = {test_tokens} tokens")

print("✅ Token counter initialized and tested")

2025-08-02 19:28:53,920 - INFO - 🎯 Token encoder initialized with cl100k_base
2025-08-02 19:28:53,938 - INFO - ✅ Token counting test: 'function helloWorld() { console.log('Hello, World!'); }' = 14 tokens


✅ Token counter initialized and tested


Tree-sitter Parser Manager (Markdown)
This cell implements the ParserManager class, which handles all Tree-sitter AST parsing operations. The parser manager:
Responsibilities:

Initializes parsers for different file types (TypeScript, HTML, CSS)
Maps file extensions to appropriate parsers
Handles parsing failures gracefully
Provides a unified interface for AST parsing

Supported Languages:

TypeScript/TSX: Handles .tsx, .ts, .js, .jsx files
HTML: Handles .html files
CSS: Handles .css, .scss files

Error Handling:

Graceful fallback when parsers fail to initialize
Detailed error logging for debugging
Returns None for unparseable files (triggers fallback chunking)

In [22]:
# =============================================================================
# TREE-SITTER PARSER MANAGER
# =============================================================================

class ParserManager:
    """Manages Tree-sitter parsers for different file types"""
    
    def __init__(self):
        self.parsers = {}
        self.languages = {}
        self._initialize_parsers()
    
    def _initialize_parsers(self) -> None:
        """Initialize all required Tree-sitter parsers"""
        # Parser configurations: (language_name, file_extensions)
        parser_configs = [
            ('tsx', ['.tsx', '.ts', '.js', '.jsx']),
            ('html', ['.html']),
            ('css', ['.css', '.scss'])
        ]
        
        for language_name, extensions in parser_configs:
            try:
                # Get language and parser
                language = get_language(language_name)
                parser = get_parser(language_name)
                
                # Store language and parser for each extension
                for ext in extensions:
                    self.languages[ext] = language
                    self.parsers[ext] = parser
                
                logger.info(f"✅ Initialized {language_name} parser for {extensions}")
                
            except Exception as e:
                logger.error(f"❌ Failed to initialize {language_name} parser: {e}")
                # Continue with other parsers even if one fails
                continue
    
    def get_parser(self, file_extension: str) -> Optional[Any]:
        """
        Get appropriate parser for file extension.
        
        Args:
            file_extension: File extension (e.g., '.tsx', '.css')
            
        Returns:
            Tree-sitter parser instance or None if unsupported
        """
        parser = self.parsers.get(file_extension.lower())
        if parser is None:
            logger.warning(f"⚠️ No parser available for extension: {file_extension}")
        return parser
    
    def get_language(self, file_extension: str) -> Optional[Any]:
        """
        Get appropriate language for file extension.
        
        Args:
            file_extension: File extension (e.g., '.tsx', '.css')
            
        Returns:
            Tree-sitter language instance or None if unsupported
        """
        return self.languages.get(file_extension.lower())
    
    def parse_file(self, file_path: Path, content: bytes) -> Optional[Tree]:
        """
        Parse file content using appropriate parser.
        
        Args:
            file_path: Path to the file being parsed
            content: File content as bytes
            
        Returns:
            Parsed Tree object or None if parsing failed
        """
        file_extension = file_path.suffix.lower()
        parser = self.get_parser(file_extension)
        
        if parser is None:
            logger.warning(f"⚠️ No parser for {file_path}")
            return None
        
        try:
            tree = parser.parse(content)
            if tree.root_node.has_error:
                logger.warning(f"⚠️ Parse errors in {file_path}")
                return None
            
            logger.debug(f"✅ Successfully parsed {file_path}")
            return tree
            
        except Exception as e:
            logger.error(f"❌ Failed to parse {file_path}: {e}")
            return None
    
    def list_supported_extensions(self) -> List[str]:
        """
        Get list of supported file extensions.
        
        Returns:
            List of supported file extensions
        """
        return list(self.parsers.keys())

# Initialize the global parser manager
parser_manager = ParserManager()

# Display initialization results
supported_extensions = parser_manager.list_supported_extensions()
logger.info(f"🎯 Parser manager initialized")
logger.info(f"📁 Supported extensions: {', '.join(sorted(supported_extensions))}")

print("✅ Tree-sitter parser manager initialized")
print(f"📁 Supported file types: {', '.join(sorted(supported_extensions))}")

2025-08-02 19:28:53,953 - INFO - ✅ Initialized tsx parser for ['.tsx', '.ts', '.js', '.jsx']
2025-08-02 19:28:53,955 - INFO - ✅ Initialized html parser for ['.html']
2025-08-02 19:28:53,958 - INFO - ✅ Initialized css parser for ['.css', '.scss']
2025-08-02 19:28:53,959 - INFO - 🎯 Parser manager initialized
2025-08-02 19:28:53,959 - INFO - 📁 Supported extensions: .css, .html, .js, .jsx, .scss, .ts, .tsx


✅ Tree-sitter parser manager initialized
📁 Supported file types: .css, .html, .js, .jsx, .scss, .ts, .tsx


ST Node Type Mappings (Markdown)
This cell defines the AST node types that serve as chunk boundaries for different file types. These mappings are based on semantic code structures:
Purpose:

Define which AST nodes should create new chunks
Organize mappings by file type for targeted chunking
Provide sub-chunking strategies for oversized chunks

Chunk Boundary Types:

Functions: Function declarations, arrow functions, methods
Classes: Class declarations and interfaces
Components: JSX elements and React components
Declarations: Variable declarations, type aliases, enums
Styles: CSS rules, at-rules, keyframes

Sub-chunking Strategy:

Used when base chunks exceed token limits
Provides finer-grained boundaries within large constructs
Includes control flow statements and logic blocks

In [23]:
# =============================================================================
# AST NODE TYPE MAPPINGS
# =============================================================================

# Primary chunk boundaries for different file types
AST_CHUNK_BOUNDARIES = {
    '.tsx': [
        'function_declaration', 'arrow_function', 'function_expression',
        'method_definition', 'class_declaration', 'interface_declaration',
        'type_alias_declaration', 'enum_declaration', 'variable_declaration',
        'export_statement', 'jsx_element', 'jsx_fragment', 'namespace_declaration'
    ],
    '.ts': [
        'function_declaration', 'arrow_function', 'function_expression',
        'method_definition', 'class_declaration', 'interface_declaration',
        'type_alias_declaration', 'enum_declaration', 'variable_declaration',
        'export_statement', 'namespace_declaration'
    ],
    '.js': [
        'function_declaration', 'arrow_function', 'function_expression',
        'method_definition', 'class_declaration', 'variable_declaration',
        'export_statement'
    ],
    '.jsx': [
        'function_declaration', 'arrow_function', 'function_expression',
        'method_definition', 'class_declaration', 'variable_declaration',
        'export_statement', 'jsx_element', 'jsx_fragment'
    ],
    '.html': [
        'element', 'script_element', 'style_element'
    ],
    '.css': [
        'rule_set', 'at_rule', 'keyframes_statement'
    ],
    '.scss': [
        'rule_set', 'at_rule', 'keyframes_statement'
    ]
}

# Sub-chunking node types for when chunks are too large
AST_SUB_CHUNK_BOUNDARIES = {
    'common': [
        'for_statement', 'while_statement', 'if_statement', 
        'switch_statement', 'try_statement', 'block'
    ],
    'jsx': ['jsx_element', 'jsx_expression'],
    'css': ['declaration', 'property'],
    'functions': ['parameter', 'argument']
}

# Import-related node types for different languages
AST_IMPORT_NODES = {
    '.tsx': ['import_statement', 'import_declaration'],
    '.ts': ['import_statement', 'import_declaration'],
    '.js': ['import_statement', 'import_declaration'],
    '.jsx': ['import_statement', 'import_declaration'],
    '.css': ['import_statement'],
    '.scss': ['import_statement']
}

def get_chunk_boundaries_for_extension(extension: str) -> List[str]:
    """
    Get chunk boundary node types for a file extension.
    
    Args:
        extension: File extension (e.g., '.tsx')
        
    Returns:
        List of AST node types that should create chunk boundaries
    """
    return AST_CHUNK_BOUNDARIES.get(extension.lower(), [])

def get_sub_chunk_boundaries(extension: str) -> List[str]:
    """
    Get sub-chunk boundary node types for a file extension.
    
    Args:
        extension: File extension
        
    Returns:
        List of AST node types for sub-chunking
    """
    boundaries = AST_SUB_CHUNK_BOUNDARIES['common'].copy()
    
    # Add extension-specific sub-boundaries
    if extension.lower() in ['.tsx', '.jsx']:
        boundaries.extend(AST_SUB_CHUNK_BOUNDARIES['jsx'])
    elif extension.lower() in ['.css', '.scss']:
        boundaries.extend(AST_SUB_CHUNK_BOUNDARIES['css'])
    
    return boundaries

def get_import_nodes_for_extension(extension: str) -> List[str]:
    """
    Get import-related node types for a file extension.
    
    Args:
        extension: File extension
        
    Returns:
        List of AST node types related to imports
    """
    return AST_IMPORT_NODES.get(extension.lower(), [])

# Log the configuration
logger.info("🎯 AST node type mappings configured")
logger.info(f"📁 Configured for {len(AST_CHUNK_BOUNDARIES)} file types")

# Display summary
print("✅ AST node type mappings configured")
print(f"📁 File types configured: {', '.join(AST_CHUNK_BOUNDARIES.keys())}")
print(f"🎯 Example boundaries for .tsx: {', '.join(AST_CHUNK_BOUNDARIES['.tsx'][:5])}...")

2025-08-02 19:28:53,976 - INFO - 🎯 AST node type mappings configured
2025-08-02 19:28:53,977 - INFO - 📁 Configured for 7 file types


✅ AST node type mappings configured
📁 File types configured: .tsx, .ts, .js, .jsx, .html, .css, .scss
🎯 Example boundaries for .tsx: function_declaration, arrow_function, function_expression, method_definition, class_declaration...


Initialization Summary and Validation (Markdown)
This final cell provides a comprehensive summary of the initialization process and validates that all components are properly set up.
Validation Checks:

Configuration parameters are valid
All required dependencies are loaded
Parsers are initialized for all supported file types
Token counter is working correctly
Data structures are properly defined

Status Report:

Lists all initialized components
Shows configuration summary
Confirms system readiness for file processing
Displays any warnings or issues

In [24]:
# =============================================================================
# INITIALIZATION SUMMARY AND VALIDATION
# =============================================================================

def validate_initialization() -> bool:
    """
    Validate that all components are properly initialized.
    
    Returns:
        True if all components are ready, False otherwise
    """
    validation_results = []
    
    # Check configuration
    try:
        config_valid = (
            Path(TARGET_DIRECTORY).exists() and
            MAX_CHUNK_TOKENS > 0 and
            MIN_CHUNK_CHARS > 0 and
            len(SUPPORTED_EXTENSIONS) > 0
        )
        validation_results.append(("Configuration", config_valid))
    except:
        validation_results.append(("Configuration", False))
    
    # Check token encoder
    try:
        test_tokens = count_tokens("test", token_encoder)
        token_valid = test_tokens > 0
        validation_results.append(("Token Counter", token_valid))
    except:
        validation_results.append(("Token Counter", False))
    
    # Check parser manager
    try:
        parser_valid = len(parser_manager.list_supported_extensions()) > 0
        validation_results.append(("Parser Manager", parser_valid))
    except:
        validation_results.append(("Parser Manager", False))
    
    # Check data structures
    try:
        test_chunk = CodeChunk(
            source_file="test.tsx",
            chunk_index=1,
            total_chunks=1,
            chunk_type=ChunkType.CODE,
            chunk_method=ChunkMethod.AST,
            content="test",
            imports_used=[],
            structure="",
            summary="test",
            file_id="test123",
            span=ChunkSpan(0, 4, 1, 1, "test")
        )
        data_valid = test_chunk.source_file == "test.tsx"
        validation_results.append(("Data Structures", data_valid))
    except:
        validation_results.append(("Data Structures", False))
    
    # Check AST mappings
    try:
        tsx_boundaries = get_chunk_boundaries_for_extension('.tsx')
        ast_valid = len(tsx_boundaries) > 0
        validation_results.append(("AST Mappings", ast_valid))
    except:
        validation_results.append(("AST Mappings", False))
    
    # Display results
    print("\n" + "="*60)
    print("🔍 INITIALIZATION VALIDATION RESULTS")
    print("="*60)
    
    all_valid = True
    for component, status in validation_results:
        status_icon = "✅" if status else "❌"
        print(f"{status_icon} {component}: {'OK' if status else 'FAILED'}")
        if not status:
            all_valid = False
    
    return all_valid

def display_initialization_summary():
    """Display a comprehensive summary of the initialization."""
    print("\n" + "="*60)
    print("🚀 CODE CHUNKING SYSTEM - INITIALIZATION COMPLETE")
    print("="*60)
    
    print(f"\n📁 DIRECTORIES:")
    print(f"   Target: {TARGET_DIRECTORY}")
    print(f"   Output: {OUTPUT_DIRECTORY}")
    
    print(f"\n⚙️ CONFIGURATION:")
    print(f"   Max Chunk Tokens: {MAX_CHUNK_TOKENS}")
    print(f"   Min Chunk Chars: {MIN_CHUNK_CHARS}")
    print(f"   Import Threshold: {IMPORT_LINES_THRESHOLD} lines")
    print(f"   Max Recursion Depth: {MAX_RECURSION_DEPTH}")
    
    print(f"\n📝 SUPPORTED FILE TYPES:")
    for ext in SUPPORTED_EXTENSIONS:
        parser_status = "✅" if parser_manager.get_parser(ext) else "❌"
        print(f"   {ext}: {parser_status}")
    
    print(f"\n🎯 COMPONENTS INITIALIZED:")
    print(f"   ✅ Logger ({logging.getLevelName(LOG_LEVEL)})")
    print(f"   ✅ Token Counter ({TIKTOKEN_ENCODING})")
    print(f"   ✅ Parser Manager ({len(parser_manager.list_supported_extensions())} parsers)")
    print(f"   ✅ Data Structures")
    print(f"   ✅ AST Mappings")
    
    print(f"\n🔧 EXCLUSIONS:")
    for exclusion in ADDITIONAL_EXCLUSIONS[:5]:  # Show first 5
        print(f"   • {exclusion}")
    if len(ADDITIONAL_EXCLUSIONS) > 5:
        print(f"   ... and {len(ADDITIONAL_EXCLUSIONS) - 5} more")

# Run validation and display summary
validation_passed = validate_initialization()

if validation_passed:
    display_initialization_summary()
    print(f"\n🎉 SYSTEM READY FOR PROCESSING!")
    print("   Next: Run Phase 2 - File Discovery")
else:
    print(f"\n❌ INITIALIZATION FAILED!")
    print("   Please check the error messages above and fix any issues.")

logger.info("🔧 Phase 1 (Initialization) complete")

2025-08-02 19:28:53,991 - INFO - 🔧 Phase 1 (Initialization) complete



🔍 INITIALIZATION VALIDATION RESULTS
✅ Configuration: OK
✅ Token Counter: OK
✅ Parser Manager: OK
✅ Data Structures: OK
✅ AST Mappings: OK

🚀 CODE CHUNKING SYSTEM - INITIALIZATION COMPLETE

📁 DIRECTORIES:
   Target: /Users/tiyadiashok/python-projects/code_chunker/rag_sources/code_sources/typescript
   Output: /Users/tiyadiashok/python-projects/code_chunker/rag_chunks/pre_processed/code_sources/typescript

⚙️ CONFIGURATION:
   Max Chunk Tokens: 1000
   Min Chunk Chars: 50
   Import Threshold: 10 lines
   Max Recursion Depth: 3

📝 SUPPORTED FILE TYPES:
   .tsx: ✅
   .ts: ✅
   .js: ✅
   .jsx: ✅
   .html: ✅
   .css: ✅
   .scss: ✅

🎯 COMPONENTS INITIALIZED:
   ✅ Logger (INFO)
   ✅ Token Counter (cl100k_base)
   ✅ Parser Manager (7 parsers)
   ✅ Data Structures
   ✅ AST Mappings

🔧 EXCLUSIONS:
   • node_modules
   • dist
   • build
   • .next
   • .git
   ... and 5 more

🎉 SYSTEM READY FOR PROCESSING!
   Next: Run Phase 2 - File Discovery


This completes Phase 1: Initialization and Setup. The implementation provides:

Complete configuration system with validation
Robust dependency management with error handling
Comprehensive logging setup for debugging
Token counting system for GPT-4o compatibility
Tree-sitter parser management for all supported file types
AST node type mappings for semantic chunking
Validation and summary reporting for system readiness

Each cell includes detailed markdown explanations and can be copied directly into a Jupyter notebook. The code is production-ready with proper error handling, logging, and validation checks.

Phase 2: File Discovery and Filtering - Python Implementation
Cell 17: Phase 2 Overview (Markdown)
This phase implements the file discovery and filtering system that recursively scans the target directory to identify all processable code files. The system applies multiple filtering layers to ensure only relevant, accessible files are processed.
Key Features:

Recursive Directory Traversal: Scans all subdirectories while respecting exclusion patterns
.gitignore Integration: Parses and applies .gitignore patterns from the project root
Binary File Detection: Uses simple byte-based heuristic to skip binary files
File Size Limits: Enforces 2MB maximum file size to prevent memory issues
Progress Tracking: Provides real-time progress indicators for large codebases
Error Recovery: Logs failures and continues processing remaining files

Processing Flow:

Load and parse .gitignore patterns
Recursively discover all files in target directory
Apply extension filtering (only supported file types)
Apply exclusion patterns (.gitignore + additional exclusions)
Validate file accessibility and size limits
Generate FileMetadata objects for valid files

In [25]:
# =============================================================================
# PHASE 2: FILE DISCOVERY AND FILTERING
# =============================================================================

def load_gitignore_patterns(project_root: Path) -> Optional[pathspec.PathSpec]:
    """
    Load and parse .gitignore patterns from project root.
    
    Args:
        project_root: Root directory of the project
        
    Returns:
        PathSpec object for .gitignore patterns or None if not found
    """
    gitignore_path = project_root / '.gitignore'
    
    if not gitignore_path.exists():
        logger.info("📋 No .gitignore file found in project root")
        return None
    
    try:
        with open(gitignore_path, 'r', encoding='utf-8') as f:
            patterns = f.read().splitlines()
        
        # Filter out empty lines and comments
        filtered_patterns = []
        for pattern in patterns:
            pattern = pattern.strip()
            if pattern and not pattern.startswith('#'):
                filtered_patterns.append(pattern)
        
        if filtered_patterns:
            spec = pathspec.PathSpec.from_lines('gitwildmatch', filtered_patterns)
            logger.info(f"📋 Loaded {len(filtered_patterns)} .gitignore patterns")
            return spec
        else:
            logger.info("📋 .gitignore file found but contains no valid patterns")
            return None
            
    except Exception as e:
        logger.error(f"❌ Error reading .gitignore file: {e}")
        return None

def create_additional_exclusion_spec(additional_exclusions: List[str]) -> pathspec.PathSpec:
    """
    Create PathSpec for additional exclusion patterns.
    
    Args:
        additional_exclusions: List of additional patterns to exclude
        
    Returns:
        PathSpec object for additional exclusions
    """
    if not additional_exclusions:
        return pathspec.PathSpec.from_lines('gitwildmatch', [])
    
    # Ensure node_modules is always excluded at any level
    exclusion_patterns = additional_exclusions.copy()
    if 'node_modules' not in exclusion_patterns:
        exclusion_patterns.append('node_modules')
    
    # Add common exclusion patterns
    exclusion_patterns.extend([
        '**/node_modules',  # node_modules at any level
        '**/node_modules/**',  # anything inside node_modules
    ])
    
    logger.info(f"📋 Created {len(exclusion_patterns)} additional exclusion patterns")
    return pathspec.PathSpec.from_lines('gitwildmatch', exclusion_patterns)

print("✅ .gitignore pattern loading functions defined")

✅ .gitignore pattern loading functions defined


File Validation Functions (Markdown)
This cell implements core file validation functions that check whether discovered files should be processed. The validation includes:
File Accessibility:

Check if file exists and is readable
Handle permission errors gracefully
Skip files that can't be accessed

Binary File Detection:

Simple byte-based heuristic (presence of null bytes)
Efficient early detection to avoid reading entire files
Configurable sample size for detection

File Size Validation:

2MB maximum file size limit
Prevents memory issues with extremely large files
Logs oversized files for reference

Empty File Handling:

Skips files with 0 bytes
Skips files containing only whitespace
Optimizes processing by avoiding empty content

In [26]:
# =============================================================================
# FILE VALIDATION FUNCTIONS
# =============================================================================

def is_binary_file(file_path: Path, sample_size: int = 8192) -> bool:
    """
    Check if file is binary using simple byte-based heuristic.
    
    Args:
        file_path: Path to the file to check
        sample_size: Number of bytes to sample for detection
        
    Returns:
        True if file appears to be binary, False otherwise
    """
    try:
        with open(file_path, 'rb') as f:
            sample = f.read(sample_size)
            
        # Check for null bytes (common indicator of binary files)
        if b'\x00' in sample:
            return True
            
        # Additional check: high ratio of non-printable characters
        if len(sample) == 0:
            return False
            
        # Count printable characters
        printable_chars = 0
        for byte in sample:
            # ASCII printable range: 32-126, plus common whitespace: 9, 10, 13
            if (32 <= byte <= 126) or byte in (9, 10, 13):
                printable_chars += 1
        
        # If less than 85% printable characters, consider it binary
        printable_ratio = printable_chars / len(sample)
        return printable_ratio < 0.85
        
    except Exception as e:
        logger.warning(f"⚠️ Error checking if {file_path} is binary: {e}")
        return True  # Assume binary if can't read

def validate_file_size(file_path: Path, max_size_mb: int = 2) -> bool:
    """
    Validate file size is within acceptable limits.
    
    Args:
        file_path: Path to the file
        max_size_mb: Maximum file size in megabytes
        
    Returns:
        True if file size is acceptable, False otherwise
    """
    try:
        size_bytes = file_path.stat().st_size
        size_mb = size_bytes / (1024 * 1024)
        
        if size_mb > max_size_mb:
            logger.warning(f"⚠️ File too large ({size_mb:.1f}MB): {file_path}")
            return False
            
        return True
        
    except Exception as e:
        logger.error(f"❌ Error checking file size for {file_path}: {e}")
        return False

def is_empty_or_whitespace(file_path: Path) -> bool:
    """
    Check if file is empty or contains only whitespace.
    
    Args:
        file_path: Path to the file
        
    Returns:
        True if file is empty or whitespace-only, False otherwise
    """
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            content = f.read().strip()
            return len(content) == 0
            
    except Exception as e:
        logger.warning(f"⚠️ Error checking if {file_path} is empty: {e}")
        return True  # Assume empty if can't read

def validate_file_processable(file_path: Path) -> Tuple[bool, str]:
    """
    Comprehensive validation if file can be processed.
    
    Args:
        file_path: Path to the file
        
    Returns:
        Tuple of (is_valid: bool, reason: str)
    """
    # Check if file exists and is readable
    if not file_path.exists():
        return False, "File does not exist"
    
    if not file_path.is_file():
        return False, "Not a regular file"
    
    # Check if it's a symbolic link (skip them)
    if file_path.is_symlink():
        return False, "Symbolic link (skipped)"
    
    # Check file permissions
    try:
        with open(file_path, 'rb') as f:
            pass  # Just try to open
    except PermissionError:
        return False, "Permission denied"
    except Exception as e:
        return False, f"Cannot access file: {e}"
    
    # Check file size
    if not validate_file_size(file_path):
        return False, "File too large (>2MB)"
    
    # Check if binary
    if is_binary_file(file_path):
        return False, "Binary file"
    
    # Check if empty or whitespace only
    if is_empty_or_whitespace(file_path):
        return False, "Empty or whitespace-only file"
    
    return True, "Valid"

print("✅ File validation functions defined")

✅ File validation functions defined


File Exclusion Logic (Markdown)
This cell implements the exclusion logic that determines whether a discovered file should be skipped based on various patterns and rules.
Exclusion Sources:

.gitignore patterns: Loaded from project root
Additional exclusions: User-specified patterns from configuration
Extension filtering: Only process supported file extensions
Special directories: Always exclude node_modules at any level

Pattern Matching:

Uses pathspec library for gitignore-compatible pattern matching
Supports glob patterns, directory patterns, and negation patterns
Applies patterns relative to project root for consistency

Performance Optimization:

Early exclusion checks to avoid unnecessary file operations
Efficient pattern matching using compiled PathSpec objects
Directory-level exclusion to skip entire subtrees

In [27]:
# =============================================================================
# FILE EXCLUSION LOGIC
# =============================================================================

def is_file_excluded(file_path: Path, 
                    gitignore_spec: Optional[pathspec.PathSpec],
                    additional_spec: pathspec.PathSpec,
                    project_root: Path) -> Tuple[bool, str]:
    """
    Check if file should be excluded from processing.
    
    Args:
        file_path: Path to the file
        gitignore_spec: .gitignore PathSpec object
        additional_spec: Additional exclusions PathSpec object
        project_root: Project root directory
        
    Returns:
        Tuple of (is_excluded: bool, reason: str)
    """
    try:
        # Get relative path from project root
        rel_path = file_path.relative_to(project_root)
        rel_path_str = str(rel_path).replace('\\', '/')  # Use forward slashes
        
        # Check file extension first (most efficient)
        if file_path.suffix.lower() not in SUPPORTED_EXTENSIONS:
            return True, f"Unsupported extension: {file_path.suffix}"
        
        # Check additional exclusions (includes node_modules)
        if additional_spec.match_file(rel_path_str):
            return True, "Matches additional exclusion pattern"
        
        # Check if any parent directory is node_modules
        for parent in file_path.parents:
            if parent.name == 'node_modules':
                return True, "Inside node_modules directory"
        
        # Check .gitignore patterns
        if gitignore_spec and gitignore_spec.match_file(rel_path_str):
            return True, "Matches .gitignore pattern"
        
        return False, "Not excluded"
        
    except ValueError:
        # File is not under project root
        return True, "File not under project root"
    except Exception as e:
        logger.warning(f"⚠️ Error checking exclusion for {file_path}: {e}")
        return True, f"Error during exclusion check: {e}"

def is_directory_excluded(dir_path: Path, 
                         gitignore_spec: Optional[pathspec.PathSpec],
                         additional_spec: pathspec.PathSpec,
                         project_root: Path) -> Tuple[bool, str]:
    """
    Check if entire directory should be excluded from traversal.
    
    Args:
        dir_path: Path to the directory
        gitignore_spec: .gitignore PathSpec object
        additional_spec: Additional exclusions PathSpec object
        project_root: Project root directory
        
    Returns:
        Tuple of (is_excluded: bool, reason: str)
    """
    try:
        # Always exclude node_modules at any level
        if dir_path.name == 'node_modules':
            return True, "node_modules directory"
        
        # Get relative path from project root
        rel_path = dir_path.relative_to(project_root)
        rel_path_str = str(rel_path).replace('\\', '/')  # Use forward slashes
        
        # Check additional exclusions
        if additional_spec.match_file(rel_path_str):
            return True, "Matches additional exclusion pattern"
        
        # Check .gitignore patterns
        if gitignore_spec and gitignore_spec.match_file(rel_path_str):
            return True, "Matches .gitignore pattern"
        
        return False, "Not excluded"
        
    except ValueError:
        # Directory is not under project root
        return True, "Directory not under project root"
    except Exception as e:
        logger.warning(f"⚠️ Error checking directory exclusion for {dir_path}: {e}")
        return True, f"Error during exclusion check: {e}"

print("✅ File exclusion logic defined")

✅ File exclusion logic defined


File Metadata Extraction (Markdown)
This cell implements functions to extract metadata from discovered files. The metadata is essential for the chunking process and provides context about each file.
Extracted Metadata:

File Path Information: Absolute and relative paths
File Properties: Extension, size in bytes, modification time
Line Counting: Total lines and import line estimation
Project Context: Relationship to package.json files

Import Line Detection:

Simple regex-based detection for common import patterns
Handles ES6 imports, CommonJS requires, CSS imports
Provides rough estimate for import chunking decisions

Performance Considerations:

Efficient line counting without loading entire files into memory
Early termination for import counting after reasonable threshold
Minimal file I/O operations to maintain speed

In [28]:
# =============================================================================
# FILE METADATA EXTRACTION
# =============================================================================

import re
from datetime import datetime

def count_file_lines(file_path: Path) -> Tuple[int, int]:
    """
    Count total lines and estimate import lines in a file.
    
    Args:
        file_path: Path to the file
        
    Returns:
        Tuple of (total_lines: int, import_lines: int)
    """
    # Import patterns for different file types
    import_patterns = [
        r'^\s*import\s+',           # ES6 imports
        r'^\s*from\s+[\'"][^\'"]',  # ES6 from imports  
        r'^\s*const\s+.*=\s*require\(',  # CommonJS require
        r'^\s*let\s+.*=\s*require\(',    # CommonJS require
        r'^\s*var\s+.*=\s*require\(',    # CommonJS require
        r'^\s*@import\s+',          # CSS imports
        r'^\s*@use\s+',             # SCSS use
        r'^\s*export\s+.*from\s+',  # Re-exports
    ]
    
    compiled_patterns = [re.compile(pattern, re.IGNORECASE) for pattern in import_patterns]
    
    try:
        total_lines = 0
        import_lines = 0
        
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            for line in f:
                total_lines += 1
                
                # Check if line matches import patterns
                line_stripped = line.strip()
                if line_stripped and not line_stripped.startswith('//') and not line_stripped.startswith('/*'):
                    for pattern in compiled_patterns:
                        if pattern.match(line_stripped):
                            import_lines += 1
                            break
        
        return total_lines, import_lines
        
    except Exception as e:
        logger.warning(f"⚠️ Error counting lines in {file_path}: {e}")
        return 0, 0

def extract_file_metadata(file_path: Path, project_root: Path) -> FileMetadata:
    """
    Extract comprehensive metadata from a file.
    
    Args:
        file_path: Path to the file
        project_root: Project root directory
        
    Returns:
        FileMetadata object with extracted information
    """
    try:
        # Basic file information
        stat_info = file_path.stat()
        size_bytes = stat_info.st_size
        
        # Relative path calculation
        relative_path = str(file_path.relative_to(project_root)).replace('\\', '/')
        
        # Extension
        extension = file_path.suffix.lower()
        
        # Line counting
        total_lines, import_lines = count_file_lines(file_path)
        
        # Check if there's a package.json in the project root
        package_json_exists = (project_root / 'package.json').exists()
        
        metadata = FileMetadata(
            file_path=file_path,
            relative_path=relative_path,
            extension=extension,
            size_bytes=size_bytes,
            total_lines=total_lines,
            import_lines=import_lines,
            has_package_json=package_json_exists
        )
        
        logger.debug(f"📊 Extracted metadata for {relative_path}: {total_lines} lines, {import_lines} imports")
        return metadata
        
    except Exception as e:
        logger.error(f"❌ Error extracting metadata for {file_path}: {e}")
        # Return minimal metadata on error
        return FileMetadata(
            file_path=file_path,
            relative_path=str(file_path.name),
            extension=file_path.suffix.lower(),
            size_bytes=0,
            total_lines=0,
            import_lines=0,
            has_package_json=False
        )

print("✅ File metadata extraction functions defined")

✅ File metadata extraction functions defined


Main File Discovery Function (Markdown)
This cell implements the main file discovery function that orchestrates the entire discovery process. This is the primary entry point for Phase 2.
Process Flow:

Initialize Patterns: Load .gitignore and create additional exclusion patterns
Recursive Traversal: Walk through all directories and subdirectories
Progressive Filtering: Apply exclusions at directory and file levels
Validation Pipeline: Check file accessibility, size, and content type
Metadata Generation: Extract comprehensive metadata for valid files
Progress Reporting: Provide real-time feedback for large codebases

Performance Features:

Early Directory Exclusion: Skip entire subtrees when directories are excluded
Batch Progress Updates: Update progress periodically to avoid log spam
Memory Efficient: Process files one at a time without loading all into memory
Error Recovery: Continue processing even when individual files fail

Return Value:

List of FileMetadata objects for all processable files
Updated ProcessingStats with discovery metrics

In [29]:
# =============================================================================
# MAIN FILE DISCOVERY FUNCTION
# =============================================================================

def discover_files(target_directory: Path,
                  supported_extensions: List[str],
                  additional_exclusions: List[str],
                  stats: ProcessingStats) -> List[FileMetadata]:
    """
    Discover all processable files in target directory with comprehensive filtering.
    
    Args:
        target_directory: Directory to scan recursively
        supported_extensions: List of supported file extensions
        additional_exclusions: Additional files/patterns to exclude
        stats: ProcessingStats object to update
        
    Returns:
        List of FileMetadata objects for discovered files
    """
    logger.info(f"🔍 Starting file discovery in: {target_directory}")
    start_time = time.time()
    
    # Validate target directory
    if not target_directory.exists():
        logger.error(f"❌ Target directory does not exist: {target_directory}")
        return []
    
    if not target_directory.is_dir():
        logger.error(f"❌ Target path is not a directory: {target_directory}")
        return []
    
    # Load exclusion patterns
    logger.info("📋 Loading exclusion patterns...")
    gitignore_spec = load_gitignore_patterns(target_directory)
    additional_spec = create_additional_exclusion_spec(additional_exclusions)
    
    # Initialize tracking variables
    discovered_files = []
    total_files_found = 0
    excluded_files = 0
    invalid_files = 0
    last_progress_update = 0
    
    # Statistics tracking
    exclusion_reasons = {}
    validation_reasons = {}
    
    logger.info("🚀 Beginning recursive file discovery...")
    
    try:
        # Walk through directory tree
        for root_path in target_directory.rglob('*'):
            
            # Progress reporting (every 100 files)
            total_files_found += 1
            if total_files_found - last_progress_update >= 100:
                logger.info(f"📊 Progress: {total_files_found} items scanned, {len(discovered_files)} valid files found")
                last_progress_update = total_files_found
            
            # Skip if it's a directory
            if root_path.is_dir():
                # Check if directory should be excluded (for optimization)
                is_excluded, reason = is_directory_excluded(
                    root_path, gitignore_spec, additional_spec, target_directory
                )
                if is_excluded:
                    logger.debug(f"📁 Excluding directory: {root_path.relative_to(target_directory)} ({reason})")
                continue
            
            # Only process files
            if not root_path.is_file():
                continue
            
            # Check file exclusions
            is_excluded, exclusion_reason = is_file_excluded(
                root_path, gitignore_spec, additional_spec, target_directory
            )
            
            if is_excluded:
                excluded_files += 1
                exclusion_reasons[exclusion_reason] = exclusion_reasons.get(exclusion_reason, 0) + 1
                logger.debug(f"🚫 Excluded: {root_path.relative_to(target_directory)} ({exclusion_reason})")
                continue
            
            # Validate file is processable
            is_valid, validation_reason = validate_file_processable(root_path)
            
            if not is_valid:
                invalid_files += 1
                validation_reasons[validation_reason] = validation_reasons.get(validation_reason, 0) + 1
                logger.debug(f"❌ Invalid: {root_path.relative_to(target_directory)} ({validation_reason})")
                continue
            
            # Extract metadata for valid files
            try:
                metadata = extract_file_metadata(root_path, target_directory)
                discovered_files.append(metadata)
                logger.debug(f"✅ Added: {metadata.relative_path}")
                
            except Exception as e:
                invalid_files += 1
                logger.error(f"❌ Error processing {root_path}: {e}")
                continue
    
    except Exception as e:
        logger.error(f"❌ Error during file discovery: {e}")
        return discovered_files
    
    # Final statistics
    discovery_time = time.time() - start_time
    
    # Update processing stats
    stats.total_files = len(discovered_files)
    
    # Log comprehensive summary
    logger.info(f"\n" + "="*60)
    logger.info(f"📊 FILE DISCOVERY SUMMARY")
    logger.info(f"="*60)
    logger.info(f"📁 Target Directory: {target_directory}")
    logger.info(f"⏱️ Discovery Time: {discovery_time:.2f} seconds")
    logger.info(f"📄 Total Items Scanned: {total_files_found}")
    logger.info(f"✅ Valid Files Found: {len(discovered_files)}")
    logger.info(f"🚫 Files Excluded: {excluded_files}")
    logger.info(f"❌ Invalid Files: {invalid_files}")
    
    # Log exclusion reasons
    if exclusion_reasons:
        logger.info(f"\n📋 EXCLUSION BREAKDOWN:")
        for reason, count in sorted(exclusion_reasons.items(), key=lambda x: x[1], reverse=True):
            logger.info(f"   • {reason}: {count} files")
    
    # Log validation reasons
    if validation_reasons:
        logger.info(f"\n❌ VALIDATION FAILURE BREAKDOWN:")
        for reason, count in sorted(validation_reasons.items(), key=lambda x: x[1], reverse=True):
            logger.info(f"   • {reason}: {count} files")
    
    # Log file type breakdown
    if discovered_files:
        extension_counts = {}
        total_size_mb = 0
        total_lines = 0
        
        for metadata in discovered_files:
            ext = metadata.extension
            extension_counts[ext] = extension_counts.get(ext, 0) + 1
            total_size_mb += metadata.size_bytes / (1024 * 1024)
            total_lines += metadata.total_lines
        
        logger.info(f"\n📝 FILE TYPE BREAKDOWN:")
        for ext, count in sorted(extension_counts.items()):
            logger.info(f"   • {ext}: {count} files")
        
        logger.info(f"\n📊 CONTENT STATISTICS:")
        logger.info(f"   • Total Size: {total_size_mb:.2f} MB")
        logger.info(f"   • Total Lines: {total_lines:,}")
        logger.info(f"   • Average File Size: {(total_size_mb * 1024) / len(discovered_files):.1f} KB")
        logger.info(f"   • Average Lines per File: {total_lines / len(discovered_files):.0f}")
    
    logger.info(f"="*60)
    logger.info(f"🎉 File discovery complete! Found {len(discovered_files)} processable files")
    
    return discovered_files

print("✅ Main file discovery function defined")

✅ Main file discovery function defined


Package.json Discovery and Analysis (Markdown)
This cell implements the package.json discovery and dependency analysis system. Since the requirements specify handling nested package.json files differently, this system will:
Package.json Discovery:

Primary: Look for package.json in the project root (highest priority)
Secondary: Discover package.json files in subdirectories
Hierarchy: Maintain parent-child relationships between package.json files

Dependency Extraction:

Extract both dependencies and devDependencies
Create lookup tables for import classification
Handle nested dependencies with inheritance rules

Multi-package Support:

Monorepo support with multiple package.json files
Workspace-aware dependency resolution
Fallback to root-level dependencies when local ones are missing

In [30]:
# =============================================================================
# PACKAGE.JSON DISCOVERY AND ANALYSIS
# =============================================================================

@dataclass
class PackageInfo:
    """Information about a package.json file"""
    path: Path
    relative_path: str
    dependencies: List[str] = field(default_factory=list)
    dev_dependencies: List[str] = field(default_factory=list)
    all_dependencies: set = field(default_factory=set)
    is_root: bool = False

def find_all_package_json_files(project_root: Path) -> List[PackageInfo]:
    """
    Find all package.json files in the project with hierarchy information.
    
    Args:
        project_root: Root directory to search in
        
    Returns:
        List of PackageInfo objects sorted by hierarchy (root first)
    """
    package_files = []
    
    logger.info("📦 Discovering package.json files...")
    
    try:
        # Search for all package.json files
        for package_path in project_root.rglob('package.json'):
            # Skip node_modules directories
            if 'node_modules' in package_path.parts:
                continue
            
            relative_path = str(package_path.relative_to(project_root)).replace('\\', '/')
            is_root = package_path.parent == project_root
            
            package_info = PackageInfo(
                path=package_path,
                relative_path=relative_path,
                is_root=is_root
            )
            
            package_files.append(package_info)
            logger.debug(f"📦 Found package.json: {relative_path} (root: {is_root})")
    
    except Exception as e:
        logger.error(f"❌ Error discovering package.json files: {e}")
        return []
    
    # Sort by hierarchy: root first, then by depth
    package_files.sort(key=lambda p: (not p.is_root, len(p.path.parts)))
    
    logger.info(f"📦 Found {len(package_files)} package.json files")
    return package_files

def extract_dependencies_from_package(package_path: Path) -> Tuple[List[str], List[str]]:
    """
    Extract dependencies and devDependencies from a package.json file.
    
    Args:
        package_path: Path to package.json file
        
    Returns:
        Tuple of (dependencies: List[str], devDependencies: List[str])
    """
    try:
        with open(package_path, 'r', encoding='utf-8') as f:
            package_data = json.load(f)
        
        # Extract dependencies
        dependencies = list(package_data.get('dependencies', {}).keys())
        dev_dependencies = list(package_data.get('devDependencies', {}).keys())
        
        logger.debug(f"📦 {package_path.name}: {len(dependencies)} deps, {len(dev_dependencies)} devDeps")
        
        return dependencies, dev_dependencies
        
    except json.JSONDecodeError as e:
        logger.error(f"❌ Invalid JSON in {package_path}: {e}")
        return [], []
    except Exception as e:
        logger.error(f"❌ Error reading {package_path}: {e}")
        return [], []

def analyze_all_package_files(package_files: List[PackageInfo]) -> Dict[str, set]:
    """
    Analyze all package.json files and create dependency lookup.
    
    Args:
        package_files: List of PackageInfo objects
        
    Returns:
        Dictionary with combined dependency information
    """
    logger.info("📦 Analyzing package.json dependencies...")
    
    # Extract dependencies from each package file
    for package_info in package_files:
        deps, dev_deps = extract_dependencies_from_package(package_info.path)
        package_info.dependencies = deps
        package_info.dev_dependencies = dev_deps
        package_info.all_dependencies = set(deps + dev_deps)
    
    # Create combined lookup tables
    root_dependencies = set()
    all_dependencies = set()
    workspace_dependencies = {}
    
    for package_info in package_files:
        if package_info.is_root:
            root_dependencies = package_info.all_dependencies.copy()
        
        all_dependencies.update(package_info.all_dependencies)
        
        # Create workspace-specific lookup
        workspace_dir = str(package_info.path.parent.relative_to(package_files[0].path.parent)).replace('\\', '/')
        workspace_dependencies[workspace_dir] = package_info.all_dependencies
    
    # Log results
    logger.info(f"📦 Dependency analysis complete:")
    logger.info(f"   •Root dependencies: {len(root_dependencies)}")
    logger.info(f"   • Total unique dependencies: {len(all_dependencies)}")
    logger.info(f"   • Workspaces with dependencies: {len(workspace_dependencies)}")
    return {
        'root': root_dependencies,
        'all': all_dependencies,
        'workspaces': workspace_dependencies
    }

def create_dependency_lookup(dependency_info: Dict[str, set]) -> set:
    """
    Create optimized lookup set for external dependency classification.
    Args:
        dependency_info: Dictionary of dependency information
        
    Returns:
        Set of external package names for quick lookup
    """
    # Use all dependencies for comprehensive lookup
    external_deps = dependency_info.get('all', set())

    # Add common external packages that might not be in package.json
    common_external = {
        'react', 'react-dom', 'vue', 'angular', 'jquery', 'lodash', 'axios',
        'express', 'next', 'nuxt', 'typescript', 'babel', 'webpack', 'vite'
    }

    external_deps.update(common_external)

    logger.info(f"🎯 Created dependency lookup with {len(external_deps)} packages")
    return external_deps
print("✅ Package.json discovery and analysis functions defined")

✅ Package.json discovery and analysis functions defined


Phase 2 Pipeline Orchestrator (Markdown)

This cell implements the main pipeline orchestrator for Phase 2 that coordinates all file discovery and filtering operations. This function serves as the primary entry point and integrates all the components defined in previous cells.

**Pipeline Stages:**
1. **Setup and Validation**: Validate target directory and configuration
2. **Pattern Loading**: Load .gitignore and additional exclusion patterns
3. **File Discovery**: Recursive discovery with filtering and validation
4. **Package Analysis**: Discover and analyze package.json files
5. **Dependency Lookup**: Create optimized dependency classification system
6. **Results Compilation**: Compile and return comprehensive results

**Error Handling:**
- Graceful handling of missing directories or permission issues
- Detailed error logging with context
- Partial results return even if some operations fail
- Recovery strategies for common failure scenarios

**Integration Points:**
- Updates global `ProcessingStats` object from Phase 1
- Returns data structures needed for Phase 3 (AST Analysis)
- Maintains consistency with Phase 1 configuration and logging

In [31]:
# =============================================================================
# PHASE 2 PIPELINE ORCHESTRATOR
# =============================================================================

def run_file_discovery_pipeline(target_directory: Path,
                               supported_extensions: List[str],
                               additional_exclusions: List[str],
                               stats: ProcessingStats) -> Tuple[List[FileMetadata], set]:
    """
    Execute the complete file discovery and filtering pipeline.
    
    Args:
        target_directory: Directory to scan for files
        supported_extensions: List of file extensions to process
        additional_exclusions: Additional exclusion patterns
        stats: ProcessingStats object to update
        
    Returns:
        Tuple of (discovered_files: List[FileMetadata], external_deps: set)
    """
    logger.info("🚀 Starting Phase 2: File Discovery and Filtering")
    phase_start_time = time.time()
    
    try:
        # Stage 1: Validate target directory
        logger.info("🔍 Stage 1: Validating target directory")
        if not target_directory.exists():
            logger.error(f"❌ Target directory does not exist: {target_directory}")
            return [], set()
        
        if not target_directory.is_dir():
            logger.error(f"❌ Target path is not a directory: {target_directory}")
            return [], set()
        
        logger.info(f"✅ Target directory validated: {target_directory}")
        
        # Stage 2: Discover and analyze package.json files
        logger.info("🔍 Stage 2: Package.json discovery and analysis")
        package_files = find_all_package_json_files(target_directory)
        
        if package_files:
            dependency_info = analyze_all_package_files(package_files)
            external_deps = create_dependency_lookup(dependency_info)
            
            # Log package.json summary
            root_packages = [p for p in package_files if p.is_root]
            workspace_packages = [p for p in package_files if not p.is_root]
            
            logger.info(f"📦 Package.json summary:")
            logger.info(f"   • Root package.json: {'Yes' if root_packages else 'No'}")
            logger.info(f"   • Workspace packages: {len(workspace_packages)}")
            logger.info(f"   • Total external dependencies: {len(external_deps)}")
            
        else:
            logger.warning("⚠️ No package.json files found - all imports will be treated as local")
            external_deps = set()
        
        # Stage 3: File discovery
        logger.info("🔍 Stage 3: Recursive file discovery")
        discovered_files = discover_files(
            target_directory=target_directory,
            supported_extensions=supported_extensions,
            additional_exclusions=additional_exclusions,
            stats=stats
        )
        
        # Stage 4: Final validation and statistics
        logger.info("🔍 Stage 4: Final validation and statistics")
        
        if not discovered_files:
            logger.warning("⚠️ No processable files found in target directory")
            return [], external_deps
        
        # Calculate additional statistics
        files_with_imports = sum(1 for f in discovered_files if f.import_lines > 0)
        large_files = sum(1 for f in discovered_files if f.size_bytes > 1024 * 1024)  # > 1MB
        
        # Update stats
        stats.total_files = len(discovered_files)
        
        # Phase completion summary
        phase_duration = time.time() - phase_start_time
        
        logger.info(f"\n" + "="*60)
        logger.info(f"🎉 PHASE 2 COMPLETE - FILE DISCOVERY SUCCESS")
        logger.info(f"="*60)
        logger.info(f"⏱️ Phase Duration: {phase_duration:.2f} seconds")
        logger.info(f"📁 Target Directory: {target_directory}")
        logger.info(f"✅ Files Discovered: {len(discovered_files)}")
        logger.info(f"📦 Package.json Files: {len(package_files) if package_files else 0}")
        logger.info(f"🔗 External Dependencies: {len(external_deps)}")
        logger.info(f"📝 Files with Imports: {files_with_imports}")
        logger.info(f"📊 Large Files (>1MB): {large_files}")
        
        # File type summary
        ext_summary = {}
        for file_meta in discovered_files:
            ext = file_meta.extension
            ext_summary[ext] = ext_summary.get(ext, 0) + 1
        
        logger.info(f"📝 File Types:")
        for ext, count in sorted(ext_summary.items()):
            logger.info(f"   • {ext}: {count} files")
        
        logger.info(f"="*60)
        logger.info(f"➡️ Ready for Phase 3: AST Analysis and Structure Extraction")
        
        return discovered_files, external_deps
        
    except Exception as e:
        logger.error(f"❌ Phase 2 failed with error: {e}")
        logger.error(f"📊 Partial results: {len(discovered_files) if 'discovered_files' in locals() else 0} files")
        
        # Return partial results if available
        partial_files = locals().get('discovered_files', [])
        partial_deps = locals().get('external_deps', set())
        
        return partial_files, partial_deps

print("✅ Phase 2 pipeline orchestrator defined")

✅ Phase 2 pipeline orchestrator defined


Phase 2 Execution and Results (Markdown)
This final cell executes the Phase 2 pipeline using the configuration from Phase 1 and displays the results. It serves as the integration point between Phase 1 configuration and Phase 2 execution.
Execution Flow:

Load Configuration: Use global variables from Phase 1
Execute Pipeline: Run the complete file discovery pipeline
Store Results: Save results in global variables for Phase 3
Display Summary: Show comprehensive results and next steps
Validation: Ensure results are ready for Phase 3 processing

Global Variables Created:

discovered_files: List of FileMetadata objects for Phase 3
external_dependencies: Set of external package names for import classification
phase2_stats: Updated ProcessingStats with discovery metrics

Error Handling:

Graceful handling of pipeline failures
Clear error messages with troubleshooting suggestions
Partial results preservation when possible

In [32]:
# =============================================================================
# PHASE 2 EXECUTION AND RESULTS
# =============================================================================

def execute_phase2():
    """Execute Phase 2 with configuration from Phase 1"""
    global discovered_files, external_dependencies, phase2_stats
    
    logger.info("🎬 Executing Phase 2: File Discovery and Filtering")
    
    # Initialize stats object if not already created
    if 'phase2_stats' not in globals():
        phase2_stats = ProcessingStats()
    
    # Validate configuration from Phase 1
    target_path = Path(TARGET_DIRECTORY)
    output_path = Path(OUTPUT_DIRECTORY)
    
    logger.info(f"📁 Using configuration from Phase 1:")
    logger.info(f"   • Target: {target_path}")
    logger.info(f"   • Output: {output_path}")
    logger.info(f"   • Extensions: {SUPPORTED_EXTENSIONS}")
    logger.info(f"   • Exclusions: {len(ADDITIONAL_EXCLUSIONS)} patterns")
    
    # Execute the pipeline
    try:
        files, deps = run_file_discovery_pipeline(
            target_directory=target_path,
            supported_extensions=SUPPORTED_EXTENSIONS,
            additional_exclusions=ADDITIONAL_EXCLUSIONS,
            stats=phase2_stats
        )
        
        # Store results globally for Phase 3
        discovered_files = files
        external_dependencies = deps
        
        # Success summary
        if discovered_files:
            logger.info(f"\n🎯 PHASE 2 RESULTS STORED:")
            logger.info(f"   • discovered_files: {len(discovered_files)} files")
            logger.info(f"   • external_dependencies: {len(external_dependencies)} packages")
            logger.info(f"   • phase2_stats: Updated with discovery metrics")
            
            # Sample files preview
            logger.info(f"\n📝 SAMPLE DISCOVERED FILES (first 5):")
            for i, file_meta in enumerate(discovered_files[:5]):
                size_kb = file_meta.size_bytes / 1024
                logger.info(f"   {i+1}. {file_meta.relative_path} ({size_kb:.1f}KB, {file_meta.total_lines} lines)")
            
            if len(discovered_files) > 5:
                logger.info(f"   ... and {len(discovered_files) - 5} more files")
            
            # Sample dependencies preview
            if external_dependencies:
                logger.info(f"\n📦 SAMPLE EXTERNAL DEPENDENCIES (first 10):")
                deps_list = sorted(list(external_dependencies))
                for i, dep in enumerate(deps_list[:10]):
                    logger.info(f"   • {dep}")
                
                if len(external_dependencies) > 10:
                    logger.info(f"   ... and {len(external_dependencies) - 10} more dependencies")
            
            return True
            
        else:
            logger.warning("⚠️ No files discovered - check target directory and exclusion patterns")
            return False
            
    except Exception as e:
        logger.error(f"❌ Phase 2 execution failed: {e}")
        
        # Initialize empty results on failure
        discovered_files = []
        external_dependencies = set()
        
        return False

def validate_phase2_results():
    """Validate Phase 2 results are ready for Phase 3"""
    logger.info("🔍 Validating Phase 2 results for Phase 3 readiness...")
    
    validation_issues = []
    
    # Check discovered_files
    if 'discovered_files' not in globals():
        validation_issues.append("discovered_files variable not found")
    elif not isinstance(discovered_files, list):
        validation_issues.append("discovered_files is not a list")
    elif len(discovered_files) == 0:
        validation_issues.append("No files were discovered")
    
    # Check external_dependencies
    if 'external_dependencies' not in globals():
        validation_issues.append("external_dependencies variable not found")
    elif not isinstance(external_dependencies, set):
        validation_issues.append("external_dependencies is not a set")
    
    # Check file metadata structure
    if 'discovered_files' in globals() and discovered_files:
        sample_file = discovered_files[0]
        required_attrs = ['file_path', 'relative_path', 'extension', 'size_bytes', 'total_lines', 'import_lines']
        for attr in required_attrs:
            if not hasattr(sample_file, attr):
                validation_issues.append(f"FileMetadata missing attribute: {attr}")
    
    # Report results
    if validation_issues:
        logger.error("❌ Phase 2 validation failed:")
        for issue in validation_issues:
            logger.error(f"   • {issue}")
        return False
    else:
        logger.info("✅ Phase 2 validation passed - ready for Phase 3")
        return True

# Execute Phase 2
print("🚀 Starting Phase 2 execution...")
phase2_success = execute_phase2()

if phase2_success:
    print("✅ Phase 2 completed successfully!")
    
    # Validate results
    validation_success = validate_phase2_results()
    
    if validation_success:
        print("\n" + "="*60)
        print("🎉 PHASE 2 COMPLETE - SYSTEM READY FOR PHASE 3")
        print("="*60)
        print(f"📁 Files Ready for Processing: {len(discovered_files)}")
        print(f"📦 External Dependencies Identified: {len(external_dependencies)}")
        print(f"⏱️ Total Discovery Time: {phase2_stats.processing_time:.2f}s")
        print("➡️ Next Step: Execute Phase 3 - AST Analysis and Structure Extraction")
        print("="*60)
    else:
        print("❌ Phase 2 validation failed - please check the issues above")
        
else:
    print("❌ Phase 2 execution failed - please check the error messages above")
    print("💡 Troubleshooting suggestions:")
    print("   • Verify TARGET_DIRECTORY exists and is accessible")
    print("   • Check file permissions in target directory")
    print("   • Review exclusion patterns for overly restrictive rules")
    print("   • Ensure supported file extensions match your project")

logger.info("🔧 Phase 2 (File Discovery and Filtering) complete")

2025-08-02 20:04:55,239 - INFO - 🎬 Executing Phase 2: File Discovery and Filtering
2025-08-02 20:04:55,245 - INFO - 📁 Using configuration from Phase 1:
2025-08-02 20:04:55,247 - INFO -    • Target: /Users/tiyadiashok/python-projects/code_chunker/rag_sources/code_sources/typescript
2025-08-02 20:04:55,249 - INFO -    • Output: /Users/tiyadiashok/python-projects/code_chunker/rag_chunks/pre_processed/code_sources/typescript
2025-08-02 20:04:55,250 - INFO -    • Extensions: ['.tsx', '.ts', '.js', '.jsx', '.html', '.css', '.scss']
2025-08-02 20:04:55,250 - INFO -    • Exclusions: 10 patterns
2025-08-02 20:04:55,251 - INFO - 🚀 Starting Phase 2: File Discovery and Filtering
2025-08-02 20:04:55,252 - INFO - 🔍 Stage 1: Validating target directory
2025-08-02 20:04:55,253 - INFO - ✅ Target directory validated: /Users/tiyadiashok/python-projects/code_chunker/rag_sources/code_sources/typescript
2025-08-02 20:04:55,254 - INFO - 🔍 Stage 2: Package.json discovery and analysis
2025-08-02 20:04:55,255 -

🚀 Starting Phase 2 execution...
❌ Phase 2 execution failed - please check the error messages above
💡 Troubleshooting suggestions:
   • Verify TARGET_DIRECTORY exists and is accessible
   • Check file permissions in target directory
   • Review exclusion patterns for overly restrictive rules
   • Ensure supported file extensions match your project


This completes Phase 2: File Discovery and Filtering. The implementation provides:

Complete .gitignore Integration - Parses and applies .gitignore patterns correctly
Robust File Validation - Binary detection, size limits, accessibility checks
Recursive Directory Traversal - Efficient scanning with early exclusion optimization
Package.json Analysis - Comprehensive dependency extraction with nested package support
Progress Reporting - Real-time feedback for large codebases
Error Recovery - Graceful handling of failures with detailed logging
Performance Optimization - Memory-efficient processing and batch operations
Comprehensive Statistics - Detailed metrics and breakdowns for analysis

The phase integrates seamlessly with Phase 1 configuration and prepares all necessary data structures for Phase 3 (AST Analysis and Structure Extraction). Each cell includes detailed markdown explanations and can be copied directly into your existing Jupyter notebook.