# PyForge CLI End-to-End Testing - Local Environment

This notebook tests PyForge CLI functionality in a local environment using the pre-built wheel from the dist/ directory.

## Key Differences from Serverless Notebook
- **Installation Source**: Local dist/ directory instead of Unity Catalog Volume
- **Sample Datasets**: Downloads to local filesystem instead of volumes
- **No Databricks Widgets**: Uses variables directly (can be converted to parameters for papermill)
- **No PySpark**: Tests only pandas-based conversions

## Test Configuration
- **Environment**: Local Python virtual environment
- **Installation Source**: dist/pyforge_cli-*.whl (local build)
- **Sample Data**: Real sample datasets from v1.0.5 release
- **Output Format**: Parquet and other supported formats

## Prerequisites
1. PyForge CLI wheel built in dist/ directory via `python -m build`
2. Python 3.8+ installed locally
3. Write permissions for creating test directories

## How to Use This Notebook
1. Ensure you have built the wheel: `python -m build --wheel`
2. Run all cells in sequence
3. Review the test results and summary report

## Key Features of This Notebook
1. **Consistent Structure**: Mirrors the serverless notebook for easy maintenance
2. **Comprehensive Testing**: Tests all supported file formats
3. **Directory Creation**: Ensures output directories exist before conversion
4. **PDF Handling**: Skips PDF conversions due to known issues
5. **Detailed Observations**: Logs test results for each conversion
6. **Error Handling**: Graceful handling of known issues

In [ ]:
# Configuration Parameters (equivalent to Databricks widgets)
# =============================================================================
# CONFIGURATION SECTION
# =============================================================================

import os
import sys
from pathlib import Path

# Navigate to project root (3 levels up from notebook location)
project_root = Path(os.getcwd()).parent.parent.parent
os.chdir(project_root)
print(f"Working directory: {os.getcwd()}")

# Configuration parameters (these would be widgets in Databricks)
SAMPLE_DATASETS_PATH = "sample-datasets"
PYFORGE_VERSION = "1.0.8.dev4"  # Will be detected from wheel
FORCE_CONVERSION = True
TEST_SMALLEST_FILES_ONLY = True
SKIP_PDF_FILES = True  # Due to known issues

# Derived paths
VENV_PATH = "test_env"
CONVERTED_OUTPUT_PATH = "test_output"

# Find latest wheel in dist/
import glob
wheel_files = glob.glob("dist/pyforge_cli-*.whl")
if wheel_files:
    PYFORGE_WHEEL_PATH = sorted(wheel_files)[-1]  # Get latest version
    # Extract version from wheel filename
    import re
    version_match = re.search(r'pyforge_cli-(.+?)-py', PYFORGE_WHEEL_PATH)
    if version_match:
        PYFORGE_VERSION = version_match.group(1)
else:
    PYFORGE_WHEEL_PATH = None

print(f"🔧 Configuration:")
print(f"   PyForge Version: {PYFORGE_VERSION}")
print(f"   PyForge Wheel Path: {PYFORGE_WHEEL_PATH}")
print(f"   Sample Datasets Path: {SAMPLE_DATASETS_PATH}")
print(f"   Output Path: {CONVERTED_OUTPUT_PATH}")
print(f"   Force Conversion: {FORCE_CONVERSION}")
print(f"   Test Smallest Files Only: {TEST_SMALLEST_FILES_ONLY}")
print(f"   Skip PDF Files: {SKIP_PDF_FILES}")

if not PYFORGE_WHEEL_PATH:
    print("❌ ERROR: No wheel found in dist/. Please build first: python -m build --wheel")
    sys.exit(1)

In [ ]:
%%sh
# Environment Setup - Create Virtual Environment
# =============================================================================
# ENVIRONMENT SETUP SECTION
# =============================================================================

echo "🔍 Setting up test environment..."

# Remove existing test environment for clean start
if [ -d "test_env" ]; then
    echo "Removing existing test environment..."
    rm -rf test_env
fi

# Create fresh virtual environment
python3 -m venv test_env
echo "✅ Created fresh virtual environment: test_env/"

# Verify environment creation
if [ -f "test_env/bin/activate" ]; then
    echo "✅ Virtual environment created successfully"
else
    echo "❌ Failed to create virtual environment"
    exit 1
fi

In [ ]:
%%sh
# Install PyForge CLI from Local Wheel
# =============================================================================
# INSTALLATION FROM LOCAL WHEEL
# =============================================================================

# Find the latest wheel file
WHEEL_FILE=$(ls dist/pyforge_cli-*.whl | sort -V | tail -1)

echo "📦 Installing PyForge CLI from local wheel..."
echo "   Installing from: $WHEEL_FILE"

# Activate virtual environment and install
source test_env/bin/activate

# Upgrade pip first
pip install --upgrade pip --quiet

# Install PyForge from wheel
pip install "$WHEEL_FILE"

echo "✅ PyForge CLI installed successfully!"

# Verify installation
echo ""
echo "🔍 Verifying installation..."
pyforge --version

# List installed packages
echo ""
echo "📋 Key dependencies installed:"
pip list | grep -E "(pandas|pyarrow|openpyxl|PyMuPDF|chardet|requests|dbfread|jaydebeapi)"

In [ ]:
%%sh
# Setup Sample Datasets
# =============================================================================
# SAMPLE DATASETS SETUP
# =============================================================================

source test_env/bin/activate

echo "📥 Setting up sample datasets..."

# Remove existing sample datasets for clean start
if [ -d "sample-datasets" ]; then
    echo "Removing existing sample-datasets directory..."
    rm -rf sample-datasets
fi

# Install sample datasets using PyForge CLI
echo "📦 Installing sample datasets using PyForge CLI..."
pyforge install sample-datasets --force

# Check if installation was successful
if [ -d "sample-datasets" ]; then
    echo "✅ Sample datasets installed successfully!"
    
    # Count files by type
    echo ""
    echo "📊 Sample datasets summary:"
    echo "   CSV files: $(find sample-datasets -name '*.csv' 2>/dev/null | wc -l)"
    echo "   XML files: $(find sample-datasets -name '*.xml' 2>/dev/null | wc -l)"
    echo "   Excel files: $(find sample-datasets -name '*.xlsx' -o -name '*.xls' 2>/dev/null | wc -l)"
    echo "   PDF files: $(find sample-datasets -name '*.pdf' 2>/dev/null | wc -l)"
    echo "   Access files: $(find sample-datasets -name '*.mdb' -o -name '*.accdb' 2>/dev/null | wc -l)"
    echo "   DBF files: $(find sample-datasets -name '*.dbf' 2>/dev/null | wc -l)"
else
    echo "⚠️  Sample datasets installation may have failed"
fi

In [ ]:
# Discover and Display Downloaded Files
# =============================================================================
# FILE DISCOVERY AND DETAILED DISPLAY
# =============================================================================

import os
import subprocess
import pandas as pd
from datetime import datetime
import json

def discover_and_display_files():
    """Discover all downloaded files and display them with size information."""
    print("🔍 Discovering all downloaded files in sample datasets...")
    
    all_files = []
    files_by_type = {}
    supported_extensions = {
        '.csv': 'CSV',
        '.xlsx': 'Excel', 
        '.xls': 'Excel',
        '.xml': 'XML',
        '.pdf': 'PDF',
        '.dbf': 'DBF',
        '.mdb': 'MDB',
        '.accdb': 'ACCDB'
    }
    
    if os.path.exists(SAMPLE_DATASETS_PATH):
        for root, dirs, files in os.walk(SAMPLE_DATASETS_PATH):
            # Skip already converted files
            if 'converted' in root or 'parquet' in root:
                continue
                
            for file in files:
                file_path = os.path.join(root, file)
                file_ext = os.path.splitext(file)[1].lower()
                
                if file_ext in supported_extensions:
                    # Get relative path for better display
                    rel_path = os.path.relpath(file_path, SAMPLE_DATASETS_PATH)
                    folder_category = rel_path.split(os.sep)[0] if os.sep in rel_path else 'root'
                    
                    file_info = {
                        'file_name': file,
                        'file_type': supported_extensions[file_ext],
                        'extension': file_ext,
                        'category': folder_category,
                        'file_path': file_path,
                        'relative_path': rel_path,
                        'size_bytes': os.path.getsize(file_path),
                        'size_mb': round(os.path.getsize(file_path) / (1024*1024), 3),
                        'size_readable': format_file_size(os.path.getsize(file_path))
                    }
                    
                    all_files.append(file_info)
                    
                    # Group by file type
                    if file_info['file_type'] not in files_by_type:
                        files_by_type[file_info['file_type']] = []
                    files_by_type[file_info['file_type']].append(file_info)
        
        # Sort files by size within each type
        for file_type in files_by_type:
            files_by_type[file_type].sort(key=lambda x: x['size_bytes'])
            
    return all_files, files_by_type

def format_file_size(size_bytes):
    """Format file size in human-readable format."""
    for unit in ['B', 'KB', 'MB', 'GB']:
        if size_bytes < 1024.0:
            return f"{size_bytes:.2f} {unit}"
        size_bytes /= 1024.0
    return f"{size_bytes:.2f} TB"

# Discover files
all_files, files_by_type = discover_and_display_files()

# Display summary statistics
print(f"\n📊 Downloaded Files Summary:")
print(f"   Total files found: {len(all_files)}")
print(f"   Total size: {format_file_size(sum(f['size_bytes'] for f in all_files))}")
print(f"   File types: {', '.join(sorted(files_by_type.keys()))}")

# Display files by type
print("\n📋 Files by Type (sorted by size):")
for file_type, files in sorted(files_by_type.items()):
    print(f"\n{file_type} Files ({len(files)} files):")
    for i, file_info in enumerate(files[:3]):  # Show first 3 files of each type
        print(f"   {i+1}. {file_info['file_name']} - {file_info['size_readable']} - {file_info['relative_path']}")
    if len(files) > 3:
        print(f"   ... and {len(files) - 3} more {file_type} files")

# Create DataFrame for display
if all_files:
    df_all_files = pd.DataFrame(all_files)
    
    # Summary by file type
    print("\n📊 Detailed Summary by File Type:")
    summary_by_type = df_all_files.groupby('file_type').agg({
        'file_name': 'count',
        'size_mb': ['sum', 'mean', 'min', 'max']
    }).round(3)
    summary_by_type.columns = ['file_count', 'total_size_mb', 'avg_size_mb', 'min_size_mb', 'max_size_mb']
    print(summary_by_type.to_string())
    
    # Show smallest file of each type
    print("\n🎯 Smallest File of Each Type (for testing):")
    smallest_files = []
    for file_type in files_by_type:
        if files_by_type[file_type]:
            smallest = files_by_type[file_type][0]  # Already sorted by size
            smallest_files.append(smallest)
            print(f"   {smallest['file_type']}: {smallest['file_name']} ({smallest['size_readable']})")
    
else:
    print("\n⚠️  No files found in the sample datasets directory.")
    print("   Please check if the sample datasets were downloaded successfully.")

# Store the catalog for later use
files_catalog = all_files
print(f"\n✅ File discovery completed. Found {len(files_catalog)} files ready for testing.")

In [ ]:
# Select Files for Testing
# =============================================================================
# FILE SELECTION FOR TESTING
# =============================================================================

def select_files_for_testing(all_files, files_by_type, test_smallest_only=True):
    """Select files for testing based on configuration."""
    selected_files = []
    
    if test_smallest_only:
        print("🎯 Selecting SMALLEST file of each type for testing...")
        
        # Get smallest file of each type
        for file_type in sorted(files_by_type.keys()):
            if files_by_type[file_type]:
                smallest_file = files_by_type[file_type][0]  # Already sorted by size
                selected_files.append(smallest_file)
                print(f"   {file_type}: {smallest_file['file_name']} ({smallest_file['size_readable']})")
    else:
        print("📋 Selecting ALL files for testing...")
        selected_files = all_files
        print(f"   Total files selected: {len(selected_files)}")
    
    return selected_files

# Select files based on configuration
files_for_testing = select_files_for_testing(all_files, files_by_type, TEST_SMALLEST_FILES_ONLY)

# Display selected files
print(f"\n📊 Files Selected for Testing: {len(files_for_testing)}")
if files_for_testing:
    df_selected = pd.DataFrame(files_for_testing)
    print("\nSelected files:")
    print(df_selected[['file_type', 'file_name', 'size_readable', 'category', 'file_path']].to_string(index=False))
    
    # Calculate total size and estimated time
    total_size_mb = sum(f['size_mb'] for f in files_for_testing)
    estimated_time = len(files_for_testing) * 5  # Assume 5 seconds per file average
    
    print(f"\n📈 Test Estimation:")
    print(f"   Files to process: {len(files_for_testing)}")
    print(f"   Total data size: {format_file_size(total_size_mb * 1024 * 1024)}")
    print(f"   Estimated time: ~{estimated_time} seconds")
else:
    print("⚠️  No files selected for testing!")

# Update files_catalog with selected files
files_catalog = files_for_testing
print(f"\n✅ File selection completed. {len(files_catalog)} files ready for conversion testing.")

In [ ]:
# Comprehensive Conversion Testing
# =============================================================================
# BULK CONVERSION TESTING
# =============================================================================

import time

def run_conversion_test(file_info):
    """Run conversion test for a single file."""
    file_path = file_info['file_path']
    file_type = file_info['file_type']
    file_name = file_info['file_name']
    file_ext = file_info['extension']
    
    # Create output path
    output_name = file_name.split('.')[0]
    output_dir = os.path.join(CONVERTED_OUTPUT_PATH, file_info['category'])
    output_path = os.path.join(output_dir, f"{output_name}.parquet")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Skip PDF files if configured
    if SKIP_PDF_FILES and file_ext == '.pdf':
        print(f"   ⚠️  Skipping PDF file - known conversion issues")
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'SKIPPED',
            'duration_seconds': 0,
            'error_message': 'PDF conversion temporarily disabled due to known issues',
            'output_path': None,
            'size_mb': file_info.get('size_mb', 0),
            'converter_used': 'N/A',
            'observation': {
                'file': file_name,
                'type': file_type,
                'status': 'SKIPPED',
                'reason': 'PDF conversion issues'
            }
        }
    
    # Build conversion command
    cmd = [
        f'{VENV_PATH}/bin/pyforge', 'convert', file_path, output_path, 
        '--format', 'parquet'
    ]
    
    if FORCE_CONVERSION:
        cmd.append('--force')
    
    print(f"\n🔄 Converting {file_name} ({file_type})...")
    print(f"   File size: {file_info.get('size_readable', 'Unknown')}")
    print(f"   Output dir: {output_dir}")
    print(f"   Command: {' '.join(cmd)}")
    
    # Log test observation
    observation = {
        'file': file_name,
        'type': file_type,
        'size': file_info.get('size_readable', 'Unknown'),
        'start_time': datetime.now().strftime('%H:%M:%S')
    }
    
    try:
        start_time = time.time()
        
        # Set timeout based on file size
        file_size_mb = file_info.get('size_mb', 0)
        if file_size_mb > 100:
            timeout = 300  # 5 minutes for large files
        elif file_size_mb > 10:
            timeout = 120  # 2 minutes for medium files
        else:
            timeout = 60  # 1 minute for small files
        
        print(f"   Timeout: {timeout}s")
        
        # Run conversion
        result = subprocess.run(
            cmd, 
            capture_output=True, 
            text=True, 
            timeout=timeout
        )
        
        end_time = time.time()
        duration = round(end_time - start_time, 2)
        
        if result.returncode == 0:
            status = 'SUCCESS'
            error_message = None
            converter_used = 'Standard'  # Local doesn't have PySpark
            print(f"   ✅ Success ({duration}s)")
            
            # Verify output file exists
            if os.path.exists(output_path):
                print(f"   ✅ Output file verified: {output_path}")
                observation['output_verified'] = True
            else:
                print(f"   ⚠️  Output file not found")
                observation['output_verified'] = False
                
            observation['status'] = 'SUCCESS'
            observation['duration'] = f"{duration}s"
            observation['converter'] = converter_used
        else:
            status = 'FAILED'
            error_message = result.stderr.strip() if result.stderr else result.stdout.strip()
            converter_used = 'Unknown'
            print(f"   ❌ Failed ({duration}s)")
            print(f"   Error: {error_message[:200]}...")
            
            observation['status'] = 'FAILED'
            observation['duration'] = f"{duration}s"
            observation['error'] = error_message[:200]
        
        # Print detailed observation
        print(f"\n📝 Test Observation:")
        for key, value in observation.items():
            print(f"   {key}: {value}")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': status,
            'duration_seconds': duration,
            'error_message': error_message,
            'output_path': output_path if status == 'SUCCESS' else None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': converter_used,
            'observation': observation
        }
        
    except subprocess.TimeoutExpired:
        observation['status'] = 'TIMEOUT'
        observation['duration'] = f"{timeout}s"
        print(f"   ⏰ Timeout after {timeout}s")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'TIMEOUT',
            'duration_seconds': timeout,
            'error_message': f'Conversion timed out after {timeout} seconds',
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown',
            'observation': observation
        }
    except Exception as e:
        observation['status'] = 'ERROR'
        observation['error'] = str(e)
        print(f"   🚫 Error: {str(e)}")
        
        return {
            'file_name': file_name,
            'file_type': file_type,
            'status': 'ERROR',
            'duration_seconds': 0,
            'error_message': str(e),
            'output_path': None,
            'size_mb': file_size_mb,
            'command': ' '.join(cmd),
            'converter_used': 'Unknown',
            'observation': observation
        }

def run_bulk_tests():
    """Run conversion tests for selected files."""
    print(f"\n🚀 Starting conversion tests...")
    print(f"📁 Output directory: {CONVERTED_OUTPUT_PATH}")
    print(f"📊 Test mode: {'Smallest files only' if TEST_SMALLEST_FILES_ONLY else 'All files'}")
    print(f"🔧 Force conversion: {FORCE_CONVERSION}")
    
    # Ensure base output directory exists
    os.makedirs(CONVERTED_OUTPUT_PATH, exist_ok=True)
    
    test_results = []
    test_observations = []
    total_start_time = time.time()
    
    for i, file_info in enumerate(files_catalog, 1):
        print(f"\n{'='*60}")
        print(f"📝 Test {i}/{len(files_catalog)}")
        result = run_conversion_test(file_info)
        test_results.append(result)
        test_observations.append(result['observation'])
    
    total_end_time = time.time()
    total_duration = round(total_end_time - total_start_time, 2)
    
    # Print test observations summary
    print(f"\n{'='*60}")
    print("📊 TEST OBSERVATIONS SUMMARY:")
    print(f"{'='*60}")
    for obs in test_observations:
        print(f"\n{obs['file']} ({obs['type']}, {obs.get('size', 'Unknown')}):")
        print(f"   Status: {obs['status']}")
        if 'duration' in obs:
            print(f"   Duration: {obs.get('duration', 'N/A')}")
        if 'converter' in obs:
            print(f"   Converter: {obs['converter']}")
        if 'reason' in obs:
            print(f"   Reason: {obs['reason']}")
        if 'error' in obs:
            print(f"   Error: {obs['error'][:100]}...")
    
    return test_results, total_duration

# Run the bulk conversion tests
print("🎯 Executing conversion tests...")
test_results, total_test_duration = run_bulk_tests()

print(f"\n🏁 Conversion testing completed in {total_test_duration} seconds!")

In [ ]:
# Generate Summary Report
# =============================================================================
# SUMMARY REPORT GENERATION
# =============================================================================

def generate_summary_report(test_results, total_duration):
    """Generate comprehensive summary report of conversion tests."""
    
    df_results = pd.DataFrame(test_results)
    
    # Overall statistics
    total_files = len(test_results)
    successful = len(df_results[df_results['status'] == 'SUCCESS']) if len(df_results) > 0 else 0
    failed = len(df_results[df_results['status'] == 'FAILED']) if len(df_results) > 0 else 0
    skipped = len(df_results[df_results['status'] == 'SKIPPED']) if len(df_results) > 0 else 0
    timeout = len(df_results[df_results['status'] == 'TIMEOUT']) if len(df_results) > 0 else 0
    errors = len(df_results[df_results['status'] == 'ERROR']) if len(df_results) > 0 else 0
    
    # Calculate success rate excluding skipped files
    files_attempted = total_files - skipped
    success_rate = round((successful / files_attempted) * 100, 1) if files_attempted > 0 else 0
    
    # Performance statistics
    successful_tests = df_results[df_results['status'] == 'SUCCESS'] if len(df_results) > 0 else pd.DataFrame()
    avg_duration = round(successful_tests['duration_seconds'].mean(), 2) if len(successful_tests) > 0 else 0
    total_conversion_time = round(df_results['duration_seconds'].sum(), 2) if len(df_results) > 0 else 0
    total_size_processed = round(successful_tests['size_mb'].sum(), 2) if len(successful_tests) > 0 else 0
    
    # Summary dictionary
    summary = {
        'test_timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'environment': 'Local Python Environment',
        'pyforge_version': PYFORGE_VERSION,
        'total_files_tested': total_files,
        'files_attempted': files_attempted,
        'successful_conversions': successful,
        'failed_conversions': failed,
        'skipped_files': skipped,
        'timeout_files': timeout,
        'error_files': errors,
        'success_rate_percent': success_rate,
        'total_test_duration_seconds': total_duration,
        'total_conversion_time_seconds': total_conversion_time,
        'average_conversion_time_seconds': avg_duration,
        'total_data_processed_mb': total_size_processed,
        'wheel_path': PYFORGE_WHEEL_PATH,
        'sample_datasets_path': SAMPLE_DATASETS_PATH,
        'output_directory': CONVERTED_OUTPUT_PATH
    }
    
    return summary, df_results

# Generate summary report
summary_report, df_detailed_results = generate_summary_report(test_results, total_test_duration)

# Display summary report
print("=" * 80)
print("🎯 PYFORGE CLI LOCAL TESTING SUMMARY")
print("=" * 80)

print(f"📅 Test Timestamp: {summary_report['test_timestamp']}")
print(f"🏢 Environment: {summary_report['environment']}")
print(f"🔧 PyForge Version: {summary_report['pyforge_version']}")
print(f"📦 Wheel Path: {summary_report['wheel_path']}")

print("\n📊 OVERALL RESULTS:")
print(f"   Total Files: {summary_report['total_files_tested']}")
print(f"   Files Attempted: {summary_report['files_attempted']}")
print(f"   ✅ Successful: {summary_report['successful_conversions']}")
print(f"   ❌ Failed: {summary_report['failed_conversions']}")
print(f"   ⏭️  Skipped: {summary_report['skipped_files']}")
print(f"   ⏰ Timeout: {summary_report['timeout_files']}")
print(f"   🚫 Errors: {summary_report['error_files']}")
print(f"   🎯 Success Rate: {summary_report['success_rate_percent']}% (of attempted files)")

print("\n⏱️  PERFORMANCE METRICS:")
print(f"   Total Test Duration: {summary_report['total_test_duration_seconds']}s")
print(f"   Total Conversion Time: {summary_report['total_conversion_time_seconds']}s")
print(f"   Average Conversion Time: {summary_report['average_conversion_time_seconds']}s")
print(f"   Total Data Processed: {summary_report['total_data_processed_mb']} MB")

print("\n📋 RESULTS BY FILE TYPE:")
if len(df_detailed_results) > 0:
    type_summary = df_detailed_results.groupby('file_type')['status'].value_counts().unstack(fill_value=0)
    print(type_summary.to_string())
    
    print("\n📊 DETAILED RESULTS:")
    print(df_detailed_results[['file_name', 'file_type', 'status', 'duration_seconds', 'size_mb', 'converter_used']].to_string(index=False))
    
    # Show failed conversions details
    failed_tests = df_detailed_results[df_detailed_results['status'].isin(['FAILED', 'ERROR', 'TIMEOUT'])]
    if len(failed_tests) > 0:
        print(f"\n❌ FAILED CONVERSIONS DETAILS ({len(failed_tests)} failures):")
        for _, test in failed_tests.iterrows():
            print(f"\n{test['file_name']} ({test['file_type']}):")
            print(f"   Status: {test['status']}")
            if test['error_message']:
                print(f"   Error: {test['error_message'][:200]}...")
    
    # Show skipped files
    skipped_tests = df_detailed_results[df_detailed_results['status'] == 'SKIPPED']
    if len(skipped_tests) > 0:
        print(f"\n⏭️  SKIPPED FILES ({len(skipped_tests)} files):")
        for _, test in skipped_tests.iterrows():
            print(f"   {test['file_name']}: {test['error_message']}")
else:
    print("   No test results to display")

print("=" * 80)

In [ ]:
# Validate Converted Files
# =============================================================================
# CONVERTED FILE VALIDATION
# =============================================================================

def validate_converted_files():
    """Validate converted Parquet files using pandas."""
    print("🔍 Validating converted Parquet files...")
    
    successful_conversions = df_detailed_results[df_detailed_results['status'] == 'SUCCESS']
    validation_results = []
    
    if len(successful_conversions) == 0:
        print("⚠️  No successful conversions to validate.")
        return
    
    for _, result in successful_conversions.iterrows():
        output_path = result['output_path']
        file_name = result['file_name']
        file_type = result['file_type']
        
        # Skip PDF validations as they're known to have issues
        if file_type == 'PDF':
            print(f"  ⚠️  Skipping validation for PDF file: {file_name}")
            validation_results.append({
                'file_name': file_name,
                'file_type': file_type,
                'status': 'SKIPPED',
                'rows': 0,
                'columns': 0,
                'error': 'PDF validation skipped due to known issues'
            })
            continue
        
        try:
            # Try to read the parquet file with pandas
            df = pd.read_parquet(output_path)
            row_count = len(df)
            col_count = len(df.columns)
            
            validation_results.append({
                'file_name': file_name,
                'file_type': file_type,
                'status': 'VALID',
                'rows': row_count,
                'columns': col_count,
                'error': None
            })
            
            print(f"  ✅ {file_name}: {row_count} rows, {col_count} columns")
            
            # Show a sample of data for small files
            if row_count <= 10 and row_count > 0:
                print(f"     Columns: {list(df.columns)}")
                print(f"     Sample data:")
                print(df.head(3).to_string())
            
        except Exception as e:
            error_msg = str(e)
            status = 'INVALID'
                
            validation_results.append({
                'file_name': file_name,
                'file_type': file_type,
                'status': status,
                'rows': 0,
                'columns': 0,
                'error': error_msg[:200] if len(error_msg) > 200 else error_msg
            })
            print(f"  ❌ {file_name}: Validation failed - {error_msg[:100]}...")
    
    if validation_results:
        print(f"\n📊 Validation Summary:")
        df_validation = pd.DataFrame(validation_results)
        print(df_validation.to_string(index=False))
        
        valid_count = len(df_validation[df_validation['status'] == 'VALID'])
        skipped_count = len(df_validation[df_validation['status'] == 'SKIPPED'])
        total_count = len(df_validation)
        
        print(f"\n✅ Validation Results:")
        print(f"   Valid files: {valid_count}/{total_count}")
        print(f"   Skipped: {skipped_count}")
        
        if valid_count == (total_count - skipped_count):
            print("\n🎉 ALL CONVERTED FILES (EXCEPT SKIPPED) ARE VALID PARQUET FILES!")
            print("✅ PyForge CLI is working correctly in local environment")
            
        # Show breakdown by file type
        print("\n📊 Validation by File Type:")
        type_summary = df_validation.groupby('file_type')['status'].value_counts().unstack(fill_value=0)
        print(type_summary.to_string())

# Run validation
validate_converted_files()

In [ ]:
# Final Test Summary and Recommendations
# =============================================================================
# FINAL SUMMARY
# =============================================================================

print("🎉 PYFORGE CLI LOCAL TESTING COMPLETED!")
print("=" * 70)

print(f"📊 FINAL STATISTICS:")
print(f"   Environment: Local Python {sys.version.split()[0]}")
print(f"   PyForge Version: {summary_report['pyforge_version']}")
print(f"   Installation Source: Local wheel from dist/")
print(f"   Files Processed: {summary_report['total_files_tested']}")
print(f"   Success Rate: {summary_report['success_rate_percent']}%")
print(f"   Total Time: {summary_report['total_test_duration_seconds']}s")
print(f"   Data Processed: {summary_report['total_data_processed_mb']} MB")

print(f"\n📁 LOCAL PATHS:")
print(f"   Source Data: {summary_report['sample_datasets_path']}")
print(f"   Converted Files: {summary_report['output_directory']}")
print(f"   Wheel Location: {summary_report['wheel_path']}")

print(f"\n🚀 FEATURES TESTED:")
print(f"   ✅ Local Installation from Wheel")
print(f"   ✅ Sample Dataset Installation")
print(f"   ✅ File Discovery and Selection")
print(f"   ✅ Directory Creation Before Conversion")
print(f"   ✅ Multiple Format Support")
print(f"   ✅ Error Handling and Timeout Management")

print(f"\n💡 RECOMMENDATIONS:")
if summary_report['success_rate_percent'] >= 90:
    print("   ✅ Excellent performance! PyForge CLI works well locally.")
    print("   🚀 Ready for production use.")
elif summary_report['success_rate_percent'] >= 75:
    print("   ⚠️  Good performance with some issues. Review failed conversions.")
    print("   🔍 Consider fixing format-specific issues.")
else:
    print("   ❌ Performance needs attention. Check failed conversions and error messages.")
    print("   🛠️  Debug required before production deployment.")

print(f"\n📋 KNOWN ISSUES:")
print(f"   ⚠️  PDF conversions may produce invalid Parquet files")
print(f"   ⚠️  Large files may require longer timeouts")
print(f"   ⚠️  Some Excel files with spaces in names may fail")

print(f"\n🎯 NEXT STEPS:")
print(f"   1. Review any failed conversions in detail")
print(f"   2. Test with larger datasets if needed")
print(f"   3. Deploy to Databricks for serverless testing")
print(f"   4. Address known PDF conversion issues")

print("\n🎉 Local testing completed successfully!")
print("✅ PyForge CLI is ready for use!")

In [ ]:
%%sh
# Optional: Cleanup Test Environment
# =============================================================================
# CLEANUP (OPTIONAL)
# =============================================================================

# Uncomment the following lines to clean up after testing:

# echo "🧹 Cleaning up test environment..."

# # Remove test output
# # rm -rf test_output/

# # Remove sample datasets
# # rm -rf sample-datasets/

# # Remove virtual environment
# # rm -rf test_env/

# echo "✅ Cleanup completed!"
echo "💡 To clean up, uncomment the cleanup commands above and run this cell"

In [None]:
%%sh
# Step 18: Validate converted files and show data samples
cd ../../../  # Navigate to project root
source test_env/bin/activate

echo '=== CONVERSION VALIDATION ==='
echo 'Generated files:'
ls -la test_output/ || echo "No test_output directory found"

echo ''
echo '=== DATA VERIFICATION ==='
if [ -d test_output ]; then
    python3 -c "
import pandas as pd
import os

output_dir = 'test_output'
success_count = 0
total_files = 0

if os.path.exists(output_dir):
    for file in os.listdir(output_dir):
        if file.endswith('.parquet'):
            total_files += 1
            file_path = os.path.join(output_dir, file)
            try:
                df = pd.read_parquet(file_path)
                print(f'✅ {file}: {len(df)} rows, {len(df.columns)} columns')
                if len(df) > 0:
                    print(f'   Sample data: {list(df.columns[:3])}')
                success_count += 1
            except Exception as e:
                print(f'❌ {file}: Failed to read - {str(e)}')

    print(f'\n📊 SUMMARY: {success_count}/{total_files} files successfully converted and readable')
    if success_count == total_files and total_files > 0:
        print('🎉 ALL CONVERSIONS SUCCESSFUL!')
    elif success_count > 0:
        print('⚠️ PARTIAL SUCCESS - some conversions worked')
    else:
        print('❌ NO SUCCESSFUL CONVERSIONS')
else:
    print('❌ No test_output directory found')
"
else
    echo "❌ No test_output directory found"
fi