# Batch Processing for Equine Microbiome Reports

This notebook provides a simple interface for batch processing multiple microbiome CSV files to generate PDF reports.

## Features
- 📁 **Multi-file Processing**: Process all CSV files in the data directory
- 📊 **Progress Tracking**: Real-time progress updates with visual feedback
- ✅ **Quality Control**: Automatic validation of input files
- 📄 **PDF Generation**: Professional reports using the Jinja2 template system
- 📈 **Summary Reports**: Comprehensive statistics and quality metrics

## 1. Setup and Import

In [None]:
# Import required modules
import sys
from pathlib import Path

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# Import project modules
from src.batch_processor import BatchProcessor, BatchConfig
from src.progress_tracker import NotebookProgressTracker, generate_quality_report, create_manifest_template
import ipywidgets as widgets
from IPython.display import display, HTML

print("✅ Setup complete! Ready for batch processing.")

## 2. Configure Batch Processing

Adjust these settings based on your needs:

In [None]:
# Configuration
config = BatchConfig(
    data_dir='../data',
    reports_dir='../reports/batch_output',
    language='en',  # Options: 'en', 'pl', 'jp'
    parallel_processing=True,
    max_workers=4
)

# Ensure output directory exists
config.ensure_directories()

# Display configuration
print("📋 Current Configuration:")
print(f"  • Data directory: {config.data_dir.absolute()}")
print(f"  • Output directory: {config.reports_dir.absolute()}")
print(f"  • Language: {config.language}")
print(f"  • Parallel processing: {'Enabled' if config.parallel_processing else 'Disabled'}")
print(f"  • Max workers: {config.max_workers}")

## 3. Check Available Files

In [None]:
# List available CSV files
csv_files = list(config.data_dir.glob("*.csv"))

if not csv_files:
    print("❌ No CSV files found in the data directory!")
    print(f"   Please add CSV files to: {config.data_dir.absolute()}")
else:
    print(f"📂 Found {len(csv_files)} CSV files:")
    for i, csv_file in enumerate(csv_files, 1):
        size_kb = csv_file.stat().st_size / 1024
        print(f"   {i}. {csv_file.name} ({size_kb:.1f} KB)")

## 4. Run Batch Processing

Click the button below to start processing all CSV files:

In [None]:
# Create interactive controls
validate_checkbox = widgets.Checkbox(
    value=True,
    description='Enable quality validation',
    style={'description_width': 'initial'}
)

language_dropdown = widgets.Dropdown(
    options=[('English', 'en'), ('Polish', 'pl'), ('Japanese', 'jp')],
    value='en',
    description='Report Language:',
    style={'description_width': 'initial'}
)

start_button = widgets.Button(
    description='Start Batch Processing',
    button_style='primary',
    icon='play'
)

output_area = widgets.Output()

def on_start_clicked(b):
    with output_area:
        output_area.clear_output()
        
        if not csv_files:
            print("❌ No CSV files to process!")
            return
            
        # Update configuration
        config.language = language_dropdown.value
        
        # Create processor and tracker
        processor = BatchProcessor(config)
        tracker = NotebookProgressTracker()
        
        print(f"🚀 Starting batch processing of {len(csv_files)} files...")
        print(f"   Language: {config.language}")
        print(f"   Validation: {'Enabled' if validate_checkbox.value else 'Disabled'}")
        print("\n" + "="*50 + "\n")
        
        # Create progress display
        progress_callback = tracker.create_progress_display(
            total=len(csv_files),
            description="Processing"
        )
        
        # Process files
        results = processor.process_directory(
            progress_callback=progress_callback,
            validate=validate_checkbox.value
        )
        
        # Display summary
        print("\n" + "="*50)
        print("✅ BATCH PROCESSING COMPLETE!")
        print("="*50 + "\n")
        
        summary = processor.generate_summary_report()
        print(f"📊 Summary:")
        print(f"   • Total files: {summary['total_files']}")
        print(f"   • Successful: {summary['successful']} ✅")
        print(f"   • Failed: {summary['failed']} ❌")
        print(f"   • Success rate: {summary['success_rate']:.1f}%")
        print(f"   • Total time: {summary['total_processing_time']:.1f} seconds")
        print(f"   • Average time per file: {summary['average_processing_time']:.1f} seconds")
        
        # Show failed files if any
        if summary['failed'] > 0:
            print("\n❌ Failed files:")
            for result in results:
                if not result['success']:
                    print(f"   • {Path(result['csv_file']).name}: {result['message']}")
        
        # Save results
        report_path = processor.save_results_to_csv()
        print(f"\n📄 Detailed report saved to: {report_path}")

start_button.on_click(on_start_clicked)

# Display controls
display(widgets.VBox([
    widgets.HBox([language_dropdown, validate_checkbox]),
    start_button,
    output_area
]))

## 5. Quality Control Report

Generate a detailed quality control report for the processed files:

In [None]:
# Check if we have results to analyze
try:
    if 'processor' in locals() and processor.results:
        print("📊 Generating Quality Control Report...\n")
        
        # Generate quality report
        df = generate_quality_report(processor.results)
        
        # Display summary statistics
        print("Summary Statistics:")
        print(f"• Files processed: {len(df)}")
        print(f"• Success rate: {df['success'].mean()*100:.1f}%")
        
        if 'validation_passed' in df.columns:
            print(f"• Validation pass rate: {df['validation_passed'].mean()*100:.1f}%")
        
        if 'processing_time' in df.columns:
            print(f"\nProcessing Time Statistics:")
            print(f"• Mean: {df['processing_time'].mean():.2f} seconds")
            print(f"• Min: {df['processing_time'].min():.2f} seconds")
            print(f"• Max: {df['processing_time'].max():.2f} seconds")
        
        # Show first few results
        print("\nFirst 5 results:")
        display(df[['csv_file', 'patient_name', 'success', 'processing_time']].head())
        
    else:
        print("ℹ️ No processing results available yet.")
        print("   Run the batch processing in section 4 first!")
except NameError:
    print("ℹ️ No processing results available yet.")
    print("   Run the batch processing in section 4 first!")

## 6. Process with Custom Patient Information (Optional)

If you need to specify custom patient information for each file, you can use a manifest file:

In [None]:
# Create a manifest template
manifest_path = Path('../manifest_template.csv')

# Get list of CSV files for manifest
csv_filenames = [f.name for f in csv_files] if csv_files else ['sample_1.csv', 'sample_2.csv']

# Create template
manifest_df = create_manifest_template(str(manifest_path), csv_filenames[:5])  # Limit to 5 for demo

print("📄 Manifest template created!")
print(f"   Location: {manifest_path.absolute()}")
print("\n📋 Template structure:")
display(manifest_df)

In [None]:
# Process using manifest (uncomment to use)
"""
# Edit the manifest_template.csv file with your patient information, then run:

processor_manifest = BatchProcessor(config)
tracker_manifest = NotebookProgressTracker()

progress_callback = tracker_manifest.create_progress_display(
    total=len(manifest_df),
    description="Processing from manifest"
)

results = processor_manifest.process_from_manifest(
    manifest_path,
    progress_callback=progress_callback
)

print("\n✅ Manifest processing complete!")
"""

print("ℹ️ To use the manifest:")
print("   1. Edit manifest_template.csv with your patient information")
print("   2. Uncomment and run the code above")

## 7. View Generated Reports

In [None]:
# List all generated reports
if config.reports_dir.exists():
    pdf_files = sorted(config.reports_dir.glob("*.pdf"))
    
    if pdf_files:
        print(f"📄 Generated Reports ({len(pdf_files)} files):")
        print(f"   Location: {config.reports_dir.absolute()}\n")
        
        for pdf in pdf_files[-10:]:  # Show last 10
            size_mb = pdf.stat().st_size / (1024 * 1024)
            print(f"   • {pdf.name} ({size_mb:.2f} MB)")
            
        if len(pdf_files) > 10:
            print(f"\n   ... and {len(pdf_files) - 10} more files")
    else:
        print("ℹ️ No PDF reports generated yet.")
else:
    print("ℹ️ Reports directory does not exist yet.")

## Tips and Troubleshooting

### 🔧 Common Issues:

1. **No CSV files found**: Make sure your CSV files are in the `data/` directory
2. **Validation failures**: Check that your CSV files have the required columns and phyla
3. **Memory issues**: Reduce `max_workers` or disable parallel processing

### 📝 CSV File Requirements:

- Must contain columns: `species`, `phylum`, `genus`, and `barcode[N]` columns
- Should have at least 10 different species
- Required phyla: Bacillota, Bacteroidota, Pseudomonadota

### 🚀 Performance Tips:

- Enable parallel processing for faster batch processing
- Adjust `max_workers` based on your system's CPU cores
- Use validation only for new datasets