# Enhanced FASTQ-to-PDF Pipeline: Single Horse Analysis

**Enhanced pipeline for equine microbiome analysis with statistical validation**

This notebook implements the enhanced FASTQ-to-PDF pipeline that properly handles technical replicates as a single horse analysis, not separate reports. Based on Gemini feedback, this version includes:

✅ **Statistical validation** of technical replicates (Spearman correlation)  
✅ **Proper normalization** (relative abundance, CPM, or rarefaction)  
✅ **Quality control** metrics and warnings  
✅ **Single comprehensive report** per patient  
✅ **Enhanced error handling** and logging  

## Key Improvements Over Original Pipeline

- **Single Horse Paradigm**: Multiple barcodes = technical replicates of same horse
- **Statistical Rigor**: Correlation validation (r ≥ 0.7, p ≤ 0.05)
- **Data Normalization**: Proper abundance normalization before aggregation
- **Quality Metrics**: Comprehensive QC reporting and validation
- **Professional Output**: Single PDF report with combined analysis

## Setup and Configuration

In [1]:
# Import required libraries
import sys
import os
import logging
import pandas as pd
from pathlib import Path
from datetime import datetime

# Add src directory to Python path
project_root = Path.cwd().parent
src_path = project_root / 'src'
sys.path.insert(0, str(src_path))

# Import enhanced pipeline components
from enhanced_notebook_interface import (
    run_enhanced_pipeline, 
    PatientInfo, 
    ProcessingMode,
    generate_simple_pdf_report
)
from barcode_aggregator import AggregationConfig

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

print("✅ Enhanced FASTQ-to-PDF Pipeline - Setup Complete")
print(f"📁 Project root: {project_root}")
print(f"📁 Source path: {src_path}")
print(f"⏰ Session started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✅ Enhanced FASTQ-to-PDF Pipeline - Setup Complete
📁 Project root: /home/trentleslie/Insync/projects/equine-microbiome-reporter
📁 Source path: /home/trentleslie/Insync/projects/equine-microbiome-reporter/src
⏰ Session started: 2025-08-04 01:28:23


## Patient Configuration

Configure patient information for the single horse being analyzed.

In [2]:
# Configure patient information
patient = PatientInfo(
    name="Montana",
    species="Horse",
    age="20 years",
    sample_number="004-006_combined",
    barcode_range="barcode04-barcode06",
    date_received="2024-08-04",
    date_analyzed="2024-08-04",
    performed_by="Julia Kończak",
    requested_by="Dr. Alexandra Matusiak",
    notes="Combined analysis from 3 technical replicates with statistical validation"
)

print("👤 Patient Configuration:")
print(f"   Name: {patient.name}")
print(f"   Species: {patient.species}")
print(f"   Age: {patient.age}")
print(f"   Sample Number: {patient.sample_number}")
print(f"   Barcode Range: {patient.barcode_range}")
print(f"   Performed by: {patient.performed_by}")
print(f"   Requested by: {patient.requested_by}")
print(f"   Notes: {patient.notes}")

👤 Patient Configuration:
   Name: Montana
   Species: Horse
   Age: 20 years
   Sample Number: 004-006_combined
   Barcode Range: barcode04-barcode06
   Performed by: Julia Kończak
   Requested by: Dr. Alexandra Matusiak
   Notes: Combined analysis from 3 technical replicates with statistical validation


## Processing Configuration

Configure the enhanced processing parameters for single-horse analysis with statistical validation.

In [3]:
# Configure processing mode for single horse analysis
processing_mode = ProcessingMode(
    mode="single_horse",                    # Single horse analysis (not multiple)
    combine_barcodes=True,                  # Combine technical replicates
    normalization_method="relative_abundance",  # Normalization method
    correlation_threshold=0.7,              # Minimum correlation for valid replicates
    p_value_threshold=0.05                  # Statistical significance threshold
)

# Data directories and barcode configuration
data_dir = "../data"                       # Directory containing FASTQ files
barcode_dirs = ["barcode04", "barcode05", "barcode06"]  # Technical replicates
output_dir = "enhanced_results"            # Output directory

print("⚙️ Processing Configuration:")
print(f"   Mode: {processing_mode.mode}")
print(f"   Combine barcodes: {processing_mode.combine_barcodes}")
print(f"   Normalization: {processing_mode.normalization_method}")
print(f"   Correlation threshold: {processing_mode.correlation_threshold}")
print(f"   P-value threshold: {processing_mode.p_value_threshold}")
print(f"   Data directory: {data_dir}")
print(f"   Barcode directories: {barcode_dirs}")
print(f"   Output directory: {output_dir}")

⚙️ Processing Configuration:
   Mode: single_horse
   Combine barcodes: True
   Normalization: relative_abundance
   Correlation threshold: 0.7
   P-value threshold: 0.05
   Data directory: ../data
   Barcode directories: ['barcode04', 'barcode05', 'barcode06']
   Output directory: enhanced_results


## Enhanced Pipeline Execution

Run the complete enhanced pipeline with FASTQ processing, statistical validation, and PDF generation.

In [4]:
# Execute enhanced FASTQ-to-PDF pipeline
print("🚀 Starting Enhanced FASTQ-to-PDF Pipeline")
print("=" * 50)

# Run enhanced pipeline
result = run_enhanced_pipeline(
    data_dir=data_dir,
    barcode_dirs=barcode_dirs,
    patient=patient,
    output_dir=output_dir,
    processing_mode=processing_mode
)

if result.success:
    print("\n✅ Enhanced Pipeline Execution Successful!")
    print(f"   Processing time: {result.total_processing_time:.2f} seconds")
    print(f"   Original species count: {result.species_count}")
    print(f"   Barcode count: {result.barcode_count}")
    print(f"   Combined species count: {result.combined_species_count}")
    print(f"   CSV generated: {result.csv_path}")
    
    if result.combined_csv_path:
        print(f"   Combined CSV: {result.combined_csv_path}")
    
    # Display aggregation results if available
    if result.aggregation_result and result.aggregation_result.success:
        agg = result.aggregation_result
        print(f"\n📊 Statistical Validation Results:")
        print(f"   Combined total reads: {agg.combined_total_reads}")
        print(f"   Normalization method: {agg.normalization_stats.get('method', 'unknown')}")
        
        # Show correlation results
        print(f"\n📈 Barcode Correlations:")
        for pair, corr in agg.validation_result.correlations.items():
            p_val = agg.validation_result.p_values[pair]
            valid = "✅ Valid" if corr >= processing_mode.correlation_threshold and p_val <= processing_mode.p_value_threshold else "⚠️ Warning"
            print(f"   {valid}: {pair} (r={corr:.3f}, p={p_val:.3f})")
        
        # Show species overlap
        overlap = agg.species_overlap
        print(f"\n🦠 Species Analysis:")
        print(f"   Total unique species: {overlap['total_unique_species']}")
        print(f"   Common to all barcodes: {overlap['common_to_all_barcodes']}")
        
        # Show quality warnings
        if agg.validation_result.warnings:
            print(f"\n⚠️ Quality Warnings:")
            for warning in agg.validation_result.warnings:
                print(f"   • {warning}")
    
else:
    print(f"\n❌ Pipeline execution failed: {result.error}")

2025-08-04 01:28:38,350 - enhanced_notebook_interface - INFO - Starting enhanced FASTQ processing for patient: Montana
2025-08-04 01:28:38,351 - enhanced_notebook_interface - INFO - Processing mode: single_horse
2025-08-04 01:28:38,352 - enhanced_notebook_interface - INFO - Barcode directories: ['barcode04', 'barcode05', 'barcode06']
2025-08-04 01:28:38,352 - enhanced_notebook_interface - INFO - Step 1: Processing FASTQ files with taxonomic classification...
2025-08-04 01:28:38,353 - real_fastq_processor - INFO - Initialized minimal classifier with 12 reference k-mers
2025-08-04 01:28:38,353 - real_fastq_processor - INFO - Processing 3 barcode directories
2025-08-04 01:28:38,353 - real_fastq_processor - INFO - Processing barcode04...
2025-08-04 01:28:38,354 - real_fastq_processor - INFO - Found 171 FASTQ files in ../data/barcode04


🚀 Starting Enhanced FASTQ-to-PDF Pipeline


2025-08-04 01:28:39,024 - real_fastq_processor - INFO -   barcode04: 1110 reads, 101 classified, 12 species
2025-08-04 01:28:39,024 - real_fastq_processor - INFO - Processing barcode05...
2025-08-04 01:28:39,025 - real_fastq_processor - INFO - Found 158 FASTQ files in ../data/barcode05
2025-08-04 01:28:39,907 - real_fastq_processor - INFO -   barcode05: 1009 reads, 126 classified, 11 species
2025-08-04 01:28:39,908 - real_fastq_processor - INFO - Processing barcode06...
2025-08-04 01:28:39,908 - real_fastq_processor - INFO - Found 168 FASTQ files in ../data/barcode06
2025-08-04 01:28:40,859 - real_fastq_processor - INFO -   barcode06: 1209 reads, 166 classified, 12 species
2025-08-04 01:28:40,861 - real_fastq_processor - INFO - Generated abundance table: 12 species, 3 barcodes
2025-08-04 01:28:40,863 - real_fastq_processor - INFO - Successfully generated abundance CSV: enhanced_results/processed_abundance.csv
2025-08-04 01:28:40,864 - real_fastq_processor - INFO -   Species: 12
2025-08


✅ Enhanced Pipeline Execution Successful!
   Processing time: 2.54 seconds
   Original species count: 12
   Barcode count: 3
   Combined species count: 12
   CSV generated: enhanced_results/processed_abundance.csv
   Combined CSV: enhanced_results/combined_abundance_montana.csv

📊 Statistical Validation Results:
   Combined total reads: 299
   Normalization method: relative_abundance

📈 Barcode Correlations:
   ✅ Valid: barcode04_vs_barcode05 (r=0.822, p=0.001)

🦠 Species Analysis:
   Total unique species: 12
   Common to all barcodes: 11

   • Low correlation barcode04-barcode06: r=0.696, p=0.012
   • Low correlation barcode05-barcode06: r=0.608, p=0.036
   • Some barcodes show poor correlation - may not represent same sample


## PDF Report Generation

Generate the final professional PDF report from the combined microbiome data.

In [5]:
# Generate PDF report from combined data
if result.success and result.combined_csv_path:
    print("📄 Generating Professional PDF Report")
    print("=" * 40)
    
    # Define PDF output path
    pdf_filename = f"{patient.name}_enhanced_microbiome_report.pdf"
    pdf_path = Path(output_dir) / pdf_filename
    
    print(f"📄 Report path: {pdf_path}")
    print(f"📊 Data source: {result.combined_csv_path}")
    
    # Generate PDF report using combined data
    pdf_success = generate_simple_pdf_report(
        csv_path=result.combined_csv_path,
        patient_info=patient,
        output_path=str(pdf_path),
        barcode_column="total"  # Use combined total column
    )
    
    if pdf_success and pdf_path.exists():
        file_size = pdf_path.stat().st_size / 1024
        print(f"\n✅ PDF Report Generated Successfully!")
        print(f"   File: {pdf_path}")
        print(f"   Size: {file_size:.1f} KB")
        print(f"   Patient: {patient.name}")
        print(f"   Sample: {patient.sample_number}")
        print(f"   Combined from: {patient.barcode_range}")
    else:
        print(f"❌ PDF generation failed")
        
else:
    print("⚠️ Skipping PDF generation - no combined CSV available")

2025-08-04 01:28:53,273 - enhanced_notebook_interface - INFO - Generating report from combined CSV: combined_abundance_montana.csv
2025-08-04 01:28:53,275 - notebook_pdf_generator - INFO - NotebookPDFGenerator initialized for language: en
2025-08-04 01:28:53,276 - notebook_pdf_generator - INFO - Starting report generation for Montana
2025-08-04 01:28:53,279 - notebook_pdf_generator - INFO - Processed 12 species


📄 Generating Professional PDF Report
📄 Report path: enhanced_results/Montana_enhanced_microbiome_report.pdf
📊 Data source: enhanced_results/combined_abundance_montana.csv


2025-08-04 01:28:53,954 - notebook_pdf_generator - INFO - Generated 2 charts successfully
2025-08-04 01:28:53,955 - notebook_pdf_generator - INFO - Generated 2 charts
2025-08-04 01:28:53,958 - notebook_llm_engine - INFO - Environment variables loaded from /home/trentleslie/Insync/projects/equine-microbiome-reporter/.env (result: True)
2025-08-04 01:29:12,255 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-08-04 01:29:12,261 - notebook_pdf_generator - INFO - Added 5 LLM-powered recommendations
2025-08-04 01:29:12,743 - weasyprint - ERROR - Relative URI reference without a base URI: <img src="assets/dna_stock_photo.jpg">
2025-08-04 01:29:12,743 - weasyprint - ERROR - Relative URI reference without a base URI: <img src="assets/hippovet_logo.png">
2025-08-04 01:29:12,744 - weasyprint - ERROR - Relative URI reference without a base URI: <img src="assets/hippovet_logo.png">
2025-08-04 01:29:12,744 - weasyprint - ERROR - Relative URI refer


✅ PDF Report Generated Successfully!
   File: enhanced_results/Montana_enhanced_microbiome_report.pdf
   Size: 22.6 KB
   Patient: Montana
   Sample: 004-006_combined
   Combined from: barcode04-barcode06


## Data Analysis and Visualization

Analyze the combined microbiome data and visualize the results.

In [6]:
# Load and analyze combined microbiome data
if result.success and result.combined_csv_path:
    print("📊 Combined Microbiome Data Analysis")
    print("=" * 40)
    
    # Load combined data
    combined_df = pd.read_csv(result.combined_csv_path)
    
    print(f"🦠 Species Summary:")
    print(f"   Total species: {len(combined_df)}")
    print(f"   Total combined abundance: {combined_df['total_combined'].sum():.2f}%")
    print(f"   Average abundance per species: {combined_df['total_combined'].mean():.2f}%")
    
    # Top 10 species
    print(f"\n🏆 Top 10 Most Abundant Species:")
    top_species = combined_df.head(10)
    for i, (_, row) in enumerate(top_species.iterrows(), 1):
        print(f"   {i:2d}. {row['species']:<30} {row['total_combined']:>6.2f}%  ({row['phylum']})")
    
    # Phylum distribution
    print(f"\n🧬 Phylum Distribution:")
    phylum_dist = combined_df.groupby('phylum')['total_combined'].sum().sort_values(ascending=False)
    for phylum, abundance in phylum_dist.items():
        print(f"   {phylum:<20} {abundance:>6.2f}%")
    
    # Display dataframe structure
    print(f"\n📋 Data Structure:")
    print(f"   Columns: {list(combined_df.columns)}")
    print(f"   Shape: {combined_df.shape}")
    
    # Show first few rows
    print(f"\n📄 Sample Data (Top 5 Species):")
    display(combined_df.head()[['species', 'phylum', 'genus', 'total_combined']].round(2))
    
else:
    print("⚠️ No combined data available for analysis")

📊 Combined Microbiome Data Analysis
🦠 Species Summary:
   Total species: 12
   Total combined abundance: 300.00%
   Average abundance per species: 25.00%

🏆 Top 10 Most Abundant Species:
    1. Bacteroides fragilis            49.95%  (Bacteroidota)
    2. Lactobacillus acidophilus       45.97%  (Bacillota)
    3. Fibrobacter succinogenes        43.55%  (Fibrobacterota)
    4. Prevotella copri                32.46%  (Bacteroidota)
    5. Salmonella enterica             22.32%  (Pseudomonadota)
    6. Bacillus subtilis               20.27%  (Bacillota)
    7. Escherichia coli                18.89%  (Pseudomonadota)
    8. Enterococcus faecalis           17.94%  (Bacillota)
    9. Streptomyces albidoflavus       14.96%  (Actinomycetota)
   10. Clostridium perfringens         13.55%  (Bacillota)

🧬 Phylum Distribution:
   Bacillota             97.72%
   Bacteroidota          82.41%
   Fibrobacterota        43.55%
   Pseudomonadota        41.20%
   Actinomycetota        35.11%

📋 Data Struc

Unnamed: 0,species,phylum,genus,total_combined
0,Bacteroides fragilis,Bacteroidota,Bacteroides,49.95
1,Lactobacillus acidophilus,Bacillota,Lactobacillus,45.97
2,Fibrobacter succinogenes,Fibrobacterota,Fibrobacter,43.55
3,Prevotella copri,Bacteroidota,Prevotella,32.46
4,Salmonella enterica,Pseudomonadota,Salmonella,22.32


## Quality Control Summary

Display comprehensive quality control metrics and validation results.

In [7]:
# Display quality control summary
if result.success and result.quality_report:
    print("🔍 Quality Control Summary")
    print("=" * 30)
    
    qc = result.quality_report
    
    # Original data quality
    print(f"📊 Original Data Quality:")
    if 'original_data' in qc:
        for barcode, metrics in qc['original_data'].items():
            print(f"   {barcode}:")
            print(f"     Total reads: {metrics['total_reads']:,}")
            print(f"     Species count: {metrics['species_count']}")
            print(f"     Max abundance: {metrics['max_abundance']:.1f}")
            print(f"     Top species: {metrics['top_species']}")
    
    # Aggregation quality
    print(f"\n🔬 Aggregation Quality:")
    if 'aggregation' in qc and qc['aggregation']['success']:
        agg_qc = qc['aggregation']
        print(f"   Status: ✅ Success")
        print(f"   Normalization: {agg_qc['normalization_method']}")
        print(f"   Combined species: {agg_qc['combined_species_count']}")
        print(f"   Combined reads: {agg_qc['combined_total_reads']}")
        
        # Correlation summary
        if 'correlations' in agg_qc:
            print(f"\n📈 Correlation Summary:")
            for pair, corr in agg_qc['correlations'].items():
                p_val = agg_qc['p_values'][pair]
                status = "✅" if corr >= 0.7 and p_val <= 0.05 else "⚠️"
                print(f"   {status} {pair}: r={corr:.3f}, p={p_val:.3f}")
    else:
        print(f"   Status: ❌ Failed or not performed")
    
    # Recommendations
    if 'recommendations' in qc and qc['recommendations']:
        print(f"\n💡 Recommendations:")
        for rec in qc['recommendations']:
            print(f"   • {rec}")
    
    print(f"\n⏰ Analysis completed: {qc.get('timestamp', 'Unknown')}")
    
else:
    print("⚠️ No quality control data available")

🔍 Quality Control Summary
📊 Original Data Quality:
   barcode04:
     Total reads: 101
     Species count: 12
     Max abundance: 19.0
     Top species: Lactobacillus acidophilus
   barcode05:
     Total reads: 126
     Species count: 11
     Max abundance: 31.0
     Top species: Bacteroides fragilis
   barcode06:
     Total reads: 166
     Species count: 12
     Max abundance: 24.0
     Top species: Bacteroides fragilis

🔬 Aggregation Quality:
   Status: ✅ Success
   Normalization: relative_abundance
   Combined species: 12
   Combined reads: 299

📈 Correlation Summary:
   ✅ barcode04_vs_barcode05: r=0.822, p=0.001
   ⚠️ barcode04_vs_barcode06: r=0.696, p=0.012
   ⚠️ barcode05_vs_barcode06: r=0.608, p=0.036

💡 Recommendations:

⏰ Analysis completed: 2025-08-04 01:28:40


## Pipeline Summary

Final summary of the enhanced pipeline execution and results.

In [8]:
# Final pipeline summary
print("🎯 Enhanced Pipeline Summary")
print("=" * 40)

if result.success:
    print(f"✅ Pipeline Status: SUCCESS")
    print(f"📊 Processing Summary:")
    print(f"   Patient: {patient.name}")
    print(f"   Sample: {patient.sample_number}")
    print(f"   Barcodes processed: {result.barcode_count}")
    print(f"   Original species: {result.species_count}")
    print(f"   Combined species: {result.combined_species_count}")
    print(f"   Processing time: {result.total_processing_time:.2f}s")
    
    print(f"\n📁 Output Files:")
    if result.csv_path:
        print(f"   Original CSV: {result.csv_path}")
    if result.combined_csv_path:
        print(f"   Combined CSV: {result.combined_csv_path}")
    
    pdf_path = Path(output_dir) / f"{patient.name}_enhanced_microbiome_report.pdf"
    if pdf_path.exists():
        print(f"   PDF Report: {pdf_path}")
    
    print(f"\n🔬 Enhanced Features:")
    print(f"   ✅ Single horse analysis (not multiple reports)")
    print(f"   ✅ Statistical validation of technical replicates")
    print(f"   ✅ Data normalization ({processing_mode.normalization_method})")
    print(f"   ✅ Quality control metrics and warnings")
    print(f"   ✅ Comprehensive error handling and logging")
    print(f"   ✅ Professional PDF report generation")
    
    # Validation summary
    if result.aggregation_result and result.aggregation_result.success:
        valid_pairs = len([c for c in result.aggregation_result.validation_result.correlations.values() 
                          if c >= processing_mode.correlation_threshold])
        total_pairs = len(result.aggregation_result.validation_result.correlations)
        print(f"\n📈 Validation Summary:")
        print(f"   Valid correlations: {valid_pairs}/{total_pairs}")
        print(f"   Correlation threshold: r ≥ {processing_mode.correlation_threshold}")
        print(f"   Significance threshold: p ≤ {processing_mode.p_value_threshold}")
    
else:
    print(f"❌ Pipeline Status: FAILED")
    print(f"   Error: {result.error}")
    print(f"   Processing time: {result.total_processing_time:.2f}s")

print(f"\n⏰ Session completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n" + "=" * 60)
print("🐎 Enhanced FASTQ-to-PDF Pipeline - Complete")
print("Single Horse • Statistical Validation • Professional Reports")
print("=" * 60)

🎯 Enhanced Pipeline Summary
✅ Pipeline Status: SUCCESS
📊 Processing Summary:
   Patient: Montana
   Sample: 004-006_combined
   Barcodes processed: 3
   Original species: 12
   Combined species: 12
   Processing time: 2.54s

📁 Output Files:
   Original CSV: enhanced_results/processed_abundance.csv
   Combined CSV: enhanced_results/combined_abundance_montana.csv
   PDF Report: enhanced_results/Montana_enhanced_microbiome_report.pdf

🔬 Enhanced Features:
   ✅ Single horse analysis (not multiple reports)
   ✅ Statistical validation of technical replicates
   ✅ Data normalization (relative_abundance)
   ✅ Comprehensive error handling and logging
   ✅ Professional PDF report generation

📈 Validation Summary:
   Valid correlations: 1/3
   Correlation threshold: r ≥ 0.7
   Significance threshold: p ≤ 0.05

⏰ Session completed: 2025-08-04 01:29:27

🐎 Enhanced FASTQ-to-PDF Pipeline - Complete
Single Horse • Statistical Validation • Professional Reports
