# FASTQ Processing Pipeline for Equine Microbiome Analysis

This notebook provides a simple interface for processing 16S rRNA FASTQ sequencing data and generating professional PDF reports for equine microbiome analysis.

## Features
- **Quality Control**: Analyze sequence quality with visual reports
- **FASTQ to CSV**: Convert sequencing data to abundance tables
- **PDF Reports**: Generate professional veterinary reports

## Quick Start
1. Install dependencies: `poetry install`
2. Place your FASTQ files in an accessible directory
3. Run the cells below

In [None]:
# Import required modules
import sys
import os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

from src.pipeline_integrator import MicrobiomePipelineIntegrator
from src.data_models import PatientInfo
import pandas as pd

## 1. Single Sample Processing

Process a single FASTQ file and generate a complete report.

In [None]:
# Initialize the pipeline
pipeline = MicrobiomePipelineIntegrator(output_dir="pipeline_output")

# Define patient information
patient_info = {
    'name': 'Montana',          # Horse name
    'age': '20 years',         # Horse age
    'sample_number': '506',    # Lab sample number
    'performed_by': 'Dr. Smith',
    'requested_by': 'Dr. Johnson'
}

# Process the sample (update with your FASTQ file path)
# results = pipeline.process_sample(
#     fastq_file="path/to/your/sample.fastq.gz",
#     patient_info=patient_info,
#     barcode_column="barcode59",  # Column name in CSV
#     language="en"  # Options: "en", "pl", "jp"
# )

print("Ready to process! Uncomment the code above and provide your FASTQ file path.")

## 2. Batch Processing

Process multiple samples using a manifest file.

In [None]:
# Create a manifest template
MicrobiomePipelineIntegrator.create_manifest_template("sample_manifest.csv")

# View the template
manifest = pd.read_csv("sample_manifest.csv")
manifest

In [None]:
# Process all samples in the manifest
# Update the manifest CSV with your actual file paths and sample information

# manifest = pd.read_csv("sample_manifest.csv")
# results = pipeline.batch_process(manifest)

print("Ready for batch processing! Update the manifest CSV and uncomment the code above.")

## 3. Quality Control Only

Run QC analysis on FASTQ files without full processing.

In [None]:
from src.fastq_qc import FASTQQualityControl

# Run QC on a single file
# qc = FASTQQualityControl("path/to/your/sample.fastq.gz")
# qc_results = qc.run_qc()
# qc.print_summary()
# qc.plot_quality_metrics(save_path="qc_report.png")

print("QC module ready! Provide a FASTQ file path to analyze.")

## 4. Custom FASTQ to CSV Conversion

Convert FASTQ files to CSV format with custom parameters.

In [None]:
from src.fastq_converter import FASTQtoCSVConverter

# Create converter with custom quality thresholds
converter = FASTQtoCSVConverter()

# Process with custom parameters
# df = converter.process_fastq_files(
#     fastq_files=["sample1.fastq", "sample2.fastq"],
#     sample_names=["59", "60"],
#     min_quality=30,  # Q30 threshold (99.9% accuracy)
#     min_length=250   # Minimum 250bp reads
# )
# 
# converter.save_to_csv(df, "custom_abundance_table.csv")

print("Converter ready with customizable quality parameters!")

## 5. Integration with Existing CSV Data

If you already have CSV abundance data, you can generate reports directly.

In [None]:
from src.report_generator import ReportGenerator
from src.data_models import PatientInfo

# Create patient information
patient = PatientInfo(
    name='Thunder',
    age='15 years',
    sample_number='507',
    performed_by='Dr. Smith',
    requested_by='Dr. Johnson'
)

# Generate report from existing CSV
generator = ReportGenerator(language='en')
# success = generator.generate_report(
#     csv_path='data/sample_1.csv',
#     patient_info=patient,
#     output_path='reports/thunder_report.pdf'
# )

print("Report generator ready! Provide CSV path to generate PDF.")

## Tips and Best Practices

### Quality Thresholds
- **Q20** (99% accuracy): Minimum acceptable quality
- **Q30** (99.9% accuracy): Recommended for high-quality analysis

### File Organization
```
pipeline_output/
├── qc_reports/      # Quality control visualizations
├── csv_files/       # Abundance tables
├── pdf_reports/     # Final PDF reports
└── batch_summary.txt
```

### Troubleshooting
1. **Memory issues**: Process files in smaller batches
2. **Slow processing**: Reduce `sample_size` in QC analysis
3. **Missing dependencies**: Run `poetry install` in project root

### Next Steps
- For production use, integrate real taxonomy classifiers (SILVA, Greengenes)
- Consider using DADA2 or VSEARCH for sequence clustering
- Implement parallel processing for large datasets

In [None]:
# Check pipeline output structure
import os
from pathlib import Path

output_dir = Path("pipeline_output")
if output_dir.exists():
    print("Pipeline output structure:")
    for root, dirs, files in os.walk(output_dir):
        level = root.replace(str(output_dir), '').count(os.sep)
        indent = ' ' * 2 * level
        print(f"{indent}{os.path.basename(root)}/")
        sub_indent = ' ' * 2 * (level + 1)
        for file in files:
            print(f"{sub_indent}{file}")
else:
    print("No output directory yet. Run the pipeline to create it.")