# Microplastics Blind Correction Workflow

This notebook demonstrates the complete workflow for processing microplastics data with blank and blind correction using separate Excel files. The new modular architecture allows for flexible processing of particle data from various sources.

## Overview
- Load actual sample data from Excel files
- Load corresponding blind sample data from separate Excel files
- Verify data structure consistency
- Apply processing pipeline (filtering, standardization)
- Perform blank and blind corrections
- Visualize results and generate reports

## Section 1: Import Required Libraries

Import pandas, openpyxl, and other necessary libraries for Excel file handling and data manipulation.

In [None]:
# Import required libraries for data processing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Add the src directory to Python path for importing our package
sys.path.insert(0, str(Path.cwd() / "src"))

# Import our microplastics processing modules
from microplas_blind_corr import (
    ExcelLoader,
    ParticleProcessor,
    BlankCorrector,
    BlindCorrector,
    ProcessingConfig
)
from microplas_blind_corr.config import EXCEL_COLUMN_MAPPING
from microplas_blind_corr.utils import (
    validate_dataframe_structure,
    calculate_particle_statistics,
    generate_processing_report,
    FileOrganizer
)

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Libraries imported successfully!")
print(f"üìä Using pandas version: {pd.__version__}")
print(f"üìà Using matplotlib version: {plt.matplotlib.__version__}")

## Section 2: Read Actual Sample Data

Load the Excel sheet containing the actual sample data using pandas read_excel() function. In this example, we'll use the provided test data which represents an environmental sample with 18,516 particles.

In [None]:
# Initialize the Excel loader
loader = ExcelLoader(EXCEL_COLUMN_MAPPING)

# Load the actual sample data
sample_file = "test_data/250606_Sterni_500_5_Particle_List.xlsx"
sample_name = "Environmental_Sample_001"

print(f"üìÅ Loading actual sample data from: {sample_file}")

try:
    # Load the sample data
    actual_sample_data = loader.load_sample(sample_file, sample_name)
    
    print(f"‚úÖ Successfully loaded sample data!")
    print(f"üìä Sample: {sample_name}")
    print(f"üî¨ Particles: {len(actual_sample_data):,}")
    print(f"üìã Columns: {len(actual_sample_data.columns)}")
    
    # Display basic information
    print("\nüìà Data Overview:")
    print(f"   ‚Ä¢ Unique polymers: {actual_sample_data['polymer_type'].nunique()}")
    print(f"   ‚Ä¢ Unique colors: {actual_sample_data['color'].nunique()}")
    print(f"   ‚Ä¢ Unique shapes: {actual_sample_data['shape'].nunique()}")
    print(f"   ‚Ä¢ Size range: {actual_sample_data['size_1_um'].min():.1f} - {actual_sample_data['size_1_um'].max():.1f} Œºm")
    
    # Show first few rows
    print("\nüìã First 5 particles:")
    display_cols = ['particle_id', 'polymer_type', 'color', 'shape', 'size_1_um', 'size_2_um']
    print(actual_sample_data[display_cols].head())
    
except Exception as e:
    print(f"‚ùå Error loading sample data: {e}")
    actual_sample_data = None

## Section 3: Read Blind Sample Data

Load the separate Excel sheet containing the corresponding blind sample data with the same structure. In a real workflow, you would have separate Excel files for each blind sample, but for this demonstration, we'll create a simulated blind sample.

In [None]:
# In a real scenario, you would load blind sample data from separate Excel files like this:\n# blind_files = [\"data/blinds/blind_001_particles.xlsx\", \"data/blinds/blind_002_particles.xlsx\"]\n# blind_data = loader.load_multiple_samples(blind_files, [\"Blind_Sample_001\", \"Blind_Sample_002\"])\n\n# For this demonstration, we'll create simulated blind sample data\n# by sampling from the actual data to show the workflow\n\nif actual_sample_data is not None:\n    print(\"üé≠ Creating simulated blind sample data for demonstration...\")\n    \n    # Create a simulated blind sample by sampling from the actual data\n    # In reality, this would come from separate Excel files\n    np.random.seed(42)  # For reproducible results\n    \n    # Sample about 1% of particles to simulate a typical blind sample size\n    blind_sample_size = max(50, len(actual_sample_data) // 100)  # At least 50 particles\n    blind_indices = np.random.choice(actual_sample_data.index, size=blind_sample_size, replace=False)\n    \n    # Create blind sample data with modified sample name\n    blind_sample_data = actual_sample_data.loc[blind_indices].copy()\n    blind_sample_data['sample_name'] = 'Blind_Sample_001'\n    \n    # Modify particle IDs to make them unique\n    blind_sample_data['particle_id'] = 'BLIND_' + blind_sample_data['particle_id'].astype(str)\n    \n    print(f\"‚úÖ Created simulated blind sample!\")\n    print(f\"üìä Blind sample: Blind_Sample_001\")\n    print(f\"üî¨ Particles: {len(blind_sample_data):,}\")\n    print(f\"üìã Structure matches actual sample: {list(blind_sample_data.columns) == list(actual_sample_data.columns)}\")\n    \n    # Display blind sample info\n    print(\"\\nüìà Blind Sample Overview:\")\n    print(f\"   ‚Ä¢ Unique polymers: {blind_sample_data['polymer_type'].nunique()}\")\n    print(f\"   ‚Ä¢ Unique colors: {blind_sample_data['color'].nunique()}\")\n    print(f\"   ‚Ä¢ Unique shapes: {blind_sample_data['shape'].nunique()}\")\n    print(f\"   ‚Ä¢ Size range: {blind_sample_data['size_1_um'].min():.1f} - {blind_sample_data['size_1_um'].max():.1f} Œºm\")\n    \n    print(\"\\nüìã First 5 blind particles:\")\n    print(blind_sample_data[display_cols].head())\n    \nelse:\n    print(\"‚ùå Cannot create blind sample data - actual sample data not available\")\n    blind_sample_data = None"

## Section 4: Verify Sheet Structure Consistency

Check that both Excel sheets have the same column names, data types, and overall structure. This is crucial for ensuring that the processing pipeline can handle both datasets consistently.