# WSI Feature Extraction Pipeline

This notebook demonstrates how to extract and save patch features from Whole Slide Images (WSIs) using a custom pipeline manager that tracks processing status and only performs missing operations.

## Key Features:
- **Smart State Management**: Tracks which slides have completed segmentation and feature extraction
- **Resume Capability**: Only processes missing operations, avoiding redundant computation
- **Organized Output Structure**: Logical folder hierarchy for different processing stages
- **Professional Implementation**: Clean, maintainable code with proper error handling

## 1. Import Required Libraries and Custom Module

First, we'll import the necessary libraries and our custom pipeline manager module.

In [None]:
import os
import torch
from pathlib import Path
import pandas as pd
from typing import Dict, List, Optional

# Import our custom pipeline manager
from trident_pipeline_manager import PipelineManager, PipelineConfig

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Display GPU information if available
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

In [None]:
import scanpy as sc

adata = sc.read_h5ad('/equilibrium/datasets/TCGA-histological-data/WSI/data/processed/TCGA-BF-A1PU-01Z-00-DX1.CB0A52E3-16A9-46B2-BBE1-149A6CAAB9CF.h5ad.gz/TCGA-BF-A1PU-01Z-00-DX1.CB0A52E3-16A9-46B2-BBE1-149A6CAAB9CF.h5ad.h5')

## 2. Initialize Manager and Set Up Paths

Configure the pipeline parameters and initialize the manager with your WSI directory and output paths.

In [None]:
# Configuration - Update these paths according to your setup
WSI_DIR = "/equilibrium/datasets/TCGA-histological-data/WSI/raw"  # Directory containing your WSI files
OUTPUT_DIR = "/equilibrium/datasets/TCGA-histological-data/WSI/interim"  # Directory to save all outputs
WSI_EXTENSIONS = ['.svs', '.tiff', '.tif', '.ndpi', '.mrxs']  # Supported WSI formats

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Configure pipeline parameters
config = PipelineConfig(
    # Segmentation parameters
    segmenter='hest',           # Options: 'hest', 'grandqc'
    seg_conf_thresh=0.5,        # Confidence threshold for segmentation
    remove_holes=False,         # Whether to remove holes in tissue
    remove_artifacts=False,     # Whether to remove artifacts
    
    # Patching parameters  
    mag=20,                     # Magnification (5, 10, 20, 40, 80)
    patch_size=512,             # Patch size in pixels
    overlap=0,                  # Overlap between patches
    min_tissue_proportion=0.0,  # Minimum tissue proportion to keep patch
    
    # Feature extraction parameters
    patch_encoder='conch_v15',  # Available encoders: conch_v15, uni_v1, uni_v2, etc.
    
    # Processing parameters
    batch_size=32,              # Batch size for processing
    gpu=0,                      # GPU index to use
    skip_errors=True            # Skip errored slides and continue
)

print("Configuration:")
print(f"  WSI Directory: {WSI_DIR}")
print(f"  Output Directory: {OUTPUT_DIR}")
print(f"  Patch Encoder: {config.patch_encoder}")
print(f"  Magnification: {config.mag}x")
print(f"  Patch Size: {config.patch_size}px")
print(f"  Batch Size: {config.batch_size}")

In [None]:
# Initialize the pipeline manager
pipeline_manager = PipelineManager(
    job_dir=OUTPUT_DIR,
    wsi_dir=WSI_DIR,
    config=config,
    wsi_ext=WSI_EXTENSIONS
)

print("Pipeline Manager initialized successfully!")
print(f"Output directory structure created at: {OUTPUT_DIR}")
print(f"  - Segmentation: {pipeline_manager.seg_dir}")
print(f"  - Coordinates: {pipeline_manager.coords_dir}")
print(f"  - Features: {pipeline_manager.features_dir}")

## 3. Scan Slides and Check Processing Status

Discover all WSI files and check the current processing status for each slide.

In [None]:
# Discover all WSI samples in the directory
sample_ids = pipeline_manager.discover_samples()
print(f"Found {len(sample_ids)} WSI samples:")
for i, sample_id in enumerate(sample_ids[:5]):  # Show first 5 samples
    print(f"  {i+1}. {sample_id}")
if len(sample_ids) > 5:
    print(f"  ... and {len(sample_ids) - 5} more")

# Display current processing status
pipeline_manager.print_summary()

In [None]:
# Check which samples need processing for each task
seg_pending = pipeline_manager.get_pending_samples("segmentation")
coords_pending = pipeline_manager.get_pending_samples("coordinates")
feat_pending = pipeline_manager.get_pending_samples("features")

print("\nDetailed Status:")
print(f"Samples needing segmentation: {len(seg_pending)}")
print(f"Samples needing coordinates: {len(coords_pending)}")
print(f"Samples needing features: {len(feat_pending)}")

# Create a status DataFrame for better visualization
status_data = []
for sample_id in sample_ids:
    sample_state = pipeline_manager.samples[sample_id]
    status_data.append({
        'Sample ID': sample_id,
        'Segmentation': '✓' if sample_state.segmentation_done else '✗',
        'Coordinates': '✓' if sample_state.coordinates_done else '✗',
        'Features': '✓' if sample_state.features_done else '✗',
        'Last Updated': sample_state.last_updated.split('T')[0] if sample_state.last_updated else 'Never'
    })

status_df = pd.DataFrame(status_data)
print("\nSample Status Table:")
print(status_df.head(10))  # Show first 10 samples

## 4. Run Segmentation Where Needed

Perform tissue segmentation only on slides that haven't been processed yet.

In [None]:
# Run segmentation for samples that need it
if seg_pending:
    print(f"Running segmentation for {len(seg_pending)} samples...")
    print("This may take several minutes depending on the number of slides and their size.")
    
    try:
        # Run segmentation
        processed_count = pipeline_manager.run_segmentation(seg_pending)
        print(f"Segmentation completed for {processed_count} samples.")
        if processed_count < len(seg_pending):
            print(f"Warning: {len(seg_pending) - processed_count} samples failed to segment. Check logs for details.")
            pipeline_manager.force_refresh_state()

    except Exception as e:
        print(f"An error occurred during segmentation: {e}")
        print("Forcing a state refresh to accurately reflect the pipeline status.")
        pipeline_manager.force_refresh_state()
    finally:
        # Update status
        pipeline_manager.print_summary()
else:
    print("All samples already have segmentation completed! ✓")

## 5. Run Coordinate Extraction Where Needed

Extract patch coordinates for slides that haven't been processed yet.

In [None]:
# Run coordinate extraction for samples that need it
coords_pending = pipeline_manager.get_pending_samples("coordinates")  # Refresh the list

if coords_pending:
    print(f"Running coordinate extraction for {len(coords_pending)} samples...")
    print("This step identifies tissue regions and extracts patch coordinates.")
    
    # Run coordinate extraction
    processed_count = pipeline_manager.run_coordinate_extraction(coords_pending)
    print(f"Coordinate extraction completed for {processed_count} samples.")
    
    # Update status
    pipeline_manager.print_summary()
else:
    print("All samples already have coordinates extracted! ✓")

## 6. Run Patch Feature Extraction Where Needed

Extract patch features using the configured patch encoder for slides that haven't been processed yet.

In [None]:
# Run feature extraction for samples that need it
feat_pending = pipeline_manager.get_pending_samples("features")  # Refresh the list

if feat_pending:
    print(f"Running feature extraction for {len(feat_pending)} samples...")
    print(f"Using patch encoder: {config.patch_encoder}")
    print("This is the most computationally intensive step and may take considerable time.")
    
    # Run feature extraction
    processed_count = pipeline_manager.run_feature_extraction(feat_pending)
    print(f"Feature extraction completed for {processed_count} samples.")
    
    # Update status
    pipeline_manager.print_summary()
else:
    print("All samples already have features extracted! ✓")

## Usage Notes

### Key Features:
- **Resume Capability**: The pipeline tracks processing state and only runs missing operations
- **Organized Output**: Clean directory structure with logical naming conventions
- **Error Handling**: Continues processing even if individual slides fail
- **Flexible Configuration**: Easy to modify parameters for different use cases

### Output Files:
- **Segmentation**: `.h5` files containing tissue masks and contours
- **Coordinates**: `.h5` files containing patch coordinates and metadata
- **Features**: `.h5` files containing patch-level feature embeddings
- **State Tracking**: `pipeline_state.json` maintains processing status

### Tips:
1. **Memory Management**: Adjust `batch_size` based on your GPU memory
2. **Storage**: Ensure sufficient disk space for feature files (can be large)
3. **Resumption**: You can safely restart the notebook - it will skip completed steps
4. **Monitoring**: Check the summary output to track progress

### Next Steps:
With the extracted features, you can:
- Train slide-level classification models
- Perform clustering analysis
- Generate attention heatmaps
- Build retrieval systems