# WSI Feature Extraction Pipeline

This notebook demonstrates how to extract and save patch features from Whole Slide Images (WSIs) using a custom pipeline manager that tracks processing status and only performs missing operations.

## Key Features:
- **Smart State Management**: Tracks which slides have completed segmentation and feature extraction
- **Resume Capability**: Only processes missing operations, avoiding redundant computation
- **Organized Output Structure**: Logical folder hierarchy for different processing stages
- **Professional Implementation**: Clean, maintainable code with proper error handling

## 1. Import Required Libraries and Custom Module

First, we'll import the necessary libraries and our custom pipeline manager module.

In [1]:
import os
import torch
from pathlib import Path
import pandas as pd
from typing import Dict, List, Optional

# Import our custom pipeline manager
from trident_pipeline_manager import PipelineManager, PipelineConfig

# Set up device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

# Display GPU information if available
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cuda
GPU: NVIDIA GeForce RTX 2080 Ti
GPU Memory: 10.8 GB


## 2. Initialize Manager and Set Up Paths

Configure the pipeline parameters and initialize the manager for the HF dataset structure. This dataset already contains pre-extracted patches and tissue segmentation masks, so we'll adapt the pipeline accordingly.

### HF Dataset Structure:
- **wsis/**: H&E stained Whole Slide Images in pyramidal Generic TIFF format
- **patches/**: Pre-extracted 256x256 H&E patches (0.5µm/px) in .h5 format
- **tissue_seg/**: Pre-computed tissue segmentation masks and contours
- **st/**: Spatial transcriptomics expressions in scanpy .h5ad format
- **metadata/**: Sample metadata files

In [None]:
# Configuration - Update these paths according to your HF dataset setup
BASE_DATA_DIR = "/home/vcivale/WSI-RL-Tiles-Selection_3/data"  # Base directory for HF dataset
WSI_DIR = os.path.join(BASE_DATA_DIR, "wsis")  # Directory containing WSI files from HF dataset
OUTPUT_DIR = "/home/vcivale/WSI-RL-Tiles-Selection_3/data/processed"  # Directory to save processed outputs
WSI_EXTENSIONS = ['.tiff', '.tif', '.svs', '.ndpi', '.mrxs']  # Supported WSI formats

# HF Dataset specific directories
HF_PATCHES_DIR = os.path.join(BASE_DATA_DIR, "patches")  # Pre-extracted patches from HF
HF_TISSUE_SEG_DIR = os.path.join(BASE_DATA_DIR, "tissue_seg")  # Pre-computed tissue segmentation
HF_ST_DIR = os.path.join(BASE_DATA_DIR, "st")  # Spatial transcriptomics data
HF_METADATA_DIR = os.path.join(BASE_DATA_DIR, "metadata")  # Metadata files

# Create output directory if it doesn't exist
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Configure pipeline parameters - adapted for HF dataset
config = PipelineConfig(
    # Segmentation parameters (may use pre-computed from HF)
    segmenter='hest',           # Options: 'hest', 'grandqc', or 'precomputed'
    seg_conf_thresh=0.5,        # Confidence threshold for segmentation
    remove_holes=False,         # Whether to remove holes in tissue
    remove_artifacts=False,     # Whether to remove artifacts
    
    # Patching parameters - HF dataset uses 256x256 patches at 0.5µm/px
    mag=40,                     # Magnification corresponding to 0.5µm/px (typically 40x)
    patch_size=256,             # Patch size in pixels (HF uses 256x256)
    overlap=0,                  # Overlap between patches
    min_tissue_proportion=0.0,  # Minimum tissue proportion to keep patch
    
    # Feature extraction parameters
    patch_encoder='conch_v15',  # Available encoders: conch_v15, uni_v1, uni_v2, etc.
    mpp=0.5,                    # Microns per pixel (HF dataset uses 0.5µm/px)
    
    # Processing parameters
    batch_size=32,              # Batch size for processing
    gpu=0,                      # GPU index to use
    skip_errors=False,          # Skip errored slides and continue
    reader_type='cucim',        # Reader type for WSI files
    
    # HF-specific parameters
    use_precomputed_patches=True,      # Use pre-extracted patches from HF
    use_precomputed_segmentation=True, # Use pre-computed tissue segmentation
    hf_patches_dir=HF_PATCHES_DIR,
    hf_tissue_seg_dir=HF_TISSUE_SEG_DIR,
    hf_st_dir=HF_ST_DIR,
    hf_metadata_dir=HF_METADATA_DIR
)

print("Configuration for HF Dataset:")
print(f"  Base Data Directory: {BASE_DATA_DIR}")
print(f"  WSI Directory: {WSI_DIR}")
print(f"  HF Patches Directory: {HF_PATCHES_DIR}")
print(f"  HF Tissue Segmentation Directory: {HF_TISSUE_SEG_DIR}")
print(f"  HF Spatial Transcriptomics Directory: {HF_ST_DIR}")
print(f"  Output Directory: {OUTPUT_DIR}")
print(f"  Patch Encoder: {config.patch_encoder}")
print(f"  Magnification: {config.mag}x (0.5µm/px)")
print(f"  Patch Size: {config.patch_size}px")
print(f"  Batch Size: {config.batch_size}")
print(f"  WSI Reader: {config.reader_type}")
print(f"  MPP: {config.mpp}")
print(f"  Use Precomputed Patches: {config.use_precomputed_patches}")
print(f"  Use Precomputed Segmentation: {config.use_precomputed_segmentation}")

Configuration:
  WSI Directory: /home/vcivale/WSI-RL-Tiles-Selection_3/data/interim
  Output Directory: /home/vcivale/WSI-RL-Tiles-Selection_3/data/processed
  Patch Encoder: conch_v15
  Magnification: 20x
  Patch Size: 512px
  Batch Size: 32
  WSI Reader: cucim
  MPP: 0.25


In [None]:
# Specify the WSI reader type for HF dataset (Generic TIFF format)
# Options: 'openslide', 'image' (for standard TIFFs), 'cucim' (recommended for Generic TIFF)
config.reader_type = 'cucim'

print(f"  WSI Reader: {config.reader_type} (optimized for Generic TIFF format)")

  WSI Reader: image


In [4]:
# Initialize the pipeline manager
pipeline_manager = PipelineManager(
    job_dir=OUTPUT_DIR,
    wsi_dir=WSI_DIR,
    config=config,
    wsi_ext=WSI_EXTENSIONS,
    
)

print("Pipeline Manager initialized successfully!")
print(f"Output directory structure created at: {OUTPUT_DIR}")
print(f"  - Segmentation: {pipeline_manager.seg_dir}")
print(f"  - Coordinates: {pipeline_manager.coords_dir}")
print(f"  - Features: {pipeline_manager.features_dir}")

[PROCESSOR] Found 1 valid slides in /home/vcivale/WSI-RL-Tiles-Selection_3/data/interim.
Pipeline Manager initialized successfully!
Output directory structure created at: /home/vcivale/WSI-RL-Tiles-Selection_3/data/processed
  - Segmentation: /home/vcivale/WSI-RL-Tiles-Selection_3/data/processed/segmentation
  - Coordinates: /home/vcivale/WSI-RL-Tiles-Selection_3/data/processed/coordinates_20x_512px
  - Features: /home/vcivale/WSI-RL-Tiles-Selection_3/data/processed/features_20x_512px_conch_v15


## 3. Scan Slides and Check Processing Status

Discover all WSI files and check the current processing status for each slide.

In [5]:
# Discover all WSI samples in the directory
sample_ids = pipeline_manager.discover_samples()
print(f"Found {len(sample_ids)} WSI samples:")
for i, sample_id in enumerate(sample_ids[:5]):  # Show first 5 samples
    print(f"  {i+1}. {sample_id}")
if len(sample_ids) > 5:
    print(f"  ... and {len(sample_ids) - 5} more")

# Display current processing status
pipeline_manager.print_summary()

Found 1 WSI samples:
  1. TCGA-D3-A2JN-06Z-00-DX1.0AA7684E-2886-4A00-B808-39EA790B825A

PIPELINE SUMMARY
Total samples: 1
Segmentation: 0/1 completed
Coordinates: 0/1 completed
Features: 0/1 completed


In [6]:
# Check which samples need processing for each task
seg_pending = pipeline_manager.get_pending_samples("segmentation")
coords_pending = pipeline_manager.get_pending_samples("coordinates")
feat_pending = pipeline_manager.get_pending_samples("features")

print("\nDetailed Status:")
print(f"Samples needing segmentation: {len(seg_pending)}")
print(f"Samples needing coordinates: {len(coords_pending)}")
print(f"Samples needing features: {len(feat_pending)}")

# Create a status DataFrame for better visualization
status_data = []
for sample_id in sample_ids:
    sample_state = pipeline_manager.samples[sample_id]
    status_data.append({
        'Sample ID': sample_id,
        'Segmentation': '✓' if sample_state.segmentation_done else '✗',
        'Coordinates': '✓' if sample_state.coordinates_done else '✗',
        'Features': '✓' if sample_state.features_done else '✗',
        'Last Updated': sample_state.last_updated.split('T')[0] if sample_state.last_updated else 'Never'
    })

status_df = pd.DataFrame(status_data)
print("\nSample Status Table:")
print(status_df.head(10))  # Show first 10 samples


Detailed Status:
Samples needing segmentation: 1
Samples needing coordinates: 0
Samples needing features: 0

Sample Status Table:
                                           Sample ID Segmentation Coordinates  \
0  TCGA-D3-A2JN-06Z-00-DX1.0AA7684E-2886-4A00-B80...            ✗           ✗   

  Features Last Updated  
0        ✗   2025-07-06  


## 4. Run Segmentation Where Needed

Perform tissue segmentation only on slides that haven't been processed yet.

In [7]:
import logging

# Configure logging to capture detailed output from the trident-wsi library
# This will help diagnose issues during segmentation or feature extraction
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

print("Logging configured to display detailed process information.")

Logging configured to display detailed process information.


In [8]:
# Force a refresh to ensure the state is up-to-date before running
print("Forcing state refresh before segmentation...")
pipeline_manager.force_refresh_state()
seg_pending = pipeline_manager.get_pending_samples("segmentation")

if seg_pending:
    print(f"Running segmentation for {len(seg_pending)} samples...")
    print("This may take several minutes depending on the number of slides and their size.")
    
    # Manually delete existing segmentation files to force re-processing
    print("Checking for and removing existing segmentation files to ensure a clean run...")
    for sample_id in seg_pending:
        seg_file_path = os.path.join(pipeline_manager.seg_dir, f"{sample_id}.h5")
        if os.path.exists(seg_file_path):
            try:
                os.remove(seg_file_path)
                print(f"  Removed existing segmentation file: {seg_file_path}")
            except OSError as e:
                print(f"  Error removing file {seg_file_path}: {e}")

    try:
        # Run segmentation (without 'force' argument)
        processed_count = pipeline_manager.run_segmentation(seg_pending)
        print(f"Segmentation completed for {processed_count} samples.")
        
        
        # Refresh state immediately after the run to reflect any changes
        print("Forcing state refresh after segmentation...")
        pipeline_manager.force_refresh_state()

        if processed_count < len(seg_pending):
            print(f"Warning: {len(seg_pending) - processed_count} samples failed to segment. Check logs for details.")

    except Exception as e:
        print(f"An error occurred during segmentation: {e}")
        print("Forcing a state refresh to accurately reflect the pipeline status.")
        pipeline_manager.force_refresh_state()
    finally:
        # Update status
        pipeline_manager.print_summary()
else:
    print("All samples already have segmentation completed! ✓")

Forcing state refresh before segmentation...
Force refreshing state by scanning all output files...
Checking common Trident output patterns...
State refresh completed.

PIPELINE SUMMARY
Total samples: 1
Segmentation: 0/1 completed
Coordinates: 0/1 completed
Features: 0/1 completed
Running segmentation for 1 samples...
This may take several minutes depending on the number of slides and their size.
Checking for and removing existing segmentation files to ensure a clean run...
Running segmentation for 1 samples...


Fetching 1 files: 100%|██████████| 1/1 [00:00<00:00, 622.39it/s]
Segmenting tissue: 100%|██████████| 1/1 [00:00<00:00, 711.62it/s, TCGA-D3-A2JN-06Z-00-DX1.0AA7684E-2886-4A00-B808-39EA790B825A already segmented. Skipping...]

Segmentation completed for 0 samples.
Forcing state refresh after segmentation...
Force refreshing state by scanning all output files...
Checking common Trident output patterns...
State refresh completed.

PIPELINE SUMMARY
Total samples: 1
Segmentation: 0/1 completed
Coordinates: 0/1 completed
Features: 0/1 completed

PIPELINE SUMMARY
Total samples: 1
Segmentation: 0/1 completed
Coordinates: 0/1 completed
Features: 0/1 completed





## 5. Run Coordinate Extraction Where Needed

Extract patch coordinates for slides that haven't been processed yet.

In [9]:
# Run coordinate extraction for samples that need it
coords_pending = pipeline_manager.get_pending_samples("coordinates")  # Refresh the list

if coords_pending:
    print(f"Running coordinate extraction for {len(coords_pending)} samples...")
    print("This step identifies tissue regions and extracts patch coordinates.")
    
    # Run coordinate extraction, forcing overwrite
    processed_count = pipeline_manager.run_coordinate_extraction(coords_pending, force=True)
    print(f"Coordinate extraction completed for {processed_count} samples.")
    
    # Update status
    pipeline_manager.print_summary()
else:
    print("All samples already have coordinates extracted! ✓")

All samples already have coordinates extracted! ✓


## 6. Run Patch Feature Extraction Where Needed

Extract patch features using the configured patch encoder for slides that haven't been processed yet.

In [10]:
# Run feature extraction for samples that need it
feat_pending = pipeline_manager.get_pending_samples("features")  # Refresh the list

if feat_pending:
    print(f"Running feature extraction for {len(feat_pending)} samples...")
    print(f"Using patch encoder: {config.patch_encoder}")
    print("This is the most computationally intensive step and may take considerable time.")
    
    # Run feature extraction, forcing overwrite
    processed_count = pipeline_manager.run_feature_extraction(feat_pending, force=True)
    print(f"Feature extraction completed for {processed_count} samples.")
    
    # Update status
    pipeline_manager.print_summary()
else:
    print("All samples already have features extracted! ✓")

All samples already have features extracted! ✓


## Usage Notes

### Key Features:
- **Resume Capability**: The pipeline tracks processing state and only runs missing operations
- **Organized Output**: Clean directory structure with logical naming conventions
- **Error Handling**: Continues processing even if individual slides fail
- **Flexible Configuration**: Easy to modify parameters for different use cases

### Output Files:
- **Segmentation**: `.h5` files containing tissue masks and contours
- **Coordinates**: `.h5` files containing patch coordinates and metadata
- **Features**: `.h5` files containing patch-level feature embeddings
- **State Tracking**: `pipeline_state.json` maintains processing status

### Tips:
1. **Memory Management**: Adjust `batch_size` based on your GPU memory
2. **Storage**: Ensure sufficient disk space for feature files (can be large)
3. **Resumption**: You can safely restart the notebook - it will skip completed steps
4. **Monitoring**: Check the summary output to track progress

### Next Steps:
With the extracted features, you can:
- Train slide-level classification models
- Perform clustering analysis
- Generate attention heatmaps
- Build retrieval systems