Skip to content

sl2005/Data-Efficient-SICS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SICS-155: Surgical Phase Segmentation with Vision Foundation Models

Benchmarking self-supervised vision foundation models (DINOv3, V-JEPA2) against supervised models (ResNet-50, I3D) for frame-level feature extraction on the SICS-155 surgical phase segmentation dataset. The MS-TCN++ (Multi-Stage Temporal Convolutional Network) serves as the temporal backbone across all experiments, isolating the impact of spatial representations on phase segmentation performance.


Table of Contents


Overview

This project evaluates how different vision encoders affect temporal action segmentation quality in cataract surgery videos. Each feature extractor produces per-frame embeddings that are fed into a shared MS-TCN++ architecture for phase prediction. By keeping the temporal backbone constant, differences in downstream performance can be attributed directly to the quality of spatial (or spatiotemporal) feature representations.

Key research questions:

  1. Do self-supervised foundation models outperform supervised baselines for surgical video understanding?
  2. How does model scale (ViT-B → ViT-L → ViT-7B) affect segmentation quality?
  3. Does domain-specific fine-tuning on cataract videos (CataractFT) improve over generic pretrained features?

Dataset

SICS-155 — 155 annotated small-incision cataract surgery (SICS) videos with frame-level phase annotations across 19 surgical phases.

Split Videos
Train ~100 (BL/BN/SD procedure types)
Validation 15
Test 40 (held out)

Surgical Phases (19 classes)

ID Phase
0 background
1 peritomy
2 cautery
3 scleral_groove
4 incision
5 tunnel
6 sideport
7 AB_injection_and_wash
8 OVD_injection
9 capsulorrhexis
10 main_incision_entry
11 hydroprocedure
12 nucleus_prolapse
13 nucleus_delivery
14 cortical_wash
15 OVD_IOL_insertion
16 OVD_wash
17 stromal_hydration
18 tunnel_suture

Videos are stratified by procedure type prefix: BL, BN, SD.

Models

Model Type Params Feature Dim Supervision Folder
ResNet-50 CNN 25M 2048 Supervised (ImageNet) Base/
I3D 3D CNN 25M 2048 Supervised (Kinetics-400) I3D/
DINOv3 ViT-B/16 ViT 86M 768 Self-supervised DinoV3-ViT-B16/
DINOv3 ViT-L/16 ViT 304M 1024 Self-supervised DinoV3-ViT-L16/
DINOv3 ViT-7B/16 ViT 7B 4096 Self-supervised DinoV3-ViT-7B16/
V-JEPA2 ViT-L ViT ~300M 1024 Self-supervised V-JEPA2-ViT-L/
V-JEPA2 ViT-g ViT ~1B 1408 Self-supervised V-JEPA2-ViT-g16-384/

CataractFT variants (SSL pretrain → LoRA fine-tune on cataract data):

Model Folder
DINOv3 ViT-B/16 + CataractFT DinoV3-ViT-B16-CataractFT/
DINOv3 ViT-L/16 + CataractFT DinoV3-ViT-L16-CataractFT/
DINOv3 ViT-7B/16 + CataractFT DinoV3-ViT-7B16-CataractFT/
V-JEPA2 ViT-L + CataractFT V-JEPA2-ViT-L-CataractFT/
V-JEPA2 ViT-g + CataractFT V-JEPA2-ViT-g-CataractFT/

Project Structure

SICS-155/
├── README.md                       # This file
├── requirements.txt                # Python dependencies
├── mapping.txt                     # Phase ID → phase name mapping (19 classes)
├── generate_k_folds.py             # Generate stratified k-fold splits
├── compute_results.py              # Master cross-model comparison script
├── TODO.txt                        # Experiment log and results summary
│
├── SICS_155_train/                 # Training data
│   └── train/
│       ├── train_annotations.csv
│       ├── groundTruth/            # Per-video phase annotations (.txt)
│       └── videos/                 # Raw surgery videos
│
├── SICS_155_validation/            # Validation data
│   └── val/
│       ├── val_annotations.csv
│       ├── groundTruth/
│       └── videos/
│
├── comparison_results/             # Cross-model comparison outputs
│   ├── kfold_comparison.xlsx
│   ├── kfold_comparison_chart.png
│   ├── kfold_comparison_table.png
│   └── model_comparison.xlsx
│
├── Base/                           # ResNet-50 (ImageNet) — Baseline
├── I3D/                            # I3D (Kinetics-400)
├── DinoV3-ViT-B16/                 # DINOv3 ViT-B/16
├── DinoV3-ViT-L16/                 # DINOv3 ViT-L/16
├── DinoV3-ViT-7B16/                # DINOv3 ViT-7B/16
├── V-JEPA2-ViT-L/                  # V-JEPA2 ViT-L
├── V-JEPA2-ViT-g16-384/            # V-JEPA2 ViT-g
├── DinoV3-ViT-B16-CataractFT/      # DINOv3 ViT-B/16 + CataractFT
├── DinoV3-ViT-L16-CataractFT/      # DINOv3 ViT-L/16 + CataractFT
├── DinoV3-ViT-7B16-CataractFT/     # DINOv3 ViT-7B/16 + CataractFT
├── V-JEPA2-ViT-L-CataractFT/       # V-JEPA2 ViT-L + CataractFT
├── V-JEPA2-ViT-g-CataractFT/       # V-JEPA2 ViT-g + CataractFT
│
└── pretrain/                        # SSL pre-training pipeline
    ├── ssl_config.py                # Model/training configuration
    ├── ssl_dataset.py               # Video dataset and augmentations
    ├── ssl_pretrain_dinov3.py       # DINOv3 self-distillation training
    ├── ssl_pretrain_vjepa2.py       # V-JEPA2 masked prediction training
    ├── extract_ssl_features.py      # Feature extraction from SSL checkpoints
    ├── finetune_lora.py             # LoRA supervised fine-tuning
    ├── launch_ssl_dinov3.sh         # SLURM launch script (DINOv3)
    ├── launch_ssl_vjepa2.sh         # SLURM launch script (V-JEPA2)
    ├── launch_ssl_multinode.sh      # Multi-node SLURM script
    ├── cataracts1k/                 # Cataracts-1K dataset (1000 unlabeled + 56 annotated)
    └── cataract101/                 # Cataract-101 annotated dataset

Per-Model Folder Layout

Each model folder (e.g., Base/, DinoV3-ViT-L16/) contains the same structure:

<Model>/
├── extract_features.py       # Feature extraction using the model's encoder
├── train.py                  # MS-TCN++ training script
├── model.py                  # MS-TCN++ architecture (MS_TCN2)
├── batch_gen.py              # Batched data loading from .npy features
├── evaluation.py             # Per-split evaluation (all metrics)
├── compute_results.py        # Results aggregation and visualization
├── kfold_summary.py          # K-fold aggregate statistics
├── run_k_folds.py            # Automated k-fold CV runner
├── focalloss.py              # Focal loss implementation
├── export_to_excel.py        # Export results to Excel
├── include/
│   └── utils.py              # Shared utilities (edit score, label parsing)
├── data/SICS155/
│   ├── features/             # Extracted .npy features (feature_dim × T)
│   ├── groundTruth/          # Phase annotations for train split
│   ├── groundTruth_val/      # Phase annotations for validation
│   ├── groundTruth_split_*/  # Per-fold ground truth
│   ├── splits/               # Train/test bundle files per fold
│   └── mapping.txt           # Phase mapping
├── models/                   # Saved MS-TCN++ checkpoints
├── results/                  # Prediction outputs per split
├── results.xlsx              # Consolidated results table
├── visualizations/           # Single-split plots
└── visualizations_kfold/     # K-fold comparison plots

Installation

# Clone the repository
git clone <repo-url> SICS-155
cd SICS-155

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Key Dependencies

Package Purpose
torch, torchvision Deep learning framework
transformers DINOv3 & V-JEPA2 model loading (Hugging Face)
timm Vision model utilities
decord Fast GPU video decoding
pytorchvideo I3D model loading
scikit-learn Metrics (F1, PR-AUC)
opencv-python Video frame extraction
matplotlib Visualization
pandas, openpyxl Results tables and Excel export
wandb Experiment tracking (optional)

Pipeline

1. Feature Extraction

Each model folder has an extract_features.py that loads the pretrained encoder and produces per-frame feature vectors saved as .npy files (feature_dim × num_frames).

# ResNet-50 (2048-d per frame)
cd Base && python extract_features.py --device cuda

# I3D (2048-d per frame, 16-frame sliding window)
cd I3D && python extract_features.py --device cuda --clip_len 16

# DINOv3 ViT-B/16 (768-d per frame)
cd DinoV3-ViT-B16 && python extract_features.py --device cuda

# V-JEPA2 ViT-L (1024-d per frame, 64-frame clips)
cd V-JEPA2-ViT-L && python extract_features.py --device cuda --stride 4

2. Training MS-TCN++

The MS-TCN++ architecture (Prediction Generation + Refinement stages) is shared across all models. Only --features_dim changes.

cd <Model>

# Train on a single split
python train.py --action train --dataset SICS155 --split 1 \
    --features_dim 2048 --num_epochs 100 \
    --num_layers_PG 13 --num_layers_R 13 --num_R 4 \
    --loss_mse 0.35 --adaptive_mse --device cuda:0

# Generate predictions
python train.py --action predict --dataset SICS155 --split 1 \
    --features_dim 2048 --num_epochs 100 \
    --num_layers_PG 13 --num_layers_R 13 --num_R 4 \
    --device cuda:0

MS-TCN++ Configuration:

  • Prediction Generation: 13 dilated causal convolution layers
  • Refinement Stages: 4 stages × 13 layers each
  • Feature maps: 64
  • Loss: Cross-entropy + Adaptive T-MSE (λ=0.35) + optional Focal Loss
  • Optimizer: Adam (lr=5e-4)

3. Evaluation

cd <Model>

# Evaluate a single split
python evaluation.py ./data/SICS155/groundTruth_split_1 ./results/SICS155/split_1 \
    --mapping_path ./data/SICS155/mapping.txt

# Generate visualizations and comparison across models
python compute_results.py
python compute_results.py --compare-kfold 5

4. K-Fold Cross-Validation

Stratified 5-fold CV is used for robust evaluation. Splits are stratified by procedure type (BL/BN/SD).

# Generate k-fold splits (run once, from any model folder)
python generate_k_folds.py ./data/SICS155 5

# Run all folds for a model
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100

# Aggregate k-fold results
python kfold_summary.py --folds 5 --verbose

# Cross-model k-fold comparison (from project root)
python compute_results.py --compare-kfold 5

5. Experiment Modes

This repository supports two evaluation paths on SICS-155:

  1. SSL-only test (no CataractFT):

    • Use base SSL model folders directly:
      • DinoV3-ViT-B16/
      • DinoV3-ViT-L16/
      • V-JEPA2-ViT-L/
    • Run feature extraction + MS-TCN++/k-fold in the same folder.
    • Typical feature dimensions:
      • DINOv3 ViT-B/16: 768
      • DINOv3 ViT-L/16: 1024
      • V-JEPA2 ViT-L: 1024
  2. CataractFT transfer pipeline:

    • Run Stage 1/2 in pretrain/ (SSL pretraining + LoRA fine-tuning).
    • Then evaluate using CataractFT folders:
      • DinoV3-ViT-B16-CataractFT/
      • DinoV3-ViT-L16-CataractFT/
      • DinoV3-ViT-7B16-CataractFT/
      • V-JEPA2-ViT-L-CataractFT/
      • V-JEPA2-ViT-g-CataractFT/
# SSL-only path (example: DINOv3 ViT-L)
cd DinoV3-ViT-L16
python extract_features.py --device cuda --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024

# CataractFT path (example: V-JEPA2 ViT-L)
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256
python finetune_lora.py --model vjepa2-l --ssl_checkpoint ./checkpoints/ssl/vjepa2-vitl-fpc64-256/final.pt
cd ../V-JEPA2-ViT-L-CataractFT
python extract_features.py --device cuda --lora_path ../pretrain/checkpoints/vjepa2-l/best_lora --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024

Transfer Learning (CataractFT)

The pretrain/ directory implements a three-stage transfer learning pipeline for domain adaptation:

Stage 1: SSL Pre-training on Cataracts-1K (1000 unlabeled cataract videos)
    ↓
Stage 2: LoRA Fine-tuning on annotated cataract datasets (Cataracts-1K annotated + Cataract-101)
    ↓
Stage 3: Feature extraction on SICS-155 → MS-TCN++ training

Stage 1 — Self-Supervised Pre-training:

  • DINOv3: Self-distillation with multi-crop augmentation (2 global + 8 local crops)
  • V-JEPA2: Masked video prediction (90% patch masking, predictor reconstructs representations)

Stage 2 — LoRA Fine-tuning:

  • Parameter-efficient fine-tuning using Low-Rank Adaptation
  • Supervised on annotated cataract surgery datasets

Stage 3 — SICS-155 Evaluation:

  • Extract features using the fine-tuned encoder
  • Train MS-TCN++ identically to the base models
# Example: SSL pre-train V-JEPA2 on Cataracts-1K
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256 --batch_size 2 --epochs 100

# LoRA fine-tune on annotated data
python finetune_lora.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/ssl/.../final.pt

# Extract features for SICS-155
python extract_ssl_features.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/...

See pretrain/README.md for detailed training instructions, SLURM scripts, and GPU requirements.

Results

5-Fold Cross-Validation Summary

Model Accuracy F1 (macro) Edit Score PR-AUC F1@10 F1@25 F1@50 mIoU
ResNet-50 75.68 ± 1.14 67.65 ± 1.43 79.30 ± 1.46 71.77 ± 1.15 79.83 ± 1.14 75.33 ± 1.99 62.22 ± 2.49 53.51 ± 1.65
I3D 79.82 ± 0.94 71.66 ± 1.41 82.61 ± 2.46 75.74 ± 1.06 83.16 ± 2.51 79.15 ± 3.16 68.88 ± 2.65 58.35 ± 1.76
DINOv3 ViT-B/16 78.41 ± 0.71 71.23 ± 0.87 82.55 ± 1.71 75.04 ± 1.23 83.79 ± 1.08 79.89 ± 1.77 68.20 ± 1.96 57.01 ± 1.96
DINOv3 ViT-L/16 82.19 ± 1.28 75.80 ± 2.07 85.71 ± 1.74 78.50 ± 1.31 87.25 ± 1.80 84.39 ± 2.54 73.74 ± 4.44 62.40 ± 3.61
DINOv3 ViT-7B/16 83.40 ± 1.37 76.48 ± 1.99 86.99 ± 1.51 79.45 ± 1.67 88.02 ± 2.23 84.48 ± 2.63 74.96 ± 1.82 64.07 ± 3.24
V-JEPA2 ViT-L 77.90 ± 0.97 69.92 ± 0.49 83.15 ± 1.02 74.20 ± 0.62 83.01 ± 0.92 78.15 ± 1.46 66.55 ± 1.73 55.58 ± 2.30
V-JEPA2 ViT-g 76.05 ± 1.04 67.54 ± 1.21 81.18 ± 2.59 72.25 ± 0.74 80.57 ± 2.50 76.48 ± 2.19 63.84 ± 2.59 53.23 ± 2.09

All values are percentages (mean ± std across 5 folds). Bold indicates best performance.

Comparison Charts

K-Fold Comparison Chart

K-Fold Comparison Table

Key Findings

  1. DINOv3 ViT-7B/16 achieves the best overall performance across all metrics, demonstrating the benefit of scale in self-supervised models for surgical video understanding.
  2. DINOv3 models scale consistently — ViT-B (78.41%) → ViT-L (82.19%) → ViT-7B (83.40%) accuracy, showing clear scaling behavior.
  3. Self-supervised ViT-L+ models surpass supervised baselines — DINOv3 ViT-L/16 outperforms both ResNet-50 and I3D across all metrics.
  4. I3D's temporal modeling provides an edge at smaller scale — I3D (79.82%) beats DINOv3 ViT-B/16 (78.41%) despite both using different pretraining paradigms, likely due to I3D's native 3D temporal convolutions.
  5. V-JEPA2 underperforms DINOv3 at equivalent scale — V-JEPA2 ViT-L (77.90%) vs DINOv3 ViT-L (82.19%), suggesting DINOv3's discriminative self-distillation produces better features for surgical phase recognition than V-JEPA2's generative masked prediction objective.

Metrics

Metric Level Description
Accuracy Frame Percentage of correctly classified frames
F1 Score (macro) Frame Macro-averaged F1 across all 19 phases
Edit Score Segment Normalized Levenshtein distance between predicted and ground truth phase sequences
PR-AUC Frame Area under the Precision-Recall curve (from model probability outputs)
F1@{10,25,50} Segment Segmental F1 at IoU overlap thresholds of 10%, 25%, 50%
mIoU Frame Mean Intersection over Union (Jaccard Index) across all phases

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages