SICS-155: Surgical Phase Segmentation with Vision Foundation Models

Benchmarking self-supervised vision foundation models (DINOv3, V-JEPA2) against supervised models (ResNet-50, I3D) for frame-level feature extraction on the SICS-155 surgical phase segmentation dataset. The MS-TCN++ (Multi-Stage Temporal Convolutional Network) serves as the temporal backbone across all experiments, isolating the impact of spatial representations on phase segmentation performance.

Overview

This project evaluates how different vision encoders affect temporal action segmentation quality in cataract surgery videos. Each feature extractor produces per-frame embeddings that are fed into a shared MS-TCN++ architecture for phase prediction. By keeping the temporal backbone constant, differences in downstream performance can be attributed directly to the quality of spatial (or spatiotemporal) feature representations.

Key research questions:

Do self-supervised foundation models outperform supervised baselines for surgical video understanding?
How does model scale (ViT-B → ViT-L → ViT-7B) affect segmentation quality?
Does domain-specific fine-tuning on cataract videos (CataractFT) improve over generic pretrained features?

Dataset

SICS-155 — 155 annotated small-incision cataract surgery (SICS) videos with frame-level phase annotations across 19 surgical phases.

Split	Videos
Train	~100 (BL/BN/SD procedure types)
Validation	15
Test	40 (held out)

Surgical Phases (19 classes)

ID	Phase
0	`background`
1	`peritomy`
2	`cautery`
3	`scleral_groove`
4	`incision`
5	`tunnel`
6	`sideport`
7	`AB_injection_and_wash`
8	`OVD_injection`
9	`capsulorrhexis`
10	`main_incision_entry`
11	`hydroprocedure`
12	`nucleus_prolapse`
13	`nucleus_delivery`
14	`cortical_wash`
15	`OVD_IOL_insertion`
16	`OVD_wash`
17	`stromal_hydration`
18	`tunnel_suture`

Videos are stratified by procedure type prefix: BL, BN, SD.

Models

Model	Type	Params	Feature Dim	Supervision	Folder
ResNet-50	CNN	25M	2048	Supervised (ImageNet)	`Base/`
I3D	3D CNN	25M	2048	Supervised (Kinetics-400)	`I3D/`
DINOv3 ViT-B/16	ViT	86M	768	Self-supervised	`DinoV3-ViT-B16/`
DINOv3 ViT-L/16	ViT	304M	1024	Self-supervised	`DinoV3-ViT-L16/`
DINOv3 ViT-7B/16	ViT	7B	4096	Self-supervised	`DinoV3-ViT-7B16/`
V-JEPA2 ViT-L	ViT	~300M	1024	Self-supervised	`V-JEPA2-ViT-L/`
V-JEPA2 ViT-g	ViT	~1B	1408	Self-supervised	`V-JEPA2-ViT-g16-384/`

CataractFT variants (SSL pretrain → LoRA fine-tune on cataract data):

Model	Folder
DINOv3 ViT-B/16 + CataractFT	`DinoV3-ViT-B16-CataractFT/`
DINOv3 ViT-L/16 + CataractFT	`DinoV3-ViT-L16-CataractFT/`
DINOv3 ViT-7B/16 + CataractFT	`DinoV3-ViT-7B16-CataractFT/`
V-JEPA2 ViT-L + CataractFT	`V-JEPA2-ViT-L-CataractFT/`
V-JEPA2 ViT-g + CataractFT	`V-JEPA2-ViT-g-CataractFT/`

Project Structure

SICS-155/
├── README.md                       # This file
├── requirements.txt                # Python dependencies
├── mapping.txt                     # Phase ID → phase name mapping (19 classes)
├── generate_k_folds.py             # Generate stratified k-fold splits
├── compute_results.py              # Master cross-model comparison script
├── TODO.txt                        # Experiment log and results summary
│
├── SICS_155_train/                 # Training data
│   └── train/
│       ├── train_annotations.csv
│       ├── groundTruth/            # Per-video phase annotations (.txt)
│       └── videos/                 # Raw surgery videos
│
├── SICS_155_validation/            # Validation data
│   └── val/
│       ├── val_annotations.csv
│       ├── groundTruth/
│       └── videos/
│
├── comparison_results/             # Cross-model comparison outputs
│   ├── kfold_comparison.xlsx
│   ├── kfold_comparison_chart.png
│   ├── kfold_comparison_table.png
│   └── model_comparison.xlsx
│
├── Base/                           # ResNet-50 (ImageNet) — Baseline
├── I3D/                            # I3D (Kinetics-400)
├── DinoV3-ViT-B16/                 # DINOv3 ViT-B/16
├── DinoV3-ViT-L16/                 # DINOv3 ViT-L/16
├── DinoV3-ViT-7B16/                # DINOv3 ViT-7B/16
├── V-JEPA2-ViT-L/                  # V-JEPA2 ViT-L
├── V-JEPA2-ViT-g16-384/            # V-JEPA2 ViT-g
├── DinoV3-ViT-B16-CataractFT/      # DINOv3 ViT-B/16 + CataractFT
├── DinoV3-ViT-L16-CataractFT/      # DINOv3 ViT-L/16 + CataractFT
├── DinoV3-ViT-7B16-CataractFT/     # DINOv3 ViT-7B/16 + CataractFT
├── V-JEPA2-ViT-L-CataractFT/       # V-JEPA2 ViT-L + CataractFT
├── V-JEPA2-ViT-g-CataractFT/       # V-JEPA2 ViT-g + CataractFT
│
└── pretrain/                        # SSL pre-training pipeline
    ├── ssl_config.py                # Model/training configuration
    ├── ssl_dataset.py               # Video dataset and augmentations
    ├── ssl_pretrain_dinov3.py       # DINOv3 self-distillation training
    ├── ssl_pretrain_vjepa2.py       # V-JEPA2 masked prediction training
    ├── extract_ssl_features.py      # Feature extraction from SSL checkpoints
    ├── finetune_lora.py             # LoRA supervised fine-tuning
    ├── launch_ssl_dinov3.sh         # SLURM launch script (DINOv3)
    ├── launch_ssl_vjepa2.sh         # SLURM launch script (V-JEPA2)
    ├── launch_ssl_multinode.sh      # Multi-node SLURM script
    ├── cataracts1k/                 # Cataracts-1K dataset (1000 unlabeled + 56 annotated)
    └── cataract101/                 # Cataract-101 annotated dataset

Per-Model Folder Layout

Each model folder (e.g., Base/, DinoV3-ViT-L16/) contains the same structure:

<Model>/
├── extract_features.py       # Feature extraction using the model's encoder
├── train.py                  # MS-TCN++ training script
├── model.py                  # MS-TCN++ architecture (MS_TCN2)
├── batch_gen.py              # Batched data loading from .npy features
├── evaluation.py             # Per-split evaluation (all metrics)
├── compute_results.py        # Results aggregation and visualization
├── kfold_summary.py          # K-fold aggregate statistics
├── run_k_folds.py            # Automated k-fold CV runner
├── focalloss.py              # Focal loss implementation
├── export_to_excel.py        # Export results to Excel
├── include/
│   └── utils.py              # Shared utilities (edit score, label parsing)
├── data/SICS155/
│   ├── features/             # Extracted .npy features (feature_dim × T)
│   ├── groundTruth/          # Phase annotations for train split
│   ├── groundTruth_val/      # Phase annotations for validation
│   ├── groundTruth_split_*/  # Per-fold ground truth
│   ├── splits/               # Train/test bundle files per fold
│   └── mapping.txt           # Phase mapping
├── models/                   # Saved MS-TCN++ checkpoints
├── results/                  # Prediction outputs per split
├── results.xlsx              # Consolidated results table
├── visualizations/           # Single-split plots
└── visualizations_kfold/     # K-fold comparison plots

Installation

# Clone the repository
git clone <repo-url> SICS-155
cd SICS-155

# Create virtual environment
python -m venv .venv
.venv\Scripts\activate        # Windows
# source .venv/bin/activate   # Linux/Mac

# Install dependencies
pip install -r requirements.txt

Key Dependencies

Package	Purpose
`torch`, `torchvision`	Deep learning framework
`transformers`	DINOv3 & V-JEPA2 model loading (Hugging Face)
`timm`	Vision model utilities
`decord`	Fast GPU video decoding
`pytorchvideo`	I3D model loading
`scikit-learn`	Metrics (F1, PR-AUC)
`opencv-python`	Video frame extraction
`matplotlib`	Visualization
`pandas`, `openpyxl`	Results tables and Excel export
`wandb`	Experiment tracking (optional)

Pipeline

1. Feature Extraction

Each model folder has an extract_features.py that loads the pretrained encoder and produces per-frame feature vectors saved as .npy files (feature_dim × num_frames).

# ResNet-50 (2048-d per frame)
cd Base && python extract_features.py --device cuda

# I3D (2048-d per frame, 16-frame sliding window)
cd I3D && python extract_features.py --device cuda --clip_len 16

# DINOv3 ViT-B/16 (768-d per frame)
cd DinoV3-ViT-B16 && python extract_features.py --device cuda

# V-JEPA2 ViT-L (1024-d per frame, 64-frame clips)
cd V-JEPA2-ViT-L && python extract_features.py --device cuda --stride 4

2. Training MS-TCN++

The MS-TCN++ architecture (Prediction Generation + Refinement stages) is shared across all models. Only --features_dim changes.

cd <Model>

# Train on a single split
python train.py --action train --dataset SICS155 --split 1 \
    --features_dim 2048 --num_epochs 100 \
    --num_layers_PG 13 --num_layers_R 13 --num_R 4 \
    --loss_mse 0.35 --adaptive_mse --device cuda:0

# Generate predictions
python train.py --action predict --dataset SICS155 --split 1 \
    --features_dim 2048 --num_epochs 100 \
    --num_layers_PG 13 --num_layers_R 13 --num_R 4 \
    --device cuda:0

MS-TCN++ Configuration:

Prediction Generation: 13 dilated causal convolution layers
Refinement Stages: 4 stages × 13 layers each
Feature maps: 64
Loss: Cross-entropy + Adaptive T-MSE (λ=0.35) + optional Focal Loss
Optimizer: Adam (lr=5e-4)

3. Evaluation

cd <Model>

# Evaluate a single split
python evaluation.py ./data/SICS155/groundTruth_split_1 ./results/SICS155/split_1 \
    --mapping_path ./data/SICS155/mapping.txt

# Generate visualizations and comparison across models
python compute_results.py
python compute_results.py --compare-kfold 5

4. K-Fold Cross-Validation

Stratified 5-fold CV is used for robust evaluation. Splits are stratified by procedure type (BL/BN/SD).

# Generate k-fold splits (run once, from any model folder)
python generate_k_folds.py ./data/SICS155 5

# Run all folds for a model
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100

# Aggregate k-fold results
python kfold_summary.py --folds 5 --verbose

# Cross-model k-fold comparison (from project root)
python compute_results.py --compare-kfold 5

5. Experiment Modes

This repository supports two evaluation paths on SICS-155:

SSL-only test (no CataractFT):
- Use base SSL model folders directly:
  - DinoV3-ViT-B16/
  - DinoV3-ViT-L16/
  - V-JEPA2-ViT-L/
- Run feature extraction + MS-TCN++/k-fold in the same folder.
- Typical feature dimensions:
  - DINOv3 ViT-B/16: 768
  - DINOv3 ViT-L/16: 1024
  - V-JEPA2 ViT-L: 1024
CataractFT transfer pipeline:
- Run Stage 1/2 in pretrain/ (SSL pretraining + LoRA fine-tuning).
- Then evaluate using CataractFT folders:
  - DinoV3-ViT-B16-CataractFT/
  - DinoV3-ViT-L16-CataractFT/
  - DinoV3-ViT-7B16-CataractFT/
  - V-JEPA2-ViT-L-CataractFT/
  - V-JEPA2-ViT-g-CataractFT/

# SSL-only path (example: DINOv3 ViT-L)
cd DinoV3-ViT-L16
python extract_features.py --device cuda --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024

# CataractFT path (example: V-JEPA2 ViT-L)
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256
python finetune_lora.py --model vjepa2-l --ssl_checkpoint ./checkpoints/ssl/vjepa2-vitl-fpc64-256/final.pt
cd ../V-JEPA2-ViT-L-CataractFT
python extract_features.py --device cuda --lora_path ../pretrain/checkpoints/vjepa2-l/best_lora --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024

Transfer Learning (CataractFT)

The pretrain/ directory implements a three-stage transfer learning pipeline for domain adaptation:

Stage 1: SSL Pre-training on Cataracts-1K (1000 unlabeled cataract videos)
    ↓
Stage 2: LoRA Fine-tuning on annotated cataract datasets (Cataracts-1K annotated + Cataract-101)
    ↓
Stage 3: Feature extraction on SICS-155 → MS-TCN++ training

Stage 1 — Self-Supervised Pre-training:

DINOv3: Self-distillation with multi-crop augmentation (2 global + 8 local crops)
V-JEPA2: Masked video prediction (90% patch masking, predictor reconstructs representations)

Stage 2 — LoRA Fine-tuning:

Parameter-efficient fine-tuning using Low-Rank Adaptation
Supervised on annotated cataract surgery datasets

Stage 3 — SICS-155 Evaluation:

Extract features using the fine-tuned encoder
Train MS-TCN++ identically to the base models

# Example: SSL pre-train V-JEPA2 on Cataracts-1K
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256 --batch_size 2 --epochs 100

# LoRA fine-tune on annotated data
python finetune_lora.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/ssl/.../final.pt

# Extract features for SICS-155
python extract_ssl_features.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/...

See pretrain/README.md for detailed training instructions, SLURM scripts, and GPU requirements.

Results

5-Fold Cross-Validation Summary

Model	Accuracy	F1 (macro)	Edit Score	PR-AUC	F1@10	F1@25	F1@50	mIoU
ResNet-50	75.68 ± 1.14	67.65 ± 1.43	79.30 ± 1.46	71.77 ± 1.15	79.83 ± 1.14	75.33 ± 1.99	62.22 ± 2.49	53.51 ± 1.65
I3D	79.82 ± 0.94	71.66 ± 1.41	82.61 ± 2.46	75.74 ± 1.06	83.16 ± 2.51	79.15 ± 3.16	68.88 ± 2.65	58.35 ± 1.76
DINOv3 ViT-B/16	78.41 ± 0.71	71.23 ± 0.87	82.55 ± 1.71	75.04 ± 1.23	83.79 ± 1.08	79.89 ± 1.77	68.20 ± 1.96	57.01 ± 1.96
DINOv3 ViT-L/16	82.19 ± 1.28	75.80 ± 2.07	85.71 ± 1.74	78.50 ± 1.31	87.25 ± 1.80	84.39 ± 2.54	73.74 ± 4.44	62.40 ± 3.61
DINOv3 ViT-7B/16	83.40 ± 1.37	76.48 ± 1.99	86.99 ± 1.51	79.45 ± 1.67	88.02 ± 2.23	84.48 ± 2.63	74.96 ± 1.82	64.07 ± 3.24
V-JEPA2 ViT-L	77.90 ± 0.97	69.92 ± 0.49	83.15 ± 1.02	74.20 ± 0.62	83.01 ± 0.92	78.15 ± 1.46	66.55 ± 1.73	55.58 ± 2.30
V-JEPA2 ViT-g	76.05 ± 1.04	67.54 ± 1.21	81.18 ± 2.59	72.25 ± 0.74	80.57 ± 2.50	76.48 ± 2.19	63.84 ± 2.59	53.23 ± 2.09

All values are percentages (mean ± std across 5 folds). Bold indicates best performance.

Comparison Charts

Key Findings

DINOv3 ViT-7B/16 achieves the best overall performance across all metrics, demonstrating the benefit of scale in self-supervised models for surgical video understanding.
DINOv3 models scale consistently — ViT-B (78.41%) → ViT-L (82.19%) → ViT-7B (83.40%) accuracy, showing clear scaling behavior.
Self-supervised ViT-L+ models surpass supervised baselines — DINOv3 ViT-L/16 outperforms both ResNet-50 and I3D across all metrics.
I3D's temporal modeling provides an edge at smaller scale — I3D (79.82%) beats DINOv3 ViT-B/16 (78.41%) despite both using different pretraining paradigms, likely due to I3D's native 3D temporal convolutions.
V-JEPA2 underperforms DINOv3 at equivalent scale — V-JEPA2 ViT-L (77.90%) vs DINOv3 ViT-L (82.19%), suggesting DINOv3's discriminative self-distillation produces better features for surgical phase recognition than V-JEPA2's generative masked prediction objective.

Metrics

Metric	Level	Description
Accuracy	Frame	Percentage of correctly classified frames
F1 Score (macro)	Frame	Macro-averaged F1 across all 19 phases
Edit Score	Segment	Normalized Levenshtein distance between predicted and ground truth phase sequences
PR-AUC	Frame	Area under the Precision-Recall curve (from model probability outputs)
F1@{10,25,50}	Segment	Segmental F1 at IoU overlap thresholds of 10%, 25%, 50%
mIoU	Frame	Mean Intersection over Union (Jaccard Index) across all phases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SICS-155: Surgical Phase Segmentation with Vision Foundation Models

Table of Contents

Overview

Dataset

Surgical Phases (19 classes)

Models

Project Structure

Per-Model Folder Layout

Installation

Key Dependencies

Pipeline

1. Feature Extraction

2. Training MS-TCN++

3. Evaluation

4. K-Fold Cross-Validation

5. Experiment Modes

Transfer Learning (CataractFT)

Results

5-Fold Cross-Validation Summary

Comparison Charts

Key Findings

Metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Base		Base
DinoV3-ViT-7B16-CataractFT		DinoV3-ViT-7B16-CataractFT
DinoV3-ViT-7B16		DinoV3-ViT-7B16
DinoV3-ViT-B16-CataractFT		DinoV3-ViT-B16-CataractFT
DinoV3-ViT-B16		DinoV3-ViT-B16
DinoV3-ViT-L16-CataractFT		DinoV3-ViT-L16-CataractFT
DinoV3-ViT-L16		DinoV3-ViT-L16
I3D		I3D
SICS_155_train/train		SICS_155_train/train
SICS_155_validation/val		SICS_155_validation/val
V-JEPA2-ViT-L-CataractFT		V-JEPA2-ViT-L-CataractFT
V-JEPA2-ViT-L		V-JEPA2-ViT-L
V-JEPA2-ViT-g-CataractFT		V-JEPA2-ViT-g-CataractFT
V-JEPA2-ViT-g16-384		V-JEPA2-ViT-g16-384
comparison_results		comparison_results
pretrain		pretrain
.gitignore		.gitignore
README.md		README.md
TODO.txt		TODO.txt
analyze_cataractft_todo.py		analyze_cataractft_todo.py
compute_results.py		compute_results.py
generate_k_folds.py		generate_k_folds.py
mapping.txt		mapping.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

SICS-155: Surgical Phase Segmentation with Vision Foundation Models

Table of Contents

Overview

Dataset

Surgical Phases (19 classes)

Models

Project Structure

Per-Model Folder Layout

Installation

Key Dependencies

Pipeline

1. Feature Extraction

2. Training MS-TCN++

3. Evaluation

4. K-Fold Cross-Validation

5. Experiment Modes

Transfer Learning (CataractFT)

Results

5-Fold Cross-Validation Summary

Comparison Charts

Key Findings

Metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages