Benchmarking self-supervised vision foundation models (DINOv3, V-JEPA2) against supervised models (ResNet-50, I3D) for frame-level feature extraction on the SICS-155 surgical phase segmentation dataset. The MS-TCN++ (Multi-Stage Temporal Convolutional Network) serves as the temporal backbone across all experiments, isolating the impact of spatial representations on phase segmentation performance.
- Overview
- Dataset
- Models
- Project Structure
- Installation
- Pipeline
- Transfer Learning (CataractFT)
- Results
- Metrics
This project evaluates how different vision encoders affect temporal action segmentation quality in cataract surgery videos. Each feature extractor produces per-frame embeddings that are fed into a shared MS-TCN++ architecture for phase prediction. By keeping the temporal backbone constant, differences in downstream performance can be attributed directly to the quality of spatial (or spatiotemporal) feature representations.
Key research questions:
- Do self-supervised foundation models outperform supervised baselines for surgical video understanding?
- How does model scale (ViT-B → ViT-L → ViT-7B) affect segmentation quality?
- Does domain-specific fine-tuning on cataract videos (CataractFT) improve over generic pretrained features?
SICS-155 — 155 annotated small-incision cataract surgery (SICS) videos with frame-level phase annotations across 19 surgical phases.
| Split | Videos |
|---|---|
| Train | ~100 (BL/BN/SD procedure types) |
| Validation | 15 |
| Test | 40 (held out) |
| ID | Phase |
|---|---|
| 0 | background |
| 1 | peritomy |
| 2 | cautery |
| 3 | scleral_groove |
| 4 | incision |
| 5 | tunnel |
| 6 | sideport |
| 7 | AB_injection_and_wash |
| 8 | OVD_injection |
| 9 | capsulorrhexis |
| 10 | main_incision_entry |
| 11 | hydroprocedure |
| 12 | nucleus_prolapse |
| 13 | nucleus_delivery |
| 14 | cortical_wash |
| 15 | OVD_IOL_insertion |
| 16 | OVD_wash |
| 17 | stromal_hydration |
| 18 | tunnel_suture |
Videos are stratified by procedure type prefix: BL, BN, SD.
| Model | Type | Params | Feature Dim | Supervision | Folder |
|---|---|---|---|---|---|
| ResNet-50 | CNN | 25M | 2048 | Supervised (ImageNet) | Base/ |
| I3D | 3D CNN | 25M | 2048 | Supervised (Kinetics-400) | I3D/ |
| DINOv3 ViT-B/16 | ViT | 86M | 768 | Self-supervised | DinoV3-ViT-B16/ |
| DINOv3 ViT-L/16 | ViT | 304M | 1024 | Self-supervised | DinoV3-ViT-L16/ |
| DINOv3 ViT-7B/16 | ViT | 7B | 4096 | Self-supervised | DinoV3-ViT-7B16/ |
| V-JEPA2 ViT-L | ViT | ~300M | 1024 | Self-supervised | V-JEPA2-ViT-L/ |
| V-JEPA2 ViT-g | ViT | ~1B | 1408 | Self-supervised | V-JEPA2-ViT-g16-384/ |
CataractFT variants (SSL pretrain → LoRA fine-tune on cataract data):
| Model | Folder |
|---|---|
| DINOv3 ViT-B/16 + CataractFT | DinoV3-ViT-B16-CataractFT/ |
| DINOv3 ViT-L/16 + CataractFT | DinoV3-ViT-L16-CataractFT/ |
| DINOv3 ViT-7B/16 + CataractFT | DinoV3-ViT-7B16-CataractFT/ |
| V-JEPA2 ViT-L + CataractFT | V-JEPA2-ViT-L-CataractFT/ |
| V-JEPA2 ViT-g + CataractFT | V-JEPA2-ViT-g-CataractFT/ |
SICS-155/
├── README.md # This file
├── requirements.txt # Python dependencies
├── mapping.txt # Phase ID → phase name mapping (19 classes)
├── generate_k_folds.py # Generate stratified k-fold splits
├── compute_results.py # Master cross-model comparison script
├── TODO.txt # Experiment log and results summary
│
├── SICS_155_train/ # Training data
│ └── train/
│ ├── train_annotations.csv
│ ├── groundTruth/ # Per-video phase annotations (.txt)
│ └── videos/ # Raw surgery videos
│
├── SICS_155_validation/ # Validation data
│ └── val/
│ ├── val_annotations.csv
│ ├── groundTruth/
│ └── videos/
│
├── comparison_results/ # Cross-model comparison outputs
│ ├── kfold_comparison.xlsx
│ ├── kfold_comparison_chart.png
│ ├── kfold_comparison_table.png
│ └── model_comparison.xlsx
│
├── Base/ # ResNet-50 (ImageNet) — Baseline
├── I3D/ # I3D (Kinetics-400)
├── DinoV3-ViT-B16/ # DINOv3 ViT-B/16
├── DinoV3-ViT-L16/ # DINOv3 ViT-L/16
├── DinoV3-ViT-7B16/ # DINOv3 ViT-7B/16
├── V-JEPA2-ViT-L/ # V-JEPA2 ViT-L
├── V-JEPA2-ViT-g16-384/ # V-JEPA2 ViT-g
├── DinoV3-ViT-B16-CataractFT/ # DINOv3 ViT-B/16 + CataractFT
├── DinoV3-ViT-L16-CataractFT/ # DINOv3 ViT-L/16 + CataractFT
├── DinoV3-ViT-7B16-CataractFT/ # DINOv3 ViT-7B/16 + CataractFT
├── V-JEPA2-ViT-L-CataractFT/ # V-JEPA2 ViT-L + CataractFT
├── V-JEPA2-ViT-g-CataractFT/ # V-JEPA2 ViT-g + CataractFT
│
└── pretrain/ # SSL pre-training pipeline
├── ssl_config.py # Model/training configuration
├── ssl_dataset.py # Video dataset and augmentations
├── ssl_pretrain_dinov3.py # DINOv3 self-distillation training
├── ssl_pretrain_vjepa2.py # V-JEPA2 masked prediction training
├── extract_ssl_features.py # Feature extraction from SSL checkpoints
├── finetune_lora.py # LoRA supervised fine-tuning
├── launch_ssl_dinov3.sh # SLURM launch script (DINOv3)
├── launch_ssl_vjepa2.sh # SLURM launch script (V-JEPA2)
├── launch_ssl_multinode.sh # Multi-node SLURM script
├── cataracts1k/ # Cataracts-1K dataset (1000 unlabeled + 56 annotated)
└── cataract101/ # Cataract-101 annotated dataset
Each model folder (e.g., Base/, DinoV3-ViT-L16/) contains the same structure:
<Model>/
├── extract_features.py # Feature extraction using the model's encoder
├── train.py # MS-TCN++ training script
├── model.py # MS-TCN++ architecture (MS_TCN2)
├── batch_gen.py # Batched data loading from .npy features
├── evaluation.py # Per-split evaluation (all metrics)
├── compute_results.py # Results aggregation and visualization
├── kfold_summary.py # K-fold aggregate statistics
├── run_k_folds.py # Automated k-fold CV runner
├── focalloss.py # Focal loss implementation
├── export_to_excel.py # Export results to Excel
├── include/
│ └── utils.py # Shared utilities (edit score, label parsing)
├── data/SICS155/
│ ├── features/ # Extracted .npy features (feature_dim × T)
│ ├── groundTruth/ # Phase annotations for train split
│ ├── groundTruth_val/ # Phase annotations for validation
│ ├── groundTruth_split_*/ # Per-fold ground truth
│ ├── splits/ # Train/test bundle files per fold
│ └── mapping.txt # Phase mapping
├── models/ # Saved MS-TCN++ checkpoints
├── results/ # Prediction outputs per split
├── results.xlsx # Consolidated results table
├── visualizations/ # Single-split plots
└── visualizations_kfold/ # K-fold comparison plots
# Clone the repository
git clone <repo-url> SICS-155
cd SICS-155
# Create virtual environment
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt| Package | Purpose |
|---|---|
torch, torchvision |
Deep learning framework |
transformers |
DINOv3 & V-JEPA2 model loading (Hugging Face) |
timm |
Vision model utilities |
decord |
Fast GPU video decoding |
pytorchvideo |
I3D model loading |
scikit-learn |
Metrics (F1, PR-AUC) |
opencv-python |
Video frame extraction |
matplotlib |
Visualization |
pandas, openpyxl |
Results tables and Excel export |
wandb |
Experiment tracking (optional) |
Each model folder has an extract_features.py that loads the pretrained encoder and produces per-frame feature vectors saved as .npy files (feature_dim × num_frames).
# ResNet-50 (2048-d per frame)
cd Base && python extract_features.py --device cuda
# I3D (2048-d per frame, 16-frame sliding window)
cd I3D && python extract_features.py --device cuda --clip_len 16
# DINOv3 ViT-B/16 (768-d per frame)
cd DinoV3-ViT-B16 && python extract_features.py --device cuda
# V-JEPA2 ViT-L (1024-d per frame, 64-frame clips)
cd V-JEPA2-ViT-L && python extract_features.py --device cuda --stride 4The MS-TCN++ architecture (Prediction Generation + Refinement stages) is shared across all models. Only --features_dim changes.
cd <Model>
# Train on a single split
python train.py --action train --dataset SICS155 --split 1 \
--features_dim 2048 --num_epochs 100 \
--num_layers_PG 13 --num_layers_R 13 --num_R 4 \
--loss_mse 0.35 --adaptive_mse --device cuda:0
# Generate predictions
python train.py --action predict --dataset SICS155 --split 1 \
--features_dim 2048 --num_epochs 100 \
--num_layers_PG 13 --num_layers_R 13 --num_R 4 \
--device cuda:0MS-TCN++ Configuration:
- Prediction Generation: 13 dilated causal convolution layers
- Refinement Stages: 4 stages × 13 layers each
- Feature maps: 64
- Loss: Cross-entropy + Adaptive T-MSE (λ=0.35) + optional Focal Loss
- Optimizer: Adam (lr=5e-4)
cd <Model>
# Evaluate a single split
python evaluation.py ./data/SICS155/groundTruth_split_1 ./results/SICS155/split_1 \
--mapping_path ./data/SICS155/mapping.txt
# Generate visualizations and comparison across models
python compute_results.py
python compute_results.py --compare-kfold 5Stratified 5-fold CV is used for robust evaluation. Splits are stratified by procedure type (BL/BN/SD).
# Generate k-fold splits (run once, from any model folder)
python generate_k_folds.py ./data/SICS155 5
# Run all folds for a model
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100
# Aggregate k-fold results
python kfold_summary.py --folds 5 --verbose
# Cross-model k-fold comparison (from project root)
python compute_results.py --compare-kfold 5This repository supports two evaluation paths on SICS-155:
-
SSL-only test (no CataractFT):
- Use base SSL model folders directly:
DinoV3-ViT-B16/DinoV3-ViT-L16/V-JEPA2-ViT-L/
- Run feature extraction + MS-TCN++/k-fold in the same folder.
- Typical feature dimensions:
- DINOv3 ViT-B/16:
768 - DINOv3 ViT-L/16:
1024 - V-JEPA2 ViT-L:
1024
- DINOv3 ViT-B/16:
- Use base SSL model folders directly:
-
CataractFT transfer pipeline:
- Run Stage 1/2 in
pretrain/(SSL pretraining + LoRA fine-tuning). - Then evaluate using CataractFT folders:
DinoV3-ViT-B16-CataractFT/DinoV3-ViT-L16-CataractFT/DinoV3-ViT-7B16-CataractFT/V-JEPA2-ViT-L-CataractFT/V-JEPA2-ViT-g-CataractFT/
- Run Stage 1/2 in
# SSL-only path (example: DINOv3 ViT-L)
cd DinoV3-ViT-L16
python extract_features.py --device cuda --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024
# CataractFT path (example: V-JEPA2 ViT-L)
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256
python finetune_lora.py --model vjepa2-l --ssl_checkpoint ./checkpoints/ssl/vjepa2-vitl-fpc64-256/final.pt
cd ../V-JEPA2-ViT-L-CataractFT
python extract_features.py --device cuda --lora_path ../pretrain/checkpoints/vjepa2-l/best_lora --force
python run_k_folds.py --dataset SICS155 --folds 5 --num_epochs 100 --features_dim 1024The pretrain/ directory implements a three-stage transfer learning pipeline for domain adaptation:
Stage 1: SSL Pre-training on Cataracts-1K (1000 unlabeled cataract videos)
↓
Stage 2: LoRA Fine-tuning on annotated cataract datasets (Cataracts-1K annotated + Cataract-101)
↓
Stage 3: Feature extraction on SICS-155 → MS-TCN++ training
Stage 1 — Self-Supervised Pre-training:
- DINOv3: Self-distillation with multi-crop augmentation (2 global + 8 local crops)
- V-JEPA2: Masked video prediction (90% patch masking, predictor reconstructs representations)
Stage 2 — LoRA Fine-tuning:
- Parameter-efficient fine-tuning using Low-Rank Adaptation
- Supervised on annotated cataract surgery datasets
Stage 3 — SICS-155 Evaluation:
- Extract features using the fine-tuned encoder
- Train MS-TCN++ identically to the base models
# Example: SSL pre-train V-JEPA2 on Cataracts-1K
cd pretrain
python ssl_pretrain_vjepa2.py --model vjepa2-vitl-fpc64-256 --batch_size 2 --epochs 100
# LoRA fine-tune on annotated data
python finetune_lora.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/ssl/.../final.pt
# Extract features for SICS-155
python extract_ssl_features.py --model vjepa2-vitl-fpc64-256 --checkpoint ./checkpoints/...See pretrain/README.md for detailed training instructions, SLURM scripts, and GPU requirements.
| Model | Accuracy | F1 (macro) | Edit Score | PR-AUC | F1@10 | F1@25 | F1@50 | mIoU |
|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 75.68 ± 1.14 | 67.65 ± 1.43 | 79.30 ± 1.46 | 71.77 ± 1.15 | 79.83 ± 1.14 | 75.33 ± 1.99 | 62.22 ± 2.49 | 53.51 ± 1.65 |
| I3D | 79.82 ± 0.94 | 71.66 ± 1.41 | 82.61 ± 2.46 | 75.74 ± 1.06 | 83.16 ± 2.51 | 79.15 ± 3.16 | 68.88 ± 2.65 | 58.35 ± 1.76 |
| DINOv3 ViT-B/16 | 78.41 ± 0.71 | 71.23 ± 0.87 | 82.55 ± 1.71 | 75.04 ± 1.23 | 83.79 ± 1.08 | 79.89 ± 1.77 | 68.20 ± 1.96 | 57.01 ± 1.96 |
| DINOv3 ViT-L/16 | 82.19 ± 1.28 | 75.80 ± 2.07 | 85.71 ± 1.74 | 78.50 ± 1.31 | 87.25 ± 1.80 | 84.39 ± 2.54 | 73.74 ± 4.44 | 62.40 ± 3.61 |
| DINOv3 ViT-7B/16 | 83.40 ± 1.37 | 76.48 ± 1.99 | 86.99 ± 1.51 | 79.45 ± 1.67 | 88.02 ± 2.23 | 84.48 ± 2.63 | 74.96 ± 1.82 | 64.07 ± 3.24 |
| V-JEPA2 ViT-L | 77.90 ± 0.97 | 69.92 ± 0.49 | 83.15 ± 1.02 | 74.20 ± 0.62 | 83.01 ± 0.92 | 78.15 ± 1.46 | 66.55 ± 1.73 | 55.58 ± 2.30 |
| V-JEPA2 ViT-g | 76.05 ± 1.04 | 67.54 ± 1.21 | 81.18 ± 2.59 | 72.25 ± 0.74 | 80.57 ± 2.50 | 76.48 ± 2.19 | 63.84 ± 2.59 | 53.23 ± 2.09 |
All values are percentages (mean ± std across 5 folds). Bold indicates best performance.
- DINOv3 ViT-7B/16 achieves the best overall performance across all metrics, demonstrating the benefit of scale in self-supervised models for surgical video understanding.
- DINOv3 models scale consistently — ViT-B (78.41%) → ViT-L (82.19%) → ViT-7B (83.40%) accuracy, showing clear scaling behavior.
- Self-supervised ViT-L+ models surpass supervised baselines — DINOv3 ViT-L/16 outperforms both ResNet-50 and I3D across all metrics.
- I3D's temporal modeling provides an edge at smaller scale — I3D (79.82%) beats DINOv3 ViT-B/16 (78.41%) despite both using different pretraining paradigms, likely due to I3D's native 3D temporal convolutions.
- V-JEPA2 underperforms DINOv3 at equivalent scale — V-JEPA2 ViT-L (77.90%) vs DINOv3 ViT-L (82.19%), suggesting DINOv3's discriminative self-distillation produces better features for surgical phase recognition than V-JEPA2's generative masked prediction objective.
| Metric | Level | Description |
|---|---|---|
| Accuracy | Frame | Percentage of correctly classified frames |
| F1 Score (macro) | Frame | Macro-averaged F1 across all 19 phases |
| Edit Score | Segment | Normalized Levenshtein distance between predicted and ground truth phase sequences |
| PR-AUC | Frame | Area under the Precision-Recall curve (from model probability outputs) |
| F1@{10,25,50} | Segment | Segmental F1 at IoU overlap thresholds of 10%, 25%, 50% |
| mIoU | Frame | Mean Intersection over Union (Jaccard Index) across all phases |

