Weijia Dou1*, Wenzhao Zheng2,3*,†, Weiliang Chen2, Yu Zheng2, Jie Zhou2, Jiwen Lu2
(*Equal contribution; †Project leader.)
1Tongji University 2Tsinghua University 3University of California, Berkeley
Table of Contents
Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. These failures include geometric warping, incoherent motion, object impermanence, and perspective failures. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics.
To address this gap, we introduce SGC, a metric for evaluating 3D Spatial Geometric Consistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions.
- Foreground-Background Disentanglement: Our approach first segments dynamic objects using motion object segmentation (MOS) to isolate the static background. Crucially, all subsequent SGC metrics and methods are applied only after this MOS step. As our evaluations show, seemingly improved scores without MOS can be misleading; metrics calculated with MOS more accurately reflect true background geometric inconsistencies.
- Depth-Aware Partitioning: After isolating the static areas, we predict depth for each pixel and partition the remaining static background into spatially coherent sub-regions.
- Composite Variance Scoring: We estimate a local camera pose for each subregion and compute the divergence among these poses. The overall SGC score is computed by aggregating three key evaluations: local inter-segment consistency, global pose consistency, and cross-frame depth consistency error.
For detailed installation instructions, please refer to the Installation Guide.
Please see docs/Install.md for a comprehensive guide on setting up the environment, dependencies, and any required submodules.
Our SGC metric can process video files directly or pre-extracted frames. Organize your custom data in a directory structure like this:
your_dataset/
├── video1.mp4 # Option A: Direct video files
├── experiment_A_video/ # Option B: Pre-extracted frames
│ ├── 00000.jpg
│ ├── 00001.jpg
│ └── ...
└── ...
Because our metrics are calculated strictly on the static background, you must perform motion object segmentation (MOS) before running the SGC calculation.
Step 1: Extract motion masks to isolate the static background
bash scripts/run_seganymo.shStep 2: Compute the SGC score on the segmented sub-areas
bash scripts/run_sgc.sh
python sgc/calculatescore.pyThe output will be saved as a JSON file containing the overall SGC score and the breakdown of the three component metrics.
We curate a comprehensive benchmark of 1,296 videos, comprising 996 generated videos and 300 high-motion real videos. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.
| Method | SGC Score (↓) |
|---|---|
| Cosmos | 0.0722 |
| Hotshot | 0.1172 |
| Latte | 0.3226 |
| Lavie | 0.1241 |
| Modelscope | 0.3129 |
| opensora-i | 0.1631 |
| opensora-t | 0.0831 |
| Seine | 0.2837 |
| Videocrafter | 0.0973 |
| Zeroscope | 0.0912 |
| RT-1 (Real) | 0.0639 |
| Nuscenes (Real) | 0.0613 |
| OpenVid (Real) | 0.0530 |
(For full quantitative comparisons across all 10 state-of-the-art models, please refer to Table 1 in our paper ).
This implementation is made possible by several excellent open-source foundational estimators. We sincerely thank the authors of:
