In [4]:
import os 
import os.path as osp
import shutil 
from dataclasses import asdict, dataclass 
from datetime import datetime 
from typing import Annotated 

import numpy as np 
import torch 
import tyro 
from torch.utils.data import DataLoader 

# MegaSAM

Given an unconstrained, continuous video sequence 

$$\mathcal{V} = \{ I_i \in \mathcal{R}^{H \times W} \}^N_{i=1}$$

We want to estimate: 

- Camera Poses: $\hat{\textbf{G}_i} \in SE(3)$
- Focal Length: $\hat{f}$ (if unknown)
- Dense Video Depth Maps: $\mathcal{D} = \{ \hat{D}_i \}^N_{i=1}$

# System Overview

We separate the problem of camera and scene structure estimation into two stages, in the spirit of a conventional SfM pipeline.

1. Estimate camera poses $\hat{\textbf{G}_i}$, focal length $\hat{f}$, and low-resolution disparity $\hat{d}$ from the input monocular video through differentiable Bundle Adjustment (BA), where we intialize $\hat{\textbf{d}}$ with monocular depth maps predicted from off-the-shelf models, e.g. Unidepth and Depth-Anything
    
2. Fix estimated camera parameters and perform first-order optimization over depth and uncertainty maps by enforcing flow and depth losses induced by pairwise 2D optical flows.

<div align="center">
<img src="assets/flow_diagram.png" alt="Flow Diagram">
</div>


# Objective Function

The objective consists of three main cost functions
$$C_{cvd} = w_{\text{flow}}C_{\text{flow}} + w_{\text{temp}}C_{\text{temp}} + w_{\text{temp}}C_{\text{temp}}$$

For each selected pair $(I_i, I_j)$, flow reprojection loss $C_{\text{flow}}$ compares $l_1$ loss weighted by the uncertainty $\hat{M}_i$ between flows $\text{flow}_{i\rightarrow j}$ from RAFT and the correspondences $\textbf{u}_{ij}$ induced by our estimated camera motion and disparity through a multi-view constraint 

$$C^{i \rightarrow j}_{\text{flow}} = \hat{M}_i \|\textbf{u}_{ij} - \textbf{p}_i\|_1  + \log{(\frac{1}{\hat{M}_i})}$$

$$\textbf{u}_{ij} = \pi \left(\hat{G}_{ij} \circ \pi^{-1} (\textbf{p}_i, \hat{D}_i, K^{-1}), K \right)$$