<a href="https://colab.research.google.com/github/robosherpa/NeuralRecon/blob/cvrg-review/A_Review_on_NeuralRecon_at_CVRG_Summer_2021_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![title](img/slides0.png)

In this notebook, lets review [NeuralRecon](https://zju3dv.github.io/neuralrecon/) paper.

# Real-time Coherent 3D Reconstruction from Monocular Video

There are four elements in this field of research
### Real-time 
In this 3D reconstruction Real-time means 33 keyframes per second.

### Coherent
One example is shown in the red boxes in Fig. 1, where **the depth-based method struggles to produce coherent depth estimations on the chairs and wall.**
        
Different from depth-based methods that predict depth maps for each key frame separately, the surface geometry within a local fragment window is jointly predicted in NeuralRecon, and thus **locally coherent geometry estimation* can be produced.

Coherent pertains to embeddings of temporal relationship between frames.
No more independent, identically distributed (i.i.d.) assumptions.

\\[
\gamma^2_{xy} = \frac{|S_{xy}(f)^2|}{S_{xx}(f)S_{yy}(f)}
\\]

where,

- $S_{xy}(f)$ is the **cross-spectral density** of the signal and

- $S_{xx}(f)$ and $S_{yy}(f)$ are the **power spectral density** functions of $x(t)$  and $y(t)$ , respectively

- The **cross-spectral density** is defined as Fourier Transforms cross-correlation signals

- The **power spectral density** are defined as the Fourier transforms of the autocorrelation signals 

### 3D (Surface) Reconstruction
    
Mathematically, 
Given a static scene (S) in Ground Truth. The algorithm, under a partial observation of the static world, creates a consistent geometric representation of the World using an objective/loss function.
\\[ p(\mathbf{S}_k|\mathbf{x}_{0:k},m_{0:k}) = \mathcal Expon(\mathbf{S}_{0:k-1},\mathbf{x}_k,m_k) \\]

where,
- $\mathbf{S}$ is the view independent static scene to be reconstructed 
- $\mathbf{x}$ is image frame at each instance in the period [0, $k$] 
- $m$ is the motion model from the camera pose estimates in a SLAM system in the period [0, $k$]
- $\mathcal Expon(\mathbf{x}) = \mathbf{\lambda} \times e^{-\lambda \mathbf{x}}$ for $\mathbf{x} >= 0$ is the space transformation function.

        
### Monocular Vision

$\mathbf{x}$ is a Monocular Image 

\\[\mathbf{x}_{i} = \mathbf{x}_{j} \leftrightarrow i = j \, \forall {i, j} \in [0, k]\\]



# Previous Methods


![title](img/slides3.png)
![title](img/slides4.png)
![title](img/slides5.png)
![title](img/slides6.png)

### Multiview Depth Estimation
- Plane-sweeping Stereo under the assumption of photo-consistency
- MVDepthNet [48] and Neural RGB->D [24] use 2D CNNs to process the 2D depth cost volume constructed from multiview image features. 
-  CNMNet [26] further leverages the **planar structure in indoor scenes** to constrain the surface normals calculated from the predicted depth maps to obtain smooth depth estimation. 

### Depthmaps per Frame
- All the above-mentioned works adopt **single-view depth maps as intermediate representations**.
- SurfaceNet [15, 16] takes a different approach and uses a unified volumetric representation to predict the volume occupancy. Recently, Atlas [30] also proposes a volumetric design and direct predicts TSDF and semantic labels with 3D CNN. 




# Proposed System

Given a sequence of monocular images and their corresponding camera poses estimated by a SLAM system, NeuralRecon incrementally reconstructs **local geometry in a view-independent 3D volume instead of view-dependent depth maps**. Specifically, it **unprojects the image features to form a 3D feature volume** and then uses sparse convolutions to process the feature volume to output a sparse TSDF volume.

With a coarse-to-fine design, the predicted TSDF is gradually refined at each level. By directly reconstructing the implicit surface (TSDF), the network is able to learn the local smoothness and global shape prior of natural 3D surfaces.

To make the current-fragment reconstruction to be globally consistent with the previously reconstructed fragments, a $\mathbf{learning-based\ TSDF\ fusion\ module\ using\ the\ Gated\ Recurrent\ Unit\ (GRU)}$ is proposed. The GRU fusion makes the current-fragment reconstruction conditioned on the previously reconstructed global volume, yielding a joint reconstruction and fusion approach. As a result, the reconstructed mesh is dense, accurate and globally coherent in scale. Furthermore, predicting the volumetric representation also removes the redundant computation in depth-based methods, which allows us to use a larger 3D CNN while maintaining the real-time performance.

![title](img/slides7.png)



### Key Frames Selection

Given a sequence of monocular images ${\mathbf{I}_t}$ and camera pose trajectory ${\xi_t}$ $\in$ $\mathbb{SE}(3)$ provided by a SLAM system, the goal is to reconstruct dense 3D scene geometry accurately in real-time. We denote the global TSDF volume to reconstruct as $\mathbf{S}^g_t$, 

where,
- $t$ represents the current time step.

The system architecture is illustrated in Fig. 2.
![title](img/fig2.png)


### Local Fragment

![title](img/slides8.png)
![title](img/slides11.png)
![title](img/slides12.png)
![title](img/slides13.png)


A new incoming frame is selected as a key frame if 
- its relative translation is greater than $t_{max}$ and 
- the relative rotation angle is greater $R_{max}$. 

A window with $N$ key frames is defined as a local fragment. 

After key frames are selected, 
a cubic-shaped fragment bounding volume (**FBV**) 
that encloses all the key frame view-frustums is computed with a fixed max depth range $d_{max}$ in each view. 
Only the region within the **FBV** is considered during the reconstruction of each fragment.

### Joint Fragment Reconstruction and Fusion

We propose to 

simultaneously reconstruct the TSDF volume of a local fragment $S_t^l$ and
fuse it with global TSDF volume $S^g_t$ with a learning-based approach. 

![title](img/slides14.png)
![title](img/slides16.png)
![title](img/slides17.png)
![title](img/slides18.png)

The joint reconstruction and fusion is carried out in the local coordinate system. 
![title](img/slides19.png)


The definition of the local and global coordinate systems as well as the construction of FBV are illustrated in Fig. 1 of the supplementary material.

![title](img/slides13.png)


#### Image Feature Volume Construction

The N images in the local fragment are first passed through the image backbone to extract the multi-level features.

Similar to previous works on volumetric reconstruction [18, 15, 30], the extracted features are back-projected along each ray into the 3D feature volume. 

The image feature volume $F^l_t$ is obtained by **averaging** the features from different views according to the visibility weight of each voxel. 

The visibility weight is defined as the number of views from which a voxel can be observed in the local fragment. A visualization of this unprojection process can be found in Fig.3 i.

#### Coarse to Fine TSDF Reconstruction

We adopt a coarse-to-fine approach to gradually refine the predicted TSDF volume at each level. 

We use 3D sparse convolution to efficiently process the feature volume $F^l_t$. 

The sparse volumetric representation also naturally integrates with the coarse-to-fine design. 

Specifically, each voxel in the TSDF volume $S^l_t$ contains two values, the occupancy score $o$ and the SDF value $x$. At each level, both $o$ and $x$ are predicted by the
MLP. 

The occupancy score represents the confidence of a voxel being within the TSDF truncation distance $\lambda$. 

The voxel whose occupancy score is lower than the sparsification threshold θ is defined as void space and will be sparsified. 

This representation of sparse TSDF volume is visually illustrated in Fig.3 iii. 

After the sparsification, $S^l_t$ is upsampled by 2× and concatenated with the $F^{l+1}_t$ as the input for the GRU Fusion module (introduced later) in the next level. 

Instead of estimating single-view depth maps for each
key frame, NeuralRecon jointly reconstructs the implicit surface within the bounding volume of the local fragment
window. 

This design guides the network to learn the natural surface prior directly from the training data. 

As a result,the reconstructed surface is locally smooth and coherent in scale. 

Notably, this design also leads to less redundant computation compared to depth-based methods since each area on the 3D surface is estimated only once during the fragment reconstruction.

#### GRU Fusion
To make the reconstruction consistent between fragments, we propose to make the current-fragment reconstruction to be conditioned on the reconstructions in previous fragments. We use a 3D convolutional variant of Gated Recurrent Unit (GRU) [6] module for this purpose. As illustrated in Fig.3 ii, at each level the image feature
volume $F^l_t$ is first passed through the 3D sparse convolution layers to extract 3D geometric features $G^l_t$. The hidden state $H^l_{t−1}$ is extracted from the global hidden state $H^g_{t−1}$ within the fragment bounding volume. GRU fuses $G^l_t$ with hidden state $H^l_{t−1}$ and produces the updated hidden state $H^l_t$, which will be passed through the MLP layers to predict the TSDF volume $S^l_t$ at this level. The hidden state $H^l_t$ will
also be updated to global hidden state $H^g_t$ by directly replacing the corresponding voxels. Formally, denoting $z_t$ as the update gate, $r_t$ as the reset gate, $\sigma$ as the sigmoid function and $W_∗$ as the weight for sparse convolution, GRU fuses $G^l_t$ with hidden state $H^l_{t−1}$ with the following operations:

\\[ 
z_t = \sigma(SparseConv([H^l_{t−1}, G^l_t], W_z)) \\
r_t = \sigma(SparseConv([H^l_{t−1}, G^l_t], W_r)) \\
\tilde{H}^l_t = tanh(SparseConv([r_t \odot H^l_{t−1}, G^l_t], W_h)) \\
H^l_t = (1 − z_t) \odot H^l_{t−1} + z_t \odot \tilde{H}^l_t \\]


Intuitively, in the context of joint reconstruction and fusion of TSDF, the update gate $z_t$ and forget gate $r_t$ in the GRU determine how much information from the previous reconstructions (i.e. hidden state $H^l_{t−1}$) is fused to the current-fragment geometric feature $G^l_t$, as well as how much information from the current-fragment will be fused into the hidden state H^l_t. As a data-driven approach, the GRU serves as a selective attention mechanism that replaces the linear running-average operation in conventional TSDF fusion [31]. By predicting $S^l_t$ after the GRU, the MLP network can leverage the context information accumulated from history fragments to produce consistent surface geometry across local fragments. This is also conceptually analogous to the depth filter in a non-learning-based 3D reconstruction pipeline [38, 34], where the current observation
and the temporally-fused depths are fused with the Bayesian filter. The effectiveness of joint reconstruction and fusion is validated in the ablation study.

#### Integration to the Global TSDF Volume
At the last coarse-to-fine level, $S^3_t$ is predicted and further sparsified to $S^l_t$. Since the fusion between $S^l_t$ and $S^g_t$ has been done in GRU Fusion, $S^l_t$ is integrated into $S^g_t$ by directly replacing the corresponding voxels after being transformed into the global coordinate. At each time step t, Marching Cubes is
performed on $S^g_t$ to reconstruct the mesh.

### Sparse TSDF Volumetric Representation
 Truncated Signed Distance Function 
 
### Supervision / Loss Function

Following [9], two loss functions are used to supervise the network. The occupancy loss is defined
as the binary cross-entropy (BCE) between the predicted occupancy values and the ground-truth occupancy values.
The SDF loss is defined as the $l1$ distance between the predicted SDF values and the ground-truth SDF values. We
log-transform the SDF values of predictions and groundtruth before applying the $l1$ loss. The supervision is applied to all the coarse-to-fine levels.

Mathematically,

\\[
BCE(S) = \sum_{c=1}^{M}( l_1( S_g^t - S_g^*)  log(l_1( S_g^t - S_g^*)))
\\]

where,
- $M$ is number of classes (dog, cat, fish)
- $log$ is the natural log
y - binary indicator (0 or 1) if class label c is the correct classification for observation o
p - predicted probability observation o is of class c

### Implementation Details
We use torchsparse [43] as the implementation of 3D sparse convolution. 

The image backbone is a variant of **MnasNet** [41] and is initialized with the weights pretrained from ImageNet. 

**Feature Pyramid Network** [23] is used in the **backbone** to extract more representative multi-level features. 

The entire network is trained end-to-end with randomly initialized weights except for the image backbone.

The occupancy score $o$ is predicted with a Sigmoid layer.

The voxel size of the last level is 4cm and the TSDF truncation distance $λ$ is set to 12cm.
$d_{max}$ is set to 3m. $R_{max}$ and $t_{max} are set to 15°and 0.1m respectively. 
$θ$ is set to 0.5.
Nearest-neighbor interpolation is used in the upsampling between coarse-to-fine levels.

  ### Experiments
  
  #### Datasets
  
We perform the experiments on two indoor datasets, 
- ScanNet (V2) [8] and
- 7-Scenes [39]. 

The ScanNet dataset contains 1613 indoor scenes with 
- ground-truth camera poses,
- surface reconstructions, and
- semantic segmentation labels.

There are two training/validation splits commonly used in previous works (defined in [30] and [42]) for
the ScanNet dataset. We use the same training and validation data with the corresponding baseline methods to make a fair comparison. 

The 7-Scenes dataset is another challenging RGB-D dataset captured in indoor scenes. Following the baseline method [26], we use the model trained on ScanNet to perform the validation on 7-Scenes.

  #### Metrics
  
We consider F-score as the most suitable metrics to measure 3D reconstruction quality since
both the accuracy and completeness of the reconstruction are considered.

- Accuracy of Reconstruction
- Completeness of Reconstruction

  ![title](img/atlas0.png)

### Results

![title](img/slides21.png)
![title](img/slides22.png)
![title](img/slides23.png)
![title](img/slides24.png)
![title](img/slides25.png)

### Result Tables
![title](img/table1and2.png)
![title](img/table3.png)
![title](img/table4and5.png)

### Claim and Review
- Real-time
- coherent*
- 3D reconstruction
- from **posed** Monocular Videos


References:

[Empirical Evaluation of Rectified Activations in Convolution Network](https://arxiv.org/pdf/1505.00853.pdf)

[Distilling Knowledge in a Neural Network](https://arxiv.org/abs/1503.02531)
[MnasNet: Platform-Aware Neural Architecture Search for Mobile](https://arxiv.org/pdf/1807.11626.pdf)
[Feature Pyramid Networks for Object Detection](https://arxiv.org/pdf/1612.03144.pdf)

Handy Acronyms List:
mobile neural architecture search (MNAS)

# Code Playground

This section is my fork of the repository, all the notes during this review are in the fork. 
I use local runtime to make use of my PC's GPU to run the solution.

In [1]:
!mkdir -p /tmp/cvrg/src
!cd /tmp/cvrg/src
!git clone https://github.com/robosherpa/NeuralRecon.git
!cd NeuralRecon
!git checkout cvrg-review

fatal: destination path 'NeuralRecon' already exists and is not an empty directory.
M	review/NeuralRecon_cvrg_review_summer_2021.ipynb
D	review/refs/kinectfusion.pdf
Already on 'cvrg-review'
