# Assignment 3 : Neural Volume Rendering and Surface Rendering

#### Submitted by: Shahram Najam Syed
#### Andrew-ID: snsyed
#### Date: 11th March, 2025
#### Late days used: 0

## A. Neural Volume Rendering (80 points)

### 0. Transmittance Calculation (10 points)

<img src="./output/figure1.png">

Since, 
$$
\frac{dT}{dy} = -\sigma(y)T
$$

Hence the base equation for transmittance becomes:
$$
T = e^{-\int \sigma(y) dy}
$$

So,
$$
T(y_1, y_2) = e^{-\int_{y_1}^{y_2} \sigma(y) dy} = e^{-2}
$$

$$
T(y_2, y_4) = e^{-\int_{y_2}^{y_3} \sigma(y) dy} \times e^{-\int_{y_3}^{y_4} \sigma(y) dy} = e^{-30.5}
$$

$$
T(x, y_4) = T(x, y_2) \times T(y_2, y_4) = T(x, y_1) \times T(y_1, y_2) \times T(y_2, y_4) = e^{-32.5}
$$

$$
T(x, y_3) = T(x, y_1) \times T(y_1, y_2) \times T(y_2, y_3) = e^{-2.5}
$$


### 1. Differentiable Volume Rendering

#### 1.3. Ray sampling (5 points)

<table>
<tr>
<th>Grid Visualization</th>
<th>Ray Visualization</th>
</tr>
<tr>
<td><img src="./images/1_3_xygrid.png"></td>
<td><img src="./images/1_3_rays.png"></td>
</tr>
</table>

### 1.3. Fitting a mesh (5 points)

<table>
<tr>
<th>Source (reconstructed)</th>
<th>Target (goal)</th>
</tr>
<tr>
<td><img src="./output_loop/q1_3_1.gif"></td>
<td><img src="./output_loop/q1_3_2.gif"></td>
</tr>
</table>

## 2. Reconstructing 3D from single view

### 2.1. Image to voxel grid (20 points)

Voxels proved to be a challenging representation to train, especially when using a pix2vox‐inspired decoder. The **Baseline FC + Deconv Decoder** first passed the 512‑dimensional feature through a fully connected layer, reshaping it into a small 3D grid, which was then upsampled via transposed 3D convolutions to form a full 32×32×32 volume. Despite numerous adjustments—such as modifying the number of layers and initial volume size—this approach often resulted in a high rate of empty meshes (about 29%) and low average F1 scores (around 27).

To address these issues, I experimented with simpler architectures based solely on fully connected layers. A deeper **5‑layer FC network** offered some improvement, reducing the mesh failure rate to roughly 10% and boosting the F1 score to 42.3. Ultimately, the **Simplified FC Network (3 layers)** with an iso threshold of 0.5 delivered the best performance, with 3% failures and an average F1 score of 62.3.

Detailed abalation of these networks can be seen in the table below:

| Architecture Variant             | Iso Threshold | Mesh Failure Rate | Avg F1 Score @ 0.05 |
|----------------------------------|---------------|-------------------|---------------------|
| Baseline FC + Deconv Decoder     | 0.3           | ~29%              | 27.6                |
| Deep FC Network (5 Layers)       | 0.5           | ~10%              | 42.3                |
| Simplified FC Network (3 Layers) | 0.5           | 3%                | 62.1                |

These results indicate that reducing the architectural complexity—by relying on a more straightforward FC-based design—can lead to more stable training and better reconstruction quality.

<table>
<tr>
<th>Input (RGB)</th>
<th>GT mesh</th>
<th>Predicted 3D voxel grid</th>
</tr>
<tr>
<td><img src="./output/q2_1_1.png" width="256" height="256"></td>
<td><img src="./output/q2_1_2.gif" width="256" height="256"></td>
<td><img src="./output/q2_1_3.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_1_4.png" width="256" height="256"></td>
<td><img src="./output/q2_1_5.gif" width="256" height="256"></td>
<td><img src="./output/q2_1_6.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_1_7.png" width="256" height="256"></td>
<td><img src="./output/q2_1_8.gif" width="256" height="256"></td>
<td><img src="./output/q2_1_9.gif" width="256" height="256"></td>
</tr>
</table>

### 2.2. Image to point cloud (20 points)

Point cloud reconstruction from a single image benefits from a relatively straightforward decoder architecture. The baseline approach employs a simple fully connected (FC) network that regresses the 3D coordinates for each point directly from the 512‑dimensional image encoding. Initially, using a single FC layer with a standard ReLU activation led to some instability, with the predicted point distributions lacking coherence. By introducing a two‐layer FC network with a LeakyReLU followed by a Tanh activation, the network was able to generate more consistent and accurate point clouds. In our ablation studies, the optimized two‐layer variant achieved the highest average F1 score at the evaluation threshold.

| Architecture Variant              | Activation Function         | Avg F1 Score @ 0.05 |
|-----------------------------------|-----------------------------|---------------------|
| Baseline Single FC                | ReLU                        | 35.9                |
| Two-Layer FC with LeakyReLU       | LeakyReLU, Tanh             | 72.2                |
| Optimized Two-Layer FC            | LeakyReLU followed by Tanh  | 83.8                |

The experiments show that a carefully designed FC decoder—using non-linearities that balance expressiveness and stability—can significantly improve point cloud quality.


<table>
<tr>
<th>Input (RGB)</th>
<th>GT mesh</th>
<th>Predicted 3D Points</th>
</tr>
<tr>
<td><img src="./output/q2_2_1.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_2.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_3.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_2_4.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_5.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_6.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_2_7.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_8.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_2_9.gif" width="256" height="256"></td>
</tr>
</table>

### 2.3 Image to mesh (20 points)

For mesh reconstruction, the challenge lies in accurately deforming an initial template mesh (typically an icosphere) to match the target geometry. The baseline mesh decoder starts with an icosphere generated at a moderate resolution (Level 4) and directly regresses vertex offsets using a single FC layer with a Tanh activation. Although this setup captures the coarse structure, it often restricts the deformation range, leading to suboptimal mesh quality. By increasing the network depth to a two- or three-layer FC decoder, the model can better capture finer geometric details and allow for more extensive vertex adjustments. In our ablation experiments, the three-layer FC mesh decoder provided the best balance, producing high-fidelity reconstructions with minimal deviation from the ground truth.

| Architecture Variant               | Mesh Initialization | Avg F1 Score @ 0.05 |
|------------------------------------|---------------------|---------------------|
| Baseline Single FC Mesh Decoder    | Icosphere Level 4   | 36.4                |
| Two-Layer FC Mesh Decoder          | Icosphere Level 4   | 68.7                |
| Three-Layer FC Mesh Decoder        | Icosphere Level 4   | 81.1                |

<table>
<tr>
<th>Input (RGB)</th>
<th>GT mesh</th>
<th>Predicted Mesh</th>
</tr>
<tr>
<td><img src="./output/q2_3_1.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_2.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_3.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_3_4.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_5.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_6.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_3_7.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_8.gif" width="256" height="256"></td>
<td><img src="./output_loop/q2_3_9.gif" width="256" height="256"></td>
</tr>
</table>


### 2.4. Quantitative comparisions(10 points)

In my experiments, I observed a similar trend across the three representations:

- **Point Clouds:**  
  This approach achieved the highest F1 score. Its strength lies in its flexibility—points are free to occupy any position in 3D space, allowing the network to capture detailed geometric nuances when enough points are used.

- **Meshes:**  
  Mesh reconstruction, which involves deforming an icosphere, performed moderately well. However, its performance is constrained by the fixed number of vertices in the initial sphere template, limiting the ability to capture fine details compared to the point cloud approach.

- **Voxels:**  
  Voxels, represented by a 32³ grid, delivered the lowest F1 score. The coarse resolution makes it difficult to capture intricate details. Additionally, the binary cross-entropy loss applied in this setting might not offer as informative a signal as the Chamfer distance loss used for point clouds and meshes.

Overall, these results confirm that point cloud representations tend to capture object geometry more accurately, followed by meshes and then voxels.


<table>
<tr>
<th>Voxel</th>
<th>Point</th>
<th>Mesh</th>
</tr>
<tr>
<td><img src="./output/q2_4_1.png"></td>
<td><img src="./output/q2_4_2.png"></td>
<td><img src="./output/q2_4_3.png"></td>
</tr>
</table>

### 2.5. Analyse effects of hyperparams variations (10 points)

I conducted an ablation study on the smoothing weight (`w_smooth`), while keeping the Chamfer distance weight (`w_chamfer`) constant at 1.0. In this experiment, I evaluated three settings: 0, 0.5, and 10.0.

#### Qualitative Observations

- **No Smoothing (w_smooth = 0):**  
  The reconstructed mesh maintains its sharp features, which can result in a somewhat pointy appearance. However, without any smoothing, the model may not regularize the surface as effectively.

- **Moderate Smoothing (w_smooth = 0.5):**  
  Introducing a moderate amount of smoothing yields visually smoother surfaces, reducing pointiness. The improvement in visual quality is noticeable even if the F1 score does not change drastically.

- **Excessive Smoothing (w_smooth = 10.0):**  
  When the smoothing weight is set too high, the performance drops. Over-smoothing appears to wash out critical geometric details, leading to less accurate reconstructions.

#### Quantitative Results

The F1 scores observed under each setting were as follows:

| w_smooth | F1 Score |
|----------|----------|
| 0        | 81.1     |
| 0.5      | 75.9     |
| 10.0     | 69.4     |


<style>
  img {
    width: 256px;
    height: 256px;
    object-fit: cover; /* Ensures uniform scaling */
  }
</style>

<table>
<tr>
<th>Input (RGB)</th>
<th>GT mesh</th>
<th>w_smooth = 0.0</th>
<th>w_smooth = 0.5</th>
<th>w_smooth = 10.0</th>
</tr>
<tr>
<td><img src="./output/q2_5_1.png"></td>
<td><img src="./output_loop/q2_5_2.gif"></td>
<td><img src="./output_loop/q2_5_13.gif"></td>
<td><img src="./output_loop/q2_5_14.gif"></td>
<td><img src="./output_loop/q2_5_15.gif"></td>
</tr>
</table>

### 2.6. Interpret your model (15 points)

To evaluate the robustness of the trained point cloud model, I intentionally introduced artifacts into one of the training images. Specifically, I applied both horizontal and vertical occlusions to simulate challenging, pseudo out-of-distribution scenarios. Remarkably, the network maintained strong classification performance despite these perturbations, demonstrating its resilience to input distortions.

<table>
<tr>
<th>Input OOD (RGB)</th>
<th>Output Point</th>
</tr>
<tr>
<td><img src="./output/q2_6_1.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_6_2.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_6_3.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_6_4.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_6_5.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_6_6.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q2_6_7.png" width="256" height="256"></td>
<td><img src="./output_loop/q2_6_8.gif" width="256" height="256"></td>
</tr>
</table>

## 3. Exploring other architectures / datasets. (Choose at least one! More than one is extra credit)

### 3.1 Implicit network (10 points)

For the implicit network, I trained the model to predict occupancy values on a pre-generated 32×32×32 grid of 3D coordinates. The model takes as input both the image feature and the 3D grid coordinate (x, y, z), which are concatenated and passed through multiple layers of a fully connected network. Specifically, the decoder is a multi-layer perceptron (MLP) with hidden layers that progressively reduce in size. Each hidden layer employs a ReLU activation, while the final output layer uses a Sigmoid function to produce occupancy values between 0 and 1, indicating the presence or absence of an object at each grid point. The model was trained using a loss function similar to the voxel-based approach, typically binary cross-entropy. Although the implicit network exhibited slower convergence rates compared to the voxel model and achieved a lower F1 score of 62.7, its flexibility allows it to adapt to different voxel structures, making it an interesting alternative despite the longer training time.

In our implicit network, we adopt a coordinate-based approach for 3D reconstruction that differs from the fixed-grid voxel method. We first pre-generate a 32×32×32 grid of 3D coordinates and then, for each coordinate, concatenate it with the image feature vector extracted from the input image. This combined vector is passed through a multi-layer perceptron (MLP) with several fully connected layers using ReLU activations, and finally through a Sigmoid-activated output layer to predict an occupancy value between 0 and 1. The training uses the same binary cross-entropy loss as the voxel model. Although the implicit model converges more slowly due to its increased flexibility and complexity, it ultimately achieves an F1 score of 62.7, demonstrating its potential to handle varied voxel structures effectively.

<center>
  <img src="./output/q3_1_10.png">
</center>

<table>
<tr>
<th>Input (RGB)</th>
<th>Output Predicted Voxel</th>
</tr>
<!--tr>
<td><img src="./output/q3_1_1.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_1_3.gif" width="256" height="256"></td>
</tr-->
<!--tr>
<td><img src="./output/q3_1_4.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_1_6.gif" width="256" height="256"></td>
</tr-->
<tr>
<td><img src="./output/q3_1_7.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_1_9.gif" width="256" height="256"></td>
</tr>
</table>


### 3.2 Parametric network (10 points)

In our parametric network, we sample 2000 points from a 2D unit circle and concatenate each (x, y) coordinate with the image feature vector. This combined input is fed into an MLP—several fully connected layers with ReLU activations—that regresses a corresponding 3D point. Training uses the same loss as the point-cloud model. Although convergence is slower and performance slightly lower due to time constraints, the model achieves an F1 score of 65.2 while offering flexibility to handle an arbitrary number of points.

<table>
<tr>
<th>Input (RGB)</th>
<th>Output Predicted Points</th>
</tr>
<tr>
<td><img src="./output/q3_2_1.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_2_2.gif" width="256" height="256"></td>
</tr>
<!--tr>
<td><img src="./output/q3_2_3.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_2_4.gif" width="256" height="256"></td>
</tr>
<tr>
<td><img src="./output/q3_2_5.png" width="256" height="256"></td>
<td><img src="./output_loop/q3_2_6.gif" width="256" height="256"></td>
</tr-->
</table>


### 3.3 Extended dataset for training (10 points)

Extending the dataset improved the model’s ability to generalize for a multiclass problem, although the qualitative results degraded. Training on three classes and testing on just the chair class achieved an F1 score of 95.9%, while training on one class and testing on the chair class yielded 83.8%. When evaluating on all three classes, the F1 score was 84.1% for the model trained on three classes, and surprisingly, 79.8% for the model trained on a single class.

<table>
<tr>
<th> </th>
<th>Training on 3 classes</th>
<th>Training on 1 class</th>
</tr>
<tr>
<td><b>Tested on 1 class</b></td>
<td><img src="./output/q3_3_1.png" width="512" height="512"></td>
<td><img src="./output/q3_3_2.png" width="512" height="512"></td>
</tr>
<tr>
<td><b>Tested on 3 classes</b></td>
<td><img src="./output/q3_3_4.png" width="512" height="512"></td>
<td><img src="./output/q3_3_3.png" width="512" height="512"></td>
</tr>
</table>


<table>
<tr>
<th> </th>
<th>Input RGB</th>
<th>Training on 3 classes</th>
<th>Training on 1 class</th>
</tr>
<tr>
<td><b>Aeroplane class</b></td>
<td><img src="./output/q3_3_5.png" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_6.gif" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_7.gif" width="512" height="512"></td>
</tr>
<tr>
<td><b>Chair class</b></td>
<td><img src="./output/q3_3_8.png" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_9.gif" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_10.gif" width="512" height="512"></td>
</tr>
<tr>
<td><b>Car class</b></td>
<td><img src="./output/q3_3_11.png" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_12.gif" width="512" height="512"></td>
<td><img src="./output_loop/q3_3_13.gif" width="512" height="512"></td>
</tr>
</table>
