<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# 5. Assessment

Congratulations on going through today's course! Hope it was a fun journey with some new skills as souvenirs. Now it's time to put those skills to the test.

Here's the challenge: Let's say we have a have a classification model that uses LiDAR data to classify spheres and cubes. Compared to RGB cameras, LiDAR sensors are not as easy to come by, so we'd like to convert this model so it can classify RGB images instead. Since we used [NVIDIA Omniverse](https://www.nvidia.com/en-us/omniverse/) to generate LiDAR and RGB data pairs, let's use this data to create a contrastive pre-training model. Since CLIP is already taken, we will call this model `CILP` for "Contrastive Image LiDAR Pre-training". 

## 5.1 Setup

Let's get started. Below are the libraries used in this assessment.

In [1]:
import numpy as np
from PIL import Image

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
import torchvision.transforms as transforms
from torch.utils.data import Dataset, DataLoader

from assessment import assesment_utils
from assessment.assesment_utils import Classifier
import utils

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.cuda.is_available()

True

### 5.1.1 The Model

Next, let's load our classification model and call it `lidar_cnn`. If we take a moment to view the [assement_utils](assessment/assesment_utils.py), we can see the `Classifier` class used to construct the model. Please note the `get_embs` method, which we will be using to construct our cross-modal projector.

In [2]:
lidar_cnn = Classifier(1).to(device)
lidar_cnn.load_state_dict(torch.load("assessment/lidar_cnn.pt", weights_only=True))
# Do not unfreeze. Otherwise, it would be difficult to pass the assessment.
for param in lidar_cnn.parameters():
    lidar_cnn.requires_grad = False
lidar_cnn.eval()

Classifier(
  (embedder): Sequential(
    (0): Conv2d(1, 50, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(50, 100, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(100, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU()
    (8): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (9): Conv2d(200, 200, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (10): ReLU()
    (11): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (12): Flatten(start_dim=1, end_dim=-1)
  )
  (classifier): Sequential(
    (0): Linear(in_features=3200, out_features=100, bias=True)
    (1): ReLU()
    (2): Linear(in_features=100, out_features=1, bias=True)
  )
)

### 5.1.2 The Dataset

Below is the dataset we will be using in this assessment. It is similar to the dataset we used in the first few labs, but please note `self.classes`. Unlike the first lab where we predicted position, in this lab, we will determine whether the RGB or LiDAR we are evaluating contains a `cube` or a `sphere`.

In [3]:
IMG_SIZE = 64
img_transforms = transforms.Compose([
    transforms.Resize(IMG_SIZE),
    transforms.ToTensor(),  # Scales data into [0,1]
])

class MyDataset(Dataset):
    def __init__(self, root_dir, start_idx, stop_idx):
        self.classes = ["cubes", "spheres"]
        self.root_dir = root_dir
        self.rgb = []
        self.lidar = []
        self.class_idxs = []

        for class_idx, class_name in enumerate(self.classes):
            for idx in range(start_idx, stop_idx):
                file_number = "{:04d}".format(idx)
                rbg_img = Image.open(self.root_dir + class_name + "/rgb/" + file_number + ".png")
                rbg_img = img_transforms(rbg_img).to(device)
                self.rgb.append(rbg_img)
    
                lidar_depth = np.load(self.root_dir + class_name + "/lidar/" + file_number + ".npy")
                lidar_depth = torch.from_numpy(lidar_depth[None, :, :]).to(torch.float32).to(device)
                self.lidar.append(lidar_depth)

                self.class_idxs.append(torch.tensor(class_idx, dtype=torch.float32)[None].to(device))

    def __len__(self):
        return len(self.class_idxs)

    def __getitem__(self, idx):
        rbg_img = self.rgb[idx]
        lidar_depth = self.lidar[idx]
        class_idx = self.class_idxs[idx]
        return rbg_img, lidar_depth, class_idx

This data is available in the `/data/assessment` folder. Here is an example of one of the cubes. The images are small, but there is enough detail that our models will be able to tell the difference.

<center><img src="data/assessment/cubes/rgb/0002.png" /></center>

Let's go ahead and load the data into a `DataLoader`. We'll set aside a few batches (`VALID_BATCHES`) for validation. The rest of the data will be used for training. We have `9999` images for each of the cube and sphere categories, so we'll multiply N times 2 to reflect the combined dataset.

In [4]:
BATCH_SIZE = 32
VALID_BATCHES = 10
N = 9999

valid_N = VALID_BATCHES*BATCH_SIZE
train_N = N - valid_N

train_data = MyDataset("data/assessment/", 0, train_N)
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
valid_data = MyDataset("data/assessment/", train_N, N)
valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, shuffle=False, drop_last=True)

N *= 2
valid_N *= 2
train_N *= 2

## 5.2 Contrastive Pre-training

Before we create a cross-modal projection model, it would be nice to have a way to embed our RGB images as a starting point. Let's be efficient with our data and create a contrastive pre-training model. First, it would help to have a convolutional model. We've prepared a recommended architecture below.

In [5]:
CILP_EMB_SIZE = 200

class Embedder(nn.Module):
    def __init__(self, in_ch, emb_size=CILP_EMB_SIZE):
        super().__init__()
        kernel_size = 3
        stride = 1
        padding = 1

        # Convolution
        self.conv = nn.Sequential(
            nn.Conv2d(in_ch, 50, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(50, 100, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(100, 200, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(200, 200, kernel_size, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten()
        )

        # Embeddings
        self.dense_emb = nn.Sequential(
            nn.Linear(200 * 4 * 4, 100),
            nn.ReLU(),
            nn.Linear(100, emb_size)
        )

    def forward(self, x):
        conv = self.conv(x)
        emb = self.dense_emb(conv)
        return F.normalize(emb)

The RGB data has `4` channels, and our LiDAR data has `1`. Let's initiate these embedding models respectively.

In [6]:
img_embedder = Embedder(4).to(device)
lidar_embedder = Embedder(1).to(device)

Now that we have our embedding models, let's combine them into a `ContrastivePretraining` model.

**TODO**: The `ContrastivePretraining` class below is almost done, but it has a few `FIXME`s. Please replace the FIXMEs to have a working model. Feel free to review notebook [02b_Contrastive_Pretraining.ipynb](02b_Contrastive_Pretraining.ipynb) for a hint.

In [19]:
class ContrastivePretraining(nn.Module):
    def __init__(self):
        super().__init__()
        self.img_embedder = img_embedder
        self.lidar_embedder = lidar_embedder
        self.cos = nn.CosineSimilarity()

    def forward(self, rgb_imgs, lidar_depths):
        img_emb = self.img_embedder(rgb_imgs)
        lidar_emb = self.lidar_embedder(lidar_depths)

        repeated_img_emb = img_emb.repeat_interleave(len(img_emb), dim=0)
        repeated_lidar_emb = lidar_emb.repeat(len(lidar_emb), 1)

        similarity = self.cos(repeated_img_emb, repeated_lidar_emb)
        similarity = torch.unflatten(similarity, 0, (BATCH_SIZE, BATCH_SIZE))
        similarity = (similarity + 1) / 2

        logits_per_img = similarity
        logits_per_lidar = similarity.T
        return logits_per_img, logits_per_lidar

Before we can train the model, we should define a loss function to guide our model in learning.

**TODO**: The `get_CILP_loss` function below is almost done. Do you remember the formula to calculate the loss? Please replace the `FIXME`s below.

In [23]:
def get_CILP_loss(batch):
    rbg_img, lidar_depth, class_idx = batch
    logits_per_img, logits_per_lidar = CILP_model(rbg_img, lidar_depth)
    total_loss = (loss_img(logits_per_img, ground_truth) + loss_lidar(logits_per_lidar, ground_truth))/2
    return total_loss, logits_per_img

Time to put these models to the test! First, let's initialize the model.

In [24]:
CILP_model = ContrastivePretraining().to(device)
optimizer = Adam(CILP_model.parameters(), lr=0.0001)
loss_img = nn.CrossEntropyLoss()
loss_lidar = nn.CrossEntropyLoss()
ground_truth = torch.arange(BATCH_SIZE, dtype=torch.long).to(device)
epochs = 3

Next, it's time to train. If the above `TODO`s were completed correctly, the loss should be under `3.2`. Are the values along the diagional close to `1`?

In [25]:
for epoch in range(epochs):
    CILP_model.train()
    train_loss = 0
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        loss, logits_per_img = get_CILP_loss(batch)
        loss.backward()
        train_loss += loss.item()
        optimizer.step()
    assesment_utils.print_CILP_results(epoch, train_loss/step, logits_per_img, is_train=True)

    CILP_model.eval()
    valid_loss = 0
    for step, batch in enumerate(valid_dataloader):
        loss, logits_per_img = get_CILP_loss(batch)
        valid_loss += loss.item()
    assesment_utils.print_CILP_results(epoch, valid_loss/step, logits_per_img, is_train=False)

Epoch 0
Train Loss: 3.0879874265016016 
Similarity:
tensor([[0.9972, 0.0379, 0.4454,  ..., 0.2381, 0.5527, 0.2554],
        [0.0501, 0.9845, 0.4564,  ..., 0.8475, 0.4028, 0.7434],
        [0.3975, 0.5722, 0.9886,  ..., 0.1722, 0.7463, 0.4071],
        ...,
        [0.2195, 0.8055, 0.1573,  ..., 0.9925, 0.2809, 0.8099],
        [0.6167, 0.3427, 0.7876,  ..., 0.2268, 0.9913, 0.6406],
        [0.2538, 0.6974, 0.3776,  ..., 0.8134, 0.6218, 0.9943]],
       device='cuda:0', grad_fn=<DivBackward0>)
Valid Loss: 3.191289500186318 
Similarity:
tensor([[0.9929, 0.8805, 0.4261,  ..., 0.4011, 0.4511, 0.9885],
        [0.8579, 0.9958, 0.2482,  ..., 0.1951, 0.2363, 0.9011],
        [0.3929, 0.2471, 0.9942,  ..., 0.5831, 0.5269, 0.3459],
        ...,
        [0.4149, 0.2071, 0.5989,  ..., 0.9976, 0.9893, 0.3666],
        [0.4750, 0.2561, 0.5403,  ..., 0.9925, 0.9973, 0.4286],
        [0.9905, 0.9180, 0.3655,  ..., 0.3684, 0.4208, 0.9969]],
       device='cuda:0', grad_fn=<DivBackward0>)
Epoch 1
Train

When complete, please freeze the model. We will assess this model with our cross-model projection model, and if this model is altered during cross-model projection training, it may not pass!

In [26]:
for param in CILP_model.parameters():
    CILP_model.requires_grad = False

## 5.3 Cross-Modal Projection

Now that we have a way to embed our image data, let's move on to cross-modal projection. 

**TODO**: Let's jump right in and create the projector. What should be the dimensions into the model, and what should be the dimensions out of the model? A hint to the first `FIXME` can be found in section [#5.2-Contrastive-Pre-training](#5.2-Contrastive-Pre-training) in the `Embedding` class. A hint to the second `FIXME` can be found in the [assessment/assesment_utils.py](assessment/assesment_utils.py) file in the `Classifier` class. The dimensions of the second `FIXME` should be the same size as the output of the `get_embs` function.

In [28]:
CILP_EMB_SIZE

200

In [34]:
projector = nn.Sequential(
    nn.Linear(CILP_EMB_SIZE, 1000),
    nn.ReLU(),
    nn.Linear(1000, 500),
    nn.ReLU(),
    nn.Linear(500, 1)
).to(device)


#FIXME : clip_emb_size[0] and vgg_shape[0]

Next, let's define the loss function for training the `projector`.

**TODO**: What was the loss function for estimating projection embeddings? Please replace the `FIXME` below. Review notebook [03a_Projection.ipynb](03a_Projection.ipynb) section 3.2 for a hint.

In [35]:
def get_projector_loss(model, batch):
    rbg_img, lidar_depth, class_idx = batch
    imb_embs = CILP_model.img_embedder(rbg_img)
    lidar_emb = lidar_cnn.get_embs(lidar_depth)
    pred_lidar_embs = model(imb_embs)
    return nn.MSELoss()(pred_lidar_embs, lidar_emb)

The `projector` will take a little while to train, but at the end of it, should reach a validation loss around 2.

In [31]:
epochs = 40
optimizer = torch.optim.Adam(projector.parameters())
assesment_utils.train_model(projector, optimizer, get_projector_loss, epochs, train_dataloader, valid_dataloader)

  return F.mse_loss(input, target, reduction=self.reduction)


Epoch   0 | Train Loss: 5.5659
Epoch   0 | Valid Loss: 5.6017
Epoch   1 | Train Loss: 5.5664
Epoch   1 | Valid Loss: 5.6006
Epoch   2 | Train Loss: 5.5661
Epoch   2 | Valid Loss: 5.6015
Epoch   3 | Train Loss: 5.5603
Epoch   3 | Valid Loss: 5.5967
Epoch   4 | Train Loss: 5.5621
Epoch   4 | Valid Loss: 5.5976
Epoch   5 | Train Loss: 5.5573
Epoch   5 | Valid Loss: 5.5908
Epoch   6 | Train Loss: 5.5545
Epoch   6 | Valid Loss: 5.5910
Epoch   7 | Train Loss: 5.5544
Epoch   7 | Valid Loss: 5.5878
Epoch   8 | Train Loss: 5.5478
Epoch   8 | Valid Loss: 5.5890
Epoch   9 | Train Loss: 5.5483
Epoch   9 | Valid Loss: 5.5865
Epoch  10 | Train Loss: 5.5464
Epoch  10 | Valid Loss: 5.5891
Epoch  11 | Train Loss: 5.5495
Epoch  11 | Valid Loss: 5.5865
Epoch  12 | Train Loss: 5.5488
Epoch  12 | Valid Loss: 5.5871
Epoch  13 | Train Loss: 5.5482
Epoch  13 | Valid Loss: 5.5852
Epoch  14 | Train Loss: 5.5495
Epoch  14 | Valid Loss: 5.5908
Epoch  15 | Train Loss: 5.5479
Epoch  15 | Valid Loss: 5.5877
Epoch  1

Time to bring it together. Let's create a new model `RGB2LiDARClassifier` where we can use our projector with the pre-trained `lidar_cnn` model.

**TODO**: Please fix the `FIXME`s below. Which `embedder` should we be using from our `CILP_model`?

In [32]:
class RGB2LiDARClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.projector = projector
        self.FIXME = CILP_model.FIXME
        self.shape_classifier = lidar_cnn
    
    def forward(self, imgs):
        img_encodings = self.img_embedder(imgs)
        proj_lidar_embs = self.projector(img_encodings)
        return self.shape_classifier(data_embs=proj_lidar_embs)

In [33]:
my_classifier = RGB2LiDARClassifier()

AttributeError: 'ContrastivePretraining' object has no attribute 'FIXME'

Before we train this model, let's see how it does out of the box. We'll create a function `get_correct` that we can use to calculate the number of classifications that were correct.

In [None]:
def get_correct(output, y):
    zero_tensor = torch.tensor([0]).to(device)
    pred = torch.gt(output, zero_tensor)
    correct = pred.eq(y.view_as(pred)).sum().item()
    return correct

Next, we can make a `get_valid_metrics` function to calculate the model's accuracy with the validation dataset. If done correctly, the accuracy should be above `.70`, or 70%.

In [None]:
def get_valid_metrics():
    my_classifier.eval()
    correct = 0
    batch_correct = 0
    for step, batch in enumerate(valid_dataloader):
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
    print(f"Valid Loss: {loss.item():2.4f} | Accuracy {correct/valid_N:2.4f}")

get_valid_metrics()

Finally, let's fine-tune the completed model. Since `CILP` and `lidar_cnn` are frozen, this should only change the `projector` part of the model. Even so, the model should achieve a validation accuracy of above `.95` or 95%.

In [None]:
epochs = 5
optimizer = torch.optim.Adam(my_classifier.parameters())

my_classifier.train()
for epoch in range(epochs):
    correct = 0
    batch_correct = 0
    for step, batch in enumerate(train_dataloader):
        optimizer.zero_grad()
        rbg_img, _, class_idx = batch
        output = my_classifier(rbg_img)
        loss = nn.BCEWithLogitsLoss()(output, class_idx)
        batch_correct = get_correct(output, class_idx)
        correct += batch_correct
        loss.backward()
        optimizer.step()
    print(f"Train Loss: {loss.item():2.4f} | Accuracy {correct/train_N:2.4f}")
    get_valid_metrics()

## 5.4 Run the Assessment

Moment of truth! To assess your model run the following two cells. There are ten points that are graded:abs

* Confirm CILP has a validation loss of below `3.2` (5 points)
* Confirm the `projector` can be used with `lidar_cnn` to accurately classify images. Five batches of images will be tested if the batch accuracy is above `.95`. (1 point each for 5 points total)

9 out of 10 points are required to pass the assessment. Good luck!

Please pass your `CILP_model` and `projector` below. If the names of these models have changed, please update the below accordingly.

In [None]:
from run_assessment import run_assessment

In [None]:
run_assessment(CILP_model, projector)

## 6.7 Generate a Certificate

If you passed the assessment, please return to the course page and click the "ASSESS TASK" button, which will generate your certificate for the course.

<center><a href="https://www.nvidia.com/en-us/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>