# Fine-Tuning the DPT-DinoV2-Small-KITTI Model with the DPT-DinoV2-Small-NYU Dataset

In this Jupyter Notebook, we embark on a practical journey to familiarize ourselves with the intricacies of fine-tuning a pre-trained Dense Prediction Transformer (DPT) model, specifically the dpt-dinov2-small-kitti. Our objective is to adapt and fine-tune this model using the dpt-dinov2-small-nyu dataset, which presents a unique challenge in depth estimation tasks.

This notebook is structured to provide a comprehensive workflow, starting from importing necessary libraries and defining key components such as custom datasets and models, to initializing and training the model with fine-tuned parameters. Along the way, we delve into each step with detailed explanations and code annotations, ensuring clarity and understanding of the processes involved.

Our goal is to not only achieve effective fine-tuning of the DPT model but also to gain deeper insights into the model's architecture and the fine-tuning techniques. This exercise serves as a hands-on exploration into the realm of Transformer models and their application in depth estimation tasks, setting a foundation for further experimentation and research in this field.

## Imports

In [1]:
# Standard Library Imports
import time

from torch.utils.data import DataLoader
import torch

# Third-Party Library Imports
from PIL import Image
import numpy as np
import torch
from torchvision.transforms import Resize

# PyTorch Imports
from torch.utils.data import DataLoader, Dataset

# PyTorch Lightning Imports
import pytorch_lightning as pl
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor, ModelCheckpoint

# Transformers and Datasets Imports
from datasets import load_dataset, load_from_disk
from transformers import AutoImageProcessor, DPTForDepthEstimation, TrainingArguments

from torchvision.transforms import Resize
from PIL import Image
import numpy as np
import torch
from torch.utils.data import Dataset

## Time Keeping

In [2]:
# Storing current time to measure runtime
start_time = time.time()

## Defining the model

In [3]:
class DPTLightningModule(pl.LightningModule):
    """
    A PyTorch Lightning module for depth estimation using the DPT (Dense Prediction Transformer) model.

    This module is specifically configured for fine-tuning the DPT model on depth estimation tasks. It involves
    freezing the entire model initially and then unfreezing the last two layers of the transformer along with the head
    for training. The module uses Mean Squared Error as the loss function for depth estimation.

    Methods:
        forward: Performs a forward pass through the model.
        common_step: A shared step used for both training and validation.
        configure_optimizers: Sets up the optimizer for training.
        training_step: Performs a training step.
        validation_step: Performs a validation step.
        train_dataloader: Loads the training dataset.
        val_dataloader: Loads the validation dataset.
        test_dataloader: Loads the test dataset.
    """

    def __init__(self):
        """
        Initializes the DPTLightningModule, loads the pre-trained DPT model and sets up layer freezing.
        """
        super().__init__()
        self.model = DPTForDepthEstimation.from_pretrained("facebook/dpt-dinov2-small-kitti")

        # Freeze all parameters of the model initially
        for param in self.model.parameters():
            param.requires_grad = False

        # Unfreeze the last two transformer layers
        for layer in self.model.backbone.encoder.layer[-2:]:
            for param in layer.parameters():
                param.requires_grad = True

        # Unfreeze the head of the model (specific classifier or regression layer)
        for param in self.model.head.parameters():
            param.requires_grad = True

    def forward(self, x):
        """
        Performs a forward pass through the model.
        
        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Predicted depth map.
        """
        return self.model(x).predicted_depth

    def common_step(self, batch, batch_idx):
        """
        A shared step for calculating loss, used in both training and validation steps.

        Args:
            batch (dict): A batch from the dataset.
            batch_idx (int): The index of the batch.

        Returns:
            torch.Tensor: Computed loss for the batch.
        """
        pixel_values = batch['pixel_values']
        labels = batch['labels']
        preds = self(pixel_values)
        loss = torch.nn.functional.mse_loss(preds, labels)

        return loss

    def configure_optimizers(self):
        """
        Configures the optimizer for training.

        Returns:
            torch.optim.Optimizer: The Adam optimizer.
        """
        return torch.optim.Adam(self.parameters())

    def training_step(self, batch, batch_idx):
        """
        Performs a training step.

        Args:
            batch (dict): A batch from the dataset.
            batch_idx (int): The index of the batch.

        Returns:
            torch.Tensor: The loss for the training step.
        """
        loss = self.common_step(batch, batch_idx)
        self.log("training_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        """
        Performs a validation step.

        Args:
            batch (dict): A batch from the dataset.
            batch_idx (int): The index of the batch.

        Returns:
            torch.Tensor: The loss for the validation step.
        """
        loss = self.common_step(batch, batch_idx)
        self.log("validation_loss", loss, on_epoch=True)
        return loss

    # Note: Ensure that train_dataloader, val_dataloader, and test_dataloader are defined elsewhere in your code.
    def train_dataloader(self):
        return train_dataloader

    def val_dataloader(self):
        return val_dataloader

    def test_dataloader(self):
        return test_dataloader


## Loading the dataset

In [4]:
# # Load the NYU Depth Dataset (assuming it's already downloaded and stored locally)
nyu_dataset = load_from_disk("C:/Downloads/nyu_depth_v2.hf")

## Ispecting image dimensions

In [5]:
# Access the first image in the dataset to check its dimensions
first_image = nyu_dataset['train'][0]['image']

# Convert to PIL Image if it's not already
if not isinstance(first_image, Image.Image):
    first_image = Image.fromarray(first_image)

# Print the dimensions of the first image
print(f"First image dimensions: {first_image.size}")  # Outputs (width, height)

# Optionally, to get a more comprehensive view, you can check dimensions of multiple images
image_sizes = set()
for i in range(min(len(nyu_dataset['train']), 100)):  # Check first 100 images
    image = nyu_dataset['train'][i]['image']
    if not isinstance(image, Image.Image):
        image = Image.fromarray(image)
    image_sizes.add(image.size)

print(f"Unique image dimensions in the first 100 images: {image_sizes}")

First image dimensions: (640, 480)
Unique image dimensions in the first 100 images: {(640, 480)}


## Instantiating DPTLightningModule

In [6]:
# Initialize the LightningModule
model = DPTLightningModule()

## Configuring CustomNYUDataset

In [7]:
class CustomNYUDataset(Dataset):
    """
    A custom dataset class for loading and processing the NYU Depth Dataset for depth estimation tasks.

    This dataset class is tailored to preprocess images and corresponding depth maps for training
    a depth estimation model. It includes functionality to resize depth maps to match the output size of the model.

    Attributes:
        dataset (Dataset): The original NYU Depth Dataset.
        processor (AutoImageProcessor): The image processor to preprocess the images.
        output_size (tuple): The target output size (height, width) for resizing depth maps.

    Methods:
        __len__: Returns the size of the dataset.
        __getitem__: Retrieves and preprocesses an item (image and depth map) from the dataset.
    """

    def __init__(self, dataset, processor, output_size):
        """
        Initializes the CustomNYUDataset with the given dataset, processor, and output size.

        Args:
            dataset (Dataset): The original NYU Depth Dataset.
            processor (AutoImageProcessor): The processor to preprocess the images.
            output_size (tuple): The target output size (height, width) for resizing depth maps.
        """
        self.dataset = dataset
        self.processor = processor
        self.output_size = output_size

    def __len__(self):
        """
        Returns the size of the dataset.

        Returns:
            int: The number of items in the dataset.
        """
        return len(self.dataset)

    def __getitem__(self, idx):
        """
        Retrieves and preprocesses an item from the dataset.

        The method processes both the image and its corresponding depth map. The image is processed using
        the provided processor, and the depth map is resized to match the model's output size.

        Args:
            idx (int): The index of the item to retrieve.

        Returns:
            dict: A dictionary with processed pixel values ('pixel_values') and labels ('labels').
        """
        # Retrieve the item at the specified index
        item = self.dataset[idx]

        # Convert image and depth map to PIL Image if they are not already
        image = Image.fromarray(item['image']) if not isinstance(item['image'], Image.Image) else item['image']
        depth_map = Image.fromarray(item['depth_map']) if not isinstance(item['depth_map'], Image.Image) else item['depth_map']

        # Process the image using the provided processor
        processed_images = self.processor(images=[image], return_tensors="pt")
        input_tensor = processed_images["pixel_values"].squeeze()

        # Resize depth map to match model's output size
        depth_map_resized = Resize((576, 736))(depth_map)  # Resize to (width, height)
        depth_map_tensor = torch.tensor(np.array(depth_map_resized), dtype=torch.float32)

        return {"pixel_values": input_tensor, "labels": depth_map_tensor}

## Loading Image Processor

In [8]:
# Model and Processor Initialization
processor = AutoImageProcessor.from_pretrained("facebook/dpt-dinov2-small-nyu")

## Assertaining Model Output Size

In [9]:
# Creating a dummy input tensor for the model with dimensions [batch_size, channels, height, width]
# Here, batch_size=1, channels=3 (for RGB images), and the height and width are set to 480 and 640 respectively
dummy_input = torch.randn(1, 3, 480, 640)

# Using torch.no_grad() to disable gradient calculations, as we only need a forward pass
# This reduces memory usage and speeds up the computation
with torch.no_grad():
    # Forward pass: Run the dummy input through the model to get the output
    model_output = model(dummy_input)

# Extracting the output size of the model, which is the spatial dimension of the output (height and width)
# The shape of model_output is expected to be [batch_size, channels, height, width]
# We take the last two values for height and width
model_output_size = model_output.shape[-2:]

# Printing the model's output size to verify its dimensions
print(f"Model Output Size: {model_output_size}")

Model Output Size: torch.Size([544, 736])


## Instantiating Training and Validation Sets

In [10]:
# Initialize with the actual model output size
# This is the size to which the depth maps will be resized
# The output size is specified as (width, height)
model_output_size = (544, 736)

# Create an instance of the CustomNYUDataset for the training data
# This instance will use the training portion of the nyu_dataset
# It will process images using the specified 'processor' and resize depth maps to 'model_output_size'
train_dataset = CustomNYUDataset(nyu_dataset['train'], processor, model_output_size)

# Similarly, create an instance of the CustomNYUDataset for the validation data
# This instance will use the validation portion of the nyu_dataset
# Images and depth maps in the validation dataset are processed in the same way as the training dataset
val_dataset = CustomNYUDataset(nyu_dataset['validation'], processor, model_output_size)


## Instantiating DataLoaders

In [11]:
# Configuring the batch size for training and evaluation
# A smaller batch size of 4 is chosen, which is a conservative size that helps prevent memory issues
batch_size = 2
train_batch_size = batch_size  # Setting the training batch size
eval_batch_size = batch_size   # Setting the evaluation (validation) batch size

# Defining a custom collate function for the DataLoader
# This function prepares batches by stacking the individual items' pixel values and labels
def collate_fn(batch):
    # Extracting pixel values (input features) from each item in the batch
    pixel_values = [item['pixel_values'] for item in batch]

    # Extracting labels (target depth maps) from each item in the batch
    labels = [item['labels'] for item in batch]

    # Stacking all pixel values and labels in the batch to create batch tensors
    pixel_values = torch.stack(pixel_values)
    labels = torch.stack(labels)

    # Returning a dictionary with keys 'pixel_values' and 'labels' for the batch
    return {"pixel_values": pixel_values, "labels": labels}

# Creating DataLoaders for training and validation
# DataLoaders are used to load the data in batches during training and validation
train_dataloader = DataLoader(train_dataset, shuffle=True, collate_fn=collate_fn, batch_size=train_batch_size)
val_dataloader = DataLoader(val_dataset, collate_fn=collate_fn, batch_size=eval_batch_size)

# Placeholder for creating a DataLoader for testing (if needed in the future)
# Currently, testing DataLoader isn't implemented
# test_dataloader = DataLoader(test_ds, collate_fn=collate_fn, batch_size=eval_batch_size)

## Inspecting a Sample Batch from the DataLoader

In [12]:
# Fetching a single batch of data from the train DataLoader
# This is typically done to inspect or debug the shape and content of the batch data
batch = next(iter(train_dataloader))

# Extracting the labels (i.e., ground truth depth maps) from the fetched batch
# 'labels' contains the depth maps that the model is supposed to predict
labels = batch['labels']

# Printing the size of the labels tensor
# The size will show us the dimensions of the depth maps, including the batch size
# This helps to confirm that the depth maps are correctly loaded and batched
print("Size of ground truth depth maps:", labels.size())

Size of ground truth depth maps: torch.Size([2, 576, 736])


## Initializing Trainer and Starting Model Training

In [13]:
# Determine the device type (CUDA if available, else CPU)
# This ensures that the training leverages GPU acceleration if available for faster processing
device_type = 'cuda' if torch.cuda.is_available() else 'cpu'

# Initialize an early stopping callback
# This will stop the training process if the validation loss does not improve after 3 epochs ('patience')
# 'strict=False' allows training to continue for a few more steps even after early stopping condition is met
# 'verbose=False' limits the amount of logging information during training
# 'mode=min' indicates that training should stop when the monitored quantity (validation loss) stops decreasing
early_stop_callback = EarlyStopping(monitor='validation_loss', patience=3, strict=False, verbose=False, mode='min')

# Initialize the PyTorch Lightning Trainer
# 'max_epochs=-1' indicates an indefinite number of epochs, but early stopping will intervene
# 'accelerator=device_type' specifies the computation device (GPU or CPU)
# The early stopping callback is added to the list of callbacks
trainer = Trainer(max_epochs=-1, accelerator=device_type, callbacks=[early_stop_callback])

# Start the training process
# The model is trained using the specified train and validation data loaders
# The training process will automatically use the device specified earlier and apply early stopping as configured
trainer.fit(model, train_dataloader, val_dataloader)


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type                  | Params
------------------------------------------------
0 | model | DPTForDepthEstimation | 37.2 M
------------------------------------------------
4.5 M     Trainable params
32.7 M    Non-trainable params
37.2 M    Total params
148.790   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

c:\Users\regis\.conda\envs\vit2\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
c:\Users\regis\.conda\envs\vit2\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

c:\Users\regis\.conda\envs\vit2\lib\site-packages\pytorch_lightning\trainer\call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


In [None]:
# Calculate and print the total runtime
print(f'Total runtime: {np.round((time.time()-start_time) / 60, 2)} minutes')