In [77]:
# colab setup
# !pip install torch torchvision torchmetrics torchinfo tqdm 

# Lab 1 & 2 - Deep Learning Basics and Efficient Architectures

Welcome to the first lab(s) of the course! Throught the course, we will gradually learn about edge AI systems.
In this lab, we will start with the basics of deep learning and PyTorch framework. We will learn how to create, train and evaluate neural networks. 

In [None]:
import torch
import torchvision
import numpy as np

# The basics of PyTorch

Pytorch is a popular deep learning framework that provides a flexible and efficient way to build and train neural networks. The core of Pytorch is it's automatic differentiation engine that allows us to compute gradients and perform backpropagation. This is essential for training neural networks using gradient descent. <br>

At a high level, Pytorch is a fronted for tensor computations, similar to NumPy, but with additional capabilities for GPU acceleration and automatic differentiation. It's power lies in its simple and intuitive API, which makes it easy to build and experiment with complex neural network architectures using pre-built components written in lower level languages like C++ or CUDA.

### Tensors - fundamental building blocks of PyTorch

`Tensors` are the basic data structures in Pytorch, similar to numpy arrays, but with additional capabilities for GPU acceleration and automatic differentiation. In general, they are generalized data containers that can hold a data in arbitrary number of dimensions (even 0D or more than 3D).

![Tensors](https://tinyurl.com/bdnxcym)

### Basic properties of tensors

In Pytorch, tensors have several important properties:
- `shape` - the dimensions of the tensor, represented as a tuple of integers.
- `dtype` - the data type of the elements in the tensor, such as `float32`, `int64`, etc.
- `device` - the device on which the tensor is stored, such as `cpu` or `gpu`.

In [None]:
# Creating a 2x2 tensor directly from lists
tensor = torch.tensor([[1, 2], [3, 4]]) 
print("Tensor:\n", tensor)

In [None]:
# Eploring basic tensor properties
print("Shape of tensor:", tensor.shape)
print("Data type of tensor:", tensor.dtype)
print("Device of tensor:", tensor.device)

In [None]:
# Creating an uninitialized 2x3 tensor, which may contain arbitrary values
# which are already present at that memory location during allocation
tensor_empty = torch.empty((2, 3))
print("Empty Tensor:\n", tensor_empty)

In [None]:
# Creating a 3x3 tensor filled with zeros
tensor_zeros = torch.zeros((3, 3))
print("Zeros Tensor:\n", tensor_zeros)

In [None]:
# Creating a 2x4 tensor filled with ones
tensor_ones = torch.ones((2, 4))
print("Ones Tensor:\n", tensor_ones)

In [None]:
# Creating a 3x3 tensor with random values from a uniform distribution over [0, 1)
tensor_random = torch.rand((3, 3))
print("Random Tensor:\n", tensor_random)

In [None]:
# Creating a 2x2 tensor from a NumPy array
np_array = np.array([[5, 6], [7, 8]])
tensor_from_np = torch.from_numpy(np_array)
print("Tensor from NumPy array:\n", tensor_from_np)

In [None]:
# Converting a PyTorch tensor to a NumPy array
tensor_to_np = tensor.numpy()
print("NumPy array from Tensor:\n", tensor_to_np)

## Tensor access, manipulation and reshaping 

In [None]:
tensor = torch.rand((4, 4))
print("Original Tensor:\n", tensor)

In [None]:
# Accessing elements
print("Element at (0, 0):", tensor[0, 0])
print("First row:", tensor[0, :])
print("Second column:", tensor[:, 1])

In [None]:
# Modifying elements
tensor[0, 0] = 0
print("Modified Tensor:\n", tensor)

In [None]:
# Reshaping tensors
reshaped_tensor = tensor.view(2, 8)
print("Reshaped Tensor:\n", reshaped_tensor)

# Reshaping (other way)
reshaped_tensor_2 = tensor.reshape(2, 8)
print("Reshaped Tensor (other way):\n", reshaped_tensor_2)

#### Question - what is the difference between `view` and `reshape` methods in PyTorch? When would you use one over the other?

### Data types and devices

Tensors generally contain data of the same type (i.e `float32`, `int64`, etc.). This is important, as it will influence the memory usage and may speed up executed arithmetic operations. Torch defines several data types, with the most common being `torch.float32` and `torch.int64`. It also supports half-precision (16-bit) and bfloat16 types, which are useful for reducing memory usage and speeding up computations on compatible hardware. <br>

Additionally, tensor objects interface with different hardware devices, such as CPUs and GPUs. By calling specific methods, we can move tensors between memory on different devices. Note that moving tensors between devices can be time-consuming, so it's important to minimize the number of device transfers during training and inference. Operations between tensors on different devices are not allowed, so we need to ensure that all tensors involved in a computation are on the same device.

<img src="https://docs.nvidia.com/cuda/cuda-c-programming-guide/_images/gpu-devotes-more-transistors-to-data-processing.png" alt="Jupyter Logo" width="800"/>

In [None]:
# Change the data type of a tensor
tensor_int = tensor.to(torch.int32)
print("Tensor with int32 data type:\n", tensor_int)
print("Data type of new tensor:", tensor_int.dtype)

# Move tensor to GPU if available
if torch.cuda.is_available():
    tensor_gpu = tensor.to('cuda')
    print("Device of tensor on GPU:", tensor_gpu.device)

    # Move tensor back to CPU
    tensor_cpu = tensor_gpu.to('cpu')
    print("Device of tensor back on CPU:", tensor_cpu.device)

# nn.Module - building blocks for neural networks

Pytorch provides a high-level API for building neural networks using the `nn.Module` class. This class provides a convenient way to define and organize the layers of a neural network, as well as to manage the parameters of the network. In general, network layers are defined by their parameters (weights, biases, convolutional kernels, etc.) and the operations they perform on the input data (linear transformations, non-linear activations, pooling, etc.).
Let's see how to create a simple feedforward neural network using `nn.Module`.

In [None]:
from torch import nn
from torchinfo import summary
class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(10, 5)  
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

model = SimpleNN()

In [None]:
example_input = torch.rand((1, 10))
output = model(example_input)
print("Model output:\n", output)

In [None]:
# Print model summary
summary(model, (1, 10))

In [None]:
# Look model weight matrices
print("Weights of fc1 layer:\n", model.fc1.weight)
print("Weights of fc2 layer:\n", model.fc2.weight)

## Model size, number of parameters and other memory requirements

The neural network model size is determined by the number of parameters (weights and biases) it contains. Each parameter typically requires a certain amount of memory, depending on its data type (e.g., float32, float64, etc.). The total memory required for a model can be calculated by multiplying the number of parameters by the size of each parameter in bytes. Considering already trained model, there is still a need to store more than parameters:

- Activations - intermediate outputs of each layer during the forward pass
- Operation instructions (kernels) - the operations that need to be performed during the forward pass (like matrix multiplications, convolutions, etc.)

In [None]:
# Measure parameter size of the model

param_size = 0
for param in model.parameters():
    param_size += param.nelement() * param.element_size()
buffer_size = 0

# There are some params that are not trainable, but still take space and change during training
# e.g. running mean and variance in BatchNorm layers.

for buffer in model.buffers():
    buffer_size += buffer.nelement() * buffer.element_size()

size_all_b = (param_size + buffer_size)
print('model size: {:.0f}B'.format(size_all_b))

## Measuring model inference latency

To measure the inference latency of a neural network model, we can use the Pytorch native Timer module. This module provides a simple way to measure the time taken to execute a set of operations. It takes into account such things as GPU warm-up time or device synchronization. 

In [None]:
from torch.utils.benchmark import Timer
timer = Timer(
    stmt='model(example_input)',
    globals={'model': model, 'example_input': example_input}
)
# Measure the time taken for 1000 inferences
measurement = timer.timeit(1000)
print(f"Median inference time over 1000 runs: {measurement.median * 1e6:.3f} us")
print(f"Mean inference time over 1000 runs: {measurement.mean * 1e6:.3f} us")


## Datasets and Dataloaders

Pytorch provides a convenient way to handle datasets and data loading using the `torch.utils.data` module. This module provides two main classes: `Dataset` and `DataLoader`. The `Dataset` class is an abstract class that represents a dataset, while the `DataLoader` class is responsible for loading data from a dataset in batches.
In the end, our input data ends up as tensors, which can be directly fed into our neural network. We can also leverage various pre-defined datasets available in `torchvision.datasets` module.

In [None]:
from torchvision import transforms
from torchvision.datasets import MNIST

# We will use the MNIST dataset of handwritten digits
train_dataset = MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
test_dataset = MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())


In [None]:
from matplotlib import pyplot as plt
# Visualize some samples from the training dataset
fig, ax = plt.subplots(1, 3, figsize=(10, 4))
for i in range(3):
    image, label = train_dataset[i]
    ax[i].imshow(image.squeeze(), cmap='gray')
    ax[i].set_title(f'Label: {label}')
    ax[i].axis('off')
plt.show()

In [None]:
# Constructing DataLoaders - these will handle batching
from torch.utils.data import DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

In [None]:
# Define simple CNN model for MNIST

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 64 * 7 * 7)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        return self.fc2(x)
        

In [None]:
# Train the model
from tqdm import tqdm
from torchmetrics import Accuracy

def train(model, device):
    criterion = nn.CrossEntropyLoss().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    acc_metric = Accuracy("multiclass", num_classes=10).to(device)
    num_epochs = 1
    
    
    for epoch in tqdm(range(num_epochs), total=num_epochs, desc="Epochs"):
        model.train()
        epoch_preds = []
        epoch_labels = []
        for batch in tqdm(train_loader, total=len(train_loader)):
            images, labels = batch
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            epoch_preds.append(outputs)
            epoch_labels.append(labels)
        epoch_preds = torch.cat(epoch_preds)
        epoch_labels = torch.cat(epoch_labels)
        train_acc = acc_metric(epoch_preds, epoch_labels)
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}, Training Accuracy: {train_acc:.4f}")

    print("Training complete.")
    
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleCNN().to(device)
train(model, device)

## Increasing the performance of our model

We have used really simple architecture for our model. Currently, vision models have made a huge progress and they are able to achieve really high accuracy on various tasks. However, we are interested not only in the model predictive performance, but also in its efficiency. The field of TinyML is focused on creating models that can run on edge devices with limited computational resources. We will explore various techniques to improve the efficiency of our models, but for now we will try to start with some more advanced architecture that is designed with efficiency in mind.

That being said, we would need some concrete metricts to evaluate not only the performance of our model on given task, but also its computational efficiency. We will use the following metrics, beyond the previously mentioned ones:
- memory footprint - the amount of memory required to store th model parameters and intermediate activations during inference
- number of operations (FLOPs) - the total number of floating-point operations required to perform a forward pass through the model

### FLOPs - Floating Point Operations, a measure of computational complexity

In [None]:
# Utility function to count model FLOPs - kudos https://alessiodevoto.github.io/Compute-Flops-with-Pytorch-built-in-flops-counter/

from torch.utils.flop_counter import FlopCounterMode
from typing import Union, Tuple
def get_flops(model, inp: Union[torch.Tensor, Tuple], with_backward=False):
    
    istrain = model.training
    model.eval()
    
    inp = inp if isinstance(inp, torch.Tensor) else torch.randn(inp)

    flop_counter = FlopCounterMode(mods=model, display=False, depth=None)
    with flop_counter:
        if with_backward:
            model(inp).sum().backward()
        else:
            model(inp)
    total_flops =  flop_counter.get_total_flops()
    if istrain:
        model.train()
    return total_flops

In [None]:
flops = get_flops(model, (1, 1, 28, 28))

### Memory footprint - the amount of memory required to store the model parameters and intermediate activations during inference

In [None]:
import torch.profiler
from torch.profiler import ProfilerActivity

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
with torch.profiler.profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True) as prof:
    model(torch.randn((1, 1, 28, 28)).to(device))
    loss = output.sum()
    loss.backward()
    optimizer.step()

print(prof.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10))

## TODO: fine-tune `MobilenetV3` on CIFAR-10 dataset and evaluate its performance using the defined metrics.

Now let's try to use the knowledge we have gained so far and fine-tune `MobilenetV3` on CIFAR-10 dataset. We will also evaluate its performance using the defined metrics.

1. Load the CIFAR-10 dataset using `torchvision.datasets` module and create a `DataLoader` for training and validation sets.
2. Create the training loop.
3. Train and evaluate the model.
4. Measure the model size, number of parameters, FLOPs, total memory footprint and inference latency.

Hint - do we really need to train the whole model from scratch? Maybe we can leverage some pre-trained weights? Do we need to train all the layers? Maybe we can freeze some of them?

### EfficientNet architecture 

EfficientNet model family and subsequent versions introduce some key architectural innovations that contribute to their efficiency and performance:
- Compound Scaling - EfficientNet uses a compound scaling method that uniformly scales the depth, width, and resolution of the input image. This allows the model to achieve better performance with fewer parameters and FLOPs.
- MBConv Blocks - These convolutional blocks works as kind of inverted residuals blocks, where the input is 

1. Expanded to a higher dimensional space using a pointwise (1x1) convolution (increase feature representation space by increasing number of channels)
2. Then, a depthwise convolution is applied to each channel separately to capture spatial information
3. Non-linear activation function is applied (ReLU6 or HardSwish, also computationnally efficient)
4. Next, input is once again passed through 1x1 convolution to reduce the number of channels back to the original size
5. Finally, a skip connection is added to the output of the block to help with gradient flow during training.

Sometimes, also a squeeze-and-excitation (SE) block is added after the depthwise convolution. In this block:

1. The input is globally averaged pooled to create a channel-wise descriptor, which captures the global information carried by each channel
2. This descriptor is then passed through a small fully connected network to learn channel-wise weights, kind of attention mechanism
3. Finally, the input is scaled (multiplied) by these weights to recalibrate the channel-wise feature responses.

You can use `torchinfo` to see a list of all layers in the model - take a look and see if you can find those elements in the architecture.

In [None]:
from torchvision.models import mobilenet_v3_small
model = mobilenet_v3_small(num_classes=10).to(device)

In [None]:
# Your code here