# Versatile Object Recognition
***
## Table of Contents
***

In [16]:
import torch
from torch import nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Introduction

## 2. Preparing Data
For simple models (shallow networks, logistic regression, etc.), `ToTensor()` is often sufficient as it rescales image pixel values to the range from 0 to 1. However, for state-of-the-art architectures, it is strongly recommended to re-normalise (standardise) inputs so that each colour channel has zero mean and unit variance. Many pretrained models are trained on such normalised inputs, therefore this approach tends to yield better results than basic normalisation. Furthermore, centring inputs around zero generally results in more stable training and faster convergence, particularly for architectures with activation functions such as tanh or certain weight initialisation schemes.

Let:
- $X_{n, c, h, w}$: Pixel value for image $n$, channel $c$, height $h$, and width $w$.
- $N$: Total number of images.
- $C$: Number of channels (RGB = 3).
- $H, W$: Height and width of an image.

For each batch of images ($\left[B, 3, 32, 32\right]$), 
- Mean per channel:
$$
\mu_{\text{batch, c}} = \dfrac{1}{B \cdot H \cdot W}\sum^{B}_{n=1} \sum^{H}_{h=1} \sum^{W}_{w=1} X_{n, c, h, w}
$$

- Squared mean per channel:

$$
s_{\text{batch, c}} = \dfrac{1}{B \cdot H \cdot W}\sum^{B}_{n=1} \sum^{H}_{h=1} \sum^{W}_{w=1} X^2_{n, c, h, w}
$$

- Mean:
$$
\mu = \dfrac{\sum_{\text{batches}} \mu_{\text{batch, c}}}{n_{\text{batches}}}
$$

Using the identity $\text{Var}(X) = E\left[(X - \mu \right)^2]$ :
- Standard deviation:

\begin{align*}

\sigma &= \sqrt{E\left[X^2\right] - (E\left[X\right])^2} \\
 &= \sqrt{\dfrac{\sum_{\text{batches}} s_{\text{batch, c}}}{n_{\text{batches}}} - \mu^2}

\end{align*}


In [None]:
def get_mean_and_std(train_data):
    train_loader = DataLoader(train_data, batch_size=64, shuffle=False)

    c_sum, c_squared_sum, n_batches = 0, 0, 0
    for data, _ in train_loader:
        # Shape: [batch_size, channel=3, height=32, width=32]
        c_sum += torch.mean(data, dim=[0, 2, 3])
        c_squared_sum += torch.mean(data**2, dim=[0, 2, 3])
        n_batches += 1
    mean = c_sum / n_batches
    std = (c_squared_sum / n_batches - mean**2) ** 0.5

    return mean.tolist(), std.tolist()

Or, alternatively:

In [None]:
def get_mean_and_std_simple(train_data):
    # .data -> [n_images, height, width, colour channels]
    data = train_data.data / 255

    mean = data.mean(axis=(0, 1, 2))  # Mean with respect to colour channels
    std = data.std(axis=(0, 1, 2))  # Std with respect to colour channels

    return mean, std

In [33]:
raw_train_data = datasets.CIFAR10(
    root="_datasets", train=True, download=True, transform=transforms.ToTensor()
)
mean, std = get_mean_and_std(raw_train_data)
print(f"Mean: {mean}\nStandard Deviation: {std}")

Mean: [0.4914625287055969, 0.48222312331199646, 0.44661077857017517]
Standard Deviation: [0.24703028798103333, 0.24348397552967072, 0.26159432530403137]


## 3. Data Augmentation
This transform pipeline contains the following data augmentation processes:
- **Resize(size=(32, 32))**:
    - Ensures all input images are reshaped to $32 \times 32$ pixels. 
- **RandomCrop(32, padding=4)**:
    - Pads each side of the $32 \times 32$ images by $4$ pixels, making it $40 \times 40$, then randomly crops back to $32 \times 32$.
    - Introduces local translations and slight spatial variations.
    - Helps the model learn translation-invariant features.
- **RandomHorizontalFlip(p=0.5)**:
    - Each image is horizontally flipped with probability $0.5$ ($= 50$%).
    - Increases data diversity and is appropriate for natural images.
- **ToTensor()**:
    - Converts PIL Images or NumPy arrays to PyTorch tensors, scaling pixel values ranging from $0$ to $1$.
- **Normalize(mean=mean, std=std)**:
    - Standardises each channel by subtracting the dataset mean and dividing by its standard deviation.
    - Ensures input features for the neural network are zero-centred and scale-invariant, enhancing training dynamics.

In [None]:
train_transform = transforms.Compose(
    [
        transforms.Resize(size=(32, 32)),
        transforms.transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(p=0.5),
        transforms.ToTensor(),
        transforms.Normalize(mean=mean, std=std),
    ]
)


## References

1. Aladdin Persson. (2020). *Pytorch Quick Tip: Calculate Mean and Standard Deviation of Data*. <br>
https://youtu.be/y6IEcEBRZks?si=JmHfPQWezR1ooX5F

1. stackoverflow. (2021). *How to calculate the mean and the std of cifar10 data*. <br>
https://stackoverflow.com/questions/66678052/how-to-calculate-the-mean-and-the-std-of-cifar10-data