# SUMMARY

This notebook caluclates mean and standard deviation of training images. Knowing mean and STD may be helpful for normalizing images within the augmentation pipeline. While computing mean is easy (we can simply average means over batches), standard deviation is a bit more tricky: averaging STDs across batches is not the same as calculating the overall STD. Let's do it properly!

Note: the pipeline comes from [this notebook](https://www.kaggle.com/kozodoi/computing-dataset-mean-and-std).


### TL;DR

- mean: `[0.5183  0.4835 0.4457]`
- std: `[0.2681 0.2638 0.2658]`

# PREPARATIONS

First, we import some libraries and specify a few parameters. No need to use GPU because there is no modeling involved.

In [None]:
##### PACKAGES

import numpy as np
import pandas as pd

import torch
import torchvision
from torch.utils.data import Dataset, DataLoader

import albumentations as A
from albumentations.pytorch import ToTensorV2

import cv2

from tqdm import tqdm

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
##### PARAMS

device      = torch.device('cpu') 
num_workers = 4
batch_size  = 64
image_size  = 380
data_path   = '../input/petfinder-pawpularity-score/'

# DATA PREP

Let's set up a Dataset and a DataLoader.

In [None]:
##### DATA IMPORT

df         = pd.read_csv(data_path + 'train.csv')
df['path'] = df['Id'].apply(lambda x: '{}train/{}.jpg'.format(data_path, x))
df.head()

Note that we divide the original pixel values in `[0, 255]` by 255 to scale them to `[0, 1]`. This will affect the mean and STD calculations.

In [None]:
##### DATASET

class ImageData(Dataset):
       
    def __init__(self, 
                 df, 
                 transform = None):
        self.df        = df
        self.transform = transform

        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        
        # import image
        path  = self.df.loc[idx, 'path']
        image = cv2.imread(path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        # scale to [0, 1]
        image = np.array(image) / 255.0
        
        # augmentations
        if self.transform is not None:
            image = self.transform(image = image)['image']
                                
        # output
        return image    

Our augmentation pipeline only uses `A.Resize()` to resize the images.

In [None]:
##### AUGMENTATIONS

augs = A.Compose([A.Resize(height  = image_size, 
                           width   = image_size),
                  ToTensorV2()])

In [None]:
##### CHECK SAMPLE BATCH

# data loader
image_dataset = ImageData(df = df, transform = augs)
image_loader = DataLoader(image_dataset, 
                          batch_size  = batch_size, 
                          shuffle     = False, 
                          num_workers = num_workers)

# display images
for batch_idx, inputs in enumerate(image_loader):
    fig = plt.figure(figsize = (16, 8))
    for i in range(5):
        ax = fig.add_subplot(2, 5, i + 1, xticks = [], yticks = [])     
        plt.imshow(inputs[i].numpy().transpose(1, 2, 0))
    break

# CALCULATIONS

The computation is done in three steps:

1. Define placeholders to store two batch-level stats: sum and squared sum of pixel values. The first will be used to compute means, and the latter will be needed for standard deviation calculations.
2. Loop through the batches and add up channel-specific sum and squared sum values.
3. Perform final calculations to obtain data-level mean and standard deviation.

In [None]:
##### COMPUTE PIXEL SUM AND SQUARED SUM

# placeholders
psum    = torch.tensor([0.0, 0.0, 0.0])
psum_sq = torch.tensor([0.0, 0.0, 0.0])

# loop through images
for inputs in tqdm(image_loader):
    psum    += inputs.sum(axis        = [0, 2, 3])
    psum_sq += (inputs ** 2).sum(axis = [0, 2, 3])

Finally, we make some further calculations:

- mean: simply divide the sum of pixel values by the total count - number of pixels in the dataset computed as `len(df) * image_size * image_size`
- standard deviation: use the following equation: `total_std = sqrt(psum_sq / count - total_mean ** 2)`

The formula for STD uses the sum of squares to perform calculations. [Click here](https://www.thoughtco.com/sum-of-squares-formula-shortcut-3126266) if you want to see some details.

![variance equation](https://kozodoi.me/images/copied_from_nb/images/fig_variance.jpg)

In [None]:
##### FINAL CALCULATIONS

# pixel count
count = len(df) * image_size * image_size

# mean and STD
total_mean = psum / count
total_var  = (psum_sq / count) - (total_mean ** 2)
total_std  = torch.sqrt(total_var)

# output
print('Training data stats:')
print('- mean:', total_mean.numpy())
print('- std: ', total_std.numpy())