# Happywhale dataset: image normalization

Playing with this dataset I realized that normalizing the images with standard ImageNet mean and std doesn't work well.
Water, whales, dolphins - they are mostly gray and blue colors.

This notebook demontrates this fact and calculates pixels mean and std for this dataset.

TLDR: the calculation produced mean=(0.434, 0.487, 0.544) and std=(0.163, 0.166, 0.173). Will try to use it in future experiments.

Hope this would be helpful.

In [None]:
import os
import random
from multiprocessing import Pool
import numpy as np
import cv2
import albumentations as A
import matplotlib.pyplot as plt
from tqdm import tqdm

INPUT_PATH = '../input/happy-whale-and-dolphin'

In [None]:
# Create lists of train, test and all images
train_files = os.listdir(os.path.join(INPUT_PATH, 'train_images'))
test_files = os.listdir(os.path.join(INPUT_PATH, 'test_images'))
all_files = [os.path.join(INPUT_PATH, 'train_images', f) for f in train_files] + \
            [os.path.join(INPUT_PATH, 'test_images', f) for f in test_files]
print(f'Train files: {len(train_files)}, test files: {len(test_files)}, all_files: {len(all_files)}')

# now select 5 random images to show
show_images = random.sample(all_files, 5)

Now let's create a function that displays original and normalized images in 2 columns to compare.
Run this function for 5 randomly selected images.

In [None]:
def show_orig_norm_images(img_files, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)):
    nrows, ncols = len(img_files), 2
    fig, ax = plt.subplots(nrows, ncols, figsize=(20,31))

    for i in range(len(img_files)):
        img = cv2.imread(img_files[i])
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        norm_img = A.Normalize(mean=mean, std=std)(image=img)['image']
        
        ax[i, 0].grid(False)
        ax[i, 0].axis('off')
        ax[i, 0].title.set_text(f'{os.path.basename(img_files[i])}: original')
        ax[i, 0].imshow(img)
    
        ax[i, 1].grid(False)
        ax[i, 1].axis('off')
        ax[i, 1].title.set_text(f'{os.path.basename(img_files[i])}: normalized')
        ax[i, 1].imshow(norm_img)
        
    plt.tight_layout()
    plt.show()
    
show_orig_norm_images(show_images)

OK, now let's go ahead and calculate pixels mean and std for each color channel for each image.
Then mean and std are averaged across all images in the dataset.

Note: it took a while on Kaggle, so below it's calculated for 1000 first images only.

Remove "all_files = all_files[:1000]" line to do the full calculation.

In [None]:
np.set_printoptions(precision=3)

all_files = all_files[:1000] # remove this line to do full calculation

def process_file(fp):
    img = cv2.imread(fp)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = img / 255
    return np.mean(img, axis=(0,1)), np.std(img, axis=(0,1))

mean, std = np.zeros(3), np.zeros(3)
n, done = len(all_files), 0
    
with Pool(os.cpu_count()) as p:
    pbar = tqdm(p.imap(process_file, all_files), total=n)
    for m, s in pbar:
        done += 1
        mean += m
        std += s
        pbar.set_description(f'{mean/done} {std/done}')
        
mean, std = mean/n, std/n
 
print(f'Calculated mean: {mean}')
print(f'Calculated std: {std}')

OK, now let's check new normalization settings.

In [None]:
show_orig_norm_images(show_images, mean=mean, std=std)