In this notebook, I will preprocess the images in the training set. I will:
- Split the data into train/test set 4:1
- Shrink image size to 256*256
- Reduce color the channel

# Load Libraries

In [None]:
import numpy as np
import pandas as pd
import os

from pathlib import Path
import cv2
import matplotlib.pyplot as plt
import matplotlib.image as img
from tqdm.notebook import tqdm

SEED = 42

# Load the DataFrame

In [None]:
labels = pd.read_csv('/kaggle/input/happy-whale-and-dolphin/train.csv')
labels.head()

In [None]:
print('Size of Training Dataset:', labels.shape[0])
print(labels.shape[1], ' Columns')

# Define the pathes to the image files

In [None]:
ROOT_PATH = Path('/kaggle/input/happy-whale-and-dolphin/train_images/')
SAVE_PATH = Path('/kaggle/working/')

# Take an Example of the Basic Preprocess Logic

Let's see a preprocessing example first:

In [None]:
ex_path = ROOT_PATH/'00021adfb725ed.jpg'
ex_arry = cv2.imread(str(ex_path), 0)/255
ex_arry = cv2.resize(ex_arry, (256,256)).astype(np.float16)

print('Type of ex_arry: ', type(ex_arry))
print('Shape of ex_arry:', ex_arry.shape)
plt.imshow(ex_arry, cmap='gray')
plt.show()

We can see that following the process above, we shrink the image to 256 * 256. From 3 channels of RGB to 1 channel which is gray. Now I am going to follow the same logit and write a loop to process all the images. But before that, I am going to split the train/test set.

# Train/Test Split, Standardize and Normalize
- train/test split
- resize to 256 * 256
- gray scale to reduce the complexity
- standardize all images by the maximum pixel value in the provided dataset, 255
- standard normalization

In [None]:
# shuffle the labels dataframe
labels = labels.sample(frac=1, random_state=SEED).reset_index(drop=True)

# split point
split = int(labels.shape[0]*4/5)
print('training size:', split)
print('test size    :', labels.shape[0] - split)

# flag train/test
# col is_train == 1 --> belongs to training set
# col is_train == 0 --> belongs to test set
labels['is_train'] = np.concatenate((np.ones(split), np.zeros(labels.shape[0] - split)), axis=0)

Now let's loop over and preprocess all the images:

In [None]:
sums = 0
sums_squared = 0

for c, image_id in enumerate(tqdm(labels.image)):
    image_path = ROOT_PATH/image_id                        # create the path to the .jpg file
    image_arry = cv2.imread(str(image_path),0) / 255       # read image as array, and standardize
    image_arry = cv2.resize(image_arry, (256,256)).astype(np.float16) # resize to 256*256
        
    label = labels.individual_id.iloc[c]                   # retrieve the corresponding label
    
    # 4/5 train split, 1/5 test split
    train_or_test = 'train' if c < split else 'test'
    
    current_save_path = SAVE_PATH/train_or_test/str(label) # define save path and create if necessary
    current_save_path.mkdir(parents=True, exist_ok=True)
    np.save(current_save_path/image_id, image_arry)        # save the aray in the corresponding directory
    
    normalizer = image_arry.shape[0]*image_arry.shape[1]
    if train_or_test == 'train':
        sums += np.sum(image_arry) / normalizer
        sums_squared += (np.power(image_arry, 2).sum())/normalizer

In [None]:
mean = sums / split
std = np.sqrt(sums_squared / split) - (mean**2)
print(f'Mean of Dataset: {mean:.4f}')
print(f'STD of Dataset : {std:.4f}')

ðŸš§ To be continued...