## Summary

I believe this notebook can be your comprehensive guide for data preprocessing for this competition. Here we cover:
1. Removing duplicates
2. Label formatting
3. Making stratified folds
4. Data pre-augmentation
5. Making TFRecords

This is also the very beginning of **TensorFlow** and **Keras** implementation of training loop for **Plant Pathology 2021** competition optimized for achieving maximum speed.

### Also check out:
1. The **[Training Notebook](https://www.kaggle.com/nickuzmenkov/pp2021-tpu-tf-training)** where we train the EfficientNetB4 ensemble.
2. The **[Inference Notebook](https://www.kaggle.com/nickuzmenkov/pp2021-tpu-tf-inference)**.

### Imports

In [None]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import StratifiedKFold
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
import tensorflow as tf
import albumentations
import pandas as pd
import numpy as np
import shutil
import os

### Configuration
Make changes here to customize the entire notebook

In [None]:
plt.style.use('fivethirtyeight')
print(f'Using TensorFlow {tf.__version__}')

class CFG:
    
    '''
    keep these
    '''
    root = '../input/plant-pathology-2021-fgvc8/train_images'
    classes = [
        'complex', 
        'frog_eye_leaf_spot', 
        'powdery_mildew', 
        'rust', 
        'scab',
        'healthy']
    strategy = tf.distribute.get_strategy()
    batch_size = 16
    
    '''
    tune these
    '''
    img_size = 600 # image size
    folds = 5 # number of KFold n_splits
    seed = 42 # random seed (only for KFold)
    subfolds = 16 # number of .tfrec files in each fold
    transform = True # whether to apply pre-augmentations or not
    epochs = 5 # (>=5) number of pre-augmented dataset copies to save when transform = True


## 1. Removing duplicates
We use `duplicates.csv` file containing 50 sequences of duplicates found with `image_hash` in my **[other notebook](https://www.kaggle.com/nickuzmenkov/pp2021-duplicates-revealing)**. Here for each duplicate sequence:
1. Leave only one sample if all duplicates share the same labels
2. Delete all duplicates if at least one of them is labeled differently

In [None]:
df = pd.read_csv('../input/plant-pathology-2021-fgvc8/train.csv', index_col='image')
init_len = len(df)

with open('../input/pp2021-duplicates-revealing/duplicates.csv', 'r') as file:
    duplicates = [x.strip().split(',') for x in file.readlines()]

for row in duplicates:
    unique_labels = df.loc[row].drop_duplicates().values
    if len(unique_labels) == 1:
        df = df.drop(row[1:], axis=0)
    else:
        df = df.drop(row, axis=0)
        
print(f'Dropping {init_len - len(df)} duplicate samples.')

## 2. Label formatting
The initial format of space-separated string labels is inapplicable for model training. Here we change the format via `MultiLabelBinarizer` instance

In [None]:
original_labels = df['labels'].values.copy()

df['labels'] = [x.split(' ') for x in df['labels']]
labels = MultiLabelBinarizer(classes=CFG.classes).fit_transform(df['labels'].values)

df = pd.DataFrame(columns=CFG.classes, data=labels, index=df.index)

df.to_csv('train.csv')
display(df.head())

That's better!
## 3. Making stratified folds
We have some pretty rare classes here (e.g. `rust`) which can be completely lost when doing random `KFold` splits or `train_test_split`. So, we need stratification. 

**Sklearn**'s `StratifiedKFold` is only applicable for **multi-class** classification tasks (and current task is **multi-label**), so we simply treat the original mulit-labels as one-hot labels when applying `StratifiedKFold` and thus making label-wise stratification implicitly.

In [None]:
kfold = StratifiedKFold(n_splits=CFG.folds, shuffle=True, random_state=CFG.seed)
fold = np.zeros((len(df),))

for i, (train_index, val_index) in enumerate(kfold.split(df.index, original_labels)):
    fold[val_index] = i

value_counts = lambda x: pd.Series.value_counts(x, normalize=True)

df_occurence = pd.DataFrame({
    'origin': df.apply(value_counts).loc[1],
    'fold_0': df[fold == 0].apply(value_counts).loc[1],
    'fold_1': df[fold == 1].apply(value_counts).loc[1],
    'fold_2': df[fold == 2].apply(value_counts).loc[1],
    'fold_3': df[fold == 3].apply(value_counts).loc[1],
    'fold_4': df[fold == 4].apply(value_counts).loc[1]})

bar = df_occurence.plot.barh(figsize=[15, 5], colormap='plasma')

folds = pd.DataFrame({
    'image': df.index,
    'fold': fold})

folds.to_csv('folds.csv', index=False)

Perfect! Label distributions are almost equal across all folds.
## 4. Data pre-augmentation
Fancy data augmentation is a big pain for every **TensorFlow** user, since most of them (e.g. `CutOut`, `CLAHE`, and even random rotation (!!!) ) are implemented only for **Keras** `ImageDataGenerator` class which is super slow (of course, comparing to `tf.data.Dataset` class) or not implemented at all.

So we have to choose between rapid training on TPU with TFRecords and optimized `tf.data.Dataset` class and fancy augmentations. Or, of course, you can hard-code all those, but just look at how easy it is with `albumentations` library:

In [None]:
if CFG.transform:
    transform = albumentations.Compose([
       albumentations.RandomResizedCrop(CFG.img_size, CFG.img_size, scale=(0.9, 1), p=1), 
       albumentations.HorizontalFlip(p=0.5),
       albumentations.VerticalFlip(p=0.5),
       albumentations.ShiftScaleRotate(p=0.5),
       albumentations.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=10, val_shift_limit=10, p=0.7),
       albumentations.RandomBrightnessContrast(brightness_limit=(-0.2,0.2), contrast_limit=(-0.2, 0.2), p=0.7),
       albumentations.CLAHE(clip_limit=(1,4), p=0.5),
       albumentations.OneOf([
           albumentations.OpticalDistortion(distort_limit=1.0),
           albumentations.GridDistortion(num_steps=5, distort_limit=1.),
           albumentations.ElasticTransform(alpha=3),
       ], p=0.2),
       albumentations.OneOf([
           albumentations.GaussNoise(var_limit=[10, 50]),
           albumentations.GaussianBlur(),
           albumentations.MotionBlur(),
           albumentations.MedianBlur(),
       ], p=0.2),
      albumentations.Resize(CFG.img_size, CFG.img_size),
      albumentations.OneOf([
          albumentations.JpegCompression(),
          albumentations.Downscale(scale_min=0.1, scale_max=0.15),
      ], p=0.2),
      albumentations.IAAPiecewiseAffine(p=0.2),
      albumentations.IAASharpen(p=0.2),
      albumentations.Cutout(max_h_size=int(CFG.img_size * 0.1), max_w_size=int(CFG.img_size * 0.1), num_holes=5, p=0.5),
    ])
else:
    transform = None

All those images are actually the same image transformed by `albumentations`

In [None]:
figure, axes = plt.subplots(5, 5, figsize=[15, 15])
axes = axes.reshape(-1,)

if transform is None:
    for i in range(len(axes)):
        image = tf.io.read_file(os.path.join(CFG.root, df.index[i]))
        image = tf.image.decode_jpeg(image, channels=3)
        image = tf.image.resize(image, [CFG.img_size, CFG.img_size])
        image = tf.cast(image, tf.uint8)
        
        axes[i].imshow(image.numpy())
        axes[i].axis('off')

else:
    image = tf.io.read_file(os.path.join(CFG.root, df.index[CFG.seed]))
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [CFG.img_size, CFG.img_size])
    image = tf.cast(image, tf.uint8)

    for i in range(len(axes)):
        axes[i].imshow(transform(image=image.numpy())['image'])
        axes[i].axis('off')

plt.show()

As mentioned by @calintimbus, we can still use the power of `albumentations` inside of `tensorflow` pipelines via `tf.py_function(func=composed_transform, inp=[image], Tout=(tf.float32))`. But the time required for downscaling really large images and applying augmentations from a relatively long list can severely slow donw the training process. If you decide to go this way, set the `transform` attribute of the `CFG` class (see the **Configurations** block at the very beginning) to `False` and apply your own augmentations on the fly. Otherwise, if you decide to save the pre-augmented dataset as a one-time payment, set it to `True`.
## 5. Making TFRecords
Using `.tfrec` files instead of tensor slices can result in dramatic performance boost. Here we finish by serializing all the past work to `.tfrec` format.
### Helper functions (serialization)

In [None]:
def _serialize_image(path, transform=None):
    image = tf.io.read_file(path)
    image = tf.image.decode_jpeg(image, channels=3)
    image = tf.image.resize(image, [CFG.img_size, CFG.img_size])
    image = tf.cast(image, tf.uint8)
    
    if transform is not None:
        image = transform(image=image.numpy())['image']
        
    return tf.image.encode_jpeg(image).numpy()


def _serialize_sample(image, image_name, label):
    feature = {
        'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image])),
        'image_name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_name])),
        'complex': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[0]])),
        'frog_eye_leaf_spot': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[1]])),
        'powdery_mildew': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[2]])),
        'rust': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[3]])),
        'scab': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[4]])),
        'healthy': tf.train.Feature(int64_list=tf.train.Int64List(value=[label[5]]))}
    sample = tf.train.Example(features=tf.train.Features(feature=feature))
    return sample.SerializeToString()


def serialize_fold(fold, name, transform=None, bar=None):
    samples = []
    
    for image_name, labels in fold.iterrows():
        path = os.path.join(CFG.root, image_name)
        image = _serialize_image(path, transform=transform)
        samples.append(_serialize_sample(image, image_name.encode(), labels))
    
    with tf.io.TFRecordWriter(name + '.tfrec') as writer:
        [writer.write(x) for x in samples]
        
    if bar is not None:
        bar.update(1)

In [None]:
total = CFG.folds * CFG.subfolds if transform is None else CFG.folds * CFG.subfolds * CFG.epochs

with tqdm(total=total) as bar:

    for i in range(CFG.folds):

        df_fold = df[fold == i]
        
        folder = f'fold_{i}'
        
        try:
            os.mkdir(folder)
        except FileExistsError:
            shutil.rmtree(folder)
            os.mkdir(folder)
        
        if transform is None:
            for k, subfold in enumerate(np.array_split(df_fold, CFG.subfolds)):
                name=os.path.join(folder, '%.2i-%.3i' % (k, len(subfold)))
                serialize_fold(subfold, name=name, bar=bar)
        else:
            for j in range(CFG.epochs):
                for k, subfold in enumerate(np.array_split(df_fold, CFG.subfolds)):
                    name=os.path.join(folder, '%.2i-%.3i' % (j * CFG.subfolds + k, len(subfold)))
                    serialize_fold(subfold, name=name, transform=transform, bar=bar)

## 6. Run test
Lastly, we test our tfrecords and preprocessed data by training a small model with one `.tfrec` file for just one epoch. Feel free to replace this part with your training pipeline.
### Helper functions (parsing & testing)

In [None]:
feature_map = {
    'image': tf.io.FixedLenFeature([], tf.string),
    'image_name': tf.io.FixedLenFeature([], tf.string),
    'complex': tf.io.FixedLenFeature([], tf.int64),
    'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64),
    'powdery_mildew': tf.io.FixedLenFeature([], tf.int64),
    'rust': tf.io.FixedLenFeature([], tf.int64),
    'scab': tf.io.FixedLenFeature([], tf.int64),
    'healthy': tf.io.FixedLenFeature([], tf.int64)}


def count_data_items(filenames):
    return np.sum([int(x[:-6].split('-')[-1]) for x in filenames])


def decode_image(image_data):
    image = tf.image.decode_jpeg(image_data, channels=3)
    image = tf.reshape(image, [CFG.img_size, CFG.img_size, 3])
    image = tf.cast(image, tf.float32) / 255.
    return image


def read_tfrecord(example):
    example = tf.io.parse_single_example(example, feature_map)
    image = decode_image(example['image'])
    target = [
        tf.cast(example['complex'], tf.float32),
        tf.cast(example['frog_eye_leaf_spot'], tf.float32),
        tf.cast(example['healthy'], tf.float32),
        tf.cast(example['powdery_mildew'], tf.float32),
        tf.cast(example['rust'], tf.float32),
        tf.cast(example['scab'], tf.float32)]
    return image, target


def get_dataset(filenames):
    auto = tf.data.experimental.AUTOTUNE
    dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=auto)
    dataset = dataset.map(read_tfrecord, num_parallel_calls=auto)
    dataset = dataset.batch(CFG.batch_size)
    dataset = dataset.prefetch(auto)
    return CFG.strategy.experimental_distribute_dataset(dataset)


def get_model():
    model = tf.keras.models.Sequential([
        tf.keras.applications.EfficientNetB0(
            include_top=False,
            input_shape=(CFG.img_size, CFG.img_size, 3),
            weights=None,
            pooling='avg'),
        tf.keras.layers.Dense(len(feature_map) - 2),
        tf.keras.layers.Activation('sigmoid', dtype='float32')
    ], name='EfficientNetB0')
    
    model.compile(
        optimizer='adam',
        loss='categorical_crossentropy',
        metrics=['accuracy'])

    return model

In [None]:
filenames = tf.io.gfile.glob('./fold_0/*.tfrec')[:1]
dataset = get_dataset(filenames)

steps_per_epoch = count_data_items(filenames) // CFG.batch_size

with CFG.strategy.scope():
    model = get_model()

model.summary()

history = model.fit(
    dataset, 
    steps_per_epoch=steps_per_epoch,
    epochs=1,
    verbose=2)

### Acknowledgements

* the list of albumentations is taken from @underwearfitting **[notebook](https://www.kaggle.com/underwearfitting/single-fold-training-of-resnet200d-lb0-965)** with only minor changes.