### Update

I am trying to add the **Mosaic** augmentation. However, it's not completed yet. To create a class label in `CutMix` or `MixUp` type augmentation, we can use `beta` such as `np.random.beta` or `scipy.stats.beta` and do as follows for two labels:


```
label = label_one*beta + (1-beta)*label_two
```

But what if we've **more than two** images? In [YoLo4](https://arxiv.org/abs/2004.10934), they've tried an interesting augmentation called **Mosaic Augmentation** for object detection problems. Unlike `CutMix` or `MixUp`, this augmentation creates augmented samples with **4** images. In object detection cases, we can compute the shift of each instance co-ords and thus possible to get the proper ground truth, [here][2]. But for only image classification cases, how can we do that efficiently? Here is an asked a question over [SO](https://stackoverflow.com/questions/65181294/how-to-create-class-label-for-mosaic-augmentation-in-image-classification), if you able to get some workaroud, please suggest. -)

# Advanced Augmentation

Hi, This is a simple EDA and data augmentation pipeline for multi-class image classification with custom sequence data generator in `tf.keras`. Here image samples will be used. Mainly I will try to show how you can use some of the advanced augmentation in a custom `tf.keras.utils.Sequence` generator in `tf.keras`. The advanced augmentaiton are as follows:

```
- CutMix
- MixUp
- FMix
```

The implementations of `CutMix` and `MixUp` augmentation are taken from [Chris Deotte](https://www.kaggle.com/cdeotte/cutmix-and-mixup-on-gpu-tpu) and integrated into a custom [tf.keras.utils.Sequence](https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence) generator with few modification. The `FMix` is simply taken from the original source code, from [here](https://github.com/ecs-vlc/FMix). 

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from glob import glob
import albumentations as A 
from pylab import rcParams
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import os, gc, cv2, random, warnings, math, sys, json, pprint

# sklearn
from sklearn.utils import class_weight
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, balanced_accuracy_score

# tf 
import tensorflow as tf
from tensorflow.keras import backend as K

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
warnings.simplefilter('ignore')

In [None]:
# helper function to plot sample 
def plot_imgs(dataset_show, row, col):
    rcParams['figure.figsize'] = 20,10
    for i in range(row):
        f, ax = plt.subplots(1,col)
        for p in range(col):
            idx = np.random.randint(0, len(dataset_show))
            img, label = dataset_show[idx]
            ax[p].grid(False)
            ax[p].imshow(img[0])
            try:
                ax[p].set_title(label[0].numpy())
            except:
                ax[p].set_title(label[0])
    plt.show()
    

def visulize(path, n_images, is_random=True, figsize=(16, 16)):
    plt.figure(figsize=figsize)
    
    w = int(n_images ** .5)
    h = math.ceil(n_images / w)
    
    image_names = os.listdir(path)
    for i in range(n_images):
        image_name = image_names[i]
        if is_random:
            image_name = random.choice(image_names)
            
        img = cv2.imread(os.path.join(path, image_name))
        plt.subplot(h, w, i + 1)
        plt.imshow(img)
        plt.xticks([])
        plt.yticks([])
    plt.show()

In [None]:
comp_cassava = False
comp_covid19 = True

def hot_to_sparse(row):
    return(row.index[row.apply(lambda x: x==1)][0])

class BaseConfig(object):
    SEED  = 101
    if comp_cassava:
        TRAIN_DF = '../input/cassava-leaf-disease-classification/train.csv'
        TRAIN_IMG_PATH = '../input/cassava-leaf-disease-classification/train_images/'
        TEST_IMG_PATH  = '../input/cassava-leaf-disease-classification/test_images/'
        CLASS_MAP  = '../input/cassava-leaf-disease-classification/label_num_to_disease_map.json'
        NUM_CLASSES = 5
    elif comp_covid19:
        TRAIN_IMG_PATH = '../input/covid19-detection-890pxpng-study/train/'
        NUM_CLASSES = 4
        study_df = pd.read_csv('../input/siim-covid19-detection/train_study_level.csv'); print(study_df.shape)
        study_df['StudyInstanceUID'] = study_df['id'].apply(lambda x: x.replace('_study', ''))
        del study_df['id']

        study_df['diagnosis'] = study_df.apply(lambda row:hot_to_sparse(row), axis=1)
        cls = {
            'Typical Appearance':1,                    
            'Negative for Pneumonia':2,                
            'Indeterminate Appearance':3,                     
            'Atypical Appearance':4,    
        }
        study_df['sparse_gt'] = study_df.diagnosis.map(cls) 

        image_df = pd.read_csv('../input/siim-covid19-detection/train_image_level.csv'); print(image_df.shape)
        df = image_df.merge(study_df, on='StudyInstanceUID')
        df['id'] = df['id'].apply(lambda x: x.replace('_image', ''))
        display(df.head()); print(df.shape)

**Overview**

In [None]:
if comp_cassava:
    df = pd.read_csv(BaseConfig.TRAIN_DF)
    assert df.shape[0] == len(df.image_id.unique()) , "NOT ALL ID UNIQUE"
    print(df.info())
    df.head()

**Significant Class Imbalance**

In [None]:
if comp_cassava:
    with open(os.path.join(BaseConfig.CLASS_MAP)) as file:
        pprint.pprint(json.loads(file.read()))

In [None]:
if comp_cassava:
    temp_df = df.copy()
    temp_df[['CBB', 'CBSD', 'CGM', 'CMD', 'Healthy']] = pd.get_dummies(temp_df["label"])

    fig = go.Figure(data=[go.Pie(labels=temp_df.columns[2:],values=temp_df.iloc[:, 2:].sum().values)])
    fig.show()

    del temp_df

**Displaying Samples**

In [None]:
visulize(BaseConfig.TRAIN_IMG_PATH, 9, is_random=True)

# Augmentation

The `albumentation` is primarily used for resizing and normalization. 

In [None]:
# For Training 
def albu_transforms_train(data_resize): 
    return A.Compose([
            A.ToFloat(),
            A.Resize(data_resize, data_resize),
        ], p=1.)

# For Validation 
def albu_transforms_valid(data_resize): 
    return A.Compose([
            A.ToFloat(),
            A.Resize(data_resize, data_resize),
        ], p=1.)

**CutMix** Augmentation

In [None]:
def CutMix(image, label, DIM, PROBABILITY = 1.0):
    # input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
    # output - a batch of images with cutmix applied
    CLASSES = BaseConfig.NUM_CLASSES
    
    imgs = []; labs = []
    for j in range(len(image)):
        # DO CUTMIX WITH PROBABILITY DEFINED ABOVE
        P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.int32)
        
        # CHOOSE RANDOM IMAGE TO CUTMIX WITH
        k = tf.cast( tf.random.uniform([],0,len(image)),tf.int32)
        
        # CHOOSE RANDOM LOCATION
        x = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        y = tf.cast( tf.random.uniform([],0,DIM),tf.int32)
        
        b = tf.random.uniform([],0,1) # this is beta dist with alpha=1.0
        
        WIDTH = tf.cast( DIM * tf.math.sqrt(1-b),tf.int32) * P
        ya = tf.math.maximum(0,y-WIDTH//2)
        yb = tf.math.minimum(DIM,y+WIDTH//2)
        xa = tf.math.maximum(0,x-WIDTH//2)
        xb = tf.math.minimum(DIM,x+WIDTH//2)
        
        # MAKE CUTMIX IMAGE
        one = image[j,ya:yb,0:xa,:]
        two = image[k,ya:yb,xa:xb,:]
        three = image[j,ya:yb,xb:DIM,:]
        middle = tf.concat([one,two,three],axis=1)
        img = tf.concat([image[j,0:ya,:,:],middle,image[j,yb:DIM,:,:]],axis=0)
        imgs.append(img)
        
        # MAKE CUTMIX LABEL
        a = tf.cast(WIDTH*WIDTH/DIM/DIM,tf.float32)
        labs.append((1-a)*label[j] + a*label[k])
            
    # RESHAPE HACK SO TPU COMPILER KNOWS SHAPE OF OUTPUT TENSOR (maybe use Python typing instead?)
    image2 = tf.reshape(tf.stack(imgs),(len(image),DIM,DIM,3))
    label2 = tf.reshape(tf.stack(labs),(len(image),CLASSES))
    
    return image2,label2

**MixUp** Augmentation

In [None]:
def MixUp(image, label, DIM, PROBABILITY = 1.0):
    # input image - is a batch of images of size [n,dim,dim,3] not a single image of [dim,dim,3]
    # output - a batch of images with mixup applied
    CLASSES = BaseConfig.NUM_CLASSES
    
    imgs = []; labs = []
    for j in range(len(image)):
        # DO MIXUP WITH PROBABILITY DEFINED ABOVE
        P = tf.cast( tf.random.uniform([],0,1)<=PROBABILITY, tf.float32)
                   
        # CHOOSE RANDOM
        k = tf.cast( tf.random.uniform([],0,len(image)),tf.int32)
        a = tf.random.uniform([],0,1)*P # this is beta dist with alpha=1.0
                    
        # MAKE MIXUP IMAGE
        img1 = image[j,]
        img2 = image[k,]
        imgs.append((1-a)*img1 + a*img2)
                    
        # MAKE CUTMIX LABEL
        labs.append((1-a)*label[j] + a*label[k])
            
    # RESHAPE HACK SO TPU COMPILER KNOWS SHAPE OF OUTPUT TENSOR (maybe use Python typing instead?)
    image2 = tf.reshape(tf.stack(imgs),(len(image),DIM,DIM,3))
    label2 = tf.reshape(tf.stack(labs),(len(image),CLASSES))
    return image2,label2

**FMix** Augmentation

In [None]:
sys.path.insert(0, "/kaggle/input/pyutils")
from fmix_utils import sample_mask

def FMix(image, label, DIM,  alpha=1, decay_power=3, max_soft=0.0, reformulate=False):
    lam, mask = sample_mask(alpha, decay_power,(DIM, DIM), max_soft, reformulate)
    index = tf.constant(np.random.permutation(int(image.shape[0])))
    mask  = np.expand_dims(mask, -1)
    
    # samples 
    image1 = image * mask
    image2 = tf.gather(image, index) * (1 - mask)
    image3 = image1 + image2

    # labels
    label1 = label * lam 
    label2 = tf.gather(label, index) * (1 - lam)
    label3 = label1 + label2 
    return image3, label3

**Mosaic** Augmentation

In [None]:
def MosaicMix(image, label, DIM, minfrac=0.25, maxfrac=0.75):
    xc, yc  = np.random.randint(DIM * minfrac, DIM * maxfrac, (2,))
    indices = np.random.permutation(int(image.shape[0]))
    mosaic_image = np.zeros((DIM, DIM, 3), dtype=np.float32)
    final_imgs   = []
    
    # Iterate over the full indices 
    for j in range(len(indices)): 
        # Take 4 sample for to create a mosaic sample randomly 
        rand4indices = [j] + random.sample(list(indices), 3) 
        
        # Make mosaic with 4 samples 
        for i in range(len(rand4indices)):
            if i == 0:    # top left
                x1a, y1a, x2a, y2a =  0,  0, xc, yc
                x1b, y1b, x2b, y2b = DIM - xc, DIM - yc, DIM, DIM # from bottom right        
            elif i == 1:  # top right
                x1a, y1a, x2a, y2a = xc, 0, DIM , yc
                x1b, y1b, x2b, y2b = 0, DIM - yc, DIM - xc, DIM # from bottom left
            elif i == 2:  # bottom left
                x1a, y1a, x2a, y2a = 0, yc, xc, DIM
                x1b, y1b, x2b, y2b = DIM - xc, 0, DIM, DIM-yc   # from top right
            elif i == 3:  # bottom right
                x1a, y1a, x2a, y2a = xc, yc,  DIM, DIM
                x1b, y1b, x2b, y2b = 0, 0, DIM-xc, DIM-yc    # from top left
                
            # Copy-Paste
            mosaic_image[y1a:y2a, x1a:x2a] = image[i,][y1b:y2b, x1b:x2b]
                   
        # Append the Mosiac samples
        final_imgs.append(mosaic_image)
 
    return final_imgs, label

# Custom Sequence Data Generator

In [None]:
class SequenceGenerator(tf.keras.utils.Sequence):
    def __init__(self, img_path, data, batch_size, 
                 dim, shuffle=True, transform=None, 
                 use_mixup=False, use_cutmix=False,
                 use_fmix=False, use_mosaicmix=False):
        self.dim  = dim
        self.data = data
        self.shuffle  = shuffle
        self.img_path = img_path
        self.augment  = transform
        self.use_cutmix = use_cutmix
        self.use_mixup  = use_mixup
        self.use_fmix   = use_fmix 
        self.use_mosaicmix = use_mosaicmix
        self.batch_size = batch_size
        self.list_idx   = self.data.index.values
        if comp_cassava:
            self.label = pd.get_dummies(self.data['label'], columns = ['label'])
        elif comp_covid19:
            self.label = pd.get_dummies(self.data['sparse_gt'], columns = ['label'])
        self.on_epoch_end()
        
    def __len__(self):
        return int(np.ceil(float(len(self.data)) / float(self.batch_size)))
    
    def __getitem__(self, index):
        batch_idx = self.indices[index*self.batch_size:(index+1)*self.batch_size]
        idx = [self.list_idx[k] for k in batch_idx]
        
        Data   = np.empty((self.batch_size, *self.dim))
        Target = np.empty((self.batch_size, BaseConfig.NUM_CLASSES), dtype = np.float32)

        for i, k in enumerate(idx):
            # load the image file using cv2
            if comp_cassava:
                image = cv2.imread(self.img_path + self.data['image_id'][k])
            elif comp_covid19:
                image = cv2.imread(self.img_path + self.data['id'][k] + '.png')
            image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
            
            res = self.augment(image=image)
            image = res['image']
            
            # assign 
            Data[i,] =  image
            Target[i,] = self.label.iloc[k,].values
                
        # cutmix 
        if self.use_cutmix:
            Data, Target = CutMix(Data, Target, self.dim[0])
            
        # mixup 
        if self.use_mixup:
            Data, Target = MixUp(Data, Target, self.dim[0]) 
            
        # fmix 
        if self.use_fmix:
            Data, Target = FMix(Data, Target, self.dim[0])
            
        if self.use_mosaicmix:
            Data, Target = MosaicMix(Data, Target, self.dim[0]) 

        return Data, Target 
    
    def on_epoch_end(self):
        self.indices = np.arange(len(self.list_idx))
        if self.shuffle:
            np.random.shuffle(self.indices)

## <font color = "seagreen">CutMix Visualizaiton</font>

[paper-work](https://arxiv.org/abs/1905.04899)

In [None]:
check_gens = SequenceGenerator(BaseConfig.TRAIN_IMG_PATH, BaseConfig.df, 20, 
                              (320, 320, 3),shuffle = True, 
                              use_mixup = False, use_cutmix = True, 
                              use_fmix = False, transform = albu_transforms_train(320))

plot_imgs(check_gens, row=5, col=3)

## <font color = "seagreen">MixUp Visualizaiton</font>

[paper-work](https://arxiv.org/abs/1710.09412)

In [None]:
check_gens = SequenceGenerator(BaseConfig.TRAIN_IMG_PATH, BaseConfig.df, 20, 
                              (320, 320, 3),shuffle = True, 
                              use_mixup = True, use_cutmix = False, 
                              use_fmix = False, transform = albu_transforms_train(320))

plot_imgs(check_gens, row=5, col=3)

## <font color = "seagreen">FMix Visualizaiton</font>

[paper-work](https://arxiv.org/abs/2002.12047)

In [None]:
check_gens = SequenceGenerator(BaseConfig.TRAIN_IMG_PATH, BaseConfig.df, 20, 
                              (420, 420, 3),shuffle = True, 
                              use_mixup = False, use_cutmix = False, 
                              use_fmix = True, transform = albu_transforms_train(420))

plot_imgs(check_gens, row=5, col=3)

## <font color = "seagreen">Mosaic Visualization [WIP]</font>

[paper-work](https://arxiv.org/abs/2004.10934)

In [None]:
check_gens = SequenceGenerator(BaseConfig.TRAIN_IMG_PATH, BaseConfig.df, 20, 
                              (512, 512, 3),shuffle = True, 
                              use_mixup = False, use_cutmix = False, 
                              use_fmix = False, use_mosaicmix=True, transform = albu_transforms_train(512))

plot_imgs(check_gens, row=7, col=3)

**Plese note**, if you want to use this augmentation in your custom data generator, you probabely need to eunsure randomness of chosing each augmentaiton. This notebook is just for **demonstration purpose**. Also please note that, for best input data pipelines, `tf.data` API is highly recommended. 