# Oversampling for Multi-Label Classification with fastai

This is a simplified approach to oversampling for multilabel classification, inspired by @iafoss and his [notebook on the previous HPA challenge](https://www.kaggle.com/iafoss/pretrained-resnet34-with-rgby-0-460-public-lb).

To summarize, I will:
- count the number of examples for each class
- calculate the oversampling ratio to match the average number of examples per class
- set manually the oversampling ratio for each class (in case I prefer that vs the calculated ratio)
- copy each example in my training dataframe the same number of times as the oversampling ratio I determined above

From then on, I will follow the regular model training approach. 

In [None]:
!pip install iterative_stratification -q

from fastai.vision.all import *
import numpy as np
import pandas as pd
import torch
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import warnings
warnings.filterwarnings('ignore')
torch.set_printoptions(precision=3, sci_mode=False)

sample_size = 1
seed = 42
stats = ([0.07237246, 0.04476176, 0.07661699], [0.17179589, 0.10284516, 0.14199627])
item_tfms = RandomResizedCrop(448, min_scale=0.75, ratio=(1.,1.))
batch_tfms = [*aug_transforms(flip_vert=True, max_warp=0), Normalize.from_stats(*stats)]
bs = 32
lr = 3e-2
epochs = 2
cbs = None

df = pd.read_csv('../input/hpa-single-cell-image-classification/train.csv')
path = Path('../input/hpa-512x512-jpg-images-dataset/512x512jpgs')

labels = [str(i) for i in range(19)]
for x in labels: df[x] = df['Label'].apply(lambda r: int(x in r.split('|')))

dfs = df.sample(frac=sample_size, random_state=seed).reset_index(drop=True)
y = dfs[labels].values
X = dfs['ID'].values
dfs['fold'] = np.nan

mskf = MultilabelStratifiedKFold(n_splits=5)
for i, (_, test_index) in enumerate(mskf.split(X, y)):
    dfs.iloc[test_index, -1] = i
   
dfs['fold'] = dfs['fold'].astype('int')
dfs['is_valid'] = False
dfs['is_valid'][dfs.fold == 0] = True

def get_x(r): return path/f'{r["ID"]}.jpg'
def get_y(r): return list(set(r['Label'].split('|')))

In the snippet below, I calculate the number of examples for each class in my dataset, and the average. 

In [None]:
full_counts = {}
fsum = 0
for lbl in labels:
    count = 0
    for row_label in dfs['Label']:
        if lbl in row_label.split('|'): count += 1
    full_counts[lbl] = count
    fsum += count
full_counts['avg'] = int(fsum/(len(labels)))

counts = list(zip(full_counts.keys(), full_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'full_count'])
counts.set_index('label', inplace=True)

Here I define the function to set an integer oversampling ratio so the number of examples gets close to the average. I apply that function in my `counts` dataframe, and I also show how to set this manually. 

In [None]:
def set_sample_ratio(x):
    avg = int(counts['full_count'].loc['avg'])
    x = int(x)
    if x >= avg: return 1
    else: return int(np.round(avg / x))

counts['calculated_oversampling_ratio'] = counts['full_count'].apply(set_sample_ratio)
counts['manual_oversapling_ratio'] = [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  2,  2,  3, 4, 8, 16]
counts.T

Here I define the function that given an example, will check the classes it belongs to, and then return the highest ratio across these classes. Then, we split our dataset into train and valid, iterate through the train rows and copy each row as many times, as determined by this function. Finally, we merge train and valid again so that we can pass it into fastai `DataBlock`. 

In [None]:
def get_sample_ratio(row):
    ratio = 1
    labels = row[1].split('|')
    for l in labels:
        r = counts.manual_oversapling_ratio.loc[l]
        if r > ratio: ratio = r
    return ratio

df_valid = dfs[dfs['is_valid'] == True]
df_train = dfs[dfs['is_valid'] == False]

rows = df_train.values.tolist()
print(len(rows))
oversampled_rows = [row for row in rows for _ in range(get_sample_ratio(row))]
print(len(oversampled_rows))

df_train_oversampled = pd.DataFrame(oversampled_rows, columns=df_train.columns)

dfs = pd.concat([df_valid, df_train_oversampled], ignore_index=True)

From here on, we can train as usual. 

In [None]:
dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock(vocab=labels)),
                    splitter=ColSplitter(col='is_valid'),
                    get_x=get_x,
                    get_y=get_y,
                    item_tfms=item_tfms,
                    batch_tfms=batch_tfms
                    )
dls = dblock.dataloaders(dfs, bs=bs)

learn = cnn_learner(dls, resnet18, metrics=[accuracy_multi, APScoreMulti()]).to_fp16()
learn.fine_tune(epochs, base_lr=lr, cbs=cbs)

# End.