# Overview
This notebook uses [Darek Kłeczek](https://www.kaggle.com/thedrcat) prototyping [solution](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/221550) with [Fastai](https://docs.fast.ai/). I leveraged this strategy along with Darek's [datasets](https://www.kaggle.com/thedrcat/datasets) and created a training loop over multiple folds.  I also provide a small framework to train your models using Kaggle, Colab or local setup; in my case Windows on a 6GB GPU NVIDIA Dell laptop.  

Also modified originator's logic to enable different folds to be evaluated.  

Please provide any suggestions or enhancements.  I hope you find this beneficial.

In [None]:
import pandas as pd
import numpy as np
from fastai.vision.all import *
import pickle
import os

In [None]:
from datetime import datetime
from pytz import timezone
tz = timezone("US/Eastern")
print(datetime.now(tz).strftime('%y%m%d-%H:%M:%S:'))

pd.set_option('display.max_columns', None)

Versions:  
> 210402-11:03:12: initial shared non-run version

In [None]:
import os
from pathlib import Path

if 'COLAB_GPU' in os.environ:
    ENV = 'COLAB'
    from google.colab import drive
    drive.mount('/content/drive')
elif 'KAGGLE_KERNEL_INTEGRATIONS' in os.environ:
    ENV = 'KAGGLE'
elif 'GNOME_SHELL_SESSION_MODE' in os.environ:
    ENV = 'LINUX'
elif os.environ['OS'] == 'Windows_NT':
    ENV = 'WIN'
else:
    ENV = 'UNDEFINED'

In [None]:
if ENV == 'KAGGLE': 
    !pip install -q /kaggle/input/iterative-stratification/iterative-stratification-master/
    # Making pretrained weights work without needing to find the default filename
    if not os.path.exists('/root/.cache/torch/hub/checkpoints/'):
            os.makedirs('/root/.cache/torch/hub/checkpoints/')
    !cp '../input/resnet50/resnet50.pth' '/root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth'
    # !cp '../input/resnet34/resnet34.pth' '/root/.cache/torch/hub/checkpoints/resnet34-333f7ec4.pth'
    
    train_path = Path('../input/hpa-cell-tiles-sample-balanced-dataset')
    test_path = Path('../input/hpa-cell-tiles-test-with-enc-dataset')
    sub_path = Path('../input/hpa-single-cell-image-classification')
    gen_path = Path('/kaggle/working')
    model_path = Path('/kaggle/working')
    
    bs=256
    TRAIN_FRAC = .1 #Change frac=1 to run on whole training sample
    
elif ENV == 'COLAB':
    if (torch.__version__ < '1.7.1'):
        !pip uninstall torch torchvision torchaudio torchtext -y
        !pip install torch torchvision #torchaudio
        #!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html

    if (fastai.__version__ < '2.3.0'):
        !pip uninstall fastai -y
        !pip install fastai -q

    if not (os.path.isdir("/content/bal_ds")):
        print("extracting balanced-dataset files")
        !unzip /content/drive/MyDrive/data/hpa-cell-tiles-sample-balanced-dataset.zip -d /content/bal_ds
    
    if not (os.path.isdir("/content/enc_ds")):
        print("extracting enc-dataset files")
        !unzip /content/drive/MyDrive/data/hpa-cell-tiles-test-with-enc-dataset.zip -d /content/enc_ds

    train_path = Path('/content/bal_ds')
    test_path = Path('/content/enc_ds')
    sub_path = Path('/content/drive/MyDrive/data')
    gen_path = Path('/content/drive/MyDrive/data')
    model_path = Path('/content/drive/MyDrive/data')
    
    bs=256 #14min/epoch on 100% train
    TRAIN_FRAC = .1

elif ENV == 'WIN':
    train_path = Path('D:/data/hpa-2021/hpa-cell-tiles-sample-balanced-dataset')
    test_path = Path('D:/data/hpa-2021/hpa-cell-tiles-test-with-enc-dataset')
    sub_path = Path('D:/data/hpa-2021')
    gen_path = Path('D:/data/hpa-2021/generated')
    model_path = Path('D:/data/hpa-2021/generated')
    
    bs=256 #40min/epoch at 100% train
    TRAIN_FRAC = .05 
    
else: ENV = 'UNDEFINED'

from fastai.vision.all import *
import fastai
print(f"fast.ai verion = {fastai.__version__}")
print(f'Environment is {ENV}')

In [None]:
path = train_path
df = pd.read_csv(path/'cell_df.csv')

In [None]:
labels = [str(i) for i in range(19)]
for x in labels: df[x] = df['image_labels'].apply(lambda r: int(x in r.split('|')))

In [None]:
dfs = df.sample(frac=TRAIN_FRAC, random_state=42)
dfs = dfs.reset_index(drop=True)
len(dfs)

| ID | Name                      | ID | Name                   | ID | Name                                     |
|----|:--------------------------|----|:-----------------------|----|:-----------------------------------------|
| 0  | Nucleoplasm               | 6  | Endoplasmic reticulum  | 12 | Centrosome                               |
| 1  | Nuclear membrane          | 7  | Golgi apparatus        | 13 | Plasma membrane                          |
| 2  | Nucleoli                  | 8  | Intermediate filaments | 14 | Mitochondria                             |
| 3  | Nucleoli fibrillar center | 9  | Actin filaments        | 15 | Aggresome                                |
| 4  | Nuclear speckles          | 10 | Microtubules           | 16 | Cytosol                                  |
| 5  | Nuclear bodies            | 11 | Mitotic spindle        | 17 | Vesicles and punctate cytosolic patterns |
|    |                           |    |                        | 18 | Negative                                 |

In [None]:
unique_counts = {}
for lbl in labels:
    unique_counts[lbl] = len(dfs[dfs.image_labels == lbl])

full_counts = {}
for lbl in labels:
    count = 0
    for row_label in dfs['image_labels']:
        if lbl in row_label.split('|'): count += 1
    full_counts[lbl] = count
    
counts = list(zip(full_counts.keys(), full_counts.values(), unique_counts.values()))
counts = np.array(sorted(counts, key=lambda x:-x[1]))
counts = pd.DataFrame(counts, columns=['label', 'full_count', 'unique_count'])
counts.set_index('label').T

In [None]:
len(dfs)

In [None]:
nfold = 5
seed = 42

y = dfs[labels].values
X = dfs[['image_id', 'cell_id']].values

dfs['fold'] = np.nan

#from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
except:
    !pip install git+https://github.com/trent-b/iterative-stratification.git
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

#GETTING ERROR when not setting Shuffle=True below
mskf = MultilabelStratifiedKFold(n_splits=nfold, random_state=seed, shuffle=True)
#mskf = MultilabelStratifiedKFold(n_splits=nfold, random_state=seed)
for i, (_, test_index) in enumerate(mskf.split(X, y)):
    dfs.iloc[test_index, -1] = i
    
dfs['fold'] = dfs['fold'].astype('int')

In [None]:
def get_x(r): return path/'cells'/(r['image_id']+'_'+str(r['cell_id'])+'.jpg')
def get_y(r): return r['image_labels'].split('|')

In [None]:
sample_stats = ([0.07237246, 0.04476176, 0.07661699], [0.17179589, 0.10284516, 0.14199627])

In [None]:
item_tfms = RandomResizedCrop(224, min_scale=0.75, ratio=(1.,1.))
batch_tfms = [*aug_transforms(flip_vert=True, size=128, max_warp=0), Normalize.from_stats(*sample_stats)]

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
def run_training(df=dfs, fold=0, net=resnet50, lr=3e-2, epochs=2):
    cbs = [EarlyStoppingCallback(patience=3), SaveModelCallback()]
    df['is_valid'] = False
    df['is_valid'][df['fold'] == fold] = True
    
    dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock(vocab=labels)),
                splitter=ColSplitter(col='is_valid'),
                get_x=get_x,
                get_y=get_y,
                item_tfms=item_tfms,
                batch_tfms=batch_tfms
                )
    dls = dblock.dataloaders(df, bs=bs)
    learn = cnn_learner(dls, net, metrics=[accuracy_multi, PrecisionMulti()]).to_fp16()
    learn.model_dir=model_path
    _lr = learn.lr_find(show_plot=False, suggestions=True).lr_steep if lr == None else lr
    print(f'\nStage1 learning rate used {_lr}')
    print(f'Training fold: {fold}')
    learn.fine_tune(epochs,base_lr=_lr,cbs=cbs)
    learn.save(f'{model_path}/fold_{fold}_{net.__name__}')
    learn.recorder.plot_loss()
    return learn, dls, df

In [None]:
for fold in range(5):
    learn, dls, dfs = run_training(fold=fold, epochs=2)
    torch.save(dls,f'{gen_path}/dls_{fold}.pkl')
    dfs.to_pickle(f'{gen_path}/dfs_{fold}')

In [None]:
path = test_path
cell_df = pd.read_csv(path/'cell_df.csv')
test_dl = learn.dls.test_dl(cell_df)
test_dl.show_batch()

In [None]:
weights = ['fold_0_resnet50','fold_1_resnet50','fold_2_resnet50','fold_3_resnet50','fold_4_resnet50']
final_preds = []
for w in weights:
    learn.load(model_path/w)
    _preds, _ = learn.tta(dl=test_dl, n=4) #n=4, beta=0.25 (defaults)
    #print(f'Pred {w}: {_preds}')
    final_preds.append(_preds)
preds = torch.mean(torch.stack(final_preds), dim=0)
#print(f'Mean Preds: {preds}')

# Will score zero on private test set.
See [Darek's solution discussion:](https://www.kaggle.com/c/hpa-single-cell-image-classification/discussion/221550)    
both cell tiles and encoding strings for the public test data are created in a separate notebook: https://www.kaggle.com/thedrcat/hpa-cell-tiles-test-with-enc
Just a reminder this approach will score zero on private, the final solution based on this approach needs to segment the cell tiles and create encodings on private test set.

# Inference
Submission logic from...  
[Darek Kłeczek's](https://www.kaggle.com/thedrcat) [Notebook](https://www.kaggle.com/thedrcat/fastai-quick-submission-template/notebook)

In [None]:
cell_df['cls'] = ''
threshold = 0.0

for i in range(preds.shape[0]): 
    p = torch.nonzero(preds[i] > threshold).squeeze().numpy().tolist()
    if type(p) != list: p = [p]
    if len(p) == 0: cls = [(preds[i].argmax().item(), preds[i].max().item())]
    else: cls = [(x, preds[i][x].item()) for x in p]
    cell_df['cls'].loc[i] = cls

In [None]:
def combine(r):
    cls = r[0]
    enc = r[1]
    classes = [str(c[0]) + ' ' + str(c[1]) + ' ' + enc for c in cls]
    return ' '.join(classes)

combine(cell_df[['cls', 'enc']].loc[24]);

In [None]:
cell_df['pred'] = cell_df[['cls', 'enc']].apply(combine, axis=1)
cell_df.head()

In [None]:
subm = cell_df.groupby(['image_id'])['pred'].apply(lambda x: ' '.join(x)).reset_index()
# subm = subm.loc[3:]
subm.head()

In [None]:
sample_submission = pd.read_csv(sub_path/'sample_submission.csv')
sample_submission.head()

In [None]:
sub = pd.merge(
    sample_submission,
    subm,
    how="left",
    left_on='ID',
    right_on='image_id',
)
sub.head()

In [None]:
def isNaN(num):
    return num != num

for i, row in sub.iterrows():
    if isNaN(row['pred']): continue
    sub.PredictionString.loc[i] = row['pred']

In [None]:
sub = sub[sample_submission.columns]
sub.head()

In [None]:
sub.to_csv(gen_path/'submission.csv', index=False)

# Where are the mistakes?
Set ```fold_num``` to evaluate performance over each fold.  

In [None]:
fold_num = 4
learn.load(f'{gen_path}/fold_{fold_num}_resnet50')
dfs = pd.read_pickle(f'{gen_path}/dfs_{fold_num}')
dls = torch.load(f'{gen_path}/dls_{fold_num}.pkl')

path = train_path

In [None]:
val_targ = torch.stack([x[1] for x in learn.dls.valid_ds], dim=0).numpy()
val_targ.shape

#val_targ = dfs[labels][dfs.is_valid == True].values

In [None]:
val_targ.shape

In [None]:
val_preds_all = learn.get_preds(dl=learn.dls.valid)
val_preds = val_preds_all[0].numpy()
val_preds = val_preds > 0.5
full_preds = val_preds_all[0].numpy()
vis_arr = cm(val_targ, val_preds)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


def print_confusion_matrix(confusion_matrix, axes, class_label, class_names, fontsize=14):

    df_cm = pd.DataFrame(
        confusion_matrix, index=class_names, columns=class_names,
    )

    try:
        heatmap = sns.heatmap(df_cm, annot=True, fmt="d", cbar=False, ax=axes)
    except ValueError:
        raise ValueError("Confusion matrix values must be integers.")
    heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
    heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
    axes.set_ylabel('True label')
    axes.set_xlabel('Predicted label')
    axes.set_title("Confusion Matrix for the class - " + class_label)

In [None]:
fig, ax = plt.subplots(5, 4, figsize=(12, 16))
    
for axes, cfs_matrix, label in zip(ax.flatten(), vis_arr, labels):
    print_confusion_matrix(cfs_matrix, axes, label, ["0", "1"])

fig.tight_layout()
plt.show()

In [None]:
val = dfs[dfs.is_valid==True]
len(val[val['16'] == 1])

In [None]:
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(val_targ, val_preds)
average_precision

In [None]:
from sklearn.metrics import precision_recall_curve

precision = dict()
recall = dict()
average_precision = dict()
for i in range(19):
    precision[i], recall[i], _ = precision_recall_curve(val_targ[:, i], val_preds[:, i])
    average_precision[i] = average_precision_score(val_targ[:, i], val_preds[:, i])

# A "micro-average": quantifying score on all classes jointly
precision["micro"], recall["micro"], _ = precision_recall_curve(val_targ.ravel(), val_preds.ravel())
average_precision["micro"] = average_precision_score(val_targ, val_preds, average="micro")
print('Average precision score, micro-averaged over all classes: {0:0.2f}'.format(average_precision["micro"]))

In [None]:
average_precision

# Example outputs...

Fold 0:  
```{0: 0.2167366997440679,
 1: 0.2812805713716858,
 10: 0.4868564932757389,
 11: 0.011422044545973729,
 12: 0.0713877784123358,
 13: 0.08623643632210165,
 14: 0.07024557395773844,
 15: 0.03712164477441462,
 16: 0.09251856082238721,
 17: 0.05653912050256996,
 18: 0.006282124500285551,
 2: 0.1880749913690423,
 3: 0.1505034583412653,
 4: 0.17727102441633766,
 5: 0.0759565962307253,
 6: 0.06420992515555021,
 7: 0.07310108509423187,
 8: 0.14442435010681726,
 9: 0.051970302684180465,
 'micro': 0.12614287998476054}```

Fold 2:  
```{0: 0.3043858958182949,
 1: 0.2794941275861746,
 10: 0.5751811762284811,
 11: 0.011422044545973729,
 12: 0.0713877784123358,
 13: 0.11427077693468515,
 14: 0.14484892101919306,
 15: 0.03712164477441462,
 16: 0.16464123336386804,
 17: 0.05653912050256996,
 18: 0.006282124500285551,
 2: 0.28014791174704406,
 3: 0.17177427864349623,
 4: 0.30107426298272905,
 5: 0.0759565962307253,
 6: 0.08618572976279001,
 7: 0.07310108509423187,
 8: 0.1585756992476962,
 9: 0.051970302684180465,
 'micro': 0.15912726712147027}```