# Recursion Cellular Image Classification - fastai starter

* Thanks greatly to [this kaggle kernel](https://www.kaggle.com/kernels/scriptcontent/20557703/download)
* This is the ensemble version of notebook

## Load modules

In [1]:
import os

import numpy as np
import pandas as pd

from fastai.vision import *
from fastai.metrics import *

In [2]:
torch.cuda.is_available()

True

In [3]:
SIZE = 320

In [4]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    

SEED = 0
seed_everything(SEED)

## Loading and formatting data

Here I will load the csv into the DataFrame, and create a column in the DataFrame with the path to the corresponding image (`generate_df`)

In [5]:
from pathlib import Path

DATA = Path("/mnt/disk4/cell/")

In [6]:
train_df = pd.read_csv(DATA/'train.csv')
train_df.head(10)

Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144
5,HEPG2-01_1_B08,HEPG2-01,1,B08,503
6,HEPG2-01_1_B09,HEPG2-01,1,B09,188
7,HEPG2-01_1_B10,HEPG2-01,1,B10,700
8,HEPG2-01_1_B11,HEPG2-01,1,B11,1100
9,HEPG2-01_1_B12,HEPG2-01,1,B12,611


In [7]:
def generate_df(train_df,sample_num=1):
    train_df['path'] = train_df['experiment'].str.cat(train_df['plate'].astype(str).str.cat(train_df['well'],sep='/'),sep='/Plate') + '_s'+str(sample_num) + '_w'
    train_df = train_df.drop(columns=['id_code','experiment','plate','well']).reindex(columns=['path','sirna'])
    return train_df
site1_train_df = generate_df(train_df)  
site2_train_df = generate_df(train_df, sample_num=2)

proc_train_df = pd.concat([site1_train_df,site2_train_df],axis=0 )\
.sample(frac = 1.0)\
.reset_index()\
.drop("index",axis=1)

Let's look at an example image. These images are 6-channel images, but the each of the six channels are saved as separate files. Here, I open just one channel of the image.

In [8]:
proc_train_df.head(10)

Unnamed: 0,path,sirna
0,HUVEC-15/Plate4/G13_s2_w,981
1,RPE-04/Plate4/B19_s1_w,954
2,HEPG2-05/Plate3/H14_s1_w,757
3,RPE-04/Plate4/F05_s2_w,803
4,HUVEC-14/Plate2/B13_s2_w,1004
5,RPE-01/Plate1/B03_s1_w,1084
6,HUVEC-13/Plate1/K19_s2_w,678
7,U2OS-02/Plate2/B17_s2_w,357
8,HUVEC-08/Plate1/O13_s2_w,1100
9,HUVEC-02/Plate1/B22_s2_w,301


In [9]:
import cv2
img = cv2.imread(str(DATA/"train/HEPG2-01/Plate1/B03_s1_w2.png"))
# plt.imshow(img)
gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
# plt.imshow(gray_img)
gray_img.shape

(512, 512)

In fastai, there is a modular data API that allows you to easily load images, add labels, split into train/valid, and add transforms. The base class for loading the images is an `ItemList`. For image classification tasks, the base class is `ImageList` which in turn subclasses the `ItemList` class. Since `ImageList` can only open 3-channel images, we will define a new `ImageList` class where we redefine the loading function:

In [10]:
def open_rcic_image(fn):
    images = []
    for i in range(6):
        file_name = fn+str(i+1)+'.png'
        im = cv2.imread(file_name)
        im = cv2.cvtColor(im, cv2.COLOR_RGB2GRAY)
        images.append(im)
    image = np.dstack(images)
    #print(pil2tensor(image, np.float32).shape)#.div_(255).shape)
    return Image(pil2tensor(image, np.float32).div_(255))
  
class MultiChannelImageList(ImageList):
    def open(self, fn):
        return open_rcic_image(fn)

As I subclassed the ImageList function I can load images with the `ImageList` function `.from_df`. 

In [11]:
il = MultiChannelImageList.from_df(df=proc_train_df,path=DATA/'train/')

We have to redefine the following function to be able to view the image in the notebook. I view just the first 3 channels.

In [12]:
def image2np(image:Tensor)->np.ndarray:
    "Convert from torch style `image` to numpy/matplotlib style."
    res = image.cpu().permute(1,2,0).numpy()
    if res.shape[2]==1:
        return res[...,0]  
    elif res.shape[2]>3:
        #print(res.shape)
        #print(res[...,:3].shape)
        return res[...,:3]
    else:
        return res

vision.image.image2np = image2np

Now let's view an example image:

In [13]:
# il[0]

With the multi-channel `ImageList` defined, we can now create a DataBunch of the train images. Let's first create a stratified split of dataset and get the indices. 

In [14]:
from sklearn.model_selection import StratifiedKFold
#train_idx, val_idx = next(iter(StratifiedKFold(n_splits=int(1/0.035),random_state=42).split(proc_train_df, proc_train_df.sirna)))
from sklearn.model_selection import train_test_split
train_df,val_df = train_test_split(proc_train_df,test_size=0.035, stratify = proc_train_df.sirna, random_state=42)
_proc_train_df = pd.concat([train_df,val_df])

Now we create the `DataBunch`

In [15]:
# data.show_batch()

## Creating and Training a Model

I will use a pretrained EfficientNet. There is code for other models thatt you can try but the EfficientNet seems to do the best. I have to now adjust the CNN arch to take in 6 channels as opposed to the usual 3 channels:

In [16]:
# !pip install efficientnet_pytorch

In [17]:
from efficientnet_pytorch import *

In [18]:
"""Inspired by https://github.com/wdhorton/protein-atlas-fastai/blob/master/resnet.py"""

import torchvision
RESNET_MODELS = {
    18: torchvision.models.resnet18,
    34: torchvision.models.resnet34,
    50: torchvision.models.resnet50,
    101: torchvision.models.resnet101,
    152: torchvision.models.resnet152,
}

def resnet_multichannel(depth=50,pretrained=True,num_classes=1108,num_channels=6):
        model = RESNET_MODELS[depth](pretrained=pretrained)
        w = model.conv1.weight
        model.conv1 = nn.Conv2d(num_channels, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        model.conv1.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
        return model

    
DENSENET_MODELS = {
    121: torchvision.models.densenet121,
    161: torchvision.models.densenet161,
    169: torchvision.models.densenet169,
    201: torchvision.models.densenet201,
}

def densenet_multichannel(depth=121,pretrained=True,num_classes=1108,num_channels=6):
        model = DENSENET_MODELS[depth](pretrained=pretrained)
        w = model.features.conv0.weight
        model.features.conv0 = nn.Conv2d(num_channels, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        model.features.conv0.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
        return model


def efficientnet_multichannel(pretrained=True,name='b3',num_classes=1108,num_channels=6,image_size=360):
    model = EfficientNet.from_pretrained('efficientnet-'+name,num_classes=num_classes)
    #model.load_state_dict(torch.load(EFFICIENTNET_MODELS[name]))
    w = model._conv_stem.weight
    #s = model._conv_stem.static_padding
    model._conv_stem = utils.Conv2dStaticSamePadding(num_channels,32,kernel_size=(3, 3), stride=(2, 2), bias=False, image_size = image_size)
    model._conv_stem.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
    return model

In [19]:
def resnet18(pretrained,num_channels=6):
    return resnet_multichannel(depth=18,pretrained=pretrained,num_channels=num_channels)

def _resnet_split(m): return (m[0][6],m[1])

def densenet161(pretrained,num_channels=6):
    return densenet_multichannel(depth=161,pretrained=pretrained,num_channels=num_channels)
  
def _densenet_split(m:nn.Module): return (m[0][0][7],m[1])

def efficientnetbn(name, sz, pretrained=True,num_channels=6):
    return efficientnet_multichannel(pretrained=pretrained,name=name, num_channels=num_channels, image_size = sz)


## Inference and Submission Generation

Let's now load our test csv and process the DataFrame like we did for the training data.

In [20]:
test_df = pd.read_csv(DATA/'test.csv')
site1_test_df = generate_df(test_df.copy())
site2_test_df = generate_df(test_df.copy(),sample_num=2)

Prediction on a single model/ with single site test data

In [21]:
# !mkdir -p /data/rcic/saved_preds

In [22]:
# cache for predicting, if model is the same, args are the same:
#     using the saved prediction
PREDS = Path("/data/rcic/saved_preds")
def predCache(f):
    """
    prediction cache decorator
    """
    def func(**kwargs):
        saved_fn = "%s_site-%s.tsr"%(kwargs["load"].split(".")[0],kwargs["site"])
        save_path = str(PREDS/saved_fn)
        if os.path.exists(save_path):
            print("loading from cache: %s"%(save_path))
            return torch.load(save_path)
        else:
            pred = f(**kwargs)
            torch.save(pred, save_path)
            print("save to cache: %s"%(save_path))
            return pred
    return func

In [45]:
@predCache
def getPred(name,sz,bs,load,site):
    """
    get prediction for 2 sites using a pretrained model
    name: str, in 'b0','b1','b2','b3'...,'b5'
    site: 1 or 2
    """
    site_test_df=site1_test_df if site==1 else site2_test_df
    data_test = MultiChannelImageList.from_df(df = site_test_df,path=DATA/'test/')
    data = (MultiChannelImageList.from_df(df=_proc_train_df,path=DATA/'train/')
        .split_by_idx(list(range(len(train_df),len(_proc_train_df))))
        .label_from_df()
        .transform(get_transforms(),size=sz)
        .databunch(bs=bs,num_workers=4)
        .normalize()
       )
    learn = Learner(data, efficientnetbn(name,sz),metrics=[accuracy]).to_fp16()
    learn.path = Path("./ensemble")
    learn.load(load)
    learn.data.add_test(data_test)
    preds, _ = learn.get_preds(DatasetType.Test)
    return preds

Ensemble the prediction

In [46]:
def ensemble(*args):
    preds = list(args)
    preds_len = len(preds)
    preds_idx = (sum(preds)/preds_len).argmax(dim=-1)
    return preds_idx

In [47]:
pred_tensors = list()

In [None]:
pred_tensors.append(getPred(name = "b0",sz = 256, bs = 64,load = "rcic-b0-sz256-bs64-s12", site = 1)) #0.272

pred_tensors.append(getPred(name = "b0",sz = 256, bs = 64,load = "rcic-b0-sz256-bs64-s12", site = 2))

pred_tensors.append(getPred(name = "b3",sz = 320, bs = 16,load = "rcic-b3-s320-bs16-s12", site = 1)) # 0.378

pred_tensors.append(getPred(name = "b3",sz = 320, bs = 16,load = "rcic-b3-s320-bs16-s2-e1", site = 2)) # 0.329

pred_tensors.append(getPred(name = "b3",sz = 320, bs = 16,load = "rcic-b3-s320-bs16-s2-e_end", site = 2)) # 0.390

pred_tensors.append(getPred(name = "b4",sz = 380, bs = 16,load = "rcic-b4-sz380-bs64-s1", site = 1)) # 0.325

pred_tensors.append(getPred(name = "b4",sz = 380, bs = 16,load = "rcic-b4-sz380-bs36-s2", site = 2)) # 0.326

save to cache: /data/rcic/saved_preds/rcic-b3-s320-bs16-s2-e1_site-2.tsr
Loaded pretrained weights for efficientnet-b3


In [33]:
preds = ensemble(*pred_tensors)

### Save to submission file

Let's open the sample submission file and load it with our predictions to create a submission.

In [34]:
submission_df = pd.read_csv(DATA/'sample_submission.csv')

In [35]:
submission_df.sirna = preds.numpy().astype(int)
print(submission_df.head(5))

          id_code  sirna
0  HEPG2-08_1_B03    855
1  HEPG2-08_1_B04    796
2  HEPG2-08_1_B05     19
3  HEPG2-08_1_B06     65
4  HEPG2-08_1_B07    823


In [36]:
submission_df.to_csv('submission.csv',index=False)