# Recursion Cellular Image Classification - fastai starter

* Thanks greatly to [this kaggle kernel](https://www.kaggle.com/kernels/scriptcontent/20557703/download)
* This is the ensemble version of notebook

## Load modules

In [1]:
import os

import numpy as np
import pandas as pd

from fastai.vision import *

In [2]:
torch.cuda.is_available()

True

In [3]:
SIZE = 320

In [3]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    

SEED = 0
seed_everything(SEED)

## Loading and formatting data

Here I will load the csv into the DataFrame, and create a column in the DataFrame with the path to the corresponding image (`generate_df`)

In [37]:
from pathlib import Path

DATA = Path("/mnt/disk4/cell/")

In [38]:
train_df = pd.read_csv(DATA/'train.csv')
train_df.head(10)

Unnamed: 0,id_code,experiment,plate,well,sirna
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144
5,HEPG2-01_1_B08,HEPG2-01,1,B08,503
6,HEPG2-01_1_B09,HEPG2-01,1,B09,188
7,HEPG2-01_1_B10,HEPG2-01,1,B10,700
8,HEPG2-01_1_B11,HEPG2-01,1,B11,1100
9,HEPG2-01_1_B12,HEPG2-01,1,B12,611


In [39]:
def generate_df(train_df,sample_num=1):
    train_df['path'] = train_df['experiment'].str.cat(train_df['plate'].astype(str).str.cat(train_df['well'],sep='/'),sep='/Plate') + '_s'+str(sample_num) + '_w'
    train_df = train_df.drop(columns=['id_code','experiment','plate','well']).reindex(columns=['path','sirna'])
    return train_df
site1_train_df = generate_df(train_df)  
site2_train_df = generate_df(train_df, sample_num=2)

proc_train_df = pd.concat([site1_train_df,site2_train_df],axis=0 )\
.sample(frac = 1.0)\
.reset_index()\
.drop("index",axis=1)

Let's look at an example image. These images are 6-channel images, but the each of the six channels are saved as separate files. Here, I open just one channel of the image.

In [40]:
proc_train_df.head(10)

Unnamed: 0,path,sirna
0,HUVEC-07/Plate4/C12_s1_w,1040
1,U2OS-01/Plate3/L20_s2_w,450
2,RPE-04/Plate2/G21_s1_w,896
3,RPE-05/Plate2/H18_s1_w,329
4,RPE-01/Plate2/L08_s1_w,429
5,HUVEC-02/Plate2/J02_s1_w,1046
6,RPE-07/Plate2/C09_s1_w,393
7,RPE-07/Plate4/D18_s2_w,286
8,RPE-06/Plate3/I18_s1_w,1027
9,HUVEC-11/Plate3/O07_s1_w,391


In [41]:
import cv2
img = cv2.imread(str(DATA/"train/HEPG2-01/Plate1/B03_s1_w2.png"))
# plt.imshow(img)
gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
# plt.imshow(gray_img)
gray_img.shape

(512, 512)

In fastai, there is a modular data API that allows you to easily load images, add labels, split into train/valid, and add transforms. The base class for loading the images is an `ItemList`. For image classification tasks, the base class is `ImageList` which in turn subclasses the `ItemList` class. Since `ImageList` can only open 3-channel images, we will define a new `ImageList` class where we redefine the loading function:

In [42]:
def open_rcic_image(fn):
    images = []
    for i in range(6):
        file_name = fn+str(i+1)+'.png'
        im = cv2.imread(file_name)
        im = cv2.cvtColor(im, cv2.COLOR_RGB2GRAY)
        images.append(im)
    image = np.dstack(images)
    #print(pil2tensor(image, np.float32).shape)#.div_(255).shape)
    return Image(pil2tensor(image, np.float32).div_(255))
  
class MultiChannelImageList(ImageList):
    def open(self, fn):
        return open_rcic_image(fn)

As I subclassed the ImageList function I can load images with the `ImageList` function `.from_df`. 

In [43]:
il = MultiChannelImageList.from_df(df=proc_train_df,path=DATA/'train/')

We have to redefine the following function to be able to view the image in the notebook. I view just the first 3 channels.

In [44]:
def image2np(image:Tensor)->np.ndarray:
    "Convert from torch style `image` to numpy/matplotlib style."
    res = image.cpu().permute(1,2,0).numpy()
    if res.shape[2]==1:
        return res[...,0]  
    elif res.shape[2]>3:
        #print(res.shape)
        #print(res[...,:3].shape)
        return res[...,:3]
    else:
        return res

vision.image.image2np = image2np

Now let's view an example image:

In [45]:
# il[0]

With the multi-channel `ImageList` defined, we can now create a DataBunch of the train images. Let's first create a stratified split of dataset and get the indices. 

In [46]:
from sklearn.model_selection import StratifiedKFold
#train_idx, val_idx = next(iter(StratifiedKFold(n_splits=int(1/0.035),random_state=42).split(proc_train_df, proc_train_df.sirna)))
from sklearn.model_selection import train_test_split
train_df,val_df = train_test_split(proc_train_df,test_size=0.035, stratify = proc_train_df.sirna, random_state=42)
_proc_train_df = pd.concat([train_df,val_df])

Now we create the `DataBunch`

In [16]:
# data.show_batch()

## Creating and Training a Model

I will use a pretrained EfficientNet. There is code for other models thatt you can try but the EfficientNet seems to do the best. I have to now adjust the CNN arch to take in 6 channels as opposed to the usual 3 channels:

In [30]:
# !pip install efficientnet_pytorch

In [31]:
from efficientnet_pytorch import *

In [49]:
"""Inspired by https://github.com/wdhorton/protein-atlas-fastai/blob/master/resnet.py"""

import torchvision
RESNET_MODELS = {
    18: torchvision.models.resnet18,
    34: torchvision.models.resnet34,
    50: torchvision.models.resnet50,
    101: torchvision.models.resnet101,
    152: torchvision.models.resnet152,
}

def resnet_multichannel(depth=50,pretrained=True,num_classes=1108,num_channels=6):
        model = RESNET_MODELS[depth](pretrained=pretrained)
        w = model.conv1.weight
        model.conv1 = nn.Conv2d(num_channels, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        model.conv1.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
        return model

    
DENSENET_MODELS = {
    121: torchvision.models.densenet121,
    161: torchvision.models.densenet161,
    169: torchvision.models.densenet169,
    201: torchvision.models.densenet201,
}

def densenet_multichannel(depth=121,pretrained=True,num_classes=1108,num_channels=6):
        model = DENSENET_MODELS[depth](pretrained=pretrained)
        w = model.features.conv0.weight
        model.features.conv0 = nn.Conv2d(num_channels, 64, kernel_size=7, stride=2, padding=3,
                               bias=False)
        model.features.conv0.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
        return model


def efficientnet_multichannel(pretrained=True,name='b3',num_classes=1108,num_channels=6,image_size=360):
    model = EfficientNet.from_pretrained('efficientnet-'+name,num_classes=num_classes)
    #model.load_state_dict(torch.load(EFFICIENTNET_MODELS[name]))
    w = model._conv_stem.weight
    #s = model._conv_stem.static_padding
    model._conv_stem = utils.Conv2dStaticSamePadding(num_channels,32,kernel_size=(3, 3), stride=(2, 2), bias=False, image_size = image_size)
    model._conv_stem.weight = nn.Parameter(torch.stack([torch.mean(w, 1)]*num_channels, dim=1))
    return model

In [50]:
def resnet18(pretrained,num_channels=6):
    return resnet_multichannel(depth=18,pretrained=pretrained,num_channels=num_channels)

def _resnet_split(m): return (m[0][6],m[1])

def densenet161(pretrained,num_channels=6):
    return densenet_multichannel(depth=161,pretrained=pretrained,num_channels=num_channels)
  
def _densenet_split(m:nn.Module): return (m[0][0][7],m[1])

def efficientnetbn(name, sz, pretrained=True,num_channels=6):
    return efficientnet_multichannel(pretrained=pretrained,name=name, num_channels=num_channels, image_size = sz)


Let's create our Learner:

In [21]:
from fastai.metrics import *

Downloading: "http://storage.googleapis.com/public-models/efficientnet/efficientnet-b3-5fb5a3c3.pth" to /home/hadoop/.cache/torch/checkpoints/efficientnet-b3-5fb5a3c3.pth
100%|██████████| 47.1M/47.1M [00:03<00:00, 15.7MB/s]


Loaded pretrained weights for efficientnet-b3


We will now unfreeze and train the entire model.

## Inference and Submission Generation

Let's now load our test csv and process the DataFrame like we did for the training data.

In [17]:
test_df = pd.read_csv(DATA/'test.csv')
site1_test_df = generate_df(test_df.copy())
site2_test_df = generate_df(test_df.copy(),sample_num=2)

We add the data to our DataBunch:

In [61]:
def getPred(name,sz,bs,load):
    """
    get prediction for 2 sites using a pretrained model
    """
    data_test1 = MultiChannelImageList.from_df(df=site1_test_df,path=DATA/'test/')
    data_test2 = MultiChannelImageList.from_df(df=site2_test_df,path=DATA/'test/')
    data = (MultiChannelImageList.from_df(df=_proc_train_df,path=DATA/'train/')
        .split_by_idx(list(range(len(train_df),len(_proc_train_df))))
        .label_from_df()
        .transform(get_transforms(),size=sz)
        .databunch(bs=16,num_workers=4)
        .normalize()
       )
    learn = Learner(data, efficientnetbn(name,sz),metrics=[accuracy]).to_fp16()
    learn.path = Path("./ensemble")
    learn.load(load)
    learn.data.add_test(data_test1)
    preds1, _ = learn.get_preds(DatasetType.Test)
    learn.data.add_test(data_test2)
    preds2, _ = learn.get_preds(DatasetType.Test)
    return preds1,preds2

In [63]:
preds1, preds2 = getPred(name = "b4",sz = 360,bs = 8,load = "rcic-sz380-bs36")

In [71]:
preds = ((preds1+preds2)/2).argmax(dim=-1)

In [73]:
test_df.head(10)

Unnamed: 0,id_code,experiment,plate,well
0,HEPG2-08_1_B03,HEPG2-08,1,B03
1,HEPG2-08_1_B04,HEPG2-08,1,B04
2,HEPG2-08_1_B05,HEPG2-08,1,B05
3,HEPG2-08_1_B06,HEPG2-08,1,B06
4,HEPG2-08_1_B07,HEPG2-08,1,B07
5,HEPG2-08_1_B08,HEPG2-08,1,B08
6,HEPG2-08_1_B09,HEPG2-08,1,B09
7,HEPG2-08_1_B10,HEPG2-08,1,B10
8,HEPG2-08_1_B11,HEPG2-08,1,B11
9,HEPG2-08_1_B12,HEPG2-08,1,B12


Let's open the sample submission file and load it with our predictions to create a submission.

In [74]:
submission_df = pd.read_csv(DATA/'sample_submission.csv')

In [75]:
submission_df.sirna = preds.numpy().astype(int)
print(submission_df.head(5))

          id_code  sirna
0  HEPG2-08_1_B03    855
1  HEPG2-08_1_B04    414
2  HEPG2-08_1_B05    289
3  HEPG2-08_1_B06    237
4  HEPG2-08_1_B07    959


In [76]:
submission_df.to_csv('submission.csv',index=False)

## Future work:

This is only a simple baseline. There are many different things we can change:
* Use both sites (right now I only use site 1)
* Model architecture
* Train multiple classifiers for different cell types
* **Metric learning** - This will be the key to successful submissions