# Recursion Cellular Image Classification
### CellSignal: Disentangling biological signal from experimental noise in cellular images

> **Work done by**: Nwachukwu Anthony  
> **Email**: nwachukwuanthony2015@gmail.com  
> **Inspired by**: *Fastai online courses on Deep Learning*  
> **Data from kaggle** competition, link below

The cost of some drugs and medical treatments has risen so high in recent years that many patients are having to go without. You can help with a classification project that could make researchers more efficient.

One of the more surprising reasons behind the cost is how long it takes to bring new treatments to market. Despite improvements in technology and science, research and development continues to lag. In fact, finding new treatments takes, on average, more than 10 years and costs hundreds of millions of dollars.

Recursion Pharmaceuticals, creators of the industry’s largest dataset of biological images, generated entirely in-house, believes AI has the potential to dramatically improve and expedite the drug discovery process. More specifically, your efforts could help them understand how drugs interact with human cells.

This will have you disentangling experimental noise from real biological signals. Your entry will classify images of cells under one of 1,108 different genetic perturbations. You can help eliminate the noise introduced by technical execution and environmental variation between experiments.

If successful, you could dramatically improve the industry’s ability to model cellular images according to their relevant biology. In turn, applying AI could greatly decrease the cost of treatments, and ensure these treatments get to patients faster.


You will find the dataset on this website: https://www.kaggle.com/c/recursion-cellular-image-classification/data

### Import Libraries

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import numpy as np 
import pandas as pd
from fastai.metrics import accuracy
from fastai.vision import *
import os
print(os.listdir("../input/recursion-cellular-image-classification"))

### Set the paths and Prepare the data

In [None]:
# Create the modified Train csv file appropriate for the work
path = '../input/recursion-cellular-image-classification'
dftrain = pd.read_csv(path+'/train.csv');
dftrain = dftrain[['id_code','sirna']];
dic = {};
for fold1 in os.listdir(path+'/train'):
    for fold2 in os.listdir(path+'/train/'+fold1):
        for image in os.listdir(path+'/train/'+fold1+'/'+fold2):
            dic[str(fold1)+'_'+fold2[5:]+'_'+image[0:3]] = str(fold1)+'/'+fold2+'/'+image;
df = pd.DataFrame(list(dic.items()), columns=['id_code','Item']);
dftraindf = pd.merge(df, dftrain);
trainData = dftraindf[['Item','sirna']];
trainData.to_csv(r'../working/trainData.csv', index = None, header=True);

In [None]:
# Create the modified Test csv file appropriate for the work
dftest = pd.read_csv(path+'/test.csv')
dftest = dftest['id_code']
dictest = {}
for fold1 in os.listdir(path+'/test'):
    for fold2 in os.listdir(path+'/test/'+fold1):
        for image in os.listdir(path+'/test/'+fold1+'/'+fold2):
            dictest[str(fold1)+'_'+fold2[5:]+'_'+image[0:3]] = str(fold1)+'/'+fold2+'/'+image
df = pd.DataFrame(list(dictest.items()), columns=['id_code','foldPath'])
df.to_csv(r'../working/testData.csv', index = None, header=True);

In [None]:
# Set the parameters and create the data for the model
np.random.seed(42) #makes sure you get same results each time you run the code
tfms = get_transforms(flip_vert=True, max_lighting=0.1, max_zoom=1.05, max_warp=0.)
src = (ImageList.from_csv('../', 'working/trainData.csv', folder='input/recursion-cellular-image-classification/train')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))
data = (src.transform(tfms, size=256)
        .databunch().normalize(imagenet_stats))
arch = models.resnet50

### Visualize the Data

In [None]:
#View a single given it's path
img = open_image(path+'/train/HEPG2-04/Plate1/O23_s2_w4.png')
img;

In [None]:
# View the number of train and validation sets. Then the classes
print('The length of train and validation sets are {tran} and {vald}\n\nThe classes are:\n{clases}'.format(tran=len(data.train_ds),vald=len(data.valid_ds),clases=len(data.classes)))

In [None]:
#View portion of dataset
data.show_batch(rows=3, figsize=(7,8))

### Tain

In [None]:
# Create the accuracy function.
def accuracy(input:Tensor, targs:Tensor)->Rank0Tensor:
    "Computes accuracy with `targs` when `input` is bs * n_classes."
    n = targs.shape[0]
    input = input.argmax(dim=-1).view(n,-1)
    targs = targs.view(n,-1)
    return (input==targs.long()).float().mean()

def error_rate(input:Tensor, targs:Tensor)->Rank0Tensor:
    "1 - `accuracy`"
    return 1 - accuracy(input, targs)

In [None]:
#Use CNN (Convolutional Neural Network) and pretrained model (resnet50)  to train
learn = cnn_learner(data, arch, metrics=[error_rate])

In [None]:
#Find and plot learning rate
learn.lr_find()

In [None]:
learn.recorder.plot()

In [None]:
#set learning rate
lr = 0.01

In [None]:
#Fit the model
learn.fit_one_cycle(5, slice(lr))

In [None]:
# Save it
learn.save('stage-1-rn50')

### More training

In [None]:

# Unfreeze the model, that is, traing afresh without the pretrained model
learn.unfreeze()


In [None]:
# Find and plot the learning rate
learn.lr_find()
learn.recorder.plot()

In [None]:
# Fit the model
learn.fit_one_cycle(5, slice(1e-5, lr/5))

In [None]:
# Save this latest trained model
learn.save('stage-2-rn50')

In [None]:
# Create a new dataset with batch size = 500
bs = 24
tfms = get_transforms(do_flip=True, flip_vert=True, max_lighting=0.2, max_rotate=359, max_zoom=1.05, max_warp=0.2)
src = (ImageList.from_csv('../', 'working/trainData.csv', folder='input/recursion-cellular-image-classification/train')
       .split_by_rand_pct(0.2)
       .label_from_df(label_delim=' '))
data = (src.transform(tfms, size=500)
        .databunch(bs=bs//4).normalize(imagenet_stats))
# Set the learners data as data
learn.data = data
data.train_ds[0][0].shape

In [None]:
# Freeze and find learning rate
learn.freeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
# Fit and save the model
lr=1e-2/2
learn.fit_one_cycle(8, max_lr=slice(1e-6,1e-4))
learn.save('stage-1-256-rn50')

In [None]:
# Freeze and find learning rate
learn.unfreeze()
learn.lr_find()
learn.recorder.plot()

In [None]:
# Fit and save the model
learn.fit_one_cycle(10)
                    

In [None]:
# Freeze and train more
learn.unfreeze()
learn.fit_one_cycle(10, slice(1e-5, lr/5))
learn.save('stage-3-256-rn50')

### Export the Model

In [None]:
learn.export()

### Test the Model

In [None]:
test = ImageList.from_csv('../', 'working/testData.csv', cols='foldPath', folder='input/recursion-cellular-image-classification/test')
learn = load_learner('../', test=test)

In [None]:
# Find the prediction
preds,_ = learn.get_preds(ds_type=DatasetType.Test)
labelled_preds = [learn.data.classes[(pred).tolist().index(max((pred).tolist()))] for pred in preds]
#Althernatively, you can replace line two with these two lines of code below
#labels = np.argmax(preds, 1)
#labelled_preds = [data.classes[int(x)] for x in labels]
#print(labelled_preds)

In [None]:
# Create the id_code for submission from the path of the test file
lsttest = []
for item in learn.data.test_ds.items:
    lst = item.split('/')[-3:]
    lsttest.append(str(lst[0])+'_'+lst[1][5:]+'_'+lst[-1].split('_')[0])
df = pd.DataFrame(lsttest, columns=['id_code'])
print(df.head())

In [None]:
# Merge the predicted labels with the corresponding id_code, then save it for submission
path = '../input/recursion-cellular-image-classification'
dftestcsv = pd.read_csv(path+'/test.csv')

tes = OrderedDict([('id_code',lsttest), ('sirna', labelled_preds)] )
df = pd.DataFrame.from_dict(tes)

dftestcsv = pd.DataFrame(list(dftestcsv['id_code']), columns=['id_code'])
dftestdfcsv = pd.merge(dftestcsv, df)
dftestdfcsv.to_csv('../working/submission.csv', index=False)

In [None]:
# View the head and tail of the predicted test file
print(dftestdfcsv.head())
print(dftestdfcsv.tail())

Thank you