# About this kernel

This is a rather quick and dirty kernel I created, with two ideas in mind: Training a "2-headed" network that will learn to predict siRNA using images from both sites at the same time, and split the learning process into two stages, namely first training on all data, then training the CNN on data from a single experiment at a time. The second idea comes from [this thread by Phalanx](https://www.kaggle.com/c/recursion-cellular-image-classification/discussion/100414#latest-586901). The data comes from my previous kernel on preprocessing.

Here are the relevant sections:
* **Data Generator**: The `__generate_X` method is pretty different, since it loads two images at the same time. Everything else is standard
* **Model**: The CNN architecture used here is `EfficientNetB2`. With the right learning rates and enough time, you can probably try B1-B5; they have unfortunately not succeeded in my case. The inputs are two images, i.e. from site 1 and site 2. The two images are passed through the same CNN, then global-average-pooled, and added to form a single 1280-dimensional vector, which is ultimately used to perform predictions. This means that the networks will be updated simultaneously from the gradients of both sites.
* **Phase 1**: Train the model on all data from 10 epochs, and save results to `model.h5`.
* **Phase 2**: Load `model.h5` and train the model for 15 epochs on data from a single cell line, i.e. *HEPG2, HUVEC, RPE, U2OS*.

## Changelog

* V20: Added random flipping.
* Data is taken from https://www.kaggle.com/xhlulu/recursion-2019-load-resize-and-save-images instead of the original data folder.

In [25]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

for dirname, _, filenames in os.walk('../input/recursion-cellular-image-classification-224-jpg/train/train'): #/kaggle/input ../tmp/train_resized
    for filename in filenames:
        print(os.path.join(dirname, filename))
        break

#/kaggle/input/recursion-2019-load-resize-and-save-images/test.zip
#/kaggle/input/efficientnet-keras-weights-b0b5/efficientnet-b2_imagenet_1000_notop.h5
#/kaggle/input/recursion-cellular-image-classification-224-jpg/new_test.csv
#/kaggle/input/recursion-cellular-image-classification-224-jpg/test/test/HUVEC-23_4_F06_s2.jpeg

../input/recursion-cellular-image-classification-224-jpg/train/train/HEPG2-06_2_O08_s1.jpeg


In [1]:
!pip install efficientnet
import efficientnet

Collecting efficientnet
  Downloading https://files.pythonhosted.org/packages/97/82/f3ae07316f0461417dc54affab6e86ab188a5a22f33176d35271628b96e0/efficientnet-1.0.0-py3-none-any.whl
Installing collected packages: efficientnet
Successfully installed efficientnet-1.0.0


In [3]:
!pip install --upgrade scikit-image

Collecting scikit-image
  Downloading https://files.pythonhosted.org/packages/cb/5a/abd74bd5ce791e2ab0b6fd88b144c42dbc88b3b1d963147417d0e163684b/scikit_image-0.16.2-cp37-cp37m-win_amd64.whl (25.7MB)
Collecting networkx>=2.0 (from scikit-image)
  Downloading https://files.pythonhosted.org/packages/41/8f/dd6a8e85946def36e4f2c69c84219af0fa5e832b018c970e92f2ad337e45/networkx-2.4-py3-none-any.whl (1.6MB)
Installing collected packages: networkx, scikit-image
  Found existing installation: networkx 1.11
    Uninstalling networkx-1.11:
      Successfully uninstalled networkx-1.11
  Found existing installation: scikit-image 0.14.1
    Uninstalling scikit-image-0.14.1:
      Successfully uninstalled scikit-image-0.14.1


Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'C:\\Users\\Admin\\AppData\\Local\\Temp\\pip-uninstall-ldjxyx5_\\anaconda3\\lib\\site-packages\\skimage\\_shared\\geometry.cp37-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



In [4]:
import json
import math
import os
import cv2
import tensorflow as tf
from PIL import Image
import numpy as np
import keras
from keras import layers
from keras.applications import MobileNetV2
from keras.callbacks import Callback, ModelCheckpoint
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model, load_model
from keras.layers import GlobalAveragePooling2D, Dense, Dropout, BatchNormalization, concatenate, Input, add
from keras.optimizers import Adam
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import cohen_kappa_score, accuracy_score
import scipy
from tqdm import tqdm
from efficientnet.keras import EfficientNetB3

In [2]:
!mkdir ../tmp
!unzip -q ../input/recursion-2019-load-resize-and-save-images/train.zip -d ../tmp
!unzip -q ../input/recursion-2019-load-resize-and-save-images/test.zip -d ../tmp 

In [44]:
load_path = 'C:/Temp/recursion2/train'
save_path = 'C:/Temp/recursion2/train_resized'
if not os.path.exists(save_path):
    os.makedirs(save_path)

for code in tqdm(os.listdir(load_path)):
    path = f'{load_path}/{code}'
    
    img = cv2.imread(path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (300, 300))
    
    cv2.imwrite(f'{save_path}/{code}', img)

100%|███████████████████████████████████████████████████████████████████████████| 73030/73030 [10:38<00:00, 114.38it/s]


# Preprocessing

In [7]:
train_df = pd.read_csv('C:/Temp/recursion2/new_train.csv')
test_df = pd.read_csv('C:/Temp/recursion2/new_test.csv')

train_df['category'] = train_df['experiment'].apply(lambda x: x.split('-')[0])
test_df['category'] = test_df['experiment'].apply(lambda x: x.split('-')[0])

train_target_df = pd.get_dummies(train_df['sirna'])

print(train_df.shape)
print(test_df.shape)
print(train_target_df.shape)

train_df.head()

(73030, 7)
(39794, 6)
(73030, 1108)


Unnamed: 0,id_code,experiment,plate,well,sirna,filename,category
0,HEPG2-01_1_B03,HEPG2-01,1,B03,513,HEPG2-01_1_B03_s1.jpeg,HEPG2
1,HEPG2-01_1_B04,HEPG2-01,1,B04,840,HEPG2-01_1_B04_s1.jpeg,HEPG2
2,HEPG2-01_1_B05,HEPG2-01,1,B05,1020,HEPG2-01_1_B05_s1.jpeg,HEPG2
3,HEPG2-01_1_B06,HEPG2-01,1,B06,254,HEPG2-01_1_B06_s1.jpeg,HEPG2
4,HEPG2-01_1_B07,HEPG2-01,1,B07,144,HEPG2-01_1_B07_s1.jpeg,HEPG2


In [8]:
train_idx, val_idx = train_test_split(
    train_df.index, test_size=0.15, random_state=2019
)

print(train_idx.shape)
print(val_idx.shape)
#(31037,)
#(5478,)

(62075,)
(10955,)


# Data Generator

In [53]:
class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, df, target_df=None, mode='fit',
                 base_path = 'C:/Temp/recursion2/train_resized',
                 batch_size=32, dim=(224, 224), n_channels=3, ext='jpeg',
                 rotation_range=0, fill_mode='nearest', swap=False,
                 vertical_flip=False, horizontal_flip=False, rescale=1/255.,
                 n_classes=5, random_state=2019, shuffle=True):
        self.dim = dim
        self.batch_size = batch_size
        self.df = df
        self.mode = mode
        self.base_path = base_path
        self.rotation_range=rotation_range
        self.target_df = target_df
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.n_classes = n_classes
        self.shuffle = shuffle
        self.ext = ext
        self.rescale = rescale
        self.vertical_flip = vertical_flip
        self.horizontal_flip = horizontal_flip
        self.random_state = random_state
        self.swap = swap
        
        self.fill_mode = self.__compute_fill_mode(fill_mode)
        
        np.random.seed(self.random_state)
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_batch = [self.list_IDs[k] for k in indexes]
        
        X = self.__generate_X(list_IDs_batch)
        
        if self.mode == 'fit':
            y = self.__generate_y(list_IDs_batch)
            return X, y
        
        elif self.mode == 'predict':
            return X
        else:
            raise AttributeError('The parameter mode should be set to "fit" or "predict".')
        
    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)
    
    def __generate_X(self, list_IDs_batch):
        'Generates data containing batch_size samples'
        # Initialization
        X_1 = np.empty((self.batch_size, *self.dim, self.n_channels))
        X_2 = np.empty((self.batch_size, *self.dim, self.n_channels))
        
        # Generate data
        for i, ID in enumerate(list_IDs_batch):
            code = self.df['id_code'].iloc[ID]
            
            img_path_1 = f"{self.base_path}/{code}_s1.{self.ext}"
            img_path_2 = f"{self.base_path}/{code}_s2.{self.ext}"
            
            img1 = self.__load_image(img_path_1)
            img2 = self.__load_image(img_path_2)
            
            if self.swap and np.random.rand() > 0.5:
                img1, img2 = img2, img1
            
            # Store samples
            X_1[i,] = img1
            X_2[i,] = img2

        return [X_1, X_2]
    
    def __generate_y(self, list_IDs_batch):
        y = np.empty((self.batch_size, self.n_classes), dtype=int)
        
        for i, ID in enumerate(list_IDs_batch):
            sirna = self.target_df.iloc[ID]
            y[i, ] = sirna
        
        return y
    
    def __load_image(self, img_path):
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
        img = self.rescale * img.astype(np.float32)

        return img
    
    def __compute_fill_mode(self, fill_mode):
        convert_cv2 = {
            'nearest': cv2.BORDER_REPLICATE,
            'reflect': cv2.BORDER_REFLECT,
            'wrap': cv2.BORDER_WRAP,
            'constant': cv2.BORDER_CONSTANT
        }
        
        return convert_cv2[fill_mode]
    
    def __random_transform(self, img):
        if np.random.rand() > 0.5 and self.vertical_flip:
            img = cv2.flip(img, 0)
        if np.random.rand() > 0.5 and self.horizontal_flip:
            img = cv2.flip(img, 1)
        
        # Random Rotation
        rotation = self.rotation_range * np.random.rand()
        
        rows,cols = img.shape[:2]
        M = cv2.getRotationMatrix2D((cols/2,rows/2),rotation,1)
        img = cv2.warpAffine(img,M,(cols,rows), borderMode=self.fill_mode)
        
        return img

In [54]:
BATCH_SIZE = 32
train_generator = DataGenerator(
    train_idx, 
    df=train_df,
    target_df=train_target_df,
    batch_size=BATCH_SIZE, 
    vertical_flip=True,
    horizontal_flip=True,
    swap=True,
    dim=(300, 300),
    base_path='C:/Temp/recursion2/train_resized',
    rotation_range=15,
    n_classes=train_target_df.shape[1]
)

val_generator = DataGenerator(
    val_idx, 
    df=train_df,
    target_df=train_target_df,
    batch_size=BATCH_SIZE, 
    vertical_flip=True,
    horizontal_flip=True,
    swap=True,
    dim=(300, 300),
    base_path='C:/Temp/recursion2/train_resized',
    rotation_range=15,
    n_classes=train_target_df.shape[1]
)

test_generator = DataGenerator(
    test_df.index, 
    df=test_df,
    batch_size=1, 
    shuffle=False,
    mode='predict',
    n_classes=train_target_df.shape[1],
    dim=(300, 300),
    base_path='C:/Temp/recursion2/test_resized'
)

In [47]:
#train_generator.__getitem__(0)
indexes = train_generator.indexes[0:1*train_generator.batch_size]
indexes
# Find list of IDs
#list_IDs_batch = [self.list_IDs[k] for k in indexes]
#X = self.__generate_X(list_IDs_batch)
list_IDs_batch = [train_generator.list_IDs[k] for k in indexes]
list_IDs_batch[0] #45582
#X = train_generator.testX([45582])
code = train_generator.df['id_code'].iloc[45582]

img_path = f"{train_generator.base_path}/{code}_s1.{train_generator.ext}"
img = cv2.imread(img_path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
#img = self.rescale * img.astype(np.float32)

array([36029, 42412, 23803, 60092, 20763, 20669, 53556, 44538, 60681,
       37372, 43207, 20833, 26325, 39182, 32340, 32969, 58059, 36384,
        3301, 13000, 47938, 22335, 61204, 17860, 38108, 11471, 56307,
        7361, 45170, 34302,  6503,  8774])

# Model Build

In [11]:
def build_model(n_classes, input_shape=(224, 224, 3)):
    # First load mobilenet
    backbone = EfficientNetB3(
        weights='imagenet', 
        include_top=False,
        input_shape=input_shape
    )
    
    im_inp_1 = Input(shape=input_shape)
    im_inp_2 = Input(shape=input_shape)

    x1 = backbone(im_inp_1)
    x2 = backbone(im_inp_2)

    x1 = GlobalAveragePooling2D()(x1)
    x2 = GlobalAveragePooling2D()(x2)

    out = add([x1, x2])
    out = Dropout(0.5)(out)

    out = Dense(n_classes, activation='softmax')(out)

    model = Model(inputs=[im_inp_1, im_inp_2], outputs=out)
    
    model.compile(Adam(0.0001), loss='categorical_crossentropy', metrics=['accuracy'])
    
    return model

In [19]:
model = build_model(
    input_shape=(300, 300, 3),
    n_classes=train_target_df.shape[1]
)
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_6 (InputLayer)            (None, 300, 300, 3)  0                                            
__________________________________________________________________________________________________
input_7 (InputLayer)            (None, 300, 300, 3)  0                                            
__________________________________________________________________________________________________
efficientnet-b3 (Model)         (None, 10, 10, 1536) 10783528    input_6[0][0]                    
                                                                 input_7[0][0]                    
__________________________________________________________________________________________________
global_average_pooling2d_3 (Glo (None, 1536)         0           efficientnet-b3[1][0]      

# Train Model.

In [55]:
checkpoint = ModelCheckpoint(
    'model.h5', 
    monitor='val_loss', 
    verbose=1, 
    save_best_only=True, 
    save_weights_only=False,
    mode='auto'
)

history = model.fit_generator(
    train_generator,
    validation_data=val_generator,
    callbacks=[checkpoint],
    use_multiprocessing=False,
    workers=1,
    verbose=1,
    epochs=20
)

Epoch 1/20


KeyboardInterrupt: 

In [None]:
with open('history.json', 'w') as f:
    json.dump(history.history, f)

history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
history_df[['acc', 'val_acc']].plot()

# Phase 2: train on each cell line

In [None]:
categories = train_df['category'].unique()
output_df = []

for category in categories:
    # Retrieve desired category
    category_df = train_df[train_df['category'] == category]
    cat_test_df = test_df[test_df['category'] == category].copy()
    
    print('\n' + '=' * 40)
    print("CURRENT CATEGORY:", category)
    print('-' * 40)
    
    train_idx, val_idx = train_test_split(
        category_df.index, 
        random_state=2019,
        test_size=0.15
    )
    
    # Create new generators
    train_generator = DataGenerator(
        train_idx, 
        df=train_df,
        target_df=train_target_df,
        batch_size=BATCH_SIZE, 
        vertical_flip=True,
        horizontal_flip=True,
        swap=True,
        rotation_range=15,
        dim=(300, 300),
        base_path='../tmp/train_resized',
        n_classes=train_target_df.shape[1]
    )

    val_generator = DataGenerator(
        val_idx, 
        df=train_df,
        target_df=train_target_df,
        batch_size=BATCH_SIZE, 
        vertical_flip=True,
        horizontal_flip=True,
        swap=True,
        rotation_range=15,
        dim=(300, 300),
        base_path='../tmp/train_resized',
        n_classes=train_target_df.shape[1]
    )

    test_generator = DataGenerator(
        cat_test_df.index, 
        df=test_df,
        batch_size=1, 
        shuffle=False,
        mode='predict',
        n_classes=train_target_df.shape[1],
        dim=(300, 300),
        base_path='../tmp/test_resized'
    )

    # Restore previously trained model
    model.load_weights('model.h5')
    model.compile(
        Adam(0.0001), 
        loss='categorical_crossentropy', 
        metrics=['accuracy']
    )

    # Train model only on data for specific category
    checkpoint = ModelCheckpoint(
        f'model_{category}.h5', 
        monitor='val_loss', 
        verbose=0, 
        save_best_only=True, 
        save_weights_only=False,
        mode='auto'
    )

    history_category = model.fit_generator(
        train_generator,
        validation_data=val_generator,
        callbacks=[checkpoint],
        use_multiprocessing=False,
        workers=1,
        verbose=2,
        epochs=10
    )

    # Make prediction and add to output dataframe
    y_pred = model.predict_generator(
        test_generator,
        workers=2,
        use_multiprocessing=True,
        verbose=1
    )

    cat_test_df['sirna'] = y_pred.argmax(axis=1)
    output_df.append(cat_test_df[['id_code', 'sirna']])

    # Save history
    with open(f'history_{category}.json', 'w') as f:
        json.dump(history_category.history, f)

# Submission

In [None]:
output_df = pd.concat(output_df)
output_df.to_csv('submission.csv', index=False)