# Build a base model for the Human Protein Atlas - Single Cell Classification Competition using Tensorflow and Keras

## Introduction

Important insights from the previous competition, from Ouyang et al. Nature Methods (2019) sections "Strategies used by the top-ranking solutions" and "Assessing the biological relevance of the winning model with class activation maps (CAMs)":

1. Data augmentation such as random cropping, rotation, and flipping might improve model performance. 
2. Modifications of the loss function. 
3. DenseNet architecture more effective than ResNet. 
4. Medium sized networks worked better than larger ones (for example DenseNet121 performed better than DenseNet169).
5. Using larger image sizes might improve scores.
6. Model ensembling and stacking might improve performance. 
7. Class activation maps (CAMs) can be used for visualization of model spatial attention.


Articles: 

[1] Ouyang, W., Winsnes, C.F., Hjelmare, M. et al. Analysis of the Human Protein Atlas Image Classification competition. Nat Methods 16, 1254–1261 (2019). https://doi.org/10.1038/s41592-019-0658-6

Notebooks:

(1) [DenseNet Trained with Old and New Data](https://www.kaggle.com/raimonds1993/aptos19-densenet-trained-with-old-and-new-data) by Federico Raimondi.

(2) [Tutorial on Keras ImageDataGenerator with flow_from_dataframe](https://vijayabhaskar96.medium.com/tutorial-on-keras-imagedatagenerator-with-flow-from-dataframe-8bd5776e45c1) by Vijayabhaskar J. 


Datasets:

(1) [HPA cell tiles sample balanced dataset: individual cells as RGB jpg images for rapid experimentation](https://www.kaggle.com/thedrcat/hpa-cell-tiles-sample-balanced-dataset) by Darek Kłeczek, a single-cell image version of the original dataset, below.

(2) [Human Protein Atlas - Single Cell Classification Dataset](https://www.kaggle.com/c/hpa-single-cell-image-classification/data).

Package documentation:

(1) [Keras DenseNet121](https://keras.io/api/applications/densenet).

(2) [Tensorflow Module: tf.keras.layers.experimental.preprocessing](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/).

(3) [Tensorflow Data augmentation](https://www.tensorflow.org/tutorials/images/data_augmentation).

(4) [Tensorflow Image classification](https://www.tensorflow.org/tutorials/images/classification).

(5) [Tensorflow Image dataset from directory](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image_dataset_from_directory).

(6) [scikit-learn MultiLabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer).


Tasks:

1. Preprocessing:

(1.1) Get unique single-cell image identifiers and multilabels.

(1.2) Train and validation split.

(1.3) Configure dataset for performance.

2. Model definition.

3. Training.

4. Evaluation.

In [None]:
# libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import warnings 
import os,gc,cv2
import shutil
import random
from tqdm.notebook import tqdm
from PIL import Image, ImageDraw
from sklearn.preprocessing import MultiLabelBinarizer

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers import RMSprop


%matplotlib inline
warnings.filterwarnings('ignore')

In [None]:
# directories 
CELL_IMG='../input/hpa-cell-tiles-sample-balanced-dataset/cells/'
CELL_DF='../input/hpa-cell-tiles-sample-balanced-dataset/cell_df.csv'

## 1. Pre-processing

### (1.1) Get unique single-cell image identifiers and multilabels

In [None]:
# loads train dataframe
train_df=pd.read_csv(CELL_DF)
train_df.head(n=10)

In [None]:

# spliting label column
train_df["image_labels"] = train_df["image_labels"].str.split("|")

# class labels
class_labels = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18']

# binarizing each label/class
for label in tqdm(class_labels):
    train_df[label] = train_df['image_labels'].map(lambda result: 1 if label in result else 0)

# rename column
train_df.columns = ['image_id', 'r_mean', 'g_mean', 'b_mean', 'cell_id', 'image_labels', 'size1', 'size2', 'Nucleoplasm', 'Nuclear membrane', 'Nucleoli', 'Nucleoli fibrillar center',
                    'Nuclear speckles', 'Nuclear bodies', 'Endoplasmic reticulum', 'Golgi apparatus', 'Intermediate filaments',
                    'Actin filaments', 'Microtubules', 'Mitotic spindle', 'Centrosome', 'Plasma membrane', 'Mitochondria',
                    'Aggresome', 'Cytosol', 'Vesicles and punctate cytosolic patterns', 'Negative']

In [None]:
train_df.head()

In [None]:
# creates a new column with unique identifiers for single-cell images

multinames = ['id', 'r_mean', 'g_mean', 'b_mean', 'image_labels', 'size1', 'size2', 'Nucleoplasm', 'Nuclear membrane', 'Nucleoli', 'Nucleoli fibrillar center',
                    'Nuclear speckles', 'Nuclear bodies', 'Endoplasmic reticulum', 'Golgi apparatus', 'Intermediate filaments',
                    'Actin filaments', 'Microtubules', 'Mitotic spindle', 'Centrosome', 'Plasma membrane', 'Mitochondria',
                    'Aggresome', 'Cytosol', 'Vesicles and punctate cytosolic patterns', 'Negative']
cell_df=train_df
cell_df["id"] = cell_df['image_id'] +'_'+ cell_df['cell_id'].astype(str) 
cell_df["id"] =  cell_df["id"] + '.jpg'
cell_df=cell_df.drop( columns=['image_id', 'cell_id'] )
cell_df=cell_df.reindex( columns= multinames ) 
cell_df.head()

In [None]:
# change order of ids as in the cells folder
cell_df=cell_df.sort_values('id', axis=0, ascending=True, inplace=False, 
                            kind='quicksort', na_position='last')
cell_df.head()

In [None]:
# define multilabels for training
multilabels = ['Nucleoplasm', 'Nuclear membrane', 'Nucleoli', 'Nucleoli fibrillar center',
                    'Nuclear speckles', 'Nuclear bodies', 'Endoplasmic reticulum', 'Golgi apparatus', 'Intermediate filaments',
                    'Actin filaments', 'Microtubules', 'Mitotic spindle', 'Centrosome', 'Plasma membrane', 'Mitochondria',
                    'Aggresome', 'Cytosol', 'Vesicles and punctate cytosolic patterns', 'Negative']
print( len(multilabels), '\n')

### (1.2) Train and validation split

Use the tensorflow method 'flow_from_dataframe', as in this [notebook](https://www.kaggle.com/minniekabra/code-3may) 

In [None]:
# constant parameters
IMG_SIZE = 224
BATCH_SIZE = 32

In [None]:
# image generator, rescaling is performed in a pre-processing layer below, 
image_generator = image.ImageDataGenerator(
    rescale=1./255,
    data_format='channels_last',
    preprocessing_function=None,
    validation_split=0.2
)

In [None]:
# train set data flow from dataframe
train_data = image_generator.flow_from_dataframe(
cell_df,
directory=CELL_IMG,
x_col='id',
y_col=multilabels,
class_mode='raw',    
color_mode='rgb',
target_size=(IMG_SIZE, IMG_SIZE),    
batch_size=BATCH_SIZE,
seed=123,
subset='training'
)

In [None]:
# validation set data flow from dataframe
validation_data = image_generator.flow_from_dataframe(
cell_df,
directory=CELL_IMG,
x_col='id',
y_col=multilabels,
class_mode='raw',    
color_mode='rgb',
target_size=(IMG_SIZE, IMG_SIZE),    
batch_size=BATCH_SIZE,
seed=123,
subset='validation'
)

## 2. Model definition

In [None]:
# constant parameters for model definition
NUM_CLASSES=19

In [None]:
# DenseNet121 model
densenet = DenseNet121(
    include_top=True,
    weights=None,
    input_shape=(IMG_SIZE,IMG_SIZE,3),
    input_tensor=None,
    pooling=None,
    classes=NUM_CLASSES
)

In [None]:
# model definition including a normalization layer and extra layers
model_densenet = Sequential( [
layers.experimental.preprocessing.Rescaling( 1./255, input_shape=(IMG_SIZE, IMG_SIZE, 3) ),
layers.experimental.preprocessing.RandomFlip("horizontal"),
layers.experimental.preprocessing.RandomFlip("vertical"),
layers.experimental.preprocessing.RandomTranslation(height_factor=0.1, width_factor=0.1),
layers.experimental.preprocessing.RandomRotation(factor=1.0),
layers.experimental.preprocessing.RandomZoom(height_factor=0.25, width_factor=0.25),
densenet
] )

In [None]:
# shape of the output ndarray 
model_densenet.output

In [None]:
# compile model
learning_rate = 1e-3
model_densenet.compile(optimizer=Adam(lr=learning_rate), 
                       loss='binary_crossentropy', metrics=['categorical_accuracy'])

In [None]:
# model summary
model_densenet.summary()

## 3. Training

In [None]:
# constant training parameters
EPOCHS=10

In [None]:
# callbacks
model_callbacks = [
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=2, verbose=0),
    tf.keras.callbacks.ModelCheckpoint(filepath='./densenet_model.{epoch:02d}-{val_loss:.2f}.h5'),
    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
]

In [None]:
history = model_densenet.fit(
train_data,
validation_data=validation_data,
epochs=EPOCHS,
callbacks=model_callbacks     
)

In [None]:
# plot model accuracy
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
# plot model loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

## 4. Evaluation