## Human Protein Atlas - Single Cell Classification

This notebook is heavily inspired and sometimes copied from [this](https://www.kaggle.com/allunia/protein-atlas-exploration-and-baseline) and [this](https://www.kaggle.com/dhananjay3/human-protein-atlas-eda-all-you-need-to-know). I wanted to find some starter code so the first mentioned kernel is a must check out if you want to find out what you should about this dataset, as the previous Human Protein Atlas was really similar. Nevertheless, I'll be sharing parts of [that](https://www.kaggle.com/allunia/protein-atlas-exploration-and-baseline) in here as I see fit. 

In [None]:
class KernelSettings:
    
    def __init__(self, fit_baseline=False):
        self.fit_baseline = fit_baseline
        
kernelsettings = KernelSettings(fit_baseline=True)

## Reading the data

Before diving into the cell images, I'd rather get comfortable with the data and labels first to figure out what we are dealing with.

In [None]:
import os
import numpy as np
import pandas as pd

ROOT = "../input/hpa-single-cell-image-classification/"
train =  pd.read_csv(ROOT+"train.csv")
train.head()

It looks like we have a cell ID, where each cell has multiple images but we'll get to that later, and each ID has multiple labels in the Label column. So this is a multi-label classification problem. To make it more simple, you could look at it as if it a multiple binary classification problems.

We should split these labels into separate columns and then start exploring them.

In [None]:
label_names = {
0: "Nucleoplasm",
1: "Nuclear membrane",
2: "Nucleoli",
3: "Nucleoli fibrillar center",
4: "Nuclear speckles",
5: "Nuclear bodies",
6: "Endoplasmic reticulum",
7: "Golgi apparatus",
8: "Intermediate filaments",
9: "Actin filaments",
10: "Microtubules",
11: "Mitotic spindle",
12: "Centrosome",
13: "Plasma membrane",
14: "Mitochondria",
15: "Aggresome",
16: "Cytosol",
17: "Vesicles and punctate cytosolic patterns",
18: "Negative",
}

name_labels = dict((v,k) for k,v in label_names.items())

In [None]:
def make_labels_columns(row):
    for label in row.Label.split('|'):
        name = label_names[int(label)]
        row.loc[name] = 1
    return row

for label_name in label_names.values():
    train[label_name] = 0
    
train = train.apply(make_labels_columns, axis=1)
train_labels = train[label_names.values()]
train_labels.head()

Now each label has it's own column, so let's start exploring.

## What is the frequency of each label?

In [None]:
labels_counts = train_labels.sum(axis=0)
labels_counts.sort_values(ascending=True).plot(kind='barh', figsize=(10, 5), title='Frequency of Labels');

> We can that some labels are barely present in the data like Mitotic spindle, Aggresome, etc. These insufficiency of these labels could result in certain inaccuracies during prediction as a model would not have been exposed enough to it. 

## What the is the frequency of multiple targets? Do data points with more than 1 target dominate, or is mostly dominated by only 1 target?

In [None]:
train_labels.sum(axis=1).value_counts(ascending=True).plot(kind='barh', figsize=(10, 3));

> We can see than the majority of the data is single or double targeted.

## Target correlation and co-occurence

**Target correlation could be easily calculated, but co-occurence is calculated using method provided by [this kernel](https://www.kaggle.com/dhananjay3/human-protein-atlas-eda-all-you-need-to-know)**

In [None]:
import matplotlib.pyplot as plt 
import seaborn as sns

# Target correlation matrix
plt.figure(figsize=(10, 7))
sns.heatmap(train_labels[train_labels.sum(axis=1) > 1].corr(), cmap="icefire", vmin=-1, vmax=1);


We can see how this color palette exposes that most correlation between targets are negative, except for Plasma membrane's correlation with Intermediate and Actin filaments.

In [None]:
u = train_labels
v = u.T.dot(u)

plt.figure(figsize=(10, 7))
sns.heatmap((v / np.sum(v, axis=0)).T, cbar=True, annot=False);

We can see that most co-occurences occur owing to the most frequent target which are Nucleosome and Cytosol, and therefore if ranked the labels according to their frequency, I guess that we would see a fading effect.

In [None]:
v = v[labels_counts.sort_values(ascending=False).index]
v = v.reindex(labels_counts.sort_values(ascending=False).index, axis=0)

plt.figure(figsize=(10, 7))
sns.heatmap((v / np.sum(v, axis=0)).T, cbar=True, annot=False);

And we can observe the fading effect as we assumed.

## Helper functions

In [None]:
import matplotlib.image as mpimg

targets = ['Cytosol']
max_rows = 5


for target in targets:
    images = train.loc[np.where(train_labels[target] == 1)]

fig, axes = plt.subplots(max_rows, 4, figsize=(20, 5*max_rows))
axes = axes.flatten()

color_filter = {'green': lambda x: ' - '.join([label_names[int(label)] for label in x.split('|')]),
                'blue': lambda x: 'Nucleus',
                'red': lambda x:'Microtubules',
                'yellow': lambda x: 'Endoplasmic reticulum'}

ax_id = 0
for row, id_ in enumerate(images.ID):
    if row == max_rows:
        break
        
    for color, filter_ in color_filter.items():
        path = f'{ROOT}train/{id_}_{color}.png'
        img = mpimg.imread(path)
        axes[ax_id].imshow(img)
        axes[ax_id].set_title(filter_(images.loc[images.ID == id_, 'Label'].item()))
        ax_id += 1
    

## Baseline

In [None]:
!pip install /kaggle/input/iterative-stratification/iterative-stratification-master/

In [None]:
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold, MultilabelStratifiedShuffleSplit


N_SPLITS = 10
SEED = 41295
mskf = MultilabelStratifiedKFold(n_splits=N_SPLITS, random_state=SEED, shuffle=True)

partitions = []

for train_idx, val_idx in mskf.split(train, train[label_names.values()]):
    partition = {}
    partition["train"] = train.ID.values[train_idx]
    partition["validation"] = train.ID.values[val_idx]
    partitions.append(partition)
    print("TRAIN:", train_idx, "VALIDATION:", val_idx)
    print("TRAIN:", len(train_idx), "VALIDATION:", len(val_idx))


In [None]:
class ModelParameters:
    """
    Holds parameters shared between dataloader, model and image processor.
    """
    
    def __init__(self, basepath,
                 num_classes=19,
                 image_rows=2048,
                 image_cols=2048,
                 batch_size=200,
                 n_channels=1,
                 row_scale_factor=4,
                 col_scale_factor=4,
                 shuffle=False,
                 n_epochs=1):
        self.basepath = basepath
        self.num_classes = num_classes
        self.image_rows = image_rows
        self.image_cols = image_cols
        self.batch_size = batch_size
        self.n_channels = n_channels
        self.shuffle = shuffle
        self.row_scale_factor = row_scale_factor
        self.col_scale_factor = col_scale_factor
        self.scaled_row_dim = np.int(self.image_rows / self.row_scale_factor)
        self.scaled_col_dim = np.int(self.image_cols / self.col_scale_factor)
        self.n_epochs = n_epochs

In [None]:
# init model parameters class
parameters = ModelParameters(ROOT)

In [None]:
from skimage.transform import resize

class ImagePreprocessor:
    
    def __init__(self, paramters):
        self.parameters = parameters
        self.basepath = self.parameters.basepath
        self.scaled_row_dim = self.parameters.scaled_row_dim
        self.scaled_col_dim = self.parameters.scaled_col_dim
        self.n_channels = self.parameters.n_channels
        
    def preprocess(self, image):
        image = self.resize(image)
        image = self.reshape(image)
        image = self.normalize(image)
        return image
    
    def resize(self, image):
        image = resize(image, (self.scaled_row_dim, self.scaled_col_dim))
        return image
    
    def reshape(self, image):
        image = np.reshape(image, (image.shape[0], image.shape[1], self.n_channels))
        return image
    
    def normalize(self, image):
        image /= 255 
        return image
    
    
    def load_image(self, image_id):
        path = f'{self.basepath}/{image_id}_green.png'
        image = mpimg.imread(path)
        
        image = np.zeros(shape=(image.shape[0], image.shape[1], 4))
        image[:,:,0] = mpimg.imread(self.basepath + image_id + "_green" + ".png")
        image[:,:,1] = mpimg.imread(self.basepath + image_id + "_blue" + ".png")
        image[:,:,2] = mpimg.imread(self.basepath + image_id + "_red" + ".png")
        image[:,:,3] = mpimg.imread(self.basepath + image_id + "_yellow" + ".png")
        return image[:, :, 0:self.n_channels]
    

In [None]:
preprocessor = ImagePreprocessor(parameters)

In [None]:
# example of preprocessing an image
id_ = train.ID[np.random.randint(0, len(train))]
color = 'green'
path = f'{ROOT}train/{id_}_{color}.png'
img = mpimg.imread(path)

pp_img = preprocessor.preprocess(img)

fig, axes = plt.subplots(1, 2, figsize=(20, 10))
axes[0].imshow(img)
axes[0].set_title('Original Image')
axes[1].imshow(pp_img.squeeze())
axes[1].set_title('Preprocessed Image');

In [None]:
import keras

class DataGenerator(keras.utils.Sequence):
    
    def __init__(self, list_IDs, labels, modelparameter, imagepreprocessor):
        self.current_epoch = 0
        self.params = modelparameter
        self.labels = labels
        self.list_IDs = list_IDs
        self.dim = (self.params.scaled_row_dim, self.params.scaled_col_dim)
        self.batch_size = self.params.batch_size
        self.n_channels = self.params.n_channels
        self.num_classes = self.params.num_classes
        self.shuffle = self.params.shuffle
        self.preprocessor = imagepreprocessor
        self.on_epoch_end()
    
    def on_epoch_end(self):
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes, random_state=self.current_epoch)
            self.current_epoch += 1
    
    def get_targets_per_image(self, identifier):
        return self.labels.loc[self.labels.ID==identifier].drop(
                ["ID", "Label"], axis=1).values
            
    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim, self.n_channels))
        y = np.empty((self.batch_size, self.num_classes), dtype=int)
        # Generate data
        for i, identifier in enumerate(list_IDs_temp):
            # Store sample
            image = self.preprocessor.load_image(identifier)
            image = self.preprocessor.preprocess(image)
            X[i] = image
            # Store class
            y[i] = self.get_targets_per_image(identifier)
        return X, y
    
    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))
    
    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]
        # Generate data
        X, y = self.__data_generation(list_IDs_temp)
        return X, y

In [None]:
class PredictGenerator:
    
    def __init__(self, predict_Ids, imagepreprocessor, predict_path):
        self.preprocessor = imagepreprocessor
        self.preprocessor.basepath = predict_path
        self.identifiers = predict_Ids
    
    def predict(self, model):
        y = np.empty(shape=(len(self.identifiers), self.preprocessor.parameter.num_classes))
        for n in range(len(self.identifiers)):
            image = self.preprocessor.load_image(self.identifiers[n])
            image = self.preprocessor.preprocess(image)
            image = image.reshape((1, *image.shape))
            y[n] = model.predict(image)
        return y

In [None]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.losses import binary_crossentropy
from keras.optimizers import Adadelta
from keras.initializers import VarianceScaling


class BaseLineModel:
    
    def __init__(self, modelparameter):
        self.params = modelparameter
        self.num_classes = self.params.num_classes
        self.img_rows = self.params.scaled_row_dim
        self.img_cols = self.params.scaled_col_dim
        self.n_channels = self.params.n_channels
        self.input_shape = (self.img_rows, self.img_cols, self.n_channels)
        self.my_metrics = ['accuracy']
    
    def build_model(self):
        self.model = Sequential()
        self.model.add(Conv2D(16, kernel_size=(3, 3), activation='relu', input_shape=self.input_shape,
                             kernel_initializer=VarianceScaling(seed=0)))
        self.model.add(Conv2D(32, (3, 3), activation='relu',
                             kernel_initializer=VarianceScaling(seed=0)))
        self.model.add(MaxPooling2D(pool_size=(2, 2)))
        self.model.add(Dropout(0.25))
        self.model.add(Flatten())
        self.model.add(Dense(64, activation='relu',
                            kernel_initializer=VarianceScaling(seed=0),))
        self.model.add(Dropout(0.5))
        self.model.add(Dense(self.num_classes, activation='sigmoid'))
    
    def compile_model(self):
        self.model.compile(loss=keras.losses.binary_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=self.my_metrics)
    
    def set_generators(self, train_generator, validation_generator):
        self.training_generator = train_generator
        self.validation_generator = validation_generator
    
    def learn(self):
        return self.model.fit_generator(generator=self.training_generator,
                    validation_data=self.validation_generator,
                    epochs=self.params.n_epochs, 
                    use_multiprocessing=True,
                    workers=8)
    
    def score(self):
        return self.model.evaluate_generator(generator=self.validation_generator,
                                      use_multiprocessing=True, 
                                      workers=8)
    
    def predict(self, predict_generator):
        y = predict_generator.predict(self.model)
        return y
    
    def save(self, modeloutputpath):
        self.model.save(modeloutputpath)
    
    def load(self, modelinputpath):
        self.model = load_model(modelinputpath)

In [None]:
# Datasets
partition = partitions[0]
labels = train

print("Number of samples in train: {}".format(len(partition["train"])))
print("Number of samples in validation: {}".format(len(partition["validation"])))

In [None]:
training_generator = DataGenerator(partition['train'], labels, parameters, preprocessor)
validation_generator = DataGenerator(partition['validation'], labels, parameters, preprocessor)

In [None]:
predict_generator = PredictGenerator(partition['validation'], preprocessor, f'{ROOT}train/')

In [None]:
submission = pd.read_csv(f"{ROOT}/sample_submission.csv")
test_names = submission.ID.values

test_preprocessor = ImagePreprocessor(parameters)
submission_predict_generator = PredictGenerator(test_names, test_preprocessor, f'{ROOT}test/')

In [None]:
test_labels = pd.DataFrame(data=test_names, columns=["ID"])
for col in train_labels.columns.values:
    if col != "ID":
        test_labels[col] = 0
test_labels.head(1)

In [None]:
if kernelsettings.fit_baseline == True:
    model = BaseLineModel(parameters)
    model.build_model()
    model.compile_model()
    model.set_generators(training_generator, validation_generator)
    history = model.learn()
    
    proba_predictions = model.predict(predict_generator)
    baseline_proba_predictions = pd.DataFrame(index = partition['validation'],
                                              data=proba_predictions,
                                              columns=target_names)
    baseline_proba_predictions.to_csv("baseline_predictions.csv")
    baseline_losses = pd.DataFrame(history.history["loss"], columns=["train_loss"])
    baseline_losses["val_loss"] = history.history["val_loss"]
    baseline_losses.to_csv("baseline_losses.csv")
    
    
    submission_proba_predictions = model.predict(submission_predict_generator)
    baseline_labels = test_labels.copy()
    baseline_labels.loc[:, test_labels.drop(["ID"].columns.values)] = submission_proba_predictions
    baseline_labels.to_csv("baseline_submission_proba.csv")
    
# If you already have done a baseline fit once, 
# you can load predictions as csv and further fitting is not neccessary:
else:
    baseline_proba_predictions = pd.read_csv("../input/protein-atlas-eab-predictions/baseline_predictions.csv", index_col=0)
    baseline_losses = pd.read_csv("../input/protein-atlas-eab-predictions/baseline_losses.csv", index_col=0)
    baseline_labels = pd.read_csv("../input/protein-atlas-eab-predictions/baseline_submission_proba.csv", index_col=0)

Huge thanks to [Dhananjay Raut](https://www.kaggle.com/dhananjay3) and [Laura Fink](https://www.kaggle.com/allunia) for insipiring this kernel, and more is yet to come, so stay tuned.