# Mechanisms of Action DNN Deep-Dive

Within this notebook we'll explore a range of techniques and model architectures, some standard, whilst others are unconventional. The focus is Deep neural networks and variants thereof, rather than classical tabular models or tree-based ensembles. 

We'll start by performing some analysis of the given dataset, followed by defining a range of preprocessing and feature engineering functions to help provide a diverse set of training and testing data for our models.

After this, we'll craft some standard deep neural networks and explore what impact different subsets of data, hyper-parameters and techniques have on the cross-validation performance. With these results then established as a base mark, we start exploring some further unconventional models. This includes some simple convolutional neural networks, and also more complex hybrid tabular & convolutional neural networks. 

You might be skeptical as to why we'd ever throw a convolutional neural network at a tabular data challenge such as this, and the answer is simple: Because we can, and in fact its not as bad as you'd think! 

**Table of Contents:**

1. [Imports](#imports)
2. [EDA](#EDA)
3. [Data Preparation and Preprocessing](#data-preprocessing)
4. [Model Production and Evaluation](#model-production) 
    - 4.1. [Hold-out Validation with Tuned 3-layer ANN](#standard-3layer)
    - 4.2. [K-Folds Cross-Validation](#kfolds-cross-validation)
    - 4.3. [Evaluation of Feature Engineered Data](#feature-engineered-evaluation)
    - 4.4. [Evaluation of Dimensionally Reduced Features](#dimensionally-reduced-evaluation)
    - 4.5. [Evaluation of Oversampled Data](#oversampled-evaluation)
5. [Composite Neural Networks](#composite-models)
    - 5.1. [Validation of a single 1D ConvNet](#single-Conv1d)
    - 5.2 [Multi-input ConvNet](#multi-input-convnet)
    - 5.3 [Dense and Convolutional Composite Multi-Input Model](#composite-model)
6. [Test Set Predictions](#test-predictions)

<a id="imports"></a>

## 1. Import dependencies and load data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import seaborn as sns
import tensorflow as tf

import keras
import keras.backend as K
from keras.callbacks import ModelCheckpoint
from keras.initializers import Constant
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Embedding, Flatten, LSTM, GRU, \
        SpatialDropout1D, Bidirectional, Conv1D, MaxPooling1D, BatchNormalization
from keras.models import Sequential, load_model
from keras import models
from keras import layers

import pickle

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import log_loss, silhouette_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, \
        cross_validate, cross_val_score, KFold
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from tqdm import tqdm

In [None]:
print(keras.__version__) 
print(tf.__version__)

In [None]:
input_dir = '/kaggle/input/lish-moa'
train_features = pd.read_csv(os.path.join(input_dir, 'train_features.csv'))
train_targets_scored = pd.read_csv(os.path.join(input_dir, 'train_targets_scored.csv'))
train_targets_nonscored = pd.read_csv(os.path.join(input_dir, 'train_targets_nonscored.csv'))
test_features = pd.read_csv(os.path.join(input_dir, 'test_features.csv'))

train_features.shape, train_targets_scored.shape, train_targets_nonscored.shape, test_features.shape

We have our main input features (train_features.csv), which is high-dimensional tabular data containing a mixture of categorical and numerical features. We then have our train targets, which consists of 206 different output classes for each data instance. Its important to note that these output labels are not mutually exclusive, and it is possible to get multiple outputs for each data instance. Therefore, this problem is a multi-output classification problem, and not just a multiclass classification problem. 

In contrast to a normal binary classification task, this type of multi-label problem becomes much more difficult in terms of producing and fine-tuning a classification model. 

---

<a id="EDA"></a>
## 2. Basic Exploratory Data Analysis

In [None]:
cat_cols = ['cp_type', 'cp_time', 'cp_dose']

plt.figure(figsize=(14,4))

for idx, col in enumerate(cat_cols):
    plt.subplot(int(f'13{idx + 1}'))
    labels = train_features[col].value_counts().index.values
    vals = train_features[col].value_counts().values
    sns.barplot(x=labels, y=vals)
    plt.xlabel(f'{col}', weight='bold')
    plt.ylabel('Count', weight='bold')
plt.tight_layout()
plt.show()

For 'cp_type', the 'ctl_vehicle' refers to samples treated with a control perturbation. For control perturbations, our targets are all zero, since they have no Mechanism of Action (MoA).

To deal with this, a good strategy could be to identify samples that are ctl_vehicle (through training a classification model or simply using the feature as its in the test data!), and set all of these to zero. We can then process the test set accordingly, by first setting all test instance targets to zero if its a ctl_vehicle, followed by processing all of the others normally using our trained model.

In [None]:
# select all indices when 'cp_type' is 'ctl_vehicle'
ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')

# evaluate number of 1s we have in the total train scores when cp_type = ctl_vehicle
train_targets_scored.loc[ctl_vehicle_idx].iloc[:, 1:].sum().sum()

The total sum is zero, which confirms the statement above on all targets being zero for cases where cp_type is ctl_vehicle. The best thing to do with this is simply fill our targets for zero when this is the case.

We could also remove all of these from the training set, however there are arguments for and against this in practice. If we remove them, we could be witholding valuable zero case data from our models, and for new data our model might struggle to predict these cases accordingly. On the other hand, it is a lot of extra data, which could just serve to unnecessarily complicate our model.

In [None]:
# take a copy of all our training sig_ids for reference
train_sig_ids = train_features['sig_id'].copy()

In [None]:
# drop cp_type column since we no longer need it
X = train_features.drop(['sig_id', 'cp_type'], axis=1).copy()
X = X.loc[~ctl_vehicle_idx].copy()

y = train_targets_scored.drop('sig_id', axis=1).copy()
y = y.loc[~ctl_vehicle_idx].copy()

X.shape, y.shape

In [None]:
X.head(3)

The data has already been normalised using quantile normalisation, and so is not in its natural form as we see it.

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(X.iloc[:, 2:].mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(y.mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(train_targets_nonscored.mean())
plt.show()

In [None]:
y.sum().sort_values()[:30].plot.bar(figsize=(18,6))
plt.show()

Some output classes only have 1 instance in the entire training set. This is problematic and is no where near enough data if we expect our models to effectively make predictions across the whole range of targets. Imbalanced dataset techniques such as minority class over-sampling may have to be introduced, which may help our models generalise better to new data.

#### Plotting all gene / cell features for random samples:

Lets quickly assess how our cell data looks when plotted over all features for random instances:

In [None]:
cat_feats = X.iloc[:, :2].copy()
X_cell_v = X.iloc[:, -100:].copy()
X_gene_e = X.iloc[:, 2:772].copy()

In [None]:
def plot_features(X, y, selected_idx, features_type, figsize=(14,10)):
    x_range = range(1, X.shape[1] + 1)
    
    fig = plt.figure(figsize=(14,10))
    
    for i, idx in enumerate(selected_idx):
        ax = fig.add_subplot(selected_idx.shape[0], 1, i + 1)
        vals = X.iloc[idx].values
    
        if (y.iloc[idx] == 1).sum():
            output_labels = list(y.iloc[idx][y.iloc[idx] == 1].index.values)
        
            labels = " ".join(output_labels)
        else:
            labels = "None (all labels zero)"
        
        sns.lineplot(x_range, vals)
        plt.title(f"Row {idx}, Labels: {labels}", weight='bold')
        plt.xlim(0.0, X.shape[1])
        plt.grid()

    plt.xlabel(f"{features_type}", weight='bold', size=14)
    plt.tight_layout()
    plt.show()

In [None]:
# lets plot some random rows from our data
random_idx = np.random.randint(X.shape[0], size=(5,))

plot_features(X_cell_v, y, random_idx, features_type='Cell Features')

Clearly some rows vary substancially in terms of their value range, and therefore it is worth standardising this data prior to training our models.

Now lets do the same for our gene features:

In [None]:
plot_features(X_gene_e, y, random_idx, features_type='Gene Features')

We have some noticeable peaks throughout the features for some of the above instances. It could be worth plotting a range of data instances with the same output labels against one another, and compare their peaks. If they correlate in one or more areas, this could be insightful for developing further features with our dataset.

Lets now repeat above, but for data instances with the same output label(s).

In [None]:
# select an output label to plot associated training features
chosen_label = 'btk_inhibitor'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,), replace=False)

In [None]:
plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

Lets also look at the mean and standard deviation of this feature:

In [None]:
def plot_mean_std(dataframe, feature_name, features_type, figsize=(14,6), alpha=0.3):
    """ Plot rolling mean and standard deviation for given dataframe """
    
    plt.figure(figsize=figsize)
    
    x_range = range(1, dataframe.shape[1] + 1)
    
    chosen_rows = y.loc[y[feature_name] == 1]
    chosen_feats = dataframe.loc[y[feature_name] == 1]
    
    means = chosen_feats.mean()
    stds = chosen_feats.std()
    
    plt.plot(x_range, means, label=feature_name)    
    plt.fill_between(x_range, means - stds, means + stds, 
                         alpha=alpha)

    plt.title(f'{features_type}: {feature_name} - Mean & Standard Deviation', weight='bold')
    
    plt.xlim(0.0, dataframe.shape[1])
    
    plt.show()

In [None]:
plot_mean_std(X_gene_e, 'btk_inhibitor', 'Gene Features')

Lets repeat for some different output labels:

In [None]:
# select an output label to plot associated training features
chosen_label = 'histamine_receptor_antagonist'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'histamine_receptor_antagonist', 'Gene Features')

In [None]:
# select an output label to plot associated training features
chosen_label = 'free_radical_scavenger'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'free_radical_scavenger', 'Gene Features')

This analysis highlights the potential for performing advanced feature engineering. We could process the trends of gene and/or cell features like we might with a time-series dataset, and use these features to supplement models that use the features in their standard form.

---

<a id="data-preprocessing"></a>
## 3. Preprocessing and data preparation

This will be relatively simple and will include:
- Standardisation of all numerical features.
- Creation of embeddings or encodings for our categorical variables.
- Removal of unwanted / unnecessary columns.

We'll define a simple class to perform these actions for us on both the training and test data.

### 3.1 Definition of a class for preprocessing:

In [None]:
class MOAPreprocessor:
    
    def __init__(self, cat_features, num_features, remove_cp_type=False):
        self.cat_features = cat_features
        self.num_features = num_features
        self.std_scaler = StandardScaler()
        self.remove_cp_type = remove_cp_type
        
    def preprocess_data(self, X, test=False):
        
        # take a copy of sig ids for reference
        sig_ids = X.loc[:, 'sig_id']
        
        #  remove ctl_vehicle if selected
        if self.remove_cp_type and not test:
            ctl_vehicle_idx = (X['cp_type'] == 'ctl_vehicle')
            data_df = X.loc[~ctl_vehicle_idx].copy()
            data_df.reset_index(inplace=True, drop=True)
        else:
            data_df = X.copy()
        
        # subsets of categorical and numerical
        X_cat = data_df.loc[:, self.cat_features]
        
        # standardise our cp_time column values
        X_cat['cp_time'] = (data_df['cp_time'] / 24.0) - 2
        
        # one-hot encode our categorical features
        X_cat = pd.get_dummies(X_cat)
        
        # select our numerical features
        X_num = data_df.loc[:, self.num_features]
        
        if not test:
            # fit parameters of our scaler and transform train
            X_num[self.num_features] = self.std_scaler.fit_transform(X_num)
            
            # add train sig ids to class instance
            self.train_sig_ids = sig_ids.copy()
            
        else:
            # transform test set
            X_num[self.num_features] = self.std_scaler.transform(X_num)
            
            # add test sig ids to class instance
            self.test_sig_ids = sig_ids.copy()
            
        return pd.concat([X_cat, X_num], axis=1)

In [None]:
#cat_features = ['cp_time', 'cp_dose', 'cp_type']
cat_features = ['cp_time', 'cp_dose']

# define non-numeric cols to form list of numeric cols
non_num_tuple = ('cp_time', 'cp_dose', 'cp_type', 'sig_id')
num_features = [x for x in train_features.columns.values if not x.startswith(non_num_tuple)]

data_processor = MOAPreprocessor(cat_features, num_features, remove_cp_type=True)
X_train_full = data_processor.preprocess_data(train_features)
X_test = data_processor.preprocess_data(test_features, test=True)

# convert our numeric data into float32 prior to tensorflow
X_train_full = X_train_full.astype('float32')
X_test = X_test.astype('float32')

X_train_full.shape, X_test.shape

In [None]:
X_train_full.head(3)

We also need to format our labels accordingly:

In [None]:
def process_labels(X, y):
    """ Format our output labels appropriately """
    # remove sig id from labels
    labels = y.drop('sig_id', axis=1).copy()

    # remove cp_type if selected
    if data_processor.remove_cp_type:
        ctl_vehicle_idx = (X['cp_type'] == 'ctl_vehicle')
        labels = labels.loc[~ctl_vehicle_idx].copy()
        labels.reset_index(inplace=True, drop=True)
    
    return labels

We can use this to process both our scored and non-scored labels:

In [None]:
y = process_labels(train_features, train_targets_scored)
y_nonscored = process_labels(train_features, train_targets_nonscored)
y.shape, y_nonscored.shape

Good, we can now move on to additional feature processing and feature engineering.

### 3.2 Preprocessing of Cell and Gene features

In order to use them as additional features, lets extract the gene and cell features from our training data.

In [None]:
def gene_and_cell_feats(data):
    X_gene = data.iloc[:, 3:775].copy()
    X_cell = data.iloc[:, 775:].copy()
    return X_gene, X_cell

In [None]:
X_train_gene, X_train_cell = gene_and_cell_feats(X_train_full)
X_test_gene, X_test_cell = gene_and_cell_feats(X_test)

#### Preparation of Cell and Gene features for Convolutional Neural Network Input

Before being able to feed our data splits into a CNN model, its preferable to have it in a specific form. In particular, for our use-case, we need our features to be of the current format for a 1D Conv Net model: (num_instances, num_features, 1).

In [None]:
def conv1d_preprocess(data):
    processed = data.astype('float32').values
    processed = processed.reshape(data.shape[0], data.shape[1], 1)
    return processed

# if not done so already - convert to float32 and reshape for ConvNet
X_train_gene, X_train_cell = gene_and_cell_feats(X_train_full)
X_test_gene, X_test_cell = gene_and_cell_feats(X_test)

X_train_gene = conv1d_preprocess(X_train_gene)
X_train_cell = conv1d_preprocess(X_train_cell)
X_test_gene = conv1d_preprocess(X_test_gene)
X_test_cell = conv1d_preprocess(X_test_cell)
X_train_gene.shape, X_train_cell.shape, X_test_gene.shape, X_test_cell.shape

We now have these sets of features that can easily be fed into a composite CNN network. These features can be fed into the model, in addition to any other data we may want to provide our model with.

### 3.3 Feature engineering some common statistics and relationships from our data

In addition to this, lets engineer some common features to use in our model, e.g. mean and std of cell and gene data.

In [None]:
def create_num_feats(num_data, feature_type):
    """ Find a range of numerical statistics across the row
        of each instance in the given numerical data """
    
    col_names = [f"{feature_type}_{x}" for x in ['mean', 'std', 'max', 'min']]
    
    means = num_data.mean(axis=1)
    stds = num_data.std(axis=1)
    maxs = num_data.max(axis=1)
    mins = num_data.min(axis=1)
    
    result = np.c_[means, stds, maxs, mins]
    return pd.DataFrame(result, columns=col_names)

# obtain aux features for both train and test
aux_cell_train = create_num_feats(X_train_cell, feature_type='cell')
aux_gene_train = create_num_feats(X_train_gene, feature_type='gene')
aux_cell_test = create_num_feats(X_test_cell, feature_type='cell')
aux_gene_test = create_num_feats(X_test_gene, feature_type='gene')

# create new dataframes of non-gene and non-cell features plus those above
X_train_misc = X_train_full.iloc[:, :3].copy()
X_train_misc.reset_index(inplace=True, drop=True)
X_train_misc = pd.concat([X_train_misc, aux_cell_train, aux_gene_train], axis=1)

X_test_misc = X_test.iloc[:, :3].copy()
X_test_misc.reset_index(inplace=True, drop=True)
X_test_misc = pd.concat([X_test_misc, aux_cell_test, aux_gene_test], axis=1)
X_train_misc.shape, X_test_misc.shape

Lets also use these features to create extended training & test sets, which include these extra engineered features in addition to the original features:

In [None]:
X_train_ext = pd.concat([X_train_full, aux_cell_train, aux_gene_train], axis=1)
X_test_ext = pd.concat([X_test, aux_cell_test, aux_gene_test], axis=1)
X_train_ext.shape, X_test_ext.shape

Good - we now have a variety of data-sets to try out on various model types in the next few sections.

### 3.4 Creation of a function to over-sample our minority classes

We have a huge imbalance of data within this challenge. To help counter this, we can apply a range of imbalanced data processing techniques, however the application of most of these is extremely difficult due to the nature of the multi-label classification problem. A simple solution to this, which we'll use, is just simple oversampling of our training data.

It's important we dont perform any form of over-sampling before performing splits of training data as part of a cross validation strategy, otherwise we'll introduce duplicate instances into our training and evaluation process. This in turn will give us an overly optimistic estimate of the generalisation performance of our models.

In [None]:
def oversample_data(X, y, top_minority=200, min_class_count=30):
    """ Oversample the lowest n represented classes from the provided data 
        and labels, ensuring we have a minimum number of counts """
    
    # gather the top n under-represented classes to oversample
    top_n_minority = y.sum().sort_values()[:top_minority].index.values
    extra_features = pd.DataFrame()
    extra_labels = pd.DataFrame()
    
    for column in top_n_minority:
        class_count = int(y[column].sum())
        class_count_diff = min_class_count - class_count
    
        if class_count_diff > 1 and class_count:
            
            # find instance idxs where class is 1
            positive_idxs = y[column] == 1
        
            for iteration in range(int(np.ceil(class_count_diff / class_count))):
                
                # get random feature and label corresponding to class
                rand_feature = X[positive_idxs].sample(class_count)
                rand_label = y[positive_idxs].sample(class_count)
                extra_features = extra_features.append(rand_feature, ignore_index=True)
                extra_labels = extra_labels.append(rand_label, ignore_index=True)
    
    oversampled_X = X.append(extra_features, ignore_index=False)
    oversampled_y = y.append(extra_labels, ignore_index=False)
    
    return oversampled_X, oversampled_y

Lets test the functionality of the above function:

In [None]:
X_oversamp, y_oversamp = oversample_data(X_train_full, y)

print(f"Original sizes: \n- X training: {X_train_full.shape} \n- y: {y.shape}\n") 
print(f"New oversampled sizes: \n- X training: {X_oversamp.shape} \n- y: {y_oversamp.shape}")

In [None]:
y_oversamp.sum().sort_values()[:30].plot.bar(figsize=(18,6))
plt.ylabel("Class Counts", weight='bold')
plt.title("Count of minority classes", weight='bold')
plt.show()

Good, we can now apply this into our training if / when required.

---

<a id="model-production"></a>
## 4. Model production and evaluation of a Tuned Deep NN

Lets split our data randomly for an initial validation set. Ideally we'd perform a multi-label stratified split here, but due to issues with limited numbers of class instances and imbalance across the dataset, we'll avoid it for now.

Once we've refined some of the basic hyper-parameters, we'll perform more in-depth optimisation and evaluation using K-Folds cross validation.

In [None]:
# choose a larger subset for this evaluation
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y, test_size=0.2, shuffle=True)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

<a id="standard-3layer"></a>
### 4.1 Model production - ANN with three hidden layers on single hold-out validation set

Lets first create a tuned ANN without any Convolutional layers, so that we've got something to compare against.

We'll create a three layered model with ELU activations, He normal weight initialisation, dropout regularisation, and Adam optimisation with a suitable learning rate decay.

In [None]:
def ann_model_1(dropout=True, dropout_val=0.45, lr=2e-3, 
                output_shape=206, input_feat_dim=X_train_full.shape[1]):
    """ Create a basic Deep NN for classification """
    model = models.Sequential()
    
    model.add(layers.Dense(512, activation='elu', input_shape=(input_feat_dim,), 
                           kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(256, activation='elu', kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(256, activation='elu', kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
        
    # output layer
    model.add(layers.Dense(206, activation='sigmoid'))
        
    model.compile(optimizer=keras.optimizers.Adam(lr=lr), 
                  loss='binary_crossentropy', metrics=['accuracy'])
    return model

#### Evaluating a suitable learning rate for this model:

Lets create a custom callback for exploring the best learning rates for our models:

In [None]:
class LearningRateComparison(keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        
        # arrays to store current rate and associated loss
        self.lr_rates = []
        self.losses = []
        
    def on_batch_end(self, batch, logs):
        self.lr_rates.append(K.get_value(self.model.optimizer.lr))
        self.losses.append(logs["loss"])
        K.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)

Train for one or two epochs with our learning rate comparison:

In [None]:
# define custom learning rate scheduler to compare loss across many learning rates
custom_lr = LearningRateComparison(factor=1.0025)

model_1 = ann_model_1(dropout=True, lr=1e-3)

# train model for 10 epochs
history = model_1.fit(X_train, y_train, epochs=10, 
                      batch_size=64, validation_data=(X_val, y_val), 
                      callbacks=[custom_lr])

In [None]:
plt.figure(figsize=(12,5))
sns.lineplot(custom_lr.lr_rates, custom_lr.losses)
plt.gca().set_xscale('log')
plt.hlines(min(custom_lr.losses), min(custom_lr.lr_rates), max(custom_lr.lr_rates), 
           linestyle='dashed')
plt.axis([min(custom_lr.lr_rates), 1.0, 0, custom_lr.losses[0]])
plt.xlabel("Learning rate", weight='bold', size=13)
plt.ylabel("Loss", weight='bold', size=13)
plt.grid()
plt.show()

Our loss begins to increase after around $ 1 \times 10^{-1} $, however this is quite high for an initial learning rate. We'll try a learning rate that is reasonably lower than this, but still high enough to provide sufficient training and convergence times.

#### Training our model on the training set and evaluating on the hold-out validation split 

For this we'll make use of learning rate decay with a scheduler, along with an early stop callback so that we can obtain the best model found throughout the training process.

In [None]:
model_1 = ann_model_1(dropout=True, lr=1e-3)

In [None]:
def schedule_lr_rate(epoch, lr):
    """ Use initial learning rate for 20 epochs and then
        decrease it exponentially """
    if epoch < 20:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

# create our lr scheduler - use reduceLRonPlat below - better performance
#lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule_lr_rate)

# create learning rate scheduler
lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                                   patience=3, verbose=0, 
                                                   min_delta=0.0001, mode='min')

# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper, lr_scheduler]

In [None]:
history = model_1.fit(X_train, y_train, epochs=75, 
                      batch_size=64, validation_data=(X_val, y_val), 
                      callbacks=trg_callbacks)

In [None]:
def plot_history_results(history, metric='accuracy', figsize=(16,6)):
    """ Helper function for plotting history from keras model """
    
    # gather desired features
    trg_loss = history.history['loss']
    val_loss = history.history['val_loss']
    trg_acc = history.history[f'{metric}']
    val_acc = history.history[f'val_{metric}']
    epochs = range(1, len(trg_acc) + 1)

    # plot losses and accuracies for training and validation 
    fig = plt.figure(figsize=figsize)
    ax = fig.add_subplot(1, 2, 1)
    plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
    plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
    plt.title("Training / Validation Loss")
    ax.set_ylabel("Loss")
    ax.set_xlabel("Epochs")
    plt.legend(loc='best')

    ax = fig.add_subplot(1, 2, 2)
    plt.plot(epochs, trg_acc, marker='o', label=f'Training {metric}')
    plt.plot(epochs, val_acc, marker='^', label=f'Validation {metric}')
    plt.title(f"Training / Validation {metric}")
    ax.set_ylabel(f"{metric}")
    ax.set_xlabel("Epochs")
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()

In [None]:
plot_history_results(history)

In [None]:
val_preds = model_1.predict(X_val)
loss = keras.losses.BinaryCrossentropy()(y_val, val_preds)
print(f'Validation Log Loss: {loss.numpy():.4f}')

Not too shabby, however we didn't put much (if any!) effort into forming our hold out validation splits, so its difficult to say for certain how well our model is actually performing. Instead, lets evaluate it on a range of validation splits using K-folds cross validation instead. This is more computationally intensive, but a much better way of gauging how well our model actually performs.

Before we move on, just out of interest, lets see how we improve / impact our score on the validation set when we clip our prediction values:

In [None]:
pred_min = 0.001
pred_max = 0.999

clipped_val_preds = np.clip(val_preds, pred_min, pred_max)
loss = keras.losses.BinaryCrossentropy()(y_val, clipped_val_preds)
print(f'Validation Log Loss (with clipped preds): {loss.numpy():.4f}')

It's actually worse in this case. However, this could be something to keep in mind and try again for different sets of results. Let's also see, just the curiosity, how our log loss is impacted when we set a threshold and round our predictions to hard classified labels:

In [None]:
threshold = 0.5
rounded_preds = np.where(val_preds > threshold, 1.0, 0.0)
loss = keras.losses.BinaryCrossentropy()(y_val, rounded_preds)
print(f'Validation Log Loss (with rounded preds): {loss.numpy():.4f}')

<a id="kfolds-cross-validation"></a>
### 4.2 Improvements on evaluation of our ANN model - K Folds Cross Validation

We'll use 7 folds of cross-validation, and for each fold we'll evaluate a new model trained on the respective training and validation splits. In addition, we can also make test predictions during each fold with each model, which can then be combined into an overall ensemble of predictions. In general, this approach should give us a better performance due to the mix of training data our different models are using.

In [None]:
def perform_cross_validation(X_full, y_full, model_func, nfolds=7, lr=1e-3, epochs=100, 
                             batch_size=64, callbacks=trg_callbacks, test_set=None, 
                             print_scores=True, random_state=12):
    """ Perform cross-validation and return histories, validation losses and averaged 
    test set predictions across all folds. """
    model_histories = []
    model_losses = []
    test_preds = np.zeros((test_features.shape[0], 
                           train_targets_scored.shape[1] - 1))
    
    k_folds = KFold(n_splits=nfolds, shuffle=True, random_state=random_state)

    for train_idx, val_idx in tqdm(k_folds.split(X_full, y_full)):
        train_split = X_full.iloc[train_idx].copy()
        train_labels = y_full.iloc[train_idx].astype(np.float64).copy()
        val_split = X_full.iloc[val_idx].copy()
        val_labels = y_full.iloc[val_idx].astype(np.float64).copy()
    
        fold_model = model_func(lr=lr)
    
        # train model for 100 epochs with early stopping
        temp_history = fold_model.fit(train_split, train_labels, 
                                epochs=epochs, batch_size=batch_size, 
                                verbose=0, validation_data=(val_split, val_labels), 
                                callbacks=[callbacks])
    
        model_histories.append(temp_history)
        model_val_preds = fold_model.predict(val_split)
        model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, 
                                                       model_val_preds).numpy()
        model_losses.append(model_log_loss)
        if print_scores:
            print(f'Current Fold Validation Loss: {model_log_loss:.4f}')
    
        # test preds if selected
        if test_set:
            temp_test_preds = fold_model.predict(test_set)
            test_preds += (temp_test_preds / nfolds)

    model_losses = np.array(model_losses)
    
    return model_histories, model_losses, test_preds

Lets perform this a few times with different random seeds, and see what the results yield:

In [None]:
random_states = [12, 24, 37, 42]
test_preds = np.zeros((test_features.shape[0], 
                       train_targets_scored.shape[1] - 1))

for iteration, rand_state in enumerate(random_states):
    print(f"Performing Iteration {iteration + 1} of Cross-Validations.\n")
    _, iter_losses, iter_preds = perform_cross_validation(X_train_full, y, ann_model_1, 
                                                          random_state=rand_state)
    
    print(f"Iteration {iteration + 1} mean loss over all folds: "
          f"{iter_losses.mean():.4f} +/- {iter_losses.std():.4f}\n")
    
    test_preds +- (iter_preds / len(random_states))

#### Combining the predictions from cross-validation into an overall averaged set of test predictions

Since we removed all instances with cp_type == ctl_vehicle from our training data, we will need to adjust our test set predictions so that the targets are always zero for these instances.

In [None]:
def process_test_preds(test_preds, original_test_feats, clip_preds=False, 
                       pred_max=0.999, pred_min=0.001):
    """ Adjust our test set predictions by replacing all class preds
        with zero when cp_type == ctl_vehicle """
    corrected_preds = test_preds.copy()
    test_sig_ids = original_test_feats['sig_id'].copy()
    test_ctl_vehicle_idx = (original_test_feats['cp_type'] == 'ctl_vehicle')
    corrected_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

    if clip_preds:
        corrected_preds[:, 1:] = np.clip(corrected_preds[:, 1:], pred_min, pred_max)
    
    return corrected_preds, test_sig_ids

In [None]:
test_preds, test_sig_ids = process_test_preds(test_preds, test_features, clip_preds=False)

In [None]:
test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
test_preds_df = pd.DataFrame(test_preds, columns=train_targets_scored.columns[1:])
test_submission = pd.concat([test_submission, test_preds_df], axis=1)
test_submission.head(3)

Great, this can now be submitted if desired. Alternatively, we can try to improve on these results using more complex architectures and data processing.

<a id="feature-engineered-evaluation"></a>
### 4.3 Evaluating our model with the additional engineered features

Lets investigate whether our performance improves or not when we make use of the additional features engineered earlier. This is as simple as swapping out our original full training set (X_train_full) for the extended training set (X_train_ext) in our cross-validation code:

In [None]:
N_FOLDS = 7
k_folds = KFold(n_splits=N_FOLDS, shuffle=True)

In [None]:
model_histories = []
model_losses = []
test_preds_ext = np.zeros((test_features.shape[0], 
                       train_targets_scored.shape[1] - 1))

for train_idx, val_idx in tqdm(k_folds.split(X_train_ext, y)):
    train_split = X_train_ext.iloc[train_idx].copy()
    train_labels = y.iloc[train_idx].astype(np.float64).copy()
    val_split = X_train_ext.iloc[val_idx].copy()
    val_labels = y.iloc[val_idx].astype(np.float64).copy()
    
    temp_model = ann_model_1(dropout=True, lr=1e-3, input_feat_dim=X_train_ext.shape[1])
    
    # train model for 100 epochs with early stopping
    temp_history = temp_model.fit(train_split, train_labels, 
                            epochs=75, batch_size=64, verbose=0,
                            validation_data=(val_split, val_labels), 
                                  callbacks=[trg_callbacks])
    
    model_histories.append(temp_history)
    
    model_val_preds = temp_model.predict(val_split)
    
    model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, model_val_preds).numpy()
    print(f'Current Fold Validation Loss: {model_log_loss:.4f}')
    
    model_losses.append(model_log_loss)
    
    temp_test_preds = temp_model.predict(X_test_ext)
    test_preds_ext += (temp_test_preds / N_FOLDS)

model_losses = np.array(model_losses)

In [None]:
print(f"Mean loss across all folds: {model_losses.mean():.4f} +/- {model_losses.std():.4f}")

It seems in this case the additional columns actually decrease performance, although this is debatable with such a small margin. We would need to repeat this many times to know for certain, using different randomised splits of folds. Never the less, we did not obtain the boost in performance we hoped, although it would be naive to expect such a simple feature engineering approach to yield anything spectacular.

<a id="dimensionally-reduced-evaluation"></a>
### 4.4 Evaluating performance with dimensionally reduced features

For completeness, along with testing extra features above, we'll also evaluate the effects of dimensionally reducing some of our data. We'll take our cell and gene features, and simply reduce them whilst retaining most of the variance using PCA. We'll then apply K-Folds cross validation again and see what impacts it has on performance.

In [None]:
# create our pca transformers for each
gene_pca = PCA(n_components=0.95)
cell_pca = PCA(n_components=0.95)

# fit transform training set with 95% variance retained
X_gene_rd = gene_pca.fit_transform(X_train_gene.reshape(X_train_gene.shape[:-1]))
X_cell_rd = cell_pca.fit_transform(X_train_cell.reshape(X_train_cell.shape[:-1])) 
X_train_rd = X_train_full.iloc[:, :3].copy()
X_train_rd = np.concatenate([X_train_rd, X_gene_rd, X_cell_rd], axis=1)

# transform test set using pca transformers from above
test_gene_rd = gene_pca.transform(X_test_gene.reshape(X_test_gene.shape[:-1]))
test_cell_rd = cell_pca.transform(X_test_cell.reshape(X_test_cell.shape[:-1]))
X_test_rd = X_test.iloc[:, :3].copy()
X_test_rd = np.concatenate([X_test_rd, test_gene_rd, test_cell_rd], axis=1)

# verify both feature dimensions match
X_train_rd.shape, X_test_rd.shape

In [None]:
model_histories = []
model_losses = []
test_preds_rd = np.zeros((test_features.shape[0], 
                       train_targets_scored.shape[1] - 1))

for train_idx, val_idx in tqdm(k_folds.split(X_train_rd, y)):
    train_split = X_train_rd[train_idx].copy()
    train_labels = y.iloc[train_idx].astype(np.float64).copy()
    val_split = X_train_rd[val_idx].copy()
    val_labels = y.iloc[val_idx].astype(np.float64).copy()
    
    temp_model = ann_model_1(dropout=True, lr=1e-3, input_feat_dim=X_train_rd.shape[1])
    
    # train model for 100 epochs with early stopping
    temp_history = temp_model.fit(train_split, train_labels, 
                            epochs=75, batch_size=64, verbose=0,
                            validation_data=(val_split, val_labels), 
                                  callbacks=[trg_callbacks])
    
    model_histories.append(temp_history)
    
    model_val_preds = temp_model.predict(val_split)
    
    model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, model_val_preds).numpy()
    print(f'Current Fold Validation Loss: {model_log_loss:.4f}')
    
    model_losses.append(model_log_loss)
    
    temp_test_preds = temp_model.predict(X_test_rd)
    test_preds_rd += (temp_test_preds / N_FOLDS)

model_losses = np.array(model_losses)

In [None]:
print(f"Mean loss across all folds: {model_losses.mean():.4f} +/- {model_losses.std():.4f}")

Not too bad, although the performance is no improvement over what we had originally with the standard features. Dimensionality reduction techniques tend to work well with complex data problems, particularly when there are a large number of redundant and correlated features. 

If you train a simple linear model or other classical model on this dataset, then it is likely you'll obtain a good improvement on the final performance if you perform dimensionality reduction. However, in the case of a Deep NN with a huge number of model parameters, we dont actually obtain an improvement, since the model is more than capable of distilling the complexity contained within the original high-dimensional dataset. All we obtain in this case is a slightly faster computational time during training due to having less dimensions.

<a id="oversampled-evaluation"></a>
### 4.5 Evaluating performance when we apply cross validation with oversampling of minority classes

Finally, in addition to testing our ANN network on various dataset combinations and parts thereof, we'll also try a kind of oversampling technique on our training data as we train each model on our cross-validation folds. Its unlikely this will be reflected in a better validation score, since we're particularly focussing on minority classes. However, it could result in a better generalisation performance of our model when exposed to new and totally unseen data, such as that in the public test set and hidden (held-out) test set.

In [None]:
N_FOLDS = 7
k_folds = KFold(n_splits=N_FOLDS, shuffle=True)

In [None]:
model_histories = []
model_losses = []
test_preds_os = np.zeros((test_features.shape[0], 
                          train_targets_scored.shape[1] - 1))

for train_idx, val_idx in tqdm(k_folds.split(X_train_full, y)):
    train_split = X_train_full.iloc[train_idx].copy()
    train_labels = y.iloc[train_idx].astype(np.float64).copy()
    val_split = X_train_full.iloc[val_idx].copy()
    val_labels = y.iloc[val_idx].astype(np.float64).copy()
    
    # oversample our data for training split
    train_split.reset_index(inplace=True, drop=True)
    train_labels.reset_index(inplace=True, drop=True)
    train_split, train_labels = oversample_data(train_split, 
                                                train_labels)
    
    print(f"Size of training split: {train_split.shape}")
    
    temp_model = ann_model_1(dropout=True, lr=1e-3)
    
    # train model for 100 epochs with early stopping
    temp_history = temp_model.fit(train_split, train_labels, 
                            epochs=75, batch_size=64, verbose=0,
                            validation_data=(val_split, val_labels), 
                                  callbacks=[trg_callbacks])
    
    model_histories.append(temp_history)
    
    model_val_preds = temp_model.predict(val_split)
    
    model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, model_val_preds).numpy()
    print(f'Current Fold Validation Loss: {model_log_loss:.4f}')
    
    model_losses.append(model_log_loss)
    
    temp_test_preds = temp_model.predict(X_test)
    test_preds_os += (temp_test_preds / N_FOLDS)

model_losses = np.array(model_losses)

In [None]:
print(f"Mean loss across oversampled folds: {model_losses.mean():.4f} +/- {model_losses.std():.4f}")

In [None]:
plot_history_results(history)

We've seen that our performance is relatively consistent above, regardless of whether we include the additional features, conduct over-sampling or apply dimensionality reduction. 

To enhanced our individual predictions obtained above, we'll combine them all into one overall ensemble set of predictions.

In [None]:
dnn_test_preds = test_preds + test_preds_ext + test_preds_rd + test_preds_os / 4.0

This is a rather crude way of combining our predictions from this section, but it will suffice for the purpose of this work.

---

<a id="composite-models"></a>
## 5. Trying something novel - ANN with supplementary 1D Conv inputs

<a id="single-Conv1d"></a>
### 5.1 ConvNet experiment - validate that a single 1D ConvNet can learn and predict using the data

Just to make sure we're not completely wasting time, lets first establish that a basic 1D ConvNet can actually learn from our data and make meaningful predictions. We'll form a simple model that takes as input our gene numerical features, and train it to make predictions on our scored training labels.

In [None]:
def single_conv1d(num_features, dropout_val=0.45, lr=1e-3):
    model = Sequential()
    model.add(layers.Input(shape=(num_features, 1)))
    model.add(layers.Conv1D(filters=64, kernel_size=7))
    model.add(layers.MaxPooling1D(5))
    model.add(layers.Conv1D(filters=32, kernel_size=5))
    model.add(layers.MaxPooling1D(5))
    model.add(layers.Conv1D(filters=16, kernel_size=2))
    model.add(layers.Flatten())
    
    # follow-up dense layers
    model.add(Dense(256, activation='elu', kernel_initializer='he_normal'))
    model.add(BatchNormalization())
    model.add(layers.Dropout(dropout_val))
    
    # output layer
    model.add(Dense(206, activation='sigmoid'))
    
    model.compile(optimizer=keras.optimizers.Adam(lr=lr), 
                  loss=['binary_crossentropy'], 
                  metrics=['accuracy'])
    
    return model

In [None]:
cell_cnn = single_conv1d(num_features=X_train_gene.shape[1])
cell_cnn.summary()

Lets see if this model even has any chance of working - we'll train for 50 epochs and see how it performs using just the Gene features:

In [None]:
# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# create learning rate scheduler
lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                                   patience=3, verbose=0, 
                                                   min_delta=0.0001, mode='min')

# list of callbacks to use
trg_callbacks = [early_stopper, lr_scheduler]

In [None]:
# train model for 100 epochs with early stopping
history = cell_cnn.fit(X_train_gene, y, epochs=50, batch_size=64, verbose=1,
                              validation_split=0.2, callbacks=trg_callbacks)

In [None]:
plot_history_results(history)

Performance might not be the best, and training time takes noticeably longer than a dense network, but this model definitely has the potential to learn and make predictions. This is reinforcing, and means that there is scope to form more complex models that blend both convolutional neural networks and dense neural network layers together.

We'll explore this next.

<a id="multi-input-convnet"></a>
### 5.2 Combine gene and cell data using a multi-input Convolutional Neural Network

We'll make a model that takes the gene and cell data in seperately and processes them using 1D convolutional layers. Simultaneously, we'll take in the categorical features from the original data, plus some common engineered features from the numerical data.

We can do this using the previous features we engineered from the data numerical features. In addition, we already processed our gene and cell data into a form amenable for CNNs.

In [None]:
def multi_input_cnn(dropout_val=0.45, lr=1e-3):
    """ Composite ANN with multiple inputs, including the cell, gene and
        other general features. """
    # define input layers
    main_input = layers.Input(shape=X_train_misc.shape[1])
    gene_input = layers.Input(shape=(X_train_gene.shape[1], 1))
    cell_input = layers.Input(shape=(X_train_cell.shape[1], 1))
    
    # define main input processing layers
    input_batch_norm = BatchNormalization()(main_input)
    hidden_1 = layers.Dense(128, activation='elu', 
                            kernel_initializer='he_normal')(input_batch_norm)
    batch_norm_1 = BatchNormalization()(hidden_1)
    dropout_1 = layers.Dropout(dropout_val)(batch_norm_1)
    
    # define gene convolutional layers
    gene_conv1 = layers.Conv1D(filters=64, kernel_size=7)(gene_input)
    gene_max_pool_1 = layers.MaxPooling1D(5)(gene_conv1)
    gene_conv2 = layers.Conv1D(filters=32, kernel_size=5)(gene_max_pool_1)
    gene_max_pool_2 = layers.MaxPooling1D(5)(gene_conv2)
    gene_conv3 = layers.Conv1D(filters=16, kernel_size=2)(gene_max_pool_2)
    gene_flatten = layers.Flatten()(gene_conv3)
    
    # define cell convolutional layers
    cell_conv1 = layers.Conv1D(filters=64, kernel_size=7)(cell_input)
    cell_max_pool_1 = layers.MaxPooling1D(5)(cell_conv1)
    cell_conv2 = layers.Conv1D(filters=32, kernel_size=5)(cell_max_pool_1)
    cell_max_pool_2 = layers.MaxPooling1D(5)(cell_conv2)
    cell_conv3 = layers.Conv1D(filters=16, kernel_size=2)(cell_max_pool_2)
    cell_flatten = layers.Flatten()(cell_conv3)
    
    # combine our multiple inputs into one
    concat_layer = layers.concatenate([dropout_1, gene_flatten, 
                                       cell_flatten])
    
    # follow-up dense layers
    hidden_2 = layers.Dense(256, activation='elu', 
                            kernel_initializer='he_normal')(concat_layer)
    batch_norm_2 = BatchNormalization()(hidden_2)
    dropout_2 = layers.Dropout(dropout_val)(batch_norm_2)
    
    hidden_3 = layers.Dense(256, activation='elu', 
                            kernel_initializer='he_normal')(dropout_2)
    batch_norm_3 = BatchNormalization()(hidden_3)
    dropout_3 = layers.Dropout(dropout_val)(batch_norm_3)
    
    # main output for our scored labels
    output_1 = layers.Dense(206, activation='sigmoid', 
                            name='scored_output')(dropout_3)
    
    model = keras.Model(inputs=[main_input, gene_input, cell_input], 
                        outputs=[output_1])
    
    model.compile(optimizer=keras.optimizers.Adam(lr=lr), 
                  loss=['binary_crossentropy'], metrics=['accuracy'])
    return model

In [None]:
multifeature_cnn = multi_input_cnn()

# lets visualise our model to ensure it looks correct
keras.utils.plot_model(multifeature_cnn, "composite_model.png", show_shapes=True)

In [None]:
# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# create learning rate scheduler
lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                                   patience=3, verbose=0, 
                                                   min_delta=0.0001, mode='min')

# list of callbacks to use
trg_callbacks = [early_stopper, lr_scheduler]

In [None]:
# train model for 50 epochs w/ early stop. Ensure we use three inputs:
history = multifeature_cnn.fit([X_train_misc, X_train_gene, X_train_cell], y, 
                               epochs=50, batch_size=64, verbose=1,
                               validation_split=0.2, callbacks=trg_callbacks)

In [None]:
plot_history_results(history)

The performance is not amazing in this case, but when we consider that we are mostly using the gene and cell data fed into multiple 1D convolutional layers, its certainly not bad!

<a id="composite-model"></a>
### 5.3 Produce a finalised composite network with both CNN features and full tabular features 

This will be more complicated, and will make use of our full original dataset, along with our gene and cell features being fed into seperate 1D convolutional layers. We'll branch these three inputs seperately and then combine them later in the network, and make use of the output to form our final predictions.

In [None]:
def ann_convnet_composite(dropout_val=0.45, lr=1e-3):
    """ Composite ANN with multiple inputs, including the full tabular
        data and the gene and cell features fed into 1D Conv layers """
    # define our input layers
    main_input = layers.Input(shape=X_train_full.shape[1])
    gene_input = layers.Input(shape=(X_train_gene.shape[1], 1))
    cell_input = layers.Input(shape=(X_train_cell.shape[1], 1))
    
    # define main input processing layers
    input_batch_norm = BatchNormalization()(main_input)
    hidden_1 = layers.Dense(512, activation='elu', 
                            kernel_initializer='he_normal')(input_batch_norm)
    batch_norm_1 = BatchNormalization()(hidden_1)
    dropout_1 = layers.Dropout(dropout_val)(batch_norm_1)
    
    # define gene convolutional layers
    gene_conv1 = layers.Conv1D(filters=64, kernel_size=7)(gene_input)
    gene_max_pool_1 = layers.MaxPooling1D(5)(gene_conv1)
    gene_conv2 = layers.Conv1D(filters=32, kernel_size=5)(gene_max_pool_1)
    gene_max_pool_2 = layers.MaxPooling1D(5)(gene_conv2)
    gene_conv3 = layers.Conv1D(filters=16, kernel_size=2)(gene_max_pool_2)
    gene_flatten = layers.Flatten()(gene_conv3)
    
    # define cell convolutional layers
    cell_conv1 = layers.Conv1D(filters=64, kernel_size=7)(cell_input)
    cell_max_pool_1 = layers.MaxPooling1D(5)(cell_conv1)
    cell_conv2 = layers.Conv1D(filters=32, kernel_size=5)(cell_max_pool_1)
    cell_max_pool_2 = layers.MaxPooling1D(5)(cell_conv2)
    cell_conv3 = layers.Conv1D(filters=16, kernel_size=2)(cell_max_pool_2)
    cell_flatten = layers.Flatten()(cell_conv3)
    
    # combine our multiple inputs into one
    concat_layer = layers.concatenate([dropout_1, gene_flatten, 
                                       cell_flatten])
    
    # follow-up dense layers
    hidden_2 = layers.Dense(256, activation='elu', 
                            kernel_initializer='he_normal')(concat_layer)
    batch_norm_2 = BatchNormalization()(hidden_2)
    dropout_2 = layers.Dropout(dropout_val)(batch_norm_2)
    
    hidden_3 = layers.Dense(256, activation='elu', 
                            kernel_initializer='he_normal')(dropout_2)
    batch_norm_3 = BatchNormalization()(hidden_3)
    dropout_3 = layers.Dropout(dropout_val)(batch_norm_3)
    
    # main output for our scored labels
    output_1 = layers.Dense(206, activation='sigmoid', 
                            name='scored_output')(dropout_3)
    
    model = keras.Model(inputs=[main_input, gene_input, cell_input], 
                        outputs=[output_1])
    
    model.compile(optimizer=keras.optimizers.Adam(lr=lr), 
                  loss=['binary_crossentropy'], metrics=['accuracy'])
    return model

In [None]:
# lets create our model and visualise its structure
composite_model = ann_convnet_composite(dropout_val=0.45, lr=2e-3)
keras.utils.plot_model(composite_model, "composite_model.png", show_shapes=True)

Train one model to confirm our model works:

In [None]:
# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# create learning rate scheduler
lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.1, 
                                                   patience=3, verbose=0, 
                                                   min_delta=0.0001, mode='min')

# list of callbacks to use
trg_callbacks = [early_stopper, lr_scheduler]

In [None]:
composite_model = ann_convnet_composite(dropout_val=0.45, lr=2e-3)

In [None]:
# train model for 100 epochs with early stopping
history = composite_model.fit([X_train_full, X_train_gene, X_train_cell], y, 
                              epochs=75, batch_size=64, verbose=1,
                              validation_split=0.2, callbacks=trg_callbacks)

In [None]:
# evaluate our predictions on entire training just for verification it works
val_preds = composite_model.predict([X_train_full, X_train_gene, X_train_cell])
score = keras.losses.BinaryCrossentropy()(y, val_preds)
print(f'Log Loss on entire training set: {score.numpy():.4f}')

In [None]:
plot_history_results(history)

Similar performance to what we achieved before with just a simple ANN.

Lets evaluate this further and in greater depth through K-Folds Cross Validation:

In [None]:
model_histories = []
model_losses = []
composite_test_preds = np.zeros((test_features.shape[0], 
                                 train_targets_scored.shape[1] - 1))

for train_idx, val_idx in tqdm(k_folds.split(X_train_full, y)):
    # obtain train and val splits
    train_split = X_train_full.iloc[train_idx].copy()
    train_labels = y.iloc[train_idx].astype(np.float64).copy()
    val_split = X_train_full.iloc[val_idx].copy()
    val_labels = y.iloc[val_idx].astype(np.float64).copy()
    
    # gene and cell training / validation splits
    train_cell = X_train_cell[train_idx].copy()
    train_gene = X_train_gene[train_idx].copy()
    val_cell = X_train_cell[val_idx].copy()
    val_gene = X_train_gene[val_idx].copy()
    
    composite_nn = ann_convnet_composite(dropout_val=0.45, lr=2e-3)
    
    # train initially on the nonscored labels
    history = composite_nn.fit([train_split, train_gene, train_cell], 
                               train_labels, epochs=75, batch_size=64, verbose=0, 
                               validation_data=([val_split, val_gene, val_cell], 
                                                val_labels,), callbacks=[trg_callbacks])
    
    model_histories.append(history)

    # find log loss for our main output
    model_val_preds = composite_nn.predict([val_split, val_gene, val_cell])
    model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, model_val_preds).numpy()
    model_losses.append(model_log_loss)
    print(f'Current Fold Scored Validation Loss: {model_log_loss:.4f}')
    
    # make predictions on test set for each fold
    temp_test_preds = composite_nn.predict([X_test, X_test_gene, X_test_cell])
    composite_test_preds += (temp_test_preds / N_FOLDS)

# convert results to np array
model_losses = np.array(model_losses)

In [None]:
print(f"Mean loss across all folds: {model_losses.mean():.4f} +/- {model_losses.std():.4f}")

---

<a id="test-predictions"></a>
## 6. Test Set Predictions - Combining the performance of the best models

Lets make a submission to the competition using the test predictions obtained earlier. 

In [None]:
# choose our final predictions to submit
#final_preds = (dnn_test_preds + composite_test_preds) / 2.0
final_preds = test_preds.copy()

Before submitting this, we will need to adjust our test set predictions so that the targets are always zero for instances where cp_type refers to a controlled experiment:

In [None]:
def process_test_preds(test_preds, original_test_feats, clip_preds=False, 
                       pred_max=0.999, pred_min=0.001):
    """ Adjust our test set predictions by replacing all class preds
        with zero when cp_type == ctl_vehicle """
    corrected_preds = test_preds.copy()
    test_sig_ids = original_test_feats['sig_id'].copy()
    test_ctl_vehicle_idx = (original_test_feats['cp_type'] == 'ctl_vehicle')
    corrected_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

    if clip_preds:
        corrected_preds[:, 1:] = np.clip(corrected_preds[:, 1:], pred_min, pred_max)
    
    return corrected_preds, test_sig_ids

In [None]:
processed_preds, test_sig_ids = process_test_preds(final_preds, test_features)

In [None]:
test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
test_preds_df = pd.DataFrame(processed_preds, columns=train_targets_scored.columns[1:])
test_submission = pd.concat([test_submission, test_preds_df], axis=1)
test_submission.head(3)

With this in the correct format, we can now save it and make a basic submission for the competition:

In [None]:
test_submission.to_csv('submission.csv', index=False)