# Mechanisms of Action Predictions

Within this notebook some variants of basic Deep Neural Networks are produced and used to form a range of multi-output classification predictions on the test set. The approach is simple and could be easily improved upon, but nevertheless was only intended as a simple introduction to this dataset and the competition.

**Table of Contents:**

1. [Imports](#imports)
2. [EDA](#EDA)
3. [Data Preparation and Preprocessing](#data-preprocessing)
4. [Model Production and Evaluation](#model-production)
5. [Test Set Predictions](#test-predictions)

<a id="imports"></a>
## 1. Import dependencies and load data

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import seaborn as sns
import tensorflow as tf

import keras
import keras.backend as K
from keras.callbacks import ModelCheckpoint
from keras.initializers import Constant
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Embedding, Flatten, LSTM, GRU, \
        SpatialDropout1D, Bidirectional, Conv1D, MaxPooling1D, BatchNormalization
from keras.models import Sequential, load_model
from keras import models
from keras import layers

import pickle

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import log_loss, silhouette_score
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, \
        cross_validate, cross_val_score, KFold
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier, MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from tqdm import tqdm

In [None]:
input_dir = '/kaggle/input/lish-moa'
train_features = pd.read_csv(os.path.join(input_dir, 'train_features.csv'))
train_targets_scored = pd.read_csv(os.path.join(input_dir, 'train_targets_scored.csv'))
train_targets_nonscored = pd.read_csv(os.path.join(input_dir, 'train_targets_nonscored.csv'))
test_features = pd.read_csv(os.path.join(input_dir, 'test_features.csv'))

train_features.shape, train_targets_scored.shape, train_targets_nonscored.shape, test_features.shape

We have our main input features (train_features.csv), which is high-dimensional tabular data containing a mixture of categorical and numerical features. We then have our train targets, which consists of 206 different output classes for each data instance. Its important to note that these output labels are not mutually exclusive, and it is possible to get multiple outputs for each data instance. Therefore, this problem is a multi-output classification problem, and not just a multiclass classification problem. 

In contrast to a normal binary classification task, this type of multi-label problem becomes much more difficult in terms of producing and fine-tuning a classification model. 

---

<a id="EDA"></a>
## 2. Basic Exploratory Data Analysis

In [None]:
cat_cols = ['cp_type', 'cp_time', 'cp_dose']

plt.figure(figsize=(16,4))

for idx, col in enumerate(cat_cols):
    plt.subplot(int(f'13{idx + 1}'))
    labels = train_features[col].value_counts().index.values
    vals = train_features[col].value_counts().values
    sns.barplot(x=labels, y=vals)
    plt.xlabel(f'{col}')
    plt.ylabel('Count')
plt.tight_layout()
plt.show()

For 'cp_type', the 'ctl_vehicle' refers to samples treated with a control perturbation. For control perturbations, our targets are all zero, since they have no Mechanism of Action (MoA).

To deal with this, a good strategy could be to identify samples that are ctl_vehicle (through training a classification model or simply using the feature as its in the test data!), and set all of these to zero. We can then process the test set accordingly, by first setting all test instance targets to zero if its a ctl_vehicle, followed by processing all of the others normally using our trained model.

In [None]:
# select all indices when 'cp_type' is 'ctl_vehicle'
ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')

# evaluate number of 1s we have in the total train scores when cp_type = ctl_vehicle
train_targets_scored.loc[ctl_vehicle_idx].iloc[:, 1:].sum().sum()

The total sum is zero, which confirms the statement above on all targets being zero for cases where cp_type is ctl_vehicle. The best thing to do with this is simply fill our targets for zero when this is the case.

We could also remove all of these from the training set, however there are arguments for and against this in practice. If we remove them, we could be witholding valuable zero case data from our models, and for new data our model might struggle to predict these cases accordingly. On the other hand, it is a lot of extra data, which could just serve to unnecessarily complicate our model.

In [None]:
# take a copy of all our training sig_ids for reference
train_sig_ids = train_features['sig_id'].copy()

In [None]:
# drop cp_type column since we no longer need it
X = train_features.drop(['sig_id', 'cp_type'], axis=1).copy()
X = X.loc[~ctl_vehicle_idx].copy()

y = train_targets_scored.drop('sig_id', axis=1).copy()
y = y.loc[~ctl_vehicle_idx].copy()

X.shape, y.shape

In [None]:
X.head(3)

The data has already been normalised using quantile normalisation, and so is not in its natural form as we see it.

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(X.iloc[:, 2:].mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(y.mean())
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(train_targets_nonscored.mean())
plt.show()

In [None]:
y.sum().sort_values()[:30].plot.bar(figsize=(18,6))
plt.show()

Some output classes only have 1 instance in the entire training set. This is problematic and is no where near enough data if we expect our models to effectively make predictions across the whole range of targets. Imbalanced dataset techniques such as minority class over-sampling may have to be introduced, which may help our models generalise better to new data.

#### Plotting all gene / cell features for random samples:

Lets quickly assess how our cell data looks when plotted over all features for random instances:

In [None]:
cat_feats = X.iloc[:, :2].copy()
X_cell_v = X.iloc[:, -100:].copy()
X_gene_e = X.iloc[:, 2:772].copy()

In [None]:
def plot_features(X, y, selected_idx, features_type, figsize=(14,10)):
    x_range = range(1, X.shape[1] + 1)
    
    fig = plt.figure(figsize=(14,10))
    
    for i, idx in enumerate(selected_idx):
        ax = fig.add_subplot(selected_idx.shape[0], 1, i + 1)
        vals = X.iloc[idx].values
    
        if (y.iloc[idx] == 1).sum():
            output_labels = list(y.iloc[idx][y.iloc[idx] == 1].index.values)
        
            labels = " ".join(output_labels)
        else:
            labels = "None (all labels zero)"
        
        sns.lineplot(x_range, vals)
        plt.title(f"Row {idx}, Labels: {labels}", weight='bold')
        plt.xlim(0.0, X.shape[1])
        plt.grid()

    plt.xlabel(f"{features_type}", weight='bold', size=14)
    plt.tight_layout()
    plt.show()
    
    
def plot_mean_std(dataframe, feature_name, features_type, figsize=(14,6), alpha=0.3):
    """ Plot rolling mean and standard deviation for given dataframe """
    
    plt.figure(figsize=figsize)
    
    x_range = range(1, dataframe.shape[1] + 1)
    
    chosen_rows = y.loc[y[feature_name] == 1]
    chosen_feats = dataframe.loc[y[feature_name] == 1]
    
    means = chosen_feats.mean()
    stds = chosen_feats.std()
    
    plt.plot(x_range, means, label=feature_name)    
    plt.fill_between(x_range, means - stds, means + stds, 
                         alpha=alpha)

    plt.title(f'{features_type}: {feature_name} - Mean & Standard Deviation', weight='bold')
    
    plt.xlim(0.0, dataframe.shape[1])
    
    plt.show()

In [None]:
# lets plot some random rows from our data
random_idx = np.random.randint(X.shape[0], size=(5,))

plot_features(X_cell_v, y, random_idx, features_type='Cell Features')

Clearly some rows vary substancially in terms of their value range, and therefore it is worth standardising this data prior to training our models.

Now lets do the same for our gene features:

In [None]:
plot_features(X_gene_e, y, random_idx, features_type='Gene Features')

We have some noticeable peaks throughout the features for some of the above instances. It could be worth plotting a range of data instances with the same output labels against one another, and compare their peaks. If they correlate in one or more areas, this could be insightful for developing further features with our dataset.

Lets now repeat above, but for data instances with the same output label(s).

In [None]:
# select an output label to plot associated training features
chosen_label = 'btk_inhibitor'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,), replace=False)

In [None]:
plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

Lets also look at the mean and standard deviation of this feature:

In [None]:
plot_mean_std(X_gene_e, 'btk_inhibitor', 'Gene Features')

Lets repeat this process for some different output labels:

In [None]:
# select an output label to plot associated training features
chosen_label = 'histamine_receptor_antagonist'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'histamine_receptor_antagonist', 'Gene Features')

In [None]:
# select an output label to plot associated training features
chosen_label = 'free_radical_scavenger'
chosen_rows = y.loc[y[chosen_label] == 1]
chosen_feats = X_gene_e.loc[y[chosen_label] == 1]

# select random rows from those available above for the chosen label
random_idx = np.random.choice(range(0, chosen_rows.shape[0]), size=(5,))

plot_features(chosen_feats, chosen_rows, random_idx, features_type='Gene Features')

In [None]:
plot_mean_std(X_gene_e, 'free_radical_scavenger', 'Gene Features')

This analysis highlights the potential for performing advanced feature engineering, such as using the trends of gene and/or cell features as additional features to our models. We could use such features to supplement the existing data in its standard form.

---

<a id="data-preprocessing"></a>
## 3. Preprocessing and Data Preparation

This will be relatively simple and will include:
- Standardisation of all numerical features.
- Creation of embeddings or encodings for our categorical variables.
- Removal of unwanted / unnecessary columns.

We'll define a simple class to perform these actions for us on both the training and test data.

In [None]:
class MOAPreprocessor:
    """ Data Preprocessing class for the MoA dataset, processing cat and num
        features accordingly. """
    
    def __init__(self, cat_features, num_features, remove_cp_type=False):
        self.cat_features = cat_features
        self.num_features = num_features
        self.std_scaler = StandardScaler()
        self.remove_cp_type = remove_cp_type
        
    def preprocess_data(self, X, test=False):
        """ Preprocess categorical and numerical features """
        
        # take a copy of sig ids for reference
        sig_ids = X.loc[:, 'sig_id']
        
        #  remove ctl_vehicle if selected
        if self.remove_cp_type and not test:
            ctl_vehicle_idx = (X['cp_type'] == 'ctl_vehicle')
            data_df = X.loc[~ctl_vehicle_idx].copy()
        else:
            data_df = X.copy()
        
        # subsets of categorical and numerical
        X_cat = data_df.loc[:, self.cat_features].astype(object)
        X_num = data_df.loc[:, self.num_features]
        
        # one-hot encode our categorical features
        X_cat = pd.get_dummies(X_cat)
        
        # if training, fit our transformers
        if not test:
            # fit parameters of our scaler and transform train
            X_num[self.num_features] = self.std_scaler.fit_transform(X_num)
            
            # add train sig ids to class instance
            self.train_sig_ids = sig_ids.copy()
        
        # otherwise, simply transform our data
        else:
            # transform test set
            X_num[self.num_features] = self.std_scaler.transform(X_num)
            
            # add test sig ids to class instance
            self.test_sig_ids = sig_ids.copy()
            
        return pd.concat([X_cat, X_num], axis=1)

In [None]:
#cat_features = ['cp_time', 'cp_dose', 'cp_type']
cat_features = ['cp_time', 'cp_dose']

# define non-numeric cols to form list of numeric cols
non_num_tuple = ('cp_time', 'cp_dose', 'cp_type', 'sig_id')
num_features = [x for x in train_features.columns.values if not x.startswith(non_num_tuple)]

data_processor = MOAPreprocessor(cat_features, num_features)
X_train_full = data_processor.preprocess_data(train_features)
X_test = data_processor.preprocess_data(test_features, test=True)

X_train_full.shape, X_test.shape

In [None]:
# we also need to format our labels so that it only contains the output labels, and not sig id
y = train_targets_scored.drop('sig_id', axis=1).copy()

# remove 
if data_processor.remove_cp_type:
    ctl_vehicle_idx = (train_features['cp_type'] == 'ctl_vehicle')
    y = y.loc[~ctl_vehicle_idx].copy()

y.shape

---

<a id="model-production"></a>
## 4. Model production and evaluation

Many of our minority classes have an extremely low number of samples in the training set, whereby some only have 1.

We can either remove these entirely from the training set and proceed onwards, or come up with an elaborate way of sampling to overcome this issue.

For this work, we will just simply include the 1 time occurence samples within the training set, and exclude them from the validation set (since the only way to do this would be to use duplicates which is of no value). We'll simply note these rows, remove them prior to splitting into our training and validation data, and then insert them into the training data after we have formed these splits.

Ideally, for such a task like this, we should perform a suitable form of multi label stratified k folds, however this will be a task for later, rather than in this notebook.

In [None]:
lowest_occurrence = y.sum().sort_values()

# identify classes that occur less than 2 times in the training set
minority_feats = lowest_occurrence[lowest_occurrence < 2].index.values
for label in minority_feats:
    print(label)

We'll remove these features from the dataset, perform our split, and then insert them in again:

In [None]:
# get the row index vals to remove for these features
remove_idx = np.array([], dtype=int)
for feature in minority_feats:
    remove_idx = np.append(remove_idx, y.loc[y[feature]==1].index.values.astype(int))

X_train_full_tmp = X_train_full.drop(index=remove_idx)
y_tmp = y.drop(index=remove_idx)
X_train_full.shape, X_train_full_tmp.shape, y.shape, y_tmp.shape

In [None]:
# get the column index vals for these features
#col_index_map = {}
#for feature in minority_feats:
#    col_index_map[feature] = y.columns.get_loc(feature)
#col_index_map

# temporarily remove these columns and then split our data
#y_tmp.drop(columns=minority_feats, inplace=True)

Lets split our data randomly. Ideally we'd perform a multi-label stratified split here, but due to issues with limited numbers of class instances and imbalance across the dataset, we'll avoid it for now.

In [None]:
# choose a larger subset for this evaluation
X_train, X_val, y_train, y_val = train_test_split(X_train_full_tmp, y_tmp, test_size=0.2, shuffle=True)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

Now lets add the under-represented instances back into our training set:

In [None]:
X_train = X_train.append(X_train_full.iloc[remove_idx], ignore_index=True)
y_train = y_train.append(y.iloc[remove_idx], ignore_index=True)
X_train.shape, y_train.shape

### Model 1 - Three hidden layers

In [None]:
def ann_model_1(dropout=False, dropout_val=0.45, batch_norm=False, lr=1e-3):
    """ Create a basic Deep NN for classification """
    model = models.Sequential()
    
    model.add(layers.Dense(512, activation='relu', input_shape=(X_train.shape[1],)))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(256, activation='relu'))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(256, activation='relu'))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
        
    # output layer
    model.add(layers.Dense(206, activation='sigmoid'))
        
    model.compile(optimizer=keras.optimizers.RMSprop(lr=lr), 
                  loss='binary_crossentropy', metrics=['accuracy'])
    return model

#### Lets first evaluate the best learning rate for this model:

Lets create a custom callback for exploring the best learning rates for our models:

In [None]:
class LearningRateComparison(keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        
        # arrays to store current rate and associated loss
        self.lr_rates = []
        self.losses = []
        
    def on_batch_end(self, batch, logs):
        self.lr_rates.append(K.get_value(self.model.optimizer.lr))
        self.losses.append(logs["loss"])
        K.set_value(self.model.optimizer.lr, self.model.optimizer.lr * self.factor)

Train for several epochs with our learning rate comparison:

In [None]:
# define custom learning rate scheduler to compare loss across many learning rates
custom_lr = LearningRateComparison(factor=1.0025)

model_1 = ann_model_1(dropout=True, batch_norm=True, lr=1e-3)

# train model for 10 epochs
history = model_1.fit(X_train, y_train, epochs=10, 
                      batch_size=64, validation_data=(X_val, y_val), 
                      callbacks=[custom_lr])

In [None]:
plt.figure(figsize=(12,5))
sns.lineplot(custom_lr.lr_rates, custom_lr.losses)
plt.gca().set_xscale('log')
plt.hlines(min(custom_lr.losses), min(custom_lr.lr_rates), max(custom_lr.lr_rates), 
           linestyle='dashed')
plt.axis([min(custom_lr.lr_rates), 1.0, 0, custom_lr.losses[0]])
plt.xlabel("Learning rate", weight='bold', size=13)
plt.ylabel("Loss", weight='bold', size=13)
plt.grid()
plt.show()

Our loss begins to increase gradually after around $ 4 \times 10^{-2} $, so we'll try a little bit before this as a starting point.

In [None]:
model_1 = ann_model_1(dropout=True, batch_norm=True, lr=2e-2)
model_1.summary()

In [None]:
# set up a check point for our model - save only the best val performance
save_path ="ann_model_1_best.hdf5"

trg_checkpoint = ModelCheckpoint(save_path, monitor='val_loss', 
                                 verbose=1, save_best_only=True, mode='min')

early_stopper = keras.callbacks.EarlyStopping(patience=20)

trg_callbacks = [trg_checkpoint, early_stopper]

In [None]:
history = model_1.fit(X_train, y_train, epochs=50, 
                      batch_size=64, validation_data=(X_val, y_val), 
                      callbacks=trg_callbacks)

In [None]:
# save model as a HDF5 file with weights + architecture
model_1.save('ann_model_1.hdf5')

# save the history of training to a datafile for later retrieval
#with open('history_model_1.pickle', 'wb') as pickle_file:
#    pickle.dump(history.history, pickle_file)
    
loaded_model = False

In [None]:
# load model with best weights as found during training
model_1_best = load_model('ann_model_1_best.hdf5')

In [None]:
# if already trained - import history file and training weights
#model_1 = load_model('models/ann_model_1.hdf5')

# get history of trained model
#with open('models/history_model_1.pickle', 'rb') as handle:
#    history = pickle.load(handle)
    
#loaded_model = True

In [None]:
# if loaded model set history accordingly
if loaded_model:
    trg_hist = history
else:
    trg_hist = history.history

trg_loss = trg_hist['loss']
val_loss = trg_hist['val_loss']

trg_acc = trg_hist['accuracy']
val_acc = trg_hist['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(16,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.tight_layout()
plt.show()

In [None]:
val_preds = model_1.predict(X_val)
score = keras.losses.BinaryCrossentropy()(y_val, val_preds)
print('Baseline score: %.4f' % score.numpy())

In [None]:
# load model with best weights as found during training
model_1_best = load_model('ann_model_1_best.hdf5')

val_preds_best = model_1_best.predict(X_val)
best_score = keras.losses.BinaryCrossentropy()(y_val, val_preds_best)
print('Best Val Loss: %.4f' % best_score.numpy())

For an initial model this is not too bad, although we could definitely benefit from a gradual reduction in our learning rate as our number of epochs increase. Therefore, we could play around and adjust a range of learning rate schedulers and likely receive a better performance than that obtained above.

In addition, we could benefit from conducting k-folds cross-validation, rather than just using one hold-out validation set.

As a basic submission with a lightly tuned model, this current model does not perform too badly on the final test set, with a score of 0.02027 on the leaderboard.

For future improvement, we'll also attempt to tackle the class imbalance problem by over-sampling instances with our most under-represented classes (Scores 0.02060 on test set).

Remarks on validation performance and tuning:
- 1. hidden 1 = 1024, hidden 2 = 512, hidden 3 = 256, batch size 1024, rmsprop op, 0.45 dropout, batch norm, 35 epochs, validation loss = 0.0158.
- 2. Same as above but hidden 1 = 512, after 45 epochs validation loss = 0.0158.
- 3. hidden 1 = 512, hidden 2 = 256, hidden 3 = 256, batch size 1024, rmsprop op, 0.45 dropout, batch norm, 59 epochs, validation loss = 0.0157.
- 4. Same as above, but without Batch Normalisation - performance decreased with validation loss of 0.0167.
- 5. hidden 1 = 512, hidden 2 = 256, hidden 3 = 128, batch size 1024, rmsprop op, 0.45 dropout, batch norm, 67 epochs, validation loss = 0.0160.
- 6. Same as previously, but with Batch size set to 2048 - performance slightly dropped, with validation loss of 0.0165.


In [None]:
# note down best parameters to use for final models
model_1_bs = 1024
model_1_epochs = 59

#### Model 1 improvements - learning Rate Scheduling and K-Folds Cross Validation

In [None]:
def schedule_lr_rate(epoch, lr):
    """ Use initial learning rate for 20 epochs and then
        decrease it exponentially """
    if epoch < 20:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

# create our learning rate scheduler callback
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(schedule_lr_rate)

# create an early stopper callback
early_stopper = keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True)

# list of callbacks to use
trg_callbacks = [early_stopper, lr_scheduler]

In [None]:
N_FOLDS = 5
k_folds = KFold(n_splits=N_FOLDS, shuffle=True)

In [None]:
model_histories = []
model_losses = []
test_preds = np.zeros((test_features.shape[0], 
                       train_targets_scored.shape[1] - 1))

for train_idx, val_idx in tqdm(k_folds.split(X_train_full, y)):
    train_split = X_train_full.iloc[train_idx].copy()
    train_labels = y.iloc[train_idx].astype(np.float64).copy()
    val_split = X_train_full.iloc[val_idx].copy()
    val_labels = y.iloc[val_idx].astype(np.float64).copy()
    
    temp_model = ann_model_1(dropout=True, batch_norm=True, lr=2e-2)
    
    # train model for 100 epochs with early stopping
    temp_history = temp_model.fit(train_split, train_labels, 
                            epochs=50, batch_size=64, verbose=0,
                            validation_data=(val_split, val_labels), callbacks=[trg_callbacks])
    
    model_histories.append(temp_history)
    
    # find log loss for out of fold val data
    model_val_preds = temp_model.predict(val_split)
    model_log_loss = keras.losses.BinaryCrossentropy()(val_labels, model_val_preds).numpy()
    model_losses.append(model_log_loss)
    print(f'Current Fold Validation Loss: {model_log_loss:.4f}')
    
    # make predictions on test set for each fold
    temp_test_preds = temp_model.predict(X_test)
    test_preds += (temp_test_preds / N_FOLDS)

# convert results to np array
model_losses = np.array(model_losses)

In [None]:
print(f"Mean loss across all folds: {model_losses.mean():.4f} +/- {model_losses.std():.4f}")

Lets visualise the average training and validation loss across all of our folds:

In [None]:
fold_1_df = pd.DataFrame(model_histories[0].history)
fold_2_df = pd.DataFrame(model_histories[1].history)
fold_3_df = pd.DataFrame(model_histories[2].history)
fold_4_df = pd.DataFrame(model_histories[3].history)
fold_5_df = pd.DataFrame(model_histories[4].history)
avg_fold_df = (fold_1_df + fold_2_df + fold_3_df + fold_4_df + fold_5_df) / 5

In [None]:
avg_fold_df[['loss', 'val_loss']].plot(figsize=(12,5))
plt.grid()
plt.title("Average Training / Validation Loss across all Folds", weight='bold')
plt.ylabel("Loss", weight='bold')
plt.xlabel("Epochs", weight='bold')
plt.legend(loc='best')
plt.xlim(0.0, avg_fold_df.shape[0])
plt.show()

#### Lets combine each set of predictions from each fold into an overall average set of test predictions

Since we removed all instances with cp_type == ctl_vehicle from our training data, we will need to adjust our test set predictions so that the targets are always zero for these instances.

In [None]:
# take a copy of all our training sig_ids for reference
test_sig_ids = test_features['sig_id'].copy()

# select all indices when 'cp_type' is 'ctl_vehicle'
test_ctl_vehicle_idx = (test_features['cp_type'] == 'ctl_vehicle')

# change all cp_type == ctl_vehicle predictions to zero
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

In [None]:
test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
test_preds_df = pd.DataFrame(test_preds, columns=train_targets_scored.columns[1:])
test_submission = pd.concat([test_submission, test_preds_df], axis=1)
test_submission.head(3)

Lets save this and make a submission.

In [None]:
test_submission.to_csv('submission.csv', index=False)

### Model 2 - Reduced complexity with only two hidden layers

We'll try a model with less complexity and a reduced overall number of parameters, since it appears we are overfitting the data substancially.

In [None]:
# to-do: add batch normalisation

def ann_model_2(dropout=False, dropout_val=0.45, batch_norm=False):
    """ Create a basic Deep NN for classification """
    model = models.Sequential()
    
    model.add(layers.Dense(2048, activation='relu', input_shape=(X_train.shape[1],)))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    model.add(layers.Dense(256, activation='relu'))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    
    # output layer
    model.add(layers.Dense(206, activation='sigmoid'))
        
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [None]:
model_2 = ann_model_2(dropout=True, batch_norm=True)

# set up a check point for our model - save only the best val performance
save_path ="ann_model_2_best.hdf5"

trg_checkpoint = ModelCheckpoint(save_path, monitor='val_loss', 
                                 verbose=1, save_best_only=True, mode='min')

trg_callbacks = [trg_checkpoint]

history_2 = model_2.fit(X_train, y_train, epochs=100, 
                      batch_size=1024, validation_data=(X_val, y_val), 
                      callbacks=trg_callbacks)

In [None]:
trg_loss = history_2.history['loss']
val_loss = history_2.history['val_loss']

trg_acc = history_2.history['accuracy']
val_acc = history_2.history['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(16,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.tight_layout()
plt.show()

In [None]:
val_preds = model_2.predict(X_val)
score = keras.losses.BinaryCrossentropy()(y_val, val_preds)
print('Baseline score: %.4f' % score.numpy())

In [None]:
# load model with best weights as found during training
model_2_best = load_model('ann_model_2_best.hdf5')

val_preds_best = model_2_best.predict(X_val)
best_score = keras.losses.BinaryCrossentropy()(y_val, val_preds_best)
print('Best Val Loss: %.4f' % best_score.numpy())

Remarks on tuning performance and history:
- Batch size increase from 256 to 1024 resulted in an improved validation loss (val loss = 0.0158).
- Change from hidden 1=512, hidden 2=256 to hidden 1=512, hidden 2=512 resulted in slightly worse val loss (val loss = 0.0163).
- Change from hidden 1=512, hidden 2=256 to hidden 1=256, hidden 2=256 resulted in val loss of 0.159.
- Change from dropout of 0.45 to dropout of 0.40 resulted in slightly better performance with val loss of 0.158.
- Change from dropout of 0.40 to dropout of 0.35 resulted in slightly worse performance with val loss of 0.160.
- Change from hidden 1=256, hidden 2=256 to hidden 1=1024, hidden 2=256 resulted in val loss of 0.160.
- Change from hidden 1=256, hidden 2=256 to hidden 1=2048, hidden 2=256 resulted in val loss of 0.0158 after 43 epochs.
- Same as above, but with dropout of 0.30,which resulted in a worse validation loss of 0.0167.
- Same as above, but with dropout returned to 0.45 and adam optimisation used instead of rmsprop, which resulted in a validation loss of 0.0161 after 97 epochs.

In [None]:
# note down best parameters to use for final models
model_2_bs = 1024
model_2_epochs = 97

### Model 3 - Further simplicity, with just one hidden layer

In [None]:
# to-do: add batch normalisation

def ann_model_3(dropout=False, dropout_val=0.45, batch_norm=False):
    """ Create a basic Deep NN for classification """
    model = models.Sequential()
    
    model.add(layers.Dense(256, activation='selu', input_shape=(X_train.shape[1],)))
    if batch_norm:
        model.add(BatchNormalization())
    if dropout:
        model.add(layers.Dropout(dropout_val))
    
    # output layer
    model.add(layers.Dense(206, activation='sigmoid'))
        
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [None]:
model_3 = ann_model_3(dropout=True, batch_norm=True)

# set up a check point for our model - save only the best val performance
save_path ="ann_model_3_best.hdf5"

trg_checkpoint = ModelCheckpoint(save_path, monitor='val_loss', 
                                 verbose=1, save_best_only=True, mode='min')

trg_callbacks = [trg_checkpoint]

history_3 = model_3.fit(X_train, y_train, epochs=50, 
                      batch_size=1024, validation_data=(X_val, y_val), 
                      callbacks=trg_callbacks)

In [None]:
trg_loss = history_3.history['loss']
val_loss = history_3.history['val_loss']

trg_acc = history_3.history['accuracy']
val_acc = history_3.history['val_accuracy']

epochs = range(1, len(trg_acc) + 1)

# plot losses and accuracies for training and validation 
fig = plt.figure(figsize=(16,6))
ax = fig.add_subplot(1, 2, 1)
plt.plot(epochs, trg_loss, marker='o', label='Training Loss')
plt.plot(epochs, val_loss, marker='x', label='Validation Loss')
plt.title("Training / Validation Loss")
ax.set_ylabel("Loss")
ax.set_xlabel("Epochs")
plt.legend(loc='best')

ax = fig.add_subplot(1, 2, 2)
plt.plot(epochs, trg_acc, marker='o', label='Training Accuracy')
plt.plot(epochs, val_acc, marker='^', label='Validation Accuracy')
plt.title("Training / Validation Accuracy")
ax.set_ylabel("Accuracy")
ax.set_xlabel("Epochs")
plt.legend(loc='best')
plt.tight_layout()
plt.show()

In [None]:
val_preds = model_3.predict(X_val)
score = keras.losses.BinaryCrossentropy()(y_val, val_preds)
print('Baseline score: %.4f' % score.numpy())

In [None]:
# load model with best weights as found during training
model_3_best = load_model('ann_model_3_best.hdf5')

val_preds_best = model_3_best.predict(X_val)
best_score = keras.losses.BinaryCrossentropy()(y_val, val_preds_best)
print('Best Val Loss: %.4f' % best_score.numpy())

Remarks on tuning performance and results:
- Hidden layer size 256, batch size 1024, dropout 0.45, BatchNorm True, 43 Epochs, val loss = 0.0161
- Same as above, but with hidden layer size 512, 30 Epochs, val loss = 0.0162.
- Same as previous, but with hidden layer increases to 1024 and 2048 - both result in worse val losses.
- Same as above but with Tanh activation - validation loss of 0.0163.
- Same as above but with Selu activation - validation loss of 0.0159.


In [None]:
# note down best parameters to use for final models
model_3_bs = 1024
model_3_epochs = 45

---

<a id="test-predictions"></a>
## 5. Test Set Predictions - Combining the performance of the three best models

### 5.1 Optional exploration - oversampling of minority data prior to making predictions

We'll now oversample the most under-represented class instances within our training data. To obtain this we'll iterate through each minority class, select random instances with that class, and simply duplicate them. This will be repeated for all classes until we have a chosen minimum number of class instances. This approach is extremely crude and feels very rough and ready, but is simply being conducted as an experiment on how our final model performs.

In [None]:
# lets gather the top 50 under-represented classes to oversample
top_100_minority = y.sum().sort_values()[:100].index.values

# iterate through each minority class, select random instances that contain that class and duplicate
# repeat this for all classes until we have a chosen minimum number of class instances
min_class_count = 30

extra_features = pd.DataFrame()
extra_labels = pd.DataFrame()
    
for column in top_100_minority:
    class_count = y[column].sum()
    class_count_diff = min_class_count - class_count
    
    if class_count_diff > 1:
        
        # find instance idxs where class is 1
        positive_idxs = y[column] == 1
        
        for iteration in range(int(np.ceil(class_count_diff / class_count))):
        
            # get random feature and label corresponding to class
            rand_feature = X_train_full[positive_idxs].sample(class_count)
            rand_label = y[positive_idxs].sample(class_count)
        
            extra_features = extra_features.append(rand_feature, ignore_index=True)
            extra_labels = extra_labels.append(rand_label, ignore_index=True)
            
extra_features.shape, extra_labels.shape

oversampled_X = X_train_full.append(extra_features, ignore_index=False)
oversampled_y = y.append(extra_labels, ignore_index=False)

### 5.2 Production of our three models and obtaining predictions for each

We'll train our model on the entire training set (both normal and oversampled variants) and make a set of predictions on the test set.

In [None]:
# scores 0.02027 on test set
model_1 = ann_model_1(dropout=True)
history_1 = model_1.fit(X_train_full, y, epochs=model_1_epochs, batch_size=model_1_bs)

In [None]:
test_preds_1 = model_1.predict(X_test)

In [None]:
model_2 = ann_model_2(dropout=True)
history_2 = model_2.fit(X_train_full, y, epochs=model_2_epochs, batch_size=model_2_bs)

In [None]:
test_preds_2 = model_2.predict(X_test)

In [None]:
model_3 = ann_model_3(dropout=True)
history_3 = model_3.fit(X_train_full, y, epochs=model_3_epochs, batch_size=model_3_bs)

In [None]:
test_preds_3 = model_3.predict(X_test)

### 5.3 Final predictions and submission

In [None]:
# combine our model predictions into an overall average
test_preds = (test_preds_1 + test_preds_2 + test_preds_3) / 3.0

Lets check the number of instances we incorrectly assigned probabilities to when they should in fact be zero due cp_type == ctl_vehicle:

In [None]:
# take a copy of all our training sig_ids for reference
test_sig_ids = test_features['sig_id'].copy()

# select all indices when 'cp_type' is 'ctl_vehicle'
test_ctl_vehicle_idx = (test_features['cp_type'] == 'ctl_vehicle')

# find total sum of predictions for these instances in test preds
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values].sum()

Only a total sum of 6, so not too bad. Nevertheless, we'll still remove these from our test set predictions, since they are incorrect.

In [None]:
# change all cp_type == ctl_vehicle predictions to zero
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values] = 0

# confirm all values now sum to zero for these instances
test_preds[test_sig_ids[test_ctl_vehicle_idx].index.values].sum()

Good, now we can make our final submission using this basic DNN classifier.

In [None]:
test_submission = pd.DataFrame({'sig_id' : test_sig_ids})
test_submission[train_targets_scored.columns[1:]] = test_preds
test_submission.head(3)

With this in the correct format, we can now save it and make a basic submission for the competition:

In [None]:
#test_submission.to_csv('submission.csv', index=False)