# Hi everyone!

This is my first public work here as well as first big competition on Kaggle :)

What do we have here? The competition is about mechanisms of drugs action. First steps, such as basic info and EDA, you can find here:
* [Insights](http://www.kaggle.com/c/lish-moa/discussion/184005)
* [EDA](https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda)

Both of those links might be extremely useful for you.

## Briefly about my way in MoA competition 

I guess, everyone should have started doing his first solution based on some basic ML algorithms. Of course, me too. So, my first submission was based on logistic regression. I was training it for every label alone, and then concatenating predictions to get the submission dataframe. Unfortunately, results were frustrating. 

The next step was to change LR model to something more complicated. I chose XGBClassifier. It was doing much better, but it took a couple of my life hours to tune its hyperparameters, and it's not worth the results.

Finally, I was told that all the best results are based on neural networks. Until that moment I was just living with kinda small NN inside my head, but never did any NN by myself. So, everything I had was Google and motivation.

I decided to share my results here with you to probably help someone else, who is also newbie, and maybe get some useful advices from someone :)

Good luck with MoA competition! If you find this notebook somehow useful for you, please thumbs up :)

# Code Time

The only one library, that might confuse you, is iterstrat. You can find it on Kaggle or download [here](https://github.com/trent-b/iterative-stratification). It provides multilabel stratified k-fold. You can't use StratifiedKFold from sklearn because of multilabel task (you have 206 targets, and StratifiedKFolds supports only 1d target arrays).

In [None]:
import sys
sys.path.append('../input/iterativestratification/iterative-stratification-master/')
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import numpy as np
import pandas as pd

from sklearn.metrics import log_loss

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import regularizers
import tensorflow_addons as tfa

import random

### Preparing data

Reading train features, train target, test and submission data.

In [None]:
X_train = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
X_test = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')
y_train = pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv')
submission = pd.read_csv('/kaggle/input/lish-moa/sample_submission.csv')

First of all let's make an index from $sig\_id$ column. That might be helpful. Now we always can get proper line from every dataframe

In [None]:
X_train.set_index('sig_id', inplace=True)
X_test.set_index('sig_id', inplace=True)
y_train.set_index('sig_id', inplace=True)

Also, let's transform $cp\_time$ and $cp\_dose$. $Cp\_time$ can be divided by 24 to get duration in days, and $cp\_dose$ should be categorical feature with two possible integer values.

In [None]:
X_train.cp_time = X_train.cp_time // 24
X_train.cp_dose = X_train.cp_dose.map({'D1': 0, 'D2': 1})
X_test.cp_time = X_test.cp_time // 24
X_test.cp_dose = X_test.cp_dose.map({'D1': 0, 'D2': 1})

Finally, we should delete control rows from both train and test data, because they have no mechanisms of action and might confuse our neural network. And we can delete $cp\_type$ column after that, because it's no more informative.

In [None]:
X_train_moa = X_train[X_train.cp_type != 'ctl_vehicle'].drop(columns=['cp_type'])
X_test_moa = X_test[X_test.cp_type != 'ctl_vehicle'].drop(columns=['cp_type'])

# Don't forget to keep only proper rows in y_train
y_train_moa = y_train.loc[X_train_moa.index]

Now we have such dataframe

In [None]:
X_train_moa.head()

## NN

The idea is simple. Let's do a cross-validation, but also let's save every model we've trained. Then we can apply them to our test and average their predictions - that should be something with low variance.

We need some new functions. First of them will apply given models to given dataframe and get a mean array of their predictions. Second of them will count log loss. It also returns losses by column. They might be useful if you want to know, which column is the best to predict and which one is the worst.

In [None]:
def mean_predictions(models_dense, X):
    y_pred_dense = [model.predict(X) for model in models_dense]
    return np.mean(y_pred_dense, axis=0)

def macro_log_loss(y_true, y_pred):
    if len(y_true.shape) == 1:
        return log_loss(y_true, y_pred, labels=[0, 1]), [log_loss(y_true, y_pred, labels=[0, 1])] 
    y_pred = np.maximum(np.minimum(y_pred, [[1 - 1e-15] * y_pred.shape[1]] * y_pred.shape[0]), [[1e-15] * y_pred.shape[1]] * y_pred.shape[0])
    losses = [log_loss(y_true[:, i], y_pred[:, i], labels=[0, 1]) for i in range(y_true.shape[1])]
    return np.mean(losses), losses

Now we can write a function to create a simple dense neural network. It consists of three layer triplets (BatchNorm-Dropoup-Dense with weight norm). Last one is output. I've tried different units number, dropout rates, activation functions and optimizers, and those are the best params I've found so far.

In [None]:
def create_dense_model(input_shape, output_shape):
    inputs = keras.Input(shape=(input_shape,), name='drug')
    x = layers.BatchNormalization()(inputs)
    x = layers.Dropout(0.2)(x)
    x = tfa.layers.WeightNormalization(layers.Dense(units=256, activation='relu', name='dense_1'))(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)
    x = tfa.layers.WeightNormalization(layers.Dense(units=256, activation='relu', name='dense_2'))(x)
    x = layers.BatchNormalization()(x)
    x = layers.Dropout(0.2)(x)
    output = tfa.layers.WeightNormalization(layers.Dense(output_shape, activation='sigmoid', kernel_regularizer=keras.regularizers.l2(l2=1e-5), 
                                                         name='predictions'))(x)
    model = keras.Model(inputs=inputs, outputs=output)
    opt = tfa.optimizers.AdamW(weight_decay=1e-5, learning_rate=1e-2)
    model.compile(optimizer=opt, loss='binary_crossentropy')
    return model

Let's have some callbacks for fitting our model. First of them will be $EarlyStopping$, so we want to stop, when the learning process doesn't improve validation loss during $patience$ number of epochs. Second is there to reduce learning rate when learning is no more useful.

In [None]:
callbacks = [
    keras.callbacks.EarlyStopping(
        monitor='val_loss',
        min_delta=1e-5,
        patience=3,
        verbose=0),
    keras.callbacks.ReduceLROnPlateau(
        monitor="val_loss",
        factor=0.1,
        patience=2)
]


Now let's train our models with 5 splits. We can save losses of every single model to average them and see the mean loss.  

In [None]:
x_train_dense = X_train_moa.to_numpy()
y_train_dense = y_train_moa.to_numpy()

kf = MultilabelStratifiedKFold(n_splits=5, shuffle=True)
mlls = []
models = []
i = 1
for train, test in kf.split(x_train_dense, y_train_dense):
    print(f'FOLD {i}.', end=' ')
    i = i + 1
    
    from numpy.random import seed
    seed(train[0])
    tf.random.set_seed(train[0])

    dense = create_dense_model(x_train_dense.shape[1], y_train_dense.shape[1])
    history_dense = dense.fit(x_train_dense[train], y_train_dense[train],
                     batch_size=64,
                     epochs=100,
                     callbacks=callbacks,
                     validation_data=(x_train_dense[test], y_train_dense[test]),
                     shuffle=True, verbose=0)
    mll, _ = macro_log_loss(y_train_dense[test], dense.predict(x_train_dense[test]))
    models.append(dense)
    mlls.append(mll)
    print('Dense macro log loss:', mll)

And let's print our mean loss. It's definitely not so bad!

In [None]:
print(np.mean(mlls))

## Test predictions

Finally, let's predict out test and do submission

In [None]:
preds = mean_predictions(models, X_test_moa)

Let's make a dict to connect test dataframe indexes with predictions array indexes to get a prediction for submission row by $sig\_id$.

In [None]:
indexes = dict(zip(X_test_moa.index, range(X_test_moa.shape[0])))

Finally, let's change submission row by row. If $sig\_id$ is not in test indexes, then it's control group and should have no MoA. And we also clip our predictions. We do that to reduce the cost of mistakes.

In [None]:
for idx, row in submission.iterrows():
    sig_id = row.sig_id
    if sig_id in indexes.keys():
        submission.loc[idx] = [sig_id] + [np.maximum(np.minimum(pred, 1 - 1e-3), 1e-3) for pred in preds[indexes[sig_id]]]
    else:
        submission.loc[idx] = [sig_id] + [0] * preds.shape[1]

And the final step is to write our submission to csv file

In [None]:
submission.to_csv('submission.csv', index=False)

## Thanks for reading! 
If you have any ideas, how to make this solution better, or you just want to tell your opinion - please let me know in the comments :)