In [None]:
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

import os
import random

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from scipy.stats import skew, kurtosis

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import roc_auc_score, mean_squared_error
from sklearn.decomposition import PCA

import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense, Dropout, GaussianDropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
from tensorflow.keras.regularizers import L1, L2

import lightgbm as lgb

import sys
sys.path.append('../input/moa-models-and-packages/site-packages')
from gauss_rank_scaler import GaussRankScaler

SEED = 721991

def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)    
    tf.random.set_seed(seed)

## **1. Introduction**

### **1.1. Mechanisms of Action**

**[("Mechanism of action", n.d.)](https://en.wikipedia.org/wiki/Mechanism_of_action)**

In pharmacology, the term **Mechanism of Action** (MoA) refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect. A mechanism of action usually includes mention of the specific molecular targets to which the drug binds, such as an enzyme or receptor. Receptor sites have specific affinities for drugs based on the chemical structure of the drug, as well as the specific action that occurs there.

In this competition, the task is predicting multiple targets of the Mechanism of Action (MoA) responses of different samples. Samples are drugs profiled at different time points and doses, and there are more than 5,000 drugs in dataset. Dataset also consists of various group of features and there are more than two hundred targets of enzymes and receptors.

In [None]:
df_train = pd.read_csv('../input/lish-moa/train_features.csv')
df_test = pd.read_csv('../input/lish-moa/test_features.csv')

df_train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
df_train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
df_test_targets = pd.read_csv('../input/lish-moa/sample_submission.csv')

target_features_scored = list(df_train_targets_scored.columns[1:])
target_features_nonscored = list(df_train_targets_nonscored.columns[1:])
df_train_targets_scored[target_features_scored] = df_train_targets_scored[target_features_scored].astype(np.uint8)
df_train_targets_nonscored[target_features_nonscored] = df_train_targets_nonscored[target_features_nonscored].astype(np.uint8)
df_test_targets[target_features_scored] = df_test_targets[target_features_scored].astype(np.float32)

df_train = df_train.merge(df_train_targets_scored, on='sig_id', how='left')
df_train = df_train.merge(df_train_targets_nonscored, on='sig_id', how='left')
df_test = df_test.merge(df_test_targets, on='sig_id', how='left')

del df_train_targets_scored, df_train_targets_nonscored, df_test_targets

print(f'Training Set Shape = {df_train.shape}')
print(f'Training Set Memory Usage = {df_train.memory_usage().sum() / 1024 ** 2:.2f} MB')
print(f'Test Set Shape = {df_test.shape}')
print(f'Test Set Memory Usage = {df_test.memory_usage().sum() / 1024 ** 2:.2f} MB')

### **1.2. Features**

* `sig_id` is the unique sample id
* Features with `g-` prefix are gene expression features and there are 772 of them (from `g-0` to `g-771`)
* Features with `c-` prefix are cell viability features and there are 100 of them (from `c-0` to `c-99`)
* `cp_type` is a binary categorical feature which indicates the samples are treated with a compound or with a control perturbation (`trt_cp` or `ctl_vehicle`)
* `cp_time` is a categorical feature which indicates the treatment duration (`24`, `48` or `72` hours)
* `cp_dose` is a binary categorical feature which indicates the dose is low or high (`D1` or `D2`)

In [None]:
g_features = [feature for feature in df_train.columns if feature.startswith('g-')]
c_features = [feature for feature in df_train.columns if feature.startswith('c-')]
other_features = [feature for feature in df_train.columns if feature not in g_features and 
                                                             feature not in c_features and 
                                                             feature not in target_features_scored and
                                                             feature not in target_features_nonscored]

print(f'Number of g- Features: {len(g_features)}')
print(f'Number of c- Features: {len(c_features)}')
print(f'Number of Other Features: {len(other_features)} ({other_features})')

There are two groups of target features; scored target features and non-scored target features. Both of those groups consist of binary MoA targets but only the first group is used for the scoring, so this is a multi-label classification problem. Even though non-scored target features don't have any impact on the score, they can be useful.

In [None]:
print(f'Number of Scored Target Features: {len(target_features_scored)}')
print(f'Number of Non-scored Target Features: {len(target_features_nonscored)}')

## **2. Objective and Metric**

This is a multi-label binary classification problem, and metric used for the evaluation is mean columnwise log loss. For every row, a probability that the sample had a positive response for each target, has to be predicted. For $N$ rows and $M$ targets, there will be $N×M$ predictions. Submissions are scored by the log loss:

$\Large \text{log loss} = - \frac{1}{M}\sum_{m=1}^{M} \frac{1}{N} \sum_{i=1}^{N} \left[ y_{i,m} \log(\hat{y}_{i,m}) + (1 - y_{i,m}) \log(1 - \hat{y}_{i,m})\right]$

* $N$ is the number of rows ($i=1,…,N$)
* $M$ is the number of targets ($m=1,…,M$)
* $\large \hat{y}_{i,m}$ is the predicted probability of the ith row and mth target
* $\large y_{i,m}$ is the ground truth of the ith row and mth target (1 for a positive response, 0 otherwise)
* $log()$ is the natural logarithm

Submitted predicted probabilities are replaced with $max(min(p,1-10^{-15}),10^{-15})$. This means that probabilities greater than 0.999999999999999, and probabilities less than 0.000000000000001 are clipped. When all of those details are added to the equation, function defined below can be used both as a metric and loss function. It should be safer to optimize competition metric directly.

In [None]:
def mean_columnwise_logloss(y_true, y_pred):        
    y_pred = np.clip(y_pred, 1e-15, (1 - 1e-15))
    score = - np.mean(np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred), axis=1))
    return score

## **3. Target Features**

Target features are categorized into two groups; scored and non-scored target features, and features in both of those groups are binary. The competition score is based on the scored target features but non-scored group can still be used for model evaluation, data analysis and feature engineering.

It is a multi-label classification problem but one sample can be classified to multiple targets or none of the targets as well. Most of the time, samples are classified to 0 or 1 target, but a small part of the training set samples are classified to 2, 3, 4, 5 and 7 different targets at the same time. Classified targets distributions are not very similar for scored targets and non-scored targets since there is a huge discrepancy of 0 and 1 classified targets.

In [None]:
scored_targets_classified = df_train[target_features_scored].sum(axis=1)
nonscored_targets_classified = df_train[target_features_nonscored].sum(axis=1)

fig, axes = plt.subplots(figsize=(32, 8), ncols=2)

sns.countplot(scored_targets_classified, ax=axes[0])
sns.countplot(nonscored_targets_classified, ax=axes[1])

for i in range(2):
    axes[i].tick_params(axis='x', labelsize=20)
    axes[i].tick_params(axis='y', labelsize=20)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    
axes[0].set_title(f'Training Set Unique Scored Targets per Sample', size=22, pad=22)
axes[1].set_title(f'Training Set Unique Non-scored Targets per Sample', size=22, pad=22)

plt.show()

### **3.1. Scored Target Features**

The most commmonly classified scored targets are nfkb inhibitor, proteasome inhibitor, cyclooxygenase inhibitor, dopamine receptor antagonist, serotonin receptor antagonist and dna_inhibitor, and there are more than 400 samples classified to each of them. The most rarely classified scored targets are atp-sensitive potassium channel antagonist and erbb2 inhibitor, and there is only one sample classified to each of them. A similar classification distribution is expected in test set.

There are lots of scored targets classified with the same number of times which suggests there might be a relationship between them.

In [None]:
fig = plt.figure(figsize=(12, 60))

sns.barplot(x=df_train[target_features_scored].sum(axis=0).sort_values(ascending=False).values,
            y=df_train[target_features_scored].sum(axis=0).sort_values(ascending=False).index)

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.xlabel('')
plt.ylabel('')
plt.title('Training Set Scored Targets Classification Counts', size=18, pad=18)

plt.show()

### **3.2. Non-scored Target Features**

The most commmonly classified non-scored targets are ace inhibitor, purinergic receptor antagonist, map kinase inhibitor, sterol demethylase inhibitor, and there are more than 70 samples classified to each of them. There are 71 non-scored targets that are classified to 0 samples. The classification counts of non-scored targets are very different than scored targets because most of the training samples are not classified to any of them.

Targets classified same number of times is more significant in non-scored targets and they are more likely to be in a relationship.

In [None]:
fig = plt.figure(figsize=(12, 110))

sns.barplot(x=df_train[target_features_nonscored].sum(axis=0).sort_values(ascending=False).values,
            y=df_train[target_features_nonscored].sum(axis=0).sort_values(ascending=False).index)

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.xlabel('')
plt.ylabel('')
plt.title('Training Set Non-scored Targets Classification Counts', size=18, pad=18)

plt.show()

### **3.3. Target Features Correlations**

Both scored target and non-scored target correlations are very close to 0, but there are very few bright red and light red dots that are worth exploring. Those red dots may yield relationships between targets. Horizontal and vertical white lines in non-scored target correlations occur due to NaNs.

In [None]:
fig, axes = plt.subplots(figsize=(32, 15), ncols=2, dpi=150)

sns.heatmap(df_train[target_features_scored].corr(),
            annot=False,
            square=True,
            cmap='coolwarm',
            cbar=False,
            yticklabels=False,
            xticklabels=False,
            ax=axes[0])  

sns.heatmap(df_train[target_features_nonscored].corr(),
            annot=False,
            square=True,
            cmap='coolwarm',
            cbar=False,
            yticklabels=False,
            xticklabels=False,
            ax=axes[1])   

axes[0].set_title('Training Set Scored Target Correlations', size=25, pad=25)
axes[1].set_title('Training Set Non-scored Target Correlations', size=25, pad=25)

plt.show()

## **4. Categorical Features**

There are three categorical features; `cp_type`, `cp_time` and `cp_dose`. Two of them are binary features and one of them has three unique values, so the cardinality among those features, is very low.
All of the categorical features have almost identical distributions in training and public test set, and that suggests training and public test set are taken from the same sample. Samples are probably stratified on those features while splitting training and public test set, and same distributions are also expected in private test set.

In [None]:
fig, axes = plt.subplots(figsize=(24, 24), nrows=3, ncols=2)

sns.countplot(df_train['cp_type'], ax=axes[0][0])
sns.countplot(df_test['cp_type'], ax=axes[0][1])

sns.countplot(df_train['cp_time'], ax=axes[1][0])
sns.countplot(df_test['cp_time'], ax=axes[1][1])

sns.countplot(df_train['cp_dose'], ax=axes[2][0])
sns.countplot(df_test['cp_dose'], ax=axes[2][1])

for i in range(3):
    for j in range(2):
        axes[i][j].tick_params(axis='x', labelsize=15)
        axes[i][j].tick_params(axis='y', labelsize=15)
        axes[i][j].set_xlabel('')
        axes[i][j].set_ylabel('')
        
for i, feature in enumerate(['cp_type', 'cp_time', 'cp_dose']):
    for j, dataset in enumerate(['Training', 'Test']):
        axes[i][j].set_title(f'{dataset} Set {feature} Distribution', size=18, pad=18)

plt.show()

### **4.1. cp_type**

`cp_type` is the first categorical feature in the dataset and it is a binary feature. It either means that samples are treated with a compound (`trt_cp`) or with a control perturbation (`ctl_vehicle`). Samples treated with control perturbations have no MoAs, thus all of their scored and non-scored target labels are zeros. However, all zero labeled samples are not entirely treated with a control perturbation, more than 1/3 of the compound samples are also labeled as zeros.

In [None]:
df_control = df_train[df_train['cp_type'] == 'ctl_vehicle']
df_compound = df_train[df_train['cp_type'] == 'trt_cp']

print(f'{len(df_control)}/{len(df_train)} samples are treated with a control perturbation and {len(df_control[df_control[target_features_scored].sum(axis=1) == 0])}/{len(df_control)} of those samples have all zero targets')
print(f'{len(df_compound)}/{len(df_train)} samples are treated with a compound and {len(df_compound[df_compound[target_features_scored].sum(axis=1) == 0])}/{len(df_compound)} of those samples have all zero targets')

### **4.2. cp_time**

`cp_time` is the second categorical feature in the dataset and it has three unique values; `24`, `48` and `72` hours. It indicates the treatment durations of the samples. Sample counts of different `cp_time` values are very consistent and close to each other in different targets. Sample counts are either extremely close to each other or `48` is slightly higher than the others.

All targets have samples with three durations except erbb2 inhibitor and atp-sensitive potassium channel antagonist, because those two targets are only classified in one sample.

In [None]:
df_target_counts_by_cp_time = pd.DataFrame(columns=['target', 'cp_time', 'count'])

for target_feature in target_features_scored:    
    for cp_time in [24, 48, 72]:
        count = len(df_train[(df_train['cp_time'] == cp_time) & (df_train[target_feature] == 1)])
        df_target_counts_by_cp_time = df_target_counts_by_cp_time.append({'target': target_feature, 'cp_time': cp_time, 'count': count}, ignore_index=True)
        
df_target_counts_by_cp_time['total_count'] = df_target_counts_by_cp_time.groupby('target')['count'].transform('sum')
df_target_counts_by_cp_time.sort_values(by=['total_count', 'target'],ascending=False, inplace=True)

fig = plt.figure(figsize=(15, 75), dpi=100)

sns.barplot(x=df_target_counts_by_cp_time['count'],
            y=df_target_counts_by_cp_time['target'],
            hue=df_target_counts_by_cp_time['cp_time'])

plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, prop={'size': 20})
plt.title('Training Set cp_time Distribution in Scored Targets', size=18, pad=18)

plt.show()

del df_target_counts_by_cp_time

### **4.3. cp_dose**

`cp_dose` is the final categorical feature in the dataset and it is also a binary feature. It indicates whether the dose of the samples are either low (`D1`) or high (`D2`). Sample counts of different `cp_dose` values are very consistent and close to each other in different targets. Sample counts are even closer to each other compared to `cp_time` and lots of targets have equal sample counts for both doses.

All targets have samples with two doses except erbb2 inhibitor and atp-sensitive potassium channel antagonist, because those two targets are only classified in one sample. 

In [None]:
df_target_counts_by_cp_dose = pd.DataFrame(columns=['target', 'cp_dose', 'count'])

for target_feature in target_features_scored:    
    for cp_dose in ['D1', 'D2']:
        count = len(df_train[(df_train['cp_dose'] == cp_dose) & (df_train[target_feature] == 1)])
        df_target_counts_by_cp_dose = df_target_counts_by_cp_dose.append({'target': target_feature, 'cp_dose': cp_dose, 'count': count}, ignore_index=True)
        
df_target_counts_by_cp_dose['total_count'] = df_target_counts_by_cp_dose.groupby('target')['count'].transform('sum')
df_target_counts_by_cp_dose.sort_values(by=['total_count', 'target'],ascending=False, inplace=True)

fig = plt.figure(figsize=(15, 75), dpi=100)

sns.barplot(x=df_target_counts_by_cp_dose['count'],
            y=df_target_counts_by_cp_dose['target'],
            hue=df_target_counts_by_cp_dose['cp_dose'])

plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, prop={'size': 20})
plt.title('Training Set cp_dose Distribution in Scored Targets', size=18, pad=18)

plt.show()

del df_target_counts_by_cp_dose

## **5. Cell Viability**

### **5.1. Cell Viability Features**

Cell viability is a measure of the proportion of live, healthy cells within a population. Cell viability assays are used to determine the overall health of cells, optimize culture or experimental conditions, and to measure cell survival following treatment with compounds, such as during a drug screen.

Cell-viability assessment is based on PRISM technology. PRISM is a high-throughput screen for assessing cell viability in which cell lines that have each been labelled with a unique 24-nucleotide barcode are pooled and treated with the experimental condition, and surviving cells are “counted” through identification of the cognate barcode. PRISM is an acronym for Profiling Relative Inhibition Simultaneously in Mixture. 

There are 100 cell-viability features and they have c- prefix (`c-0` to `c-99`). Each cell-viability feature represents viability of one particular cell line, and all experiments are based on a set of similar cells. These are mostly cancer cells.

In [None]:
df_train[(['sig_id'] + c_features)]

### **5.2. Cell Viability and Target Interactions**

Cell viability features are plotted against only a random target feature since there are 206 targets. It can be seen that, there are positive relationships between cell viability features and target features in most of the cases. However, some of them have no relationship or negative relationship with target features. This could be related to, most of the cells are being cancer cells while some of them are not. Another pattern that can be seen in cell viability features is, positive target values are clustered around zero means in most of the cases.

In [None]:
def scatterplot_cfeature(feature_group, seeds):
        
    fig, axes = plt.subplots(ncols=5, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 25
    label_size = 25

    for i, feature in enumerate(feature_group):
        
        np.random.seed(seeds[i])
        target = np.random.choice(target_features_scored)
        if len(target) > 25:
            target_title = target[:25]
        else:
            target_title = target
            
        sns.scatterplot(df_train[feature], df_train[target], s=100, ax=axes[i])
        axes[i].set_xlabel('')
        axes[i].set_ylabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        for label in axes[i].get_yticklabels():
            if i % 5 == 0:
                label.set_visible(True)
            else:
                label.set_visible(False)
                
        axes[i].set_title(f'{feature} vs {target_title}', size=title_size, pad=title_size)
    
    plt.show()
    
for i, feature_group in enumerate(np.array_split(c_features, len(c_features) // 5), 2):
    scatterplot_cfeature(feature_group, seeds=np.arange(1, 6) * i)

### **5.3. Cell Viability Feature Distributions**

Cell viability feature distributions are very similar in training and test set, that's because, quantile normalization and z-scoring steps are done at the plate level once, not separately for training and test set. A typical plate contains 1,200 signatures and those signatures can be placed in both training or test set. Plate information is not provided in the dataset.

Most common cell viability feature distribution looks like a left-skewed bell curve with mean close to 0.5. Features with most different distributions are `c-37`, `c-58`, `c-69`, `c-74` and `c-76` because of their shorter tails. They have higher overall cell viability. Minimum values in other features are clipped at -10, but it wasn't required for formerly mentioned features.

In [None]:
def distplot_cfeature(feature_group):
        
    fig, axes = plt.subplots(ncols=5, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 25
    label_size = 25

    for i, feature in enumerate(feature_group):
        sns.distplot(df_train[feature], label='Training', ax=axes[i], hist_kws={'alpha': 0.25})
        sns.distplot(df_test[feature], label='Test', ax=axes[i], hist_kws={'alpha': 0.25})
        axes[i].set_xlabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        if i % 5 == 0:
            axes[i].legend(prop={'size': 25})
            
        axes[i].set_title(f'{feature} Distribution', size=title_size, pad=title_size)
    
    plt.show()
    
for feature_group in np.array_split(c_features, len(c_features) // 5):
    
    for c_feature in feature_group: 
        train_mean = df_train[c_feature].mean()
        train_median = df_train[c_feature].median()
        train_std = df_train[c_feature].std()
        train_min = df_train[c_feature].min()
        train_max = df_train[c_feature].max()
        train_skew = skew(df_train[c_feature])
        train_kurt = kurtosis(df_train[c_feature])
        train_var = np.var(df_train[c_feature])

        test_mean = df_test[c_feature].mean()
        test_median = df_test[c_feature].median()
        test_std = df_test[c_feature].std()
        test_min = df_test[c_feature].min()
        test_max = df_test[c_feature].max()
        test_skew = skew(df_test[c_feature])
        test_kurt = kurtosis(df_test[c_feature])
        test_var = np.var(df_test[c_feature])

        print(f'{c_feature} Train - Mean: {train_mean:.4} - Median: {train_median:.4} - Std: {train_std:.4} - Min: {train_min:.4} - Max: {train_max:.4} - Skew {train_skew:.4} - Kurt {train_kurt:.4} - Var {train_var:.4}')
        print(f'{c_feature} Test - Mean: {test_mean:.4} - Median: {test_median:.4} - Std: {test_std:.4} - Min: {test_min:.4} - Max: {test_max:.4} - Skew {test_skew:.4} - Kurt {test_kurt:.4} - Var {test_var:.4}\n')

    distplot_cfeature(feature_group)


### **5.4. Cell Viability Feature Correlations**

Cell viability feature correlations are extremely high. This can be related to drug effects being similar on most cells lines. Lowest correlations (0.65) are the black lines which belong to `c-37`, `c-58`, `c-69`, `c-74` and `c-76` features due to their unusual distributions.

In [None]:
fig = plt.figure(figsize=(20, 20))

ax = sns.heatmap(df_train[c_features].corr(),
                 annot=False,
                 square=True)

ax.tick_params(axis='x', labelsize=20, rotation=0, pad=20)
ax.tick_params(axis='y', labelsize=20, rotation=0, pad=20)

for idx, label in enumerate(ax.get_xticklabels()):
    if idx % 5 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)
        
for idx, label in enumerate(ax.get_yticklabels()):
    if idx % 5 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)
        
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=30, pad=20)

plt.title('Cell Viability Features Correlations', size=30, pad=30)
plt.show()

## **6. Gene Expression**

### **6.1. Gene Expression Features**

Gene expression is the amount and type of proteins that are expressed in a cell at any given point in time. Gene expression level is based on a protocol similar to L1000 which is a high-throughput gene expression assay that measures the mRNA transcript abundance of 978 "landmark" genes from human cells. (The "L" in L1000 refers to the Landmark genes measured in the assay.)

There are 772 gene expression features and they have g- prefix (`g-0` to `g-771`). Each gene expression feature represents the expression of one particular gene, so there are 772 individual genes are being monitored in this assay.

In [None]:
df_train[(['sig_id'] + g_features)]

### **6.2. Gene Expression and Target Interactions**

Random gene expression features are plotted against a random target feature because it is not possible to visualize 772x206 feature interactions. Gene expression features and targets have weaker relationships compared to cell viability features, because data points of positive target values are more spread along the x axis.

All gene expression features have one thing in common; positive target values are clustered around zero means just like cell viability features. High absolute values in gene expression features (>2 or <-2) indicate that the drug or perturbation had a significant effect on the current cell, whereas values close to zero mean means that the effect for that cell was non-measurable.

In [None]:
def scatterplot_gfeature(feature_group, seeds):
        
    fig, axes = plt.subplots(ncols=4, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 25
    label_size = 25

    for i, feature in enumerate(feature_group):
                
        np.random.seed(seeds[i])
        target = np.random.choice(target_features_scored)
        if len(target) > 25:
            target_title = target[:25]
        else:
            target_title = target
            
        sns.scatterplot(df_train[feature], df_train[target], s=100, ax=axes[i])
        axes[i].set_xlabel('')
        axes[i].set_ylabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        for label in axes[i].get_yticklabels():
            if i % 5 == 0:
                label.set_visible(True)
            else:
                label.set_visible(False)
                
        axes[i].set_title(f'{feature} vs {target_title}', size=title_size, pad=title_size)
    
    plt.show()
    
shuffled_g_features = np.copy(g_features)
np.random.shuffle(shuffled_g_features)
for i, feature_group in enumerate(np.array_split(shuffled_g_features, len(shuffled_g_features) // 4)[:10], 1):
    scatterplot_gfeature(feature_group, seeds=np.arange(1, 6) * i)

### **6.3. Gene Expression Feature Distributions**

Gene expression feature distributions are also very similar in training and test set, that's because, same quantile normalization and z-scoring steps used on cell viability features, are also applied to them at the plate level.

Gene expression feature distributions are more diverse than cell viability feature distributions. There are both left/right tailed and long/short tailed distributions exist.

In [None]:
def distplot_gfeature(feature_group):
        
    fig, axes = plt.subplots(ncols=4, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 25
    label_size = 25

    for i, feature in enumerate(feature_group):
        sns.distplot(df_train[feature], label='Training', ax=axes[i], hist_kws={'alpha': 0.25})
        sns.distplot(df_test[feature], label='Test', ax=axes[i], hist_kws={'alpha': 0.25})
        axes[i].set_xlabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        if i % 5 == 0:
            axes[i].legend(prop={'size': 25})
            
        axes[i].set_title(f'{feature} Distribution', size=title_size, pad=title_size)
    
    plt.show()

shuffled_g_features = np.copy(g_features)
np.random.shuffle(shuffled_g_features)
for feature_group in np.array_split(shuffled_g_features, len(shuffled_g_features) // 4)[:10]:
    
    for c_feature in feature_group: 
        train_mean = df_train[c_feature].mean()
        train_median = df_train[c_feature].median()
        train_std = df_train[c_feature].std()
        train_min = df_train[c_feature].min()
        train_max = df_train[c_feature].max()
        train_skew = skew(df_train[c_feature])
        train_kurt = kurtosis(df_train[c_feature])
        train_var = np.var(df_train[c_feature])

        test_mean = df_test[c_feature].mean()
        test_median = df_test[c_feature].median()
        test_std = df_test[c_feature].std()
        test_min = df_test[c_feature].min()
        test_max = df_test[c_feature].max()
        test_skew = skew(df_test[c_feature])
        test_kurt = kurtosis(df_test[c_feature])
        test_var = np.var(df_test[c_feature])

        print(f'{c_feature} Train - Mean: {train_mean:.4} - Median: {train_median:.4} - Std: {train_std:.4} - Min: {train_min:.4} - Max: {train_max:.4} - Skew {train_skew:.4} - Kurt {train_kurt:.4} - Var {train_var:.4}')
        print(f'{c_feature} Test - Mean: {test_mean:.4} - Median: {test_median:.4} - Std: {test_std:.4} - Min: {test_min:.4} - Max: {test_max:.4} - Skew {test_skew:.4} - Kurt {test_kurt:.4} - Var {test_var:.4}\n')

    distplot_gfeature(feature_group)


### **6.4. Gene Expression Feature Correlations**

Gene expression feature correlations are more diverse compared to cell viability feature correlations. Correlation coefficients vary from -0.75 to 1, because of different distribution types.

In [None]:
fig = plt.figure(figsize=(20, 20))

ax = sns.heatmap(df_train[g_features].corr(),
                 annot=False,
                 square=True)

ax.tick_params(axis='x', labelsize=20, rotation=90, pad=20)
ax.tick_params(axis='y', labelsize=20, rotation=0, pad=20)

for idx, label in enumerate(ax.get_xticklabels()):
    if idx % 5 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)
        
for idx, label in enumerate(ax.get_yticklabels()):
    if idx % 5 == 0:
        label.set_visible(True)
    else:
        label.set_visible(False)
        
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=30, pad=20)

plt.title('Gene Expression Features Correlations', size=30, pad=30)
plt.show()

## **7. Feature Selection**

### **7.1. Adversarial Validation**

Categorical features; `cp_type`, `cp_time` and `cp_dose` are omitted in adversarial validation. Only gene expression and cell viability features are used. A ROC AUC score between **0.5** and **0.55** is expected from adversarial validation model because gene expression and cell viability features are normalized before training and test sets are splitted.

In [None]:
df_train['target'] = 0
df_test['target'] = 1

X = pd.concat([df_train.loc[:, g_features + c_features], df_test.loc[:, g_features + c_features]]).reset_index(drop=True)
y = pd.concat([df_train.loc[:, 'target'], df_test.loc[:, 'target']]).reset_index(drop=True)

Adversarial validation model yields **0.5275** ROC AUC score which suggests that training and public test set are similar. Features at the top of importance plot have higher gain, because they have different means in training and public test set due to distribution tail extremities. This could be related to small sample size of public test set, and it's not necessarily have to be expected in private test set.

In [None]:
K = 5
skf = StratifiedKFold(n_splits=K, shuffle=True, random_state=SEED)

scores = []
oof_predictions = pd.DataFrame(np.zeros((X.shape[0], 1)), columns=['target'])
feature_importance = pd.DataFrame(np.zeros((X.shape[1], K)), columns=[f'Fold_{i}_Importance' for i in range(1, K + 1)], index=X.columns)

parameters = {
    'num_iterations': 500,
    'early_stopping_round': 50,
    'num_leaves': 2 ** 5, 
    'learning_rate': 0.05,
    'bagging_fraction': 0.9,
    'bagging_freq': 1,
    'feature_fraction': 0.9,
    'feature_fraction_bynode': 0.9,
    'lambda_l1': 0,
    'lambda_l2': 0,
    'max_depth': -1,
    'objective': 'binary',
    'seed': SEED,
    'feature_fraction_seed': SEED,
    'bagging_seed': SEED,
    'drop_seed': SEED,
    'data_random_seed': SEED,
    'boosting_type': 'gbdt',
    'verbose': 1,
    'metric': 'auc',
    'n_jobs': -1,   
}

print('Running LightGBM Adversarial Validation Model\n' + ('-' * 45) + '\n')

for fold, (trn_idx, val_idx) in enumerate(skf.split(X, y), 1):

    trn_data = lgb.Dataset(X.iloc[trn_idx, :], label=y.iloc[trn_idx])
    val_data = lgb.Dataset(X.iloc[val_idx, :], label=y.iloc[val_idx])      
    model = lgb.train(parameters, trn_data, valid_sets=[trn_data, val_data], verbose_eval=50)
    feature_importance.iloc[:, fold - 1] = model.feature_importance(importance_type='gain')

    predictions = model.predict(X.iloc[val_idx, :], num_iteration=model.best_iteration)
    oof_predictions.loc[val_idx, 'target'] = predictions
    
    score = roc_auc_score(y.iloc[val_idx], predictions)
    scores.append(score)            
    print(f'\nFold {fold} - ROC AUC Score {score:.6}\n')
    
oof_score = roc_auc_score(y, oof_predictions)
print(f'\n{"-" * 30}\nLightGBM Adversarial Validation Model Mean ROC AUC Score {np.mean(scores):.6} [STD:{np.std(scores):.6}]')
print(f'LightGBM Adversarial Validation Model OOF ROC AUC Score: {oof_score:.6}\n{"-" * 30}')


plt.figure(figsize=(20, 20))
feature_importance['Mean_Importance'] = feature_importance.sum(axis=1) / K
feature_importance.sort_values(by='Mean_Importance', inplace=True, ascending=False)
sns.barplot(x='Mean_Importance', y=feature_importance.index[:50], data=feature_importance[:50])

plt.xlabel('')
plt.tick_params(axis='x', labelsize=18)
plt.tick_params(axis='y', labelsize=18)
plt.title('LightGBM Adversarial Validation Model Top 50 Feature Importance (Gain)', size=20, pad=20)

plt.show()

del X, y, oof_predictions, feature_importance, parameters, scores, oof_score

### **7.2. Variance Threshold**

Another approach for feature selection is eliminating features that have lower variance than a predefined threshold. Variance is the average of the squared differences from the mean, so it measures how far the data points are spread out from the mean. If the variance is low or close to zero, then a feature is approximately constant and will not improve the performances of the models. In that case, it should be removed.

Variances of continuous features are calculated for training and test set together, because they could yield different results if they are calculated separately. There are lots of low variance features that can be seen from below, but they are not close to zero.

In [None]:
df_all = pd.concat([df_train[c_features + g_features], df_test[c_features + g_features]], axis=0, ignore_index=True).astype(np.float32)

variances = {}

for feature in df_all.columns:
    variances[feature] = df_all[feature].var()
    
del df_all

df_variances = pd.DataFrame.from_dict(variances, orient='index', columns=['Variance'])
df_variances['Variance'] = df_variances['Variance'].astype(np.float32)
df_variances.sort_values(by='Variance', ascending=True, inplace=True)
df_variances.head(10)

Feature distributions are visualized from low to high variance. Features with variance between 0.3 and 0.4 can be removed since they contain the least information. For other ranges, feature removal should be done with trial and error based on model performances.

In [None]:
def distplot_variances(feature_group):
        
    fig, axes = plt.subplots(ncols=4, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 30
    label_size = 25

    for i, feature in enumerate(feature_group):
        sns.distplot(df_train[feature], label='Training', ax=axes[i], hist_kws={'alpha': 0.25})
        sns.distplot(df_test[feature], label='Test', ax=axes[i], hist_kws={'alpha': 0.25})
        axes[i].set_xlabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        if i % 4 == 0:
            axes[i].legend(prop={'size': 25})
            
        axes[i].set_title(f'{feature} - Variance: {df_variances.loc[feature, "Variance"]:.6}', size=title_size, pad=title_size)
    
    plt.show()
    
for i in np.arange(0.3, 1.5, 0.1):
    feature_group = list(df_variances.loc[(df_variances['Variance'] >= i) & (df_variances['Variance'] < (i + 0.1)), 'Variance'].index)[:4]
    distplot_variances(feature_group)

## **8. Feature Scaling**

It can seen from the prior visualizations that continuous features are between different ranges. Feature scaling can both improve model performance and speed up convergence at the same time. In order to observe scaling effects of different scalers, 25 cell viability and 25 gene expression features are randomly selected and scaled with `StandardScaler`, `MinMaxScaler`, and `GaussRankScaler`.

In [None]:
df_all = pd.concat([df_train[c_features + g_features], df_test[c_features + g_features]], axis=0, ignore_index=True).astype(np.float32)

np.random.seed(SEED)
scale_features = list(np.random.choice(c_features, 25, replace=False)) + list(np.random.choice(g_features, 25, replace=False))

df_scaled = df_all.loc[:, scale_features]
del df_all

mms = MinMaxScaler()
ss = StandardScaler()
grs = GaussRankScaler()

for feature in scale_features:
    df_scaled[f'{feature}_mms'] = mms.fit_transform(df_scaled[feature].values.reshape(-1, 1))
    df_scaled[f'{feature}_ss'] = ss.fit_transform(df_scaled[feature].values.reshape(-1, 1))
    df_scaled[f'{feature}_grs'] = grs.fit_transform(df_scaled[feature].values.reshape(-1, 1))

df_scaled = df_scaled.astype(np.float32)
df_scaled.head(10)

Different scalers scale features into different ranges. Min max scaled data are between 0 and 1, while standard scaled data and gaussian rank scaled data are zero centric. Zero centric data perform better in algorithms like PCA, on the other hand min max scaled data may perform better in neural networks.

Min max scaler and standard scaler are heavily affected by outliers, however gaussian rank scaler yields more robust results. Gaussian rank scaled data have symmetrical tails unlike others. 

In [None]:
def distplot_scaled_features(feature):
        
    fig, axes = plt.subplots(ncols=4, figsize=(36, 5), dpi=100, constrained_layout=True)
    title_size = 30
    label_size = 25
    
    sns.distplot(df_scaled.loc[:len(df_train), feature], label='Training', ax=axes[0], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[len(df_train):, feature], label='Test', ax=axes[0], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[:len(df_train), f'{feature}_mms'], label='Training', ax=axes[1], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[len(df_train):, f'{feature}_mms'], label='Test', ax=axes[1], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[:len(df_train), f'{feature}_ss'], label='Training', ax=axes[2], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[len(df_train):, f'{feature}_ss'], label='Test', ax=axes[2], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[:len(df_train), f'{feature}_grs'], label='Training', ax=axes[3], hist_kws={'alpha': 0.25})
    sns.distplot(df_scaled.loc[len(df_train):, f'{feature}_grs'], label='Test', ax=axes[3], hist_kws={'alpha': 0.25})

    for i in range(4):

        axes[i].set_xlabel('')
        axes[i].tick_params(axis='x', labelsize=label_size)
        axes[i].tick_params(axis='y', labelsize=label_size)
        
        if i % 4 == 0:
            axes[i].legend(prop={'size': 25})
            
    axes[0].set_title(f'{feature} Raw', size=title_size, pad=title_size)
    axes[1].set_title(f'{feature} Min Max Scaled', size=title_size, pad=title_size)
    axes[2].set_title(f'{feature} Standard Scaled', size=title_size, pad=title_size)
    axes[3].set_title(f'{feature} Gaussian Rank Scaled', size=title_size, pad=title_size)
    
    plt.show()
    
for feature in scale_features:
    distplot_scaled_features(feature)

## **9. Dimensionality Reduction**

Two common dimensionality reduction techniques are PCA and auto-encoders. Those techniques are sensitive to scale, so it is important to standardize the data and make it unitless. For this purpose, cell viability and gene expression features are standardized with standard scaler. For evaluating information loss in different dimensionality reduction techniques, latent space dimensions are set to half of cell viability and gene expression dimensions.

In [None]:
df_cell_viability = pd.concat([df_train.loc[:, c_features], df_test.loc[:, c_features]], axis=0, ignore_index=True)
df_cell_viability.loc[:, :] = np.float32(StandardScaler().fit_transform(df_cell_viability))

df_gene_expression = pd.concat([df_train.loc[:, g_features], df_test.loc[:, g_features]], axis=0, ignore_index=True)
df_gene_expression.loc[:, :] = np.float32(StandardScaler().fit_transform(df_gene_expression))

CELL_N_DIMS = len(c_features) // 2
GENE_N_DIMS = len(g_features) // 2

### **9.1. PCA**

PCA is a linear transformation that projects the data into another space, where vectors of projections are defined by variance of the data. PCA results can be evaluated with reconstruction error and cumulative percent variance.

Ideally latent variables should explain around 75% of the total variance. For cell viability features, CPV curve resembles to linear growth. First 10 components contain approximately 89% of the variance, while 50 components are needed to explain close to 100% of the variance. For gene expression features, CPV curve is closer to logarithmic growth. First 10 components contain approximately 50% of the variance, while 400 components are needed to describe close to 100% of the variance. Optimal dimensions can be found for cell viability and gene expressions by evaluating CPV curves.

In [None]:
seed_everything(SEED)

fig, axes = plt.subplots(figsize=(32, 8), ncols=2)

pca = PCA(n_components=CELL_N_DIMS)
pca.fit(df_cell_viability)
pca_cell_viability_error = mean_squared_error(df_cell_viability, pca.inverse_transform(pca.transform(df_cell_viability)))
cell_viability_explained_variance = np.sum(pca.explained_variance_) / len(c_features) * 100
print(f'PCA Cell Viability Reconstruction Error {pca_cell_viability_error:.6} - Explained Variance: {cell_viability_explained_variance:.6}%')

axes[0].plot(np.cumsum(pca.explained_variance_ratio_))
axes[0].set_xlabel('Number of Components', size=20, labelpad=20)
axes[0].set_ylabel('CPV', size=20, labelpad=20)
axes[0].set_title(f'Cell Viability Features CPV Curve', size=22, pad=22)

pca = PCA(n_components=GENE_N_DIMS)
pca.fit(df_gene_expression)
pca_gene_expression_error = mean_squared_error(df_gene_expression, pca.inverse_transform(pca.transform(df_gene_expression)))
gene_expression_explained_variance = np.sum(pca.explained_variance_) / len(g_features) * 100
print(f'PCA Gene Expression Reconstruction Error {pca_gene_expression_error:.6} - Explained Variance: {gene_expression_explained_variance:.6}%')

axes[1].plot(np.cumsum(pca.explained_variance_ratio_))
axes[1].set_xlabel('Number of Components', size=20, labelpad=20)
axes[1].set_ylabel('CPV', size=20, labelpad=20)
axes[1].set_title('Gene Expression Features CPV Curve', size=22, pad=22)

for i in range(2):
    axes[i].tick_params(axis='x', labelsize=20)
    axes[i].tick_params(axis='y', labelsize=20)
    
plt.show()

### **9.2. Auto-encoder**

Auto-encoders are neural networks used for reducing data into a low dimensional latent space. Most important features lie in this low dimensional latent space because they are capable of reconstructing it. Auto-encoders are slower and computationally expensive compared to PCA, and they are also prone to overfitting.

For cell viability features, reconstruction error is close to PCA reconstruction error, but there are lots of spikes on learning curve. Features created in the cell viability latent space might be useful for supervised learning models.

In [None]:
class CellViabilityAutoencoder(Model):
    
    def __init__(self, n_inputs, encoding_dim):
        
        super(CellViabilityAutoencoder, self).__init__()
        self.encoder = Sequential([
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 9, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 8, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(encoding_dim, activation='swish'),
        ])
        self.decoder = Sequential([
            Dense(2 ** 8, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 9, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(n_inputs)
        ])
        
    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

seed_everything(SEED)

cell_viability_ae = CellViabilityAutoencoder(len(c_features), CELL_N_DIMS)
cell_viability_ae.compile(optimizer=Adam(learning_rate=0.0005), loss='mse')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.00001, verbose=0)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=100, restore_best_weights=True, verbose=0)

history = cell_viability_ae.fit(df_cell_viability,
                                df_cell_viability,
                                batch_size=128,
                                verbose=0,
                                validation_data=(df_cell_viability, df_cell_viability),
                                epochs=100,
                                shuffle=True,
                                callbacks=[reduce_lr, early_stopping])

cell_viability_encoded = cell_viability_ae.encoder(df_cell_viability.values).numpy()
cell_viability_decoded = cell_viability_ae.decoder(cell_viability_encoded).numpy()
cell_viability_error = mean_squared_error(df_cell_viability.values, cell_viability_decoded)
print(f'Auto-encoder Cell Viability Reconstruction Error {cell_viability_error:.6}')

plt.figure(figsize=(32, 8))

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')

plt.xlabel('Epochs', size=20, labelpad=20)
plt.ylabel('Mean Squared Error', size=20, labelpad=20)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.legend(prop={'size': 20})
plt.title('Cell Viability Features Auto-encoder Loss', size=22, pad=22)

plt.show()

For gene expression features, reconstruction error is worse than PCA reconstruction error, but features created in the gene expression latent space might be still useful for supervised learning models.

In [None]:
class GeneExpressionAutoencoder(Model):
    
    def __init__(self, n_inputs, encoding_dim):
        
        super(GeneExpressionAutoencoder, self).__init__()
        self.encoder = Sequential([
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 9, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 8, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(encoding_dim, activation='swish'),
        ])
        self.decoder = Sequential([
            Dense(2 ** 8, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(2 ** 9, activation='swish', activity_regularizer=L2(0.001)),
            BatchNormalization(),
            Dropout(0.1),
            Dense(n_inputs)
        ])
        
    def call(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded

seed_everything(SEED)

gene_expression_ae = GeneExpressionAutoencoder(len(g_features), GENE_N_DIMS)
gene_expression_ae.compile(optimizer=Adam(learning_rate=0.0005), loss='mse')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.00001, verbose=0)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0.00001, patience=100, restore_best_weights=True, verbose=0)

history = gene_expression_ae.fit(df_gene_expression,
                                 df_gene_expression,
                                 batch_size=128,
                                 verbose=0,
                                 validation_data=(df_gene_expression, df_gene_expression),
                                 epochs=100,
                                 shuffle=True,
                                 callbacks=[reduce_lr, early_stopping])

gene_expression_encoded = gene_expression_ae.encoder(df_gene_expression.values).numpy()
gene_expression_decoded = gene_expression_ae.decoder(gene_expression_encoded).numpy()
gene_expression_error = mean_squared_error(df_gene_expression.values, gene_expression_decoded)
print(f'Auto-encoder Gene Expression Reconstruction Error {gene_expression_error:.6}')

plt.figure(figsize=(32, 8))

plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')

plt.xlabel('Epochs', size=20, labelpad=20)
plt.ylabel('Mean Squared Error', size=20, labelpad=20)
plt.tick_params(axis='x', labelsize=20)
plt.tick_params(axis='y', labelsize=20)
plt.legend(prop={'size': 20})
plt.title('Gene Expression Features Auto-encoder Loss', size=22, pad=22)

plt.show()

## **10. Drugs**

Training set drug ids are added to the competition dataset by today. `train_drug.csv` contains anonymous drug ids of every signature id in training set. There are **3289** unique drugs while there are **23814** unique signatures. This means some drugs are used more than 6 times (2 different doses x 3 different durations). This data can be useful for cross-validations and outlier detection.

In [None]:
df_train_drug = pd.read_csv('../input/lish-moa/train_drug.csv')
print(f'Training Drugs Shape = {df_train_drug.shape} - Unique Drugs {df_train_drug["drug_id"].nunique()}')
print(f'Training Drugs Memory Usage = {df_train_drug.memory_usage().sum() / 1024 ** 2:.2f} MB')

df_train = df_train.merge(df_train_drug, how='left', on='sig_id')
cols = list(df_train.columns)
cols = [cols[-1]] + cols[:-1]
df_train = df_train[cols]

df_train_drug.head(10)

Drug frequencies were expected to be diverse but somewhat balanced. It looks like that's the case, but there are so many unique drugs. Most common drug is `cacb2b860`, and it has 1866 occurences. Other most common drugs are `87d714366`, `9f80f3f77`, `8b87a7a83`, `5628cb3ee`, `d08af5d4b`, `292ab2c28`, `d50f18348`, and `d1b47f29d` in this order. 

In [None]:
df_drug_frequencies = df_train['drug_id'].value_counts()

fig = plt.figure(figsize=(15, 75))

sns.barplot(x=df_drug_frequencies.head(200).values,
            y=df_drug_frequencies.head(200).index)

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.xlabel('')
plt.ylabel('')
plt.title('Training Set Drug Frequencies', size=22, pad=22)

plt.show()

del df_drug_frequencies

There are very few unique drug occurences. Most common one is 6 because there are 2 different doses and 3 different durations to begin with. 2774 out of 3289 unique drugs are in this group. 196 unique drugs are seen 7 times and 129 drugs are seen only one time.

1866 is the drug (`cacb2b860`) with the highest amount of occurence and it is only seen in control group (`ctl_vehicle`). This explains the unusual number of occurences for this one. This drug can be omitted from cross-validation schemes. Other large values are also very unusual and they are not in the control group.

In [None]:
df_drug_occurences = df_train['drug_id'].value_counts().value_counts()
df_drug_occurences.sort_values(ascending=False, inplace=True)

fig = plt.figure(figsize=(32, 6))

sns.barplot(x=df_drug_occurences.index,
            y=df_drug_occurences.values,
            order=df_drug_occurences.index)

plt.tick_params(axis='x', labelsize=18)
plt.tick_params(axis='y', labelsize=18)
plt.xlabel('')
plt.ylabel('')
plt.title('Training Set Drug Occurence Frequencies', size=22, pad=22)

plt.show()

del df_drug_occurences

Duration counts are consistent between different drugs just like they were between different targets. Same behavior is expected in test set because some level of stratification is applied while splitting train/test sets.

In [None]:
df_drug_counts_by_cp_time = pd.DataFrame(df_train.groupby(['drug_id', 'cp_time'])['drug_id'].count())
df_drug_counts_by_cp_time.rename(columns={'drug_id': 'count'},inplace=True)
df_drug_counts_by_cp_time.reset_index(inplace=True)
df_drug_counts_by_cp_time['count_max'] = df_drug_counts_by_cp_time.groupby('drug_id')['count'].transform('max')
df_drug_counts_by_cp_time.sort_values(by=['count_max', 'drug_id'], ascending=False, inplace=True)

fig = plt.figure(figsize=(15, 75), dpi=100)

sns.barplot(x=df_drug_counts_by_cp_time['count'].values[:300],
            y=df_drug_counts_by_cp_time['drug_id'].values[:300],
            hue=df_drug_counts_by_cp_time['cp_time'].values[:300])

plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, prop={'size': 20})
plt.title('Training Set cp_time Distribution in Drugs', size=22, pad=22)

plt.show()

del df_drug_counts_by_cp_time

Dose counts are also consistent between different drugs just like they were between different targets. Same behavior is expected in test set because some level of stratification is applied while splitting train/test sets.

In [None]:
df_drug_counts_by_cp_dose = pd.DataFrame(df_train.groupby(['drug_id', 'cp_dose'])['drug_id'].count())
df_drug_counts_by_cp_dose.rename(columns={'drug_id': 'count'},inplace=True)
df_drug_counts_by_cp_dose.reset_index(inplace=True)
df_drug_counts_by_cp_dose['count_max'] = df_drug_counts_by_cp_dose.groupby('drug_id')['count'].transform('max')
df_drug_counts_by_cp_dose.sort_values(by=['count_max', 'drug_id'], ascending=False, inplace=True)

fig = plt.figure(figsize=(15, 75), dpi=100)

sns.barplot(x=df_drug_counts_by_cp_dose['count'].values[:300],
            y=df_drug_counts_by_cp_dose['drug_id'].values[:300],
            hue=df_drug_counts_by_cp_dose['cp_dose'].values[:300])

plt.xlabel('')
plt.ylabel('')
plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=15)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0, prop={'size': 20})
plt.title('Training Set cp_dose Distribution in Drugs', size=22, pad=22)

plt.show()

del df_drug_counts_by_cp_dose

## **11. Signatures**

Finally, different samples (signatures) should be analyzed individually for finding feature engineering ideas and detecting outliers. Even though gene expression and cell viability features are independent, signatures can be visualized as a sequence. For this analysis, 14 random signatures are selected. 7 of them are classified to 0, 1, 2, 3, 4, 5 or 7 scored targets, and the other 7 of them are classified to same number of not scored targets. 

In [None]:
def plot_signature(sig_id):
    
    if sig_id in df_train['sig_id'].values:
        g = df_train.loc[df_train['sig_id'] == sig_id, g_features].T
        c = df_train.loc[df_train['sig_id'] == sig_id, c_features].T
        t_scored = df_train.loc[df_train['sig_id'] == sig_id, target_features_scored].T
        t_notscored = df_train.loc[df_train['sig_id'] == sig_id, target_features_nonscored].T
        
        print(f'{sig_id}\n{"-" * (len(sig_id))}')
        print(f'Gene Expression Mean: {np.mean(g.values):.4} - Median: {np.median(g.values):.4} - Std: {np.std(g.values):.4} - Min: {np.min(g.values):.4} - Max: {np.max(g.values):.4} - Skew {skew(g.values)[0]:.4} - Kurt {kurtosis(g.values)[0]:.4} - Var {np.var(g.values):.4}')
        print(f'Cell Viability Mean: {np.mean(c.values):.4} - Median: {np.median(c.values):.4} - Std: {np.std(c.values):.4} - Min: {np.min(c.values):.4} - Max: {np.max(c.values):.4} - Skew {skew(c.values)[0]:.4} - Kurt {kurtosis(c.values)[0]:.4} - Var {np.var(c.values):.4}')
        print(f'Target Features Scored Mean {np.mean(t_scored.values):.4} - Target Features Not Scored Mean {np.mean(t_notscored.values):.4}')

        fig, axes = plt.subplots(nrows=2, figsize=(32, 12), dpi=100)
        plt.subplots_adjust(left=None, bottom=None, right=None, top=1.50, wspace=None, hspace=None)
        
        axes[0].plot(g, label='Gene Expression')
        axes[0].plot(c, label='Cell Viability')
        axes[1].plot(t_scored, label='Target Features Scored')
        axes[1].plot(t_notscored, label='Target Features Not Scored')
        
        axes[0].tick_params(axis='x', labelsize=20, rotation=0, pad=15, length=0)
        axes[1].tick_params(axis='x', labelsize=20, rotation=90, pad=15, length=0)
        for i in range(2):            
            axes[i].tick_params(axis='y', labelsize=20, rotation=0, pad=15, length=0)
            axes[i].legend(prop={'size': 20})

        for idx, label in enumerate(axes[0].get_xticklabels()):
            if idx % 50 == 0:
                label.set_visible(True)
            else:
                label.set_visible(False)
                
        for idx, label in enumerate(axes[1].get_xticklabels()):
            if df_train.loc[df_train['sig_id'] == sig_id, (target_features_scored + target_features_nonscored)[idx]].values[0] == 1:
                label.set_visible(True)
            else:
                label.set_visible(False)
                
        axes[0].set_title(f'Signature {sig_id} Gene Expression and Cell Viability Sequence', size=25, pad=15)
        axes[1].set_title(f'Signature {sig_id} Target Features Sequence', size=25, pad=15)
    else:
        g = df_test.loc[df_test['sig_id'] == sig_id, g_features].T
        c = df_test.loc[df_test['sig_id'] == sig_id, c_features].T
        
        fig = plt.figure(figsize=(32, 8), dpi=100)
        
        plt.plot(g, label='Gene Expression')
        plt.plot(c, label='Cell Viability')
            
    plt.show()


### **11.1 Signatures with 7 Targets**

There are 6 unique signatures that are classified to **7** scored targets. All of them belong to drug `91dc8bab8`. Those 6 signatures are classified to 3 not scored targets. Most unusual thing about this drug is, cell viability and gene expression values are on completely different scales. Gene expression features have extremely high standard deviation and variance. Those two statistical measures can be used for outlier detection because ideally cell viability and gene expression values should on same scale.

In [None]:
plot_signature('id_344ef5019')

There are 6 unique signatures that are classified to **7** not scored targets. All of them belong to drug `838575b1d`. Those 6 signatures are classified to 4 scored targets. Unlike the previous signature, this one doesn't have any anomalies.

In [None]:
plot_signature('id_063e8e7d6')

### **11.2 Signatures with 5 Targets**

There are 13 unique signatures that are classified to **5** scored targets, and they are very similar groups which are mainly inhibitors. They belong to drugs `3f9dd627f`, `228f08c3d`, and `809527b9d`, and this signature belongs to drug `3f9dd627f`. Those 13 signatures are classified to 1 not scored target. In addition to that, there is nothing unusual about the sequence.

In [None]:
plot_signature('id_8c4b726ff')

There are 6 unique signatures that are classified to **5** not scored targets, and it is a different target group compared to previous signature. All of them belong to drug `91dc8bab8`, and those 6 signatures are classified to 2 scored targets.

In [None]:
plot_signature('id_0c830d384')

### **11.3 Signatures with 4 Targets**

There are 55 unique signatures that are classified to **4** scored targets. They belong to 8 unique drugs, and this signature belongs to drug `3cda750b5`. There is no clear anomaly in this signature, and it is very clean.

In [None]:
plot_signature('id_003603254')

There are 13 unique signatures that are classified to **4** not scored targets. They belong to 3 unique drugs, and this signature belongs to drug `7cf4a548c`. There is no clear anomaly in this signature, and it is very clean.

In [None]:
plot_signature('id_0da23653d')

### **11.4 Signatures with 3 Targets**

There are 303 unique signatures that are classified to **3** scored targets. They belong to 21 unique drugs, and this signature belongs to drug `18bb41b2c`. There is no clear anomaly in this signature, but cell viability variance and standard deviation are lesser than gene expression variance and standard deviation.

In [None]:
plot_signature('id_000a6266a')

There are 36 unique signatures that are classified to **3** not scored targets. They belong to 5 unique drugs, and this signature belongs to drug `fe8625cad`. There is no clear anomaly in this signature, but cell viability variance and standard deviation are lesser than gene expression variance and standard deviation just like the previous signature.

In [None]:
plot_signature('id_163bd7f3e')

### **11.5 Signatures with 2 Targets**

There are 1538 unique signatures that are classified to **2** scored targets. They belong to 139 unique drugs, and this signature belongs to drug `b877aa048`. There is no clear anomaly in this signature, and it is very clean.

In [None]:
plot_signature('id_0024bcd70')

There are 247 unique signatures that are classified to **2** not scored targets. They belong to 41 unique drugs, and this signature belongs to drug `746ca1f5a`. Cell viability/gene expression variance and standard deviation mismatch exists in this signature as well, and the gap is huge.

In [None]:
plot_signature('id_001bd861f')

### **11.6 Signatures with 1 Target**

1 target signatures are the most common group for scored targets in training set. There are 12532 unique signatures that are classified to **1** scored target. They belong to 1904 unique drugs, and this signature belongs to drug `b68db1d53`. There is no clear anomaly in this signature, and it is very clean.

In [None]:
plot_signature('id_001626bd3')

There are 4282 unique signatures that are classified to **1** not scored target. They belong to 704 unique drugs, and this signature belongs to drug `e0cd5c091`. In this case, cell viability values have higher standard deviation and variance, and they are on different scales.

In [None]:
plot_signature('id_002d31e2c')

### **11.7 Signatures with 0 Targets**

There are 9367 unique signatures that are classified to **0** scored targets. They belong to 1213 unique drugs, and this signature belongs to drug `df89a8e5a`. There is no clear anomaly in this signature, but cell viability variance and standard deviation are lesser than gene expression variance and standard deviation.

In [None]:
plot_signature('id_000779bfc')

0 target signatures are the most common group for not scored targets in training set. There are 19224 unique signatures that are classified to **0** scored targets. They belong to 2534 unique drugs, and this signature belongs to drug `b68db1d53`. There is no clear anomaly in this signature, but cell viability variance and standard deviation are lesser than gene expression variance and standard deviation just like the previous signature.

In [None]:
plot_signature('id_000644bb2')

To sum up, statistical features calculated on signature sequences may yield hidden information about targets. There might be connections between cell viability/gene expression difference and number of classified targets. Those types of engineered features are definitely worth trying.