### <center> Mechanisms of Action (MoA) Prediction </center>

### <center> Exploratory Data Analysis </center>

The aim of the challenge is given in the competition page. I prefer to cut to the chase.

We are given 2 dataframes for the training set: a feature set and a target set. Let's start exploring the data in the Q&A style and also check some basic statistics.

In [None]:
import os 
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import seaborn as sns

In [None]:
BASE_DIR = '../input/lish-moa/'
train_features = pd.read_csv(BASE_DIR + 'train_features.csv')
train_targets_scored = pd.read_csv(BASE_DIR + 'train_targets_scored.csv')
train_targets_nonscored = pd.read_csv(BASE_DIR + 'train_targets_nonscored.csv')

test_features = pd.read_csv(BASE_DIR + 'test_features.csv')
sample_submission = pd.read_csv(BASE_DIR + 'sample_submission.csv')

# TRAIN FEATURES
INDEX = 'sig_id'
g_cols = [col for col in train_features.columns if col.startswith('g-')]
c_cols = [col for col in train_features.columns if col.startswith('c-')]

other_cols = ['cp_type', 'cp_time', 'cp_dose']  # Categoricals

### 1. Features

* **`cp_type (categorical):`**  Samples treated with a compound or with a control perturbation. Categories include "trt_cp" and "ctl_vehicle", respectively.

* **`cp_time (categorical):`** Treatment duration in hours. Categories include 24, 48, 72.

* **`cp_dose (categorical):`** Drug dose. Categories include "D1", "D2" for low and high dose.

* **`g-[0-771] (continous):`** Gene expression data - a measure of activation in a given gene after the drug is applied. 

* **`c-[0-99] (continous):`** Cell viability. Basically count of live cells after the drug is applied.

In [None]:
train_features.head()

In [None]:
print("Q: Does the features dataframe have any null entries?")
if not train_features.isnull().values.any():
    print('A: Nope, none!')

In [None]:
print('A full list of categorical features')
print('----'*10)
for col in other_cols:
    print('Number of unique values in "%s": %d' % (col, train_features[col].nunique()))
    print('Values: ', train_features[col].unique())
    print('')

In [None]:
fig, ax = plt.subplots(1, 3, figsize=(10, 3))
sns.countplot(data=train_features, x=other_cols[0], ax=ax[0])
sns.countplot(data=train_features, x=other_cols[1], ax=ax[1])
sns.countplot(data=train_features, x=other_cols[2], ax=ax[2])
plt.tight_layout()
fig.suptitle('Distribution of the categorical variables', y=1.1)
plt.show()

Only a small portion of the data belongs to the **ctl_vehicle** group. The ctl_vehicle sample is the control group; meaning that no drugs are applied. So their MoA values are 0. We will check this in the Features Section.

We have a balanced distribution for different categories in treatment times and doses.

In [None]:
print('Q: Does the features dataframe have any duplicated rows?')
if not train_features.duplicated().values.any():
    print('A: Nope!')

In [None]:
print('Q: Are there any duplicated rows associated with different sig_ids?')
if not train_features.loc[:, train_features.columns !=INDEX].duplicated().values.any():
    print('A: Nope!')

In [None]:
print('Feature set includes a series of c- and g- columns.')
print('Number of c_cols: %d' % (len(c_cols)))
print('Number of g_cols: %d' % (len(g_cols)))

**c- columns**

The plots below show that c- columns follow a Gaussian-like distribution with a low end tail. Moreover, the values are quantile-scaled as seen from their Zero means and tails around +/- 2.5. 

**These tails are important because if a drug is responsible for an MoA, one or some of the c- measurement diverge from the normal distribution. So our models will try to predict the critical values in these distributions where the tails start.**

In [None]:
def display_distributions(cols):
    fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(12, 10))
    for i in range(len(cols)):
        # print(i)
        sns.distplot(train_features[cols[i]], ax = axs[i // 5, i % 5], norm_hist=False, kde=False)
        #axs[i // 10, i % 10].set_title(le.inverse_transform(np.argmax([samples[0][1][i].numpy()], axis=-1))[0])
    plt.tight_layout() # w_pad=0.01, h_pad=1
    plt.show()

display_distributions(c_cols[:25])
display_distributions(c_cols[25:50])
display_distributions(c_cols[50:75])
display_distributions(c_cols[75:100])

#### g- columns

It seems that g- columns are also quantile-scaled similiar to c- columns. Unlike c- columns, g- columns exhibit both low and high end tails. **Again the information at these tails are the signatures of MoAs.**



In [None]:
display_distributions(g_cols[:25])
display_distributions(g_cols[25:50])
display_distributions(g_cols[50:75])
display_distributions(g_cols[75:100])

### 2. Targets

The targets dataset have 206 binary target columns for each of the drugs in the features set. Each drug can activate more than 1 MoA at the same time. This means the problem in this competition is a **multilabel** prediction problem, not multiclass prediction. 

In [None]:
train_targets_scored.head()

In [None]:
target_cols = train_targets_scored.columns[1:]  # 1 removes the sig_id
print('Number of target columns: %d' % (len(target_cols)))
print('We are predicting 206 columns for each sig_id')

In [None]:
total = 0
for i in target_cols:
    total += train_targets_scored[i].nunique()

if total/(len(target_cols)) == 2:
    print('All of the target columns are binary.')

In [None]:
train_targets_scored.set_index('sig_id', inplace=True)

### Class Distribution

The least frequent class is observed only once and most frequent class is observed 832 times. So, the train targets are highly imbalanced.

In [None]:
target_freq = train_targets_scored.sum(axis=0).to_frame('Counts').sort_values('Counts').reset_index()

fig, ax = plt.subplots(1, 2, figsize=(8, 4))
sns.barplot(target_freq.index, target_freq.Counts, ax=ax[0]).set(xticklabels=[])
ax[0].set_xlabel('Classes')

ax[1] = sns.distplot(target_freq.Counts, kde=False)
ax[1].set_ylabel('Counts of Counts')
fig.suptitle('Class Distribution', y=1.1)
plt.tight_layout()
plt.show()

Class names are not shown not to clutter the plot.

### The Least Frequent Classes

There are 2 least observed classes, both observed only once. The second group of least observed classes are observed 6 times. This means that a 5-fold CV will likely not to learn any of these classes if they are not augmented.

In [None]:
target_freq[:5]

In [None]:
target_freq[-5:]

# TO BE CONTINUED...