### Used https://www.kaggle.com/headsortails/explorations-of-action-moa-eda as a reference.

The point of this "multi-label classification with log-loss evaluation metric" competition is to classify the drugs based on their biological activity or the mechanism-of-action(MoA). The dataset given measured from the human cells' response to the drugs in a pool of 100 different cell types. These MoA response patterns are clasified into different kinds of drugs that might work.

The data comes in the shape of train and test files. There are two different files for the training predictors (train_features.csv) and the targets (train_targets_scored.csv). each row of these files corresponds to a specific treatment.

In [None]:
# import the required libraries
import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
from sklearn import preprocessing

style.use('seaborn')

%matplotlib inline

In [None]:
DATA_DIR = "/kaggle/input/lish-moa"

In [None]:
train_df = pd.read_csv(os.path.join(DATA_DIR, "train_features.csv"))
train_df['FROM'] = "train"
train_targets_df = pd.read_csv(os.path.join(DATA_DIR, "train_targets_scored.csv"))
test_df = pd.read_csv(os.path.join(DATA_DIR, "test_features.csv"))
test_df['FROM'] = 'test'
# combine train_df and test_df so that we can make some combined visualizations
train_test_df = pd.concat([train_df, test_df])
sample_sub_df = pd.read_csv(os.path.join(DATA_DIR, "sample_submission.csv"))

In [None]:
print(train_df.shape)
train_df.head()

In [None]:
test_df.head()

## Train/Test Features

In [None]:
f"There are {len(train_df.columns)-1} features out of which {sum('g-' in s for s in train_df)} are the features starting with 'g-' that encode the gene expression data and {sum('c-' in s for s in train_df)} are the features starting with 'c-' that encode the cell viability data"

and the additional 3 "cp_" features: 
- "cp_type" indicates the sample treatment of which there is 'trt_cp' (treated with the compound) and 'cp_vehicle' which is the control without MoAs.
- "cp_time" indicates the treatment duration which can be 24, 48 or 72 hours.
- and "cp_dose" indicates the dosage which can be high or low (D1 or D2)
- the sig_id is just the unique primary key of the sample

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6))

sns.countplot(x='cp_type', hue='FROM', data=train_test_df, ax=ax1).set_title("Sample Treatment")
sns.countplot(x='cp_dose', hue='FROM', data=train_test_df, ax=ax2).set_title("Treatment Dose")
sns.countplot(x='cp_time', hue='FROM', data=train_test_df, ax=ax3).set_title("Treatment Duration")

In [None]:
print(train_targets_df.shape)
train_targets_df.head()

In [None]:
f"There are {'no' if not train_df.isnull().values.any() else train_df.isna().sum()} null values in the train_df and {'no' if not train_targets_df.isnull().values.any() else train_targets_df.isna().sum()} null values in the train_targets_df. Similarly, there are {'no' if not test_df.isnull().values.any() else test_df.isna().sum()} null values in the test_df."

In [None]:
plt.figure(figsize=(15,15))
plt.suptitle("Distributions of Gene Expression Features")
# distributions of the gene expressions
for column in [s for s in train_df.columns if s.startswith('g-')]:
    sns.kdeplot(train_df[column], legend=False)

Most of the g- features have the normal distribution except for few.

In [None]:
plt.figure(figsize=(15,15))
plt.suptitle("Distributions of all Cell Viability Features")
# distributions of the cell viability features
for column in [s for s in train_df.columns if s.startswith('c-')]:
    sns.kdeplot(train_df[column], legend=False)

It seems like the distribution is normal. There is a bump on -10.0 which might indicate that the c- features are bimodal.

## Meta Statistics for Gene Distribution & Cell Viability

In [None]:
# meta statistics for cell viability
g_stats = train_df[[col for col in train_df if col.startswith('g-')]].describe().T
g_stats[g_stats.columns] = preprocessing.scale(g_stats)

fig, ax = plt.subplots(2, 2, figsize=(15, 7))
fig.suptitle("Meta Statistics for Gene Distribution")

sns.distplot(g_stats['max'], ax=ax[0,0])
sns.distplot(g_stats['mean'], ax=ax[0,1])
sns.distplot(g_stats['min'], ax=ax[1,0])
sns.distplot(g_stats['std'], ax=ax[1,1])

In [None]:
# meta statistics for cell viability
c_stats = train_df[[col for col in train_df.columns if col.startswith('c-')]].describe().T
c_stats[c_stats.columns] = preprocessing.scale(c_stats)

fig, ax = plt.subplots(2, 2, figsize=(15, 7))
fig.suptitle("Meta Statistics for Cell Viablity")

sns.distplot(c_stats['max'], ax=ax[0,0])
sns.distplot(c_stats['mean'], ax=ax[0,1])
sns.distplot(c_stats['50%'], ax=ax[1,0])
sns.distplot(c_stats['std'], ax=ax[1,1])

In [None]:
# frequency distribution of drugs
sns.distplot(train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum())

In [None]:
plt.figure(figsize=(15, 7))
plt.title("Drugs with highest MoAs")
train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum().sort_values(ascending=False).head(7).plot.barh().invert_yaxis()

In [None]:
plt.figure(figsize=(15, 7))
plt.title("Drugs with lowest MoAs")
train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum().sort_values(ascending=False).tail(7).plot.barh().invert_yaxis()

In [None]:
# get the top k endings of the column except for the sig_id column
plt.figure(figsize=(13, 6))
plt.title("Class Name Endings Frequency")
pd.Series([s.split('_')[-1] for s in train_targets_df.columns[1:]]).value_counts().head(10).plot.barh().invert_yaxis()

The most common drug name endings are "inhibitor", "antagonist", "agonist", etc..

## Correlations among 'g-' and 'c-' features

In [None]:
# correlation matrix for first k columns starting with 'g-'
corr = train_df[[s for s in train_df.columns if s.startswith('g-')][:15]].corr()

plt.figure(figsize=(15,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=np.triu(np.ones_like(corr, dtype=np.bool)), cmap=cmap).set_title("Correlation among first 15 g- features")

In [None]:
# correlation matrix for first k columns starting with 'c-'
corr = train_df[[s for s in train_df.columns if s.startswith('c-')][:15]].corr()

plt.figure(figsize=(15,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=np.triu(np.ones_like(corr, dtype=np.bool)), cmap=cmap).set_title("Correlation among first 15 'c-' features")

## Work in Progress