Used https://www.kaggle.com/headsortails/explorations-of-action-moa-eda as a reference.

We're classifying drugs based on their biological activity. The aim is to find the proteins that are associated with a specific disease and develop molecules that can target those proteins. The MoA of a molecule encodes its biological activity. This dataset describes the response of 100 different types of human cells to various drugs. Those response patterns will be used to classify the MoA response. Note, the drugs can have multiple MoA annotations.

This is a multi-label classification problem with a log loss evaluation metric.

The data comes in the shape of train and test files. There are two different files for the training predictors (train_features.csv) and the targets (train_targets_scored.csv). each row of these files corresponds to a specific treatment.

In [None]:
# import the required libraries
import os, sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
from sklearn import preprocessing

style.use('seaborn')

%matplotlib inline

In [None]:
DATA_DIR = "/kaggle/input/lish-moa"

In [None]:
train_df = pd.read_csv(os.path.join(DATA_DIR, "train_features.csv"))
train_targets_df = pd.read_csv(os.path.join(DATA_DIR, "train_targets_scored.csv"))
test_df = pd.read_csv(os.path.join(DATA_DIR, "test_features.csv"))
sample_sub_df = pd.read_csv(os.path.join(DATA_DIR, "sample_submission.csv"))

In [None]:
print(train_df.shape)
train_df.head()

In [None]:
f"There are {sum('g-' in s for s in train_df.columns)} columns starting with 'g-' that encode the gene expression data and {sum('c-' in s for s in train_df.columns)} starting with 'c-' that encode the cell viability data"



and there are additional 3 "cp_" features: "cp_type" indicates the sample treatment, while "cp_time" and "cp_dose" encodes the duration and the dosage of the treatment
the sig_id is the unique primary key of the sample

In [None]:
print(train_targets_df.shape)
train_targets_df.head()

In [None]:
test_df.head()

In [None]:
# train_df null check
print(train_df.isnull().values.any())
train_targets_df.isnull().values.any()

In [None]:
# test_df null check
test_df.isnull().values.any()

In [None]:
# sanity check, check if the number of sig_id in train_df is equal to the number of sig_id in the train_targets_df
train_df.sig_id.nunique() == train_targets_df.sig_id.nunique()

### Treatment Features

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6))

ax1.set_title("Sample Treatment")
train_df['cp_type'].value_counts().plot.bar(ax=ax1)
ax2.set_title("Treatment Dose")
train_df['cp_dose'].value_counts().plot.bar(ax=ax2)
ax3.set_title("Treatment Duration")
train_df['cp_time'].value_counts().plot.bar(ax=ax3)

In [None]:
plt.figure(figsize=(15,15))
# distributions of the gene expressions
for column in [s for s in train_df.columns if s.startswith('g-')]:
    sns.kdeplot(train_df[column], legend=False)

In [None]:
plt.figure(figsize=(15,15))
# distributions of the cell viability features
for column in [s for s in train_df.columns if s.startswith('c-')]:
    sns.kdeplot(train_df[column], legend=False)

## Meta Statistics

In [None]:
# meta statistics for cell viability
g_stats = train_df[[col for col in train_df if col.startswith('g-')]].describe().T
g_stats[g_stats.columns] = preprocessing.scale(g_stats)

fig, ax = plt.subplots(2, 2, figsize=(15, 7))
fig.suptitle("Meta Statistics for Gene Distribution")

sns.distplot(g_stats['max'], ax=ax[0,0])
sns.distplot(g_stats['mean'], ax=ax[0,1])
sns.distplot(g_stats['min'], ax=ax[1,0])
sns.distplot(g_stats['std'], ax=ax[1,1])

In [None]:
# meta statistics for cell viability
c_stats = train_df[[col for col in train_df.columns if col.startswith('c-')]].describe().T
c_stats[c_stats.columns] = preprocessing.scale(c_stats)

fig, ax = plt.subplots(2, 2, figsize=(15, 7))
fig.suptitle("Meta Statistics for Cell Viablity")

sns.distplot(c_stats['max'], ax=ax[0,0])
sns.distplot(c_stats['mean'], ax=ax[0,1])
sns.distplot(c_stats['50%'], ax=ax[1,0])
sns.distplot(c_stats['std'], ax=ax[1,1])

In [None]:
# frequency distribution of drugs
sns.distplot(train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum())

In [None]:
plt.figure(figsize=(15, 7))
plt.title("Drugs with highest MoAs")
train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum().sort_values(ascending=False).head(7).plot.barh().invert_yaxis()

In [None]:
plt.figure(figsize=(15, 7))
plt.title("Drugs with lowest MoAs")
train_targets_df.loc[:, train_targets_df.columns != 'sig_id'].sum().sort_values(ascending=False).tail(7).plot.barh().invert_yaxis()

In [None]:
# get the top k endings of the column except for the sig_id column
plt.figure(figsize=(13, 6))
plt.title("Class Name Endings Frequency")
pd.Series([s.split('_')[-1] for s in train_targets_df.columns[1:]]).value_counts().head(10).plot.barh().invert_yaxis()

## Correlations

In [None]:
# correlation matrix for first k columns starting with 'g-'
corr = train_df[[s for s in train_df.columns if s.startswith('g-')][:15]].corr()

plt.figure(figsize=(15,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=np.triu(np.ones_like(corr, dtype=np.bool)), cmap=cmap)

In [None]:
# correlation matrix for first k columns starting with 'c-'
corr = train_df[[s for s in train_df.columns if s.startswith('c-')][:15]].corr()

plt.figure(figsize=(15,10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=np.triu(np.ones_like(corr, dtype=np.bool)), cmap=cmap)

## Work in Progress