# Drugs Mechanisms of Action Prediction Exploratory Data Analysis

# Introduction

This is a simple EDA for the MoA Prediction competition data, which includes:
- training data overview
- numerical and categorical feature distributions
- basic feature intercations
- targets distribution
- feature-feature interactions
- feature-target interactions
- exploring outliers

In [None]:
%matplotlib inline

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['font.size'] = 16

# Train data overview

Training data is provided in csv format

In [None]:
df = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
df_test = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')

`train_features.csv` and `test_features.csv` have identical sets of columns:
- `sig_id` - unique entry id
- `cp_type` - type of perturbation, either `cp_vehicle` or `ctrl_vehicle`
- `cp_time` - treatment duration, 24, 48 or 72 hours
- `cp_dose` - treatment dose, either `D1` (low) or `D2` (high)
- `g-0` - `g-771` - gene expression data, real numbers in range [-10, 10]
- `c-0` - `c-99` - cell viability data

The dataset does not have any missing values, which is great

In [None]:
print('# of NaNs in train dataset:', df.isna().values.sum())
print('# of NaNs in test dataset:', df_test.isna().values.sum())

# Categorical columns distributions

In [None]:
BAR_WIDTH = 0.2

In [None]:
percentage = df.cp_type.value_counts() / len(df) * 100
plt.bar(np.arange(len(percentage)), percentage.values, width=BAR_WIDTH, label='Train')

percentage = df_test.cp_type.value_counts() / len(df_test) * 100
plt.bar(np.arange(len(percentage)) + BAR_WIDTH, percentage.values, width=BAR_WIDTH, label='Test')

plt.xticks(np.arange(len(percentage)) + BAR_WIDTH / 2, percentage.index)
plt.legend()
plt.xlabel('Perturbation (cp_type) column value')
plt.ylabel('% of values')
plt.title('cp_type')
plt.show()

In [None]:
percentage = df.cp_time.value_counts() / len(df) * 100
plt.bar(np.arange(len(percentage)), percentage.values, width=BAR_WIDTH, label='Train')

percentage = df_test.cp_time.value_counts() / len(df_test) * 100
plt.bar(np.arange(len(percentage)) + BAR_WIDTH, percentage.values, width=BAR_WIDTH, label='Test')

plt.xticks(np.arange(len(percentage)) + BAR_WIDTH / 2, percentage.index.astype(str) + ' hours')
plt.xlim(-0.5, 3.3)
plt.legend()
plt.xlabel('Treatment duration (cp_time) column value')
plt.ylabel('% of values')
plt.title('cp_time')
plt.show()

In [None]:
percentage = df.cp_dose.value_counts() / len(df) * 100
plt.bar(np.arange(len(percentage)), percentage.values, width=BAR_WIDTH, label='Train')

percentage = df_test.cp_dose.value_counts() / len(df_test) * 100
plt.bar(np.arange(len(percentage)) + BAR_WIDTH, percentage.values, width=BAR_WIDTH, label='Test')

plt.xticks(np.arange(len(percentage)) + BAR_WIDTH / 2, percentage.index.astype(str) + ' hours')
plt.xlim(-0.5, 2)
plt.legend()
plt.xlabel('Treatment dose (cp_dose) column value')
plt.ylabel('% of values')
plt.title('cp_dose')
plt.show()

> ## Insights ðŸ’¡
> We can see that categorical columns distirbutions in train and test datasets are quite similar. Also, the distributions of `cp_time` and `cp_dose` features are very even (balanced).

# Gene features

There are 772 gene features (`g-0` - `g-771`), each is represented by a real number in range [-10, 10].

In [None]:
gene_cols = [f'g-{i}' for i in range(772)]
df[gene_cols].values.min(), df[gene_cols].values.max()

We'll randomly sample 9 gene features and look at their distributions in train and test datasets.

In [None]:
gene_columns_sample = df[gene_cols].sample(9, axis=1, random_state=42)
gene_columns_sample.describe()

In [None]:
fig, axs = plt.subplots(3, 3, figsize=(15, 6), constrained_layout=True)
for i, col in enumerate(gene_columns_sample):
    sns.distplot(df[col], ax=axs[i // 3, i % 3], label='Train')
    sns.distplot(df_test[col], ax=axs[i // 3, i % 3], label='Test')
    axs[i // 3, i % 3].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
plt.legend()
plt.show()

> ## Insights ðŸ’¡
>Gene features distributions are far from normal: most of them have non-zero mean, heavy tails and some even have a small second peak (e.g. `g-417`).

>However, train and test distributions match almost exactly.

Now we'll see how means of the gene features are distributed.

In [None]:
sns.distplot(df[gene_cols].mean(), kde=False, bins=75)
plt.title('Gene features mean distribution')
plt.show()

Visualize gene features with the highest mean (which are `g-100`, `g-707` and `g-744`), and the lowest mean (which are `g-37`, `g-370` and `g-508`).

In [None]:
gene_cols_with_high_mean = np.argsort(df[gene_cols].mean())[-3:]
gene_cols_with_low_mean = np.argsort(df[gene_cols].mean())[:3]

fig, axs = plt.subplots(2, 3, figsize=(15, 5), constrained_layout=True)
for i, col_number in enumerate(gene_cols_with_high_mean):
    col_name = f'g-{col_number}'
    sns.distplot(df[col_name], ax=axs[0, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[0, i], label='Test')
    axs[0, i].set_title(col_name)
for i, col_number in enumerate(gene_cols_with_low_mean):
    col_name = f'g-{col_number}'
    sns.distplot(df[col_name], ax=axs[1, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[1, i], label='Test')
    axs[1, i].set_title(col_name)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Gene features with the highest (first row) and the lowest (second row) mean')
plt.legend()
plt.show()

> **If you look at these plots, the huge impact of heavy distributions tails becomes even more obvious. All the distributions have the main peak at zero.**

> Also, we can see that some features have a second small peak at either edge values (-10 or 10) or at -5 or 5. 

Note that gene features also have very different variances. Let's look at their stds distributions.

In [None]:
sns.distplot(df[gene_cols].std(), kde=False, bins=75)
plt.title('Gene features std distribution')
plt.show()

In [None]:
gene_cols_with_high_std = np.argsort(df[gene_cols].std())[-3:]
gene_cols_with_low_std = np.argsort(df[gene_cols].std())[:3]

fig, axs = plt.subplots(2, 3, figsize=(15, 5), constrained_layout=True)
for i, col_number in enumerate(gene_cols_with_high_std):
    col_name = f'g-{col_number}'
    sns.distplot(df[col_name], ax=axs[0, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[0, i], label='Test')
    axs[0, i].set_title(col_name)
for i, col_number in enumerate(gene_cols_with_low_std):
    col_name = f'g-{col_number}'
    sns.distplot(df[col_name], ax=axs[1, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[1, i], label='Test')
    axs[1, i].set_title(col_name)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Gene features with the highest (first row) and the lowest (second row) std')
plt.legend()
plt.show()

> ## Insights ðŸ’¡
> Different gene features are distibuted differently and have heavy distributions tails. Be careful if your model implies that input features are distributed similarly or that they come from standard distribution.
>
> Also, there is some bimodality in gene features distributions: some of them have a second small peak at the edge (-10 or 10). Maybe outliers? Will need to explore in more details.

## Correlations
Now we will look at the pairwise correlation between gene features.

In [None]:
plt.figure(figsize=(8, 7))
sns.heatmap(df[gene_cols[:50]].corr())
plt.title('Pairwise correlations of the first 50 gene features')
plt.show()

You can see some very bright and dark lines / rectangles on this plot. It means that there are some groups of features that strongly correlate with each other, as well as features that are highly anti-correlated with others.

Now let's look at all features correlations heatmap, not just the first 50.

In [None]:
correlations = df[gene_cols].corr()
plt.figure(figsize=(8, 7))
sns.heatmap(correlations)
plt.title('Pairwise correlations of gene features')
plt.show()

Here we observe the same patterns.

In [None]:
sns.distplot(correlations.abs().values.flatten(), kde=False)
plt.title('Gene features pairwise (absolute) correlation coefficients distribution')
plt.show()

In [None]:
correlations_np = correlations.values
correlations_np[np.arange(len(gene_cols)), np.arange(len(gene_cols))] = np.NaN

In [None]:
max_corr = np.nanmax(correlations_np)
i, j = np.where(correlations_np == max_corr)[0]
i, j = f'g-{i}', f'g-{j}'
print(f'Two features with the highest pairwise correlation in the train dataset: {i}, {j}')
print(f'Correlation coefficient on train data:', max_corr)
print(f'Correlation coefficient on test data:', df_test[[i, j]].corr().values[0][1])
plt.scatter(df[i], df[j], alpha=0.05)
plt.xlabel(i)
plt.ylabel(j)
plt.title('Two features with the highest correlation')
plt.show()

In [None]:
min_corr = np.nanmin(correlations_np)
i, j = np.where(correlations_np == min_corr)[0]
i, j = f'g-{i}', f'g-{j}'
print(f'Two features with the lowest pairwise correlation in the train dataset: {i}, {j}')
print(f'Correlation coefficient on train data:', min_corr)
print(f'Correlation coefficient on test data:', df_test[[i, j]].corr().values[0][1])
plt.scatter(df[i], df[j], alpha=0.05)
plt.xlabel(i)
plt.ylabel(j)
plt.title('Two features with the lowest correlation')
plt.show()

> ## Insights ðŸ’¡ 
> There are features that are highly correlated. So be careful when choosing the model and preprocessing the data as correlated features may lead to poor perfomace for some ML models.

# Cell viability features

There are 100 cell viability features (`c-0` - `c-99`), each is represented by a real number in range from -10 to approximately 6.5.


For them we'll perform similar analysis as for gene features.

In [None]:
cell_viability_cols = [f'c-{i}' for i in range(100)]

Randomly sample 9 cell viability features and look at their distributions in train and test datasets.

In [None]:
viability_columns_sample = df[cell_viability_cols].sample(9, axis=1)
viability_columns_sample.describe()

In [None]:
fig, axs = plt.subplots(3, 3, figsize=(15, 6), constrained_layout=True)
for i, col in enumerate(viability_columns_sample):
    sns.distplot(df[col], ax=axs[i // 3, i % 3], label='Train')
    sns.distplot(df_test[col], ax=axs[i // 3, i % 3], label='Test')
    axs[i // 3, i % 3].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 6.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
plt.legend()
plt.show()

In [None]:
sns.distplot(df[cell_viability_cols].mean(), kde=False, bins=15)
plt.title('Cell viability features mean distribution')
plt.show()

In [None]:
gene_cols_with_high_mean = np.argsort(df[cell_viability_cols].mean())[-3:]
gene_cols_with_low_mean = np.argsort(df[cell_viability_cols].mean())[:3]

fig, axs = plt.subplots(2, 3, figsize=(15, 5), constrained_layout=True)
for i, col_number in enumerate(gene_cols_with_high_mean):
    col_name = f'c-{col_number}'
    sns.distplot(df[col_name], ax=axs[0, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[0, i], label='Test')
    axs[0, i].set_title(col_name)
for i, col_number in enumerate(gene_cols_with_low_mean):
    col_name = f'c-{col_number}'
    sns.distplot(df[col_name], ax=axs[1, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[1, i], label='Test')
    axs[1, i].set_title(col_name)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 6.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Cell viability features with the highest (first row) and the lowest (second row) mean')
plt.legend()
plt.show()

In [None]:
sns.distplot(df[cell_viability_cols].std(), kde=False, bins=15)
plt.title('Cell viability features std distribution')
plt.show()

In [None]:
gene_cols_with_high_std = np.argsort(df[cell_viability_cols].std())[-3:]
gene_cols_with_low_std = np.argsort(df[cell_viability_cols].std())[:3]

fig, axs = plt.subplots(2, 3, figsize=(15, 5), constrained_layout=True)
for i, col_number in enumerate(gene_cols_with_high_std):
    col_name = f'c-{col_number}'
    sns.distplot(df[col_name], ax=axs[0, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[0, i], label='Test')
    axs[0, i].set_title(col_name)
for i, col_number in enumerate(gene_cols_with_low_std):
    col_name = f'c-{col_number}'
    sns.distplot(df[col_name], ax=axs[1, i], label='Train')
    sns.distplot(df_test[col_name], ax=axs[1, i], label='Test')
    axs[1, i].set_title(col_name)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 6.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Cell viability features with the highest (first row) and the lowest (second row) std')
plt.legend()
plt.show()

> ## Insights ðŸ’¡
> Cell viability features are distributed much more similarly (to each other) than gene features. Still, they are also assymetrical, have heavy distribution tails (usually on the left side) and sometimes a second peak at -10.

In [None]:
correlations = df[cell_viability_cols].corr()
plt.figure(figsize=(8, 7))
sns.heatmap(correlations)
plt.title('Pairwise correlations of cell viability features')
plt.show()

> ## Insights ðŸ’¡
> **Note the scale on the right side of the plot - the lowest value of pairwise correlation coefficient is around 0.6. It means that cell viability features are highly dependent. However, this may be caused by the peaks around -10. We'll study them in a bit more details later in this notebook.**

In [None]:
sns.distplot(correlations.abs().values.flatten(), kde=False)
plt.title('Cell viability features pairwise (absolute) correlation coefficients distribution')
plt.show()

As an example of feature dependency let's pick one feature (`c-0`) and look at how other features are distributed when `c-0` value is close to the lower bound (-10).

In [None]:
col = 'c-0'
print(df[df[col] < -9][cell_viability_cols[1:]].values.mean(), df[df[col] >= -9][cell_viability_cols[1:]].values.mean())
sns.distplot(df[df[col] < -9][cell_viability_cols[1:]].values, label='c-1 - c-99 distribution when c-0 < -9')
sns.distplot(df[df[col] >= -9][cell_viability_cols[1:]].values, label='c-1 - c-99 distribution when c-0 >= -9')
plt.title('Cell viability distribution conditioned on c-0')
plt.legend()
plt.show()

Well, there is clearly some weird behavior here. 

Now let's have a look at feature distributions in rows where any of the features is close to the lower bound (-10).

In [None]:
print(df.loc[(df[cell_viability_cols] < -9).any(axis=1), cell_viability_cols].values.mean(), 
      df.loc[(df[cell_viability_cols] >= -9).all(axis=1), cell_viability_cols].values.mean())
sns.distplot(df.loc[(df[cell_viability_cols] < -9).any(axis=1), cell_viability_cols], label='any of cell viability features < -9')
sns.distplot(df.loc[(df[cell_viability_cols] >= -9).all(axis=1), cell_viability_cols], label='all of cell viability features >= -9')
plt.title('Conditioned cell viability features distribution')
plt.legend()
plt.show()

> ## Insights ðŸ’¡
> This is an interesting behavior: seems like some rows have all the values around zero while some have a lot of very low values.

Rows with low values make up about 6% of the train dataset.

In [None]:
print('Percentage of rows containing a value < -9:', (df[cell_viability_cols] < -9).any(axis=1).sum() / len(df) * 100)

For now we will confider them outliers, exclude from the dataset and look at the correlations heatmap once again:

In [None]:
correlations = df.loc[(df[cell_viability_cols] >= -9).all(axis=1), cell_viability_cols].corr()
plt.figure(figsize=(8, 7))
sns.heatmap(correlations)
plt.title('Pairwise correlations of cell viability features, outliers excluded')
plt.show()

> ## Insights ðŸ’¡
> Even with "outliers" excluded, cell viability features still look very inter-dependent.
>
> However, there are some features that do not correlate with others (dark vertical and horizontal lines).
>
> Also worth mentioning: we do not have any anti-correltaions here!

# Targets

We have 206 different binary targets in this competition.

In [None]:
targets = pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv')

In [None]:
targets.head()

In [None]:
targets.shape

In [None]:
target_cols = targets.columns[1:]

In [None]:
(targets[target_cols].mean() * 100).plot.hist(bins=50)
plt.title('% of true labels in target distribution')
plt.xlabel('% of true labels')
plt.show()

> ## Insights ðŸ’¡
> All the targets are very unbalanced. The most frequent target has only 3.5% of true labels! And most of the rest have less than 0.5%.

Let's have a look at top-20 most frequent targets.

In [None]:
frequent_targets = (targets[target_cols].mean() * 100).sort_values()[-20:].index

In [None]:
(targets[frequent_targets].mean() * 100).sort_values().plot.bar()
plt.title('Most frequent targets')
plt.ylabel('% of true labels')
plt.show()

Now we want to know how many true labels can each row have.

In [None]:
vc = targets[target_cols].sum(axis=1).value_counts()
plt.title('# of true labels per row distribution')
plt.ylabel('# of rows')
plt.xlabel('# of true targets per row')
plt.bar(vc.index, vc.values)
plt.show()

> ## Insights ðŸ’¡
> A large portion of train samples does not have any MoA labels at all. Most of the other rows have exactly one label, but some can have up to 7.

We will take these top-20 frequent targets and see how they interact with each other - how often they go together.

In [None]:
counts = np.zeros((len(frequent_targets), len(frequent_targets)))
for i, col1 in enumerate(frequent_targets):
    for j, col2 in enumerate(frequent_targets):
        if i != j:
            counts[i, j] = len(targets[(targets[col1] == 1) & (targets[col2] == 1)]) / len(targets[targets[col1] == 1]) 

In [None]:
plt.figure(figsize=(12, 10))
sns.heatmap(counts, annot=True, fmt=".1f", annot_kws={"size": 14})
plt.xticks(np.arange(len(frequent_targets)), frequent_targets, rotation=90)
plt.yticks(np.arange(len(frequent_targets)), frequent_targets, rotation=0)
plt.show()

> ## Insights ðŸ’¡
> It seems like most of the targets are mutually exclusive. However, there are a few that look very coupled. For example, `kit_inhibitor`, and `flt3_inhibitor`: if a sample has a true label for `kit_inhibitor`, there is an 80% chance that it will also have `flt3_inhibitor` among its targets and vice versa.

# Feature interactions

Now we will expolore feature interaction a bit.

Let's see how gene columns are distributed with different `cp_time` (treatment duration) and `cp_dose` (treatment dose).

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(14, 6), constrained_layout=True)
for i, col in enumerate(gene_columns_sample.columns[:4]):
    sns.distplot(df[df.cp_time == 24][col], label='24 hours', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_time == 48][col], label='48 hours', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_time == 72][col], label='72 hours', kde=False, ax=axs[i // 2, i % 2])
    axs[i // 2, i % 2].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Gene features distributions conditioned on cp_time')
plt.legend()
plt.show()

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(14, 6), constrained_layout=True)
for i, col in enumerate(gene_columns_sample.columns[:4]):
    sns.distplot(df[df.cp_dose == 'D1'][col], label='Low dose', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_dose == 'D2'][col], label='High dose', kde=False, ax=axs[i // 2, i % 2])
    axs[i // 2, i % 2].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Gene features distributions conditioned on cp_dose')
plt.legend()
plt.show()

Same for cell viability features:

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(14, 6), constrained_layout=True)
for i, col in enumerate(viability_columns_sample.columns[:4]):
    sns.distplot(df[df.cp_time == 24][col], label='24 hours', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_time == 48][col], label='48 hours', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_time == 72][col], label='72 hours', kde=False, ax=axs[i // 2, i % 2])
    axs[i // 2, i % 2].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Cell viability features distributions conditioned on cp_time')
plt.legend()
plt.show()

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(14, 6), constrained_layout=True)
for i, col in enumerate(viability_columns_sample.columns[:4]):
    sns.distplot(df[df.cp_dose == 'D1'][col], label='Low dose', kde=False, ax=axs[i // 2, i % 2])
    sns.distplot(df[df.cp_dose == 'D2'][col], label='High dose', kde=False, ax=axs[i // 2, i % 2])
    axs[i // 2, i % 2].set_title(col)
for ax in axs.flat:
    ax.set(xlabel='', ylabel='')
    ax.set_xlim(-10.5, 10.5)
    ax.label_outer()
    ax.title.set_fontsize(12)
fig.suptitle('Cell viability features distributions conditioned on cp_dose')
plt.legend()
plt.show()

> ## Insights ðŸ’¡
> Cell viability features distributions with different `cp_time` and `cp_dose` match almost exactly. For gene features, these distributions are not identical, but still very close.

## Gene - cell viability correlations
We will look at how gene and cell viability features correlate with each other

In [None]:
correlations = df[gene_cols + cell_viability_cols].corr()

In [None]:
plt.figure(figsize=(16, 7))
sns.heatmap(correlations.loc[cell_viability_cols, gene_cols])
plt.title('Pairwise correlations of gene and cell viabilty features')
plt.show()

In [None]:
plt.figure(figsize=(8, 7))
sns.heatmap(correlations.loc[cell_viability_cols[:50], gene_cols[:50]])
plt.title('Closer look: pairwise correlations of gene and cell viabilty features')
plt.show()

In [None]:
sns.distplot(correlations.loc[cell_viability_cols, gene_cols].abs().values.flatten(), kde=False, bins=200)
plt.title('Gene - cell viability (absolute) correlation coefficients distribution')
plt.show()

> ## Insights ðŸ’¡
> Gene features and cell viability features are very inter-dependent.
>
> The presence of these distinct light (or dark) vertical lines on the plot means that there are gene features that are higly correlated (or anti-correlated) with **ALL** the cell viability features.

Now we will try to exclude our "outliers" (samples which have any of the cell features values < -9, see Cell viability features section) and recompute the correlations matrix. 

In [None]:
correlations = df.loc[(df[cell_viability_cols] >= -9).all(axis=1), gene_cols + cell_viability_cols].corr()

In [None]:
plt.figure(figsize=(16, 7))
sns.heatmap(correlations.loc[cell_viability_cols, gene_cols])
plt.title('Pairwise correlations of gene and cell viabilty features, outliers excluded')
plt.show()

In [None]:
plt.figure(figsize=(8, 7))
sns.heatmap(correlations.loc[cell_viability_cols[:50], gene_cols[:50]])
plt.title('Closer look: pairwise correlations of gene and cell viabilty features, outliers excluded')
plt.show()

In [None]:
sns.distplot(correlations.loc[cell_viability_cols, gene_cols].abs().values.flatten(), kde=False, bins=200)
plt.title('Gene - cell viability (absolute) correlation coefficients distribution, outliers excluded')
plt.show()

So now this looks much better. Although there are still some gene features with strong (anti-) correlation with all the cell viability features, on average they all look quite independent.

# Feature-target intercations

Validate that samples treated with a control perturbation do not have MoAs (i. e. rows with `cp_type == 'ctl_vehicle'` have all targtes equal to 0), as stated in competiton data description.

In [None]:
targets.loc[df.cp_type == 'ctl_vehicle', target_cols].sum().sum()

**Thus we will exclude these rows from further analysis.**

So now we have 2 categorical features (`cp_dose` and `cp_time`) and we want to know if any of them strongly affects any of our targets.

In order to achieve this we can do the following:

1. For every unique value of the feature and each target compute mean target's score over train samples with this value
2. For each pair of unique feature values compare their mean scores for each target using relative difference:

    $$\text{Relative difference }(\text{value i}, \text{value j}, \text{target k}) = \frac{|\bar{y_{i, k}} - \bar{y_{j, k}}|}{max(\bar{y_{i, k}}, \bar{y_{j, k}})}$$
3. Output features values and targets with relative difference


See the cell below for code.


In [None]:
EPS = 10 ** -5
REL_ERROR_THRESHOLD = 0.3
columns = ['cp_dose', 'cp_time']

t = targets[target_cols].sum()
target_cols_more_than_once = t[t > 1].index  # we don't want to look at targets with only one true label

for col in columns:
    col_unique_values = df[col].unique()
    target_averages = []
    
    # Step 1: compute mean targets scores
    for val in col_unique_values:
        target_averages.append(targets.loc[df[col] == val, target_cols_more_than_once].mean())
     
    # Step 2: compute relative differences
    for i in range(len(col_unique_values)):
        for j in range(i + 1, len(col_unique_values)):
            rel_diff = abs(target_averages[i] - target_averages[j]) / (pd.concat([target_averages[i], target_averages[j]], axis=1).max(axis=1) + EPS)
            if rel_diff.max() < REL_ERROR_THRESHOLD:
                continue
            print(col_unique_values[i], col_unique_values[j])  # Step 3: output feature values and targets with high relative difference
            for target in rel_diff[rel_diff >= REL_ERROR_THRESHOLD].index:
                print(target, target_averages[i][target], target_averages[i][target])

The cell did not output anything, so none of the values of these features have a particular affect on the target on its own (which does not mean that these features are not important for prediction!)

Now let's see how numerical features correlate with targets.

In [None]:
correlations = pd.concat([df[gene_cols + cell_viability_cols], targets[target_cols]], axis=1).corr()

In [None]:
plt.figure(figsize=(16, 7))
sns.heatmap(correlations.loc[target_cols, gene_cols])
plt.title('Gene features correlations with targets')
plt.show()

In [None]:
plt.figure(figsize=(16, 7))
sns.heatmap(correlations.loc[target_cols, cell_viability_cols])
plt.title('Cell viability features correlations with targets')
plt.show()

In general, individual numerical features do not correlate with targets. However, there are two distinct horizontal lines on both plots. Let's have a look at these targets.

These two targets have strong anti-correlation with all the cell viability features:

In [None]:
correlations.loc[target_cols, cell_viability_cols].mean(axis=1).sort_values()

These are `proteasome_inhibitor` and `nfkb_inhibitor` which on average have much lower mean correlation coefficients than all the other targets.

Surprisingly, these are exactly the two features with the highest variance in their correlation coefficients with gene features!

In [None]:
correlations.loc[target_cols, gene_cols].std(axis=1).sort_values(ascending=False)

Remember "our outliers" from Cell viability features section? Let's see how these two targets are distributed for them:

In [None]:
for target in ['proteasome_inhibitor', 'nfkb_inhibitor']:
    print(f'Percentage of true labels for {target} target among outliers %.2f' % (targets.loc[(df[cell_viability_cols] < -9).any(axis=1), target].mean() * 100))
    print(f'Percentage of true labels for {target} target among non-outliers %.2f' % (targets.loc[(df[cell_viability_cols] >= -9).all(axis=1), target].mean() * 100))
    print('-' * 20)

As a reminder: `proteasome_inhibitor` and `nfkb_inhibitor` are also two most frequent features in train dataset.

**To be continued...**