In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
def plot_features(ax, features, color='blue', title=''):
    ax.scatter(np.arange(len(features)), features, c=color)
    ax.set_ylabel('Feature Value')
    ax.set_xlabel('Feature #')
    ax.set_title(title)
    
def plot_one_vs_class(ax, idx, class_features, title=''):
    for i, feat in enumerate(class_features):
        if i == idx: continue
        plot_features(ax, feat, 'grey')
    
    plot_features(ax, class_features[idx], title=title)

## Read in data

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
train_targets_ns = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
test_features = pd.read_csv('../input/lish-moa/test_features.csv')
sample_submission = pd.read_csv('../input/lish-moa/sample_submission.csv')

## Take a look at some examples

First, we will look at the `radiopaque_medium` target. You will notice this target is not cytotoxic (does not kill cells) and the cell viability values seem pretty much contained between -2 and 2.

In [None]:
target = 'radiopaque_medium'
target_df = train_features[train_targets[target] == 1].reset_index(drop=True)
target_features = target_df.values[:, 4:]
fig, ax = plt.subplots(2, 1, figsize=(20, 10))
plot_one_vs_class(ax[0], 0, target_features)
plot_one_vs_class(ax[1], 1, target_features)

And next we'll look at the `igf-1_inhibitor` target. Unlike the previous MoA, this target is often cytotoxic and we will see some clear effects in the cell viability data with increasing dose/time.

In [None]:
target = 'igf-1_inhibitor'
target_df = train_features[train_targets[target] == 1].reset_index(drop=True)
target_features = target_df.values[:, 4:]
fig, ax = plt.subplots(2, 1, figsize=(20, 10))
plot_one_vs_class(ax[0], 6, target_features)
plot_one_vs_class(ax[1], 2, target_features)

## Take away an insight

You may also pick up on something else from this graph: **the larger the decrease in cell viability, the larger the gene data variability**. This makes perfect sense from the biologist point of view. If the compound induces cell death, it will likely also cause some large changes to the production of mRNA.

So, next let's take a look at these correlations across the whole dataset. On the x axis we will put the average cell viability values. And on the y axis we will put the standard deviation of the gene features.

In [None]:
x = np.mean(train_features.values[:, -100:], axis=1)
y = np.std(train_features.values[:, 4:-100].astype(np.float), axis=1)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(12, 12))
ax.scatter(x, y, c='blue')
ax.set_xlabel('Average Cell Viability')
ax.set_ylabel('Standard Deviation of Gene Expression')
plt.show()

## Dive deeper

We can see that there is indeed a strong correlation between these two variables. We can also see that there are very few positive average cell viability values.

There's also a single outlier in the top-left, an example with very high variablity and low cell viability. Let's have a look at it

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(20, 5))
plot_features(ax, train_features[y > 7].values[0, 4:])

Wow! So we can see this drug killed pretty much all the cells and had a drastic effect on the gene output. We can see that although the drug killed all the cells, there appears to be a marked increase in some gene expression. Maybe these features are reveresed?

So what kind of drug is this, anyways?

In [None]:
targets = train_targets.columns[1:][train_targets[y > 7].values[0, 1:].astype(np.bool)]
print('\n'.join(targets))

The proteasome_inhibitor is a common target in our dataset with a high tendency towards cytotoxicity. Let's take a look at the average cell viability values for this target

In [None]:
target = 'proteasome_inhibitor'
data = x[train_targets[target] == 1]
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.hist(data, 20, color='blue')
ax.set_xlabel('Average Cell Viability')
ax.set_ylabel('Count')
ax.set_xlim(-10, 10)
ax.set_title(target)
plt.show()

## Experiment and iterate

One takeaway you might make from this plot is that for proteasome_inhibitor you might be able to come up with some rule like:
 - If avg. cell viability is less than -2.5, then it could *possibly* be proteasome_inhibitor.
 - If avg. cell viability is greater than -2.5, then it can't be proteasome_inhibitor.

However, when I put this rule to the test, it turns out it doesn't actually hold true. I made an experimental submission where if a row had avg. cell viability greater than 0.0 (to be conservative), I would set it's probability to 0.0 for the proteasome_inhibitor target. Suprisingly, it resulted in a small increase in the reported loss.