# Mechanisms of Action (MoA) Prediction
Can you improve the algorithm that classifies drugs based on their biological activity?

![Mechanisms of Action](https://www.urmc.rochester.edu/MediaLibraries/URMCMedia/education/graduate/phd/pharmacology-and-physiology/images/shutterstock_647026912.jpg)

In this notebook we will explore the data provided for the competition, understand the metric, and create a baseline model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from itertools import cycle
pd.set_option('max_columns', 50)
plt.style.use('seaborn-dark')
color_pal = plt.rcParams['axes.prop_cycle'].by_key()['color']
color_cycle = cycle(plt.rcParams['axes.prop_cycle'].by_key()['color'])

First we can look at the data format. Everything is stored as a CSV, and the largest file is only 150MB.

In [None]:
!ls -GFlash --color ../input/lish-moa/

In [None]:
ss = pd.read_csv('../input/lish-moa/sample_submission.csv')
test_features = pd.read_csv('../input/lish-moa/test_features.csv')
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')

# What is the target we are trying to predict?
In this competition, you will be predicting multiple targets of the Mechanism of Action (MoA) response(s) of different samples (sig_id), given various inputs such as gene expression data and cell viability data.

We are provided "scored" and "nonscored" targets for the train set. First we will focus on the "scored" dataset since these are the binary MoA targets that are scored.

In [None]:
ax = train_targets_scored.drop('sig_id', axis=1) \
    .sum(axis=1) \
    .value_counts() \
    .sort_index() \
    .plot(kind='bar',
         figsize=(15, 5),
          color=next(color_cycle)
         )
ax.set_title('Unique Targets per observation (train)', fontsize=20)
plt.show()

In [None]:
ax = train_targets_scored.drop('sig_id', axis=1) \
    .sum() \
    .sort_values(ascending=False) \
    .head(30) \
    .sort_values() \
    .plot(kind='barh',
         figsize=(15, 10),
          color=next(color_cycle)
         )
ax.set_title('Top 30 Scored Targets in Train Set', fontsize=20)
plt.show()

# Features
The start of the column:
- `g-` signify gene expression data
- `c-` signify cell viability data.
- `cp_type` indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; 
- `cp_time` and `cp_dose` indicate treatment duration (24, 48, 72 hours) and dose (high or low).

In [None]:
GENE_COLS = [c for c in train_features.columns if c[:2] == 'g-']
CELL_COLS = [c for c in train_features.columns if c[:2] == 'c-']
print('Number of gene columns:', len(GENE_COLS))
print('Number of cell columns:', len(CELL_COLS))

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(15, 5))
ax = train_features['cp_type'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[0])
ax.set_title('training set', fontsize=15)

ax = test_features['cp_type'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[1])
ax.set_title('public test set', fontsize=15)
fig.suptitle('"cp_type"', fontsize=20)
plt.show()

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
ax = train_features['cp_dose'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[0])
ax.set_title('training set', fontsize=15)

ax = test_features['cp_dose'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[1])
ax.set_title('public test set', fontsize=15)
fig.suptitle('treatment duration (hours)', fontsize=20)
plt.show()

fig, axs = plt.subplots(1, 2, figsize=(15, 5))
ax = train_features['cp_time'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[0])
ax.set_title('training set', fontsize=15)

ax = test_features['cp_time'] \
    .value_counts() \
    .plot(kind='barh',
          figsize=(15, 3),
          color=next(color_cycle),
         ax=axs[1])
ax.set_title('public test set', fontsize=15)
fig.suptitle('treatment dose', fontsize=20)
plt.show()

## Cell Features

In [None]:
fig, ax = plt.subplots(10, 10, figsize=(15, 15),
                       sharex=True,
                       sharey=True)
ax = ax.flatten()

for i, c in enumerate(CELL_COLS):
    train_features[c].plot(kind='hist',
                           ax=ax[i],
                           title=c,
                           bins=20,
                          color=next(color_cycle)
                          )
#     break
plt.suptitle('Distribution of 100 Cell Features', fontsize=20, y=1)
plt.tight_layout()
plt.show()

## Gene Features

In [None]:
ax = train_features.set_index('sig_id') \
    .sample(10)[GENE_COLS] \
    .T.plot(figsize=(15, 5))
plt.suptitle('Gene Features for 10 Random Samples', fontsize=20)
ax.get_legend().remove()
plt.show()

# Evaluation Criteria

For every `sig_id` you will be predicting the probability that the sample had a positive response for each <MoA> target. For N sig_id rows and M <MoA> targets, you will be making N×M predictions. Submissions are scored by the log loss:


$$ \text{score} = - \frac{1}{M}\sum_{m=1}^{M} \frac{1}{N} \sum_{i=1}^{N} \left[ y_{i,m} \log(\hat{y}_{i,m}) + (1 - y_{i,m}) \log(1 - \hat{y}_{i,m})\right] $$

- \(N\) is the number of sig_id observations in the test data (\(i=1,…,N\))
- \(M\) is the number of scored MoA targets (\(m=1,…,M\))
- \( \hat{y}_{i,m} \) is the predicted probability of a positive MoA response for a sig_id
- \( y_{i,m} \) is the ground truth, 1 for a positive response, 0 otherwise
- \( log() \) is the natural (base e) logarithm
    
Note: the actual submitted predicted probabilities are replaced with max(min(p,1−10−15),10−15). A smaller log loss is better.


    

In [None]:
from sklearn.metrics import log_loss
def kaggle_metric_np(targets, preds):
    """
    Kaggle metric for MoA competition targets and preds
    in numpy format.
    """
    assert targets.shape[1] == 206
    assert preds.shape[1] == 206
    metrics = []
    for t in range(206):
        metrics.append(log_loss(targets[:, t], preds[:, t], labels=[0, 1]))
    return np.mean(metrics)

# Multiclass Model

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
# from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.svm import LinearSVC 
from sklearn.metrics import log_loss
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

LABEL_ENCODE_COLS = ['cp_type','cp_time','cp_dose']
for l in LABEL_ENCODE_COLS:
    le = LabelEncoder()
    train_features[f'{l}_le'] = le.fit_transform(train_features[l])
    test_features[f'{l}_le'] = le.transform(test_features[l])

FEATURES = GENE_COLS + CELL_COLS + ['cp_type_le','cp_time_le','cp_dose_le']
TARGETS = [t for t in train_targets_scored.columns if t != 'sig_id']
X = train_features[FEATURES].values
X_test = test_features[FEATURES].values
y = train_targets_scored[TARGETS].values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)
X_full = np.concatenate([X, X_test])

# Standard Scale
scale = StandardScaler()
scale.fit(X_full)
X_train = scale.transform(X_train)
X_val = scale.transform(X_val)
X_test = scale.transform(X_test)

# Apply PCA
# pca = PCA(n_components=100, svd_solver='full')
# pca.fit(X_full)
# X_train = pca.transform(X_train)
# X_val = pca.transform(X_val)
# X_test = pca.transform(X_test)
print(X_train.shape, X_val.shape, X_test.shape)

In [None]:
import warnings
warnings.simplefilter("ignore")

clf = OneVsRestClassifier(SVC(probability=True))
clf.fit(X_train, y_train)
pred_train = clf.predict_proba(X_train)
pred_val = clf.predict_proba(X_val)
pred_test = clf.predict_proba(X_test)

In [None]:
train_score = kaggle_metric_np(y_train, pred_train)
val_score = kaggle_metric_np(y_val, pred_val)
print(f'train score {train_score:0.4f}, val score {val_score:0.4f}')

# Make Submission

In [None]:
sub = pd.DataFrame(pred_test, columns=TARGETS)
sub['sig_id'] = test_features['sig_id'].values

In [None]:
sub.shape, ss.shape

In [None]:
sub.to_csv('submission.csv', index=False)