<img src="https://www.yourgenome.org/sites/default/files/illustrations/diagram/gene_expression_transcription_yourgenome.png">

<h6><center>Image credit: Genome Research Limited</center></h6>
<h1><center>Mechanism of Action (MoA) Prediction</center></h1>

# Introduction

This notebook is an introduction in how to generate features for the [Mechanisms of Action (MoA)](https://www.kaggle.com/c/lish-moa) in python. We will first go simple feature generation and then go through advance feature generation and preprocessing.

If you like it, feel free to upvote :)

Let’s get started!

**Importing libraries and reading the dataset**

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Understanding the data

In this competition, our aim is to predict the the Mechanism of Action (MoA) response(s) of different samples (`sig_id`) in `test_features.csv`, given information about the responses of 100 different types of human cells to various drugs in `train_features.csv`.

For the samples (`sig_id`) in `train_features.csv` we are also given additional (optional) binary MoA responses that we don’t need to predict `test_targets_nonscored.csv`.

In [None]:
# Load data
test_features = pd.read_csv('../input/lish-moa/test_features.csv')
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
sample_submission = pd.read_csv('../input/lish-moa/sample_submission.csv')

In [None]:
# copy the original dataframe for feature generation
test_features1 = test_features.copy()
train_features1 = train_features.copy()
train_targets_nonscored1 = train_targets_nonscored.copy()
train_targets_scored1 = train_targets_scored.copy()
sample_submission1 = sample_submission.copy()

In [None]:
# take a look at train_features.csv before digging deeper
train_features.head(5)

### train_features.csv - what does it tell us

> **Observations**:
* we have 23814 observation and 207 features
* `sig_id`: samples (sig_id) unique for each Mechanism of Action (MoA) response(s).
* `cp_type`: tell us about samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle);
* `cp_time`: how long it take for samples to generate Mechanism of Action (MoA) response(s)
* `cp_dose`: indicate whether sample treated with high or low dose.

*PS: columns starting with `g-` signify "gene expression data", and `c-` signify signify cell viability data.*

> **Hypothesis**:
* Since, we are given different feature types we will first try to use common feature generation methods to generate some new features and then we will move on advanced feature generation using domain knowledge about the features and the task.


# Feature Engineering

Feature generation is a process of creating new features. It helps us by making model training more simple and effective. Sometimes, we can engineer these features using by looking at data types, otherwise we require domain knowledge to create new features.

Feature Engineering also depends on model that you are going to use CART (Classification and Regression Trees), Neural Networks or Gradient Boosting Decision Trees. Make sure to preprocess the features in a way that model can understand them. 

In [None]:
# extract columns containing numerical values
float_cols = train_features.select_dtypes(include=[np.float]).columns
# for each column create a new column emthusiasing floating point value
for col in float_cols:
    train_features[col] = train_features[col].apply(np.exp)
    
train_features.head()

In [None]:
# how much days does it take for samples to show (MoA) response(s)
train_features['days'] = (train_features['cp_time']/24).astype(np.int)
test_features['days'] = (test_features['cp_time']/24).astype(np.int)
train_features.head()

Similarly, we can create features indicating time `minutes`, and `seconds`.

In [None]:
train_features['minutes'] = train_features['cp_time']*60
test_features['minutes'] = test_features['cp_time']*60
train_features['seconds'] = train_features['cp_time']*3600
test_features['seconds'] = test_features['cp_time']*3600
train_features.head()

For gene expression data and cell viability data, we can taking the floating point value to show our model much more clearer difference in values

In [None]:
# extract columns containing numerical values
float_cols = train_features.select_dtypes(include=[np.float]).columns
# for each column create a new column emthusiasing floating point value
for col in float_cols:
    train_features['frac_'+col] = train_features[col].apply(lambda x: x%1)
    test_features['frac_'+col] = test_features[col].apply(lambda x: x%1)
    
train_features.head()

For categorical features, we can generate feature based on interaction between different features types. For example, sample treated with high dose and showed MoA response(s) in 24 hours, 48 hours and 72 hours respectively.

In [None]:
# feature interaction between days and cp_dose
train_features['cp_time_dose'] = train_features['cp_time'].astype(str)+train_features['cp_dose']
test_features['cp_time_dose'] = test_features['cp_time'].astype(str)+test_features['cp_dose']

train_features.head()

## Advance Feature Generation

To understand the data and explore it deeper, we need to do EDA. Since, @artgor, @headsortails and @isaienkov have already perform EDA, I will only discuss plots that will help us generate new features and connect with them to the domain knowledge so that we can comprehend the process of generating features based on EDA. Please go through this links for important parts of EDA process.

**Useful Links:**
* [Explorations of Action - MoA EDA](https://www.kaggle.com/headsortails/explorations-of-action-moa-eda)
* [Mechanisms of Action (MoA) Prediction. EDA](https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda)

**Let's separate gene expression data and cell viability data and study them individually.**

In [None]:
GENES = [col for col in train_features1.columns if col.startswith('g-')]
CELLS = [col for col in train_features1.columns if col.startswith('c-')]

In [None]:
# study gene expression data
g_train_features = train_features1[GENES]
g_test_features = test_features1[GENES]
g_train_features.head()

In [None]:
# study cell viability data
c_train_features = train_features1[CELLS]
c_test_features = test_features1[CELLS]
c_train_features.head()

In [None]:
plt.plot(g_train_features.iloc[0].sort_values(), '.');

Ok, gene expression data is not normal. It shows logits values are distributed across gene expression data as feature values.

A simple data transformation can solve the problem. This is one of the awesome things you can learn in statistical books: in case of logits , exp transformations usually works well. 

However, if we go to [Useful Links](https://www.kaggle.com/c/lish-moa/overview/useful-links) and study [Corsello et al. “Discovering the anticancer potential of non-oncology drugs by systematic viability profiling,” Nature Cancer](https://doi.org/10.1038/s43018-019-0018-6). In extended data section, it states
>  Median Fluorescence Intensity (MFI) values are calculated from fluorescence values for each replicate-condition-cell line combination and are log2-transformed. 

So, more advanced revearse engineering technique for log2-transform can be applied.

<img src="https://raw.githubusercontent.com/gauravchopracg/share_code/master/HeLa%20cell%20line%20dose%E2%80%93response%20curves%20with%20PDE3A%20genetic%20loss.png">

*(work in progress)*

In [None]:
import numpy as np

g_exp_train = g_train_features.copy()
g_exp_test = g_test_features.copy()
cols = g_exp_train.columns

for col in cols:
    g_exp_train[col] = np.exp(g_exp_train[col])
    g_exp_test[col] = np.exp(g_exp_test[col])

In [None]:
plt.plot(g_exp_train.iloc[0].sort_values(), '.');

# Feature Preprocessing

Feature Preprocessing highly depends on model that you are going to use and it's dependence with the target variable. For example, if we are using [Catboost](https://catboost.ai/) all we need to do is generate new feature and it's built-in functions will try to find new feature interactions based on target variable that we have to predict whereas if we are using Neural Network as our model, we need to explore dependencies between our model and target variable. Depending on its nature linear or non-linear after that we need to preprocess it accordingly. 

This part of notebook has been taken from [Feature Engineering Techniques](https://www.kaggle.com/c/ieee-fraud-detection/discussion/108575). Please upvote that discussion as ideas were taken from there.

Frequency Encoding

Frequency encoding is a powerful technique that allows different models to see whether column values are rare or common.

In [None]:
# copy of dataframes to apply label encoding
train_features_fe = train_features1.copy()
test_features_fe = test_features1.copy()

temp = train_features_fe['cp_dose'].value_counts().to_dict()
train_features_fe['cp_dose_counts'] = train_features_fe['cp_dose'].map(temp)
train_features_fe.head()

Aggregations / Group Statistics

Providing models with group statistics allows them to determine if a value is common or rare for a particular group. You calculate group statistics by providing pandas with 3 variables. You give it the group, variable of interest, and type of statistic. For example, 

In [None]:
temp = train_features_fe.groupby('cp_dose')['g-0'].agg(['mean']).rename({'mean':'g-0_cp_dose_mean'},axis=1)
train_features_fe = pd.merge(train_features_fe,temp,on='cp_dose',how='left')
train_features_fe.head()

**These are actually all the feature i generated and experiemented with your imagination to create as many features you can.**

## Feature Selection

Feature Selection is not necessary most of time , when i experimented with catboot classifier and split neural network in both cases removing features reduces both the cv and lb score, but i haven't tested all the models. So, make sure to generate feature interation in data, then preprocess your features based on the model after that check if feature selection helps.


below code cell has been taken from [amazing notebook by simakov - keras Multilabel Neural Network v1.2](https://www.kaggle.com/simakov/keras-multilabel-neural-network-v1-2)

In [None]:
# Source: https://www.kaggle.com/simakov/keras-multilabel-neural-network-v1-2
# add seed and change create_model

'''
from typing import Tuple, List, Callable, Any

from sklearn.utils import check_random_state  # type: ignore

### from eli5
def iter_shuffled(X, columns_to_shuffle=None, pre_shuffle=False,
                  random_state=None):
    rng = check_random_state(random_state)

    if columns_to_shuffle is None:
        columns_to_shuffle = range(X.shape[1])

    if pre_shuffle:
        X_shuffled = X.copy()
        rng.shuffle(X_shuffled)

    X_res = X.copy()
    for columns in tqdm(columns_to_shuffle):
        if pre_shuffle:
            X_res[:, columns] = X_shuffled[:, columns]
        else:
            rng.shuffle(X_res[:, columns])
        yield X_res
        X_res[:, columns] = X[:, columns]



def get_score_importances(
        score_func,  # type: Callable[[Any, Any], float]
        X,
        y,
        n_iter=5,  # type: int
        columns_to_shuffle=None,
        random_state=None
    ):
    rng = check_random_state(random_state)
    base_score = score_func(X, y)
    scores_decreases = []
    for i in range(n_iter):
        scores_shuffled = _get_scores_shufled(
            score_func, X, y, columns_to_shuffle=columns_to_shuffle,
            random_state=rng, base_score=base_score
        )
        scores_decreases.append(scores_shuffled)

    return base_score, scores_decreases



def _get_scores_shufled(score_func, X, y, base_score, columns_to_shuffle=None,
                        random_state=None):
    Xs = iter_shuffled(X, columns_to_shuffle, random_state=random_state)
    res = []
    for X_shuffled in Xs:
        res.append(-score_func(X_shuffled, y) + base_score)
    return res

def metric(y_true, y_pred):
    metrics = []
    for i in range(y_pred.shape[1]):
        if y_true[:, i].sum() > 1:
            metrics.append(log_loss(y_true[:, i], y_pred[:, i].astype(float)))
    return np.mean(metrics)   

perm_imp = np.zeros(train.shape[1])
all_res = []
for n, (tr, te) in enumerate(KFold(n_splits=7, random_state=0, shuffle=True).split(train_targets)):
    print(f'Fold {n}')

    model = create_model(len(train.columns))
    checkpoint_path = f'repeat:{seed}_Fold:{n}.hdf5'
    reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, verbose=1, epsilon=1e-4, mode='min')
    cb_checkpt = ModelCheckpoint(checkpoint_path, monitor = 'val_loss', verbose = 0, save_best_only = True,
                                     save_weights_only = True, mode = 'min')
    model.fit(train.values[tr],
                  train_targets.values[tr],
                  validation_data=(train.values[te], train_targets.values[te]),
                  epochs=35, batch_size=128,
                  callbacks=[reduce_lr_loss, cb_checkpt], verbose=2
                 )
        
    model.load_weights(checkpoint_path)
        
    def _score(X, y):
        pred = model.predict(X)
        return metric(y, pred)

    base_score, local_imp = get_score_importances(_score, train.values[te], train_targets.values[te], n_iter=1, random_state=0)
    all_res.append(local_imp)
    perm_imp += np.mean(local_imp, axis=0)
    print('')
    
top_feats = np.argwhere(perm_imp < 0).flatten()
top_feats
'''

# print(top_feats)

Thanks and I hope you discovered something new while reading it. In this competition, Laboratory for Innovation Science at Harvard, presented us with a dataset to develop an algorithm to predict a compound’s MoA given its cellular signature which will help scientists advance the drug discovery process. I hope to contribute as my exam's are still going I'll be updating this notebook whenever it's possible. Happy Learning!