# Laboratory for Innovation Science at Harvard: Mechanisms of Action (MoA) Prediction

Welcome to the Mechanisms of Action (MoA) Prediction competition. The aim here is to anticipate the cellular response to various drugs. This process is called the Method of Action

## What is the Mechanism of Action (MoA) of a drug? 

According to Wikipedia:
> In pharmacology, the term mechanism of action (MOA) refers to the specific biochemical interaction through which a drug substance produces its pharmacological effect. A mechanism of action usually includes mention of the specific molecular targets to which the drug binds, such as an enzyme or receptor. Receptor sites have specific affinities for drugs based on the chemical structure of the drug, as well as the specific action that occurs there.
>
> Drugs that do not bind to receptors produce their corresponding therapeutic effect by simply interacting with chemical or physical properties in the body. Common examples of drugs that work in this way are antacids and laxatives.
>
> In contrast, a mode of action (MoA) describes functional or anatomical changes, at the cellular level, resulting from the exposure of a living organism to a substance. 

## And why is it important?

Elucidating the mechanism of action of novel drugs and medications is important for several reasons:

* In the case of anti-infective drug development, the information permits anticipation of problems relating to clinical safety. Drugs disrupting the cytoplasmic membrane or electron transport chain, for example, are more likely to cause toxicity problems than those targeting components of the cell wall (peptidoglycan or Î²-glucans) or 70S ribosome, structures which are absent in human cells.
* By knowing the interaction between a certain site of a drug and a receptor, other drugs can be formulated in a way that replicates this interaction, thus producing the same therapeutic effects. Indeed, this method is used to create new drugs.
* It can help identify which patients are most likely to respond to treatment. Because the breast cancer medication trastuzumab is known to target protein HER2, for example, tumors can be screened for the presence of this molecule to determine whether or not the patient will benefit from trastuzumab therapy.
* It can enable better dosing because the drug's effects on the target pathway can be monitored in the patient. Statin dosage, for example, is usually determined by measuring the patient's blood cholesterol levels.
* It allows drugs to be combined in such a way that the likelihood of drug resistance emerging is reduced. By knowing what cellular structure an anti-infective or anticancer drug acts upon, it is possible to administer a cocktail that inhibits multiple targets simultaneously, thereby reducing the risk that a single mutation in microbial or tumor DNA will lead to drug resistance and treatment failure.
* It may allow other indications for the drug to be identified. Discovery that sildenafil inhibits phosphodiesterase-5 (PDE-5) proteins, for example, enabled this drug to be repurposed for pulmonary arterial hypertension treatment, since PDE-5 is expressed in pulmonary hypertensive lungs.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline

In [None]:
df_test_features = pd.read_csv('../input/lish-moa/test_features.csv')
df_train_features = pd.read_csv('../input/lish-moa/train_features.csv')
df_train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
sample_submission = pd.read_csv('../input/lish-moa/sample_submission.csv')
df_train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')


Lets have a look at the size and shape of the data files.

In [None]:
print(f'Train features:\t\t\t{df_train_features.shape}\nTest features:\t\t\t{df_test_features.shape}\nTrain targets (scored):\t\t{df_train_targets_scored.shape}\nTrain targets (non-scored):\t{df_train_targets_nonscored.shape}')


In [None]:
df_train_features.head(10)

The train_features contains 876 columns and 23814 rows. The columns are as follows:
* sig_id - The sample Id
* cp_type - indicates samples treated with a compound (trt-cp) or with a control perturbation (ctl_vehicle)
* cp_time - indicate treatment duration (24, 48, 72 hours)
* cp_dose - dose: high or low (binary, D1, D2)
* g-0:771 - gene expression data
* c-0:99 - cell viability data

In [None]:
df_train_targets_scored.head(10)

In [None]:
df_train_targets_nonscored.head(10)

At this point it might be a good idea to check if the sample ids match exactly between the three train dataframes.

In [None]:
cols_match_1 = df_train_targets_scored['sig_id']==df_train_features['sig_id']
cols_match_2 = df_train_targets_nonscored['sig_id']==df_train_features['sig_id']
cols_match_1.value_counts(),cols_match_2.value_counts()

Looks good!
The two train_targets sets contain a large number of seemingly esoteric features. These will be examined later.
Let's have a look at the distributions of the gene and cell columns.
There is a huge amount of columns in this set, so for now we'll just look at the first 10 of each type of column.

In [None]:
df_train_features[['g-0','g-1','g-2','g-3','g-4','g-5','g-6','g-7','g-8','g-9']].describe()

In [None]:
df_train_features[['c-0','c-1','c-2','c-3','c-4','c-5','c-6','c-7','c-8','c-9']].describe()

It looks like the data in these columns is bounded between -10 and 10.

For the cell columns, the mean is about 0.25-0.5 lower than the median indicating that these values are left skewed.

The gene columns do not seem to exhibit any consistent skewness one way or the other.

# Visualisation

First we'll visualise the distribution of the cell columns in the features set.

In [None]:
cell_cols = df_train_features.iloc[:,-100:]
#cell_cols['sig_id'] = df_train_features['sig_id']

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(18,30), dpi= 100, facecolor='w', edgecolor='k')
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=20,ax=ax)
        ax.set_title(var_name)
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

draw_histograms(cell_cols, cell_cols.columns, 20, 5)

In [None]:
gene_cols = df_train_features.iloc[:,4:104]
#cell_cols['sig_id'] = df_train_features['sig_id']

def draw_histograms(df, variables, n_rows, n_cols):
    fig=plt.figure(figsize=(18,30), dpi= 100, facecolor='w', edgecolor='k')
    for i, var_name in enumerate(variables):
        ax=fig.add_subplot(n_rows,n_cols,i+1)
        df[var_name].hist(bins=20,ax=ax)
        ax.set_title(var_name)
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

draw_histograms(gene_cols, gene_cols.columns, 20, 5)

The cell data appears relatively homogeneous, apart from some accumulations around -10.

The gene data shows a bit more variance, with some columns being right or left skewed and having different ranges.

To be continued...