# Mechanisms of Action (MoA) Prediction

The Connectivity Map, a project within the Broad Institute of MIT and Harvard, together with the Laboratory for Innovation Science at Harvard (LISH), presents this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.

What is the Mechanism of Action (MoA) of a drug? And why is it important?

In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

How do we determine the MoAs of a new drug?

One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cells’ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset.

As is customary, the dataset has been split into testing and training subsets. Hence, your task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.

How to evaluate the accuracy of a solution?

Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

If successful, you’ll help to develop an algorithm to predict a compound’s MoA given its cellular signature, thus helping scientists advance the drug discovery process.

![](https://images.financialexpress.com/2020/01/drugs-1.jpg?w=1200&h=800&imflag=true)

**Data Description**

In this competition, you will be predicting multiple targets of the Mechanism of Action (MoA) response(s) of different samples (sig_id), given various inputs such as gene expression data and cell viability data.

***Two notes:***

* the training data has an additional (optional) set of MoA labels that are not included in the test data and not used for scoring.
* the re-run dataset has approximately 4x the number of examples seen in the Public test.

**Files**
* **train_features.csv** - Features for the training set. Features g- signify gene expression data, and c- signify cell viability data. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
* **train_targets_scored.csv** - The binary MoA targets that are scored.
* **train_targets_nonscored.csv** - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
* **test_features.csv** - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
* **sample_submission.csv** - A submission file in the correct format.

# Load the Packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load the Data

In [None]:
df_Train = pd.read_csv('../input/lish-moa/train_features.csv')
df_Test = pd.read_csv('../input/lish-moa/test_features.csv')
df_Train_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
df_Train_unscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')

# Exploratory Data Analysis

In [None]:
#To find the head of the Data
df_Train.head()

In [None]:
df_Train_scored.head()

In [None]:
#Columns List
df_Train.columns

In [None]:
#Columns List
df_Train_scored.columns

In [None]:
#Information of the Dataset Continuous Values
df_Train.describe()

In [None]:
#Information of the Dataset Values
df_Train.info()

In [None]:
#Information of the Dataset Values
df_Train_scored.info()

In [None]:
#Shape of the Train and Test Data
print('Shape of Train Data: ', df_Train.shape)
print('Shape of Train Scored Data: ', df_Train_scored.shape)
print('Shape of Train Unscored Data: ', df_Train_unscored.shape)
print('Shape of Test Data: ', df_Test.shape)

### Missing Values

In [None]:
#Null values in the Train Dataset
print('Null values in Train Data: \n', df_Train.isnull().sum())

In [None]:
#Null Values in the Test Dataset
print('Null Values in Test Data: \n', df_Test.isnull().sum())

# Data Insight and Visualization

## Correlation

Finding the Correlation of the Target Column with the other Columns

**Correlation**

To check for the correlation of the variable and the graph dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to how close two variables are to having a linear relationship with each other

In [None]:
corrmat = df_Train.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)

In [None]:
corrmat = df_Train_scored.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)

In [None]:
corrmat = df_Train_unscored.corr()
f, ax = plt.subplots(figsize=(14,14))
sns.heatmap(corrmat, square=True, vmax=.8)