# Mechanisms of Action (MoA) Prediction

The goal of this competition is to advance drug development through improvements of MoA prediction algorithms.
With MoA, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. Mechanism of action stands for the biochemical interactions through which a drug generates its pharmacological effect.
There is a data set that combines gene expression and cell viability data. 
Hence, the task is to use the training data set to develop a model that automatically labels each case in the test set as one or more MoA classes, so the task is a multi-label classification problem.
The evaluation of the model is based on the log loss function and it measures the performance of a classification model whose output is a probability value between 0 and 1. Log loss increases as the predicted probability diverges from the actual label.
In this Project after an Exploratory Data Analysis and the One-Hot Encoding into k-1 dummy variables, the data set is reduced with Autoencoder, moving from 875 features to 28 features. I've applied t-SNE at some new features and some target labels to discover graphical patterns.



![](https://image.slidesharecdn.com/mechanismofdrugaction-131104071748-phpapp01/95/mechanism-of-drug-action-3-638.jpg?cb=1383549622)

# Notes

In [None]:
# Reference notebooks:
# https://www.kaggle.com/sinamhd9/mechanisms-of-action-moa-tutorial
# https://www.kaggle.com/fchmiel/xgboost-baseline-multilabel-classification

# Prepare Workspace

In [None]:
# Upload libraries
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
#% matplotlib inline
import seaborn as sns

import statistics as st 
import scipy.stats as stats

from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from keras.layers import Input,Dense
from keras.models import Model

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# Import Data

In [None]:
# Upload data set
train = pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
test = pd.read_csv('/kaggle/input/lish-moa/test_features.csv')
targets = pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv')
targets_ns = pd.read_csv('/kaggle/input/lish-moa/train_targets_nonscored.csv')
submission = pd.read_csv('/kaggle/input/lish-moa/sample_submission.csv')

# Have a peek of Data

In [None]:
train.shape

In [None]:
test.shape

In [None]:
df = train.append(test)

In [None]:
# Look at dimension of data set and types of each attribute
df.info()

In [None]:
# train
# ['sig_id', 'cp_type', 'cp_dose'] == object
# ['cp_time'] == int

In [None]:
# Summarize attribute distributions of the data frame
df.describe(include = 'all').T

In [None]:
# Take a peek at the first rows of the data
df.head(10)

In [None]:
# Look at dimension of data set and types of each attribute
targets.info()

In [None]:
# Summarize attribute distributions of the data frame
targets.describe(include = 'all').T

In [None]:
# Take a peek at the first rows of the data
targets.head(10)

In [None]:
# Look at dimension of data set and types of each attribute
targets_ns.info()

In [None]:
# Summarize attribute distributions of the data frame
targets_ns.describe(include = 'all').T

In [None]:
# Take a peek at the first rows of the data
targets_ns.head(10)

### Handling missing values

In [None]:
# Check missing values both to numeric features and categorical features 
feat_missing = []

for f in df.columns:
    missings = df[f].isnull().sum()
    if missings > 0:
        feat_missing.append(f)
        missings_perc = missings/df.shape[0]
        
        # printing summary of missing values
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

# How many variables do present missing values?
print()
print('In total, there are {} variables with missing values'.format(len(feat_missing)))

# Target Variable Analysis

In [None]:
targets_df = targets.copy()
targets_df = targets_df.drop(['sig_id'], axis=1)

In [None]:
# Summarize the class distribution 
col_name = targets_df.columns
for var in col_name:
    count = pd.crosstab(index = targets_df[var], columns="count")
    percentage = pd.crosstab(index = targets_df[var], columns="frequency")/pd.crosstab(index = targets_df[var], columns="frequency").sum()
    print('\n',pd.concat([count, percentage], axis=1))

In [None]:
# Plot the target variable
figures_per_time = 4
count = 0 
for var in targets_df.columns:
    x = targets_df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.countplot(x, color='g',linewidth=2)
    plt.xticks(rotation=45)
    count+=1
ax = sns.countplot(x=targets[var], data=targets, palette='rocket')

# Numerical Features Analysis

In [None]:
# gene features
gene_features = [cols for cols in df.columns if cols.startswith('g-')]

In [None]:
# Univariate analysis looking at Mean, Standard Deviation, Skewness and Kurtosis
for col in gene_features:
    print(col,
        '\nMean :', np.mean(df[col]),  
        '\nVariance :', np.var(df[col]),
        '\nStandard Deviation :', st.stdev(df[col]), 
        '\nSkewness :', stats.skew(df[col]), 
        '\nKurtosis :', stats.kurtosis(df[col]))

In [None]:
# Univariate analysis with density plots 
figures_per_time = 4
count = 0 
for var in gene_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.kdeplot(x, color='g',linewidth=2)
    plt.xticks(rotation=45)
    count+=1

In [None]:
# Univariate analysis with histogram plots 
figures_per_time = 4
count = 0 
for var in gene_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.distplot(x, bins=10, color='r')
    plt.xticks(rotation=45)
    count+=1

In [None]:
# Univariate analysis with box plots 
figures_per_time = 4
count = 0 
for var in gene_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.boxplot(x, color='b')
    plt.xticks(rotation=45)
    count+=1

In [None]:
# cell features
cell_features = [cols for cols in df.columns if cols.startswith('g-')]

In [None]:
# Univariate analysis looking at Mean, Standard Deviation, Skewness and Kurtosis
for col in cell_features:
    print(col,
        '\nMean :', np.mean(df[col]),  
        '\nVariance :', np.var(df[col]),
        '\nStandard Deviation :', st.stdev(df[col]), 
        '\nSkewness :', stats.skew(df[col]), 
        '\nKurtosis :', stats.kurtosis(df[col]))

In [None]:
# Univariate analysis with density plots 
figures_per_time = 4
count = 0 
for var in cell_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.kdeplot(x, color='g',linewidth=2)
    plt.xticks(rotation=45)
    count+=1

In [None]:
# Univariate analysis with histogram plots 
figures_per_time = 4
count = 0 
for var in cell_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.distplot(x, bins=10, color='r')
    plt.xticks(rotation=45)
    count+=1

In [None]:
# Univariate analysis with box plots 
figures_per_time = 4
count = 0 
for var in cell_features:
    x = df[var]
    plt.figure(count//figures_per_time,figsize=(25,5))
    plt.subplot(1,figures_per_time,np.mod(count,4)+1)
    sns.boxplot(x, color='b')
    plt.xticks(rotation=45)
    count+=1

# Categorical Features Analysis

In [None]:
fcat = ['cp_type', 'cp_dose']
for col in fcat:
    count = pd.crosstab(index = df[col], columns="count")
    percentage = pd.crosstab(index = df[col], columns="frequency")/pd.crosstab(index = df[col], columns="frequency").sum()
    tab = pd.concat([count, percentage], axis=1)
    plt.figure(figsize=(5,5))
    sns.countplot(x=df[col], data=df, palette="Set1")
    plt.xticks(rotation=45)
    print(tab)
    plt.show()

# Feature Engineering

In [None]:
# One-Hot Encoding into k-1 dummy Variables
dummy_df = pd.concat([pd.get_dummies(df.cp_type, drop_first=True), 
                         pd.get_dummies(df.cp_dose, drop_first=True)], axis=1)
dummy_df = dummy_df.astype(int)

# Handling Data Set

In [None]:
# Whole data set
df_new = pd.concat([df, dummy_df], axis=1)

In [None]:
X_all = df_new.copy()

In [None]:
# Drop features not helpful
X_all = X_all.drop(['sig_id', 'cp_type', 'cp_dose'], axis=1)

# Scaling Data

In [None]:
# Normalization of data
scaling = MinMaxScaler()
X_all_sc = scaling.fit_transform(X_all)


# Dimensionality Reduction

In [None]:
# Take number of autoencoder features from number of components which explain 70% of variance with PCA 
pca = PCA(0.70, random_state=0)
pca_X_all_sc = pca.fit_transform(X_all_sc)
pca.n_components_

In [None]:
# single fully-connected neural layer as encoder and as decoder
# This is the size of encoded representations
encoding_dim = pca.n_components_  

# This is input 
input_data = Input(shape=(875,)) # number of features/columns
# "encoded" is the encoded representation of the input
encoded = Dense(encoding_dim, activation='relu')(input_data)
# "decoded" is the lossy reconstruction of the input
decoded = Dense(875, activation='relu')(encoded)

# This model maps an input to its reconstruction
autoencoder = Model(input_data, decoded)

In [None]:
# create a separate encoder model:
# This model maps an input to its encoded representation
encoder = Model(input_data, encoded)
# create a separate decoder model:
# This is encoded input
encoded_input = Input(shape=(encoding_dim,))
# Retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]
# Create the decoder model
decoder = Model(encoded_input, decoder_layer(encoded_input))

In [None]:
# Configure the model
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# train the model
np.random.seed(0)
autoencoder.fit(X_all_sc,
                X_all_sc,
                epochs=50,
                shuffle=True)

autoencoder.summary()

In [None]:
# predict after training
encoded_input = encoder.predict(X_all_sc)

In [None]:
# New data set
autoencoder_df = pd.DataFrame(data = encoded_input, columns=['autoencoder'+str(i) for i in range(pca_X_all_sc.shape[1])])
autoencoder_df

# t-SNE for Visualizing Dimensionality reduction Data

In [None]:
tsne_df = TSNE(n_components=2, random_state=0).fit_transform(autoencoder_df)

In [None]:
tsne_data = pd.DataFrame(tsne_df, columns=['x', 'y'], index=autoencoder_df.index)

In [None]:
dff = pd.concat([targets_df['5-alpha_reductase_inhibitor'], tsne_data], axis=1)

# Show the diagram
fig, ax = plt.subplots(figsize=(10, 10))

with sns.plotting_context("notebook", font_scale=1.0):
     sns.scatterplot(x='x',
                        y='y',
                        hue='5-alpha_reductase_inhibitor',
                        palette=sns.color_palette("husl", 2),
                        data=dff,
                        ax=ax)

ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

plt.show()

In [None]:
dff = pd.concat([targets_df['antiprotozoal'], tsne_data], axis=1)

# Show the diagram
fig, ax = plt.subplots(figsize=(10, 10))

with sns.plotting_context("notebook", font_scale=1.0):
     sns.scatterplot(x='x',
                        y='y',
                        hue='antiprotozoal',
                        palette=sns.color_palette("Set2", 2),
                        data=dff,
                        ax=ax)

ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

plt.show()

In [None]:
dff = pd.concat([targets_df['adenylyl_cyclase_activator'], tsne_data], axis=1)

# Show the diagram
fig, ax = plt.subplots(figsize=(10, 10))

with sns.plotting_context("notebook", font_scale=1.0):
     sns.scatterplot(x='x',
                        y='y',
                        hue='adenylyl_cyclase_activator',
                        palette=sns.color_palette("Set1", 2),
                        data=dff,
                        ax=ax)

ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

plt.show()