# <h1 align='center'><font color = 'red'> ðŸ’ŠMechanisms of Action (MoA) PredictionðŸ’Š </font></h1>
![](https://www.uicc.org/sites/main/files/styles/uicc_news_main_image/public/iStock_PIlls_1024x.jpg?itok=0LOJZO4I)

# _<h1 align='center'><font color='orange'>1. Introduction</font></h1>_

## _Quick Navigation_

* [1. Introduction](#1)
  * [1.1 About the Competition](#2)
  * [1.2 Importing relevant packages](#3)
* [2. EDA](#4)
  * [2.1 Basic Data Overview](#5)
  * [2.2 Categories Data Insight](#6)
    * [2.2.1 cp_type](#7)
    * [2.2.1 cp_time](#8)
    * [2.2.1 cp_dose](#9)
  * [2.3 Gene Expression and Cell Viability features](#10)
  * [2.4 Multivariate Analysis](#11)
    * [2.4.1 cptype / cp_time](#12)
    * [2.4.2 cp_type / cp_dose](#13)
    * [2.4.3 cp_time / cp_dose](#14)
  * [2.5 Correlations](#15)
    * [2.5.1 Correlation b/w g- features](#16)
    * [2.5.2 Correlation b/w c- features](#17)
    * [2.5.3 High Correlation features](#18)
  * [2.6 Target Data Analysis](#19)
* [3. Baseline Model](#20)
  * [3.1 Import data](#21)
  * [3.2 Train model](#22)
  * [3.3 Compute log loss](#23)
  * [3.4 Submission](#24)

<a id="2"></a>
## _1.1 About the Competition_


**<font color='red'>Q. What is Mechanism of Action (MoA) of a drug ? And why is it important ?</font>**<br><br>
Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

**<font color='red'>Q. How do we determine the MoAs of a new drug?</font>**<br><br>
In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cellsâ€™ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset.

**<font color='red'>Q.How to evaluate the accuracy of a solution?</font>**<br><br>
Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

To know more about evaluation metric , [click here](https://www.kaggle.com/c/lish-moa/overview/evaluation)

<a id="3"></a>
## _1.2 Importing relevant packages_

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.metrics import log_loss
from category_encoders import CountEncoder
from sklearn.model_selection import KFold
from xgboost import XGBClassifier
from sklearn.multioutput import MultiOutputClassifier
import random

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

<a id="4"></a>
# _<h1 align='center'><font color='orange'>2. EDA</font></h1>_

<a id="5"></a>
## _2.1 Basic Data Overview_
First lets see what all files our working directory contains and the data description

In [None]:
ROOT = '../input/lish-moa'
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Files -
- **train_features.csv** - Features for the training set. Features **g- signify gene expression** data, and **c- signify cell viability data**. cp_type indicates samples treated with a compound (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs; cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).
- **train_targets_scored.csv** - The binary MoA targets that are scored.
- **train_targets_nonscored.csv** - Additional (optional) binary MoA responses for the training data. These are not predicted nor scored.
- **test_features.csv** - Features for the test data. You must predict the probability of each scored MoA for each row in the test data.
- **sample_submission.csv** - A submission file in the correct format.

In [None]:
# training data
train = pd.read_csv(f'{ROOT}/train_features.csv')

# training data targets
target = pd.read_csv(f'{ROOT}/train_targets_scored.csv')

# testing data
test = pd.read_csv(f'{ROOT}/test_features.csv')

In [None]:
train.head()

In [None]:
target.head()

In [None]:
print("No. of rows in training set : {}".format(train.shape[0]))
print('No. of columns in training set : {}'.format(train.shape[1]))
print('No. of rows in target set : {}'.format(target.shape[0]))
print('No. of columns in target set : {}'.format(target.shape[1]))

In [None]:
cols = train.columns
g_cols = [x for x in cols if x.startswith('g-')]
c_cols = [x for x in cols if x.startswith('c-')]
print(f"There are {train.shape[1]} columns in training and test set out of which :- ")
print(f"There are {len(g_cols)} columns starting with 'g-' i.e. gene expression features.")
print(f"There are {len(c_cols)} columns starting with 'c-' i.e. cell viability features.")
print("'sig_id', 'cp_type', 'cp_time', 'cp_dose' account for the remaining 4 columns.")

In [None]:
labels = ['g-','c-','sig_id','cp_type', 'cp_time', 'cp_dose']
values = [len(g_cols), len(c_cols), 1, 1, 1, 1]

# Use `hole` to create a donut-like pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values, hole=.5)])

fig.update_layout(title_text="Distribution of columns in train and test features.")
fig.show()

In [None]:
# Unique values for features
print("There are {} unique values in 'sig_id' column. So, there are no duplicate rows".format(train.sig_id.nunique()))
print("There are {} unique values in 'cp_type' column having values : {}".format(train.cp_type.nunique(), train.cp_type.unique()))
print("There are {} unique values in 'cp_time' column having values : {}".format(train.cp_time.nunique(), train.cp_time.unique()))
print("There are {} unique values in 'cp_dose' column having values : {}".format(train.cp_dose.nunique(), train.cp_dose.unique()))

<a id="6"></a>
## _2.2 Categories Data Insight_
<a id="7"></a>
### _2.2.1 cp_type_

In [None]:
trt_cp_count = train.cp_type.value_counts()[0]
ctl_vehicle_count = train.cp_type.value_counts()[1]
print(f"In 'cp_type' feature there are {trt_cp_count} records with value 'trt_cp' and {ctl_vehicle_count} records " 
      + "with value 'ctl_vehicle'")

In [None]:
x=train.cp_type.unique()
y_train=[trt_cp_count,ctl_vehicle_count]
y_test=[test.cp_type.value_counts()[0], test.cp_type.value_counts()[1]]

fig = go.Figure(data=[
    go.Bar(name='test', x=x, y=y_test),
    go.Bar(name='train', x=x, y=y_train)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

<a id="8"></a>
### _2.2.2 cp_time_

In [None]:
print("There are {} distinct values for 'cp_time' : {}".format(train.cp_time.nunique(), train.cp_time.unique()))
print('No. of records where cp_time=24 : {}'.format(train[train.cp_time == 24].shape[0]))
print('No. of records where cp_time=48 : {}'.format(train[train.cp_time == 48].shape[0]))
print('No. of records where cp_time=72 : {}'.format(train[train.cp_time == 72].shape[0]))

In [None]:
x=train.cp_time.unique()
y_train = [train[train.cp_time == 24].shape[0],train[train.cp_time == 72].shape[0],train[train.cp_time == 48].shape[0]]
y_test=[test[test.cp_time == 24].shape[0],test[test.cp_time == 72].shape[0],test[test.cp_time == 48].shape[0]]
fig = go.Figure(data=[
    go.Bar(name='test', x=x, y=y_test),
    go.Bar(name='train', x=x, y=y_train)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

<a id="9"></a>
### _2.2.3 cp_dose_

In [None]:
d1_count = train.cp_dose.value_counts()[0]
d2_count = train.cp_dose.value_counts()[1]
print(f"In 'cp_dose' feature there are {d1_count} records with value 'D1' and {d2_count} records " 
      + "with value 'D2'")

In [None]:
x=train.cp_dose.unique()
y_train=[d1_count,d2_count]
y_test=[test.cp_dose.value_counts()[0], test.cp_dose.value_counts()[1]]

fig = go.Figure(data=[
    go.Bar(name='test', x=x, y=y_test),
    go.Bar(name='train', x=x, y=y_train)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

<a id="10"></a>
## _2.3 Gene Expression and Cell Viability features_

In this section, we will randomly pick a few columns of gene expression(g-) and cell viability(c-) features and gain insight from them.

In [None]:
# distributions of few random g- features
random_g_cols = random.sample(g_cols, 10)

fig = make_subplots(rows=5, cols=2, subplot_titles=random_g_cols)

fig.add_trace(go.Histogram(x=train[random_g_cols[0]], name=random_g_cols[0]), row=1, col=1)
fig.add_trace(go.Histogram(x=train[random_g_cols[1]], name=random_g_cols[1]), row=1, col=2)
fig.add_trace(go.Histogram(x=train[random_g_cols[2]], name=random_g_cols[2]), row=2, col=1)
fig.add_trace(go.Histogram(x=train[random_g_cols[3]], name=random_g_cols[3]), row=2, col=2)
fig.add_trace(go.Histogram(x=train[random_g_cols[4]], name=random_g_cols[4]), row=3, col=1)
fig.add_trace(go.Histogram(x=train[random_g_cols[5]], name=random_g_cols[5]), row=3, col=2)
fig.add_trace(go.Histogram(x=train[random_g_cols[6]], name=random_g_cols[6]), row=4, col=1)
fig.add_trace(go.Histogram(x=train[random_g_cols[7]], name=random_g_cols[7]), row=4, col=2)
fig.add_trace(go.Histogram(x=train[random_g_cols[8]], name=random_g_cols[8]), row=5, col=1)
fig.add_trace(go.Histogram(x=train[random_g_cols[9]], name=random_g_cols[9]), row=5, col=2)

fig.update_layout(
    title_text='Distribution of a few random Gene Expression features',
    height = 1200,
    width=675
)

fig.show()

In [None]:
# distributions of few random c- features
random_c_cols = random.sample(c_cols, 10)

fig = make_subplots(rows=5, cols=2, subplot_titles=random_c_cols)

fig.add_trace(go.Histogram(x=train[random_c_cols[0]], name=random_c_cols[0]), row=1, col=1)
fig.add_trace(go.Histogram(x=train[random_c_cols[1]], name=random_c_cols[1]), row=1, col=2)
fig.add_trace(go.Histogram(x=train[random_c_cols[2]], name=random_c_cols[2]), row=2, col=1)
fig.add_trace(go.Histogram(x=train[random_c_cols[3]], name=random_c_cols[3]), row=2, col=2)
fig.add_trace(go.Histogram(x=train[random_c_cols[4]], name=random_c_cols[4]), row=3, col=1)
fig.add_trace(go.Histogram(x=train[random_c_cols[5]], name=random_c_cols[5]), row=3, col=2)
fig.add_trace(go.Histogram(x=train[random_c_cols[6]], name=random_c_cols[6]), row=4, col=1)
fig.add_trace(go.Histogram(x=train[random_c_cols[7]], name=random_c_cols[7]), row=4, col=2)
fig.add_trace(go.Histogram(x=train[random_c_cols[8]], name=random_c_cols[8]), row=5, col=1)
fig.add_trace(go.Histogram(x=train[random_c_cols[9]], name=random_c_cols[9]), row=5, col=2)

fig.update_layout(
    title_text='Distribution of a few random Cell Viability features',
    height = 1200,
    width=675
)

fig.show()

### ***It is good to see that both g- and c- features are normally distributed with some features having skewness.<br>(OUTLIERS ALERT!)***

<a id="11"></a>
## _2.4 Multivariate Analysis_
#### *Now we are going to perform analysis between :*
- *cp_type / cp_time*
- *cp_type / cp_dose*
- *cp_time / cp_dose*

<a id="12"></a>
### _2.4.1 cp_type / cp_time_

In [None]:
trt_cp_24 = train[(train.cp_type == 'trt_cp') & (train.cp_time == 24)].shape[0]
trt_cp_48 = train[(train.cp_type == 'trt_cp') & (train.cp_time == 48)].shape[0]
trt_cp_72 = train[(train.cp_type == 'trt_cp') & (train.cp_time == 72)].shape[0]
ctl_veh_24 = train[(train.cp_type == 'ctl_vehicle') & (train.cp_time == 24)].shape[0]
ctl_veh_48 = train[(train.cp_type == 'ctl_vehicle') & (train.cp_time == 48)].shape[0]
ctl_veh_72 = train[(train.cp_type == 'ctl_vehicle') & (train.cp_time == 72)].shape[0]

rel_df = pd.DataFrame([['trt_cp', trt_cp_24, trt_cp_48, trt_cp_72],
                       ['ctl_vehicle', ctl_veh_24, ctl_veh_48, ctl_veh_72]], 
                      columns = ['cp_type', '24', '48', '72'])

fig = px.bar(rel_df, x="cp_type", y=["24", "48", "72"], title="Cp_type V/S Cp_time")
fig.show()

<a id="13"></a>
### _2.4.2 cp_type/ cp_dose_

In [None]:
trt_cp_d1 = train[(train.cp_type == 'trt_cp') & (train.cp_dose == 'D1')].shape[0]
trt_cp_d2 = train[(train.cp_type == 'trt_cp') & (train.cp_dose == 'D2')].shape[0]
ctl_veh_d1 = train[(train.cp_type == 'ctl_vehicle') & (train.cp_dose == 'D1')].shape[0]
ctl_veh_d2 = train[(train.cp_type == 'ctl_vehicle') & (train.cp_dose == 'D2')].shape[0]

rel_df = pd.DataFrame([['trt_cp', trt_cp_d1, trt_cp_d2],
                       ['ctl_vehicle', ctl_veh_d1, ctl_veh_d2]], 
                      columns = ['cp_type', 'D1', 'D2'])

fig = px.bar(rel_df, x="cp_type", y=['D1', 'D2'], title="Cp_type V/S Cp_dose")
fig.show()

<a id="14"></a>
### _2.4.3 cp_time/ cp_dose_

In [None]:
d1_24 = train[(train.cp_dose == 'D1') & (train.cp_time == 24)].shape[0]
d1_48 = train[(train.cp_dose == 'D1') & (train.cp_time == 48)].shape[0]
d1_72 = train[(train.cp_dose == 'D1') & (train.cp_time == 72)].shape[0]
d2_24 = train[(train.cp_dose == 'D2') & (train.cp_time == 24)].shape[0]
d2_48 = train[(train.cp_dose == 'D2') & (train.cp_time == 48)].shape[0]
d2_72 = train[(train.cp_dose == 'D2') & (train.cp_time == 72)].shape[0]

rel_df = pd.DataFrame([['D1', d1_24, d1_48, d1_72],
                       ['D2', d2_24, d2_48, d2_72]], 
                      columns = ['cp_dose', '24', '48', '72'])

fig = px.bar(rel_df, x="cp_dose", y=["24", "48", "72"], title="Cp_dose V/S Cp_time")
fig.show()

<a id="15"></a>
## _2.5 Correlations_
In this section, we will look at feature correlations and see some insights :)

<a id="16"></a>
### _2.5.1 Correlation b/w 'g-' features_

In [None]:
# Correlation b/w random 40 g- features

g_df = train[random.sample(g_cols, 40)]
f = plt.figure(figsize=(19, 15))
plt.matshow(g_df.corr(), fignum=f.number)
plt.xticks(range(g_df.shape[1]), g_df.columns, fontsize=14, rotation=50)
plt.yticks(range(g_df.shape[1]), g_df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

<a id="17"></a>
### _2.5.2 Correlation b/w 'c-' features_

In [None]:
# Correlation b/w random 40 c- features

c_df = train[random.sample(c_cols, 40)]
f = plt.figure(figsize=(19, 15))
plt.matshow(c_df.corr(), fignum=f.number)
plt.xticks(range(c_df.shape[1]), c_df.columns, fontsize=14, rotation=50)
plt.yticks(range(c_df.shape[1]), c_df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

#### Lets now find out pairs of features with high correlation.

<a id="18"></a>
### _2.5.3 Features with high correlations_
Here we will use 0.9 as the threshold value to consider a correlation as a high correlation

In [None]:
cols = ['cp_time'] + g_cols + c_cols
all_columns = []
for i in range(0, len(cols)):
    for j in range(i+1, len(cols)):
        if abs(train[cols[i]].corr(train[cols[j]])) > 0.9:
            all_columns.append(cols[i])
            all_columns.append(cols[j])

all_columns = list(set(all_columns))
print('Number of columns: ', len(all_columns))

In [None]:
all_cols_df = train[all_columns]
f = plt.figure(figsize=(19, 15))
plt.matshow(all_cols_df.corr(), fignum=f.number)
plt.xticks(range(all_cols_df.shape[1]), all_cols_df.columns, fontsize=14, rotation=50)
plt.yticks(range(all_cols_df.shape[1]), all_cols_df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)

<a id="19"></a>
### _2.6 Target Data Analysis_

In [None]:
target.head()

In [None]:
print("No. of rows : {}".format(target.shape[0]))
print("No. of columns : {}".format(target.shape[1]))

It is clear that our target dataframe is like a sparse matrix having a lot of zeros and only a few ones.<br><br>
So lets count the number of ones.

### _2.6.1 Check rows and columns for all same values_

In [None]:
target_copy = target.copy()

In [None]:
# Checking columns for all same values

same_value_cols = []
colwise_sum = ['colwise_sum']
moa_df = target_copy.iloc[:, 1:]
for col in moa_df.columns:
    colwise_sum.append(moa_df[col].sum())
    if moa_df[col].sum() == 0:
        same_value_cols.append(col)
        
print(f"There are {len(same_value_cols)} column(s) with all values same.")

# Append the colwise_sum list as last row to our target dataframe. We will use this row later.
target_copy.loc[len(target_copy)] = colwise_sum

In [None]:
# Checking rows for all same values

moa_df = target_copy.iloc[:-1,1:]
rowwise_sum = moa_df.sum(axis=1)
rowsum_zero_idx = []
for i, sum in enumerate(rowwise_sum):
    if sum == 0:
        rowsum_zero_idx.append(i)

print(f"There are {len(rowsum_zero_idx)} drug samples having all same values i.e all zero MoA labels.")

# Append this rowwise sum to target dataframe. We will use this column later.
target_copy['rowwise_sum'] = rowwise_sum

### _2.6.2 Non-zero elements_

In [None]:
# counting the number of non-zeros
print("Total number of elements in target set : {}".format(target.shape[0] * target.shape[1]))
print("No. of non-zero elements in target set : {}".format(np.count_nonzero(target)))

#### This means that target set rows are having more than one non-zero value (since the number of non-zero elements is greater than no. of rows i.e 23814) indicating -> a drug sample can have more than one MoA label.<br><br>

#### Lets look at :-
- 50 drug samples having most MoA labels under them.

### _2.6.3 Top 50 - Drug samples with highest number of MoA labels under them_

In [None]:
target_copy = target_copy.sort_values('rowwise_sum', ascending=False)
temp_df = target_copy.iloc[:50,:]
fig = px.bar(temp_df, x='sig_id', y='rowwise_sum')
fig.show()

In [None]:
# Number of activations in targets for every sample

# Work in progress...

<a id="20"></a>
# _<h1 align='center'><font color='orange'>3.Baseline Model</font></h1>_

In [None]:
# Initialize variables
SEED = 42
NFOLDS = 5
np.random.seed(SEED)
ROOT = '../input/lish-moa/'

<a id="21"></a>
## _3.1 Import data_

In [None]:
train = pd.read_csv(ROOT + 'train_features.csv')
targets = pd.read_csv(ROOT + 'train_targets_scored.csv')

test = pd.read_csv(ROOT + 'test_features.csv')
sub = pd.read_csv(ROOT + 'sample_submission.csv')

# drop id col
X = train.iloc[:,1:].to_numpy()
X_test = test.iloc[:,1:].to_numpy()
y = targets.iloc[:,1:].to_numpy() 

In [None]:
# Build the ML Pipeline

classifier = MultiOutputClassifier(XGBClassifier(tree_method='gpu_hist'))

clf = Pipeline([('encode', CountEncoder(cols=[0, 2])),
                ('classify', classifier)
               ])

In [None]:
# set parameters for Pipeline classifier

params = {'classify__estimator__colsample_bytree': 0.6522,
          'classify__estimator__gamma': 3.6975,
          'classify__estimator__learning_rate': 0.0503,
          'classify__estimator__max_delta_step': 2.0706,
          'classify__estimator__max_depth': 10,
          'classify__estimator__min_child_weight': 31.5800,
          'classify__estimator__n_estimators': 166,
          'classify__estimator__subsample': 0.8639
         }

_ = clf.set_params(**params)

<a id="22"></a>
## _3.2 Train model_

In [None]:
oof_preds = np.zeros(y.shape)
test_preds = np.zeros((test.shape[0], y.shape[1]))
oof_losses = []
kf = KFold(n_splits=NFOLDS)
for fn, (trn_idx, val_idx) in enumerate(kf.split(X, y)):
    print('Starting fold: ', fn)
    X_train, X_val = X[trn_idx], X[val_idx]
    y_train, y_val = y[trn_idx], y[val_idx]
    
    # drop where cp_type==ctl_vehicle (baseline)
    ctl_mask = X_train[:,0]=='ctl_vehicle'
    X_train = X_train[~ctl_mask,:]
    y_train = y_train[~ctl_mask]
    
    clf.fit(X_train, y_train)
    val_preds = clf.predict_proba(X_val) # list of preds per class
    val_preds = np.array(val_preds)[:,:,1].T # take the positive class
    oof_preds[val_idx] = val_preds
    
    loss = log_loss(np.ravel(y_val), np.ravel(val_preds))
    oof_losses.append(loss)
    preds = clf.predict_proba(X_test)
    preds = np.array(preds)[:,:,1].T # take the positive class
    test_preds += preds / NFOLDS
    
print(oof_losses)
print('Mean OOF loss across folds', np.mean(oof_losses))
print('STD OOF loss across folds', np.std(oof_losses))

<a id="23"></a>
## _3.3 Compute log loss_

In [None]:
# set control train preds to 0
control_mask = train['cp_type']=='ctl_vehicle'
oof_preds[control_mask] = 0

print('OOF log loss: ', log_loss(np.ravel(y), np.ravel(oof_preds)))

<a id="24"></a>
## _3.4 Submission_

In [None]:
# set control test preds to 0
control_mask = test['cp_type']=='ctl_vehicle'
test_preds[control_mask] = 0

In [None]:
# create the submission file
sub.iloc[:,1:] = test_preds
sub.to_csv('submission.csv', index=False)

### <font color='blue'>Thanks for reading this kernel. I hope you gained as much insights from reading it as I got from writing it. If you liked it, an UPVOTE is highly appreciated. If you are interested in more such content, feel free to follow me! ;)</font>