https://www.kaggle.com/c/lish-moa/overview

The Connectivity Map, a project within the Broad Institute of MIT and Harvard, together with the Laboratory for Innovation Science at Harvard (LISH), presents this challenge with the goal of advancing drug development through improvements to MoA prediction algorithms.

What is the Mechanism of Action (MoA) of a drug? And why is it important?

In the past, scientists derived drugs from natural products or were inspired by traditional remedies. Very common drugs, such as paracetamol, known in the US as acetaminophen, were put into clinical use decades before the biological mechanisms driving their pharmacological activities were understood. Today, with the advent of more powerful technologies, drug discovery has changed from the serendipitous approaches of the past to a more targeted model based on an understanding of the underlying biological mechanism of a disease. In this new framework, scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

How do we determine the MoAs of a new drug?

One approach is to treat a sample of human cells with the drug and then analyze the cellular responses with algorithms that search for similarity to known patterns in large genomic databases, such as libraries of gene expression or cell viability patterns of drugs with known MoAs.

In this competition, you will have access to a unique dataset that combines gene expression and cell viability data. The data is based on a new technology that measures simultaneously (within the same samples) human cells’ responses to drugs in a pool of 100 different cell types (thus solving the problem of identifying ex-ante, which cell types are better suited for a given drug). In addition, you will have access to MoA annotations for more than 5,000 drugs in this dataset.

As is customary, the dataset has been split into testing and training subsets. Hence, your task is to use the training dataset to develop an algorithm that automatically labels each case in the test set as one or more MoA classes. Note that since drugs can have multiple MoA annotations, the task is formally a multi-label classification problem.

How to evaluate the accuracy of a solution?

Based on the MoA annotations, the accuracy of solutions will be evaluated on the average value of the logarithmic loss function applied to each drug-MoA annotation pair.

If successful, you’ll help to develop an algorithm to predict a compound’s MoA given its cellular signature, thus helping scientists advance the drug discovery process.

In [None]:
import gc
import time
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.decomposition import PCA

import lightgbm as lgb

#Path to save the output
path_analysis = 'C:\\Users\\maxwi\\Python\\Kaggle\\Mechanism of action\\data_analysis\\'

In [None]:
# Load the Data
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
test_features = pd.read_csv('../input/lish-moa/test_features.csv')

In [None]:
#Function to reduce memory usage.
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
train_features = reduce_mem_usage(train_features)
train_targets_scored = reduce_mem_usage(train_targets_scored)
test_features = reduce_mem_usage(test_features)

In [None]:
display(train_features)
print()

#Verify missing.
train_features_missing = pd.DataFrame(train_features.isnull().sum())
print('Quantity of missing values:', train_features_missing[0].sum())

It is a good news that we don't have any missing values.\
On the other hand, we don't have even 25 thousands of observations to work with.

In [None]:
display(train_targets_scored)
print()

#Verify missing.
train_targets_scored_missing = pd.DataFrame(train_targets_scored.isnull().sum())
print('Quantity of missing values:', train_targets_scored_missing[0].sum())

Again, it is a good news that we don't have any missing values, so we do not need to worry if a missing means no information or just another '0'.

In [None]:
display(test_features)
print()

#Verify missing.
test_features_missing = pd.DataFrame(test_features.isnull().sum())
print('Quantity of missing values:', test_features_missing[0].sum())

Our train and test file are in the same format with the same variables. There is no problem here.

Our train_features and train_targets_scored are already in the same order, so we won't have any problem at modeling. We do not need to loose time joining the tables.

We can see that we have 206 different Mechanisms of Action. One approach is to run 206 different LightGBM.\
The downside of this approach is time. However, in the 'lightgbm test' notebook, we could see that this methodology can be donne in less tha 9 hours (our limit to run this model in Kaggle's notebook). 

If we have time, we will try a Feedforward neuron network also. 

In [None]:
'''
Before start studying the variables, lets separate train_features and train_targets_scored into train_set and validation_set.
This is important to avoid overfitting.
'''
train_features_set, validation_features_set = train_test_split(train_features, test_size = 0.2, random_state = 1).copy()

train_labels_set = train_targets_scored.iloc[train_features_set.index].copy()
validation_labels_set = train_targets_scored.iloc[validation_features_set.index].copy()

#reset index. We don't really need a index as a number. We can identify our observations by the column 'sig_id'
train_features_set.reset_index(inplace = True)
train_features_set.drop('index', axis = 1, inplace = True)

validation_features_set.reset_index(inplace = True)
validation_features_set.drop('index', axis = 1, inplace = True)

train_labels_set.reset_index(inplace = True)
train_labels_set.drop('index', axis = 1, inplace = True)

validation_labels_set.reset_index(inplace = True)
validation_labels_set.drop('index', axis = 1, inplace = True)

In [None]:
#It looks like we have a lot of variables g-x and c-x.
#Lets first verify the names of our variables.
features_df = pd.DataFrame(train_features_set.columns)
features_df['first_2_letter'] = features_df[0].str.extract(r"(.{2})", expand = True)
features_df['first_2_letter'].value_counts()

As we can see, we have 772 gene expression variables (g), and 100 cell viability variables (c).\
The 3 'cp' are cp_type, cp_time, and cp_dose.

cp_type indicates if the person was in the treated group (cp_vehicle) or in the control group. Drugs in the control group (ctrl_vehicle) does not have any  Mechanism of Action.

cp_time indicate the treatment duration  (24, 48, 72 hours).

cp_dose indicates the dose (high or low, D1 or D2).


It would be very helpfull if we had information about how the dose and treatment duration affect the 'g' and 'c' variables and the mechanism of action. For example, given that we already have the 'g' and 'c' variables, does it really matter the dosage and the duration? Does a high dosage imply that the 'g' and 'c' variables will have a bigger impact in the pacient so the mechanism of action will be different.\
To solve this problem we can design new variable, e. g., multiply all 'g' and 'c' variables by 2 when we have a high dosage, or make differents models by dosage and treatment duration.
Given that we do not even know if high dosage is D1, as it looks like by the way they described the variable, or D2, as one could infer that D1 mens one dosage and D2 means 2 dosages, it is preferable to do differents models. Nonetheless, we only have 24k observations, so if we won't have a good number of observations to make so many models.

In [None]:
#Lets see the variable cp_type
train_features_cp_type_bar_graph = pd.DataFrame(train_features_set['cp_type'].value_counts())

ax = train_features_cp_type_bar_graph.plot(kind='bar', figsize=(15,10), width = 0.58, rot = 0,
                                           align='center', color = 'LightGray', edgecolor = None)

total = 0
for bars in ax.patches:
    total += bars.get_height()

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height/total:.2%}', (x + width/2, y + 100 + height), ha = 'center')

Almost 8% of our observations are from the control group. 
We should withdraw these observations from our database to train our model. However, we must put a flag in our model to classify all MoA as zero when classifying a control observation. 

In [None]:
#Nonetheless, lets verify if the control group really has no MoA.
#first, lets sum how many MoAs each observations has.
train_labels_set['sum_MoA'] = train_labels_set.drop(['sig_id'], axis = 1).astype(bool).sum(axis = 1)
#Now, we sum the total MoA of each obsevartion in control
train_labels_set[train_features_set['cp_type'] == 'ctl_vehicle']['sum_MoA'].sum()

In [None]:
#Lets see if, despite having no MoA, the control group has value for the other variables.
train_features_set[train_features_set['cp_type'] == 'ctl_vehicle']

This shows us that it may be important to remove these observations for our training. \
If we keep these observations, we only have more observations with no information or, even worst, a lot of observation with misleading informations since we do not know if the 'g' and 'c' variables represents a drug with no effect in the human body or the original drug.\
However, to have sure about our decision we must study the distributions of the variables in the control and treatment group.\
If there are no change in the distribution of variables between control and treatment, it is crucial to remove the control group from our train dataset.

In [None]:
# Remove control observations.
train_features_without_control = train_features_set[train_features_set['cp_type'] != 'ctl_vehicle'].copy()
train_labels_without_control = train_labels_set.iloc[train_features_without_control.index].copy()
print('Observations left to train our model:', len(train_features_without_control))

# Only control observations.
train_features_only_control = train_features_set[train_features_set['cp_type'] == 'ctl_vehicle'].copy()
train_labels_only_control = train_labels_set.iloc[train_features_only_control.index].copy()

#reset index. 
train_features_without_control.reset_index(inplace = True)
train_features_without_control.drop('index', axis = 1, inplace = True)

train_labels_without_control.reset_index(inplace = True)
train_labels_without_control.drop('index', axis = 1, inplace = True)

train_features_without_control.reset_index(inplace = True)
train_features_without_control.drop('index', axis = 1, inplace = True)

train_labels_only_control.reset_index(inplace = True)
train_labels_only_control.drop('index', axis = 1, inplace = True)

In [None]:
'''
For the validation and test, we will remove just the column 'cp_type'.
However, we will not delete the observations. 
Instead, we will create a separate data frame with 'cp_type' so we can set all MoA equal to zero for these observations.
'''

validation_features_without_control = validation_features_set.drop(['cp_type'], axis = 1).copy()
validation_features_control = validation_features_set[['sig_id', 'cp_type']].copy()

test_features_without_control = test_features.drop(['cp_type'], axis = 1).copy()
test_features_control = test_features[['sig_id', 'cp_type']].copy()

For now, to continue our data study and train our model, we should use \
train_features_without_control, and\
train_labels_without_control

To do the hyperparametrization and the holdout, we should use\
validation_features_without_control and validation_features_control, and\
test_features_without_control and test_features_control

In [None]:
#Now, lets study the variable cp_time
train_features_cp_time_bar_graph = pd.DataFrame(train_features_without_control['cp_time'].value_counts()).sort_index()

ax = train_features_cp_time_bar_graph.plot(kind='bar', figsize=(15,10), width = 0.58, rot = 0,
                                           align = 'center', color = 'LightGray', edgecolor = None)

total = 0
for bars in ax.patches:
    total += bars.get_height()

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height/total:.2%}', (x + width/2, y + 50 + height), ha = 'center')

The duration of each treatment is well distributed in our data.
Lets verify if treatments with longer durations presents more MoA.

In [None]:
#Compute value_counts just to see how many bins we should plot at the histogram.
train_labels_without_control_cp_time_hist_24 = train_labels_without_control[train_features_without_control['cp_time'] == 24]['sum_MoA'].value_counts().sort_index()
train_labels_without_control_cp_time_hist_24

In [None]:
train_labels_without_control_cp_time_hist_48 = train_labels_without_control[train_features_without_control['cp_time'] == 48]['sum_MoA'].value_counts().sort_index()
train_labels_without_control_cp_time_hist_48

In [None]:
train_labels_without_control_cp_time_hist_72 = train_labels_without_control[train_features_without_control['cp_time'] == 72]['sum_MoA'].value_counts().sort_index()
train_labels_without_control_cp_time_hist_72

A drug can have from none to 7 different mechanisms of action.
For some reason, we do not have any observation with 6 MoA.
After, when we train out model, we can considerer that drugs have a limited number of mechanisms of action and, for example, consider only the 10 most probables mechanisms.

In [None]:
#Lets transform these tables in histograns to make it easier to the eyes.
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(3,1,1)
ax1 = train_labels_without_control[train_features_without_control['cp_time'] == 24]['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('cp_time = 24')
plt.ylim(top=0.6)

ax2 = fig.add_subplot(3,1,2)
ax2 = train_labels_without_control[train_features_without_control['cp_time'] == 48]['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('cp_time = 48')
plt.ylim(top=0.6)

ax3 = fig.add_subplot(3,1,3)
ax3 = train_labels_without_control[train_features_without_control['cp_time'] == 24]['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('cp_time = 72')
plt.ylim(top=0.6)

As we can see, the number of mechanisms of action does not change according the duration of the treatment.

In [None]:
#Study of the variable cp_dose
train_features_cp_dose_bar_graph = pd.DataFrame(train_features_without_control['cp_dose'].value_counts()).sort_index()

ax = train_features_cp_dose_bar_graph.plot(kind='bar', figsize=(15,10), width = 0.58, rot = 0,
                                           align = 'center', color = 'LightGray', edgecolor = None)

total = 0
for bars in ax.patches:
    total += bars.get_height()

for p in ax.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy() 
    ax.annotate(f'{height/total:.2%}', (x + width/2, y + 50 + height), ha = 'center')

In [None]:
#Our data is well balanced in relation to drug dosage.
#Lets see if the number of MoA is equilibrated too.
print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,1,1)
ax1 = train_labels_without_control[train_features_without_control['cp_dose'] == 'D1']['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('cp_dose = D1')
plt.ylim(top=0.6)

ax2 = fig.add_subplot(2,1,2)
ax2 = train_labels_without_control[train_features_without_control['cp_dose'] == 'D2']['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('cp_dose = D2')
plt.ylim(top=0.6)

In [None]:
#Lets see if the realation between cp_time and cp_dose is equilibrated too.
train_cp_time_cp_dose_sunburst = train_features_without_control.groupby(['cp_time', 'cp_dose'])['sig_id'].count().reset_index()
train_cp_time_cp_dose_sunburst.columns = ['cp_time', 'cp_dose', 'count']

fig = px.sunburst(
    train_cp_time_cp_dose_sunburst,
    path = ['cp_time','cp_dose'],
    values = 'count',
    title = 'Sunburst chart: cp_time, cp_dose',
    color = 'count', color_continuous_scale = 'Blues',
)

fig.update_traces(go.Sunburst(textinfo= 'label + percent entry'))
    
fig.show()


In [None]:
#Another way to see the same information is with a tree map
fig =px.treemap(
    train_cp_time_cp_dose_sunburst,
    path = ['cp_dose', 'cp_time'],
    values = 'count',
    title = 'Tree map: cp_time, cp_dose',
    color = 'count', color_continuous_scale = 'Blues',
)

fig.update_layout(template = 'seaborn')
fig.data[0].textinfo = 'label + percent entry'
fig.show()

We can notice that all groups of treatment duration and drug dosage have the same proportion in our database.
The segment (D1, 48) have a little more participation, but a difference of only 2% does not seen to be relevant enouth to treat it diferently from the other segments.\
If we decide to do one model by dosage and duration, we will have only 2812 (17576*0.16) to work with in each model.

All the other variables(772 gene expression variables and 100 cell viability variables) are continuous variables and there isn't much to analize. We already know that there isn't any missing in our database, but we can also see if any of these variables is actually a constant (what would not be usefull for modeling).\
It is also interisting to see the correlations among 'g' variables and the correlations amog the 'c' variables. If they are highly correlated, we can perform a principal component analysis. This will allow our model to be more efficint, especially considering the small number of observations that we have.

In [None]:
#Computing the standard deviation for 'g_' and 'c_' variables.
describe_g_c_variables = train_features_without_control.drop(['cp_time'], axis = 1).describe()
std_g_c = describe_g_c_variables.iloc[2]
print('Describe the standard deviation of g and c variables:')
display(std_g_c.describe())
print()

#Histogram
print('Histogram:')
fig = plt.figure(figsize=(15,10))

ax = std_g_c.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g and c variables')
plt.ylim(top = 2)

We can see that some variables have much more variance than others. Probably, these variables will be more important in the model. However, we do not have any variable without variance.

Lets repeat this study but separating the g variables from the c ones.

In [None]:
#Computing the standard deviation for 'g_' variables.
mask_g = train_features_without_control.columns.str.contains('^g-*')

describe_g_variables = train_features_without_control.loc[:,mask_g].describe()
std_g = describe_g_variables.iloc[2]
print('Describe the standard deviation of g variables:')
display(std_g.describe())
print()

#Histogram
print('Histogram:')
fig = plt.figure(figsize=(15,10))

ax = std_g.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

In [None]:
#Computing the standard deviation for 'c_' variables.
mask_c = train_features_without_control.columns.str.contains('^c-[0-99]')

describe_c_variables = train_features_without_control.loc[:,mask_c].describe()
std_c = describe_c_variables.iloc[2]
print('Describe the standard deviation of c variables:')
display(std_c.describe())
print()

#Histogram
print('Histogram:')
fig = plt.figure(figsize=(15,10))

ax = std_c.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

In [None]:
#Summarizing the comparison between g and c variables with a box plot.
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,2,1)
ax1 = sns.boxplot(data = std_g, color = 'LightGray')
plt.xlabel('standard of g variables')
plt.ylim(bottom = 0.5, top = 2.5)


ax2 = fig.add_subplot(2, 2, 2)
ax2 = sns.boxplot(data = std_c, color = 'LightGray')
plt.xlabel('standard of c variables')
plt.ylim(bottom = 0.5, top = 2.5)


From the box plot above, it is clear that the cell viability variables have a higher variability than the gene expression variables.
Lets compare the distribution of these genes and cell variables with the database with only control observations.

In [None]:
describe_g_variables_without_control = train_features_without_control.loc[:,mask_g].describe()
std_g_without_control = describe_g_variables_without_control.iloc[2]
print('Describe the standard deviation of g variables without control:')
display(std_g_without_control.describe())
print()
describe_g_variables_only_control = train_features_only_control.loc[:,mask_g].describe()
std_g_only_control = describe_g_variables_only_control.iloc[2]
print('Describe the standard deviation of g variables with only control:')
display(std_g_only_control.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,1,1)
ax1 = std_g_without_control.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables without control')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(2,1,2)
ax2 = std_g_only_control.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables with only control')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

In [None]:
describe_c_variables_without_control = train_features_without_control.loc[:,mask_c].describe()
std_c_without_control = describe_c_variables_without_control.iloc[2]
print('Describe the standard deviation of c variables without control:')
display(std_c_without_control.describe())
print()
describe_c_variables_only_control = train_features_only_control.loc[:,mask_c].describe()
std_c_only_control = describe_c_variables_only_control.iloc[2]
print('Describe the standard deviation of c variables with only control:')
display(std_c_only_control.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,1,1)
ax1 = std_c_without_control.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables without control')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(2,1,2)
ax2 = std_c_only_control.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables with only control')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

We can see that the distribution of genes and cell variables are different for the control and treatment group, especially for the cell variables.\
Lets repeat the same study considering the difference in the dosage variable,

In [None]:
train_g_variables = train_features_without_control.loc[:,mask_g]
describe_g_variables_D1 = train_g_variables[np.in1d(train_features_without_control['cp_dose'], 'D1')].describe()
std_g_D1 = describe_g_variables_D1.iloc[2]
print('Describe the standard deviation of g variables, dosage = "D1":')
display(std_g_D1.describe())
print()
std_g_D2 = train_g_variables[np.in1d(train_features_without_control['cp_dose'], 'D2')].describe()
std_g_D2 = std_g_D2.iloc[2]
print('Describe the standard deviation of g variables, dosage = "D2":')
display(std_g_D2.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,1,1)
ax1 = std_g_D1.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables, dosage = "D1":')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(2,1,2)
ax2 = std_g_D2.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables, dosage = "D2":')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

In [None]:
train_c_variables = train_features_without_control.loc[:,mask_c]
describe_c_variables_D1 = train_c_variables[np.in1d(train_features_without_control['cp_dose'], 'D1')].describe()
std_c_D1 = describe_c_variables_D1.iloc[2]
print('Describe the standard deviation of c variables, dosage = "D1":')
display(std_c_D1.describe())
print()
std_c_D2 = train_c_variables[np.in1d(train_features_without_control['cp_dose'], 'D2')].describe()
std_c_D2 = std_c_D2.iloc[2]
print('Describe the standard deviation of c variables, dosage = "D2":')
display(std_c_D2.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(2,1,1)
ax1 = std_c_D1.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables, dosage = "D1":')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(2,1,2)
ax2 = std_c_D2.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables, dosage = "D2":')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

There is a small change in the distribution of the genes and cell variables when considering the difference in the dosage. This may imply that the dosage variable is not so importante for our model.\
Lets repeat the study considering the duration of the treatment.

In [None]:
train_g_variables = train_features_without_control.loc[:,mask_g]
std_g_24 = train_g_variables[np.in1d(train_features_without_control['cp_time'], 24)].describe()
std_g_24 = std_g_24.iloc[2]
print('Describe the standard deviation of g variables, duration = 24:')
display(std_g_24.describe())
print()
std_g_48 = train_g_variables[np.in1d(train_features_without_control['cp_time'], 48)].describe()
std_g_48 = std_g_48.iloc[2]
print('Describe the standard deviation of g variables, duration = 48:')
display(std_g_48.describe())
print()
std_g_72 = train_g_variables[np.in1d(train_features_without_control['cp_time'], 72)].describe()
std_g_72 = std_g_72.iloc[2]
print('Describe the standard deviation of g variables, duration = 72:')
display(std_g_48.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(3,1,1)
ax1 = std_g_24.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables, duration = 24:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(3,1,2)
ax2 = std_g_48.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables, duration = 48:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax3 = fig.add_subplot(3,1,3)
ax3 = std_g_72.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of g variables, duration = 72:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

In [None]:
train_c_variables = train_features_without_control.loc[:,mask_c]
std_c_24 = train_c_variables[np.in1d(train_features_without_control['cp_time'], 24)].describe()
std_c_24 = std_c_24.iloc[2]
print('Describe the standard deviation of c variables, duration = 24:')
display(std_c_24.describe())
print()
std_c_48 = train_c_variables[np.in1d(train_features_without_control['cp_time'], 48)].describe()
std_c_48 = std_c_48.iloc[2]
print('Describe the standard deviation of c variables, duration = 48:')
display(std_c_48.describe())
print()
std_c_72 = train_c_variables[np.in1d(train_features_without_control['cp_time'], 72)].describe()
std_c_72 = std_c_72.iloc[2]
print('Describe the standard deviation of c variables, duration = 72:')
display(std_c_48.describe())
print()

print('Histogram:')
fig = plt.figure(figsize=(15,25))

ax1 = fig.add_subplot(3,1,1)
ax1 = std_c_24.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables, duration = 24:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax2 = fig.add_subplot(3,1,2)
ax2 = std_c_48.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables, duration = 48:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

ax3 = fig.add_subplot(3,1,3)
ax3 = std_c_72.hist(density = True,
                histtype = 'bar', bins = 10, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('std of c variables, duration = 72:')
plt.xlim(left = 0.5, right = 2.5)
plt.ylim(top = 2.5)

For the genes variable we cannot see a significant difference in the distribution when considering the duration of the treatment. However, for the cell variables, there are a visible difference.

Now, lets see the correlations.

In [None]:
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def correlation_variables(db, variable_type, n, correlation_cut):
    
    if variable_type == 'c':
        mask_type = db.columns.str.contains('^c-[0-99]')
    elif variable_type == 'g':
        mask_type = db.columns.str.contains('^g-*')
        
    db_variables = db.loc[:,mask_type]

    #Correlation
    corr_matrix = db_variables.corr()
    corr_table = corr_matrix.abs().unstack()
    labels_to_drop = get_redundant_pairs(db_variables)
    corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)

    print('Variable type: ' + variable_type)
    print()
    print("Top Absolute Correlations")
    print(corr_table[: n])
    print()
    print("Least Absolute Correlations")
    print(corr_table[len(corr_table) - n: ])
    print()
    print("Variables with correlation higher than", correlation_cut, ':', len(corr_table[abs(corr_table) > correlation_cut]))
    print("Percent:", round(len(corr_table[abs(corr_table) > correlation_cut]) / len(corr_table), 4))
    print()

    size_x = 20     #This is a good size to visualise the heatmap saved as .png
    size_y = 20
    plt.figure(figsize = (size_x, size_y))
    sns.set(font_scale = 1.5)

    ax = sns.heatmap(corr_matrix, annot = False, linewidth = 0.2, cmap='coolwarm')
    bottom, top = ax.get_ylim()
    ax.set_ylim(bottom + 0.5, top - 0.5)

    plt.tight_layout()
    plt.savefig(path_analysis + 'variable_type_' + variable_type + '.png')

    plt.show() 

In [None]:
correlation_variables(train_features_without_control, 'g', 10, 0.5)

For the 'g' variables, we can see that they are not highly correlated. Less than 5% has a correlation higher than 0.5 or less than -0.5.
This imply that we won't be able to reduce very much the number of variables using PCA.

In [None]:
correlation_variables(train_features_without_control, 'c', 10, 0.8)

On the other hand, for the 'c' variables, we can see that most variables (more than 85%) have a correlation higher than 80%.\
Even the least correlated variables have a correlation higher than 60%. Therefore, we can do a PCA to use less variables and increase the performance of our model.

In [None]:
# PCA
def pca_variabels(db, variable_type, interval_explained_variance_low, interval_explained_variance_high):
    '''
    pca_variabels(train_features_without_control, 'g', 340, 350)
    '''
    if variable_type == 'c':
        mask_type = db.columns.str.contains('^c-[0-99]')
    elif variable_type == 'g':
        mask_type = db.columns.str.contains('^g-*')

    db_variables = db.loc[:,mask_type]
    pca = PCA(n_components = len(db_variables.columns),random_state = 4)
    pca.fit(db_variables)

    explained_variance = []

    sum_variance = 0
    num_variables = 0
    for variance in pca.explained_variance_ratio_:
        sum_variance += variance
        num_variables += 1
        explained_variance.append([num_variables, sum_variance])

    print("Explained variance:")
    display(explained_variance[interval_explained_variance_low - 1: interval_explained_variance_high])
    
    return pd.DataFrame(data = explained_variance)

In [None]:
#plot PCA: components Vs. explained 
def graph_PCA(db, rotation_x, pace):
    '''
    loss_train_validation(db = log_loss_hp, rotation_x = 0, save_to = save_file_final)
    '''
    fig, ax = plt.subplots(figsize=(20, 15))
    x = db[0]
    y = db[1]

    ax.xaxis.set_ticks(x)
    ax.plot(x, y, color='gray', zorder = 1, linewidth = 0.5, linestyle = 'solid')
    # add two layers of points to create an illusion of a discontinuous line. "zorder" specifies plotting order
    ax.scatter(x, y, s = 64, color = 'white', zorder = 2)
    ax.scatter(x, y, s = 8, color = 'gray', zorder = 3)

    ax.spines['right'].set_visible(False)
    ax.spines['top'].set_visible(False)

    ax.xaxis.set_ticks(list(range(1, len(x), pace)))
    ax.set_facecolor('xkcd:white')
    
    plt.xticks(rotation = rotation_x)

In [None]:
pca_variabels(train_features_without_control, 'g', 340, 350)

In [None]:
pca_variabels(train_features_without_control, 'g', 160, 170)

In [None]:
explained_variance_g = pca_variabels(train_features_without_control, 'g', 65, 75)

In [None]:
graph_PCA(explained_variance_g, 45, 50)

Despite the correlation heatmap not showing high correlations between variables, insteady of using 772 'g' variables, we can use only 343 components and still explain a little more than 90% of our data variation.\
If we acept a variation of 80% we can use only 166 components.\
If we really want to cut the number of variables, we can use only 72 components and still explain 70% of the variation in the 'g' variables.

In [None]:
explained_variance_c = pca_variabels(train_features_without_control, 'c', 5, 15)

In [None]:
graph_PCA(explained_variance_c, 45, 10)

With the 'c' variables, we can use only 10 components and still explain more than 90% of the variation of the 'c' variables.

In [None]:
#Now that we already studied our exogenous variable, lets see our endogenous ones.
train_labels_without_control

We can see that we have 206 differents classifications of mechanism of action.
We already know that one drug may have more than one MoA, but lets plot again the histogram of the number of mechanisms of action for each drug without clustering by variable.

In [None]:
print('Describe the endogenous varibles:')
display(train_labels_without_control['sum_MoA'].describe())
print()

#Histogram
print('Histogram:')
fig = plt.figure(figsize=(15,10))
ax = train_labels_without_control['sum_MoA'].hist(density = True,
                histtype = 'bar', bins = 7, align = 'mid', orientation = 'vertical',
                color = 'LightGray', edgecolor = None)
plt.grid(b = None)
plt.xlabel('Mechanism of action')
plt.ylim(top=0.6)


In [None]:
#Box plot
print('Box plot:')
fig = plt.figure(figsize=(10,10))

ax = sns.boxplot(data = train_labels_without_control['sum_MoA'], color = 'LightGray')
plt.xlabel('Mechanism of action')

As we can see, even without the control group, about 35% of the drugs in our database does not have any mechanism of action.
It would be nice if we had access to more information about our database, or if we could talk with someone of the field to try to understand if is normal to have so many drugs without any MoA. 

We have differents types of mechanisms of action, lets try to understand them a little.

In [None]:
#Types of MoA
mechanism_of_action_type_aux = train_labels_without_control.drop(['sig_id', 'sum_MoA'], axis = 1).columns
mechanism_of_action_type_aux_split = mechanism_of_action_type_aux.str.split('_')

mechanism_of_action_type = []
for i in range(len(mechanism_of_action_type_aux_split)):
    #Words that appear only once
    if mechanism_of_action_type_aux_split[i][-1] in ['b', 'stimulant', 'medium', 'local', 'donor',' sensitizer',
                                                    'laxative','anticonvulsant','secretagogue', 'antibiotic',
                                                    'antiprotozoal','antifolate','antimalarial', 'immunosuppressant',
                                                    'antioxidant','antifungal','scavenger', 'antiviral',
                                                    'steroid','antihistamine','diuretic', 'analgesic',
                                                    'antiarrhythmic','anti-inflammatory']:
        mechanism_of_action_type.append(mechanism_of_action_type_aux[i])
    else:
        mechanism_of_action_type.append(mechanism_of_action_type_aux_split[i][-1])


mechanism_of_action_type_df = pd.DataFrame(data = mechanism_of_action_type)
mechanism_of_action_type_df_barh_graph = mechanism_of_action_type_df[0].value_counts().sort_values()
mechanism_of_action_type_df_barh_graph.plot(kind='barh', figsize=(15,13), width = 0.58, rot = 0,
                                           align = 'center', color = 'LightGray', edgecolor = None)

Most of our mechanisms of action are inhibitors.

Considering that one drug can have more than one MoA, again it would be nice to talk with someone of the field to know, for example, if it is possible to one drug to be a inhibitor and an activator at the same time.

Lets now verify if our data base is unbalanced. Lets count how many drugs of each MoA we have.

In [None]:
sum_MoA = pd.DataFrame(data = train_labels_without_control.drop(['sig_id', 'sum_MoA'], axis = 1).sum().sort_values(ascending = False), columns = ['Count'])
sum_MoA['Percent'] = round(sum_MoA['Count'] / len(train_labels_without_control), 4)
sum_MoA.head(10)

Unfortunately, as we can see, most of our MoA has less than 4% of positive labels.\
This unbalanced data may b a problem in modeling.

In [None]:
#Correlation between MoAs.
n = 10
db_variables = train_labels_without_control.drop(['sig_id', 'sum_MoA'], axis = 1).copy()
corr_matrix = db_variables.corr()
corr_table = corr_matrix.abs().unstack()
labels_to_drop = get_redundant_pairs(db_variables)
corr_table = corr_table.drop(labels = labels_to_drop).sort_values(ascending = False)
print("Top Absolute Correlations:")
print(corr_table[: n])
print()
print("Least Absolute Correlations:")
print(corr_table[len(corr_table) - n: ])
print()

size_x = 20     #This is a good size to visualise the heatmap saved as .png
size_y = 20
plt.figure(figsize = (size_x, size_y))
sns.set(font_scale = 1.5)

ax = sns.heatmap(corr_matrix, annot = False, linewidth = 0.2, cmap='coolwarm')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

plt.tight_layout()
plt.savefig(path_analysis + 'MoA_correlation.png')

plt.show() 

We do not have so many mechanism of action correlated with another. This supports a strategy of one model per MoA.\
However, we still have some correlation, so it would be better to do only one model considering all MoA at the same time.\
The difference is that one model per MoA can be easily done using LightGBM or any other model like a Logit, but a multi label model with one observations being able to assume more than one classification demands a fancier model like a Feedforward neural network.