# Mechanisms of Action (MoA) Prediction: EDA

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

> Read the datasets

In [None]:
a= pd.read_csv('/kaggle/input/lish-moa/train_features.csv')
b= pd.read_csv('/kaggle/input/lish-moa/test_features.csv')
c=pd.read_csv('/kaggle/input/lish-moa/train_targets_nonscored.csv')
d=pd.read_csv('/kaggle/input/lish-moa/train_targets_scored.csv')

merged=pd.concat([a,b])

#Datasets for treated and control experiments
treated= a[a['cp_type']=='trt_cp']
control= a[a['cp_type']=='ctl_vehicle']

#Treatment time datasets
cp24= a[a['cp_time']== 24]
cp48= a[a['cp_time']== 48]
cp72= a[a['cp_time']== 72]

#Merge scored and nonscored labels
all_drugs= pd.merge(d, c, on='sig_id', how='inner')

#Treated drugs without control
treated_list = treated['sig_id'].to_list()
drugs_tr= d[d['sig_id'].isin(treated_list)]

#adt= All Drugs Treated
adt= all_drugs[all_drugs['sig_id'].isin(treated_list)]

> Helper functions

In [None]:
def plotd(f1):
    plt.style.use('seaborn')
    sns.set_style('whitegrid')
    fig = plt.figure(figsize=(15,5))
    #1 rows 2 cols
    #first row, first col
    ax1 = plt.subplot2grid((1,2),(0,0))
    plt.hist(control[f1], bins=4, color='mediumpurple',alpha=0.5)
    plt.title(f'control: {f1}',weight='bold', fontsize=18)
    #first row sec col
    ax1 = plt.subplot2grid((1,2),(0,1))
    plt.hist(treated[f1], bins=4, color='darkcyan',alpha=0.5)
    plt.title(f'Treated with compound: {f1}',weight='bold', fontsize=18)
    plt.show()
    
def plott(f1):
    plt.style.use('seaborn')
    sns.set_style('whitegrid')
    fig = plt.figure(figsize=(15,5))
    #1 rows 2 cols
    #first row, first col
    ax1 = plt.subplot2grid((1,3),(0,0))
    plt.hist(cp24[f1], bins=3, color='deepskyblue',alpha=0.5)
    plt.title(f'Treatment duration 24h: {f1}',weight='bold', fontsize=14)
    #first row sec col
    ax1 = plt.subplot2grid((1,3),(0,1))
    plt.hist(cp48[f1], bins=3, color='lightgreen',alpha=0.5)
    plt.title(f'Treatment duration 48h: {f1}',weight='bold', fontsize=14)
    #first row 3rd column
    ax1 = plt.subplot2grid((1,3),(0,2))
    plt.hist(cp72[f1], bins=3, color='gold',alpha=0.5)
    plt.title(f'Treatment duration 72h: {f1}',weight='bold', fontsize=14)
    plt.show()

def plotf(f1, f2, f3, f4):
    plt.style.use('seaborn')
    sns.set_style('whitegrid')

    fig= plt.figure(figsize=(15,10))
    #2 rows 2 cols
    #first row, first col
    ax1 = plt.subplot2grid((2,2),(0,0))
    plt.hist(a[f1], bins=3, color='orange', alpha=0.7)
    plt.title(f1,weight='bold', fontsize=18)
    plt.yticks(weight='bold')
    plt.xticks(weight='bold')
    #first row sec col
    ax1 = plt.subplot2grid((2,2), (0, 1))
    plt.hist(a[f2], bins=3, alpha=0.7)
    plt.title(f2,weight='bold', fontsize=18)
    plt.yticks(weight='bold')
    plt.xticks(weight='bold')
    #Second row first column
    ax1 = plt.subplot2grid((2,2), (1, 0))
    plt.hist(a[f3], bins=3, color='red', alpha=0.7)
    plt.title(f3,weight='bold', fontsize=18)
    plt.yticks(weight='bold')
    plt.xticks(weight='bold')
    #second row second column
    ax1 = plt.subplot2grid((2,2), (1, 1))
    plt.hist(a[f4], bins=3, color='green', alpha=0.7)
    plt.title(f4,weight='bold', fontsize=18)
    plt.yticks(weight='bold')
    plt.xticks(weight='bold')

    return plt.show()

def ploth(data, w=15, h=9):
    plt.figure(figsize=(w,h))
    sns.heatmap(data.corr(), cmap='hot')
    plt.title('Correlation between the drugs', fontsize=18, weight='bold')
    return plt.show()

**First glimpse: 876 features with:**
* Features g- signify gene expression data.
* c- signify cell viability data.
* cp_type indicates samples treated with a compound.
* (cp_vehicle) or with a control perturbation (ctrl_vehicle); control perturbations have no MoAs.
* cp_time and cp_dose indicate treatment duration (24, 48, 72 hours) and dose (high or low).

In [None]:
a.head()

# 1-Overview: Features

In [None]:
plt.style.use('seaborn')
sns.set_style('whitegrid')
fig = plt.figure(figsize=(15,5))
#1 rows 2 cols
#first row, first col
ax1 = plt.subplot2grid((1,2),(0,0))
sns.countplot(x='cp_type', data=a, palette='pastel')
plt.title('Train: Control and treated samples', fontsize=15, weight='bold')
#first row sec col
ax1 = plt.subplot2grid((1,2),(0,1))
sns.countplot(x='cp_dose', data=a, palette='Purples')
plt.title('Train: Treatment Doses: Low and High',weight='bold', fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.distplot( a['cp_time'], color='red', bins=5)
plt.title("Train: Treatment duration ", fontsize=15, weight='bold')
plt.show()

* **Few control samples.**
* **The low and high doses were applied equally.**
* **3 treatment durations: 24h, 48h and 72h.**

# 2- `c-` Features are related to cell viability. What is cell viability? 

A viability assay is an assay that is created to determine the ability of organs, cells or tissues to maintain or recover a state of survival. Viability can be distinguished from the all-or-nothing states of life and death by the use of a quantifiable index that ranges between the integers of 0 and 1 or, if more easily understood, the range of 0% and 100%. Viability can be observed through the physical properties of cells, tissues, and organs. Some of these include mechanical activity, motility, such as with spermatozoa and granulocytes, the contraction of muscle tissue or cells, mitotic activity in cellular functions, and more. Viability assays provide a more precise basis for measurement of an organism's level of vitality.[1]

In [None]:
plotf('c-10', 'c-50', 'c-70', 'c-90')

***First observation in this EDA:***

As mentioned in the definition, cell viability should range between the integers 0 and 1. Here, we have values in the range -10 and 6. I need to investigate more and learn if those are just failed experiments or if those numbers are due to a different method used in this study. *(I will update this section in the future).*

Anyways, let's see the difference between the cell viability in a control and treated sample.

In [None]:
plotd("c-30")

This is just a first impression after checking multiple cell lines *(We need to check all the cell lines to draw conclusions).* It seems that the treated samples have higher cell viability than control sample.

> **Next, let's see the impact of the treatment time on the cell viability.**

In [None]:
plott('c-30')

*(work in progress)*

> **Next let's see the correlation between cell viability features (in the treated samples, no control).**

In [None]:
#Select the columns c-
c_cols = [col for col in a.columns if 'c-' in col]
#Filter the columns c-
cells=treated[c_cols]
#Plot heatmap
plt.figure(figsize=(12,6))
sns.heatmap(cells.corr(), cmap='coolwarm', alpha=0.9)
plt.title('Correlation: Cell viability', fontsize=15, weight='bold')
plt.xticks(weight='bold')
plt.yticks(weight='bold')
plt.show()

* **Many high correlations between c- features. This is something to be taken into consideration in feature engineering.**


***



# 3- `g-` features are related to gene expression. What is gene expression?

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins. You can refer to my notebook [COVID_19: Viral proteins identification](https://www.kaggle.com/amiiiney/covid-19-proteins-identification-with-biopython) to understand more how gene expression works.

In short, the mechanism of action of the 207 drugs in this study will activate some genes, gene expression will take place and byproducts (proteins) will be synthesized.

In [None]:
plotf('g-10','g-100','g-200','g-400')

***How to interpret those values?***

Gene expression levels are calculated by the ratio between the expression of the target gene (i.e., the gene of interest) and the expression of one or more reference genes (often household genes). [3]

> To understand this better, let's compare the samples treated with the drugs and the control samples.

In [None]:
plotd('g-510')

Again, we have to check all the genes to draw conclusions, but after checking multiple genes, it seems that the samples treated with the compounds have higher gene expression levels.

**Next, let's check the impact of the treatment time on the gene expression.**

In [None]:
plott('g-510')

*(work in progress)*

> **Next, let's see the correlation between gene expression features. (in the treated samples, no control)**

In [None]:
#Select the columns g-
g_cols = [col for col in a.columns if 'g-' in col]
#Filter the columns g-
genes=treated[g_cols]
#Plot heatmap
plt.figure(figsize=(15,7))
sns.heatmap(genes.corr(), cmap='coolwarm', alpha=0.9)
plt.title('Gene expression: Correlation', fontsize=15, weight='bold')
plt.show()

* **We have both negative and positive correlations here between genes. Interesting!** *(The control samples were not included, so the negative correlation between some genes is not related to the control/treated samples).*
***
# 4- Drugs *(Targets)*:
***
> ## 4-1 Scored targets:

This is a multi-label classification, we have 207 drugs and we have to find out the efficiency of these drugs on the cell lines, a single sample can be a target of many drugs, so we have to predict the mechanism of action of each drug.

*We will filter the **train_targets_scored** dataset and keep just the treated rows (we discard the control rows because they are not treated with the drugs).*


In [None]:
#Count unique values per column
cols = drugs_tr.columns.to_list() # specify the columns whose unique values you want here
uniques = {col: drugs_tr[col].nunique() for col in cols}
uniques=pd.DataFrame(uniques, index=[0]).T
uniques=uniques.rename(columns={0:'count'})
uniques= uniques.drop('sig_id', axis=0)


#Calculate the mean values
average=d.mean()
average=pd.DataFrame(average)
average=average.rename(columns={ 0: 'mean'})
average['percentage']= average['mean']*100
#Filter just the drugs with mean >0.01
average_filtered= average[average['mean'] > 0.01]
average_filtered= average_filtered.reset_index()
average_filtered= average_filtered.rename(columns={'index': 'drug'})

In [None]:
plt.style.use('seaborn')
sns.set_style('whitegrid')
fig = plt.figure(figsize=(15,5))
#1 rows 2 cols
#first row, first col
ax1 = plt.subplot2grid((1,2),(0,0))
sns.countplot(uniques['count'], color='deepskyblue', alpha=0.75)
plt.title('Unique elements per drug [0,1]', fontsize=15, weight='bold')
#first row sec col
ax1 = plt.subplot2grid((1,2),(0,1))
sns.distplot(average['percentage'], color='orange', bins=40)
plt.title("The drugs mean distribution", fontsize=15, weight='bold')
plt.show()

* All the drugs are present in at least one sample.
* The presence of the drugs is very low in the samples (Mostly less than 0.5%).
* Some drugs *(outliers)* have a higher presence in comparison with the rest of drugs with a percentage in the range (1.5%, 3.5%).

> **Let's have a look over some of these drugs**

In [None]:
plt.figure(figsize=(7,7))
plt.scatter(average_filtered['percentage'].sort_values(), average_filtered['drug'], color=sns.color_palette('Reds',22))
plt.title('Drugs with higher presence in train samples', weight='bold', fontsize=15)
plt.xticks(weight='bold')
plt.yticks(weight='bold')
plt.xlabel('Percentage', fontsize=13)
plt.show()

* It seems we have 2 outliers here: **Tubulin_inhibitor** and **sodium_channel_inhibitor.** Next, let's check the correlation between the drugs
***
### Correlation between drugs:

In [None]:
ploth(drugs_tr)

Most of the drugs have 0 correlation. It is worth recalling that the presence of active drugs in the samples in very low (mainly 1 or 2 drugs per sample).

However, we notice some yellow dots *(high correlation)* between some drugs. Let's have a closer look over those drugs.

#### Drugs with the highest MoA correlation

In [None]:
#Correlation between drugs
corre= drugs_tr.corr()
#Unstack the dataframe
s = corre.unstack()
so = s.sort_values(kind="quicksort", ascending=False)
#Create new dataframe
so2= pd.DataFrame(so).reset_index()
so2= so2.rename(columns={0: 'correlation', 'level_0':'Drug 1', 'level_1': 'Drug2'})
#Filter out the coef 1 correlation between the same drugs
so2= so2[so2['correlation'] != 1]
#Drop pair duplicates
so2= so2.reset_index()
pos = [1,3,5,7,9]
so2= so2.drop(so2.index[pos])
so2= so2.round(decimals=4)
so3=so2.head(4)
#Show the first 10 high correlations
cm = sns.light_palette("Red", as_cmap=True)
s = so2.head().style.background_gradient(cmap=cm)
s

* **2 drug-pairs have +0.9 correlation:** 

Those drug pairs must be in the few samples that have more than two active drugs. The functionality of these drugs and their distribution is something to be taken into consideration because in this case, we have **a multi-label classification problem**, where the correlation between the labels is also important and the model selection should be based on the labels correlation. Select a model that finds patterns not just in the train data but also in the **multi-label target data.**

Below, we will try to connect the dots and try to match the high correlated drugs with the drugs with the most presence in the samples.

In [None]:
plt.figure(figsize=(8,10))
the_table =plt.table(cellText=so3.values,colWidths = [0.35]*len(so3.columns),
          rowLabels=so3.index,
          colLabels=so3.columns
          ,cellLoc = 'center', rowLoc = 'center',
          loc='left', edges='closed', bbox=(1,0, 1, 1)
         ,rowColours=sns.color_palette('Reds',10))
the_table.auto_set_font_size(False)
the_table.set_fontsize(10.5)
the_table.scale(2, 2)
plt.scatter(average_filtered['percentage'].sort_values(), average_filtered['drug'], color=sns.color_palette('Reds',22))
plt.title('Drugs with higher presence in train samples', weight='bold', fontsize=15)
plt.xlabel('Percentage', weight='bold')
plt.xticks(weight='bold')
plt.yticks(weight='bold')
plt.axhline(y=13.5, color='black', linestyle='-')
plt.axhline(y=18.5, color='black', linestyle='-')
plt.show()

* **Observations:**

>1. **Two drug pairs** have +0.9 correlation and are highly presented in the samples (within the black boudaries).
2.  **Kit_inhibtor** is highly correlated with 2 drugs: **pdgfr_inhibitor and flt3_inhibitor**.
3. flt3_inhibitor has slightly less presence in the samples than the other highly correlated drugs.
4. The *outliers* **sodium_channel_inhibitor and tubulin_inhibitor** are highly present in the samples but have low correlation with the rest of drugs. *(This needs further investigation)*

The samples having +2 active drugs most probably include those 4 drug pairs.
***
> ## 4-2 Nonscored targets:

In this section, we will have a look over the dataset provided that will not be used in the score. This dataset has 402 drugs *(more than the 206 drugs in the targets_scored dataset that will be used in the score).* 

Let's first merge both scored and nonscored targets datasets and try to find patterns and and relationships between the drugs in both datasets. 

In [None]:
#Correlation between drugs
corre= adt.corr()
#Unstack the dataframe
s = corre.unstack()
so = s.sort_values(kind="quicksort", ascending=False)
#Create new dataframe
so2= pd.DataFrame(so).reset_index()
so2= so2.rename(columns={0: 'correlation', 'level_0':'Drug 1', 'level_1': 'Drug2'})
#Filter out the coef 1 correlation between the same drugs
so2= so2[so2['correlation'] != 1]
#Drop pair duplicates
so2= so2.reset_index()
pos = [1,3,5,7,9, 11,13,15,17,19,21,23, 25, 27, 29, 31,33,35, 37, 39, 41, 43, 45]
so2= so2.drop(so2.index[pos])
#so2= so2.round(decimals=4)
so3=so2.head()
#Show the first 10 high correlations
cm = sns.light_palette("Red", as_cmap=True)
s = so2.head(16).style.background_gradient(cmap=cm)
s

**Great!** This nonscored dataset seems promising. 

If in the `scored_target` dataset we had just 4 drug pairs with +0.7 correlation, by merging both datasets we have more than 15 drug pairs highly correlated.

### Heatmap of the 31 drugs with high correlation

In [None]:
#High correlation adt 22 pairs
adt15= so2.head(22)
#Filter the drug names
adt_1=adt15['Drug 1'].values.tolist()
adt_2=adt15['Drug2'].values.tolist()
#Join the 2 lists
adt3= adt_1 + adt_2
#Keep unique elements and drop duplicates
adt4= list(dict.fromkeys(adt3))
#Filter out the selected drugs from the "all drugs treated" adt dataset
adt5= adt[adt4]

In [None]:
ploth(adt5)

**Interesting!** Even though this is just a visual representation of the table above, here we can clearly see that many drugs from the `scored_target` dataset have high correlation with several drugs from the `nonscored_targets`.
***

# 5- Test set:
After understanding the relationship between the features and the labels, we move on to the test test to understand the features and their relationship with the train features.
> ## 5-1 Features:

In [None]:
plt.style.use('seaborn')
sns.set_style('whitegrid')
fig = plt.figure(figsize=(15,5))
#1 rows 2 cols
#first row, first col
ax1 = plt.subplot2grid((1,2),(0,0))
sns.countplot(x='cp_type', data=b, palette='rainbow')
plt.title('Test: Control and treated samples', fontsize=15, weight='bold')
#first row sec col
ax1 = plt.subplot2grid((1,2),(0,1))
sns.countplot(x='cp_dose', data=b, palette='PiYG')
plt.title('Test: Treatment Doses: Low and High',weight='bold', fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(15,5))
sns.distplot( a['cp_time'], color='gold', bins=5)
plt.title("Test: Treatment duration ", fontsize=15, weight='bold')
plt.show()

Everything seems similar to the train set:
* The doses are equally applied.
* Very few control samples.
* Same treatment duration 24h, 48h and 72h.

**Good news!** It seems that both train and test datasets are similar in terms of *experiments conditions.* The variation would be in the gene expression and cell viability since the samples used in the test set are different than the train set.

Let's see how different are those samples!

> ## 5-2 Gene expression:

In [None]:
#Filter out just the treated samples
treated2= b[b['cp_type']=='trt_cp']
treated_list2 = treated2['sig_id'].to_list()
full_tr= b[b['sig_id'].isin(treated_list2)]

#Select the columns c-
c_cols2 = [col for col in full_tr.columns if 'g-' in col]
#Filter the columns c-
cells2=treated2[c_cols2]
#Plot heatmap
plt.figure(figsize=(15,6))
sns.heatmap(cells2.corr(), cmap='coolwarm', alpha=0.9)
plt.title('Test: Correlation gene expression', fontsize=15, weight='bold')
plt.xticks(weight='bold')
plt.yticks(weight='bold')
plt.show()

We can see several high positive and negative correlations between some genes, same as in the train set. However more investigation is needed to find some patterns and differences in gene expression in the test set.

> ## 5-3 Cell viability:

In [None]:
#Select the columns c-
c_cols3 = [col for col in b.columns if 'c-' in col]
#Filter the columns c-
cells3=treated[c_cols3]
#Plot heatmap
plt.figure(figsize=(12,6))
sns.heatmap(cells3.corr(), cmap='coolwarm', alpha=0.9)
plt.title('Correlation: Cell viability', fontsize=15, weight='bold')
plt.xticks(weight='bold')
plt.yticks(weight='bold')
plt.show()

Again, the frist impression is that all the cell viability features have high correlation, same as in the train set. With some specific cell types having low correlation than others *(c-24, c-22, c-73, c-73...).

### [WORK IN PROGRESS]

***
# References:

[1] Viability assay https://en.wikipedia.org/wiki/Viability_assay

[2] COVID-19 viral proteins identification https://www.kaggle.com/amiiiney/covid-19-proteins-identification-with-biopython

[3] Gene expression level https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/gene-expression-level