In [None]:
import numpy as np
import pandas as pd

In [None]:
train_features = pd.read_csv('../input/lish-moa/train_features.csv')
train_targets_scored = pd.read_csv('../input/lish-moa/train_targets_scored.csv')
train_targets_nonscored = pd.read_csv('../input/lish-moa/train_targets_nonscored.csv')
test_features = pd.read_csv('../input/lish-moa/test_features.csv')
train_drug = pd.read_csv("../input/lish-moa/train_drug.csv")

data = train_features.append(test_features)

ss = pd.read_csv('../input/lish-moa/sample_submission.csv')

In [None]:
train_features

In [None]:
train_features.columns.tolist();

In [None]:
train_drug.head()

In [None]:
train_drug.info()

In [None]:
train_drug.nunique()

In [None]:
train_features.describe()

In [None]:
sample_data = train_features[["sig_id","cp_type","cp_time","cp_dose","g-0","g-771", "c-0","c-99"]]
sample_data

In [None]:
sample_data.info()

In [None]:
sample_data.describe()

In [None]:
for col in list(sample_data.columns):
    print("NAME:", col, " TYPE:", sample_data[col].dtype, " NUNIQUE:", sample_data[col].nunique(), " UNIQUES:")
    if (sample_data[col].nunique()<10):
        print(sample_data[col].unique())
    print("------------------------------")

In [None]:
train_targets_scored

In [None]:
train_targets_scored.info()

In [None]:
train_targets_nonscored

In [None]:
train_features=train_features.merge(train_drug, on = "sig_id")

In [None]:
ss

In [None]:
train_features_scored=pd.merge(train_features,train_targets_scored,how='inner')
train_features_scored

# Overview of Datasets
### train_features
**sig_id** - Refers to a unique sample in the dataset. Primary key for the dataset which links it to train_targets_scored dataset

**Gene Expression** - Contains variables with g- prefix which store gene expression related information. There are about 775 gene expression variables. These variables generally have a mean around zero and range of -10 to 10

**Cell Viability** - Contains variables with c- prefix which store cell viability information. There are about 100 cell viability variables. These variables also tend have a mean around zero and range of -10 to 5. The data contains negative values for Cell Viability as well, which does not make sense. We will investigate into this later

**Drug Dosage** - Has distinct values for D1 & D2, signifying low and high dosage

**CP Time** - Has three distinct values of 24, 28 and 72 hours

**CP Type** - Indicates whether a sample has been treated with a compound or control perturbation

### train_targets_scored -
Dataset containing the output variables for training data. There are about 206 binary output variables showing MoAs triggered for a specific sig_id. This dataset is linked to the train_features dataset through sig_id column

### test_features
Test data on which our trained models have to make a prediction.

Let's now take a deeper look at distribution of Gene Expression & Cell Viability Variables

# Distribution of Gene Expression & Cell Viability Variables

In [None]:
import random
from plotly.subplots import make_subplots
import plotly.graph_objects as go #Plotly for Viz
import plotly.express as px # Plotly express

#Plotting Histograms for Randomly Selected Gene Expression Variables
fig = make_subplots(rows=5, cols=4,shared_yaxes=True)
j=1
k=1

for i in range(1,21):
    rand=random.randint(0, 770)
    col="g-"+str(rand)
    fig.add_trace(
    go.Histogram(x=train_features[col],name=col),
    row=k, col=j
    )
   # print(k,j)
    j=j+1
    if(j>4):
        j=1
    if(i%4==0):
        k=k+1

        
fig.update_layout(title_text="Distribution for Randomly Selected Gene Expression Variables")
fig.show()

In [None]:
#Plotting Histograms for Randomly Selected Cell Viability Rate Variables
fig = make_subplots(rows=5, cols=4,shared_yaxes=True)
j=1
k=1

for i in range(1,21):
    rand=random.randint(0, 99)
    col="c-"+str(rand)
    fig.add_trace(
    go.Histogram(x=train_features[col],name=col),
    row=k, col=j
    )
   # print(k,j)
    j=j+1
    if(j>4):
        j=1
    if(i%4==0):
        k=k+1

        
fig.update_layout(title_text="Distribution for Randomly Selected Gene Expression Variables")
fig.show()

**Some observations from the plots above -**
- Both the Gene Expression & Cell Viability variables generally tend to follow a normal distrubtion centered around zero

- Gene Expression variables tend to have long left rail and right tail, indicating presence of outliers in the gene expression variables

- Cell Viability variables don't have a right skew but generally tend to have a long left tail, which points to negative cell viability rates. Since Cell Viability is the percentage of live cells in an environment, that number cannot be less than zero(you can't have negative live cells in a living organism). One reason for this could be that some sort of transformation has already been applied to the dataset and the mean centered around zero, therefore these values are actually transformed values and do not represent actual cell viability rates.



Let's now look at correlation between Gene Expression & Cell Viability Rate Variables

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#Correlation matrix for Variables
cell=train_features_scored.loc[:, train_features_scored.columns.str.startswith('c-')]
corr = cell.corr(method='pearson')

# corr
f, ax = plt.subplots(figsize=(25, 25))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.8, cbar_kws={"shrink": .5})

ax = sns.heatmap(corr,linewidths=0.8,cmap=cmap)

Heavy presence of red in the plot above suggests that cell viability variables are highly correlated to each other. This information will be helpful for us in dimensionality reduction because we will be able to combine some of these variables since they are highly correlated to each other. We will look at this detail again in PCA section of the notebook.

In [None]:
# new colums to describe targets 
train_features_scored['sum_actions']=train_features_scored.iloc[:,-207:].sum(axis=1)
train_features_scored

fig = make_subplots(rows=3, cols=1,subplot_titles=('Sum of Drug Actions with different Dosages','Sum of Drug Actions with different Treatment Durations','Sum of Drug Actions with different Dosage Types'))
#fig = go.Figure()

fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_dose']=='D1','sum_actions'],name='Drug Dosage - D1'),row=1,col=1)
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_dose']=='D2','sum_actions'],name='Drug Dosage - D2'),row=1,col=1)
#fig.update_layout(title_text='Sum of Drug Actions with different Dosages',xaxis_title_text='Sum of Drug Actions',yaxis_title_text='Count of Samples')
#fig.show()

#fig1 = go.Figure()
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_time']==24,'sum_actions'],name='Treatment Duration - 24h'),row=2,col=1)
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_time']==48,'sum_actions'],name='Treatment Duration - 48h'),row=2,col=1)
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_time']==72,'sum_actions'],name='Treatment Duration - 72h'),row=2,col=1)
#fig.update_layout(title_text='Sum of Drug Actions with different Dosage Times',xaxis_title_text='Sum of Drug Actions',yaxis_title_text='Count of Samples')
#fig.show()

#fig2 = go.Figure()
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_type']=='trt_cp','sum_actions'],name='Drug Type - trt_cp'),row=3,col=1)
fig.add_trace(go.Histogram(x=train_features_scored.loc[train_features_scored['cp_type']=='ctl_vehicle','sum_actions'],name='Drug Type - ctl_vehicle'),row=3,col=1)
#fig2.update_layout(title_text='Distribution ofActions with different Dosage Types',xaxis_title_text='Sum of Drug Actions',yaxis_title_text='Count of Samples')
#fig2.show()

One thing that stands out is that neither of the three variables we looked at - Drug Dosage, Drug Type & Treatment Duration have any difference in distribution of sum of MoAs for distinct values of the variable. Therefore its unlikely that these features will play an important role in model development

One thing that was mentioned above and has been verified through the output is that records with ctrl_vehicle as cp_type do not have any MoAs.

Another thing that can be observed is that majority of samples have 0 or 1 MoA associated with them. There are very few samples which have 2 or more MoAs associated with them. Thats why models can struggle for multiple MOAs.

### Dependent Variables - Mechanism of Action 

Since this is a multilabel classfication problem, we have 207 binary dependent variables in the data, with individual columns for each MoA. We have to predict whether a particular gene expression & cell viability sample corresponds to a specific MoAs. One single drug sample can have multiple mechanism of actions.

For the charts below, we summed up the values of individual MoAs to understand which MoAs are most & least commonly triggered in the dataset.

In [None]:
depsum=train_features_scored.iloc[:,-207:-1].sum(axis=0)
depsum=depsum.to_frame()
depsum.columns=['sum_actions_vert']
depsum['action']=depsum.index
depsum=depsum.reset_index(drop=True)
depsum=depsum.sort_values(by='sum_actions_vert',ascending=False)
depsum_top=depsum.head(10)
depsum_tail=depsum.tail(10)

import plotly.express as px
# df = px.data.tips()
fig2 = px.histogram(depsum, x="sum_actions_vert",opacity=0.6, title='Histogram of Sum of Actions across MoAs')
fig2.show()

In [None]:
fig=px.bar(depsum_top,y='sum_actions_vert',x='action',title='Most Common MoAs')
#fig.update_layout(height=350,width=800)
fig.show()

fig1=px.bar(depsum_tail,y='sum_actions_vert',x='action',title='Least Common MoAs')
#fig1.update_layout(height=350,width=800)

fig1.show()

Here are some of the observations from the plots above -

- The histogram for sum of actions across different MoAs is left skewed. Majority of MoAs have less than 100 samples where they were triggered. This is quite low given that we have almost 23k samples in our training dataset.
- nfkb_inhibitor & proteasome_inhibitor are the most common MoAs with counts of around 800+ and 700+ respectively
- Other popular MoAs for Cyclooxygenase_inhibitor, dopamine_receptor_antagonist, dna_inhibitor etc. have counts of 400+ records each
- Least common MoAs like steroid, elastase_inhibitor, laxative have less than 10 records each
- Each MoA has been triggered atleast once in the data
- Models may struggle with low triggered MoAs.
- Given the low triggers for some of the MoAs, it may be an interesting approach to set their predictions manually to zero and ignore them from the prediction.

Next we would look at correlation between MoAs to understand if a drug sample is likely to trigger multiple MoAs that are correlated to each other.

In [None]:
actions=train_features_scored.iloc[:,-100:]
corr = actions.corr(method='pearson')
# corr
f, ax = plt.subplots(figsize=(25, 25))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
mask = np.triu(np.ones_like(corr, dtype=np.bool))

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.8, cbar_kws={"shrink": .5})

ax = sns.heatmap(corr,linewidths=0.8,cmap=cmap)

The above correlation plot tells us that there is no correlation across MoAs, meaning that these MoAs are independent to each other and triggering of one MoA is unlikely to lead to triggering of other MoA because of zero correlation.