# 1.0. Introduction
The objective of this notebook is to do a exploratory analysis of [Mechanism of Action (MoA) Prediction](https://www.kaggle.com/c/lish-moa/overview/description) data, using Matplotlib and Seaborn.

## 1.1. About the Challenge
To understand the biological mechanism of the disease, scientists seek to identify the protein target associated with the disease and develop a molecule that can modulate that protein target. To describe the biological activity of a given molecule, scientist assign a label referred to as Mechanism-of-action or MoA. 

In [None]:
!pip install ppscore

In [None]:
# importing the libraries
import os
import gc
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import ipywidgets as widgets
import ppscore as pps


# setting the notebook parameters
root_dir = "/kaggle/input/lish-moa"
plt.rcParams["figure.figsize"] = (16, 8)
sns.set_style("darkgrid")
pd.set_option("display.max_rows", 20, "display.max_columns", None)

In [None]:
##########################
#### Helper Functions ####
##########################

def info_df(df):
    """
    returns a dataframe with number of unique values and nulls
    
    args: dataframe
    returns: dataframe
    """
    info = pd.DataFrame({
        "nuniques": df.nunique(),
        "% nuniques": round((df.nunique() / len(df) * 100), 2),
        "nulls": df.isnull().sum(),
        "% nulls": round((df.isnull().sum() / len(df) * 100), 2)
    })
    
    return info.T

## 1.2. Dataset
In this competition, we have a dataset that combines gene expression and cell viability data. The data is based on a technology that measures simultaneously human's cell response to drugs in a pool of 100 different cell types. Drugs can have multiple MoA annotations which describe binary cell responses from different cell type in different ways. Thus, the problem that we have here is of multi-label classification.

In [None]:
# importing the datasets

train_df = pd.read_csv(os.path.join(root_dir, "train_features.csv"))
train_targets = pd.read_csv(os.path.join(root_dir, "train_targets_scored.csv"))
train_targets_nonscored = pd.read_csv(os.path.join(root_dir, "train_targets_nonscored.csv"))

#### Trainset
The trainset contains 876 predictors, which can be divided into four broad categories:-
1. Gene Expression
2. Cell Viability
3. Whether the sample is treated with compound or not
4. Identifier (`sig_id`)

In [None]:
train_df.head()

#### Train Targets (scored and non-scored)
The `train_targets` contains the binary MoA targets. These targets are divided into two categories, scored and non-scored. Following table shows the `train_targets` that are scored.

In [None]:
train_targets.head()

Checking the number of nulls and uniques in each predictor of the `train_df`.

In [None]:
info_df(train_df)

Similarly, checking the number of nulls and uniques in `train_targets` (scored).

In [None]:
info_df(train_targets)

In the `train_target`, every target has two classes, 0 and 1. To check the sparsity of data, we will calculate the percentage of the non-zero classes in each `train_targets`.

In [None]:
pd.DataFrame({
    "% non-zero class - scored": round(train_targets.drop(columns = ["sig_id"]).sum() / len(train_targets) * 100, 2)
}).T

---
# 2.0. Exploratory Data Analysis
Now that we have a brief overview of the data, we can delve deeper into the data exploration. The section 02 will visualize different predictors, alone and against other variables. The univariate analysis will help us understand the predictors while multivariate analysis will help us establishes cause and effect or other relationships.
## 2.1. Univariate Analysis
In this section, we will visualize the predictors one by one, to understand their distribution, describe statistical summaries, and find patterns in the data. 

### 2.1.1. CP Variables
These predictors indicates whether a certain observation or sample is treated with compound or not.

In [None]:
plt.subplot(1, 3, 1)
ax = sns.countplot(x = "cp_type", data = train_df)
for p in ax.patches:
    ax.annotate("{}".format(p.get_height()), (p.get_x() + 0.25, p.get_height() + 150))   # for annotation of counts
plt.xlabel("CP Type")

plt.subplot(1, 3, 2)
ax = sns.countplot(x = "cp_time", data = train_df)
for p in ax.patches:
    ax.annotate("{}".format(p.get_height()), (p.get_x() + 0.25, p.get_height() + 50))
plt.xlabel("CP Time")

plt.subplot(1, 3, 3)
ax = sns.countplot(x = "cp_dose", data = train_df)
for p in ax.patches:
    ax.annotate("{}".format(p.get_height()), (p.get_x() + 0.25, p.get_height() + 100))
plt.xlabel("CP Dose")

plt.suptitle("CP - Features", fontsize = 20)
plt.show()

#### Insights
1. The `cp_type` indicates sample treated with compound (`trt_cp`) or control pertubations (`ctl_vehicle`). The control pertubations have no MoAs. In the training set, majority of the data samples are the one treated with compound, 11 times the data samples of control pertubations.
2. The `cp_time` indicated treatment duration and contains three distinct time durations, 24, 48 and 72 hours, with observations (almost) evenly distributed among these durations.
3. The `cp_dose` indicates dose being high or low. Like `cp_time`, it is also evenly distributed.

### 2.1.2. Gene Expression Data
All the columns that starts with `g-` signifies the *gene expression data*. Becuase the gene-expression predictors are 772 in number, it won't be possible to show each as a seperate plot, so, we will go with **Dropdown Widget**.

In [None]:
# list of gene expression columns
ge_list = [i for i in train_df.columns if i.startswith("g-")]

# feeding the above list to the dropdown
dropdown_ge_cols = widgets.Dropdown(options = ge_list)
ge_plot = widgets.Output()

def dropdown_ge_eventhandler(change):
    ge_plot.clear_output()
    with ge_plot:
        display(sns.distplot(train_df[change.new], kde = True, color = "g", label = change.new))
        plt.legend()
        plt.title("Density plot of {}".format(change.new))
        plt.show()
        
dropdown_ge_cols.observe(dropdown_ge_eventhandler, names = "value")

**Note**:- The Jupyter Widgets doesn't work after committing the notebook. Don't why that's the case. But it works fairly well in the edit mode. It would be advisable to fork and open the NB in edit mode to access the Dropdown Widget. To use the widget, uncomment the following two cells blocks.

In [None]:
# display(dropdown_ge_cols)

In [None]:
# display(ge_plot)

In [None]:
sns.distplot(train_df["g-2"], kde = True, color = "g", label = "g-2")
plt.legend()
plt.title("Density plot of g-2")
plt.show()

#### Insights:-
1. The probability distributions of `g-` predictors are bell-shaped, with mean of all around 0.
2. Skewness is also visible in the many of the gene-expression predictors.

### 2.1.3. Cell Viability Data
All the columns that starts with `c-` refers to cell viability data. Like gene expression predictors, these too are quite a few in numbers, 100 to be precise. So, we will use the dropdown widget here too, to include all the cell viability data predictors.

In [None]:
# list of gene expression columns
cv_list = [i for i in train_df.columns if i.startswith("c-")]

# feeding the above list to the dropdown
dropdown_cv_cols = widgets.Dropdown(options = cv_list)
cv_plot = widgets.Output()

def dropdown_cv_eventhandler(change):
    cv_plot.clear_output()
    with cv_plot:
        display(sns.distplot(train_df[change.new], kde = True, color = "y", label = change.new))
        plt.legend()
        plt.title("Density plot of {}".format(change.new))
        plt.show()
        
dropdown_cv_cols.observe(dropdown_cv_eventhandler, names = "value")

Due to similar reason as above, widgets are not working after committing. To use the widget, uncomment the following two cells blocks.

In [None]:
# display(dropdown_cv_cols)

In [None]:
# display(cv_plot)

In [None]:
sns.distplot(train_df["c-2"], kde = True, color = "y", label = "c-2")
plt.legend()
plt.title("Density plot of c-2")
plt.show()

#### Analyze fat-tail using Q-Q Plot
As we see quite a fat tails in all the cell viability predictors, we will analyze it bit more using the Q-Q plots. A Q-Q plot or quantile-quantile plot, compares quantiles of the dataset. It compares the quantiles of the "real-world" dataset against the theoretical normal distribution. 

In [None]:
plt.subplots_adjust(hspace = 0.5)

plt.subplot(2, 2, 1)
sns.distplot(train_df["c-0"], kde = True)
plt.title("Distribution of C-0")

plt.subplot(2, 2, 2)
stats.probplot(train_df["c-0"], dist = "norm", plot = plt)

plt.subplot(2, 2, 3)
sns.distplot(train_df["c-1"], kde = True)
plt.title("Distribution of C-1")

plt.subplot(2, 2, 4)
stats.probplot(train_df["c-1"], dist = "norm", plot = plt)

plt.suptitle("Q-Q Plots of Cell Viability predictors")
plt.show()

#### Insights:-
1. All the `c-` predictors have fat tails, with peak around the `-10`. These tails are all on the left side. **These tails indicate high likelihood of the improbable events** and could be hard of for the machine learning model to grasp. The risk associated with imperfectly modelling these tails is called tail risk. In the competition like this, fat tail could carry disproportionately high impact.
2. Though the distributions seems like normal distribution, the tails cause quite a deviation as seen in the QQ plot. The tails in the both graphs shows that the distribution are skewed to right.

### 2.1.4. Target Features

There are total of 206 target features, which belongs to categories such as inhibitor, activators, agonists etc. In the following visualization, we plot a frequency plot of different types of features available to us. 

In [None]:
# types of target variables
target_types = ["_inhibitor", "_agonist", "_antagonist", "_activator", "_blocker"]
col_counts = {} # key => the target column; val => number of such columns 
count = 0       # running count of how many columns are currently considered
for i in target_types:
    col_counts[i[1:]] = len([j for j in train_targets.columns if j.endswith(i)])
    count += col_counts[i[1:]]
col_counts["others"] = train_targets.shape[1] - count - 1 # -1 for sig_id column

# plot
sns.set_palette("hls")
bar_colors = ["r", "g", "b", "y", "m", "c"]
plt.bar(*zip(*col_counts.items()), color = bar_colors)
plt.title("Frequency plot of Targets types")
plt.show()

#### Insights:-
1. Most of the target variables are inhibitors.
2. The agonists and antagonists are not equal to each other. While by definition, agonist causes an action and antagonist will inhibit it, my initial understanding was they **might** be equal to each other.

## 2.2. Multivariate Analysis
After looking into the distributions and frequency plots (whichever applicable), and developing a understanding of individual predictors, we will look how these predictors interact among themselves.

### 2.2.1. CP Variables

In [None]:
sns.catplot(x = "cp_type", col = "cp_time", hue = "cp_dose", data = train_df, kind = "count")
plt.suptitle("Interaction among the CP Variables")
plt.tight_layout(rect = [0, 0.03, 1, 0.95])
plt.show()

#### Insights:-
1. The distributions seems very similar across different dose time frames. 
2. There is some difference in the 48 hour dose time between the D1 and D2 distribution for samples treated with the compound.

### 2.2.2. Gene Expression
To identify the relationship between two predictors in the gene expression data, we plot a jointplot with a regression line. Because the number of variables in the gene-expression predictor family are way too high, it won't be feasible to plot each and every one of those in this notebook.

In [None]:
sns.color_palette("tab10")

g1 = sns.jointplot(x = "g-0", y = "g-1", data = train_df, kind = "reg", marker = "+")
g1.plot_marginals(sns.rugplot, height = -.15, clip_on = False)

plt.show()

In [None]:
g1 = sns.jointplot(x = "g-0", y = "g-1", data = train_df, kind = "kde", color = "b")
g1.plot_marginals(sns.rugplot, color = "b", height = -.15, clip_on = False)

plt.show()

#### Insights:-
1. The variance between two gene-expression predictors, (here, `g-0` and `g-1`) is high. This affects the correlation in general as the best fitted line is more or less parallel to the x-axis, hence, predictors are weakly correlated.
2. Though the variance is high, it is clear from the density plot above, the data is concentrated in the range `(-2, 2)`.

#### Predictive Power score of gene-expression predictors
In this section, we explore the predictive power score among the variables in the gene-expression family. Looking at the alternatives of the correlation (it can be quite misleading), I found this [blog](https://towardsdatascience.com/rip-correlation-introducing-the-predictive-power-score-3d90808b9598) and thought of using this in the analysis.
One of the greatest disadvantage of using predictive power is the speed, it is much slower than the correlation. So, I am selecting 10 random variables below and calculating PPScore for them.

In [None]:
# the ge_list used in the following line is defined in section 2.1.2
ppdf = pps.matrix(train_df[random.sample(ge_list, 10)])
ppdf = ppdf[["x", "y", "ppscore"]].pivot(columns = "x", index = "y", values = "ppscore")
sns.heatmap(ppdf, vmin = 0, vmax = 1, cmap = "Blues", linewidths = 0.5, annot = True)
plt.show()

del ppdf

#### Insights:-
1. There is isn't much predictive power in among the gene expression variables. This is backed by the jointplots above, where two variables were weakly correlated.
2. This communicates that there might not be any redundant information coming from these variables.

### 2.2.3. Cell Viability
The appraoch to explore the relationships among the Cell Viability predictors will remain the same as that in Gene Expression predictors. We will explore the predictors first with the regression plot and kde plot and then look into the predictive power among these variables.

In [None]:
g1 = sns.jointplot(x = "c-0", y = "c-1", data = train_df, kind = "reg", marker = "+")
g1.plot_marginals(sns.rugplot, height = -.15, clip_on = False)

plt.show()

In [None]:
g1 = sns.jointplot(x = "c-0", y = "c-1", data = train_df, kind = "kde", color = "b")
g1.plot_marginals(sns.rugplot, color = "b", height = -.15, clip_on = False)

plt.show()

#### Insights
1. Unlike the gene-expression predictors, the cell-viability predictors do have significant correlation among themselves. This is evident in the above graphs.
2. The concentration of the observations here too is in the range of `(-2, 2)`, but with some concentration around negative tail too. *I guess, it is because of this tail, that we see positive correlations here*.

#### Predictive power score of Cell Viability predictors

In [None]:
# the cv_list used in the following line is defined in section 2.1.2
ppdf = pps.matrix(train_df[random.sample(cv_list, 10)])
ppdf = ppdf[["x", "y", "ppscore"]].pivot(columns = "x", index = "y", values = "ppscore")
sns.heatmap(ppdf, vmin = 0, vmax = 1, cmap = "Blues", linewidths = 0.5, annot = True)
plt.show()

del ppdf

#### Insights
1. Unlike the gene-expression predictors, the predictors do have a predictive among themselves.
2. The predictive score isn't much high here, which backs our hypothesis that **High positive correlation is due to two primary areas of concentration, one being in range of `(-2, 2)` and other around `-10`**. In the section 2.1.3, we did a fat tail analysis that shed light on how, despite looking like a bell curve, the distributions of Cell-Viability predictors deviate largely from the *theoretical normal distribution*. 