<h1><center>Mechanisms of Action (MoA) - EDA - a perspective from a computational biologist</center></h1>
<h2><center>2020-09-13</center></h2>


This notebook do not intend to replicate previous EDA done by other contributors. I assume the readers would have read through previous general EDAs that i find really useful before diving deeper into this domain-specific EDA:-
 1. https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda (contributed by isaienkov)
 2. https://www.kaggle.com/headsortails/explorations-of-action-moa-eda (contributed by headsortails)

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>Quick navigation</center></h2>

* [1. Package installations + import libraries + data loading](#1)
* [2. Principle Component Analysis (Genes basis)](#2)
* [3. Leiden clustering on tsne/umap dimensional reductions (Gene basis)](#3)
* [4. Differential expressing genes defining each cluster (Gene basis)](#4)
* [5. Principle Component Analysis (Cell viability basis)](#5)
* [6. Leiden clustering on tsne/umap dimensional reductions (Cell viability basis)](#6)
* [7. Differential expressing viabilities defining each cluster (Cell viability basis)](#7)
    
##### Although you can navigate to the topic of interest, it is recommended to first read from top-down before navigating around.

<a id="1"></a>
<h2 style='background:green; border:0; color:white'><center>1. Package installations + import libraries + data loading</center><h2>

### Packages
If you wish to replicate this work, you'll need to excecute the following cell to install the quanp package (https://quanp.readthedocs.io/en/latest/installation.html) that should install all necessary packages/libraries required to execute the codes in this tutorial. 

In [None]:
# # Install libraries/pacakges
!conda install seaborn scikit-learn statsmodels numba pytables -y
!conda install -c conda-forge python-igraph leidenalg -y
!pip install quanp
!pip install MulticoreTSNE

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as pl
import quanp as qp

from IPython.display import display
from matplotlib import rcParams

# setting visualization/logging parameters
pd.set_option('display.max_columns', None)
qp.set_figure_params(dpi=100, color_map = 'viridis_r')
qp.settings.verbosity = 1
qp.logging.print_versions()

In [None]:
# Loading pandas dataframe as anndata 
train_features = pd.read_csv('/kaggle/input/lish-moa/train_features.csv', index_col=0)

# Get lists of genes and cell viabilities, respectively
train_genes = [s for s in train_features.columns if "g-" in s]
train_cellvia = [s for s in train_features.columns if "c-" in s]

In [None]:
# Loading pandas dataframe as anndata 
adata_genes = qp.AnnData(train_features[train_genes])
adata_cellvia = qp.AnnData(train_features[train_cellvia])

# add a new `.obs` column for additional categorical features
adata_genes.obs['cp_type'] = train_features['cp_type']
adata_genes.obs['cp_time'] = train_features['cp_time']
adata_genes.obs['cp_dose'] = train_features['cp_dose']
adata_cellvia.obs['cp_type'] = train_features['cp_type']
adata_cellvia.obs['cp_time'] = train_features['cp_time']
adata_cellvia.obs['cp_dose'] = train_features['cp_dose']

<a id="2"></a>
<h2 style='background:green; border:0; color:white'><center>2. Principle Component Analysis (Gene basis)</center><h2>

#### To reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data.
    
#### Note: I didn't do any data transformation or scaling for the feature data. Usually, it works the best to subject the features to Log2(x+1) Transformation and Standardardization Scaling before proceeding to PCA analysis . From what the gene data distribution shown by isaienkov (https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda), it reads that the organizer may have done that for us. In addition, based on the features descriptions below, there may be a winsorization or some kinds of normalization (like quantile normalization) applied before the Standardization Scaling or Log-transformation (i.e. max value can be detected at 10.0000 in many features).

In [None]:
train_features[train_genes].describe()

In [None]:
rcParams['figure.figsize'] = 12, 8
qp.tl.pca(adata_genes, svd_solver='auto');
qp.pl.pca(adata_genes, 
          color=['cp_type', 'cp_time', 'cp_dose'], 
          size=50, 
          ncols=2);

Here, it seems that the PC1 is useful to separate treatment (trt) and control (ctl) groups; while PC1+PC2 can potentially use to separate the treatment times (32, 48, and 72 hrs).

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations or further dimension reductions of cells, e.g. used in the clustering function qp.tl.leiden() or tSNE qp.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine. The 'elbow' point seems to suggest at least up to PC3 will be useful to characterize the cells/subjects. However, we are going to do further dimensional reduction based on the first 10 PCs later. 

In [None]:
qp.pl.pca_variance_ratio(adata_genes, n_pcs=20)

<a id="3"></a>
<h2 style='background:green; border:0; color:white'><center>3. Leiden clustering on tsne/umap dimensional reductions (Gene basis)</center><h2>

#### We use Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018) to cluster the neighborhood graph of companies. We are going to first compute the neighborhood graph for this. We  compute the neighborhood graph of cells using the PCA representation of the data matrix. This will give rise to distances and connectivities in each company. Here, we consider 30 nearest neighbors with 10 PCs derived from the PCA. Afterward, we wil compute the leiden clustering.

In [None]:
qp.pp.neighbors(adata_genes, n_neighbors=30, n_pcs=10); # 30 nearest neighbors and only consider the first 10 pcs
qp.tl.leiden(adata_genes);

### Computing the T-distributed Stochastic Neighbor Embedding (tSNE)
Let us further reduce the dimensionality of the signficant PCs identified above wholly in to 2 dimensions using the tSNE. Here, the 'leiden' clusters ,  'cp_type', 'cp_time', 'cp_dose' were mapped onto the tSNE. There were 17 (0-16) leiden clusters obtained. You can imagine that these clusters were different celltypes from the same tissue or different tissues.

In [None]:
rcParams['figure.figsize'] = 12, 8
qp.tl.tsne(adata_genes, n_pcs=10); # only consider the first 10 pcs

qp.pl.tsne(adata_genes, color=['leiden', 'cp_type', 'cp_time', 'cp_dose'], 
           legend_loc='on data', ncols=2)

### Uniform Manifold Approximation and Projection (UMAP) - Embedding the neighborhood graph
We can also embed the neighborhood graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE. Before running the UMAP, we compute the correlations between clusters as initiating positions for the UMAP.

In [None]:
rcParams['figure.figsize'] = 8,6
qp.tl.paga(adata_genes)
qp.pl.paga(adata_genes, plot=True)

In [None]:
rcParams['figure.figsize'] = 10, 6
qp.tl.umap(adata_genes, init_pos='paga')
qp.pl.umap(adata_genes, color=['leiden', 'cp_type', 'cp_time', 'cp_dose'], 
           legend_loc='on data', ncols=2)


<a id="4"></a>
<h2 style='background:green; border:0; color:white'><center>4. Differential expressing genes defining each cluster (Gene basis)</center><h2>
    
### Visualizing the differential expressing genes/features defining each cluster
We can identify genes/features that are differentially expressed by each cluster or group. Here, we take each group of cells and compare the distribution of each gene/feature in a group against the distribution in all other cells not in the group. Here, we list the top 30 genes/features defining each cluster (P-value <0.05 as long as the score > 1.96). 
    
Fact to know: A biologist typically uses these differentially expressed genes (either significantly up-regulated or down-regulated) to define the celltypes/tissues in biology. 


In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning);

rcParams['figure.figsize'] = 6,8;
qp.tl.rank_features_groups(adata_genes, 'leiden', method='wilcoxon');
qp.pl.rank_features_groups(adata_genes, n_features=30, sharey=False)

In [None]:
qp.tl.dendrogram(adata_genes, 'leiden', var_names=adata_genes.var_names);
qp.pl.rank_features_groups_matrixplot(adata_genes, n_features=5, use_raw=False, 
                                      cmap='bwr'); # choose only top 5 features

From the matrixplot above, for example, we can see the top 5 significant differential expressing genes for the Cluster 11 were 'g-744', 'g-243', 'g-712', 'g-417', and 'g-731', while Cluster 13 were 'g-166', 'g-167', 'g-708', 'g-168', and 'g-456'; Now we can visualize the marker genes on the UMAP dimensions, as follow.

In [None]:
rcParams['figure.figsize'] = 8,8;
qp.pl.umap(adata_genes, color=['leiden', 'g-744', 'g-243', 'g-712', 'g-417', 'g-731', 'g-166', 'g-167', 'g-708', 'g-168', 'g-456'], 
           legend_loc='on data', ncols=3, cmap='bwr')

### Visualize marker genes using heatmap
#### Here, the columns represent genes/features; the rows represent cells/subjects in clusters.

In [None]:
rcParams['figure.figsize'] = 8,8
qp.pl.rank_features_groups_heatmap(adata_genes, n_features=5, use_raw=False, 
                                   vmin=-5, vmax=5, cmap='bwr')

<a id="5"></a>
<h2 style='background:green; border:0; color:white'><center>5. Principle Component Analysis (Cell viability basis)</center><h2>

#### To reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data.
    
#### Note: I didn't do any data transformation or scaling for the feature data. Usually, it works the best to subject the features to Log2(x+1) Transformation and Standardardization Scaling before proceeding to PCA analysis. From what the gene data distribution shown by isaienkov (https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda), it reads that the organizer may have done that for us. In addition, based on the features descriptions below, there may be a winsorization applied before the Standardization Scaling or Log-transformation (i.e. max value can be detected at 10.0000 in many features).

In [None]:
train_features[train_cellvia].describe()

In [None]:
rcParams['figure.figsize'] = 8, 5
qp.tl.pca(adata_cellvia, svd_solver='auto');
qp.pl.pca(adata_cellvia, color=['cp_type', 'cp_time', 'cp_dose'], size=50);

Here, similar to the gene basis PCA, it seems that the PC1 from the cell viability basis is also useful to separate treatment (trt) and control (ctl) groups; while PC1+PC2 can potentially use to separate the treatment times (32, 48, and 72 hrs).

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations or further dimension reductions of cells, e.g. used in the clustering function qp.tl.leiden() or tSNE qp.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine. The 'elbow' point seems to suggest that only PC1 will be useful to characterize the cells. However, we are going to do further dimensional reduction based on the first 5 PCs later. 

In [None]:
qp.pl.pca_variance_ratio(adata_cellvia, n_pcs=20)

<a id="6"></a>
<h2 style='background:green; border:0; color:white'><center>6. Leiden clustering on tsne/umap dimensional reductions (Cell viability basis)</center><h2>

#### We use Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018) to cluster the neighborhood graph of companies. We are going to first compute the neighborhood graph for this. We  compute the neighborhood graph of cells using the PCA representation of the data matrix. This will give rise to distances and connectivities in each company. Here, we consider 100 nearest neighbors with 5 PCs derived from the PCA. Afterward, we wil compute the leiden clustering.

In [None]:
qp.pp.neighbors(adata_cellvia, n_neighbors=100, n_pcs=5); # only consider the first 5 pcs
qp.tl.leiden(adata_cellvia, resolution=0.5);

### Computing the T-distributed Stochastic Neighbor Embedding (tSNE)
Let us further reduce the dimensionality of the signficant PCs identified above wholly in to 2 dimensions using the tSNE. Here, the 'leiden' clusters ,  'cp_type', 'cp_time', 'cp_dose' were mapped onto the tSNE. There were 8 leiden clusters obtained. You can imagine that these clusters were different celltypes from the same tissue or different tissues.

In [None]:
rcParams['figure.figsize'] = 12, 8
qp.tl.tsne(adata_cellvia, n_pcs=5); # only consider the first 5 pcs
qp.pl.tsne(adata_cellvia, color=['leiden', 'cp_type', 'cp_time', 'cp_dose'], 
           legend_loc='on data', ncols=2)

### Embedding the neighborhood graph (UMAP)
We can also embed the neighborhood graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE. Before running the UMAP, we compute the correlations between clusters as initiating positions for the UMAP.

In [None]:
rcParams['figure.figsize'] = 8,6
qp.tl.paga(adata_cellvia)
qp.pl.paga(adata_cellvia, plot=True)

In [None]:
rcParams['figure.figsize'] = 10, 6
qp.tl.umap(adata_cellvia, init_pos='paga')
qp.pl.umap(adata_cellvia, color=['leiden', 'cp_type', 'cp_time', 'cp_dose'], 
           legend_loc='on data', ncols=2)

<a id="7"></a>
<h2 style='background:green; border:0; color:white'><center>7. Differential expressing viability defining each cluster (Cell viability basis)</center><h2>
    
### Visualizing the differential expressing viabilities/features defining each cluster
We can identify viabilities/features that are differentially expressed by each cluster or group. Here, we take each group of cells and compare the distribution of each viability/feature in a group against the distribution in all other cells not in the group. Here, we list the top 30 viabilities/features defining each cluster (P-value <0.05 as long as the score > +/-1.96).


In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

rcParams['figure.figsize'] = 6,8;
qp.tl.rank_features_groups(adata_cellvia, 'leiden', method='wilcoxon');
qp.pl.rank_features_groups(adata_cellvia, n_features=30, sharey=False)

In [None]:
qp.tl.dendrogram(adata_cellvia, 'leiden', var_names=adata_cellvia.var_names);
qp.pl.matrixplot(adata_cellvia, adata_cellvia.var_names, 'leiden', dendrogram=True, cmap='RdBu_r')

The figure above shows the matrixplot of all the 100 cell-viabilities for all the 8 clusters identified. There is gradual change of almost all cell viabilities along the UMAP dimensions, starting from Clusters 5, 1, 0, 2, 3, 4, 7, and lastly 6. Importantly, almost all of them are highly correlated with each other (multicollinearity) to only 1 factor! Let's plot a few of these cell viabilities on the UMAP mapping the leiden clustering below.

In [None]:
rcParams['figure.figsize'] = 8,8;
qp.pl.umap(adata_cellvia, color=['leiden', 'c-0', 'c-1', 'c-2', 'c-3', 'c-22'], 
           legend_loc='on data', ncols=3, cmap='bwr')

# To be continued...


# References:
 1. https://www.kaggle.com/isaienkov/mechanisms-of-action-moa-prediction-eda (contributed by isaienkov)
 2. https://www.kaggle.com/headsortails/explorations-of-action-moa-eda (contributed by headsortails)
 3. Tabachnick & Fidell. Using Multivariate Statistics, Sixth Edition. PEARSON 2013; ISBN-13:9780205956227.
 4. Traag et al., From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233.
 4. McInnes & Healy, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, arXiv. 2018.
 3. https://quanp.readthedocs.io/en/latest/