<h1><center>Jane Street Market Prediction</center></h1>
<h2><center>Advanced EDA - from perspectives of features-based clustering to features comparison between clusters</center></h2>

<h3><center>2020-11-29</center></h3>

<h4><center>We have a fun interesting lung-shape UMAP visualization! And it looks like coronavirus infections at left and right lungs ;)</center></h4>

<div class="list-group" id="list-tab" role="tablist">
<h2 class="list-group-item list-group-item-action active" data-toggle="list" style='background:green; border:0; color:white' role="tab" aria-controls="home"><center>Quick navigation</center></h2>

* [1. Package installations + import libraries + data loading](#1)
* [2. Principle Component Analysis (features basis, disregard time-series)](#2)
* [3. Leiden clustering on tsne/umap dimensional reductions ](#3)
* [4. Differential expressing features defining each cluster](#4)
    
##### Although you can navigate to the topic of interest, it is recommended to first read from top-down before navigating around.

<a id="1"></a>
<h2 style='background:green; border:0; color:white'><center>1. Package installations + import libraries + data loading</center><h2>

### Packages
If you wish to replicate this work, you'll need to excecute the following cell to install the quanp package (https://quanp.readthedocs.io/en/latest/installation.html) that should install all necessary packages/libraries required to execute the codes in this tutorial. 

In [None]:
# Install libraries/packages
!conda install seaborn scikit-learn statsmodels numba pytables -y
!conda install -c conda-forge python-igraph leidenalg -y
!pip install quanp
!pip install MulticoreTSNE

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import quanp as qp
import gc

from IPython.display import display
from matplotlib import rcParams

# setting visualization/logging parameters
pd.set_option('display.max_columns', None)
qp.set_figure_params(dpi=100, color_map = 'viridis_r')
qp.settings.verbosity = 1
qp.logging.print_versions()

In [None]:
# Loading pandas dataframe as anndata 
train = pd.read_csv('/kaggle/input/jane-street-market-prediction/train.csv')

# We disregard the timeseries perspective and drop all the rows with nan value in any columns. 
train = train.dropna()

In [None]:
# resampling the subjects at fraction of 0.1 - due to current notebook runtime memory restriction
train = train.sample(frac=0.05, random_state=0)

### Features description
I only briefly checking the usual min, max, mean, median, 25th, 75th percentage for rough estimation of the distribution patterns of the 130 features. This is useful for me to see if i need to do normalization or scaling later. 

For detailed visualization for each feature, i recommend to visit https://www.kaggle.com/blurredmachine/jane-street-market-eda-viz-prediction produced by blurredmachine.;)

In [None]:
# Get lists of features
train_features = [s for s in train.columns if "feature_" in s]
train[train_features].describe()

In [None]:
def plot_outlier_graph_set(df):    

    for f in df.columns:
        fig, axs = plt.subplots(1, 4, figsize=(16, 5))

        # plot 1
        sns.boxplot(y=f, data=df, ax=axs[0])

        # plot 2
        sns.boxenplot(y=f, data=df, ax=axs[1])

        # plot 3
        sns.violinplot(y=f, data=df, ax=axs[2]) 

        # plot 4
        sns.stripplot(y=f, data=df, size=4, color=".3", linewidth=0, ax=axs[3])


        fig.suptitle(f, fontsize=15, y=1.1)
        axs[0].set_title('Box Plot')
        axs[1].set_title('Boxen Plot')
        axs[2].set_title('Violin Plot')
        axs[3].set_title('Strip Plot')

        plt.tight_layout()
        plt.show()

## Checking extreme outliers values in the Y values
The y's included weights, resp, resp_1, resp_2, resp_3, resp_4. This is going to be useful to set the min and max values for plot visualization later.

In [None]:
y_cols = [c for c in train.columns if 'resp' in c or 'weight' in c]
plot_outlier_graph_set(train[y_cols])  


In [None]:
# Loading pandas dataframe as anndata 
adata = qp.AnnData(train[train_features])

# add a new `.obs` column for additional categorical features
adata.obs['weight'] = train['weight'].values
adata.obs['resp_1'] = train['resp_1'].values
adata.obs['resp_2'] = train['resp_2'].values
adata.obs['resp_3'] = train['resp_3'].values
adata.obs['resp_4'] = train['resp_4'].values
adata.obs['resp'] = train['resp'].values

# housekeeping
train = 0
gc.collect()

<a id="2"></a>
<h2 style='background:green; border:0; color:white'><center>2. Principle Component Analysis (features basis, disregard time-series)</center><h2>

#### To reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data.
    
#### Note: I didn't do any data transformation or scaling for the feature data. Usually, it works the best to subject the features to minmax normalization, log2(x+1) Transformation and Standardardization Scaling before proceeding to PCA analysis. From what the features description above, it reads that the organizer may have done that for us. 

In [None]:
rcParams['figure.figsize'] = 12, 8
qp.tl.pca(adata, svd_solver='arpack', random_state=0);
qp.pl.pca(adata, 
          color=['weight', 'resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4'], 
          size=20, cmap='bwr', 
          ncols=2);

Here, it seems that the PC1 and 2 are somehow useful to separate high and low values in the Y that we are looking at.

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations or further dimension reductions of subjects, e.g. used in the clustering function qp.tl.leiden() or tSNE qp.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine. The 'elbow' point seems to suggest at least up to PC5 will be useful to characterize the subjects. Here, we are going to do further dimensional reduction based on the first 5 PCs later. 

In [None]:
qp.pl.pca_variance_ratio(adata, n_pcs=20)

<a id="3"></a>
<h2 style='background:green; border:0; color:white'><center>3. Leiden clustering on tsne/umap dimensional reductions</center><h2>

#### We use Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018) to cluster the neighborhood graph of companies. We are going to first compute the neighborhood graph for this. We  compute the neighborhood graph of subjects using the PCA representation of the data matrix. This will give rise to distances and connectivities with other subjects for each subject. Here, we consider 30 nearest neighbors with 5 PCs derived from the PCA. Afterward, we wil compute the leiden clustering.

In [None]:
qp.pp.neighbors(adata, n_neighbors=30, n_pcs=5, random_state=0); # 30 nearest neighbors and only consider the first 4 pcs
qp.tl.leiden(adata, random_state=0);

### Computing the T-distributed Stochastic Neighbor Embedding (tSNE)
Let us further reduce the dimensionality of the signficant PCs identified above wholly in to 2 dimensions using the tSNE. Here, the 'leiden' clusters , 'weight', 'resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4' were mapped onto the tSNE. There were 8 leiden clusters obtained. 

In [None]:
rcParams['figure.figsize'] = 12, 8
qp.tl.tsne(adata, n_pcs=10, random_state=0); # only consider the first 10 pcs

qp.pl.tsne(adata, color=['leiden', 'weight', 'resp', 'resp_1', 'resp_2', 'resp_3', 'resp_4'], 
           size=20, cmap='bwr',
           legend_loc='on data', ncols=2)

### Uniform Manifold Approximation and Projection (UMAP) - Embedding the neighborhood graph
We can also embed the neighborhood graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE. Before running the UMAP, we compute the correlations between clusters as initiating positions for the UMAP.

In [None]:
rcParams['figure.figsize'] = 8,6
qp.tl.paga(adata)
qp.pl.paga(adata, plot=True)

In [None]:
rcParams['figure.figsize'] = 10, 6
qp.tl.umap(adata, init_pos='paga', random_state=0)
qp.pl.umap(adata, color=['leiden', 'weight'],            
           size=20, cmap='bwr', vmax = 75,
           legend_loc='on data', ncols=2)

Great! We have a lung shape UMAP plots here! Don't you think it looks like some kinds of coronavirus infection CT scan images, left and right lungs infected! haha

Anyway, here, we set the max value for the weight as 75 in the plot visualization, according to the extreme outliers detection plot above.
We can roughly see that most of the high weight subjects are clustered within or near the middle clusters of the 2 Big Clusters.


The plots on resp_1, 2, 3, 4 and resp below show that high resp values usually seems to mostly appear in certain parts of the one of the lungs. Interestingly, The connecting part of both lung lobes, can expressed both high and low resp values. Fun!

In [None]:
qp.pl.umap(adata, color=['leiden', 'resp_1', 'resp_2', 'resp_3'],            
           size=20, cmap='bwr', vmax = 0.1, vmin= -0.1,
           legend_loc='on data', ncols=2)

In [None]:
qp.pl.umap(adata, color=['resp_4', 'resp'],            
           size=20, cmap='bwr', vmax = 0.2, vmin= -0.2,
           legend_loc='on data', ncols=2)

<a id="7"></a>
<h2 style='background:green; border:0; color:white'><center>4. Differential expressing viability defining each cluster</center><h2>
    
### Visualizing the differential expressing features defining each cluster
We can identify features that are differentially expressed by each cluster or group. Here, we take each group of subjects and compare the distribution of each feature in a group against the distribution in all other subjects not in the group. Here, we list the top 30 features defining each cluster (P-value <0.05 as long as the score > +/-1.96).


In [None]:
# Before we perform the group comparisons to get the differentiating features, we can check if euclidean distance itself 
# can gives us some hints that there are patterns of features appear between clades of clusters. 
qp.tl.dendrogram(adata, 'leiden', var_names=adata.var_names);
qp.pl.matrixplot(adata, adata.var_names, 'leiden',
                 standard_scale='var',
                 dendrogram=True, cmap='RdBu_r')

From the above, we can see the features 0, 17, 18, 19, 20, 21, 22, 23, ..., 31, 32, 33, 37, 38, 39, 40 potentially can split the subjects into 2 main clades. 

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning);

rcParams['figure.figsize'] = 6,8;
qp.tl.rank_features_groups(adata, 'leiden', method='wilcoxon');
qp.pl.rank_features_groups(adata, n_features=30, sharey=False);

We can visualize the top 5 features highly expressed by each cluster below.

In [None]:
qp.pl.rank_features_groups_matrixplot(adata, n_features=5, 
                                      use_raw=False, 
                                      standard_scale='var',
                                      dendrogram=True, cmap='RdBu_r')

# References:

 1. https://www.kaggle.com/blurredmachine/jane-street-market-eda-viz-prediction
 2. Tabachnick & Fidell. Using Multivariate Statistics, Sixth Edition. PEARSON 2013; ISBN-13:9780205956227.
 2. Traag et al., From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233.
 3. McInnes & Healy, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, arXiv. 2018.
 4. https://quanp.readthedocs.io/en/latest/