# 1. Characterising companies based on financial metrics during covid19 

## Instructions
In 2020 August, the author started out as a demonstration to adopt the Single-Cell Genomic (SCG) analyses in the business and finance world. The SCG has a high number of features (Up to 56k genes/features) and also frequently known to have low signal with high background. Here, the author demonstrated the correlation of price performance with the common company financial metrics during covid19 period.

## Packages
You'll need to you install the quanp package (https://quanp.readthedocs.io/en/latest/installation.html) that should install all necessary packages/libraries required to execute the codes in this tutorial. Please create and use virtualenv with python version 3.6 to avoid dependency problem.

### Install Packages

In [3]:
import sys
!conda install seaborn scikit-learn statsmodels numba pytables
!conda install -c conda-forge python-igraph leidenalg
!{sys.executable} -m pip install quanp

### Import Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import quanp as qp

from IPython.display import display
from matplotlib import rcParams

# setting visualization/logging parameters
pd.set_option('display.max_columns', None)
qp.set_figure_params(dpi=100, color_map = 'viridis_r')
qp.settings.verbosity = 1
qp.logging.print_versions()

### Download data

In [None]:
# S&P 500 metadata
df_metadata = qp.datasets.get_wiki_sp500_metadata()

# S&P 500 fundamentals
df_fundamental = qp.datasets.download_tickers_fundamental()

Downloading fundamentals...:   0%|          | 0/505 [00:00<?, ?ticker/s]

The metadata file has been initialized and saved as sp500_metadata.csv
Total tickers to download: 505


Downloading fundamentals...:  15%|█▌        | 77/505 [02:05<11:14,  1.58s/ticker]

### Loading data

In [None]:
# Optional: The data retried in cell above were saved as csv file You may activated this cell to avoid 
# rerunning the downloading cell above.
df_fundamental = pd.read_csv('data/metadata/sp500_metadata_fundamentalAdded.csv', index_col=0)
print(df_fundamental.columns)

In [None]:
ls_fundamental_target = ['beta','bookValuePerShare','currentRatio', 'dividendYield','epsChangePercentTTM','epsChangeYear',
 'epsTTM', 'grossMarginMRQ', 'grossMarginTTM', 'interestCoverage', 
 'ltDebtToEquity', 'marketCap', 'marketCapFloat', 'netProfitMarginMRQ', 'netProfitMarginTTM', 
 'operatingMarginMRQ', 'operatingMarginTTM', 'pbRatio', 'pcfRatio',
    'peRatio', 'pegRatio', 'prRatio', 'quickRatio', 'returnOnAssets',
    'returnOnEquity', 'returnOnInvestment', 'revChangeIn', 'revChangeTTM',
    'revChangeYear', 'sharesOutstanding', 'shortIntDayToCover',
    'shortIntToFloat', 'totalDebtToCapital', 'totalDebtToEquity',
    'vol10DayAvg', 'vol1DayAvg', 'vol3MonthAvg']

In [None]:
# Loading pandas dataframe as anndata 
adata = qp.AnnData(df_fundamental[ls_fundamental_target])

# log(x+1) transformation for all data
qp.pp.log1p(adata)

# Standardization scaling per feature
qp.pp.scale(adata)

In [None]:
# add a new `.obs` column for all comapnanies called `GICS_Sector`
adata.obs['GICS_Sector'] = df_fundamental['GICS Sector']
adata

### Principal component analysis (PCA)
Reduce the dimensionality of the data by running PCA, which reveals the main axes of variation and denoises the data.

In [None]:
qp.tl.pca(adata, svd_solver='auto')

We can make a scatter plot using the first 2 principle components' (PCs') coordinates and try to see if these 2 PCs can separate the GICS_Sector well

In [None]:
qp.pl.pca(adata, color=['GICS_Sector'], size=50);

For instance, it seems that the Information Technology, Financial, and Energy can be separated from low to high PC1.

In [None]:
qp.pl.pca(adata, color=['GICS_Sector'], size=50, groups=['Financials', 'Energy', 'Information Technology']);

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function ```qp.tl.leiden()```, ```qp.tl.louvain()```, or ```tSNE qp.tl.tsne()```. In our experience, often, a rough estimate of the number of PCs does fine. The 'elbow' point seems to suggest at least up to PC8 will be useful to characterize the companies. We are going to do further dimensional reduction based on the first 8 PCs later. We will be looking at 

In [None]:
qp.pl.pca_variance_ratio(adata, n_pcs=len(adata.var_names))

In [None]:
# Optional: save the anndata in h5ad for fast loading later
adata.write('sp500_metadata_fundamental.h5ad')
adata

### Computing the T-distributed Stochastic Neighbor Embedding (tSNE)

Let us further reduce the dimensionality of the signficant PCs identified above wholly in to 2 dimensions using the tSNE tool implemented as ```qp.tl.tsne(adata)```.

In [None]:
qp.tl.tsne(adata, n_pcs=8) # only consider the first 8 pcs

### Computing the neighborhood graph

Before we view the tsne plots with Sector annotations, Let us compute the neighborhood graph of companies using the PCA representation of the data matrix. This will give rise to distances and connectivities in each company. Here, we consider 10 nearest neighbors with 8 PCs derived from the PCA

In [None]:
qp.pp.neighbors(adata, n_neighbors=10, n_pcs=8);

### Clustering the neighborhood graph

Here, we use Leiden graph-clustering method (community detection based on optimizing modularity) by Traag *et al.* (2018) to cluster the neighborhood graph of companies, which we already computed in the previous section.

In [None]:
qp.tl.leiden(adata)

We can now map and view the annotations of leiden clustering, GICS_Sector, or any financial metrics/features on the tsne plots. We see that Leiden Cluster 3 seems to correspond well Financials sector, and it is featured by low currentRatio.

In [None]:
rcParams['figure.figsize'] = 8,8
qp.pl.tsne(adata, color=['leiden', 'GICS_Sector', 'currentRatio'], legend_loc='on data')

### Embedding the neighborhood graph

We can also embed the graph in 2 dimensions using UMAP (McInnes et al., 2018), see below. It is potentially more faithful to the global connectivity of the manifold than tSNE. Before running the UMAP, we compute the correlations between clusters as initiating positions for the UMAP.

In [None]:
qp.tl.paga(adata)
qp.pl.paga(adata, plot=True)

In [None]:
qp.tl.umap(adata, init_pos='paga')


We can now map and view the annotations of leiden clustering, GICS_Sector, or any financial metrics/features on the umap plots. We see that Leiden Cluster 3 seems to correspond well Financials sector, and it is featured by low currentRatio.

In [None]:
rcParams['figure.figsize'] = 8,8

qp.pl.umap(adata, color=['leiden', 'GICS_Sector'] + list(adata.var_names), legend_loc='on data', frameon=False, ncols=3)

We run `qp.tl.dendrogram` to compute hierachical clustering. Multiple visualizations that can
then include a dendrogram: `qp.pl.matrixplot`, `qp.pl.heatmap`, `qp.pl.dotplot` and `qp.pl.stacked_violin`. 

In [None]:
qp.tl.dendrogram(adata, 'leiden', var_names=adata.var_names)

In [None]:
qp.pl.heatmap(adata, adata.var_names, 'leiden', dendrogram=True)

In [None]:
qp.pl.matrixplot(adata, adata.var_names, 'leiden', dendrogram=True, cmap='RdBu_r')

In [None]:
qp.pl.stacked_violin(adata, adata.var_names, 'leiden', dendrogram=True, cmap='RdBu_r')

In [None]:
qp.pl.tracksplot(adata, adata.var_names, 'leiden', dendrogram=True)

## Visualizing the important features defining each cluster
Instead of looking at all features of clusters as previously, we can identify features/metrics that are differentially characterizing each cluster. Here, we can see the  Cluster 3 (Mostly consisted of Financials Sector) are correlated significantly higher dividendYield, operatingMarginTTM, netProfitMarginTTM, epsTTM, and netProfitMarginMRQ. 

In [None]:
qp.tl.rank_features_groups(adata, 'leiden', method='wilcoxon')
qp.pl.rank_features_groups(adata, n_features=40, sharey=False)

### Visualizing the top 5 positively enriched features using matrixplot

In [None]:
qp.pl.rank_features_groups_matrixplot(adata, n_features=5, use_raw=False, vmin=-3, vmax=3, cmap='bwr')

In [None]:
qp.pl.rank_features_groups_heatmap(adata, n_features=5, use_raw=False, vmin=-3, vmax=3, cmap='bwr')

Lastly, we can cross-tabulate the Leiden and GICS_Sector to check correlation between Sector and Leiden definitions.

In [None]:
df_crosstab_sectorVSleiden = pd.merge(pd.DataFrame(adata.obs['leiden']), pd.DataFrame(adata.obs['GICS_Sector']), 
                                      how='inner', left_index=True, right_index=True)
pd.crosstab(df_crosstab_sectorVSleiden['leiden'], df_crosstab_sectorVSleiden['GICS_Sector'])