# Select Top Features

Technically, any forms of observation-by-feature matrix is acceptable for the method we developed, and users are
encouraged to explore the usability of our method with other types of data, even not in a biological context.
However, single-cell transcriptomics data, as provided, usually is of high dimensionality and contains technical and
biological noise. With testing different approaches of reducing the dimensionality and noise, we recommend that users
select a number of top differentially expressed genes (DEGs) for each cluster (or group of clusters) that a vertex represents.

We implemented a fast Wilcoxon rank-sum test method which can be invoked with function [`select_top_features`](../generated/CytoSimplex.select_top_features).
The test is done in a one group versus all other groups manner. Here, we will choose the top DEGs for Osteoblast cells 
(shortened as `"OS"`), Reticular cells (`"RE"`) and Chondrocytes (`"CH"`), as also shown in the previously mentioned 
publication. The number of top DEGs for each cluster is set to 30 (nTop = 30), thus 90 unique genes at maximum are expected to be returned.

In [1]:
import CytoSimplex as csx
import scanpy as sc
adata = sc.read(filename='test.h5ad',
                backup_url="https://figshare.com/ndownloader/files/41034857")
vertices = {"OS": "Osteoblast_1",
            "RE": "Reticular_1",
            "CH": "Chondrocyte_1"}
selected_genes = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, n_top=30)
selected_genes[:10]

  z = z / usigma


['Steap1',
 'Smim5',
 'H2-DMa',
 'Zcchc5',
 'Lims2',
 'Fam89a',
 'Ninj2',
 'Scin',
 'Pygl',
 'Slc2a5']

Alternatively, users can set `return_stats=True` to obtain a table of
full Wilcoxon rank-sum test statistics including the result for all clusters, instead of selected vertices.

In [2]:
stats = csx.select_top_features(adata, cluster_var="cluster", vertices=vertices, return_stats=True)
stats

  z = z / usigma


Unnamed: 0,group,avgExpr,logFC,ustat,auc,pval,padj,pct_in,pct_out,feature
0,CH,0.000000,0.000000,3010.5,0.500000,,,0.000000,0.000000,Rp1
1,CH,0.000000,0.000000,3010.5,0.500000,,,0.000000,0.000000,Sox17
2,CH,10.610097,7.637897,4547.5,0.755273,4.666025e-08,4.881361e-07,81.481481,21.524664,Mrpl15
3,CH,6.077630,3.874176,3826.0,0.635443,9.406324e-04,3.637987e-03,48.148148,16.143498,Lypla1
4,CH,0.000000,-0.057590,2997.0,0.497758,7.669166e-01,1.000000e+00,0.000000,0.448430,Gm37988
...,...,...,...,...,...,...,...,...,...,...
242911,Reticular_2,1.582342,-0.496039,1048.0,0.483172,7.931468e-01,1.000000e+00,11.111111,14.937759,PISD
242912,Reticular_2,0.000000,-2.912214,846.0,0.390041,1.202520e-01,1.000000e+00,0.000000,21.991701,DHRSX
242913,Reticular_2,0.000000,-0.154048,1071.0,0.493776,7.746696e-01,1.000000e+00,0.000000,1.244813,CAAA01147332.1
242914,Reticular_2,16.142743,2.744758,1350.0,0.622407,2.151089e-01,1.000000e+00,100.000000,83.402490,tdT-WPRE_trans


The returned table can be considered as the concatenation of the tables of all tests, with the column `group` indicating 
which cluster the test is primarily based on. For example, the 3rd row is the result of the test for gene "Mrpl15" in group "CH", 
which represents the original cluster "Chondrocyte_1", against all other clusters. 