# Marker Selection 

The goal of this step is to select a small subset of markers gene that are discriminative to the cell type/ state of interest. 

To plot the figure 2 of our book chapter  we run 3 different marker selection method on the data scRNA-seq data from our recent study (Curras-Alonso, et al. An interactive murine single-cell atlas of the lung responses to radiation injury. Nat Commun 14, 2445 (2023).https://doi.org/10.1038/s41467-023-38134-z).

For the goal of this notebook we use a small subset of data that can be download here :  https://cloud.minesparis.psl.eu/index.php/s/ZXkeiJTzD6KrjRX


In [23]:
import numpy as np
import scanpy as sc
import pandas as pd
from pathlib import Path

import nsforest as ns
from nsforest import NSForest


from scGeneFit.functions import get_markers, load_example_data, plot_marker_selection

from RankCorr_local.picturedRocks import Rocks



In [8]:
#### load a reduce dataset
anndata = sc.read("./test_set/test_anndata.h5ad")


# NS forest

Aevermann B et al. A machine learning method for the discovery of minimum marker gene combinations for cell type identification from single-cell RNA sequencing, Genome Res. 2021   

The official tutorial to use this method can be found here https://jcventerinstitute.github.io/celligrate/tutorials/NS-Forest_tutorial.html

In [None]:
from nsforest import NSForest

path_to_save_ns_result = "./ns"
Path(path_to_save_ns_result).mkdir(exist_ok = True, parents = True)
list_cell_type = list(np.array(anndata.obs.cell_ID))

ns.NSForest(anndata, cluster_header="cell_ID",
            cluster_list=list_cell_type,
            n_trees=1000, n_genes_eval=10,
            beta=0.5, output_folder=path_to_save_ns_result)

Preparing data...
--- 0.0016837120056152344 seconds ---
Calculating medians...
--- 0.17231512069702148 seconds ---
Number of clusters to evaluate: 4000
1 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
3 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
4 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
5 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
6 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
7 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
8 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
9 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
10 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
11 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
12 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
13 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
14 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
15 out of 4000:
	AM
	['Atp6v0d2', 'Chi

	['Cd79a']
	0.9112149532710281
132 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
133 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
134 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
135 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
136 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
137 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
138 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
139 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
140 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
141 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
142 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
143 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
144 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
145 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
146 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
147 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281

	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
268 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
269 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
270 out of 4000:
	Mesotheliocytes
	['Nkain4', 'Wt1']
	0.980392156862745
271 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
272 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
273 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
274 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
275 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
276 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
277 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
278 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
279 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
280 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
281 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
282 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
283 out of 40

	['Atp6v0d2', 'Chil3']
	0.9518348623853211
399 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
400 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
401 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
402 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
403 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
404 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
405 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
406 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
407 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
408 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
409 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
410 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
411 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
412 out of 4000:
	AT1
	['Rtkn2']
	0.8214285714285715
413 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
414 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
415 out o

	['Cd3g', 'Cd3d']
	0.7610554605423894
534 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
535 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
536 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
537 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
538 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
539 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
540 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
541 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
542 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
543 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
544 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
545 out of 4000:
	Platelets
	Only 5 out of 15 top Random Forest features with median > 0 will be further evaluated.
	['Ccl21a']
	0.8394160583941607
546 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
547 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.936768149882903

	['Inmt']
	0.9461426491994178
666 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
667 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
668 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
669 out of 4000:
	Fibroblasts
	['Inmt']
	0.9461426491994178
670 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
671 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
672 out of 4000:
	Platelets
	Only 5 out of 15 top Random Forest features with median > 0 will be further evaluated.
	['Ccl21a']
	0.8394160583941607
673 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
674 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
675 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
676 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
677 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
678 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
679 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
680 out of 4000:
	EC
	['Clec

	['Sftpc']
	0.9610472541507024
796 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
797 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
798 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
799 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
800 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
801 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
802 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
803 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
804 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
805 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
806 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
807 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
808 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
809 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
810 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
811 out of 4000:
	NK_T_cells
	['Cc

	['Cd3g', 'Cd3d']
	0.7610554605423894
932 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
933 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
934 out of 4000:
	Fibroblasts
	['Inmt']
	0.9461426491994178
935 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
936 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
937 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
938 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
939 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
940 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
941 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
942 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
943 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
944 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
945 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
946 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
947 out of 4000:
	AT2
	['Sftpc']
	0.961

	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1065 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1066 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1067 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1068 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1069 out of 4000:
	Platelets
	Only 5 out of 15 top Random Forest features with median > 0 will be further evaluated.
	['Ccl21a']
	0.8394160583941607
1070 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
1071 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1072 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1073 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1074 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1075 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1076 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1077 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1078 out of 4000:
	T_cells
	['Cd3g', 'Cd3

	['Tagln']
	0.8914728682170542
1196 out of 4000:
	AT1
	['Rtkn2']
	0.8214285714285715
1197 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1198 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1199 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1200 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1201 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1202 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1203 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
1204 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1205 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1206 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1207 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1208 out of 4000:
	SMC
	['Tagln']
	0.8914728682170542
1209 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1210 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1211 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605

	['Plac8', 'Lst1']
	0.8156606851549756
1329 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1330 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1331 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1332 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1333 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1334 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1335 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1336 out of 4000:
	Fibroblasts
	['Inmt']
	0.9461426491994178
1337 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1338 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1339 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
1340 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1341 out of 4000:
	AT1
	['Rtkn2']
	0.8214285714285715
1342 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1343 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1344 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423

	['Cd3g', 'Cd3d']
	0.7610554605423894
1460 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
1461 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
1462 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1463 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
1464 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
1465 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1466 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1467 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1468 out of 4000:
	AT1
	['Rtkn2']
	0.8214285714285715
1469 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1470 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1471 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1472 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
1473 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1474 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1475

	['Cd79a']
	0.9112149532710281
1592 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1593 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1594 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
1595 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1596 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1597 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1598 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1599 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1600 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1601 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1602 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1603 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1604 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1605 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
1606 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
1607 out of 4000:
	EC
	['Clec

	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1724 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1725 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1726 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1727 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1728 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1729 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1730 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1731 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1732 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1733 out of 4000:
	SMC
	['Tagln']
	0.8914728682170542
1734 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1735 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1736 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1737 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1738 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1739 out of 4000:
	AM
	['Atp6v0d2', 'Chil3'

	['Cd79a']
	0.9112149532710281
1855 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1856 out of 4000:
	Club
	['Scgb3a2', 'Scgb1a1']
	0.9367681498829039
1857 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1858 out of 4000:
	Fibroblasts
	['Inmt']
	0.9461426491994178
1859 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1860 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1861 out of 4000:
	Basophils
	['Il4', 'Cyp11a1']
	0.9821428571428572
1862 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1863 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1864 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1865 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1866 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
1867 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1868 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1869 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1870 out of 4000:
	IM
	[

	['Ccdc153']
	0.8664772727272728
1985 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1986 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
1987 out of 4000:
	DC
	['Cst3', 'H2-Eb1']
	0.8497536945812809
1988 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1989 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1990 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
1991 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1992 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1993 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
1994 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
1995 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
1996 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
1997 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
1998 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
1999 out of 4000:
	AT1
	['Rtkn2']
	0.8214285714285715
2000 out of 4000:
	EC
	['Clec14a', 'Eng']


	['Gzma']
	0.8475783475783477
2119 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2120 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2121 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2122 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
2123 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
2124 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
2125 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
2126 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
2127 out of 4000:
	Neutrophils
	['S100a9']
	0.9512683578104139
2128 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2129 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
2130 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2131 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2132 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2133 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2134 out of 4000:
	B_cell

	['Clec14a', 'Eng']
	0.8882521489971348
2252 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2253 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2254 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2255 out of 4000:
	Fibroblasts
	['Inmt']
	0.9461426491994178
2256 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
2257 out of 4000:
	NK_T_cells
	['Ccl5', 'Cd3d']
	0.7727797001153401
2258 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
2259 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
2260 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
2261 out of 4000:
	NK_cells
	['Gzma']
	0.8475783475783477
2262 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2263 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
2264 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2265 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2266 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2267 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.950744

	['C1qc', 'C1qb']
	0.9507445589919817
2385 out of 4000:
	EC
	['Clec14a', 'Eng']
	0.8882521489971348
2386 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024
2387 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
2388 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2389 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2390 out of 4000:
	T_cells
	['Cd3g', 'Cd3d']
	0.7610554605423894
2391 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2392 out of 4000:
	AM
	['Atp6v0d2', 'Chil3']
	0.9518348623853211
2393 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2394 out of 4000:
	Monocytes
	['Plac8', 'Lst1']
	0.8156606851549756
2395 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2396 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2397 out of 4000:
	B_cells
	['Cd79a']
	0.9112149532710281
2398 out of 4000:
	IM
	['C1qc', 'C1qb']
	0.9507445589919817
2399 out of 4000:
	Ciliated
	['Ccdc153']
	0.8664772727272728
2400 out of 4000:
	AT2
	['Sftpc']
	0.9610472541507024


Of note: NS forest computes also the "information gain" of markers to classify a class. This infomation gain is computed while training the random forest model as it is used as criteria for building decision tree. We use this metric to incrementaly increse the number of marker from one marker per class to five markers per class (Figure 2B of the book chapter) by keeping first the marker with the higher information gain.

# scGeneFit 
Dumitrascu, B., Villar, S., Mixon, D.G. et al. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun 12, 1186 (2021). https://doi.org/10.1038/s41467-021-21453-4  


The official repository of the code can be found here  : https://github.com/solevillar/scGeneFit-python

In [22]:

path_to_save_ns_result = "./scGenefit_result"
Path(path_to_save_ns_result).mkdir(exist_ok = True, parents = True)



column_name = "cell_ID"
num_markers = 20
method = 'centers'
redundancy = 0.25
data = anndata.X
labels = anndata.obs[column_name].values

markers_index = get_markers(data, labels, num_markers, method=method, redundancy=redundancy)
markers_names = anndata.var_names[markers_index]

df = pd.DataFrame()
df['gene'] = markers_names
df.to_csv(path_to_save_ns_result + "/scGeneFit_10.csv")

dico_result = {num_markers: [markers_index, markers_names]}
np.save(path_to_save_ns_result + '/dico_result' + str(num_markers), dico_result)
df

Solving a linear program with 3000 variables and 100 constraints
Time elapsed: 0.49406862258911133 seconds


Unnamed: 0,gene
0,Tmsb4x
1,Ccl21a
2,Jun
3,Cd52
4,Gng11
5,Cd9
6,Mgp
7,Klf2
8,Rps24
9,Ccl5


# RankCorr

Vargo, A.H.S., Gilbert, A.C. A rank-based marker selection method for high throughput scRNA-seq data. BMC Bioinformatics 21, 477 (2020). https://doi.org/10.1186/s12859-020-03641-z  

The official code can be found here :  https://github.com/ahsv/RankCorr


In [None]:
path_to_save_ns_result = "./rankcorr"
Path(path_to_save_ns_result).mkdir(exist_ok = True, parents = True)

column_name = "cell_ID"



anndata.obs[column_name] = anndata.obs[column_name].astype('category')

lookup = list(anndata.obs[column_name].cat.categories)

yVec = np.array([lookup.index(anndata.obs[column_name][i]) for i in
                 range(anndata.obs[column_name].shape[0])])

data = Rocks(anndata.X, yVec)

lamb = 3
markers, dico_cluster_gene = data.CSrankMarkers(lamb=lamb, writeOut=False, keepZeros=False, onlyNonZero=False)

geneNames = np.array(anndata.var.index)
data.genes = geneNames
marker_genes = data.markers_to_genes(markers)

dico_cell_type = {}
list_gene = []
list_cluster = []
for c in dico_cluster_gene:
    list_gene += data.markers_to_genes(dico_cluster_gene[c])
    list_cluster.append([lookup[c-1]] * len(data.markers_to_genes(dico_cluster_gene[c])))

## create dataframe
df = pd.DataFrame()
df['gene'] = list_gene
df['cluster'] = np.concatenate(list_cluster)
df.to_csv(path_to_save_ns_result+ "/markers.csv")

for i in range(len(lookup)):
    genes = anndata.var_names[dico_cluster_gene[list(dico_cluster_gene.keys())[0]][-1][i + 1]]
    dico_cell_type[lookup[i]] = list(genes)

df