# YADA Deconvolution

---
In this notebook, we demonstrate running YADA with a matrix that includes lists of marker genes as well as their expression values, i.e., with RNA counts for relevant cell types.
It is recommended to clone this repository by using:
```
git clone https://github.com/zurkin1/Yada.git
!pip install -r ../requirements.txt
```
and then run it in a Jupyter notebook.

## 1 - Import Prerequisites

In [2]:
%load_ext autoreload
%autoreload 2


from YADA import *

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 2 - Configure Input Files

Example input files are located in the `./data/` folder. We demonstrate with input files from GEO RNAseq series 107019, Monaco et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107019)


In [11]:
#Reference matrix name. Should be normalized as the mix data.
pure_file_path = '../data/Challenge/pure-107019_RNASeq.csv'

#This is the mixture file in the format: columns: mix1, mix2, ..., rows: gene names.
mix_File_path = '../data/Challenge/mix-107019_RNASeq.csv'

#True cell type proportions file. This is not a necessary file. If you have it, you can use it to compare the results.
labels_file_path = 'Challenge'

## 3 - Data Preprocessing
YADA implements the following data preprocessing steps:

- Missing values are imputed with 0.
- If the maximum expression value across all genes is less than 20, a power transformation (raising values to the power of 2) is applied.
- Only genes common to both the marker gene list and the mixture dataset are considered for deconvolution.
- Standardization is performed column-wise (i.e., per cell type) by subtracting the minimum value and dividing by the mean.
- Gene differentiation algorithm to select the most expressible genes for each cell type.

In [12]:
pure, mix = preprocess(pure_file_path, mix_File_path)

INFO:root:Preprocessing files...
INFO:root:Dropping genes that are not shared by mix and pure...
INFO:root:Standardizing data...
INFO:root:Gene list differential expression...


Since a complete gene expression matrix for purified cell populations is provided, YADA performs gene differentiation to identify an optimal marker gene set.
The gene differentiation algorithm aims to select a subset of genes that maximally differentiate between the provided cell types. It involves the following steps:
- Calculate the difference between the maximum and minimum expression values for each gene across all cell types.
- Rank genes in descending order based on the calculated difference.
- Select the top N genes as the marker gene set, where N is a user-defined parameter.
By executing this procedure, YADA can automatically derive a robust marker gene signature from the input expression data, potentially enhancing the accuracy of subsequent deconvolution.

YADA does not require a complete gene expression matrix for purified cell populations as input. Instead, it only needs a list of marker genes for each cell type. While YADA can deduce marker gene sets from a reference expression matrix using the gene_diff function, in most cases, only pre-defined marker gene lists are available.

In [13]:
pure

Unnamed: 0,naive.B.cells,memory.B.cells,naive.CD4.T.cells,naive.CD8.T.cells,memory.CD8.T.cells,regulatory.T.cells,monocytes,NK.cells,myeloid.dendritic.cells,neutrophils
0,linc01013,rp4-809f18.1,fst,hsfy2,rapgef4-as1,mtnd1p23,rp11-73m18.2,trdj1,cd1e,fcgr3b
1,kcnh8,mir568,phf2p2,rp11-20p5.2,rp11-347c18.1,krt1,rna5sp154,spon2,bx255923.2,tnfrsf10c
2,bmp3,ac007003.1,ctd-2358c21.4,frg2b,rp13-1032i1.10,rp11-47i22.4,rp11-747h12.5,klrf1,fcer1a,kcnj15
3,mybpc2,borcs7-asmt,pin4p1,nr1i2,glra2,golga6l6,ch17-125a10.1,sh2d1b,rp11-290h9.2,mme
4,snord84,rn7sl152p,cnn2p8,rp11-677m14.6,ac015849.2,kynup3,adamts5,s1pr5,znf366,cmtm2
...,...,...,...,...,...,...,...,...,...,...
75,st6galnac4p1,rp11-512m8.11,linc00933,ctd-2036p10.6,rp1-95l4.3,znf75bp,rps3ap43,nuak1,slc2a12,cdh2
76,prelid3bp6,prdx2p1,tmem256-plscr3,rp11-89n17.2,pgam4,ccr8,rp11-280o1.2,ttc38,ppargc1a,gp1bb
77,ctb-179i1.1,rps10p14,ranp8,fcf1p1,rp11-112l6.3,tnfrsf4,smarce1p6,copz2,wnt5a,cxcr2
78,rpl3p1,kynup2,or7e36p,hmgn2p17,znf536,rnu6-1091p,cd300e,rnf165,spns3,kcnh7


## 4 - Run Deconvolution

In [14]:
result = run_yada(pure, mix)
result

  0%|          | 0/400 [00:00<?, ?it/s]

Unnamed: 0,naive.B.cells,memory.B.cells,naive.CD4.T.cells,naive.CD8.T.cells,memory.CD8.T.cells,regulatory.T.cells,monocytes,NK.cells,myeloid.dendritic.cells,neutrophils
mix0,0.06517698,0.140452,0.024101,0.087757,0.058971,0.023689,0.181621,0.01351502,0.121427,0.231759
mix1,0.1317191,0.055823,0.027529,0.012976,0.022421,0.129572,0.262003,0.1036697,0.233751,0.004206
mix2,0.03596097,0.150221,0.066293,0.070305,0.202975,0.1212,0.160212,0.05684701,0.015761,0.136302
mix3,0.2206043,0.000634,0.084577,0.015541,0.106293,0.064719,0.182126,0.1210489,0.034534,0.150038
mix4,0.06417525,0.002218,0.000547,0.018532,0.196893,0.039499,0.333077,0.3263802,0.008178,0.000156
mix5,0.2024363,0.077277,0.076357,0.103465,0.005173,0.18863,0.13213,0.03559901,0.114801,0.041931
mix6,0.03183925,0.030332,0.030846,0.0081,0.10069,0.044562,0.389136,0.3152723,0.021079,0.031102
mix7,0.2230813,0.130373,0.299677,0.00706,0.006296,0.178799,0.0534,0.01289321,0.035879,0.075095
mix8,3.832942e-18,0.220795,0.077328,0.352873,0.001402,0.147904,0.022906,0.05856875,0.076448,0.035555
mix9,0.08656632,0.00842,0.179878,0.086175,0.090018,0.039058,0.201019,0.007913075,0.257632,0.053455


## 5 - Evaluate Results

If the true cell type proportions are available, you can evaluate the deconvolution results using the following method:

In [15]:
res = calc_corr(labels_file_path, result) # columns=['dataset', 'celltype', 'pearson', 'spearman', 'p'])
res

Unnamed: 0,Challenge,celltype,Pearson,Spearman,p
0,Challenge,naive.B.cells,0.991234,0.968421,2.644983e-12
1,Challenge,memory.B.cells,0.995892,0.986466,1.377568e-15
2,Challenge,naive.CD4.T.cells,0.990821,0.980451,3.689359e-14
3,Challenge,naive.CD8.T.cells,0.996829,0.980451,3.689359e-14
4,Challenge,memory.CD8.T.cells,0.995607,0.990977,3.642322e-17
5,Challenge,regulatory.T.cells,0.991307,0.98797,4.798484e-16
6,Challenge,monocytes,0.984621,0.986466,1.377568e-15
7,Challenge,NK.cells,0.99879,0.996992,1.891102e-21
8,Challenge,myeloid.dendritic.cells,0.990423,0.992481,7.097516000000001e-18
9,Challenge,neutrophils,0.995647,0.981955,1.804935e-14
