# YADA Deconvolution

---



Run the following cells for deconvolution using YADA.

## 1 - Import Prerequisites.

In [1]:
%load_ext autoreload
%autoreload 2

from IPython.display import FileLink, FileLinks
import pandas as pd
from YADA import *

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)

## 2 - Configure Input Files.

Example input files are in the ../data/ folder. We demonstrate with input files from RNAseq sequencing.


In [8]:
#Reference matrix name. Should be normalized as the mix data.
pure_file_path = '../data/Challenge/pure-107019_RNASeq.csv'

#This is the mixture file in the format: columns: mix1, mix2, ..., rows: gene names.
mix_File_path = '../data/Challenge/mix-107019_RNASeq.csv'

#True cell type proportions file.
labels_file_path = 'Challenge'

## 3 - Data Preprocessing
YADA implements the following data preprocessing steps:

- Missing values are imputed with 0.
- If the maximum expression value across all genes is less than 20, a power transformation (raising values to the power of 2) is applied.
- Only genes common to both the marker gene list and the mixture dataset are considered for deconvolution.
- Standardization is performed column-wise (i.e., per cell type) by subtracting the minimum value and dividing by the mean.

In [9]:
pure, mix = preprocess(pure_file_path, mix_File_path)

## 4 - Gene Differentiation (Optional)
If a complete gene expression matrix for purified cell populations is provided, YADA can perform gene differentiation to identify an optimal marker gene set. This step is optional and can be skipped if a predefined marker list is already available.

The gene differentiation algorithm aims to select a subset of genes that maximally differentiate between the provided cell types. It involves the following steps:

Calculate the difference between the maximum and minimum expression values for each gene across all cell types.
Rank genes in descending order based on the calculated difference.
Select the top N genes as the marker gene set, where N is a user-defined parameter.
By executing this procedure, YADA can automatically derive a robust marker gene signature from the input expression data, potentially enhancing the accuracy of subsequent deconvolution.

In [10]:
# Gene differentiation algorithm.
gene_list_df = gene_diff(pure, mix)

YADA does not require a complete gene expression matrix for purified cell populations as input. Instead, it only needs a list of marker genes for each cell type. While YADA can deduce marker gene sets from a reference expression matrix using the gene_diff function, in most cases, only pre-defined marker gene lists are available. These lists can be provided to YADA by creating a gene_list_df dataframe.

In [8]:
gene_list_df

Unnamed: 0,naive.B.cells,memory.B.cells,naive.CD4.T.cells,naive.CD8.T.cells,memory.CD8.T.cells,regulatory.T.cells,monocytes,NK.cells,myeloid.dendritic.cells,neutrophils
0,linc01013,rp4-809f18.1,fst,hsfy2,rapgef4-as1,mtnd1p23,rp11-73m18.2,trdj1,cd1e,fcgr3b
1,kcnh8,mir568,phf2p2,rp11-20p5.2,rp11-347c18.1,krt1,rna5sp154,spon2,bx255923.2,tnfrsf10c
2,bmp3,ac007003.1,ctd-2358c21.4,frg2b,rp13-1032i1.10,rp11-47i22.4,rp11-747h12.5,klrf1,fcer1a,kcnj15
3,mybpc2,borcs7-asmt,pin4p1,nr1i2,glra2,golga6l6,ch17-125a10.1,sh2d1b,rp11-290h9.2,mme
4,snord84,rn7sl152p,cnn2p8,rp11-677m14.6,ac015849.2,kynup3,adamts5,s1pr5,znf366,cmtm2
...,...,...,...,...,...,...,...,...,...,...
75,st6galnac4p1,rp11-512m8.11,linc00933,ctd-2036p10.6,rp1-95l4.3,znf75bp,rps3ap43,nuak1,slc2a12,cdh2
76,prelid3bp6,prdx2p1,tmem256-plscr3,rp11-89n17.2,pgam4,ccr8,rp11-280o1.2,ttc38,ppargc1a,gp1bb
77,ctb-179i1.1,rps10p14,ranp8,fcf1p1,rp11-112l6.3,tnfrsf4,smarce1p6,copz2,wnt5a,cxcr2
78,rpl3p1,kynup2,or7e36p,hmgn2p17,znf536,rnu6-1091p,cd300e,rnf165,spns3,kcnh7


## 5 - Run Deconvolution.

In [12]:
result = run_yada(pure, mix, gene_list_df)
result

  0%|          | 0/400 [00:00<?, ?it/s]

Unnamed: 0,naive.B.cells,memory.B.cells,naive.CD4.T.cells,naive.CD8.T.cells,memory.CD8.T.cells,regulatory.T.cells,monocytes,NK.cells,myeloid.dendritic.cells,neutrophils
mix0,0.06510932,0.139998,0.02363,0.088001,0.05976,0.02386,0.181402,0.0133618,0.122602,0.232836
mix1,0.1328024,0.056148,0.027403,0.013681,0.022919,0.130835,0.262066,0.102278,0.234297,0.003972
mix2,0.03685177,0.149513,0.065553,0.06994,0.204118,0.120903,0.160237,0.05685653,0.016228,0.13404
mix3,0.2184108,0.00064,0.084474,0.01601,0.107536,0.064281,0.18298,0.1196925,0.034109,0.148653
mix4,0.06381383,0.002332,0.000823,0.019264,0.199137,0.03994,0.333606,0.3261569,0.008473,0.000216
mix5,0.2044775,0.076677,0.074489,0.104568,0.005324,0.187129,0.131919,0.03549868,0.114335,0.041387
mix6,0.03139101,0.030452,0.030898,0.009147,0.102076,0.044835,0.388953,0.3115957,0.021546,0.030609
mix7,0.2277614,0.129444,0.298505,0.007775,0.005989,0.178833,0.053525,0.01279926,0.035252,0.074192
mix8,3.454594e-18,0.218877,0.075901,0.358503,0.001309,0.148091,0.022591,0.05792784,0.077045,0.034962
mix9,0.0882669,0.008486,0.177011,0.085566,0.091452,0.03945,0.201882,0.007738412,0.253404,0.05334


## 5- Downloading Results
If you need to download the deconvolution results, use the following code cell to enable a file link:
This function generates a downloadable link for the results_df dataframe containing the deconvolution output.
After executing the cell, a URL will be displayed in the output. Right-click on this link and choose "Copy Link Address" or the equivalent option in your environment. You can then use this copied URL to download the results file.
Please note that the results_df dataframe should be the variable holding the deconvolution results you wish to download.

In [None]:
#FileLink('data/results.csv')
#from google.colab import files
#files.download('data/results.csv') 

## 6 - Evaluate Results.

If the true cell type proportions are available, you can evaluate the deconvolution results using the following method:

In [15]:
res = calc_corr(labels_file_path, result) # columns=['dataset', 'celltype', 'pearson', 'spearman', 'p'])
res

Unnamed: 0,Challenge,celltype,Pearson,Spearman,p
0,Challenge,naive.B.cells,0.989535,0.969925,1.714356e-12
1,Challenge,memory.B.cells,0.995704,0.986466,1.377568e-15
2,Challenge,naive.CD4.T.cells,0.991059,0.980451,3.689359e-14
3,Challenge,naive.CD8.T.cells,0.996958,0.980451,3.689359e-14
4,Challenge,memory.CD8.T.cells,0.995444,0.990977,3.642322e-17
5,Challenge,regulatory.T.cells,0.992295,0.986466,1.377568e-15
6,Challenge,monocytes,0.984819,0.986466,1.377568e-15
7,Challenge,NK.cells,0.998679,0.995489,7.230721e-20
8,Challenge,myeloid.dendritic.cells,0.990618,0.989474,1.45057e-16
9,Challenge,neutrophils,0.996379,0.978947,7.148944e-14
