# Differential gene expression analysis

To perform differential gene expression analysis we have several alternatives and modes:
- Single-cell level
- Pseudobulk level

The `dotools_py` package includes two functions to automatically test for DEA between two conditions
for all the celltypes we have defined in our object, as well a consensus function to run both approaches.

## Environment setup

In [1]:
import anndata as ad
import dotools_py as do
import session_info


adata = ad.read_h5ad("/Users/david/Downloads/Data10x/adata.h5ad")
adata

2025-10-22 16:33:19,274 - Jupyter enviroment detected. Using "inline" backend


AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

## DEA at the single-cell level

Among the test we can use we have: wilcoxon, t-test, logistic regression, t-test with overestimation of the variance and the MAST test.
The MAST test can be run directly using the `do.tl.MastTest`, the other tests can be run with `do.tl.rank_genes_groups`. Alternatively, we can use `do.tl.rank_genes_condition`, to use any of the test and automatically test for all the cell-types

To reduce the computation time, we are going to only use the NK cells

In [2]:
nk = adata[adata.obs.annotation == "NK"].copy()

df = do.tl.rank_genes_condition(
    nk,
    groupby="condition",
    subset_by="annotation",
    reference="healthy",
    groups=["disease"],
    method="mast",
    get_results=True,
)

2025-10-22 16:34:01,821 - Running DGEs for NK.
2025-10-22 16:34:01,824 - Running MAST test
2025-10-22 16:34:12,968 - Reference level is set to healthy


In [3]:
df.head(10)

Unnamed: 0,GeneName,log2fc,pvals,padj,pts_group,pts_ref,group,annotation
0,A1BG,2.604576,0.001125,0.021286,0.2,0.042553,disease,NK
1,A1BG-AS1,-22.228531,0.781334,1.0,0.0,0.00266,disease,NK
2,A2M,-24.127653,0.639331,1.0,0.0,0.005319,disease,NK
3,A2M-AS1,-0.21674,0.636112,1.0,0.133333,0.170213,disease,NK
4,A4GALT,0.0,1.0,1.0,0.0,0.0,disease,NK
5,AAAS,-1.100035,0.69498,1.0,0.022222,0.047872,disease,NK
6,AACS,-25.872604,0.328578,0.901818,0.0,0.015957,disease,NK
7,AAED1,0.645098,0.001607,0.028256,0.044444,0.090426,disease,NK
8,AAGAB,0.570027,0.643581,1.0,0.088889,0.066489,disease,NK
9,AAK1,-0.717432,0.129302,0.568715,0.333333,0.465426,disease,NK


## DEA at the pseudobulk level

To perform differential gene expression using a pseudobulk approach we can use `do.tl.rank_genes_pseudobulk`, which test between two conditions for each cell-type. We can use `DESEq2` or `edgeR`. In this case we need to generate pseudo-replicates since we only have one sample per condition.

In [4]:
df = do.tl.rank_genes_pseudobulk(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    design="~condition",
    min_cells=30,
    min_counts=10,
    method="deseq2",
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
)

2025-10-22 16:36:05,602 - Generating Pseudo-bulk data


Pseudo-bulked groups: 100%|██████████| 10/10 [00:00<00:00, 54.47it/s]


2025-10-22 16:36:23,693 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:23,696 - Run DESeq2
Using None as control genes, passed at DeseqDataSet initialization


Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.34 seconds.

Fitting dispersion trend curve...
... done in 0.07 seconds.

Fitting MAP dispersions...
... done in 0.48 seconds.

Fitting LFCs...
... done in 0.40 seconds.

Calculating cook's distance...
... done in 0.00 seconds.

Replacing 0 outlier genes.

Running Wald tests...
... done in 0.20 seconds.

Fitting size factors...
... done in 0.00 seconds.



Log2 fold change & Wald test p-value: condition disease vs healthy
          baseMean  log2FoldChange     lfcSE      stat    pvalue  padj
A1BG      5.669955        0.403825  1.067884  0.378154  0.705316   NaN
A1BG-AS1  0.738919        2.278757  3.406331  0.668977  0.503510   NaN
A2M-AS1   0.157668        0.021039  5.332329  0.003946  0.996852   NaN
A4GALT    2.649744        4.131701  2.716025  1.521231  0.128202   NaN
AAAS      1.998178       -0.768574  1.708237 -0.449923  0.652766   NaN
...            ...             ...       ...       ...       ...   ...
ZXDB      1.219480       -4.522738  3.132528 -1.443798  0.148796   NaN
ZXDC      3.052221       -0.368758  1.482192 -0.248793  0.803521   NaN
ZYG11B    2.702957        1.185403  1.593806  0.743756  0.457024   NaN
ZYX       4.646288        0.371801  1.177736  0.315692  0.752237   NaN
ZZEF1     2.725794       -0.534042  1.487064 -0.359125  0.719501   NaN

[12711 rows x 6 columns]
Using None as control genes, passed at DeseqDataSet ini

Fitting dispersions...
... done in 0.35 seconds.

Fitting dispersion trend curve...
... done in 0.07 seconds.

Fitting MAP dispersions...
... done in 0.46 seconds.

Fitting LFCs...


2025-10-22 16:36:27,369 - Test could not be computed for Monocytes due to You specified a non-existant category for condition. Possible categories: healthy


... done in 0.42 seconds.

Calculating cook's distance...
... done in 0.00 seconds.

Replacing 0 outlier genes.

Fitting size factors...
... done in 0.00 seconds.



Using None as control genes, passed at DeseqDataSet initialization


Fitting dispersions...
... done in 0.42 seconds.

Fitting dispersion trend curve...
... done in 0.07 seconds.

Fitting MAP dispersions...
... done in 0.61 seconds.

Fitting LFCs...
... done in 0.49 seconds.

Calculating cook's distance...
... done in 0.00 seconds.

Replacing 0 outlier genes.

Running Wald tests...
... done in 0.19 seconds.



Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG       9.584556        2.614643  1.247043  2.096674  0.036022  0.102957
A1BG-AS1   0.083831        1.607201  5.343868  0.300756  0.763600       NaN
A2M-AS1   10.868651       -0.295972  0.595112 -0.497339  0.618950  0.740909
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS       2.234670       -1.123438  1.523501 -0.737405  0.460876       NaN
...             ...             ...       ...       ...       ...       ...
ZXDB       0.337953       -0.396078  2.704777 -0.146436  0.883577       NaN
ZXDC       1.524730       -2.568738  2.527960 -1.016131  0.309567       NaN
ZYG11B     0.252808        0.021401  2.999455  0.007135  0.994307       NaN
ZYX       11.933621       -1.723495  0.810586 -2.126234  0.033484  0.098208
ZZEF1      3.545880       -3.787932  2.403851 -1.575777  0.115077       NaN

[12711 rows x 6 colu

Fitting size factors...
... done in 0.00 seconds.

Fitting dispersions...
... done in 0.40 seconds.

Fitting dispersion trend curve...
... done in 0.08 seconds.

Fitting MAP dispersions...
... done in 0.51 seconds.

Fitting LFCs...
... done in 0.35 seconds.

Calculating cook's distance...
... done in 0.00 seconds.

Replacing 0 outlier genes.

Running Wald tests...


Log2 fold change & Wald test p-value: condition disease vs healthy
           baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A1BG      41.778608       -0.274318  0.380225 -0.721461  0.470626  0.670616
A1BG-AS1   4.012429       -0.926947  1.281293 -0.723446  0.469406  0.669769
A2M-AS1    4.688797       -1.240566  1.236603 -1.003204  0.315762  0.530839
A4GALT     0.000000             NaN       NaN       NaN       NaN       NaN
AAAS      15.380437       -1.665547  0.709661 -2.346961  0.018927  0.073339
...             ...             ...       ...       ...       ...       ...
ZXDB       5.162987        1.357682  1.049996  1.293036  0.195999  0.389684
ZXDC      18.622773       -0.592434  0.604882 -0.979422  0.327372  0.542457
ZYG11B    18.725857       -0.319720  0.563125 -0.567761  0.570197  0.748763
ZYX       59.401751       -0.652206  0.324146 -2.012075  0.044212  0.138577
ZZEF1     23.713091       -0.187387  0.493061 -0.380049  0.703909  0.841674

[12711 rows x 6 colu

... done in 0.31 seconds.



In [5]:
df.head(10)

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,group
A1BG,5.669955,0.403825,1.067884,0.378154,0.705316,1.0,B_cells
A1BG-AS1,0.738919,2.278757,3.406331,0.668977,0.50351,1.0,B_cells
A2M-AS1,0.157668,0.021039,5.332329,0.003946,0.996852,1.0,B_cells
A4GALT,2.649744,4.131701,2.716025,1.521231,0.128202,1.0,B_cells
AAAS,1.998178,-0.768574,1.708237,-0.449923,0.652766,1.0,B_cells
AACS,1.038974,0.518157,2.502786,0.207032,0.835985,1.0,B_cells
AAED1,9.829811,1.880914,0.951986,1.97578,0.04818,0.163028,B_cells
AAGAB,2.013459,0.515767,1.7587,0.293266,0.769319,1.0,B_cells
AAK1,0.816817,0.059553,2.702106,0.022039,0.982417,1.0,B_cells
AAMDC,1.314319,-0.937528,2.182265,-0.429613,0.667478,1.0,B_cells


## DEA consensus

Additionally, the `do.tl.rank_genes_consensus` allow to perform both single-cell and pseudo-bulk DEA and generate a dataframe that summarises everything.

In [6]:
df = do.tl.rank_genes_consensus(
    adata,
    ctrl_cond="healthy",
    disease_cond="disease",
    cluster_key="annotation",
    batch_key="batch",
    condition_key="condition",
    min_cells=30,
    min_counts=10,
    pseudobulk_approach="sum",
    technical_replicates=2,
    get_results=True,
    test_pseudobulk="edger",
    test="wilcoxon",
)

2025-10-22 16:36:53,084 - Running wilcoxon
2025-10-22 16:36:53,151 - Running DGEs for B_cells.
2025-10-22 16:36:53,153 - Running wilcoxon test.
2025-10-22 16:36:53,480 - Running DGEs for Monocytes.
2025-10-22 16:36:53,482 - Running wilcoxon test.
2025-10-22 16:36:53,565 - Running DGEs for NK.
2025-10-22 16:36:53,566 - Running wilcoxon test.
2025-10-22 16:36:53,677 - Running DGEs for T_cells.
2025-10-22 16:36:53,680 - Running wilcoxon test.
2025-10-22 16:36:54,161 - Running DGEs for pDC.
2025-10-22 16:36:54,163 - Running wilcoxon test.
2025-10-22 16:36:54,197 - Running edger
2025-10-22 16:36:54,197 - Generating Pseudo-bulk data


Pseudo-bulked groups: 100%|██████████| 10/10 [00:00<00:00, 33554.43it/s]


2025-10-22 16:36:57,684 - Removed 5806 genes for having less than 10 total counts
2025-10-22 16:36:57,699 - Run edgeR
2025-10-22 16:36:57,703 - Running DEA for B_cells


Reading AnnData in R
Running edgeR Test


2025-10-22 16:37:08,520 - Running DEA for Monocytes


Generating DGE Table to pass to Python
Reading AnnData in R


2025-10-22 16:37:18,426 - Test could not be computed for Monocytes due to [Errno 2] No such file or directory: '/tmp/EdgeR_Test_eb8e58c43d00401db58be2acaae8a194/dge_Monocytes_edgeR.csv'
2025-10-22 16:37:18,427 - Running DEA for NK


Running edgeR Test
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
Calls: fit_model ... model.matrix -> model.matrix.default -> contrasts<-
Execution halted
Reading AnnData in R
Running edgeR Test


2025-10-22 16:37:28,661 - Running DEA for T_cells


Generating DGE Table to pass to Python
Reading AnnData in R
Running edgeR Test


2025-10-22 16:37:38,776 - Generating consensus DataFrame


Generating DGE Table to pass to Python


In [7]:
df.head(10)

Unnamed: 0,GeneName,wilcox_score,log2fc,pvals,padj,pts_group,pts_ref,group,annotation,log2fc_edger,stat_edger,pval_edger,padj_edger,sc_signicant,psc_signicant,consensus_significant,MeanExpr_disease,MeanExpr_healthy
0,JUND,10.58798,2.16752,3.388231e-26,1.254797e-22,0.982558,0.941176,disease,B_cells,2.150242,266.374184,1.508208e-09,5.45423e-07,Yes,Yes,Yes,4.278455,2.823328
1,CD83,10.266931,4.102563,9.931388000000001e-25,3.064992e-21,0.895349,0.176471,disease,B_cells,3.985247,289.033574,9.5653e-10,3.818415e-07,Yes,Yes,Yes,2.662119,0.574226
2,RPS4Y1,10.197039,32.221443,2.044093e-24,5.407211e-21,0.813953,0.0,disease,B_cells,9.754567,211.691989,1.603596e-08,3.189553e-06,Yes,Yes,Yes,1.793008,0.0
3,FOS,10.12042,4.805681,4.4849e-24,1.038086e-20,0.912791,0.308824,disease,B_cells,4.670979,681.658744,6.30214e-12,1.253496e-08,Yes,Yes,Yes,3.868105,0.98403
4,HSP90AA1,9.908383,2.990328,3.827898e-23,7.875687e-20,0.959302,0.632353,disease,B_cells,3.064642,316.093902,5.622099e-10,3.6216e-07,Yes,Yes,Yes,3.364896,1.50736
5,CREM,9.665054,5.431766,4.2439460000000005e-22,7.144104999999999e-19,0.802326,0.058824,disease,B_cells,5.363856,299.596476,7.086365e-10,3.6216e-07,Yes,Yes,Yes,2.367836,0.202228
6,YPEL5,9.301365,3.094231,1.386545e-20,1.9749730000000003e-17,0.889535,0.323529,disease,B_cells,2.97651,147.04032,4.389385e-08,6.073501e-06,Yes,Yes,Yes,2.494147,0.833389
7,SRGN,9.120118,4.315721,7.504224999999999e-20,8.684733000000001e-17,0.802326,0.161765,disease,B_cells,4.067877,206.063003,6.64686e-09,1.652575e-06,Yes,Yes,Yes,2.626802,0.497278
8,TUBB4B,8.940316,3.670844,3.880572e-19,3.992031e-16,0.784884,0.147059,disease,B_cells,3.703473,165.803423,2.310259e-08,4.177369e-06,Yes,Yes,Yes,1.940737,0.384084
9,RGS1,8.825177,7.311319,1.092864e-18,9.233199e-16,0.697674,0.029412,disease,B_cells,7.306141,567.203038,4.569613e-12,1.253496e-08,Yes,Yes,Yes,2.666053,0.080899


In [8]:
adata

AnnData object with n_obs × n_vars = 2783 × 18517
    obs: 'batch', 'condition', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'n_counts', 'doublet_class', 'doublet_score', 'leiden', 'autoAnnot', 'celltypist_conf_score', 'annotation', 'annotation_recluster'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'annotation_recluster_colors', 'hvg', 'leiden', 'log1p', 'neighbors', 'umap', 'rank_genes_pseudobulk', 'rank_genes_condition', 'rank_genes_consensus'
    obsm: 'X_CCA', 'X_pca', 'X_umap'
    layers: 'counts', 'logcounts'
    obsp: 'connectivities', 'distances'

As we can appreciate, the results of the DEA will be saved in the uns attributed

In [9]:

session_info.show(na=False, cpu=True, excludes=["backports"], std_lib=True, dependencies=True, html=True)