# Negative binomial models with dispersion estimates

`delnx` implements a negative binomial model as well as size factor and dispersion estimation for differential expression analysis. This is heavily inspired by [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html) / [PyDESeq2](https://pydeseq2.readthedocs.io/en/stable/) and [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html). While it's not an exact reimplementation of these methods, it achieves very similar results and is a lot quicker (especially on GPUs), thanks to [JAX](https://jax.readthedocs.io/en/latest/). Here's a quick example of the basic workflow:

In [None]:
import delnx as dx
import scanpy as sc

# Load example data
adata = sc.read_h5ad("data/GLI3_KO_45d_pseudobulk.h5ad")

# Use DESeq2-style median-of-ratios to compute size factors
dx.pp.size_factors(adata, method="ratio")

# Estimate dispersion using DESeq2-inspired shrinkage
dx.pp.dispersion(adata, size_factor_key="size_factor", method="deseq2")

print(adata)

100%|██████████| 8/8 [00:00<00:00, 14.66it/s]


AnnData object with n_obs × n_vars = 28 × 16199
    obs: 'psbulk_replicate', 'cell_type', 'organoid', 'GLI3_KO', 'psbulk_cells', 'psbulk_counts', 'size_factor'
    var: 'dispersion'
    layers: 'psbulk_props'


Now we have size factors for each cell and dispersion estimates for each gene stored in `adata.obs['size_factors']` and `adata.var['dispersion']`, respectively. We can use these to perform differential expression analysis with a negative binomial model.

In [None]:
# Run differential expression analysis with negative binomial model
de_results = dx.tl.de(
    adata,
    method="negbinom",  # Use negative binomial model for DE analysis
    condition_key="GLI3_KO",  # Key for condition variable
    group_key="cell_type",  # Key for grouping variable
    size_factor_key="size_factor",  # Key for size factors
    dispersion_key="dispersion",  # Key for dispersion estimates
)

print(de_results)

Inferred data type: counts
14462 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:01<00:00,  7.03it/s]


Inferred data type: counts
14906 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:01<00:00,  6.83it/s]


Inferred data type: counts
15320 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:00<00:00, 11.46it/s]


Inferred data type: counts
15443 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:00<00:00, 12.60it/s]

      feature  test_condition  ref_condition     group     log2fc     auroc  \
0      TGFBR2           False           True     ge_in  10.000000  0.666667   
1         CCK           False           True     ge_in -10.000000  0.000000   
2         OTP           False           True     ge_in -10.000000  0.000000   
3      NKX2-1           False           True    ge_npc -10.000000  0.000000   
4        LHX8           False           True    ge_npc -10.000000  0.000000   
...       ...             ...            ...       ...        ...       ...   
60126  GPRC5A           False           True    ge_npc  10.000000  0.666667   
60127   HOXC8           False           True  mesen_ex  10.000000  0.666667   
60128  ADRA1A           False           True     ge_in  10.000000  0.833333   
60129   APOC3           False           True     ge_in  10.000000  0.666667   
60130   PGBD1           False           True  mesen_ex  -0.000167  0.583333   

            coef          pval          padj  
0   


