# Overview

`delnx` is a Python package for differential expression analysis of single-cell RNA-seq data. For the most part, it's actually just one function {func}`~delnx.tl.de` that provides a unified interface to several ways of performing differential expression analysis with (generalized) linear models. These include other established packages such as [statsmodels](https://www.statsmodels.org/stable/index.html) and [PyDESeq2](https://pydeseq2.readthedocs.io/en/stable/) but we also implemented our own linear models in [JAX](https://jax.readthedocs.io/en/latest/) to enable lightning-fast DE testing on GPUs. To get you started, here's a basic example of how to use `delnx` for differential expression analysis:

In [12]:
import delnx as dx
import scanpy as sc


# Load example data
adata = sc.read_h5ad("data/GLI3_KO_45d.h5ad")
adata.layers["counts"] = adata.X.copy()  # Store raw counts in a separate layer

# Some basic preprocessing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

print(adata)

AnnData object with n_obs × n_vars = 22410 × 18653
    obs: 'organoid', 'GLI3_KO', 'cell_type'
    uns: 'log1p'
    obsm: 'pca', 'umap'
    layers: 'counts'


In [None]:
# Run differential expression analysis between conditions (here knockout vs. control)
de_results = dx.tl.de(
    adata,
    method="lr",  # DE method to use: "lr" for logistic regression
    condition_key="GLI3_KO",  # Condition key for DE analysis
)

print(de_results)

Inferred data type: lognorm
16199 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:19<00:00,  2.38s/it]


       feature  test_condition  ref_condition     log2fc     auroc       coef  \
0          EN1           False           True -10.000000  0.489666 -16.067822   
1       NKX2-1           False           True  -9.768157  0.406106  -5.291508   
2        SFTA3           False           True  -8.790968  0.427252  -4.085033   
3         LHX8           False           True  -9.115255  0.456056  -3.926080   
4       NKX6-2           False           True  -5.469905  0.490169  -3.553129   
...        ...             ...            ...        ...       ...        ...   
16194    EPHA1           False           True   0.053959  0.500040   0.152943   
16195   AMDHD1           False           True   0.209194  0.500001   0.180583   
16196  PABPC4L           False           True   0.260047  0.500040   0.289223   
16197   ANXA2R           False           True   0.942715  0.499981   0.342610   
16198    NRSN1           False           True  -0.129153  0.501732   1.043447   

               pval        

What you get back from {func}`~delnx.tl.de` is a {class}`~pandas.DataFrame` with the results of the differential expression analysis. The rows are genes and the columns are the results of the DE testing, such as p-values, log fold changes, etc. 

## Grouping
Often, we want to split the dataset into groups, to perform testing e.g. within each cell type. `delnx` allows you to do this with the `group_key` argument, pointing to the column in the `adata.obs` to group by.

In [None]:
# Run differential expression analysis witin groups
de_results = dx.tl.de(
    adata,
    method="lr",  # DE method to use: "lr" for logistic regression
    condition_key="GLI3_KO",  # Condition key for DE analysis
    group_key="cell_type",  # Group by cell type
)

print(de_results)

Inferred data type: lognorm
15443 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:03<00:00,  2.02it/s]


Inferred data type: lognorm
15320 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:02<00:00,  3.52it/s]


Inferred data type: lognorm
14080 features passed log2fc threshold of 0.0


100%|██████████| 7/7 [00:00<00:00,  8.46it/s]


Inferred data type: lognorm
14462 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:00<00:00, 10.44it/s]


Inferred data type: lognorm
14906 features passed log2fc threshold of 0.0


100%|██████████| 8/8 [00:00<00:00, 10.99it/s]


Inferred data type: lognorm
12147 features passed log2fc threshold of 0.0


100%|██████████| 6/6 [00:00<00:00, 16.85it/s]


Inferred data type: lognorm
13129 features passed log2fc threshold of 0.0


100%|██████████| 7/7 [00:00<00:00, 13.64it/s]

      feature  test_condition  ref_condition      group     log2fc     auroc  \
0      NKX2-1           False           True     ge_npc -10.000000  0.296055   
1         EN1           False           True   mesen_ex -10.000000  0.463525   
2      NKX2-1           False           True  mesen_npc  -8.316442  0.469342   
3       SFTA3           False           True     ge_npc  -9.799723  0.302916   
4      SHISA3           False           True  mesen_npc  -6.515883  0.460307   
...       ...             ...            ...        ...        ...       ...   
99482   SPRY2           False           True    ctx_npc  -1.618940  0.495519   
99483   FBXL4           False           True    ctx_npc  -1.224567  0.493761   
99484    GANC           False           True     ctx_ex  -2.922253  0.477551   
99485   VPS50           False           True    ctx_npc  -0.221592  0.500210   
99486   AHDC1           False           True    ctx_npc  -0.200196  0.496689   

            coef          pval         




Now the result has one additional column, `group`, which indicates the group the gene was tested in. 

## Pseudo-bulking
It is often advisable to not test on the single-cell level, but to aggregate the data to a (pseudo-)bulk level first. This better accounts for variation between actual biological replicates. `delnx` provides a thin wrapper around the [decoupler](https://decoupler.readthedocs.io/en/latest) function to do this:

In [15]:
adata_pb = dx.pp.pseudobulk(
    adata,
    sample_key="organoid",  # Sample key for pseudobulk aggregation (the biological replicate)
    group_key="cell_type",  # Group key for pseudobulk aggregation
    n_pseudoreps=2,  # Optionally, the data can be split into multiple pseudoreplicates. This can be useful if the number of actual biological replicates is low.
    layer="counts",  # Layer to use for pseudobulk aggregation, e.g. "counts" or None for adata.X
    mode="sum",
)

print(adata_pb)

AnnData object with n_obs × n_vars = 81 × 16199
    obs: 'psbulk_replicate', 'cell_type', 'organoid', 'GLI3_KO', 'psbulk_cells', 'psbulk_counts'
    layers: 'psbulk_props'


## Available Methods
`delnx` provides various methods for performing differential expression analysis, which you can specify with the `method` argument. The default is `"lr"` for logistic regression, which constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test. Here's an overview of the available methods:

- `"lr"`: Constructs a logistic regression model predicting group membership based on each feature individually and compares this to a null model with a likelihood ratio test. Recommended for log-normalized single-cell data.

- `"deseq2"`: DESeq2 method (through [PyDESeq2](https://pydeseq2.readthedocs.io/en/stable/)) based on a native binomial model. Recommended for (pseudo-)bulk RNA-seq count data.

- `"negbinom"`: Wald test based on a negative binomial regression model. Conecptually very similar to DESeq2 but implemented with [JAX](https://jax.readthedocs.io/en/latest/) (or [statsmodels](https://www.statsmodels.org/stable/index.html)). Recommended for count single-cell and bulk RNA-seq data. See {doc}`nb` for more details.

- `"anova"`: ANOVA based on linear model. Recommended for log-normalized or scaled single-cell data.

- `"anova_residual"`: Linear model with residual F-test. Recommended for log-normalized or scaled single-cell data

- `"binomial"`: Likelihood ratio test based on a binomial regression model. Recommended for binary data such as single-cell and bulk ATAC-seq.

In addition to different methods, `delnx` also provides several backends to use for executing these methods. The default is `"jax"` which relies on regression models and statistical tests implemented in [JAX](https://jax.readthedocs.io/en/latest/). This is usually the the fastest option, especially on GPUs. However, most methods are also available with [`"statsmodels"`](https://www.statsmodels.org/stable/index.html) as the backend. The exception to this is the `"deseq2"` method, which really just calls [PyDESeq2](https://pydeseq2.readthedocs.io/en/stable/).