# Kompot Tutorial: Differential Analysis with AnnData

This tutorial demonstrates how to use the Kompot package to perform differential abundance and differential gene expression analysis on single-cell RNA-seq data stored in AnnData format. 

## What is Kompot?

Kompot is a computational tool that provides a novel approach to differential analysis between conditions in single-cell data. It has several key features:

- Leverages **Mahalanobis distance** for sensitive detection of gene expression differences between conditions
- Utilizes **JAX** for efficient computation, enabling analysis of large datasets
- Works seamlessly with **AnnData** objects, making it compatible with the Scanpy ecosystem
- Performs both **differential abundance** and **differential gene expression** analyses
- Provides informative visualizations to interpret results

## When to Use Kompot

Kompot is particularly useful when:

1. You want to identify cell states that change in abundance between conditions
2. You need to detect genes with altered expression patterns across conditions
3. You want to account for the continuous nature of cell states rather than relying on discrete categories
4. Traditional differential methods miss subtle changes along cell trajectories

In this tutorial, we'll analyze how aging affects hematopoietic stem cells and their derivatives by comparing young and old samples.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import anndata as ad
import matplotlib.pyplot as plt

# Import necessary libraries
import numpy as np
import palantir
import pandas as pd
import scanpy as sc
import seaborn as sns

import kompot

# Set plotting style
#sc.settings.set_figure_params(dpi=150, figsize=[5, 5])
plt.rcParams["axes.spines.right"] = False
plt.rcParams["axes.spines.top"] = False
plt.rcParams["image.cmap"] = "Spectral_r"

  from .autonotebook import tqdm as notebook_tqdm


## Parameters and Configuration

First, let's define the parameters we'll use for the analysis. These parameters control how Kompot will process the data:

- **GROUPING_COLUMN**: The column in `adata.obs` that contains the condition labels (e.g., age, treatment, disease state)
- **CONDITIONS**: The specific condition values to compare - the first is used as reference
- **CELL_TYPE_COLUMN**: The column in `adata.obs` with cell type annotations
- **DIMENSIONALITY_REDUCTION**: Key in `adata.obsm` for cell state representation (e.g., PCA, diffusion maps)
- **LAYER_FOR_EXPRESSION**: Data layer to use for expression values (using None would use `adata.X`)

Modifying these values allows you to adapt the tutorial to your own data. When working with your own dataset, consider:

1. Which conditions you want to compare (e.g., control vs. treatment, young vs. old)
2. What representation of cell states to use (PCA, UMAP, diffusion components)
3. Which normalization layer to analyze for expression data (raw counts, normalized counts, etc.)

In [3]:
# Data path - replace with your own AnnData file path
# For reproducibility, you can download the example file from:
# https://zenodo.org/records/10153433
DATA_PATH = "../data/processed_filtered_HSPCandMature_withcorrection_withregression_postcelltype.h5ad"

# Analysis parameters
GROUPING_COLUMN = "Age"  # Column in adata.obs with condition labels
CONDITIONS = ["Young", "Old"]  # Conditions to compare (first is reference)
CELL_TYPE_COLUMN = "highres_celltype"  # Column in adata.obs with cell type annotations
DIMENSIONALITY_REDUCTION = (
    "DM_EigenVectors"  # Key in adata.obsm for cell state representation
)
LAYER_FOR_EXPRESSION = (
    "logged_counts"  # Layer in adata.layers for expression data (None uses adata.X)
)

In [4]:
adata = ad.read_h5ad(DATA_PATH)
adata

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '../data/processed_filtered_HSPCandMature_withcorrection_withregression_postcelltype.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

## Exploring the Data

Before running any analysis, it's important to understand the structure of your data. Let's visualize the dataset to examine:

1. **Cell Type Distribution**: How many cells of each type are present in our dataset?
2. **Condition Distribution**: How are our conditions (Young vs. Old) balanced?
3. **Cell Type by Condition**: Are certain cell types associated with particular conditions?

These exploratory analyses will help us interpret the differential results later and identify potential biases in the data.

In [None]:
sc.pl.umap(adata, color=CELL_TYPE_COLUMN)

In [None]:
cell_type_counts = adata.obs[CELL_TYPE_COLUMN].value_counts()
plt.figure(figsize=(10, 5))
sns.barplot(x=cell_type_counts.index, y=cell_type_counts.values)
plt.xticks(rotation=90)
plt.title("Cell Type Distribution")
plt.ylabel("Number of Cells")
plt.tight_layout()
plt.show()

In [None]:
sc.pl.umap(adata, color=GROUPING_COLUMN, title="Conditions")

In [None]:
condition_counts = adata.obs[GROUPING_COLUMN].value_counts()
plt.figure(figsize=(3, 3))
sns.barplot(x=condition_counts.index, y=condition_counts.values)
plt.title("Condition Distribution")
plt.ylabel("Number of Cells")
plt.tight_layout()
plt.show()

In [None]:
crosstab = (
    pd.crosstab(
        adata.obs[CELL_TYPE_COLUMN], adata.obs[GROUPING_COLUMN], normalize="index"
    )
    * 100
)

# Plot the distribution
ax = crosstab.plot(kind="bar", stacked=False, figsize=(12, 8))
ax.grid(False)
plt.xlabel("Cell Type")
plt.ylabel("Percentage (%)")
plt.title("Cell Type Distribution by Condition")
plt.legend(title=GROUPING_COLUMN)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# Print extremes (cell types with high bias toward one condition)
bias_threshold = 75  # Percentage threshold for considering a cell type biased
biased_types = crosstab[(crosstab > bias_threshold).any(axis=1)]

if not biased_types.empty:
    print(f"Cell types with >={bias_threshold}% bias toward one condition:")
    print(biased_types)
    print("\nThese cell types might show disproportionate changes between conditions.")

## Preprocessing with Diffusion Maps

For Kompot to perform effectively, we need a good representation of cell state. Diffusion maps are an excellent choice because they:

1. Capture the intrinsic geometry of high-dimensional single-cell data
2. Preserve the continuous nature of cellular differentiation trajectories
3. Reduce technical noise while maintaining biological signal
4. Focus on global structure rather than local variations

We'll use Palantir's implementation of diffusion maps, which works directly with AnnData objects and provides a robust low-dimensional representation of cell states.

In [None]:
palantir.utils.run_diffusion_maps(adata, pca_key="X_pca_harmony", n_components=40);

## Running Differential Analysis with Kompot

Now we'll run the differential analysis using Kompot. We'll perform two types of analysis:

1. **Differential Abundance (DA)**: Identifies cell states that change in frequency between conditions
2. **Differential Expression (DE)**: Identifies genes that change in expression between conditions

### Understanding Kompot Parameters

For Differential Abundance:
- `groupby`: Column in adata.obs containing condition labels
- `condition1`/`condition2`: The conditions to compare (condition1 is the reference)
- `obsm_key`: The dimensionality reduction to use as cell state representation
- `log_fold_change_threshold`: The minimum log fold change to consider significant

For Differential Expression:
- `layer`: The expression data to use (normalized counts, log counts, etc.)
- `differential_abundance_key`: Optional key to use DA results for weighted fold changes
- `batch_size`: Can be set to process data in batches to reduce memory usage (0 means no batching)

### How Kompot Works

Kompot uses Gaussian processes to model the density of cells in each condition and then computes the log-fold change between these densities. For differential expression, it implements a statistical test based on the Mahalanobis distance, which is particularly sensitive to differences along continuous cell states and accounts for the covariance structure of the data.

In [None]:
# First, let's compute differential abundance between conditions
da_results = kompot.compute_differential_abundance(
    adata,  # AnnData object
    groupby=GROUPING_COLUMN,  # Column with condition labels
    condition1=CONDITIONS[0],  # Reference condition
    condition2=CONDITIONS[1],  # Comparison condition
    obsm_key=DIMENSIONALITY_REDUCTION,  # Cell state representation
)

In [None]:
# Now, compute differential expression between the conditions
de_results = kompot.compute_differential_expression(
    adata,  # AnnData object
    groupby=GROUPING_COLUMN,  # Column with condition labels
    condition1=CONDITIONS[0],  # Reference condition
    condition2=CONDITIONS[1],  # Comparison condition
    layer=LAYER_FOR_EXPRESSION,  # Expression data layer
    obsm_key=DIMENSIONALITY_REDUCTION,  # Cell state representation
    differential_abundance_key="kompot_da",  # DA results for weighted log-fold change (optional)
    batch_size=0,  # set to, e.g., 100 to batch cells and genes for lower memory demand
)

## Visualizing Differential Abundance Results

Now that we've computed differential abundance between conditions, let's visualize the results.

### What to Look For in the Visualizations

1. **Direction of Change**: "up" means a cell state is more abundant in the second condition (Old) compared to the reference (Young), while "down" means less abundant.

2. **Log Fold Change Scale**: 
   - Values near 0 indicate similar abundance across conditions
   - Positive values show enrichment in the second condition (Old)  
   - Negative values show depletion in the second condition (Old)

3. **Spatial Patterns**: Look for regions or clusters in the UMAP showing consistent changes - these represent biologically meaningful shifts in cell populations.

4. **Cell Type Correlations**: Check if specific cell types are consistently changing in abundance, which can reveal lineage-specific effects.

The volcano plot displays statistical significance versus effect size, with points above the horizontal line and outside the vertical lines indicating significant changes. Points are colored by cell type to facilitate interpretation of which cell populations are changing.

In [None]:
sc.pl.embedding(
    adata,
    "umap",
    color=[
        "kompot_da_log_fold_change_direction_Young_vs_Old",
        "kompot_da_log_fold_change_Young_vs_Old",
    ],
    title=["Abundance Changes From Young to Old", "Log-Fold Changes From Young to Old"],
    color_map="RdBu_r",
    vcenter=0,
)

We can use the kompot wrapper of `scanpy.pl.embedding` to color and subset by different criteria. Here we subset to cell states that are significantly differentially expressed by passing our criteria `{"kompot_da_log_fold_change_direction_Young_vs_Old":["up", "down"]}` to the `groups` parameter.

In [None]:
kompot.plot.embedding(
    adata,
    "umap",
    color=[
        "kompot_da_log_fold_change_direction_Young_vs_Old",
        "kompot_da_log_fold_change_Young_vs_Old",
        CELL_TYPE_COLUMN
    ],
    title=["Abundance Changes From Young to Old", "Log-Fold Changes From Young to Old", CELL_TYPE_COLUMN],
    color_map="RdBu_r",
    vcenter=0,
    groups={"kompot_da_log_fold_change_direction_Young_vs_Old":["up", "down"]},
)

In [None]:
kompot.plot.volcano_da(adata, color=CELL_TYPE_COLUMN)

In [None]:
kompot.plot.direction_barplot(adata, category_column=CELL_TYPE_COLUMN)

## Visualizing Differential Expression Results

Now let's examine the differential expression results to identify genes that change between conditions.

### Understanding the Results

The Kompot differential expression analysis produces several key metrics:

1. **Weighted Log Fold Change**: The average log fold change between conditions, weighted by cell density
2. **Mean Log Fold Change**: The simple average log fold change across all cells
3. **Mahalanobis Distance**: A statistical measure that accounts for variance and covariance in the data

The Mahalanobis distance is particularly powerful because it:
- Accounts for the covariance structure of the data
- Is sensitive to changes in gene expression patterns
- Provides a robust measure of significance even for genes with complex expression patterns

### Volcano Plots

In a volcano plot:
- The x-axis shows the log fold change (effect size)
- The y-axis shows the Mahalanobis distance (statistical significance)
- Points in the upper right indicate genes upregulated in the second condition (Old)
- Points in the upper left indicate genes downregulated in the second condition (Old)

Let's first look at the top differentially expressed genes:

In [None]:
kompot.plot.volcano_de(adata, n_top_genes=20)

In [None]:
adata.var[
    ["kompot_de_weighted_lfc_Young_vs_Old", "kompot_de_mahalanobis_Young_vs_Old"]
].sort_values("kompot_de_mahalanobis_Young_vs_Old", ascending=False).head(20)

### Expression Plots

This plot helps to ispect how the expression was imputed for each condition, and to vizualize the fold change. Note that all these results are stored as layers in the anndata and can also be plotted individually, e.g., with
```Python
scanpy.pl.embedding(adata, basis="umap", color="Igkc", layer="logged_counts")
scanpy.pl.embedding(adata, basis="umap", color="Igkc", layer="kompot_de_imputed_Young")
scanpy.pl.embedding(adata, basis="umap", color="Igkc", layer="kompot_de_imputed_Old")
scanpy.pl.embedding(adata, basis="umap", color="Igkc", layer="kompot_de_fold_change_Young_vs_Old")
```
Or use the kompot plotting function, that uses scanpy internally, for convinience.

In [None]:
kompot.plot.plot_gene_expression(adata, gene="Igkc", vmin="p2", vmax="p98")

In [None]:
# Let's examine one key gene in more detail
# H2-Q7 is a major histocompatibility complex (MHC) class I gene
# that showed the strongest differential expression
kompot.plot.plot_gene_expression(adata, gene="H2-Q7", vmax="p98")

### Heatmaps

To visualize expression changes of top genes across multiple cell types, you can use the split heatmap function. This plot is largely independent of the Kompot results and displays the average **unimputed expression** per group and condition.  It only uses the Mahalnobis distance to choose the top `n` genes. If you specify the `layer`, `genes`, `condition_column`, `condition1`, and `condition2` parameters, the function can be used without requiring any Kompot results.

In this example, we exclude plasma cells, as they are not represented in the `Young` age group.

In [None]:
kompot.plot.heatmap(
    adata,
    n_top_genes=20, # show only the top most differentially expressed genes
    groupby=CELL_TYPE_COLUMN, # the x-axis of the heatmap
    exclude_groups="Plasma cell", # excluded from the x-axis
    vmin="p1", # clip values at 1st percentile from bewlow for better contrast
    vmax="p99", # clip values at 99th percentile from above for better contrast
)

You can customize the heatmap visualization by adjusting parameters. For example, to display raw expression values without z-scoring, set `standard_scale=None`:

In [None]:
kompot.plot.heatmap(
    adata,
    n_top_genes=20,
    groupby=CELL_TYPE_COLUMN,
    exclude_groups="Plasma cell",
    standard_scale=None, # disable z-scoring
    vmax="p99", # clip values at 99th percentile from above for better contrast
)

## Biological Interpretation of Results

Now that we've identified differential abundance and expression patterns, let's interpret what these changes mean biologically.

### Key Findings

1. **Cell Type Changes with Age**:
   - HSCs (hematopoietic stem cells) show significantly increased abundance in Old vs. Young mice
   - Naive CD8 T cells are predominantly found in Young mice
   - These findings align with previous studies showing HSC expansion but functional decline with age

2. **Gene Expression Changes**:
   - MHC class I genes (H2-Q7, H2-Q6, Cd74, H2-Aa, H2-Ab1) are upregulated in Old mice, suggesting increased antigen presentation and immune activation
   - Inflammatory markers (S100a8, S100a9) are higher in Young mice
   - Interferon-stimulated genes (Ifitm family) show age-related changes, indicating altered immune response

### Functional Implications

These findings suggest that aging:
- Alters the composition of the hematopoietic system
- Changes immune surveillance and response mechanisms
- May affect stem cell functionality despite increased numbers
- Leads to chronic low-grade inflammation (inflammaging)

### Next Steps

To further validate and extend these findings, consider:
1. Functional assays to test HSC capacity in different age groups
2. Flow cytometry validation of key markers
3. Pathway analysis of the differentially expressed genes
4. Single-cell trajectory analysis to understand developmental changes
5. Integration with epigenomic data to understand regulatory mechanisms

## Conclusion

Kompot has allowed us to identify both cell state abundance changes and gene expression differences between Young and Old mice. The Mahalanobis distance-based approach provides sensitive detection of changes along continuous cell states, capturing biological signals that might be missed by traditional discrete methods.

For more details on Kompot parameters, methods, and applications, refer to the [complete documentation](https://kompot.readthedocs.io).