# Kompot Applications

Kompot is a python package for differential abundance and expression analysis in biological data. It is built on top of the Mellon library, which provides Gaussian Process (GP) model functionality for density and function estimation.

This notebook demonstrates how to use Kompot for various applications in computational biology, particularly for single-cell RNA-seq data analysis.

In [None]:
# Import dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import umap
from sklearn.datasets import make_swiss_roll
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Import Mellon for direct access to density and function estimators
import mellon

# Import Kompot
import sys
import os
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname('__file__'), '..')))  # Add Kompot to path
import kompot

# Configure plots
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 8]
plt.rcParams['figure.dpi'] = 100
sns.set_style('whitegrid')

## 1. Synthetic Data Analysis

Let's start by generating synthetic data to demonstrate Kompot's capabilities. We'll create:

1. Two conditions with different cell state distributions
2. Gene expression patterns that vary between these conditions
3. Specific differentially expressed genes

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

def generate_synthetic_data(n_cells=1000, n_genes=100, n_diff_genes=20):
    """Generate synthetic data with two conditions and differentially expressed genes."""
    # Create Swiss roll data for condition 1
    X1, _ = make_swiss_roll(n_samples=n_cells, noise=0.1)
    X1 = X1[:, [0, 2]]  # Take only 2 dimensions for simplicity
    
    # Create Swiss roll data for condition 2, slightly shifted and scaled
    X2, _ = make_swiss_roll(n_samples=n_cells, noise=0.15)
    X2 = X2[:, [0, 2]]
    X2 = X2 * 1.2 + np.array([2.0, 0.0])  # Scale and shift
    
    # Generate gene expression data
    # For condition 1
    y1 = np.zeros((n_cells, n_genes))
    for g in range(n_genes):
        # Each gene expression is a function of cell state coordinates
        if g % 4 == 0:  # Different patterns based on gene index
            y1[:, g] = np.sin(X1[:, 0] * 0.5) + 0.1 * np.random.randn(n_cells)
        elif g % 4 == 1:
            y1[:, g] = np.cos(X1[:, 1] * 0.5) + 0.1 * np.random.randn(n_cells)
        elif g % 4 == 2:
            y1[:, g] = np.sin(X1[:, 0] * 0.3) * np.cos(X1[:, 1] * 0.3) + 0.1 * np.random.randn(n_cells)
        else:
            y1[:, g] = np.exp(-(X1[:, 0]**2 + X1[:, 1]**2) / 10) + 0.1 * np.random.randn(n_cells)
    
    # For condition 2 (similar patterns but with differences)
    y2 = np.zeros((n_cells, n_genes))
    for g in range(n_genes):
        if g % 4 == 0:
            y2[:, g] = np.sin(X2[:, 0] * 0.5) + 0.1 * np.random.randn(n_cells)
        elif g % 4 == 1:
            y2[:, g] = np.cos(X2[:, 1] * 0.5) + 0.1 * np.random.randn(n_cells)
        elif g % 4 == 2:
            y2[:, g] = np.sin(X2[:, 0] * 0.3) * np.cos(X2[:, 1] * 0.3) + 0.1 * np.random.randn(n_cells)
        else:
            y2[:, g] = np.exp(-(X2[:, 0]**2 + X2[:, 1]**2) / 10) + 0.1 * np.random.randn(n_cells)
    
    # Ensure some differentially expressed genes between conditions
    diff_genes = np.random.choice(n_genes, n_diff_genes, replace=False)
    for g in diff_genes:
        # Apply different transformations for differential expression
        if g % 3 == 0:
            # Upregulation in condition 2
            y2[:, g] += 1.5
        elif g % 3 == 1:
            # Downregulation in condition 2
            y2[:, g] -= 1.0
        else:
            # Different pattern in condition 2
            y2[:, g] = -y2[:, g] + 0.5
    
    # Create gene names for better interpretability
    gene_names = [f"Gene_{i+1}" for i in range(n_genes)]
    diff_gene_names = [gene_names[i] for i in diff_genes]
    
    return X1, y1, X2, y2, gene_names, diff_genes, diff_gene_names

# Generate our synthetic dataset
X_cond1, y_cond1, X_cond2, y_cond2, gene_names, diff_genes, diff_gene_names = generate_synthetic_data()

# Standardize the cell states for better GP performance
scaler = StandardScaler()
X_cond1_scaled = scaler.fit_transform(X_cond1)
X_cond2_scaled = scaler.transform(X_cond2)

print(f"Condition 1 data shape: {X_cond1.shape}, {y_cond1.shape}")
print(f"Condition 2 data shape: {X_cond2.shape}, {y_cond2.shape}")
print(f"Number of differentially expressed genes: {len(diff_genes)}")
print(f"Example differentially expressed genes: {diff_gene_names[:5]}")

### Visualize the data distributions

In [None]:
plt.figure(figsize=(12, 5))

# Plot condition 1
plt.subplot(121)
plt.scatter(X_cond1[:, 0], X_cond1[:, 1], alpha=0.6, s=10)
plt.title("Condition 1 - Cell States")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")

# Plot condition 2
plt.subplot(122)
plt.scatter(X_cond2[:, 0], X_cond2[:, 1], alpha=0.6, s=10, color='orange')
plt.title("Condition 2 - Cell States")
plt.xlabel("Dimension 1")
plt.ylabel("Dimension 2")

plt.tight_layout()
plt.show()

# Create a combined UMAP embedding for better visualization
X_combined = np.vstack([X_cond1, X_cond2])
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_combined)

# Plot UMAP embedding with condition labels
plt.figure(figsize=(10, 8))
plt.scatter(X_umap[:1000, 0], X_umap[:1000, 1], alpha=0.6, s=10, label="Condition 1")
plt.scatter(X_umap[1000:, 0], X_umap[1000:, 1], alpha=0.6, s=10, color='orange', label="Condition 2")
plt.title("UMAP Embedding of Cell States from Both Conditions")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.legend()
plt.tight_layout()
plt.show()

### Visualize gene expression patterns

Let's look at some of our differentially expressed genes to see the patterns

In [None]:
def plot_gene_expression(gene_idx, gene_name=None):
    """Plot expression of a gene across both conditions in UMAP space."""
    if gene_name is None:
        gene_name = gene_names[gene_idx]
    
    # Combine gene expression values
    gene_expr = np.concatenate([y_cond1[:, gene_idx], y_cond2[:, gene_idx]])
    
    plt.figure(figsize=(15, 5))
    
    # Plot expression in original space for condition 1
    plt.subplot(131)
    sc = plt.scatter(X_cond1[:, 0], X_cond1[:, 1], c=y_cond1[:, gene_idx], 
                     cmap='viridis', alpha=0.8, s=15)
    plt.colorbar(sc, label='Expression')
    plt.title(f"{gene_name} - Condition 1")
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 2")
    
    # Plot expression in original space for condition 2
    plt.subplot(132)
    sc = plt.scatter(X_cond2[:, 0], X_cond2[:, 1], c=y_cond2[:, gene_idx], 
                     cmap='viridis', alpha=0.8, s=15)
    plt.colorbar(sc, label='Expression')
    plt.title(f"{gene_name} - Condition 2")
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 2")
    
    # Plot expression in UMAP for both conditions
    plt.subplot(133)
    sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=gene_expr, 
                     cmap='viridis', alpha=0.8, s=15)
    plt.colorbar(sc, label='Expression')
    plt.title(f"{gene_name} - Combined UMAP")
    plt.xlabel("UMAP 1")
    plt.ylabel("UMAP 2")
    
    plt.tight_layout()
    plt.show()

# Plot a few differentially expressed genes
for i in range(min(3, len(diff_genes))):
    plot_gene_expression(diff_genes[i])

## 2. Differential Abundance Analysis

First, we'll analyze differential abundance between the two conditions. This helps us understand how the cell state distribution changes, regardless of gene expression.

In [None]:
# Create a differential abundance analyzer
print("Running Differential Abundance Analysis...")
diff_abundance = kompot.DifferentialAbundance(n_landmarks=200)

# Fit the analyzer to our data
diff_abundance.fit(X_cond1_scaled, X_cond2_scaled)

### Visualization of Differential Abundance Results

In [None]:
# Visualize the log fold changes and z-scores in UMAP space
plt.figure(figsize=(15, 6))

# Plot log fold changes
plt.subplot(121)
sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=diff_abundance.log_fold_change, 
                 cmap='RdBu_r', alpha=0.8, s=15, vmin=-3, vmax=3)
plt.colorbar(sc, label='Log Fold Change')
plt.title("Differential Abundance: Log Fold Change")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")

# Plot z-scores
plt.subplot(122)
sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=diff_abundance.log_fold_change_zscore, 
                 cmap='RdBu_r', alpha=0.8, s=15, vmin=-5, vmax=5)
plt.colorbar(sc, label='Z-score')
plt.title("Differential Abundance: Z-scores")
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")

plt.tight_layout()
plt.show()

### Alternative: Using Pre-computed Density Predictors

Now let's demonstrate how to use pre-computed Mellon density estimators. This approach allows for more flexibility and reuse of estimators.

In [None]:
# First create and fit density estimators for each condition
print("Creating density estimators using Mellon...")
density_estimator1 = mellon.DensityEstimator(n_landmarks=200, d_method='fractal', predictor_with_uncertainty=True)
density_estimator1.fit(X_cond1_scaled)

density_estimator2 = mellon.DensityEstimator(n_landmarks=200, d_method='fractal', predictor_with_uncertainty=True)
density_estimator2.fit(X_cond2_scaled)

# Now create the differential abundance object using the pre-computed predictors
print("Creating differential abundance analyzer with pre-computed predictors...")
diff_abundance_with_predictors = kompot.DifferentialAbundance(
    n_landmarks=200,
    density_predictor1=density_estimator1.predict,
    density_predictor2=density_estimator2.predict
)

# Fit the analyzer to our data
diff_abundance_with_predictors.fit(X_cond1_scaled, X_cond2_scaled)

# Compare results
print("\nResults comparison:")
print(f"Method 1 (from scratch) - Mean log fold change: {np.mean(diff_abundance.log_fold_change):.4f}")
print(f"Method 2 (with predictors) - Mean log fold change: {np.mean(diff_abundance_with_predictors.log_fold_change):.4f}")
print(f"Correlation between methods: {np.corrcoef(diff_abundance.log_fold_change, diff_abundance_with_predictors.log_fold_change)[0, 1]:.4f}")

## 3. Differential Expression Analysis

Now we'll perform differential expression analysis to identify genes that are differentially expressed between the two conditions. We'll show multiple methods to achieve this:

In [None]:
# Method 1: Using the differential abundance object we already computed
print("Method 1: Using pre-computed differential abundance...")
diff_expression = kompot.DifferentialExpression(
    n_landmarks=200,
    differential_abundance=diff_abundance  # Reuse the differential abundance object
)

diff_expression.fit(
    X_cond1_scaled, y_cond1,
    X_cond2_scaled, y_cond2
)

### Alternative Methods for Differential Expression

In [None]:
# Method 2: Computing everything from scratch
print("Method 2: Computing everything from scratch...")
diff_expression_scratch = kompot.DifferentialExpression(n_landmarks=200)
diff_expression_scratch.fit(
    X_cond1_scaled, y_cond1,
    X_cond2_scaled, y_cond2
)

# Method 3: Using pre-computed Mellon function predictors
print("Method 3: Using pre-computed function predictors...")

# Create and fit function estimators for both conditions
function_estimator1 = mellon.FunctionEstimator(n_landmarks=200, sigma=0.1, predictor_with_uncertainty=True)
function_estimator1.fit(X_cond1_scaled, y_cond1)

function_estimator2 = mellon.FunctionEstimator(n_landmarks=200, sigma=0.1, predictor_with_uncertainty=True)
function_estimator2.fit(X_cond2_scaled, y_cond2)

# Create differential expression analyzer with pre-computed predictors
diff_expression_with_predictors = kompot.DifferentialExpression(
    n_landmarks=200,
    function_predictor1=function_estimator1.predict,
    function_predictor2=function_estimator2.predict,
    density_predictor1=density_estimator1.predict,  # From previous cell
    density_predictor2=density_estimator2.predict   # From previous cell
)

diff_expression_with_predictors.fit(
    X_cond1_scaled, y_cond1,
    X_cond2_scaled, y_cond2
)

### Evaluation of Differential Expression Results

Let's evaluate the differential expression results to see how well they identify our known differentially expressed genes.

In [None]:
def evaluate_differential_expression(model, method_name):
    """Evaluate differential expression results."""
    # Get top genes by Mahalanobis distance
    top_n = len(diff_genes)
    top_genes_idx = np.argsort(model.mahalanobis_distances)[-top_n:]
    
    # Calculate metrics
    true_positives = set(top_genes_idx).intersection(set(diff_genes))
    precision = len(true_positives) / top_n
    recall = len(true_positives) / len(diff_genes)
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # Print results
    print(f"\n{method_name} Results:")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    
    # Return metrics for comparison
    return {
        'method': method_name,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'top_genes': top_genes_idx
    }

# Evaluate all methods
results = []
results.append(evaluate_differential_expression(diff_expression, "Method 1: With pre-computed differential abundance"))
results.append(evaluate_differential_expression(diff_expression_scratch, "Method 2: Computed from scratch"))
results.append(evaluate_differential_expression(diff_expression_with_predictors, "Method 3: With pre-computed predictors"))

# Compare results in a table
results_df = pd.DataFrame(results)
results_df[['method', 'precision', 'recall', 'f1']]

### Visualization of Differential Expression Results

In [None]:
# Plot Mahalanobis distances and highlight true differential genes
def plot_mahalanobis_distances(model):
    plt.figure(figsize=(12, 6))
    
    # Sort genes by Mahalanobis distance
    sorted_indices = np.argsort(model.mahalanobis_distances)
    sorted_distances = model.mahalanobis_distances[sorted_indices]
    
    # Create colors: red for true differentials, blue for non-differentials
    colors = ['blue' if i not in diff_genes else 'red' for i in sorted_indices]
    
    plt.bar(range(len(sorted_distances)), sorted_distances, color=colors)
    plt.xlabel('Gene Index (sorted)')
    plt.ylabel('Mahalanobis Distance')
    plt.title('Mahalanobis Distances for All Genes')
    
    # Add a legend
    from matplotlib.patches import Patch
    legend_elements = [
        Patch(facecolor='red', label='True Differential'),
        Patch(facecolor='blue', label='Non-Differential')
    ]
    plt.legend(handles=legend_elements)
    
    plt.tight_layout()
    plt.show()
    
    # Return true differentials in top 20
    top_20 = sorted_indices[-20:]
    true_in_top = [gene_names[i] for i in top_20 if i in diff_genes]
    return true_in_top

true_diff_in_top = plot_mahalanobis_distances(diff_expression)
print(f"True differential genes in top 20: {true_diff_in_top}")

In [None]:
# Plot expression and fold change for a specific differential gene
def plot_differential_gene(gene_idx, gene_name=None):
    if gene_name is None:
        gene_name = gene_names[gene_idx]
    
    plt.figure(figsize=(15, 5))
    
    # Plot imputed expression for condition 1
    plt.subplot(131)
    sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], 
                     c=diff_expression.condition1_imputed[:, gene_idx], 
                     cmap='viridis', alpha=0.8, s=15)
    plt.colorbar(sc, label='Expression')
    plt.title(f"{gene_name} - Condition 1 (Imputed)")
    plt.xlabel("UMAP 1")
    plt.ylabel("UMAP 2")
    
    # Plot imputed expression for condition 2
    plt.subplot(132)
    sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], 
                     c=diff_expression.condition2_imputed[:, gene_idx], 
                     cmap='viridis', alpha=0.8, s=15)
    plt.colorbar(sc, label='Expression')
    plt.title(f"{gene_name} - Condition 2 (Imputed)")
    plt.xlabel("UMAP 1")
    plt.ylabel("UMAP 2")
    
    # Plot fold change
    plt.subplot(133)
    sc = plt.scatter(X_umap[:, 0], X_umap[:, 1], 
                     c=diff_expression.fold_change[:, gene_idx], 
                     cmap='RdBu_r', alpha=0.8, s=15, vmin=-3, vmax=3)
    plt.colorbar(sc, label='Log Fold Change')
    plt.title(f"{gene_name} - Log Fold Change")
    plt.xlabel("UMAP 1")
    plt.ylabel("UMAP 2")
    
    # Print Mahalanobis distance
    mahala_dist = diff_expression.mahalanobis_distances[gene_idx]
    plt.suptitle(f"{gene_name} - Mahalanobis Distance: {mahala_dist:.4f}", fontsize=16)
    
    plt.tight_layout(rect=[0, 0, 1, 0.95])
    plt.show()

# Plot the top differential gene
top_gene_idx = np.argmax(diff_expression.mahalanobis_distances)
plot_differential_gene(top_gene_idx, gene_names[top_gene_idx])

# Plot a known differential gene
known_diff_idx = diff_genes[0]
plot_differential_gene(known_diff_idx, gene_names[known_diff_idx])

## 4. Combining Differential Abundance and Expression

The weighted fold change in Kompot combines information from both differential abundance and differential expression analyses. Let's explore this.

In [None]:
# Compare weighted and unweighted fold changes
plt.figure(figsize=(12, 6))

# Get mean fold changes for each gene
mean_fold_changes = np.mean(diff_expression.fold_change, axis=0)
weighted_fold_changes = diff_expression.weighted_mean_log_fold_change

# Sort genes by Mahalanobis distance
sorted_indices = np.argsort(diff_expression.mahalanobis_distances)
sorted_mean_fc = mean_fold_changes[sorted_indices]
sorted_weighted_fc = weighted_fold_changes[sorted_indices]

# Plot both fold changes
plt.subplot(121)
plt.scatter(range(len(sorted_mean_fc)), sorted_mean_fc, alpha=0.7)
plt.axhline(y=0, color='r', linestyle='-', alpha=0.3)
plt.xlabel('Gene Index (sorted by Mahalanobis distance)')
plt.ylabel('Mean Log Fold Change')
plt.title('Simple Mean Log Fold Change')

plt.subplot(122)
plt.scatter(range(len(sorted_weighted_fc)), sorted_weighted_fc, alpha=0.7, color='orange')
plt.axhline(y=0, color='r', linestyle='-', alpha=0.3)
plt.xlabel('Gene Index (sorted by Mahalanobis distance)')
plt.ylabel('Weighted Mean Log Fold Change')
plt.title('Density-Weighted Log Fold Change')

plt.tight_layout()
plt.show()

# Plot correlation between the two measures
plt.figure(figsize=(8, 8))
plt.scatter(mean_fold_changes, weighted_fold_changes, alpha=0.7)
plt.xlabel('Simple Mean Log Fold Change')
plt.ylabel('Weighted Mean Log Fold Change')
plt.title('Correlation between Fold Change Measures')
plt.grid(alpha=0.3)

# Add a diagonal line for reference
min_val = min(np.min(mean_fold_changes), np.min(weighted_fold_changes))
max_val = max(np.max(mean_fold_changes), np.max(weighted_fold_changes))
plt.plot([min_val, max_val], [min_val, max_val], 'r--', alpha=0.5)

# Color true differential genes
plt.scatter(mean_fold_changes[diff_genes], weighted_fold_changes[diff_genes], 
            color='red', alpha=0.8, s=100, label='True Differentials')
plt.legend()

plt.tight_layout()
plt.show()

## 5. Generating an Interactive HTML Report

Kompot provides functionality to generate an interactive HTML report for differential expression results.

In [None]:
# Generate an HTML report
report_path = kompot.generate_report(
    diff_expression,
    output_dir="kompot_report_synthetic",
    condition1_name="Condition A",
    condition2_name="Condition B",
    gene_names=gene_names,
    top_n=30,
    open_browser=True  # Set to False to not automatically open the browser
)

print(f"Report generated at: {report_path}")

## 6. Working with Real Single-Cell RNA-seq Data

In practice, you would typically work with data loaded from a single-cell RNA-seq experiment, often in AnnData format. Here's how you might use Kompot with such data:

In [None]:
# This is pseudocode that outlines the typical workflow with real data
"""
import scanpy as sc
import anndata as ad

# Load data
adata = sc.read_h5ad('path/to/your_data.h5ad')

# Preprocessing (if not already done)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

# Split data by condition
condition_key = 'condition'  # The column in adata.obs that contains condition labels
condition1 = 'control'
condition2 = 'treatment'

mask_condition1 = adata.obs[condition_key] == condition1
mask_condition2 = adata.obs[condition_key] == condition2

# Extract cell states (e.g., PCA coordinates) and expression values
X_condition1 = adata[mask_condition1].obsm['X_pca']
X_condition2 = adata[mask_condition2].obsm['X_pca']

# Use the normalized and log-transformed expression matrix
y_condition1 = adata[mask_condition1].X[:, adata.var['highly_variable']].toarray()
y_condition2 = adata[mask_condition2].X[:, adata.var['highly_variable']].toarray()

# Run differential abundance analysis
diff_abundance = kompot.DifferentialAbundance(n_landmarks=300)
diff_abundance.fit(X_condition1, X_condition2)

# Run differential expression analysis
diff_expression = kompot.DifferentialExpression(
    n_landmarks=300, 
    differential_abundance=diff_abundance
)
diff_expression.fit(
    X_condition1, y_condition1,
    X_condition2, y_condition2
)

# Get gene names for the highly variable genes
gene_names = adata.var_names[adata.var['highly_variable']].tolist()

# Generate an HTML report
report_path = kompot.generate_report(
    diff_expression,
    output_dir="kompot_report_real_data",
    condition1_name=condition1,
    condition2_name=condition2,
    gene_names=gene_names,
    adata=adata,  # Pass the full AnnData object for additional visualizations
    groupby='leiden',  # Cell type annotations
    embedding_key='X_umap'  # Embedding for visualization
)
"""

## 7. Conclusion

In this notebook, we've demonstrated Kompot's capabilities for differential abundance and expression analysis:

1. **Differential Abundance Analysis**: Identifying regions of the cell state space that differ in density between conditions
2. **Differential Expression Analysis**: Finding genes with significant expression changes between conditions
3. **Multiple Integration Methods**: Using pre-computed predictors from Mellon for flexibility and efficiency
4. **Weighted Fold Change**: Combining density and expression information for more robust analysis
5. **Interactive HTML Reporting**: Generating comprehensive reports of differential expression results

Kompot provides a powerful framework for differential analysis in single-cell data, leveraging Gaussian Process models for robust imputation and uncertainty quantification.