# CellOracle Analysis: CTR9 Knockout Mouse scRNA-seq Data

This notebook walks through running CellOracle on mouse scRNA-seq data comparing **wild-type (WT)** and **CTR9 knockout (KO)** genotypes.

## Workflow Overview

1. **Data Loading**: Load WT and KO h5 files separately
2. **Preprocessing**: Standard scanpy preprocessing with genotype annotations
3. **Integration**: Combine WT and KO datasets
4. **CellOracle Setup**: Load base GRN and create Oracle object
5. **GRN Construction**: Build cell-type-specific GRN models
6. **Perturbation Simulation**: Compare WT vs KO or simulate CTR9 KO effects

## Important Notes
- CTR9 is a component of the PAF1 complex, involved in transcription elongation
- Since you have experimental KO data, you can either:
  - Analyze WT and KO separately to compare GRN differences
  - Use combined data and compare network structures by genotype
  - Use WT data to simulate CTR9 KO and validate against actual KO data

# 0. Setup and Imports

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import anndata as ad

# CellOracle
import celloracle as co
print(f"CellOracle version: {co.__version__}")

# Visualization settings
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = [6, 4.5]
plt.rcParams['savefig.dpi'] = 300

# Scanpy settings
sc.settings.verbosity = 3
sc.settings.set_figure_params(dpi=100, facecolor='white')

In [None]:
# Create output directories
save_folder = "figures"
data_folder = "processed_data"
os.makedirs(save_folder, exist_ok=True)
os.makedirs(data_folder, exist_ok=True)

# 1. Load Data

Load the WT and KO h5 files from their respective directories.

In [None]:
# Define paths to your data
# MODIFY THESE PATHS TO MATCH YOUR DIRECTORY STRUCTURE
data_dir = "/path/to/CTR9_snRNASeq"  # Change this!

wt_path = os.path.join(data_dir, "WT", "filtered_feature_bc_matrix.h5")
ko_path = os.path.join(data_dir, "KO", "filtered_feature_bc_matrix.h5")

print(f"WT file exists: {os.path.exists(wt_path)}")
print(f"KO file exists: {os.path.exists(ko_path)}")

In [None]:
# Load WT data
adata_wt = sc.read_10x_h5(wt_path)
adata_wt.var_names_make_unique()
adata_wt.obs['genotype'] = 'WT'
adata_wt.obs['sample'] = 'WT'
print(f"WT data shape: {adata_wt.shape}")
adata_wt

In [None]:
# Load KO data
adata_ko = sc.read_10x_h5(ko_path)
adata_ko.var_names_make_unique()
adata_ko.obs['genotype'] = 'KO'
adata_ko.obs['sample'] = 'CTR9_KO'
print(f"KO data shape: {adata_ko.shape}")
adata_ko

In [None]:
# Make cell barcodes unique before concatenation
adata_wt.obs_names = [f"WT_{bc}" for bc in adata_wt.obs_names]
adata_ko.obs_names = [f"KO_{bc}" for bc in adata_ko.obs_names]

In [None]:
# Concatenate datasets
adata = ad.concat([adata_wt, adata_ko], join='outer')
adata.obs_names_make_unique()
print(f"Combined data shape: {adata.shape}")
print(f"\nGenotype distribution:")
print(adata.obs['genotype'].value_counts())

# 2. Quality Control and Filtering

In [None]:
# Calculate QC metrics
# For mouse, mitochondrial genes start with 'mt-' (lowercase)
adata.var['mt'] = adata.var_names.str.startswith('mt-')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

# Visualize QC metrics
fig, axes = plt.subplots(1, 4, figsize=(16, 4))
sc.pl.violin(adata, 'n_genes_by_counts', groupby='genotype', ax=axes[0], show=False)
sc.pl.violin(adata, 'total_counts', groupby='genotype', ax=axes[1], show=False)
sc.pl.violin(adata, 'pct_counts_mt', groupby='genotype', ax=axes[2], show=False)

# Scatter plot
axes[3].scatter(adata.obs['total_counts'], adata.obs['n_genes_by_counts'], 
                c=adata.obs['pct_counts_mt'], cmap='viridis', s=1, alpha=0.5)
axes[3].set_xlabel('Total counts')
axes[3].set_ylabel('N genes')
plt.colorbar(axes[3].collections[0], ax=axes[3], label='% MT')

plt.tight_layout()
plt.savefig(f"{save_folder}/QC_metrics.png", dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# Filter cells - ADJUST THESE THRESHOLDS BASED ON YOUR QC PLOTS
print(f"Cells before filtering: {adata.n_obs}")

# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_cells(adata, max_genes=6000)  # Adjust based on your data
sc.pp.filter_genes(adata, min_cells=3)

# Filter by mitochondrial content
adata = adata[adata.obs['pct_counts_mt'] < 20, :].copy()  # Adjust threshold

print(f"Cells after filtering: {adata.n_obs}")
print(f"Genes after filtering: {adata.n_vars}")
print(f"\nGenotype distribution after filtering:")
print(adata.obs['genotype'].value_counts())

# 3. Preprocessing for CellOracle

**Important**: CellOracle requires:
1. Raw counts stored in a layer (for GRN inference)
2. Normalized/log-transformed data in `.X` (for visualization)
3. Variable gene selection (2000-3000 genes recommended)
4. Clustering and embedding

In [None]:
# Store raw counts BEFORE normalization - CellOracle needs this!
adata.layers['raw_count'] = adata.X.copy()

In [None]:
# Normalize and log transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

In [None]:
# Variable gene selection
# CellOracle works best with 2000-3000 highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2500, batch_key='genotype')
print(f"Number of highly variable genes: {adata.var['highly_variable'].sum()}")

# Plot highly variable genes
sc.pl.highly_variable_genes(adata)

In [None]:
# Subset to highly variable genes
adata = adata[:, adata.var['highly_variable']].copy()
print(f"Data shape after HVG selection: {adata.shape}")

In [None]:
# Scale data (for PCA/clustering, not for CellOracle GRN inference)
sc.pp.scale(adata, max_value=10)

In [None]:
# PCA
sc.tl.pca(adata, n_comps=50)
sc.pl.pca_variance_ratio(adata, n_pcs=50)

In [None]:
# Batch correction with Harmony (optional but recommended for WT vs KO comparison)
# Uncomment if you want to integrate the datasets

# import scanpy.external as sce
# sce.pp.harmony_integrate(adata, key='genotype')
# use_rep = 'X_pca_harmony'

use_rep = 'X_pca'  # Use this if not doing integration

In [None]:
# Compute neighbors and UMAP
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30, use_rep=use_rep)
sc.tl.umap(adata)

In [None]:
# Clustering
sc.tl.leiden(adata, resolution=0.8)  # Adjust resolution as needed

In [None]:
# Visualize clustering and genotype
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

sc.pl.umap(adata, color='leiden', ax=axes[0], show=False, legend_loc='on data', 
           title='Leiden Clusters', frameon=False)
sc.pl.umap(adata, color='genotype', ax=axes[1], show=False, 
           title='Genotype', frameon=False)
sc.pl.umap(adata, color='n_genes_by_counts', ax=axes[2], show=False, 
           title='N Genes', frameon=False)

plt.tight_layout()
plt.savefig(f"{save_folder}/UMAP_overview.png", dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# OPTIONAL: Draw force-directed graph (alternative embedding)
# CellOracle can work with either UMAP or force-atlas embedding
# sc.tl.draw_graph(adata)

# 4. Cell Type Annotation

You'll need to annotate your cell types. Here are some approaches:
1. Use known marker genes for your tissue
2. Use automated annotation tools (scType, SingleR, etc.)
3. Transfer labels from a reference dataset

In [None]:
# Example: Check expression of common marker genes
# MODIFY THESE BASED ON YOUR TISSUE TYPE
marker_genes = {
    'Stem/Progenitor': ['Cd34', 'Kit'],
    'T cells': ['Cd3e', 'Cd3d', 'Cd4', 'Cd8a'],
    'B cells': ['Cd19', 'Ms4a1', 'Cd79a'],
    'Macrophages': ['Cd68', 'Adgre1'],
    'NK cells': ['Ncr1', 'Klrb1c'],
    'Fibroblasts': ['Col1a1', 'Dcn'],
    'Epithelial': ['Epcam', 'Krt8'],
}

# Check which markers are in the dataset
available_markers = {}
for cell_type, genes in marker_genes.items():
    available = [g for g in genes if g in adata.var_names]
    if available:
        available_markers[cell_type] = available
        print(f"{cell_type}: {available}")

In [None]:
# Visualize marker genes
all_markers = [g for genes in available_markers.values() for g in genes]
if all_markers:
    sc.pl.dotplot(adata, all_markers, groupby='leiden', dendrogram=True)

In [None]:
# Create cell type annotation column
# MODIFY THIS BASED ON YOUR MARKER ANALYSIS
# This is a placeholder - you need to assign cell types based on your data

# Example mapping (REPLACE WITH YOUR OWN)
cluster_to_celltype = {
    '0': 'CellType_A',
    '1': 'CellType_B',
    '2': 'CellType_C',
    # Add more as needed...
}

# If you haven't annotated yet, use cluster IDs temporarily
adata.obs['cell_type'] = adata.obs['leiden'].astype(str)

# Or apply your mapping:
# adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_to_celltype).astype('category')

In [None]:
# Check CTR9 expression in WT vs KO
if 'Ctr9' in adata.var_names:
    sc.pl.violin(adata, 'Ctr9', groupby='genotype', 
                 title='CTR9 expression by genotype')
    sc.pl.umap(adata, color='Ctr9', title='CTR9 Expression')
else:
    print("CTR9 not found in variable genes - checking full gene list...")
    # Try alternate capitalization
    for name in ['Ctr9', 'CTR9', 'ctr9']:
        if name in adata.var_names:
            print(f"Found as: {name}")

# 5. Save Preprocessed Data

In [None]:
# Verify raw counts are stored
print("Layers:", list(adata.layers.keys()))
print("Obs columns:", list(adata.obs.columns))
print("Obsm keys:", list(adata.obsm.keys()))

In [None]:
# Save preprocessed data
adata.write_h5ad(f"{data_folder}/CTR9_WT_KO_preprocessed.h5ad")
print(f"Saved to {data_folder}/CTR9_WT_KO_preprocessed.h5ad")

---
# 6. CellOracle Analysis

Now we'll set up CellOracle for GRN inference and analysis.

In [None]:
# Load preprocessed data (or continue with adata from above)
# adata = sc.read_h5ad(f"{data_folder}/CTR9_WT_KO_preprocessed.h5ad")

## 6.1. Load Base GRN

For mouse data, use the mouse scATAC-seq atlas base GRN.

In [None]:
# Load mouse base GRN from scATAC-seq atlas
base_GRN = co.data.load_mouse_scATAC_atlas_base_GRN()
print(f"Base GRN shape: {base_GRN.shape}")
base_GRN.head()

In [None]:
# Alternative: Use promoter-based base GRN
# base_GRN = co.data.load_mouse_promoter_base_GRN()

# Or if you have your own scATAC-seq data, you can build a custom base GRN
# See CellOracle documentation for details

## 6.2. Create Oracle Object

In [None]:
# Prepare adata for CellOracle
# CellOracle needs raw counts in .X
adata_oracle = adata.copy()
adata_oracle.X = adata_oracle.layers['raw_count'].copy()

print("Data prepared for CellOracle")
print(f"Shape: {adata_oracle.shape}")

In [None]:
# Check available embeddings and clustering columns
print("Clustering columns (obs):", [c for c in adata_oracle.obs.columns 
                                     if 'leiden' in c.lower() or 'louvain' in c.lower() or 'cell' in c.lower()])
print("Embeddings (obsm):", list(adata_oracle.obsm.keys()))

In [None]:
# Instantiate Oracle object
oracle = co.Oracle()

# Import data - MODIFY cluster_column_name and embedding_name as needed
oracle.import_anndata_as_raw_count(
    adata=adata_oracle,
    cluster_column_name='cell_type',  # or 'leiden' if you haven't annotated
    embedding_name='X_umap'  # or 'X_draw_graph_fa' if you computed it
)

print("Data imported into Oracle object")

In [None]:
# Add genotype information to Oracle object
oracle.adata.obs['genotype'] = adata.obs['genotype'].values

In [None]:
# Import base GRN
oracle.import_TF_data(TF_info_matrix=base_GRN)
print("Base GRN imported")

## 6.3. Preprocessing for GRN Inference

In [None]:
# Perform PCA on the imputed data
oracle.perform_PCA()

# Plot variance explained to select number of PCs
plt.figure(figsize=(6, 4))
plt.plot(np.cumsum(oracle.pca.explained_variance_ratio_)[:100])
n_comps = np.where(np.diff(np.diff(np.cumsum(oracle.pca.explained_variance_ratio_)) > 0.002))[0][0]
plt.axvline(n_comps, c='red', linestyle='--')
plt.xlabel('Number of PCs')
plt.ylabel('Cumulative variance explained')
plt.title(f'PCA - Auto-selected {n_comps} PCs')
plt.show()

n_comps = min(n_comps, 50)
print(f"Using {n_comps} PCs")

In [None]:
# Estimate optimal k for KNN imputation
n_cell = oracle.adata.shape[0]
k = int(0.025 * n_cell)
print(f"Number of cells: {n_cell}")
print(f"Auto-selected k for KNN imputation: {k}")

In [None]:
# Perform KNN imputation
oracle.knn_imputation(n_pca_dims=n_comps, k=k, balanced=True, 
                      b_sight=k*8, b_maxl=k*4, n_jobs=4)

## 6.4. GRN Inference

Infer cluster-specific GRNs using regularized linear regression.

In [None]:
# Get cluster information
print("Clusters in data:")
print(oracle.adata.obs['cell_type'].value_counts())

In [None]:
# GRN inference
# This step takes a while depending on data size
links = oracle.get_links(cluster_name_for_GRN_unit='cell_type', 
                         alpha=10, 
                         verbose_level=10)

In [None]:
# Check links object
links

In [None]:
# Filter links by p-value
links.filter_links(p=0.001, weight="coef_abs", threshold_number=2000)

## 6.5. Network Visualization and Analysis

In [None]:
# Plot network degree distribution
plt.subplots_adjust(left=0.15, bottom=0.3)
links.plot_degree_distributions(plot_model=True, 
                                 save=f"{save_folder}/degree_distribution")

In [None]:
# Network scores
links.get_network_score()

In [None]:
# Plot network entropy
links.plot_network_entropy_distributions(save=f"{save_folder}/network_entropy")

## 6.6. Save Results

In [None]:
# Save Oracle object
oracle.to_hdf5(f"{data_folder}/CTR9_oracle.celloracle.oracle")
print("Oracle object saved")

In [None]:
# Save Links object
links.to_hdf5(f"{data_folder}/CTR9_links.celloracle.links")
print("Links object saved")

---
# 7. Analysis Strategies for WT vs KO Comparison

With your experimental setup, you have several analysis options:

## Option A: Run CellOracle Separately on WT and KO

This allows you to compare GRN structures between genotypes.

In [None]:
# Subset data by genotype
adata_wt_only = adata[adata.obs['genotype'] == 'WT'].copy()
adata_ko_only = adata[adata.obs['genotype'] == 'KO'].copy()

print(f"WT cells: {adata_wt_only.n_obs}")
print(f"KO cells: {adata_ko_only.n_obs}")

In [None]:
# Example: Create Oracle for WT only
# You would repeat this for KO

# oracle_wt = co.Oracle()
# adata_wt_oracle = adata_wt_only.copy()
# adata_wt_oracle.X = adata_wt_oracle.layers['raw_count'].copy()
# oracle_wt.import_anndata_as_raw_count(adata=adata_wt_oracle,
#                                       cluster_column_name='cell_type',
#                                       embedding_name='X_umap')
# oracle_wt.import_TF_data(TF_info_matrix=base_GRN)
# # ... continue with analysis

## Option B: Simulate CTR9 KO in WT Data

Use CellOracle's perturbation simulation to predict CTR9 KO effects in WT cells, then compare to actual KO data.

In [None]:
# Load saved oracle object (if starting fresh)
# oracle = co.load_hdf5(f"{data_folder}/CTR9_oracle.celloracle.oracle")

In [None]:
# Check if CTR9 is in the gene list
# Note: Gene names might be case-sensitive
gene_name = 'Ctr9'  # or 'CTR9' depending on your data

if gene_name in oracle.adata.var_names:
    print(f"{gene_name} found in dataset")
else:
    print(f"{gene_name} not found. Available genes starting with 'Ctr':")
    print([g for g in oracle.adata.var_names if g.lower().startswith('ctr')])

In [None]:
# Simulate CTR9 knockout
# This simulates what happens when you knock out CTR9

# First, check if CTR9 is a TF in the base GRN
if gene_name in base_GRN.columns:
    print(f"{gene_name} is a TF in the base GRN - can simulate knockout")
else:
    print(f"{gene_name} is NOT a TF in base GRN - will simulate as target gene perturbation")

## Perturbation Simulation (Next Notebook)

After building the GRN, you'll want to run perturbation simulations. This is typically done in a separate notebook following the CellOracle tutorial:
https://morris-lab.github.io/CellOracle.documentation/tutorials/simulation.html

In [None]:
# Example perturbation simulation setup
# This would be expanded in a follow-up analysis

# # Simulate CTR9 knockout (set expression to 0)
# oracle.simulate_shift(perturb_condition={gene_name: 0.0},
#                       n_propagation=3)

# # Get simulation results
# oracle.get_simulation_result()

# # Visualize predicted cell state shifts
# oracle.visualize_simulation_result(...)

---
# Summary

This notebook covered:
1. ✅ Loading WT and KO h5 files
2. ✅ Preprocessing and quality control
3. ✅ Combining datasets with genotype annotation
4. ✅ Setting up CellOracle with mouse base GRN
5. ✅ GRN inference

## Next Steps
1. **Annotate cell types** based on marker gene expression
2. **Run perturbation simulation** to model CTR9 KO effects
3. **Compare predicted vs actual KO** phenotypes
4. **Analyze GRN differences** between WT and KO conditions

## Key Resources
- CellOracle documentation: https://morris-lab.github.io/CellOracle.documentation/
- Perturbation tutorial: https://morris-lab.github.io/CellOracle.documentation/tutorials/simulation.html