# Tutorial 09: scATAC-seq Archetypal Analysis

This tutorial walks through end-to-end archetypal analysis on scATAC-seq data.

**What you'll learn:**
1. TF-IDF + LSI preprocessing for chromatin accessibility data
2. Hyperparameter search with LSI embeddings
3. Training and evaluating the final model
4. Archetype distances, positions, and assignments
5. Peak associations and characterization

**Key difference from scRNA-seq:** scATAC-seq uses TF-IDF + LSI (Truncated SVD)
instead of log-normalization + PCA. PEACH handles this seamlessly — just set
`pca_key='X_lsi'` when training.

**Requirements:**
- `peach` (v0.4.0+)
- `scanpy`
- Data: `data/ovary_ATAC.h5ad` (18K cells x 19K peaks)

In [2]:
import scanpy as sc
import peach as pc
import numpy as np
from pathlib import Path

print(f"PEACH version: {pc.__version__}")

PEACH version: 0.4.0


## Step 1: Load scATAC-seq Data

scATAC-seq data is a sparse peak-by-cell matrix where each entry represents
the number of Tn5 insertions in a genomic peak for a given cell. These are
typically very sparse (95-99% zeros) and high-dimensional.

In [None]:
data_path = Path("data/ovary_ATAC.h5ad")
adata = sc.read_h5ad(data_path)

print(f"Shape: {adata.n_obs:,} cells x {adata.n_vars:,} peaks")
print(f"Existing obsm keys: {list(adata.obsm.keys())}")
print(f"Existing obs columns: {list(adata.obs.columns[:10])}")

# Check sparsity
import scipy.sparse as sp
if sp.issparse(adata.X):
    density = adata.X.nnz / (adata.n_obs * adata.n_vars)
    print(f"Matrix density: {density:.1%} non-zero")

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '~/Desktop/PEACH_public/data/ovary_ATAC.h5ad', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

## Step 2: TF-IDF + LSI Preprocessing

`pc.pp.prepare_atacseq()` performs:
1. **TF-IDF normalization**: Term-frequency * inverse-document-frequency weights
   each peak by how informative it is across the dataset.
2. **LSI (Latent Semantic Indexing)**: Truncated SVD dimensionality reduction.
   The first component is dropped by default because it typically captures
   sequencing depth rather than biological variation.

The result is stored in `adata.obsm['X_lsi']`, analogous to `X_pca` for RNA.

In [None]:
# If pre-computed LSI exists, we can use it directly
# To force recomputation: del adata.obsm['X_lsi']

if "X_lsi" not in adata.obsm:
    pc.pp.prepare_atacseq(
        adata,
        n_components=50,   # 30-50 standard for scATAC
        drop_first=True,   # Drop depth component
    )
else:
    print(f"Using pre-computed X_lsi: {adata.obsm['X_lsi'].shape}")

print(f"\nLSI embeddings: {adata.obsm['X_lsi'].shape}")
if 'lsi' in adata.uns:
    var_ratio = adata.uns['lsi']['variance_ratio']
    print(f"Variance explained: {var_ratio.sum()*100:.1f}% total")
    print(f"  Top 5 components: {var_ratio[:5]*100}")

NameError: name 'adata' is not defined

## Step 3: Hyperparameter Search

Same hyperparameter search as scRNA-seq, but pointing to `X_lsi` instead of `X_pca`.
The search tests different numbers of archetypes using cross-validation.

In [None]:
cv_summary = pc.tl.hyperparameter_search(
    adata,
    n_archetypes_range=[3, 4, 5, 6, 7],
    cv_folds=3,
    max_epochs_cv=15,
    pca_key="X_lsi",  # The only difference from RNA workflows
    device="cpu",
)

# Ranked results
ranked = cv_summary.ranked_configs
for i, config in enumerate(ranked[:5]):
    print(f"  #{i+1}: n_archetypes={config['hyperparameters']['n_archetypes']}, "
          f"R²={config['metric_value']:.4f}")

In [None]:
# Elbow plot to select optimal n_archetypes
pc.pl.elbow_curve(cv_summary)

## Step 4: Train Final Model

Train with the selected number of archetypes. Remember to set `pca_key='X_lsi'`.

In [None]:
n_archetypes = 5  # Adjust based on elbow plot / CV results

results = pc.tl.train_archetypal(
    adata,
    n_archetypes=n_archetypes,
    n_epochs=50,
    hidden_dims=[256, 128, 64],
    pca_key="X_lsi",
    device="cpu",
)

print(f"Final archetype R²: {results.get('final_archetype_r2', 'N/A')}")

In [None]:
# Training metrics
pc.pl.training_metrics(results["history"])

## Step 5: Archetype Distances, Positions & Assignments

Compute distances from each cell to each archetype in LSI space,
then assign cells to their nearest archetype.

In [None]:
# Compute distances in LSI space
pc.tl.archetypal_coordinates(adata, pca_key="X_lsi")

print(f"Archetype positions: {adata.uns['archetype_coordinates'].shape}")
print(f"Distance matrix: {adata.obsm['archetype_distances'].shape}")

In [None]:
# Assign cells to archetypes (top 10% closest per archetype)
pc.tl.assign_archetypes(adata, percentage_per_archetype=0.1)
adata.obs["archetypes"].value_counts()

In [None]:
# Extract barycentric weight matrix (A matrix: rows sum to 1)
weights = pc.tl.extract_archetype_weights(adata, pca_key="X_lsi")
print(f"Weight matrix: {weights.shape}")
print(f"Row sums: min={weights.sum(axis=1).min():.4f}, max={weights.sum(axis=1).max():.4f}")

In [None]:
# Visualize archetype positions
pc.pl.archetype_positions(adata)

In [None]:
pc.pl.archetype_positions_3d(adata)

In [None]:
# Archetype usage statistics
pc.pl.archetype_statistics(adata)

## Step 6: Characterization

### Peak Associations
Test which peaks are differentially accessible per archetype. In scATAC-seq,
peaks represent open chromatin regions, so archetype-associated peaks reveal
the regulatory programs driving each extreme cell state.

In [None]:
gene_results = pc.tl.gene_associations(adata, use_layer=None)

sig = gene_results[gene_results["significant"]]
print(f"Total significant peak-archetype associations: {len(sig)}")
print(f"\nBreakdown by archetype:")
print(sig["archetype"].value_counts())

In [None]:
# Top differentially accessible peaks per archetype
for arch in sorted(sig["archetype"].unique()):
    top = sig[sig["archetype"] == arch].nlargest(5, "log_fold_change")
    print(f"\n{arch}:")
    for _, row in top.iterrows():
        print(f"  {row['gene']:30s} LFC={row['log_fold_change']:.3f} FDR={row['fdr_pvalue']:.2e}")

### Cell Type Associations (if annotations available)

Test whether specific cell types are enriched or depleted in each archetype.

In [None]:
# Check for cell type annotations
ct_cols = [c for c in adata.obs.columns if 'type' in c.lower() or 'cluster' in c.lower()]
print(f"Candidate annotation columns: {ct_cols}")

if ct_cols:
    ct_col = ct_cols[0]
    print(f"\nUsing '{ct_col}' for conditional associations")
    cond_results = pc.tl.conditional_associations(adata, obs_column=ct_col)
    sig_cond = cond_results[cond_results["significant"]]
    print(f"{len(sig_cond)} significant associations")
    
    for _, row in sig_cond.nlargest(10, "odds_ratio").iterrows():
        direction = "enriched" if row["odds_ratio"] > 1 else "depleted"
        print(f"  {row['archetype']} x {row['condition']}: OR={row['odds_ratio']:.2f} ({direction})")
else:
    print("No cell type annotations found — skip conditional associations")

## Summary

**AnnData keys created in this tutorial:**

| Key | Description |
|-----|-------------|
| `adata.obsm['X_lsi']` | LSI embeddings (TF-IDF + TruncatedSVD) |
| `adata.uns['lsi']` | Variance ratio and component loadings |
| `adata.uns['archetype_coordinates']` | Archetype positions in LSI space |
| `adata.obsm['archetype_distances']` | Cell-to-archetype distance matrix |
| `adata.obs['archetypes']` | Categorical archetype assignments |
| `adata.obsm['cell_archetype_weights']` | Barycentric coordinates (A matrix) |

**Key takeaway:** scATAC-seq archetypal analysis is identical to scRNA-seq except
for the preprocessing step. Replace `log-normalize + PCA` with `TF-IDF + LSI`,
then set `pca_key='X_lsi'` everywhere.