# Tutorial 09: scATAC-seq Archetypal Analysis

This tutorial walks through end-to-end archetypal analysis on scATAC-seq data.

**What you'll learn:**
1. TF-IDF + LSI preprocessing for chromatin accessibility data
2. Hyperparameter search with LSI embeddings
3. Training and evaluating the final model
4. Archetype distances, positions, and assignments
5. Peak associations and characterization

**Key difference from scRNA-seq:** scATAC-seq uses TF-IDF + LSI (Truncated SVD)
instead of log-normalization + PCA. PEACH handles this seamlessly — just set
`pca_key='X_lsi'` when training.

**Requirements:**
- `peach` (v0.4.0+)
- `scanpy`
- Data: `data/ovary_ATAC.h5ad` (18K cells x 19K peaks)

**WARNING:** The current results are not biologically interpretable. Note the lack of model convergence and low archetypal $R^2$ obtained. This is a demonstration only. Future work will test archetype analysis from raw ATAC peaks to preserve rank structure in LSI dimensional reduction.

In [2]:
import scanpy as sc
import peach as pc
import numpy as np
from pathlib import Path

print(f"PEACH version: {pc.__version__}")

PEACH version: 0.4.0


## Step 1: Load scATAC-seq Data

scATAC-seq data is a sparse peak-by-cell matrix where each entry represents
the number of Tn5 insertions in a genomic peak for a given cell. These are
typically very sparse (95-99% zeros) and high-dimensional.

In [10]:
data_path = Path("data/ovary_ATAC.h5ad")
adata = sc.read_h5ad('/Users/honkala/Desktop/PEACH_public/data/ovary_ATAC.h5ad')

print(f"Shape: {adata.n_obs:,} cells x {adata.n_vars:,} peaks")
print(f"Existing obsm keys: {list(adata.obsm.keys())}")
print(f"Existing obs columns: {list(adata.obs.columns[:10])}")

# Check sparsity
import scipy.sparse as sp
if sp.issparse(adata.X):
    density = adata.X.nnz / (adata.n_obs * adata.n_vars)
    print(f"Matrix density: {density:.1%} non-zero")

Shape: 18,315 cells x 19,281 peaks
Existing obsm keys: ['X_harmony', 'X_lsi', 'X_umap']
Existing obs columns: ['mapped_reference_assembly', 'alignment_software', 'donor_id', 'self_reported_ethnicity_ontology_term_id', 'donor_living_at_sample_collection', 'donor_menopausal_status', 'sample_uuid', 'sample_preservation_method', 'tissue_ontology_term_id', 'development_stage_ontology_term_id']
Matrix density: 24.9% non-zero


## Step 2: TF-IDF + LSI Preprocessing

`pc.pp.prepare_atacseq()` performs:
1. **TF-IDF normalization**: Term-frequency * inverse-document-frequency weights
   each peak by how informative it is across the dataset.
2. **LSI (Latent Semantic Indexing)**: Truncated SVD dimensionality reduction.
   The first component is dropped by default because it typically captures
   sequencing depth rather than biological variation.

The result is stored in `adata.obsm['X_lsi']`, analogous to `X_pca` for RNA.

In [11]:
# If pre-computed LSI exists, we can use it directly
# To force recomputation: del adata.obsm['X_lsi']

if "X_lsi" not in adata.obsm:
    pc.pp.prepare_atacseq(
        adata,
        n_components=50,   # 30-50 standard for scATAC
        drop_first=True,   # Drop depth component
    )
else:
    print(f"Using pre-computed X_lsi: {adata.obsm['X_lsi'].shape}")

print(f"\nLSI embeddings: {adata.obsm['X_lsi'].shape}")
if 'lsi' in adata.uns:
    var_ratio = adata.uns['lsi']['variance_ratio']
    print(f"Variance explained: {var_ratio.sum()*100:.1f}% total")
    print(f"  Top 5 components: {var_ratio[:5]*100}")

Using pre-computed X_lsi: (18315, 50)

LSI embeddings: (18315, 50)


## Step 3: Hyperparameter Search

Same hyperparameter search as scRNA-seq, but pointing to `X_lsi` instead of `X_pca`.
The search tests different numbers of archetypes using cross-validation.

In [12]:
cv_summary = pc.tl.hyperparameter_search(
    adata,
    n_archetypes_range=[3, 4, 5, 6, 7],
    cv_folds=3,
    max_epochs_cv=15,
    pca_key="X_lsi",  # The only difference from RNA workflows
    device="cpu",
)

# Ranked results
ranked = cv_summary.ranked_configs
for i, config in enumerate(ranked[:5]):
    print(f"  #{i+1}: n_archetypes={config['hyperparameters']['n_archetypes']}, "
          f"R²={config['metric_value']:.4f}")

[OK] Using specified PCA coordinates: adata.obsm['X_lsi'] (18315, 50)
[STATS] DataLoader created: 18315 cells × 50 PCA components
   Config: batch_size=128, workers=0 (Apple Silicon)
 Starting Archetypal Hyperparameter Grid Search
   Search space: 5 × 4 = 20 combinations
   Using fixed inflation: 1.5
   CV folds: 3
   Total training runs: 60
[STATS] Dataset info: 18,315 cells, 50 features
 Subsampled to 9,157 cells (50.0%) for CV
   Fold 1: 6,104 train, 3,053 val
   Fold 2: 6,105 train, 3,052 val
   Fold 3: 6,105 train, 3,052 val

🧪 Configuration 1/20: {'n_archetypes': 3, 'hidden_dims': [128, 64], 'inflation_factor': 1.5, 'use_pcha_init': True, 'use_inflation': True}
     Fold 1/3
       Model input_dim: 50 (auto-detected from data)
Archetypes parameter registered: True
Archetypes requires_grad: True
Deep_AA (Deep Archetypal Analysis) initialized:
  - Single-stage architecture (like Deep_2)
  - Inflation factor: 1.5
  - Direct archetypal coordinates (no bottleneck)

 Consolidated Arche

In [13]:
# Elbow plot to select optimal n_archetypes
pc.pl.elbow_curve(cv_summary)

## Step 4: Train Final Model

Train with the selected number of archetypes. Remember to set `pca_key='X_lsi'`.

In [14]:
n_archetypes = 5  # Adjust based on elbow plot / CV results

results = pc.tl.train_archetypal(
    adata,
    n_archetypes=n_archetypes,
    n_epochs=50,
    hidden_dims=[256, 128, 64],
    pca_key="X_lsi",
    device="cpu",
)

print(f"Final archetype R²: {results.get('final_archetype_r2', 'N/A')}")

[OK] Using specified PCA coordinates: adata.obsm['X_lsi'] (18315, 50)
[STATS] DataLoader created: 18315 cells × 50 PCA components
   Config: batch_size=128, workers=0 (Apple Silicon)
Archetypes parameter registered: True
Archetypes requires_grad: True
Deep_AA (Deep Archetypal Analysis) initialized:
  - Single-stage architecture (like Deep_2)
  - Inflation factor: 1.5
  - Direct archetypal coordinates (no bottleneck)
 Initializing with PCHA + inflation_factor=1.5...

 Consolidated Archetype Initialization
   PCHA: True, Inflation: True (factor: 1.5)
   Test inflation: False
Running PCHA initialization...
  Input shape: (1000, 50)
  Target archetypes: 5
Running PCHA with 5 archetypes...
Data shape for PCHA: (50, 1000)
PCHA Results:
  Archetypes shape: (5, 50)
  Archetype R²: 0.1430
  SSE: 42919.9264
  PCHA archetype R²: 0.1430
  Archetype shape: (5, 50)
[OK] Initialized 5 archetypes using PCHA

 Scalar Archetypal Inflation (factor: 1.50)
   [OK] Inflation complete
      [STATS] Positioni

In [15]:
# Training metrics
pc.pl.training_metrics(results["history"])

## Step 5: Archetype Distances, Positions & Assignments

Compute distances from each cell to each archetype in LSI space,
then assign cells to their nearest archetype.

In [16]:
# Compute distances in LSI space
pc.tl.archetypal_coordinates(adata, pca_key="X_lsi")

print(f"Archetype positions: {adata.uns['archetype_coordinates'].shape}")
print(f"Distance matrix: {adata.obsm['archetype_distances'].shape}")

 Computing archetype distances in PCA space...
   Canonical reference: adata.obs.index (18315 cells)
   Found PCA coordinates: X_lsi (18315, 50)
   Found archetype coordinates: archetype_coordinates (5, 50)
 Computing pairwise distances in PCA space...
[OK] Distance computation complete
   Distance matrix shape: (18315, 5)
[OK] Stored in AnnData:
   adata.obsm['archetype_distances']: (18315, 5) distance matrix
   adata.uns['archetype_positions']: (5, 50) archetype positions
   adata.uns['archetype_distance_info']: distance computation metadata

[STATS] Distance Statistics:
   Nearest archetype distribution:
      Archetype 0: 2795 cells (15.3%), mean distance: 33.5142
      Archetype 1: 451 cells (2.5%), mean distance: 34.2035
      Archetype 2: 5508 cells (30.1%), mean distance: 33.3467
      Archetype 3: 9552 cells (52.2%), mean distance: 33.0764
      Archetype 4: 9 cells (0.0%), mean distance: 34.7673
   Overall statistics:
      Mean nearest distance: 33.2531
      Distance range:

In [17]:
# Assign cells to archetypes (top 10% closest per archetype)
pc.tl.assign_archetypes(adata, percentage_per_archetype=0.1)
adata.obs["archetypes"].value_counts()

 AnnData-centric archetype binning...
   Distance matrix: (18315, 5) (from adata.obsm['archetype_distances'])
   Canonical cell reference: adata.obs.index (18315 cells)
   Selecting top 1831 cells (10.0%) per archetype
   INCLUDING central archetype_0 (generalist cells)
   Archetype 0 (central): 1831 cells, centroid distance range: [34.8888, 34.9877], mean: 34.9611
   Archetype 1: 1831 cells, distance range: [30.8723, 33.4236], mean: 33.0499
   Archetype 2: 1831 cells, distance range: [31.7376, 35.0504], mean: 34.4331
   Archetype 3: 1831 cells, distance range: [30.5359, 33.0230], mean: 32.6955
   Archetype 4: 1831 cells, distance range: [28.8842, 32.4825], mean: 31.8148
   Archetype 5: 1831 cells, distance range: [34.3878, 36.4911], mean: 36.1245

[STATS] Assignment Summary:
   Total cells: 18315
   Archetype 0 (central): 1831 cells (10.0%)
   Archetype 1: 1831 cells (10.0%)
   Archetype 2: 1831 cells (10.0%)
   Archetype 3: 1831 cells (10.0%)
   Archetype 4: 1831 cells (10.0%)
   Arc

archetypes
no_archetype    8816
archetype_1     1671
archetype_2     1665
archetype_4     1643
archetype_5     1615
archetype_3     1541
archetype_0     1364
Name: count, dtype: int64

In [18]:
# Extract barycentric weight matrix (A matrix: rows sum to 1)
weights = pc.tl.extract_archetype_weights(adata, pca_key="X_lsi")
print(f"Weight matrix: {weights.shape}")
print(f"Row sums: min={weights.sum(axis=1).min():.4f}, max={weights.sum(axis=1).max():.4f}")

[STATS] Using model from adata.uns['trained_model']
[STATS] Extracting weights for 18315 cells...
   PCA shape: (18315, 50)
   Device: cpu
   Batch size: 256
   Processed 18315/18315 cells...
[OK] Stored cell weights in adata.obsm['cell_archetype_weights']
   Shape: (18315, 5)
   Range: [0.059, 0.471]
   Mean sum: 1.0000

[STATS] Archetype weight statistics:
   Archetype 0: mean=0.200, std=0.022, max=0.372, dominant in 0 cells
   Archetype 1: mean=0.200, std=0.025, max=0.340, dominant in 0 cells
   Archetype 2: mean=0.200, std=0.023, max=0.471, dominant in 0 cells
   Archetype 3: mean=0.200, std=0.024, max=0.410, dominant in 0 cells
   Archetype 4: mean=0.200, std=0.024, max=0.284, dominant in 0 cells
Weight matrix: (18315, 5)
Row sums: min=1.0000, max=1.0000


In [19]:
# Visualize archetype positions
pc.pl.archetype_positions(adata)

<Figure size 1500x600 with 3 Axes>

In [20]:
pc.pl.archetype_positions_3d(adata)

<Figure size 1200x1000 with 2 Axes>

In [21]:
# Archetype usage statistics
pc.pl.archetype_statistics(adata)

[STATS] Archetype Statistics
Number of archetypes: 5
Embedding dimensions: 50

Distance statistics:
  Mean distance: 55.0217
  Std distance:  2.5060
  Min distance:  51.7099
  Max distance:  61.1572
  Range:         9.4474

Nearest archetypes:  A2 - A4
Farthest archetypes: A2 - A5


{'n_archetypes': 5,
 'n_dimensions': 50,
 'mean_distance': 55.02172860349238,
 'std_distance': 2.5060175772077207,
 'min_distance': 51.70987714135885,
 'max_distance': 61.15724285281022,
 'distance_range': 9.447365711451368,
 'distance_matrix': array([[ 0.        , 54.91483488, 53.69872152, 53.00294838, 56.02309706],
        [54.91483488,  0.        , 55.20350812, 51.70987714, 61.15724285],
        [53.69872152, 55.20350812,  0.        , 53.0033917 , 54.70887264],
        [53.00294838, 51.70987714, 53.0033917 ,  0.        , 56.79479174],
        [56.02309706, 61.15724285, 54.70887264, 56.79479174,  0.        ]]),
 'nearest_pair': (1, 3),
 'farthest_pair': (1, 4),
 'hull_volume': None,
 'hull_area': None}

## Step 6: Characterization

### Peak Associations
Test which peaks are differentially accessible per archetype. In scATAC-seq,
peaks represent open chromatin regions, so archetype-associated peaks reveal
the regulatory programs driving each extreme cell state.

In [22]:
gene_results = pc.tl.gene_associations(adata, use_layer=None)

sig = gene_results[gene_results["significant"]]
print(f"Total significant peak-archetype associations: {len(sig)}")
print(f"\nBreakdown by archetype:")
print(sig["archetype"].value_counts())

🧪 Testing archetype-gene associations (AnnData-centric)...
   Method: mannwhitneyu
   FDR correction: benjamini_hochberg (global scope)
   Test direction: two-sided
   Bin proportion: 0.1 (closest cells to each archetype)
   Minimum cells per archetype: 10
   Comparison group: all
   [OK] AnnData validation passed:
      Distance matrix: (18315, 5) from adata.obsm['archetype_distances']
      Archetype assignments: 18315 cells from adata.obs['archetypes']
      Assignment categories: ['archetype_0', 'archetype_1', 'archetype_2', 'archetype_3', 'archetype_4', 'archetype_5', 'no_archetype']
   Using adata.X
   Expression data: 18315 cells × 19281 genes
   Sparse matrix: 75.1% zeros
   Original format: csr
    Using AnnData archetype assignments for binning...
      Found 6 archetype categories: ['archetype_0', 'archetype_1', 'archetype_2', 'archetype_3', 'archetype_4', 'archetype_5']
         no_archetype: 8816 cells (48.1%)
         archetype_1: 1671 cells (9.1%)
         archetype_2: 1

In [23]:
# Top differentially accessible peaks per archetype
for arch in sorted(sig["archetype"].unique()):
    top = sig[sig["archetype"] == arch].nlargest(5, "log_fold_change")
    print(f"\n{arch}:")
    for _, row in top.iterrows():
        print(f"  {row['gene']:30s} LFC={row['log_fold_change']:.3f} FDR={row['fdr_pvalue']:.2e}")


archetype_0:
  ENSG00000173068                LFC=0.218 FDR=3.77e-11
  ENSG00000157827                LFC=0.207 FDR=2.91e-18
  ENSG00000196189                LFC=0.173 FDR=3.75e-08
  ENSG00000166250                LFC=0.172 FDR=1.32e-10
  ENSG00000007237                LFC=0.149 FDR=2.31e-08

archetype_1:
  ENSG00000148848                LFC=0.373 FDR=1.46e-45
  ENSG00000173068                LFC=0.288 FDR=1.99e-27
  ENSG00000104490                LFC=0.286 FDR=2.33e-31
  ENSG00000145685                LFC=0.269 FDR=6.24e-33
  ENSG00000184719                LFC=0.243 FDR=7.66e-25

archetype_2:
  ENSG00000109906                LFC=0.365 FDR=6.53e-52
  ENSG00000142611                LFC=0.352 FDR=9.31e-37
  ENSG00000058453                LFC=0.336 FDR=3.38e-43
  ENSG00000173175                LFC=0.331 FDR=9.14e-40
  ENSG00000079432                LFC=0.318 FDR=4.43e-38

archetype_3:
  ENSG00000042832                LFC=0.247 FDR=9.24e-18
  ENSG00000124788                LFC=0.233 FDR=8

### Cell Type Associations (if annotations available)

Test whether specific cell types are enriched or depleted in each archetype.

In [24]:
# Check for cell type annotations
ct_cols = [c for c in adata.obs.columns if 'type' in c.lower() or 'cluster' in c.lower()]
print(f"Candidate annotation columns: {ct_cols}")

if ct_cols:
    ct_col = ct_cols[0]
    print(f"\nUsing '{ct_col}' for conditional associations")
    cond_results = pc.tl.conditional_associations(adata, obs_column=ct_col)
    sig_cond = cond_results[cond_results["significant"]]
    print(f"{len(sig_cond)} significant associations")
    
    for _, row in sig_cond.nlargest(10, "odds_ratio").iterrows():
        direction = "enriched" if row["odds_ratio"] > 1 else "depleted"
        print(f"  {row['archetype']} x {row['condition']}: OR={row['odds_ratio']:.2f} ({direction})")
else:
    print("No cell type annotations found — skip conditional associations")

Candidate annotation columns: ['suspension_type', 'cell_type_ontology_term_id', 'author_cell_type', 'sub_celltype', 'tissue_type', 'cell_type', 'archetypes']

Using 'suspension_type' for conditional associations
🧪 Testing archetype-conditional associations...
   Condition column: suspension_type
   Method: hypergeometric (boolean matrix approach)
   Testing 6 archetypes × 1 conditions
   Total cells in analysis: 9499
   Condition distribution:
      nucleus: 9499 cells
   Boolean matrices created:
      Archetype matrix: (9499, 6) (cells × archetypes)
      Condition matrix: (9499, 1) (cells × conditions)
[OK] Conditional association testing completed!
   Total tests: 6
   Significant associations: 0 (0.0%)
   Efficiency: 6/6 tests performed (filtered by min_cells)
0 significant associations


## Summary

**AnnData keys created in this tutorial:**

| Key | Description |
|-----|-------------|
| `adata.obsm['X_lsi']` | LSI embeddings (TF-IDF + TruncatedSVD) |
| `adata.uns['lsi']` | Variance ratio and component loadings |
| `adata.uns['archetype_coordinates']` | Archetype positions in LSI space |
| `adata.obsm['archetype_distances']` | Cell-to-archetype distance matrix |
| `adata.obs['archetypes']` | Categorical archetype assignments |
| `adata.obsm['cell_archetype_weights']` | Barycentric coordinates (A matrix) |

**Key takeaway:** scATAC-seq archetypal analysis is identical to scRNA-seq except
for the preprocessing step. Replace `log-normalize + PCA` with `TF-IDF + LSI`,
then set `pca_key='X_lsi'` everywhere.

**WARNING:** The current results are not biologically interpretable. Note the lack of model convergence and low archetypal $R^2$ obtained. This is a demonstration only. Future work will test archetype analysis from raw ATAC peaks to preserve rank structure in LSI dimensional reduction.