### Overview

This notebook demonstrates the ARUNA inference pipeline for upscaling sparse, sequencing-based DNA methylation data to genome-wide resolution using a pretrained model.

Inference assumes that input data (either simulated RRBS-like data or real RRBS data) have already been converted into the **patch-centric representation** expected by the model. In particular, input patches must match the patch size used during training (here, 128 CpGs per patch). For details see data_prep.ipynb.

While a pretrained ARUNA model can, in principle, be applied across a wide range of sparsity levels, the provided model and hyperparameters were specifically tuned for extreme sparsity regimes, including:
- MCAR missingness ≥ 90%
- RRBS-like missing-not-at-random patterns

---

### Requirements

> **Note:** This notebook assumes that all preprocessing steps in `data_prep.ipynb` have already been completed.

The following artifacts must be available:

- **Pretrained ARUNA model**  
  A PyTorch checkpoint (`.pth`) located under:
`ARUNA/checkpoints/`

- **Patchified sparse methylation data**  
Patch-centric data generated by `data_prep.ipynb`, stored under:
`ARUNA/data/<dataset>/`

---

### Outputs

Running this notebook will produce:
- Chromosome-wide predicted methylomes and associated evaluation masks written to: `ARUNA/results/`
- Example output format:
```text
results/
├── eval_mask
│   └── chr21.csv
└── pred_betas
    └── chr21.csv
```
- eval_mask: Boolean masks for every chr CpG. True = CpG was observed in (known or matched) ground truth and either simulated to be missing or missing in the corresponding RRBS.
- pred_betas: The predicted beta values at each chr CpG.


These outputs can be directly used for downstream evaluation, visualization, or comparison with baseline imputation methods.


In [1]:
%cd ..

/home/js228/ARUNA


In [2]:
import os
from pathlib import Path
from aruna.process_dataset import get_cc_gt
from scripts.inference import get_cpgmask, save_aruna_preds, run_mslice_inference

%load_ext autoreload
%autoreload 2

In [3]:
CWD = os.getcwd()
model_fpath = os.path.join(CWD, "checkpoints", "trained_model.pth")
config_fpath = os.path.join(CWD, "configs", "example_config.yaml")
res_dir = os.path.join(CWD, "results") # save path for preds and evalMask

In [4]:
test_data = "gtex"
chrom = "chr21"
test_regime = "rrbs_sim"
save_path = Path(CWD) / "results"
save_path.mkdir(parents=True, exist_ok=True)

cpgMask_map = get_cpgmask([chrom], test_data) # cpgs missing in ground truth

cc_gt_df, _ = get_cc_gt(test_data, chrom)
canonical_index = cc_gt_df.index # canonical set of hg38 CpGs

# get sample names that need to be imputed, in our case all samples in original dir
base_dir = Path(CWD) / "data" / "gtex_subset"
sample_names = sorted(d.name for d in base_dir.iterdir() if d.is_dir())

Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov
Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov


In [None]:
_, preds_map, evalMask_map = run_mslice_inference(test_data, model_fpath, 
                                                  config_fpath, sample_names, 
                                                  test_regime, cpgMask_map)

Running Aruna inference...
Loading Testing Data...
Getting Patch Data for:
Dataset: gtex
Chr(s): chr21
Patch Type: mpatch
NR: rrbs_sim
#Samples: 16
Current chromosome:  chr21
Looking for Ground-Truth Patchified FM files...
/home/js228/ARUNA/data/gtex/patch_centric/numCpg128/true/FractionalMethylation/chr21_patches.fm.pkl
Patchified data found at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/true/FractionalMethylation/chr21_patches.fm.pkl
Curent chromosome:  chr21
Looking for Patchified Noise-Simulated FractionalMethylation and Coverage files...
All files exist!
FM at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/FractionalMethylation/chr21_patches.mask.fm.pkl
MASK at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/SimulatedMask/chr21_patches.mask.pkl
Loaded Validation Data with 16 samples and 288960 patches.
#Batches in Validation (batch_dim=256): 1129
Current chromosome:  chr21
#CpG stats under eval: 462299.0 +/- 0.0


In [6]:
# save predictions
save_aruna_preds(preds_map = preds_map, 
                 canonical_index = canonical_index, 
                 out_dir = os.path.join(res_dir, "pred_betas"),
                 chrom = chrom)

Saved: /home/js228/ARUNA/results/pred_betas/chr21.csv


In [7]:
# save the evaluation mask
# for simulation studies: evalMask = True for CpGs that are observed in ground truth but simulated missing
# for true rrbs: evalMask = True for Cp Gs that are observed in matched WGBS and missing in the ground truth RRBS
save_aruna_preds(preds_map = evalMask_map, 
                 canonical_index = canonical_index, 
                 out_dir = os.path.join(res_dir, "eval_mask"),
                 chrom = chrom)

Saved: /home/js228/ARUNA/results/eval_mask/chr21.csv
