### Overview

This notebook demonstrates the preprocessing (“patchification”) pipeline. Precomputing patchified datasets (instead of patchifying on-the-fly) reduces runtime by avoiding repeated merges/joins against the canonical CpG reference.

#### Goal
We start from 16 GTEx samples (subset to chr21) and generate:
1. **MCAR-90** (mcar_90): 90% simulated missingness (training example data).
2. **RRBS-like** (rrbs_sim): RRBS-style missingness pattern (inference demo with the provided checkpoint).

#### Requirements
- Input directory with BED-like sample outputs (one subdirectory per sample). Example in /ARUNA/data/gtex_subset
- Reference genome (hg38) metadata with per chromosome - per CpG start positions (0-indexed). Example in /ARUNA/data/metadata/hg38_cpg_py0idx.csv.
- [Optional/Reqd. for rrbs_sim] Path to per chromosome - per CpG probabilities of observation computed from an RRBS data compendium.
    - Example data for chr21 in /ARUNA/data/metadata/rrbs_chr_pobs/rrbs_chr21_pobs.tsv

#### Outputs
Running the next code blocks writes to ./data/```<dataset>```/... (where ```dataset``` is the user-specified name, e.g., gtex):

- **chrom_centric/:** per-chromosome TSVs with rows = canonical CpG start positions and columns = samples (.fm for beta values, .cov for read depth).
- **patch_centric/:** pickled dicts mapping sample → list of (patch_id, patch_vector, chrom)
- **true/:** ground truth data (usually true WGBS for simulations or matched WGBS from replicates etc.)
- **mcar_90/** and **rrbs_sim/** simulated missingness outputs:
    - FractionalMethylation/ contains beta matrices with NaNs at simulated-missing CpGs
    - SimulatedMask/ contains boolean masks indicating CpGs simulated as missing (distinct from CpGs missing in the original ground truth).

NOTE: aruna.process_dataset assumes reference metadata lives at ./data/metadata relative to the current working directory. If your layout differs, update the metadata paths in aruna.process_dataset / aruna.noise_simulators.

You should see the following structure upon running the next few code blocks in the data dir.
<pre>
gtex
├── chrom_centric
│   ├── mcar_90
│   │   ├── FractionalMethylation
│   │   │   └── chr21.mask.fm
│   │   └── SimulatedMask
│   │       └── chr21.mask
│   ├── rrbs_sim
│   │   ├── FractionalMethylation
│   │   │   └── chr21.mask.fm
│   │   └── SimulatedMask
│   │       └── chr21.mask
│   └── true
│       ├── FractionalMethylation
│       │   └── chr21.fm
│       └── ReadDepth
│           └── chr21.cov
└── patch_centric
    └── numCpg128
        ├── mcar_90
        │   ├── FractionalMethylation
        │   │   └── chr21_patches.mask.fm.pkl
        │   └── SimulatedMask
        │       └── chr21_patches.mask.pkl
        └── rrbs_sim
            ├── FractionalMethylation
            │   └── chr21_patches.mask.fm.pkl
            └── SimulatedMask
                └── chr21_patches.mask.pkl
</pre>

In [1]:
%cd ..

/home/js228/ARUNA


In [2]:
import os
import pandas as pd
from aruna.process_dataset import get_cc_gt, get_cc_noisy
from aruna.process_dataset import get_pc_gt, get_pc_noisy

In [3]:
CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, "data", "gtex_subset") # dir to all sample BED-like files

In [4]:
dataset = "gtex"
chrom = "chr21"
num_cpg = 128

In [8]:
# creating ground truth data for later training use
pc_map= get_pc_gt(dataset = dataset, chrom = chrom, 
                  num_cpg = num_cpg, 
                  data_dir = DATA_DIR)

Current chromosome:  chr21
Looking for Ground-Truth Patchified FM files...
Patchified GT data not found!
Processing for gtex - chr21 and 128 CpGs/Patch...
PatchID-CpG reference maps complete!
Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov
Creating mappings such as sample: [patch_id: patch_betas] with 8 workers...


100%|██████████| 16/16 [00:13<00:00,  1.18it/s]


Patchification complete! Saving FM Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/true/FractionalMethylation/chr21_patches.fm.pkl...
All files saved to disk!




In [5]:
# creating mcar_90 data for later training use
pc_nrfm_map, pc_nrmask_map = get_pc_noisy(dataset = dataset, chrom = chrom, 
                                          num_cpg = num_cpg, 
                                          nr = "mcar_90", data_dir = DATA_DIR)

Curent chromosome:  chr21
Looking for Patchified Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found!
Processing for gtex - chr21 - mcar_90 and 128 CpGs/Patch...
PatchID-CpG reference maps complete!
Looking for Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found. Creating...
Looking for Ground-Truth FractionalMethylation and Coverage files...
One or both files not found.
Processing for gtex and chr21...
cpgMerged.CpG_report.merged_CpG_evidence.cov files for 16 samples found!
Creating FM and COV files with 8 workers...


100%|██████████| 16/16 [00:01<00:00,  8.42it/s]


CpG Observation rates for gtex - chr21 over ALL samples are: 
Mean:  84.549
Std:  0.931
For gtex - chr21, saving Fractional Methylation Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm...
For gtex - chr21, saving Read Coverage Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov...
All files saved to disk!
Processing for gtex - chr21 - mcar_90...
Creating Noise Simulation files...
Computing metrics for sanity checks...
Avg: 15.451 Std: 0.931 of Originally Missing CpG rates.

Avg: 90.021 Std: 0.044 of Simulated Missing CpG rates.

Avg: 91.561 Std: 0.096 of After-Simulation Missing CpG rates.

For gtex - chr21, saving Fractional Methylation Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/mcar_90/FractionalMethylation/chr21.mask.fm...
Saved!
For gtex - chr21, saving Simulated Mask Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/mcar_90/SimulatedMask/chr21.mask

100%|██████████| 16/16 [00:15<00:00,  1.03it/s]


Patchification complete! Saving FM Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/mcar_90/FractionalMethylation/chr21_patches.mask.fm.pkl...
Saved!


100%|██████████| 16/16 [00:13<00:00,  1.17it/s]


Patchification complete! Saving SimMask Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/mcar_90/SimulatedMask/chr21_patches.mask.pkl ...
Saved!
All files saved to disk!




In [6]:
pc_nrfm_map, pc_nrmask_map = get_pc_noisy(dataset = dataset, chrom = chrom, 
                                          num_cpg = num_cpg, 
                                          nr = "rrbs_sim", data_dir = DATA_DIR)

Curent chromosome:  chr21
Looking for Patchified Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found!
Processing for gtex - chr21 - rrbs_sim and 128 CpGs/Patch...
PatchID-CpG reference maps complete!
Looking for Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found. Creating...
Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov
Processing for gtex - chr21 - rrbs_sim...
Creating Noise Simulation files...
Computing metrics for sanity checks...
Avg: 15.451 Std: 0.931 of Originally Missing CpG rates.

Avg: 93.415 Std: 0.024 of Simulated Missing CpG rates.

Avg: 93.883 Std: 0.177 of After-Simulation Missing CpG rates.

For gtex - chr21, saving Fractional Methylation Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centri

100%|██████████| 16/16 [00:14<00:00,  1.09it/s]


Patchification complete! Saving FM Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/FractionalMethylation/chr21_patches.mask.fm.pkl...
Saved!


100%|██████████| 16/16 [00:13<00:00,  1.16it/s]


Patchification complete! Saving SimMask Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/SimulatedMask/chr21_patches.mask.pkl ...
Saved!
All files saved to disk!


