### Overview

- Showcases steps involved in the initial "patchification" process.
- Pre-patchifying the dataset instead of doing it on-the-fly considerably reduces runtime by avoiding expensive merge/table joins with the reference genome loci.

The 16 example GTEx samples (pre-reduced to only contain chr21) will undergo 2 transformation:
1. Simulated to be RRBS-like missing using the "rrbs_sim" noise regime. 
    - This is to showcase model inference with the supplied pre-trained model in checkpoints/
2. Simulated to be MCAR at a 0.9 missing ratio.
    - This is to supply example data for training a new model. 

* NOTE: This code will create process and store additional data in the CWD!
* NOTE: process_dataset assumes reference genome pre-computed metadata is in CWD/data/metadata. If not, please change path in ```aruna.process_dataset```.

In [1]:
%cd ..

/home/js228/ARUNA


In [6]:
import os
import pandas as pd
from aruna.process_dataset import get_cc_gt, get_cc_noisy
from aruna.process_dataset import get_pc_gt, get_pc_noisy

In [7]:
CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, "data", "gtex_subset") # dir to all sample BED-like files

In [None]:
dataset = "gtex"
chrom = "chr21"
num_cpg = 128
nr = "mcar_90"

In [10]:
pc_nrfm_map, pc_nrmask_map = get_pc_noisy(dataset = dataset, chrom = chrom, 
                                          num_cpg = num_cpg, 
                                          nr = nr, data_dir = DATA_DIR)

Curent chromosome:  chr21
Looking for Patchified Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found!
Processing for gtex - chr21 - mcar_90 and 128 CpGs/Patch...
PatchID-CpG reference maps complete!
Looking for Noise-Simulated FractionalMethylation and Coverage files...
One or both files not found. Creating...
Looking for Ground-Truth FractionalMethylation and Coverage files...
One or both files not found.
Processing for gtex and chr21...
cpgMerged.CpG_report.merged_CpG_evidence.cov files for 16 samples found!
Creating FM and COV files with 8 workers...


100%|██████████| 16/16 [00:01<00:00, 13.10it/s]


CpG Observation rates for gtex - chr21 over ALL samples are: 
Mean:  84.549
Std:  0.931
For gtex - chr21, saving Fractional Methylation Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm...
For gtex - chr21, saving Read Coverage Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov...
All files saved to disk!
Processing for gtex - chr21 - mcar_90...
Creating Noise Simulation files...
Computing metrics for sanity checks...
Avg: 15.451 Std: 0.931 of Originally Missing CpG rates.

Avg: 89.994 Std: 0.046 of Simulated Missing CpG rates.

Avg: 91.546 Std: 0.106 of After-Simulation Missing CpG rates.

For gtex - chr21, saving Fractional Methylation Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/mcar_90/FractionalMethylation/chr21.mask.fm...




Saved!
For gtex - chr21, saving Simulated Mask Data for all samples to: /home/js228/ARUNA/data/gtex/chrom_centric/mcar_90/SimulatedMask/chr21.mask...
Saved!

All files saved to disk!
Creating mappings such as sample: [patch_id: patch_betas] with 8 workers...


100%|██████████| 16/16 [00:08<00:00,  1.85it/s]


Patchification complete! Saving FM Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/mcar_90/FractionalMethylation/chr21_patches.mask.fm.pkl...
Saved!


100%|██████████| 16/16 [00:08<00:00,  1.94it/s]


Patchification complete! Saving SimMask Data for all samples to: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/mcar_90/SimulatedMask/chr21_patches.mask.pkl ...
Saved!
All files saved to disk!


