### Overview

This notebook demonstrates the inference pipeline. Assumes a pretrained model and either simulated RRBS or true RRBS data is available in both patch-format expected by the model. This basically requires patches to be of the same size as was used during model training (here, 128 CpGs). In general, a pretrained model can be applied to any level of sparsity, however the model hyperparameters were tuned specifically for extreme sparsity levels (>90% missing or RRBS-like missing not at random).

#### Requirements
NOTE: Assumes steps in data_prep.ipynb have already been performed.

- Pretrained model. Available under /ARUNA/checkpoints as a pytorch .pth file.
- Patchified sparse data. Available under /ARUNA/data/```<dataset>``` following steps in data_prep.ipynb.

#### Outputs
- Predicted methylomes in /ARUNA/results/.


<pre>
</pre>

In [1]:
%cd ..

/home/js228/ARUNA


In [2]:
import os
from pathlib import Path
from aruna.process_dataset import get_cc_gt
from scripts.inference import get_cpgmask, save_aruna_preds, run_mslice_inference

%load_ext autoreload
%autoreload 2

In [3]:
CWD = os.getcwd()
model_fpath = os.path.join(CWD, "checkpoints", "trained_model.pth")
config_fpath = os.path.join(CWD, "configs", "example_config.yaml")
res_dir = os.path.join(CWD, "results") # save path for preds and evalMask

In [4]:
test_data = "gtex"
chrom = "chr21"
test_regime = "rrbs_sim"
save_path = Path(CWD) / "results"
save_path.mkdir(parents=True, exist_ok=True)

cpgMask_map = get_cpgmask([chrom], test_data) # cpgs missing in ground truth

cc_gt_df, _ = get_cc_gt(test_data, chrom)
canonical_index = cc_gt_df.index # canonical set of hg38 CpGs

# get sample names that need to be imputed, in our case all samples in original dir
base_dir = Path(CWD) / "data" / "gtex_subset"
sample_names = sorted(d.name for d in base_dir.iterdir() if d.is_dir())

Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov
Looking for Ground-Truth FractionalMethylation and Coverage files...
All files found!
FM at: /home/js228/ARUNA/data/gtex/chrom_centric/true/FractionalMethylation/chr21.fm
COV at: /home/js228/ARUNA/data/gtex/chrom_centric/true/ReadDepth/chr21.cov


In [5]:
_, preds_map, evalMask_map = run_mslice_inference(test_data, model_fpath, 
                                                       config_fpath, sample_names, 
                                                       test_regime, cpgMask_map)

Running Aruna inference...
Loading Testing Data...
Getting Patch Data for:
Dataset: gtex
Chr(s): chr21
Patch Type: mpatch
NR: rrbs_sim
#Samples: 16
Current chromosome:  chr21
Looking for Ground-Truth Patchified FM files...
/home/js228/ARUNA/data/gtex/patch_centric/numCpg128/true/FractionalMethylation/chr21_patches.fm.pkl
Patchified data found at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/true/FractionalMethylation/chr21_patches.fm.pkl
Curent chromosome:  chr21
Looking for Patchified Noise-Simulated FractionalMethylation and Coverage files...
All files exist!
FM at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/FractionalMethylation/chr21_patches.mask.fm.pkl
MASK at: /home/js228/ARUNA/data/gtex/patch_centric/numCpg128/rrbs_sim/SimulatedMask/chr21_patches.mask.pkl
Loaded Validation Data with 16 samples and 288960 patches.
#Batches in Validation (batch_dim=256): 1129
Current chromosome:  chr21
#CpG stats under eval: 462299.0 +/- 0.0


In [6]:
# save predictions
save_aruna_preds(preds_map = preds_map, 
                 canonical_index = canonical_index, 
                 out_dir = os.path.join(res_dir, "pred_betas"),
                 chrom = chrom)

Saved: /home/js228/ARUNA/results/pred_betas/chr21.csv


In [7]:
# save the evaluation mask
# for simulation studies: evalMask = True for CpGs that are observed in ground truth but simulated missing
# for true rrbs: evalMask = True for Cp Gs that are observed in matched WGBS and missing in the ground truth RRBS
save_aruna_preds(preds_map = evalMask_map, 
                 canonical_index = canonical_index, 
                 out_dir = os.path.join(res_dir, "eval_mask"),
                 chrom = chrom)

Saved: /home/js228/ARUNA/results/eval_mask/chr21.csv
