# **Tutorial 1.5. Preprocessing data for gRelu Models**

<a href="https://github.com/Genentech/gReLU/blob/main/README.md" target="_blank">gReLU</a> is a Python library to train, interpret, and apply deep learning models to DNA sequences". As explained in the "Software libraries for model building section", the gRelu library contains a model zoo allowing for easy access to several models such as Borzoi, Enformer, or a dilated convolutional model based on the BPnet model architecture. Borzoi and Enformer can be imported pre-trained. Additionally, simpler base models and convolutional neural networks are also available. <br>

This tutorial explains the pre-processing steps on data used to train gRelu models in Tutorial 2 (Training Models with gRelu and Examining Pitfalls). While the pre-processing steps in Tutorial 1 were done for a more general/imaginary model, processing data for gRelu models require slightly different naming conventions and dataset objects.

To explain this process, this tutorial encorporates data pre-processing steps from Tutorial 1 (Loading and Pre-Processsing Data from bigWigs) with <a href="https://genentech.github.io/gReLU/tutorials/3_train.html#" target="_blank">gRelu's tutorial</a> on training their models. 

- gRelu's tutorial trains a single-task convolutional regression model to predict total coverage using ATAC-Seq data.
- Models in Tutorial 2 also use a single-task convolutional regression model to predict total coverage.

**Differences in approach:**

| gRelu's Tutorial | Tutorial 2 |
|----------|----------|
| Uses ATAC-Seq Data as input   | Uses ChIP-Seq Data as input  |
| Chromosomes 1-22    | Chromosomes 1-5**  |
| Model trains on Centered Peaks    | Model trains on Thresholded Peaks (big difference in pre-processing)   |
| Input Window: 2144bp    | Input Window: 2048bp  |
| Prediction Resolution: 1000bp    | Prediction Resolution: 512bp  |
| Aim: Understand how to train gRelu models    | Aim: Evaluating training decisions and examining pitfalls    |

<br>
** In discussing distributional differences in Tutorial 2, two models were trained on chromosomes 1-22. I repeated the pre-processing in this tutorial and Tutorial 1 using all chromosomes. The exact code and logic for every dataset is contained in several scripts named "get_XYZ_dataset". <br>
<br>For continuation we will use data from the same ChIP-seq experiment from the Encode project used in the previous tutorial, <a href="https://www.encodeproject.org/experiments/ENCSR817LUF/" target="_blank">ENCSR817LUF</a>


In [None]:
import numpy as np
import pandas as pd
import os
import grelu.data.preprocess
import grelu.data.dataset
import grelu.lightning
import grelu.visualize
import pickle
import pyBigWig

<br>In order to reduce computational time in this tutorial, the first few steps from tutorial 1 are skipped over. These include creating a dataframe of 2048bp regions from chromosomes 1 through 5 and appending arcsinhed coverage values from the bigWig file, base pair averaged to 512bp. The resulting dataframe was saved into '512bpResolution_p_values.csv.gz'. The dataframe is available in the 'raw_data' folder (if downloaded locally) or available for download on Google Colab through the code below.<br><br>

In [None]:
'''
!wget -O raw_data.zip 'https://www.dropbox.com/scl/fo/lcg4akvwi4ib8e9vey1re/ADgF_XwTy18Rend8ZB8YAbs?rlkey=0qgf8yt4exrgt4tu8qu16315d&st=p1dp3rx9&dl=0'

import zipfile

with zipfile.ZipFile('raw_data.zip', 'r') as zip_ref:
    zip_ref.extractall('raw_data')
'''

In [None]:
!wget https://www.encodeproject.org/files/ENCFF601VTB/@@download/ENCFF601VTB.bigWig -O raw_data/ENCFF601VTB.bigWig

Model Constants
-

The models used in Tutorial 2 have an input window of 2048bp and a output window / prediction resolution of 512bp. This widens the resolution that the model trains on and predicts compared the hypothetical model in Tutorial 1 (32bp prediciton resolution). <br>
<br>Note: gRelu's tutorial uses a similar input window but has a larger prediction resolution (1000bp).

In [None]:
# Model constants
INPUT_WINDOW = 2048
OUTPUT_WINDOW = 512
PRED_RES = 512
buffer_bp = (INPUT_WINDOW-OUTPUT_WINDOW)//2
val_chroms = "chr3"
test_chroms = "chr2"
genome = "hg38"
bw_file =  'raw_data/ENCFF601VTB.bigWig' #p-value 


# Model will predict on chromsomes 1
CHROMOSOMES =np.array(['chr1', 'chr2', 'chr3', 'chr4', 'chr5'])


# hg38 chrom lengths
# Human Genome Assembly GRCh38.p14
CHROM_LEN =np.array([248_956_422, 242_193_529, 198_295_559, 190_214_555, 181_538_259, 170_805_979, 
                     159_345_973, 145_138_636, 138_394_717, 133_797_422, 135_086_622, 133_275_309,
                     114_364_328, 107_043_718, 101_991_189, 90_338_345, 83_257_441, 80_373_285,
                     58_617_616, 64_444_167, 46_709_983, 50_818_468])



In [None]:
# Load our DNA bins filled with our target values
dna_bins = pd.read_csv('raw_data/512bpResolution_p_values.csv.gz')  # Intervals and coverage from chromosome's 1-5 with a 512bp resolution (summed)

# Set a threshold to filter our data
threshold = 2 # A threshold of 2 coming from research explained in Tutorial 1
thresholdarc = np.arcsinh(threshold) # The threshold has to undergo an arcsinh transformation as well

filt_dna_bins = dna_bins.loc[dna_bins['cov']>thresholdarc].reset_index(drop=True) # Apply the threshold to all chromosomes
print(f"{filt_dna_bins.shape[0]} training/validation/test positions.\n")
print(filt_dna_bins)


Recap
-

'filt_dna_bins' now contains 2048bp regions from chromosome 1 through 5 where the coverage values are above our threshold. While the dataframe contains the actual coverage values, gRelu's pre-processing functions retrieve them from the bigWig when creating final training/validation/test sets. What matters here are the **regions** that we have thresholded as peaks, e.g. thresholded peaks. We can simply let gRelu's function retrieve the coverage values later on in the process. This is why the next step is to keep only the **thresholded peak regions**. <br>

In gRelu's tutorial, peak regions are retrieved from a narrowpeak file following peak calling with MACS2.<br>
<br><img src="narrowpeak.png" alt="Alt Text" width="600"/>
<br><br>
As you can see these peak regions vary in size, [206bps, 182bps, 150bps]. The next step in the gRelu tutorial is to create **centered peaks regions**. gRelu has a 'grelu.data.preprocess.extend_from_coord' function which returns a dataframe of regions surrounding the summit of each peak spanning X base pairs long (in this case their input window 2114bps). <br>

These are the two main methods of determining which regions to use as peak regions. While either approach is valid, as explained in the 'Training Tricks" section of the markdown book, <a href="https://www.nature.com/articles/s42256-022-00570-9" target="_blank">research</a> has found that thresholded peaks benefit from randomly-centered profiles making convolutional models trained on them more robust without the need for sequence shift augmentations. (Tutorial 2 will explore reverse complement augmentation on the thresholded peak models).

In [None]:
# Keep only the 'chrom', 'start', and 'end' columns e.g. our thresholded peak regions
peaks = filt_dna_bins[['chrom', 'start', 'end']]
print(peaks)

<br>
Here we utilise gRelu's 'filter_blacklist function' which filters out regions if they are within 50bp of a blacklist region. Blacklist regions are regions across the genome which consistenly show high signals irrespective of the experiment conducted, leading to increases in false-positive peaks.
<br><br>

In [None]:
peaks = grelu.data.preprocess.filter_blacklist(
        peaks,
        genome=genome, #hg38
        window=50 # Remove peaks if they are within 50 bp of a blacklist region
    )


Appending Non-Peak Regions to our Data
-

In Tutorial 1, after pre-processing our bigWig data, we were left with thresholded peaks (both regions and coverage). If we trained our model solely on these datapoints, we would be falling into one of the pitfalls, unbalanced classes. We need to provide our model with example datapoints of non-peak regions. gRelu's 'grelu.data.preprocess.get_gc_matched_intervals' not only provides non-peak regions but it ensures the non-peak regions have a similar GC (the proportion of guanine (G) and cytosine (C) nucleotides) content to our peaks. GC content bias can arise from high thoroughput sequencing experiments such as ATAC-Seq and ChIP-Seq. A <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3639258/" target="_blank">research paper</a> explains that "sequencing data is considered as GC biased if more (or less) reads tend to come from a region with a higher GC content." Maintaining a similar GC content in our non-peak regions prevents this bias.

In [None]:
negatives = grelu.data.preprocess.get_gc_matched_intervals(
    peaks,
    binwidth=0.02, # resolution of measuring GC content
    genome=genome,
    chroms=['chr1', 'chr2', 'chr3', 'chr4', 'chr5'], 
    blacklist=genome, # negative regions overlapping the blacklist will be dropped
    seed=0,
    )
negatives.head(3)

In [None]:
import grelu.visualize
grelu.visualize.plot_gc_match(
        positives=peaks, negatives=negatives, binwidth=0.02, genome="hg38", figsize=(4, 3)
    )

<br>
This visualisation shows the GC content in both the peaks and non-peak regions so we can ensure they are similar.
<br><br>

In [None]:
# Combining our peak and non-peak data
regions = pd.concat([peaks, negatives]).reset_index(drop=True)
len(regions)

In [None]:
print(regions)

Resampling
-

The models trained on chromosomes 1-5 in Tutorial 2 to demonstrate concepts have been trained with minimal data (15,000) for computational and time constraints. Sklearn's resample has been used to resample and split datasets into training, validation and test sets.

In [None]:
from sklearn.utils import resample

train_size = 15000
valid_size = 1500
test_size = 1500

# 1. Filter the data by chromosome
chr2_data = regions[regions['chrom'] == test_chroms]
chr3_data = regions[regions['chrom'] == val_chroms]
train_data = regions[~regions['chrom'].isin([test_chroms, val_chroms])]

# 2. Downsample chr2 and chr3 data if necessary
if len(chr2_data) > test_size:
    chr2_data = resample(chr2_data, n_samples=test_size, random_state=1)

if len(chr3_data) > valid_size:
    chr3_data = resample(chr3_data, n_samples=valid_size, random_state=1)

# 3. Downsample the training data to 12,000 if necessary
if len(train_data) > train_size:
    train_data = resample(train_data, n_samples=train_size, random_state=1)

# 4. Combine the final training, validation, and test sets
train = train_data
val = chr3_data
test = chr2_data

# 5. Print the sizes of each split to verify
print("Training set size:", len(train))
print("Validation set size:", len(val))
print("Test set size:", len(test))

Creating our gRelu datasets
-
Here we use gRelu's "grelu.data.dataset.BigWigSeqDataset" function to create our final training, validation and test sets. These datasets are of a custom object type, made for gRelu models. Here, we input our thresholded peak regions (train) as the intervals. gRelu's function retrieves the labels from the bw_file ('ENCFF601VTB.bigWig'), aggregating the central 512bps (PRED_RES). Instead of averaging the central 512bp coverage values, models in Tutorial 2 are trained on the summed coverage values.

In [None]:
import grelu.data.dataset
    
train_ds = grelu.data.dataset.BigWigSeqDataset(
    intervals = train,
    bw_files=[bw_file],
    label_len=PRED_RES,
    label_aggfunc="sum",
    #rc=True, # reverse complement
    #max_seq_shift=2, # Shift the sequence
    #augment_mode="random",
    seed=0,
    genome=genome,
    label_transform_func=np.arcsinh
)

In [None]:
val_ds = grelu.data.dataset.BigWigSeqDataset(
    intervals = val,
    bw_files=[bw_file],
    label_len=PRED_RES,
    label_aggfunc="sum", 
    genome=genome,
    label_transform_func=np.arcsinh
)

test_ds = grelu.data.dataset.BigWigSeqDataset(
    intervals = test,
    bw_files=[bw_file],
    label_len=PRED_RES,
    label_aggfunc="sum",
    genome=genome,
    label_transform_func=np.arcsinh
)

len(train_ds), len(val_ds), len(test_ds)

In [None]:
print(type(train_ds))

In [None]:
print(train_ds[0])

Distribution of peaks in our datasets vs chromosomes
-

As a result of thresholded peaks and the addition of non-peak regions, we have effectively downsampled the majority class (non-peaks).

In [None]:
# Distribution of train_ds vs chromosomes 1, 4, 5 distribution.
labels = np.array(train_ds.get_labels()).reshape(-1)
# Reversing transformations of train_ds labels + thresholding
# e.g. reverse arcsinh, divide by PRED_RES (we thresholded peaks by averaging, not summing)
thresholded_labels = np.where(np.arcsinh(np.sinh(labels) //PRED_RES)>= thresholdarc, 1, 0)
num_ones = np.sum(thresholded_labels)

print(f"% of peaks in train_ds : {num_ones*100/(len(train_ds)):.1f}%")


# Apply the threshold to unfiltered training data
training_chroms = dna_bins[(dna_bins['chrom'] != test_chroms) & (dna_bins['chrom'] != val_chroms)]
thresh_label = np.where(training_chroms['cov'] >= threshold, 1, 0)
# Calculate the number of peaks in training chromosomes
num_peaks_train_chroms = np.sum(thresh_label)
percentage_peaks_chroms = num_peaks_train_chroms * 100 / len(training_chroms)

print(f"% of peaks in real distribution training chroms (1, 4, 5): {percentage_peaks_chroms:.1f}%")



In [None]:
# Distribution of val_ds vs chromosome 3s distribution
labels = np.array(val_ds.get_labels()).reshape(-1)
thresholded_labels = np.where(np.arcsinh(np.sinh(labels) //PRED_RES)>= thresholdarc, 1, 0)
num_ones = np.sum(thresholded_labels)

print(f"% of peaks in val_ds : {num_ones*100/(len(val_ds)):.1f}%")


# Apply the threshold to unfiltered chrom3 data
chrom3_data = dna_bins[dna_bins['chrom'] == val_chroms]
thresh_label = np.where(chrom3_data['cov'] >= threshold, 1, 0)
# Calculate the number of peaks in chrom3 data
num_peaks_chrom3 = np.sum(thresh_label)
percentage_peaks_chrom3 = num_peaks_chrom3 * 100 / len(chrom3_data)

print(f"% of actual peaks in val chrom (3): {percentage_peaks_chrom3:.1f}%")

In [None]:
# Distribution of test_ds vs chromosome 2s distribution
labels = np.array(test_ds.get_labels()).reshape(-1)
thresholded_labels = np.where(np.sinh(labels) //PRED_RES>= threshold, 1, 0)
num_ones = np.sum(thresholded_labels)

print(f"% of peaks in test_ds : {num_ones*100/(len(test_ds)):.1f}%")


# Apply the threshold to chrom2_data
chrom2_data = dna_bins[dna_bins['chrom'] == test_chroms]
thresh_label = np.where(chrom2_data['cov'] >= threshold, 1, 0)
# Calculate the number of peaks in chrom2_data
num_peaks_chrom2 = np.sum(thresh_label)
percentage_peaks_chrom2 = num_peaks_chrom2 * 100 / len(chrom2_data)

print(f"% of actual peaks in test chrom (2): {percentage_peaks_chrom2:.1f}%")


Visualising our datasets
-

In [None]:
import matplotlib.pyplot as plt

def plot_distribution(labels, title):
    labels_flat = labels.flatten()  # Flatten the labels to 1D
    plt.figure(figsize=(10, 6))
    plt.hist(labels_flat, bins=50, color='skyblue', edgecolor='black')
    plt.title(f'Distribution of {title}')
    plt.xlabel('Target Value')
    plt.ylabel('Frequency')
    plt.show()


# Print basic statistics
def print_statistics(labels, title):
    labels_flat = labels.flatten()  # Flatten the labels to 1D
    print(f"Statistics for {title}:")
    print(f"Mean: {np.mean(labels_flat):.2f}")
    print(f"Median: {np.median(labels_flat):.2f}")
    print(f"Standard Deviation: {np.std(labels_flat):.2f}")
    print(f"Min: {np.min(labels_flat):.2f}")
    print(f"Max: {np.max(labels_flat):.2f}\n")


In [None]:
train_labels = np.array(train_ds.get_labels())
plot_distribution(train_labels, 'Training Dataset')
print_statistics(train_labels, 'Training Dataset')

In [None]:
val_labels = np.array(val_ds.get_labels())
plot_distribution(val_labels, 'Validation Dataset')
print_statistics(val_labels, 'Validation Dataset')

In [None]:
test_labels = np.array(test_ds.get_labels())
plot_distribution(test_labels, 'Test Dataset')
print_statistics(test_labels, 'Test Dataset')

<br>
The distributions are skewed towards peaks as we have effective downsampled the majority class (non-peaks), so that our model has more examples of peaks to train on. gRelu's tutorial ends with these 50-50 peak/non-peak datasets being used to train a convolutional regression model. Tutorial 2 will explore the effects of training a model in this way.
<br><br>

In [None]:
# Saving our datasets as pickle files
if not os.path.exists('datasets'):
    os.makedirs('datasets')

with open('datasets/train_ds_balanced.pkl', 'wb') as f:
    pickle.dump(train_ds, f)


with open('datasets/val_ds_balanced.pkl', 'wb') as f:
    pickle.dump(val_ds, f)


with open('datasets/test_ds_balanced.pkl', 'wb') as f:
    pickle.dump(test_ds, f)


<br>
To explain several pitfalls in Tutorial 2, datasets were made using these functions and logic using the following scripts:
<br><br>
"get_allpeaks_datasets.py" <br>
"get_real_distribution_datasets.py" <br>
"get_balanced_datasets.py" (this Tutorial) <br>
"get_unsplit_datasets.py" <br>
"get_compare_datasets.py" <br>
"get_unsplit_all_chroms_datasets.py" <br>
"get_allchroms_compare_dataset.py" <br>
<br>
For reproducibility purposes, these scripts are included in this tutorial's repository (if downloaded locally) or available to be downloaded through the code below
<br><br>

In [None]:
'''
!wget -O dataset_scripts.zip 'https://www.dropbox.com/scl/fo/hii7mrxhopz27f9c4z4oh/AFp_kbskmR7uiGzNMPRdkqc?rlkey=78s482a0o62x42uv1kmj6i3v9&st=8eua7xwc&dl=0'

import zipfile

with zipfile.ZipFile('dataset_scripts.zip', 'r') as zip_ref:
    zip_ref.extractall('dataset_scripts')
'''

In [None]:
#!python3 dataset_scripts/get_allpeaks_datasets.py