Data: https://zenodo.org/records/10946767

The goal of the project is to create a deep learning network to predict DNA accessibility across multiple Arabidopsis experiments. The data is available on zenodo. It contains both raw read coverage files (aka BigWig files) and peak files in BED format, like the files you used in Assignment 3.  For the project, consider the problem as a regression problem, i.e. your task is to predict predict read coverage rather than the presence of peaks.

The zenodo repository also includes a metadata spreadsheet that indicates the source of the biological samples (the project ID and Accession number columns) as well as the plant tissue that was used to generate the samples.

In training and evaluating your models, we suggest you use chromosomes 1-4 for training and validation and chromosome 5 for testing.

In your project we expect you to evaluate different architectures (e.g. purely convolutional vs transformer), explore them in terms of depth and other aspects of the design (e.g. regulation and other features such as layer normalization), and perform an analysis of the filters learned by the network.  The objective is for you to develop some intuition of what works or doesn't work in this domain.

In designing your approach we recommend carefully studying the approach used in the Basenji paper that will be discussed in class.  The following paper is another useful resource:

Toneyan, S., Tang, Z. & Koo, P.K. Evaluating deep learning for predicting epigenomic profiles. Nature Machine Intelligence 4, 1088–1100 (2022).  https://doi.org/10.1038/s42256-022-00570-9


Goal
Predict DNA accessibility sites across different Arabidopsis experiments.

Problem Framing:
Given input DNA sequence, predict read coverage as a continuous, quantitative response variable.

Data
Raw Data
Raw read coverage local filepaths (similar to those from Bassenji)
Local filepaths of (BED) peaks (like those from Assignment 3)
Arabidopsis genome

Metadata
Source of biological samples (project ID and Accession)
Plant tissue identifier

Training Data
Chromosomes 1-4 will be randomly split 80/20 into train and validation data. Chromosome 5 will be held out as a test set.

Data Loading
Load in the multiple Arabidopsis experiments read coverage data, and Arabidopsis genome
Generate testing, training, and validation set generators
Use generator objects that will get the one hot encoded training, testing, and validation datasets so that the datasets can be randomized for each run

Output: List of sequences, and coverage map of those sequences

Biological Datasets
Biologically relevant parts of the genome will be curated into datasets to explore how the models perform with known biological functions using annotations. All of these annotations will hopefully be available on ENCODE or other online resources.

- Promoter dataset
- Enhancer dataset
- CTCF dataset 


Architecture
We are proposing 3 different model architectures based on what we have discussed in class:

- Basset model
Small input sequence length
3 convolutional filters

- Bassenji model
Large input sequence length (10s of kb)
4 convolutional filters + 5 dilated convolutional filters (Arabidopsis has a ~10x smaller genome than humans)

- Bassenji model with transformers
Use positional encoding + multi-head attention layer

Hyperparameters
There are various hyperparameters that we aim to experiment with. Since we are predicting a continuous variable with a regression function, we will use a Poisson regression loss function, as done in the Bassenji model. We may look into GPyOpt (https://github.com/SheffieldML/GPyOpt) for hyperparameter optimization, but will likely just experiment with the hyperparameters manually.
Hyper Parameters to Test (not all hyper parameters apply to all networks):
Learning rate
Number of layers
Batch size
Convolutional filter size
Number of convolutional filters
Input dropout rate (to inform performance on noisy data)
Dropout rate
Num. attention heads
Input layer size
Read length

Prediction
Our prediction is that the Bassenji model will be the best performing, followed by the basic Bassenji, followed by the Basset. Since we have already implemented something similar to the Basset model, this will serve as a useful benchmark.

Biological Interpretation
We aim to provide a rigorous interpretation of the biological significance of our model results, taking into account performance across different cell types, optimal read length, optimal convolutional filter size and number of filters, and other relevant aspects of model architecture.


Goal:
Predict DNA accessibility sites across different Arabidopsis experiments.

Problem Framing:
Given input DNA sequence, predict read coverage as a continuous, quantitative response variable.

Data:

Raw Data

Raw read coverage local filepaths (similar to those from Bassenji)
Local filepaths of (BED) peaks (like those from Assignment 3)
Arabidopsis genome

Metadata

Source of biological samples (project ID and Accession)
Plant tissue identifier

Training Data

Chromosomes 1-4 will be randomly split 80/20 into train and validation data. Chromosome 5 will be held out as a test set.

Data Loading

Load in the multiple Arabidopsis experiments read coverage data, and Arabidopsis genome
Generate testing, training, and validation set generators
Use generator objects that will get the one hot encoded training, testing, and validation datasets so that the datasets can be randomized for each run

Output: List of sequences, and coverage map of those sequences

Biological Datasets

Biologically relevant parts of the genome will be curated into datasets to explore how the models perform with known biological functions using annotations. All of these annotations will hopefully be available on ENCODE or other online resources.

Promoter dataset

Enhancer dataset

CTCF dataset 


Human introns: 1500bp
Arabidopsis: 150bp

Basset: 600bp
Bassenji: 130kb
Window length: 2.5kb

Human genome: 3B bases
Arabidopsis: 100mb

36 outputs (for each bigwig file)

Predict single value for each 2.5kb segment

Since peaks can be very high

Each label is a 36 dimensional vector

In [2]:
import collections
import glob
import gzip
import math
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pyBigWig
import random
import scipy.signal

from sklearn import metrics
from sklearn.model_selection import train_test_split

import torch,torchvision
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import ToTensor
torch.manual_seed(42);

from Bio import SeqIO

device = (
    "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")
#device = "cpu"

Using mps device


In [20]:
# Create a fasta file from the bigwig file
# Generates sequences of length bin_size, sliding a window of size interval across
# each chromosome
# Assumes that chr_fnames already exists
def generate_input_files_from_bw(bw_fname,
                                 output_fasta,
                                 output_faste,
                                 seq_length=2500,
                                 interval=1250):
    bw = pyBigWig.open(bw_fname)
    chrs = ['Chr1', 'Chr2', 'Chr3', 'Chr4', 'Chr5']
    output_fasta = open(output_fasta, "w")
    output_faste = open(output_faste, "w")
    for chr_id in chrs:
        chr_fname = chr_fnames[chr_id]
        with gzip.open(chr_fname, "rt") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                chr_seq = str(record.seq)
                chr_len = bw.chroms(chr_id)
                bw_idx = 0
                while bw_idx + seq_length < chr_len:
                    coverage = ",".join(map(str, bw.values(chr_id, bw_idx, bw_idx + seq_length)))
                    seq = chr_seq[bw_idx:bw_idx + seq_length]
                    seq_id =  ",".join([chr_id, str(bw_idx), str(bw_idx+seq_length)])
                    output_fasta.write(">" + seq_id + "\n" + seq + "\n")
                    output_faste.write(">" + seq_id + "\n" + coverage + "\n")
                    bw_idx += interval

In [21]:
base_dir = '../project/chromatin_cs425'
chr_fnames = {'Chr1': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.1.fa.gz',
              'Chr2': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.2.fa.gz',
              'Chr3': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.3.fa.gz',
              'Chr4': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.4.fa.gz',
              'Chr5': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.5.fa.gz'}

input_dirs = [os.path.join(base_dir, 'SRP034156'), os.path.join(base_dir, 'SRP300093')]

bw_fname = os.path.join(base_dir, 'SRP034156', 'SRX1096548_Rep0.rpgc.bw')

generate_input_files_from_bw(bw_fname,
                             '../project/chromatin_cs425/SRP034156/fasta/SRP034156.fasta',
                             '../project/chromatin_cs425/SRP034156/fasta/SRP034156.faste')

In [16]:
# Work in progress: create tissue-level fasta files by aggregating data from bigwig files
# Uses Metadata.csv to determine identification of each bigwig file

metadata = pd.read_csv(os.path.join(base_dir, 'Metadata.csv'))
tissues = metadata['Tissue'].unique()
for tissue in tissues:
    t_fnames = []
    t_metadata = metadata[metadata['Tissue'] == tissue]
    output_train_file = os.path.join(base_dir, 'input_data', tissue, "train_val.fasta")
    output_test_file = os.path.join(base_dir, 'input_data', tissue, "test.fasta")
    for index, row in t_metadata.iterrows():
        project = row.iloc[0]
        accession = row.iloc[1]
        fname = accession + "_Rep0.rpgc.bw"
        fname = os.path.join(base_dir, project, fname)
        t_fnames.append(fname)
    # TODO: 
    print(tissue)
    print(t_fnames)

Leaf
['../project/chromatin_cs425/SRP300093/SRX9770773_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770774_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770775_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770776_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770777_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770778_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770779_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770780_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770781_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770782_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770783_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770784_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770785_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770786_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300093/SRX9770787_Rep0.rpgc.bw', '../project/chromatin_cs425/SRP300

In [None]:
# datasets can be obtained e.g. from:
# https://github.com/MedChaabane/deepRAM/tree/master/datasets/ChIP-seq

# convert sequence to a one-hot encoding
# and pad with a uniform distribution
def seqtopad(sequence, motif_len):
    rows=len(sequence)+2*motif_len-2
    S=np.empty([rows,4])
    base=['A', 'C', 'G', 'T']
    for i in range(rows):
        for j in range(4):
            if (i-motif_len+1<len(sequence) and sequence[i-motif_len+1]=='N' 
                or i<motif_len-1 or i>len(sequence)+motif_len-2):
                S[i,j]=np.float32(0.25)
            elif sequence[i-motif_len+1]==base[j]:
                S[i,j]=np.float32(1)
            else:
                S[i,j]=np.float32(0)
    return np.transpose(S)

# TODO: determine appropriate motif_len
def load_file(path, motif_len=24):
    dataset=[]
    sequences=[]
    with open(path, "rt") as handle:
        for record in SeqIO.parse(handle, "fasta"):
            coverage = float((record.id).split(',')[3])
            sequence = str(record.seq)
            dataset.append([seqtopad(sequence, motif_len),[coverage]])
            sequences.append(sequence)
    return dataset  

class chromatin_dataset(Dataset):
    def __init__(self, xy):
        self.x_data=np.array([el[0] for el in xy],dtype=np.float32)
        self.y_data =np.array([el[1] for el in xy ],dtype=np.float32)
        self.x_data = torch.from_numpy(self.x_data)
        self.y_data = torch.from_numpy(self.y_data)
        self.length=len(self.x_data)

    def __getitem__(self, index):
        return self.x_data[index], self.y_data[index]

    def __len__(self):
        return self.length

def get_train_valid_test_loader(train_fname, test_fname):
    train_val_dataset=load_file(train_fname)
    train_data, val_test_data = train_test_split(train_val_dataset, test_size=0.25)
    test_data = loadfile(test_fname)
    len(train_data),len(valid_data),len(test_data)
    
    train_dataset=chromatin_dataset(train_data)
    valid_dataset=chromatin_dataset(valid_data)
    test_dataset=chromatin_dataset(test_data)
    
    batch_size = 64
    train_loader = DataLoader(dataset=train_dataset,
                              batch_size=batch_size,shuffle=True)
    valid_loader = DataLoader(dataset=valid_dataset,
                              batch_size=batch_size,shuffle=True)
    test_loader = DataLoader(dataset=test_dataset,
                              batch_size=batch_size,shuffle=True)
    return train_loader, valid_loader, test_loader

train_loader, valid_loader, test_loader=get_train_valid_test_loader(train_fname, test_fname)