Data: https://zenodo.org/records/10946767

The goal of the project is to create a deep learning network to predict DNA accessibility across multiple Arabidopsis experiments. The data is available on zenodo. It contains both raw read coverage files (aka BigWig files) and peak files in BED format, like the files you used in Assignment 3.  For the project, consider the problem as a regression problem, i.e. your task is to predict predict read coverage rather than the presence of peaks.

The zenodo repository also includes a metadata spreadsheet that indicates the source of the biological samples (the project ID and Accession number columns) as well as the plant tissue that was used to generate the samples.

In training and evaluating your models, we suggest you use chromosomes 1-4 for training and validation and chromosome 5 for testing.

In your project we expect you to evaluate different architectures (e.g. purely convolutional vs transformer), explore them in terms of depth and other aspects of the design (e.g. regulation and other features such as layer normalization), and perform an analysis of the filters learned by the network.  The objective is for you to develop some intuition of what works or doesn't work in this domain.

In designing your approach we recommend carefully studying the approach used in the Basenji paper that will be discussed in class.  The following paper is another useful resource:

Toneyan, S., Tang, Z. & Koo, P.K. Evaluating deep learning for predicting epigenomic profiles. Nature Machine Intelligence 4, 1088–1100 (2022).  https://doi.org/10.1038/s42256-022-00570-9


Goal
Predict DNA accessibility sites across different Arabidopsis experiments.

Problem Framing:
Given input DNA sequence, predict read coverage as a continuous, quantitative response variable.

Data
Raw Data
Raw read coverage local filepaths (similar to those from Bassenji)
Local filepaths of (BED) peaks (like those from Assignment 3)
Arabidopsis genome

Metadata
Source of biological samples (project ID and Accession)
Plant tissue identifier

Training Data
Chromosomes 1-4 will be randomly split 80/20 into train and validation data. Chromosome 5 will be held out as a test set.

Data Loading
Load in the multiple Arabidopsis experiments read coverage data, and Arabidopsis genome
Generate testing, training, and validation set generators
Use generator objects that will get the one hot encoded training, testing, and validation datasets so that the datasets can be randomized for each run

Output: List of sequences, and coverage map of those sequences

Biological Datasets
Biologically relevant parts of the genome will be curated into datasets to explore how the models perform with known biological functions using annotations. All of these annotations will hopefully be available on ENCODE or other online resources.

- Promoter dataset
- Enhancer dataset
- CTCF dataset 


Architecture
We are proposing 3 different model architectures based on what we have discussed in class:

- Basset model
Small input sequence length
3 convolutional filters

- Bassenji model
Large input sequence length (10s of kb)
4 convolutional filters + 5 dilated convolutional filters (Arabidopsis has a ~10x smaller genome than humans)

- Bassenji model with transformers
Use positional encoding + multi-head attention layer

Hyperparameters
There are various hyperparameters that we aim to experiment with. Since we are predicting a continuous variable with a regression function, we will use a Poisson regression loss function, as done in the Bassenji model. We may look into GPyOpt (https://github.com/SheffieldML/GPyOpt) for hyperparameter optimization, but will likely just experiment with the hyperparameters manually.
Hyper Parameters to Test (not all hyper parameters apply to all networks):
Learning rate
Number of layers
Batch size
Convolutional filter size
Number of convolutional filters
Input dropout rate (to inform performance on noisy data)
Dropout rate
Num. attention heads
Input layer size
Read length

Prediction
Our prediction is that the Bassenji model will be the best performing, followed by the basic Bassenji, followed by the Basset. Since we have already implemented something similar to the Basset model, this will serve as a useful benchmark.

Biological Interpretation
We aim to provide a rigorous interpretation of the biological significance of our model results, taking into account performance across different cell types, optimal read length, optimal convolutional filter size and number of filters, and other relevant aspects of model architecture.


Goal:
Predict DNA accessibility sites across different Arabidopsis experiments.

Problem Framing:
Given input DNA sequence, predict read coverage as a continuous, quantitative response variable.

Data:

Raw Data

Raw read coverage local filepaths (similar to those from Bassenji)
Local filepaths of (BED) peaks (like those from Assignment 3)
Arabidopsis genome

Metadata

Source of biological samples (project ID and Accession)
Plant tissue identifier

Training Data

Chromosomes 1-4 will be randomly split 80/20 into train and validation data. Chromosome 5 will be held out as a test set.

Data Loading

Load in the multiple Arabidopsis experiments read coverage data, and Arabidopsis genome
Generate testing, training, and validation set generators
Use generator objects that will get the one hot encoded training, testing, and validation datasets so that the datasets can be randomized for each run

Output: List of sequences, and coverage map of those sequences

Biological Datasets

Biologically relevant parts of the genome will be curated into datasets to explore how the models perform with known biological functions using annotations. All of these annotations will hopefully be available on ENCODE or other online resources.

Promoter dataset

Enhancer dataset

CTCF dataset 


In [2]:
import collections
import gzip
import math
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import scipy.signal

from sklearn import metrics
from sklearn.model_selection import train_test_split

import torch,torchvision
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision.transforms import ToTensor
torch.manual_seed(42);

from Bio import SeqIO

device = (
    "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")
#device = "cpu"

Using mps device


In [18]:
def generate_fasta(bed_fname, output_train_fname, output_test_fname, seq_len):
    bed_peaks = {'Chr1': [],
                 'Chr2': [],
                 'Chr3': [],
                 'Chr4': [],
                 'Chr5': []}
    with gzip.open(bed_fname, "rt") as handle:
        for line in handle:
            L = line.strip().split()
            chr_id = L[0]
            chr_start = int(L[1])
            chr_end = int(L[2])
            bed_peaks[chr_id].append(math.floor((chr_start + chr_end)/2))

    seqs = {}
    output_train_file = open(output_train_fname, "w")
    output_test_file = open(output_test_fname, "w")
    for chr_id, chr_fname in chr_fnames.items():
        with gzip.open(chr_fname, "rt") as handle:
            for record in SeqIO.parse(handle, "fasta"):
                chr_seq = str(record.seq)
                peaks = bed_peaks[chr_id]
                for p in peaks:
                    start = p - int(seq_len/2)
                    end = p + int(seq_len/2)
                    if start < 0:
                        start = 0
                        end = seq_len
                    key = ",".join([chr_id, str(start), str(end), str(end-start)])
                    seqs[key] = chr_seq[start:end]
    for seq_id, seq in seqs.items():
        chr_id = seq_id[0:4]
        # Chromosomes 1-4 will be used for training and validation
        if chr_id == 'Chr5':
            output_test_file.write(">" + seq_id + "\n" + seq + "\n")
        else:
            output_train_file.write(">" + seq_id + "\n" + seq + "\n")

# TODO: implement
def read_bigwig(bw_fname):
    print(bw_fname)

SyntaxError: incomplete input (416453781.py, line 39)

In [17]:
chr_fnames = {'Chr1': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.1.fa.gz',
              'Chr2': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.2.fa.gz',
              'Chr3': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.3.fa.gz',
              'Chr4': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.4.fa.gz',
              'Chr5': '../project/Arabidopsis_thaliana.TAIR10.dna.chromosome.5.fa.gz'}

#bed_fname = '../project/chromatin_cs425/SRP034156/SRX391990.target.all.bed.gz'

#generate_fasta(bed_fname,
#               '../project/chromatin_cs425/SRP034156/fasta/train_val.fasta',
#               '../project/chromatin_cs425/SRP034156/fasta/test.fasta',
#               1000)