Modeling Enhancer-Promoter Interactions with Attention-Based Neural Networks
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
GM12878
HUVEC
HeLa-S3
IMR90
K562
NHEK
.gitattributes
Data_Augmentation.R
EPIANN.png
EPIANN.py
README.md

README.md

EPIANN

Inspired by machine translation models we develp an attention-based nerual network model, EPIANN. Schematic overview of EPIANN

Data Augmentation

There are 6 cell lines. which are celline=GM12878, HUVEC, HeLa-S3, IMR90, K562 and NHEK, and each comes with its own folder. Within each folder, there is a single file: celline.csv. celline.csv is a renamed copy of

https://github.com/shwhalen/targetfinder/tree/master/paper/targetfinder/celline/output-ep/pairs.csv

Before we actually train oorneural network model, we need to generate input data from genomic coordinates(hg19) of enhancers and promoters, along with the indicators of EPIs recorded in celline.csv. Data_Augmentation.R encoded an automatic data augmentation pipeline with several parameters specified in the following table.

Parameters Explanation
celline change it to one of the 6 cell lines with default = "IMR90"
folder the name of the folder to hold all output files with default = "aug_50"
shift_distance the step size to slide extended region around the enhancer and promoter with default = 50
enhancer_target_length the length of extended enhancer with default = 3000
promoter_target_length the length of extended promoter with default = 2000
positive_scalar the augmentation ratio with default = 20
test_percent the percent of test data among all with default = 0.1
random_seed the random seed to sample test data with default = 1

You can find the output files with default parameters under the directory IMR90/aug_50/. The following files are currently not avaiable in the github repository because of the size limit (Work In Progress). They are avaiable in the repository.

IMR90/aug_50/IMR90_enhancer.fasta
IMR90/aug_50/IMR90_promoter.fasta
IMR90/aug_50/imbalanced/IMR90_enhancer.fasta
IMR90/aug_50/imbalanced/IMR90_promoter.fasta

Train Neural Netork Model

Under the directory IMR90/, you can find an example python script IMR90_EPIANN.py with the default setting. The parameters regarding inputs are explained in the following table.

Parameters Explanation
celline chanage it to one of the 6 cell lines with default = 'IMR90'
file_pre change it to be the folder containing augmented data with default = 'aug_50/IMR90'
out_dir change it to be the folder that contains the output with dedault = 'output/IMR90_EPIANN'
script_id change it to be the current python script name in order to distinguish the outputs from multiple runs with default = 'IMR90_EPIANN'

The computational grpaph for the neural network is programmed using Tensorflow. On our setup, we use a single NVIDIA GTX 1080 or NVIDIA TITAN X with 5 CPU threads. A single batch takes about 6 seconds to train. All neural neural parameters can be altered in the script.

Neural Network Parameters Explanation
enhancer_length the length of input enhancers with default = 3000
promoter_length the length of input promoters with default = 2000
BATCH_SIZE the half of exact batch size with default = 32
num_filters the number of convolution filters with default = 256
e_conv_width the convolutional filter width with default = 15
dropout_rate_cnn the dropout rate for the convolution layer with default = 0.2
dropout_rate the dropout rate for all layers except the convolution layer with default = 0.2
pool_width the max pooling size with default = 30
atten_hyper the dimension of the attention-related parameters with default = 32
dense_neuron_coor the dimension of the fully connected layers for coordinate prediction with default = [128, 64]
inter_dim the dimension of the interaction quantification related parameters with default = 1
topk the top-k pooling size with default = 32
dense_neuron the dimension of the fully connected layers with default = 32
lamb the hyperparameter which mediates the cross-entropy error and the regression error with default = 10
num_of_epoch the number of epochs with default = 90
output_step the step size to report performance on test dataset with default = 500 batches

Required Pre-installed Packages

R (3.4.2) Library dependencies

GenomicRanges 1.28.2
BSgenome.Hsapiens.UCSC.hg19.masked 1.3.99

Python (2.7.6) Module dependencies

Sklearn 0.18.1

os
pickle
time
tensorflow 1.3.0
numpy 1.13.3
Sklearn 0.19.1
Biopython 1.67