Skip to content
/ EPIANN Public

Modeling Enhancer-Promoter Interactions with Attention-Based Neural Networks

Notifications You must be signed in to change notification settings

wgmao/EPIANN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EPIANN

Inspired by machine translation models we develp an attention-based nerual network model, EPIANN. Schematic overview of EPIANN

Data Augmentation

There are 6 cell lines. which are celline=GM12878, HUVEC, HeLa-S3, IMR90, K562 and NHEK, and each comes with its own folder. Within each folder, there is a single file: celline.csv. celline.csv is a renamed copy of

https://github.com/shwhalen/targetfinder/tree/master/paper/targetfinder/celline/output-ep/pairs.csv

Before we actually train oorneural network model, we need to generate input data from genomic coordinates(hg19) of enhancers and promoters, along with the indicators of EPIs recorded in celline.csv. Data_Augmentation.R encoded an automatic data augmentation pipeline with several parameters specified in the following table.

Parameters Explanation
celline change it to one of the 6 cell lines with default = "IMR90"
folder the name of the folder to hold all output files with default = "aug_50"
shift_distance the step size to slide extended region around the enhancer and promoter with default = 50
enhancer_target_length the length of extended enhancer with default = 3000
promoter_target_length the length of extended promoter with default = 2000
positive_scalar the augmentation ratio with default = 20
test_percent the percent of test data among all with default = 0.1
random_seed the random seed to sample test data with default = 1

You can find the output files with default parameters under the directory IMR90/aug_50/. The following files are currently not avaiable in the github repository because of the size limit (Work In Progress). They are avaiable in the repository.

IMR90/aug_50/IMR90_enhancer.fasta
IMR90/aug_50/IMR90_promoter.fasta
IMR90/aug_50/imbalanced/IMR90_enhancer.fasta
IMR90/aug_50/imbalanced/IMR90_promoter.fasta

Train Neural Netork Model

Under the directory IMR90/, you can find an example python script IMR90_EPIANN.py with the default setting. The parameters regarding inputs are explained in the following table.

Parameters Explanation
celline chanage it to one of the 6 cell lines with default = 'IMR90'
file_pre change it to be the folder containing augmented data with default = 'aug_50/IMR90'
out_dir change it to be the folder that contains the output with dedault = 'output/IMR90_EPIANN'
script_id change it to be the current python script name in order to distinguish the outputs from multiple runs with default = 'IMR90_EPIANN'

The computational grpaph for the neural network is programmed using Tensorflow. On our setup, we use a single NVIDIA GTX 1080 or NVIDIA TITAN X with 5 CPU threads. A single batch takes about 6 seconds to train. All neural neural parameters can be altered in the script.

Neural Network Parameters Explanation
enhancer_length the length of input enhancers with default = 3000
promoter_length the length of input promoters with default = 2000
BATCH_SIZE the half of exact batch size with default = 32
num_filters the number of convolution filters with default = 256
e_conv_width the convolutional filter width with default = 15
dropout_rate_cnn the dropout rate for the convolution layer with default = 0.2
dropout_rate the dropout rate for all layers except the convolution layer with default = 0.2
pool_width the max pooling size with default = 30
atten_hyper the dimension of the attention-related parameters with default = 32
dense_neuron_coor the dimension of the fully connected layers for coordinate prediction with default = [128, 64]
inter_dim the dimension of the interaction quantification related parameters with default = 1
topk the top-k pooling size with default = 32
dense_neuron the dimension of the fully connected layers with default = 32
lamb the hyperparameter which mediates the cross-entropy error and the regression error with default = 10
num_of_epoch the number of epochs with default = 90
output_step the step size to report performance on test dataset with default = 500 batches

Required Pre-installed Packages

R (3.4.2) Library dependencies

GenomicRanges 1.28.2
BSgenome.Hsapiens.UCSC.hg19.masked 1.3.99

Python (2.7.6) Module dependencies

Sklearn 0.18.1

os
pickle
time
tensorflow 1.3.0
numpy 1.13.3
Sklearn 0.19.1
Biopython 1.67

About

Modeling Enhancer-Promoter Interactions with Attention-Based Neural Networks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published