Inspired by machine translation models we develp an attention-based nerual network model, EPIANN.
There are 6 cell lines. which are celline=GM12878, HUVEC, HeLa-S3, IMR90, K562 and NHEK, and each comes with its own folder. Within each folder, there is a single file: celline.csv. celline.csv is a renamed copy of
Before we actually train oorneural network model, we need to generate input data from genomic coordinates(hg19) of enhancers and promoters, along with the indicators of EPIs recorded in celline.csv. Data_Augmentation.R encoded an automatic data augmentation pipeline with several parameters specified in the following table.
|celline||change it to one of the 6 cell lines with default = "IMR90"|
|folder||the name of the folder to hold all output files with default = "aug_50"|
|shift_distance||the step size to slide extended region around the enhancer and promoter with default = 50|
|enhancer_target_length||the length of extended enhancer with default = 3000|
|promoter_target_length||the length of extended promoter with default = 2000|
|positive_scalar||the augmentation ratio with default = 20|
|test_percent||the percent of test data among all with default = 0.1|
|random_seed||the random seed to sample test data with default = 1|
You can find the output files with default parameters under the directory IMR90/aug_50/.
The following files are currently not avaiable in the github repository because of the size limit (Work In Progress). They are avaiable in the repository.
IMR90/aug_50/IMR90_enhancer.fasta IMR90/aug_50/IMR90_promoter.fasta IMR90/aug_50/imbalanced/IMR90_enhancer.fasta IMR90/aug_50/imbalanced/IMR90_promoter.fasta
Train Neural Netork Model
Under the directory IMR90/, you can find an example python script IMR90_EPIANN.py with the default setting. The parameters regarding inputs are explained in the following table.
|celline||chanage it to one of the 6 cell lines with default = 'IMR90'|
|file_pre||change it to be the folder containing augmented data with default = 'aug_50/IMR90'|
|out_dir||change it to be the folder that contains the output with dedault = 'output/IMR90_EPIANN'|
|script_id||change it to be the current python script name in order to distinguish the outputs from multiple runs with default = 'IMR90_EPIANN'|
The computational grpaph for the neural network is programmed using Tensorflow. On our setup, we use a single NVIDIA GTX 1080 or NVIDIA TITAN X with 5 CPU threads. A single batch takes about 6 seconds to train. All neural neural parameters can be altered in the script.
|Neural Network Parameters||Explanation|
|enhancer_length||the length of input enhancers with default = 3000|
|promoter_length||the length of input promoters with default = 2000|
|BATCH_SIZE||the half of exact batch size with default = 32|
|num_filters||the number of convolution filters with default = 256|
|e_conv_width||the convolutional filter width with default = 15|
|dropout_rate_cnn||the dropout rate for the convolution layer with default = 0.2|
|dropout_rate||the dropout rate for all layers except the convolution layer with default = 0.2|
|pool_width||the max pooling size with default = 30|
|atten_hyper||the dimension of the attention-related parameters with default = 32|
|dense_neuron_coor||the dimension of the fully connected layers for coordinate prediction with default = [128, 64]|
|inter_dim||the dimension of the interaction quantification related parameters with default = 1|
|topk||the top-k pooling size with default = 32|
|dense_neuron||the dimension of the fully connected layers with default = 32|
|lamb||the hyperparameter which mediates the cross-entropy error and the regression error with default = 10|
|num_of_epoch||the number of epochs with default = 90|
|output_step||the step size to report performance on test dataset with default = 500 batches|
Required Pre-installed Packages
R (3.4.2) Library dependencies
GenomicRanges 1.28.2 BSgenome.Hsapiens.UCSC.hg19.masked 1.3.99
Python (2.7.6) Module dependencies
os pickle time tensorflow 1.3.0 numpy 1.13.3 Sklearn 0.19.1 Biopython 1.67