# ULMFiT for Genomics Classification

Karl Heyer

## Abstract

Classifying functional genomic regions such as promoters or enhancers from sequence data alone is a notoriously difficult challenge. Many bioinformatics algorithms exist for genomics classification, but these tend to rely on hand engineered features and are limited to recognizing known structural motifs. More recently, deep learning approaches developed in computer vision (CV) and natural language processing (NLP) have been applied to genomics problems with promising results [26]. However, these methods tend to be limited by the amount of labeled data available. We present Genomic-ULMFiT, an application of Universal Language Model for Fine-tuning (ULMFiT) [1] to genomics classification problems. ULMFiT is an effective NLP algorithm for inductive transfer learning that utilizes pre-training on large unlabeled corpuses. This is particularly advantageous in the context of genomics data, which tends to contain significant volumes of unlabeled data. Genomic-ULMFiT shows improved results over existing methods for classification of promoters, enhancers, taxonomy, and lncRNA using only sequence data.

## 1. Introduction

Advancements in genomics research such as high-throughput sequencing have driven an explosion of genomic data. The mountain of data demands robust algorithms to analyze and interpret the data. Genomic classification - asigning genomic sequences a label based on function or some other property - is fundamental to processing raw genomics data. Algorithms exist to classify sequences based on hand engineered features, but these approaches struggle with the structural heterogeneity of genomic sequences [6]. Structures of genomic regulatory features such as promoters are notoriously complex and diverse [32].

Recently deep learning approaches have tried to leverage large genomics datasets to train end-to-end models to resolve the complex structural diversity of genomic sequences. In the context of classification problems, deep learning has been applied to promoter classification [5,6], enhancer classification [3,4,7,12], enhancer-promoter interactions [8], CRISPR guide scoring [2], transcription factor binding sites [9], metagenomics classification [10], delitrious mutation classification [11], and long noncoding RNA classification [27]. One limitation of these methods as implemented is they all rely on labled data. Even when techniques like pre-training and transfer learning are used, such steps are applied only to labeled classification corpuses [8,9,12,28,29,30]. This poses a problem, as labeled genomics datasets tend to be small and deep learning models are prone to overfitting on small datasets [31].

ULMFiT is a NLP algorithm for training text classification models using transfer learning. ULMFiT uses unsupervised pre-training on large unlabeled domain corpuses to improve classification on smaller labeled datasets. This approach has shown to reduce error rate on common text classification tasks by 18-24% over training from scratch, while also reducing data requirements by up to 100x [1]. In the context of genomic data, ULMFiT allows us to pre-train models on large amounts of unlabeled sequence data, then use the pre-trained models for classification on smaller corpuses of labeled data. This helps address the limitations caused by labeled sequence availability.

## Contributions

We adapt ULMFiT to genomics data with k-mer based tokenization. We show that Genomic-ULMFiT produces superior classification results in several genomic contexts, improving metrics such as accuracy, precision and recall by up to 18%. We provide a roadmap for approaching genomics classification problems in any genomic context using unsupervised pre-training.

## 2. Related Work


### Deep Learning in Genomics

Deep learning has been applied to many areas of genomics for some time now [26]. Deep classification models have been applied to a variety of genomics classification problems such as promoter classification [5,6], enhancer classification [3,4,7,12], enhancer-promoter interactions [8], CRISPR guide scoring [2], transcription factor binding sites [9], metagenomics classification [10], delitrious mutation classification [11], and long noncoding RNA classification [27] to name a few. Deep models for genomics typically use CNN based architectures [2,3,5,6,7,10] or GRU cell architectures [4,9,27]. 

### Text Representation of Genomic Sequences

An important design decision in training deep genomics models is choosing how to represent genomic sequences as a numerical input to the model. Genomic sequences are typically tokenized on the nucleotide level [2,3,4,5,6,7,27]. Some approaches tokenize on the k-mer level [8,9,10]. Tokenized sequences are then represented either as one hot encoded vectors [2,3,4,5,6,7,10] or passed through an embedding layer [8,9,27].


### Language Modeling

Language modeling is the task of taking in as input a sequence of tokens and predicting the next token in the sequence. Language modeling in NLP has proven to be a powerful pre-training tool for transfer learning. Language modeling as a task trains a model to learn long term dependencies, hierarchical relations and sentiment [1]. Language models have shown the ability to be fine-tuned to many different classification tasks related to the language domain the language model was trained on [1].


### ULMFiT

ULMFiT is an inductive transfer learning technique in NLP [1]. ULMFiT trains a text classification model in three steps. The first step is general domain language modeling. A language model is trained on a large, general, unlabeled corpus that shares domain similarities to the classification task. The second step is to fine tune the general language model on the classification corpus. The third step is to use the pre-trained weights of the fine tuned language model to train the final classification model on the classification corpus. The ULMFiT approach has shown to train better performing classification models requiring smaller amounts of labeled data compared to training from scratch [1].

### Regularization of Language Models

Language models can be prone to overfitting. [13] showed that using a variety of effective regularization techniques is crucial for training high quality language models that do not overfit. Dropout [17] is applied in four different ways. Embedding dropout is the equivalent of applying dropout on the word level, broadcasting dropout along the entire embedding vector for a word with a given probability. Weight dropout, similar to DropConnect [19] is applied to the hidden-to-hidden matrices of the LSTM layers of the model. For weight drop, a constant dropout mask is applied at all timesteps to avoid issues raised by [18]. Variational dropout is applied to the activations of the hidden layers, also using a constant dropout mask at all timesteps. Standard dropout is used on linear layers after LSTM layers. Two forms of activation regularization are also applied. Activation Regularization (AR) applies an L2 scale penalty to the activations in the LSTM layers of the model. Temporal Activation Regularization (TAR) [21] applies an L2 penalty to the change in activation scale between timesteps.

## 3. Genomic ULMFiT

We propose Genomic ULMFiT - an adaptation of the ULMFiT training process to genomic data. Following the ULMFiT process, we first train a general domain genomic language model. This genomic language model takes in a sequence of genomic tokens and predicts the next token in the sequence. The genomic language model is then fine tuned on a target classification corpus to create a fine tuned genomic language model. The weights of the fine tuned genomic language model are then transferred to a classification model that is trained on the target classification corpus.

### Datasets

Two types of datasets are used in the ULMFiT process. The first is the dataset used for training the general domain genomic language model. The second is the dataset used for the target classification task.

Training the general domain language model is a self-supervised process. The model takes in a sequence of tokens and predicts the next token. This training stage required no labeled data. This allows us to take advantage of the large amounts of unlabeled genomic data available. To create datasets for training the general domain genomic language model, full genomes are downloaded from NCBI. This makes it easy to generate large genomic datasets. The dataset used for pre-training should be phylogenically similar to the target classification task the model will be transferred to. If for example the target task was classifying human promoter sequences, the general domain genomic language model should be trained on human genomic data or genomic data from closely related species.

Datasets used for classification tasks are taken from existing publications for comparison. This is discussed more in the results section. In short, classification datasets are taken from [5,6,7,10,27].

### Data Representation

Genomic data must be pre-processed and tokenized before being input to the model. Tokenization on the k-mer level following [8,9,10] leads to better performance than nucleotide level tokenization used by [2,3,4,5,6,7,27]. k-mer level tokenization allows the language model to learn a more nuanced understanding of different k-mers compared to nucleotide level tokenization.

Tokenization is also done with a set stride between tokens that may be less than the k-mer value. For example consider tokenizing the sequence `ATCGCGTACGATCCG` with a k-mer size of 4 and the following stride values:
 * Stride 1: `ATCG TCGC CGCG GCGT CGTA GTAC TACG ACGA CGAT GATC ATCC TCCG`
 * Stride 2: `ATCG CGCG CGTA TACG CGAT ATCC`
 * Stride 3: `ATCG GCGT TACG GATC`
 * Stride 4: `ATCG CGTA CGAT`

Decreasing the stride between k-mers can be thought of as applying a prior to the language model prediction process where information about the next k-mer is included in the previous k-mer. This allows the language model to converge more quickly to a lower loss value, but this does not always translate to better classification performance.

In practice, tokenization parameters for the length of k-mers used and the stride between k-mers are hyperparameters that must be tuned for different datasets. Some datasets respond strongly to the choice of k-mer and stride, while other datasets show no performance change for a range of k-mer and stride values.

### General Domain Genomic Language Model

The general domain genomic language model is based on the AWD-LSTM model [13] which is the standard model for ULMFiT [1]. This model utilizes LSTM cells for procesing sequences [14]. The model is regularized following [13]. All forms of dropout described in [13] are used - variational dropout, weight-drop, embedding dropout and standard dropout. Activation regularization, temporal activation regularization and weight decay are also used. Weight decay is implemented following [20] where the weight decay penalty is added in the weight update step rather than in the loss calculation. The model uses weight tying between the embedding layer and the output layer, motivated by [15,16].

The general domain genomic language model is trained in a self-supervised fashion to predict the next genomic token in a series of genomic tokens. Typical datasets include an organism's genome, or an ensemble of genomes from closely related organisms. The model is trained following the One Cycle learning rate policy from [22]. Learning rates are selected following the methods of [23,24]. The model is optimized using the Adam optimizer with parameters $\beta_{1} = 0.9, \beta_{2} = 0.99$.

### Task Specific Genomic Language Model

No matter how good the general domain genomic language model is, the classification dataset likely comes from a different distribution [1]. This motivates fine-tuning the genomic language model on the classification corpus. Following [1,25], the genomic language model is fine-tuned using discriminative fine-tuning where different layers in the model receive different learning rates and are fine-tuned to different extents. In this stage, regularization hyperparameters may need to be changed. The model is still trained following the One Cycle learning rate policy.

### Task Specific Classification Model

Weights from the fine-tuned genomic langauge model are then transferred to the classification model. The classification model adds two new linear layers on top of the genomic language model. Following [1], the linear layers use batch normalization, dropout and ReLU activations. The linear layers use the concat pooling method described in [1].

The classification model is trained using gradual unfreezing. Training all layers of the classification model at once could lead to catastrophic forgetting. Training all layers also risks allowing a bad gradient update in the new, untrained linear layers to dramatically change the pre-trained weights in the lower layers of the model. With gradual unfreezing, we first train only the new linear layers at the end of the model. Then we unfreeze one layer at a time over several training steps.

### Hyperparameters

Models typically include an embedding of size 400, three LSTM layers with 1150 hidden activations per layer, and a BPTT value of 70. Typical dropout parameters are 0.1 for embeddings, 0.5 for the LSTM hidden-to-hidden matrix, 0.2 to the activations of the LSTM layers and 0.4 to the output linear layers. Activation regularization and temporal activation regularization are implemented with parameters $\alpha = 2$, $\beta = 1$. The weight decay coefficient is typically 0.01. Dropout and weight decay are tuned on a per-dataset basis. Typically dropout parameters are tuned by applying a constant scaling coefficient, maintaining the ratios between dropout parameters.

## Results

### E. coli Baseline

This is a baseline comparison to show the effect of pre-training and validate that the Genomic-ULMFiT approach improves results over training from scratch. Here the Naive model is trained from scratch. The E. coli Genome Pre-Training model is pre-trained on only the E. coli genome. The Genomic Ensemble Pre-Training model is trained on a dozen or so bacterial genomes. Pre-training has a clear impact on model performance. Pre-training on more data shows improvements over pre-training on less data. In general the quality of the pre-trained language model has a direct impact on classification performance.

  | Model                        	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	|
  |------------------------------	|:--------:	|:---------:	|:------:	|:-----------------------:	|
  | Naive                        	|   0.834  	|   0.847   	|  0.816 	|          0.670          	|
  | E. coli Genome Pre-Training   	|   0.919  	|   0.941   	|  0.893 	|          0.839          	|
  | Genomic Ensemble Pre-Training 	|   0.973  	|   0.980   	|  0.966 	|          0.947          	|


### Human Promoters, Short Sequences

This data shows a direct comparison to [5] for classification of human promoters from short (250 bp) sequences, taken -200/50 relative to the TSS. The same dataset from [5] was used to generate these results.


| Model                            	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	| Specificity 	|
|----------------------------------	|----------	|-----------	|--------	|-------------------------	|-------------	|
| Kh et al.                        	|     -    	|     -     	|   0.9  	|           0.89          	|     0.98    	|
| Genomic-ULMFiT 	|   .995   	|    .992   	|  __.996__  	|           __.991__          	|     __.994__    	|



### Human Promoters, Long Sequences

These results show a direct comparison to [6]. The dataset for [6] was not publicly available, but the same methodology was used to generate a dataset. Positive sequences were taken as the region -500/500 relative to TSS locations in the [EPDnew Database](https://epd.epfl.ch//EPDnew_database.php). Negative sequences were randomly selected from regions in the genome not overlapping with regions taken for promoter sequences. The [NCBI Human Genome](https://www.ncbi.nlm.nih.gov/genome/51) is used as a reference template.

| Model                                   	| DNA Size  	| Models           	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	|
|-----------------------------------------	|-----------	|------------------	|----------	|-----------	|--------	|-------------------------	|
| Umarov et al.                           	| -1000/500 	| 2 Model Ensemble 	|     -    	|   0.636   	|  0.802 	|          0.714          	|
| Umarov et al.                           	|  -200/400 	| 2 Model Ensemble 	|     -    	|   0.769   	|  0.755 	|          0.762          	|
| Genomic-ULMFiT	|  -500/500 	|   Single Model   	|   0.894  	|   __0.900__   	|  __0.844__ 	|          __0.784__          	|


### Bacterial Promoters

These results show comparisons to performance on another dataset from [5] containing promoter sequences from E. coli and B. subtilis. Compared to the CNN-based method used by [5], Genomic-ULMFiT performed similarly on E. coli promoters, but worse on B. subtilis promoters, likely due to the amount of data available (2936 examples for E. coli, 1050 for B. subtilis). This suggests that in extremely low data regimes, CNN models may perform better.


| Method         	| Organism    	| Training Examples 	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	| Specificity 	|
|----------------	|-------------	|-------------------	|----------	|-----------	|--------	|-------------------------	|-------------	|
| Kh et al.     	| E. coli     	|        2936       	|     -    	|     -     	|  __0.90__  	|           0.84          	|     0.96    	|
| Genomic-ULMFiT 	| E. coli     	|        2936       	|   0.956  	|   0.917   	|  0.880 	|          __0.871__          	|    __0.977__    	|
| Kh et al.     	| B. subtilis 	|        1050       	|     -    	|     -     	|  __0.91__  	|           __0.86__          	|     0.95    	|
| Genomic-ULMFiT 	| B. subtilis 	|        1050       	|   0.905  	|   0.857   	|  0.789 	|          0.759          	|     0.95    	|


### Metagenomics Classification

These results show a direct comparison to [10], using the same datasets for classification. Two datasets are used - one for amplicon sequencing data, another for shotgun sequencing data. Datasets are generated synthetically based on sequencing of S16 regions of bacterial genomes.


| Amplicon Data   	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|----------	|-----------	|--------	|-------	|
| Fiannaca et al. 	|   .9137  	|   .9162   	|  .9137 	| .9126 	|
| Genomic-ULMFiT  	|   __.9239__  	|   __.9402__   	|  __.9332__ 	| __.9306__ 	|

| Shotgun Data    	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|----------	|-----------	|--------	|-------	|
| Fiannaca et al. 	|   .8550  	|   .8570   	|  .8520 	| .8511 	|
| Genomic-ULMFiT  	|   __.8797__  	|   __.8824__   	|  __.8769__ 	| __.8758__ 	|


### Enhancer Classification

These results show a direct comparison to [7], using the same dataset. Results here are compared using ROC-AUC as this was the metric used by [7]. Positive examples are 
500 bp sequences defined as having active enhancer marks (H3K27ac) in the liver. Negative examples are genomic regions showing no H3K27ac marks.

The data from [7] on this dataset is actually not presented in the paper itself, but put in the supplementary section, available [here](https://www.biorxiv.org/content/biorxiv/suppl/2018/02/14/264200.DC2/264200-1.pdf). The results below are compared to the author's results in supplementary Figure 3. This dataset was used because the main dataset from [7] used to generate figured in the main paper was not made available on their github repo.

| Model/ROC-AUC                 	| Human 	| Mouse 	|  Dog  	| Opossum 	|
|-------------------------------	|:-----:	|:-----:	|:-----:	|:-------:	|
| Cohn et al.                   	|  0.80 	|  0.78 	|  0.77 	|   0.72  	|
| Genomic-ULMFiT 	| __0.819__ 	| __0.875__ 	| __0.788__ 	|  __0.798__  	|



### mRNA/lncRNA Classification

These results show a direct comparison to [27] using data from the paper. The classification dataset consists of DNA sequences corresponding to mRNA and lncRNA sequences. The dataset contains two test sets - a standard test set and a challenge test set. In the table below, results from a single Genomic-ULMFiT model are compared to an ensemble of GRU models used by [27].


| Model                          	| Test Set           	| Accuracy 	| Specificity 	| Sensitivity 	| Precision 	| MCC   	|
|--------------------------------	|--------------------	|----------	|-------------	|-------------	|-----------	|-------	|
| GRU Ensemble (Hill et al.)*    	| Standard Test Set  	|   0.96   	|     __0.97__    	|     0.95    	|    __0.97__   	|  0.92 	|
| Genomic-ULMFiT 	| Standard Test Set  	|   __0.963__  	|    0.952    	|    __0.974__    	|   0.953   	| __0.926__ 	|
| GRU Ensemble (Hill et al.)*    	| Challenge Test Set 	|   0.875  	|     __0.95__    	|     0.80    	|    __0.95__   	|  0.75 	|
| Genomic-ULMFiT 	| Challenge Test Set 	|   __0.90__   	|    0.944    	|    __0.871__    	|   0.939   	| __0.817__ 	|

(*) [27] presented their results as a plot rather than as a data table. Values in the above table are estimated by reading off the plot

## Analysis

### Effect of k-mer and Stride Tokenization Parameters

Tokenizing genomic sequences with different k-mer and stride values impact classification performance. Different datasets respond differently to changes in k-mer and stride. For example, with the metagenomics dataset from [10]:

| Amplicon Data   	| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|-------------	|----------	|-----------	|--------	|-------	|
| Genomic-ULMFiT  	|     5/2     	|   .9144  	|   .9369   	|  .9250 	| .9214 	|
| Genomic-ULMFiT  	|     5/1     	|   .9150  	|   .9309   	|  .9263 	| .9230 	|
| Genomic-ULMFiT  	|     3/1     	|   __.9239__  	|   __.9402__   	|  __.9332__ 	| __.9306__ 	|

| Shotgun Data    	| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|-------------	|----------	|-----------	|--------	|-------	|
| Genomic-ULMFiT  	|     5/2     	|   .8075  	|   .8102   	|  .8054 	| .8044 	|
| Genomic-ULMFiT  	|     5/1     	|   .8528  	|   .8631   	|  .8566 	| .8569 	|
| Genomic-ULMFiT  	|     3/1     	|   __.8797__  	|   __.8824__   	|  __.8769__ 	| __.8758__ 	|

For both the shotgun and amplicon datasets, performance improved by using smaller k-mer and stride values. However for the amplicon dataset, accuracy improved by ~1%, while for the shotgun dataset, accuracy improved by ~7%.

For the short promoter dataset from [5]:

| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	| Specificity 	|
|-------------	|----------	|-----------	|--------	|-------------------------	|-------------	|
|     5/2     	|   .977   	|    .959   	|  .989  	|           .955          	|     .969    	|
|     5/1     	|   .990   	|    .983   	|  .995  	|           .981          	|     .987    	|
|     3/1     	|   __.995__   	|    __.992__   	|  __.996__  	|           __.991__          	|     __.994__    	|

This dataset shows a slight improvement using a 3/1 k-mer/stride scheme over 5/1.

For the long promoter dataset from [6]:

| Kmer/Stride 	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	|
|-------------	|----------	|-----------	|--------	|-------------------------	|
|     5/2     	|   0.889  	|   0.886   	|  0.846 	|          0.772          	|
|     4/2     	|   0.892  	|   0.877   	|  __0.865__ 	|          0.778          	|
|     8/3     	|   0.874  	|   0.889   	|  0.802 	|          0.742          	|
|     1/1     	|   __0.894__  	|   0.900   	|  0.844 	|          __0.784__          	|

This dataset shows a slight accuracy improvment with a 1/1 scheme over a 4/2 scheme, but not by much. Precision and recall for the 4/1 and 1/1 models show tradeoffs.

References:

[1] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. arXiv preprint arXiv:1801.06146

[2] Chuai G, Ma H, Yan J, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 2018;19(1):80. Published 2018 Jun 26. doi:10.1186/s13059-018-1459-4

[3] Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18(Suppl 13):478. Published 2017 Dec 1. doi:10.1186/s12859-017-1878-3

[4] Bite Yang, Feng Liu, Chao Ren, Zhangyi Ouyang, Ziwei Xie, Xiaochen Bo, Wenjie Shu, BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone, Bioinformatics, Volume 33, Issue 13, 1 July 2017, Pages 1930–1936, https://doi.org/10.1093/bioinformatics/btx105

[5] Umarov RK, Solovyev VV (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE 12(2): e0171410. https://doi.org/10.1371/journal.pone.0171410

[6] Umarov RK, et al. 2018. PromID: human promoter prediction by deep learning. arXiv preprint arXiv:1810.01414

[7] Cohn D. et al. 2018. Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences. bioRxiv doi:https://doi.org/10.1101/264200

[8] Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics. 2018;19(Suppl 2):84. Published 2018 May 9. doi:10.1186/s12864-018-4459-6

[9] Shen Z, Bao W, Huang DS. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep. 2018;8(1):15270. Published 2018 Oct 15. doi:10.1038/s41598-018-33321-1

[10] Fiannaca A, La Paglia L, La Rosa M, et al. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics. 2018;19(Suppl 7):198. Published 2018 Jul 9. doi:10.1186/s12859-018-2182-6

[11] Plekhanova, E., Nuzhdin, S. V., Utkin, L. V., & Samsonova, M. G. ( 2018). Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evolutionary Applications, 12, 18– 28. https://doi.org/10.1111/eva.12607

[12] Liu F, Li H, Ren C, Bo X, Shu W. 2016 PEDLA: predicting enhancers with a deep learning-based algorithmic framework. bioRxiv (doi:10.1101/036129) Google Scholar

[13] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2017. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182

[14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term memory. Neural computation, 9(8):1735–1780, 1997.

[15] Inan, H., Khosravi, K., and Socher, R. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. arXiv preprint arXiv:1611.01462, 2016.

[16] Press, O. and Wolf, L. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.

[17] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

[18] Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.

[19] Wan, L., Zeiler, M., Zhang, S., LeCun, Y, and Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML-13), pp. 1058–1066, 2013.

[20] Ilya Loshchilov, Frank Hutter. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.

[21] Merity, S., McCann, B., and Socher, R. Revisiting activation regularization for language rnns. arXiv preprint arXiv:1708.01009, 2017.

[22] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018

[23] Leslie N. Smith, Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv preprint arXiv:1708.07120, 2017

[24] Leslie N Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186, 2015

[25] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems. pages 3320–3328.

[26] Tianwei Yue, Haohan Wang. Deep Learning for Genomics: A Concise Overview. arXiv preprint arXiv:1802.00810, 2018

[27] Hill S.T., Kuintzle R., Teegarden A., Merrill E., 3rd, Danaee P., Hendrix D.A. A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Res. 2018;46:8105–8113. doi: 10.1093/nar/gky567.

[28] Weihua Guo, You Xu, Xueyang Feng. DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing. arXiv preprint arXiv:1705.03094, 2017.

[29] Seonwoo Min, Byunghan Lee, Sungroh Yoon. Deep Learning in Bioinformatics. arXiv preprint arXiv:1603.06430, 2016.

[30] Young J.D., Cai C., Lu X. Unsupervised deep learning reveals prognostically relevant subtypes of glioblastoma. BMC Bioinform. 2017;18:381. doi: 10.1186/s12859-017-1798-2

[31] Yu Li et al. Deep learning in bioinformatics: introduction, application, and perspective in big data era. arXiv preprint arXiv:1903.00342 2019

[32] Roy AL, Singer DS. Core promoters in transcription: old problem, new insights. Trends Biochem. Sci. 2015;40:165–171. doi: 10.1016/j.tibs.2015.01.007.