# Genomic-ULMFiT Methods 0: Data Representation, Model Architecture, Regularization and Training

Karl Heyer

## Introduction

Genomic-ULMFiT is a method for training deep genomic sequence classification models that shows competitive or improved performance over previous published results. This technique allows us to solve problems like:
 * Does this genomic sequence contain a promoter?
 * Is this RNA sequence coding RNA or non-coding RNA?
 * What genus does this sequencing read belong to?

This method is based on ULMFiT [1] - a transfer learning method for NLP tasks. Transfer learning is the process of taking a model trained to solve one task as the basis for solving another task. This means that rather than training a model from scratch for the second task, we initialize the model with weights learned from the initial task, then *fine tune* the weights towards the second task. Transfer learning has been extensively applied in the field of computer vision. For example, one might train a classification model on ImageNet data, then fine tune that model for classification of satellite imagery. 

Transfer learning in NLP domains has historically been restricted to embeddings. NLP models would have embedding layers initialized with pretrained weights from word2vec, GloVe, or some other source. ULMFiT extends this to transferring learned weights for multiple layers, to great effect. Importantly, the initial model is trained in an unsupervised fashion before being fine tuned for supervised classification on a labeled dataset. This means that our model performance is not restricted by the availability of labeled data. From a genomics perspective, this allows us to train the initial model on large amounts of unlabeled sequence data, then transfer the learned weights to a classification model that is fine tuned on a smaller dataset. This allows Genomic-ULMFiT to leverage the huge amount of unlabeled sequence data available to produce accurate results on small labeled datasets using __only genomic sequences as input__. The initial model is also general and reusable. It can be fine tuned towards any number of classification tasks and datasets.

This document covers the theory behind Genomic-ULMFiT and practical considerations for structuring data and training models. This document is written with the following goals in mind:
   * Cover the theory of the ULMFiT process and considerations taken applying it to genomic data
   * Explain the model architectures used, the theory behind them, and important hyperparameters
   * Describe practical methods for training models quickly and achieving high performance, including regularization and learning rate scheduling
    
This document is not intended to be a code walkthrough. Relevant code is shown in the notebooks in the [E. coli](https://github.com/tejasvi/DNAish/tree/master/Bacteria/E.%20Coli) directory.

This document is structured in the following sections:
   * Genomic Sequence Data Representation details preprocessing steps of preparing genomic data before sending it as input to a model
   * ULMFiT Overview describes the overall method and the template for what we want to achieve
   * Model Architecture covers Genomic Language Models, Genomic Classification Models, the layers that comprise them, and important hyperparameter choices in model design
   * Regularization details the many types of regularization at play in the model and how to tune them effectively
   * Training covers the ULMFiT process in detail, as well as learning rate schedules and training phases


## Table of Contents
1. Genomic Sequence Data Representation
    * 1.1 Genomic Tokenization
    * 1.2 Genomic Numericalization
2. ULMFiT Overview
3. Model Architecture
    * 3.1 High Level Overview - Language Models and Classification Models
    * 3.2 Embeddings
    * 3.3 LSTM Encoder
    * 3.4 Linear Head
        * 3.4.1 Language Model Head
        * 3.4.1 Classification Model Head
    * 3.5 Practical Model Parameters
4. Regularization
    * 4.1 Dropout
    * 4.2 Weight Decay
    * 4.3 Activation Regularization
    * 4.4 Practical Regularization Parameters
5. Training
    * 5.1 Optimizer - Adam-W
    * 5.2 One Cycle Policy
    * 5.3 Selecting a Learning Rate
    * 5.4 Discriminative Learning Rates
    * 5.5 Gradual Unfreezing
    * 5.6 FP16 Training
    * 5.7 Language Model Training
    * 5.8 Language Model Fine-Tuning
    * 5.9 Classification Model Training
6. Comparison to Recent Literature
    * 6.1 Use of Pre-Training and Transfer Learning
    * 6.2 Nucleotide Representation
    * 6.3 Model Architecture
7. Summary of Results
8. Areas for Continuing Work

## 1. Genomic Sequence Data Representation

If we want to train a sequence model on genomic data, the first thing we need to figure out is how to process the data into a form that can be used by a neural network. We need to turn sequence data into a numerical form that can be manipulated mathematically. We do this in two steps - __tokenization__ and __numericalization__.

Tokenization is the process of breaking the sequence down into sub-units or tokens. Numericalization is the process of mapping tokens to integer values.

### 1.1 Genomic Tokenization

How do we break genomic data into tokens? A common way is to tokenize by nucleotide [2,3,4,5,6,7]. Single nucleotide tokenization would process the sequence `ATCGCGTACG` into `A T C G C G T A C G`. This works, but it gives a very restricted representation of genomic sub-units. This looks at every nucleotide in a vacuum. It essentially says every `A` should be treated the same, regardless of where it appears or in what context it appears.

A representation that allows for more nuance is to tokenize by k-mers instead. This approach has been used by [8,9,10]. We could tokenize the sequence `ATCGCGTACGATCCG` into:
 * 3-mers: `ATC GCG TAC GAT CCG`
 * 4-mers: `ATCG CGTA CGAT`
 * 5-mers: `ATCGC GTACG ATCCG`
 
Or some other k-mer size. Notice in the above that the sequence is truncated to the last whole k-mer. 

Another parameter in tokenization is the stride between k-mers. Stride is defined as the frame shift between k-mers relative to the sequence being tokenized. In the above example, there was no overlap between k-mers, but this does not have to be the case. Consider tokenizing the sequence `ATCGCGTACGATCCG` with a k-mer size of 4 and the following stride values:
 * Stride 1: `ATCG TCGC CGCG GCGT CGTA GTAC TACG ACGA CGAT GATC ATCC TCCG`
 * Stride 2: `ATCG CGCG CGTA TACG CGAT ATCC`
 * Stride 3: `ATCG GCGT TACG GATC`
 * Stride 4: `ATCG CGTA CGAT`
 
Notice how the stride parameter affects the number of tokens created per length of input sequence. The impact of the choice of k-mer and stride values is discussed more in Section 5.7: Language Model Training. For now, understand that k-mer and stride values are hyperparameters that must be decided before training begins, and that the choice of k-mer and stride has an effect on compute time and performance.

### 1.2 Genomic Numericalization

Once we have decided on the k-mer and stride values to use in our tokenization, numericalizing is easy. We simply create a dictionary mapping each unique k-mer to an integer value. This creates the __vocabulary__ of the model - the total set of possible tokens. For a given k-mer length, the vocabulary will be $4^k + 1$ tokens. The $+ 1$ comes from adding a padding token, which will be important for batching sequences of different length. So for tokenization and numericalization of a sequence with k-mer length 3 and stride 2, we might see the following:

Sequence: `ATCGCGTACGATCCG`

Tokenization: `ATC CGC CGT TAC CGA ATC CCG`

Numericalization: `[5, 12, 8, 32, 27, 5, 14]`

Where the final numericalized list is the input to the model. Once a numericalized input is sent to the model, it is turned into a vector representation via an embedding. The integer values of the numericalized input correspond to rows in the embedding matrix. This is discussed further in Section 3.2: Embeddings.

### 1.3 Practical Tokenization

In practice I find k-mer lengths from 3-5 and stride values of 1-2 work best.

## 2. ULMFiT Overview

Now that we can process genomic data into a form we can feed to a model, we need to determine our strategy for training the model. Lets start by defining our end goal: We want to train a sequence model to classify genomic sequences using sequence input alone. This poses a potential problem. Sequence models tend to require a large amount of data to train effectively, and labeled genomic classification datasets can be small. The ULMFiT approach provides a solution to this. ULMFiT breaks training into three stages:

1. First we train a general domain language model using unsupervised on a large unlabeled corpus
2. We fine tune the general language model on the classification corpus to create a task specific language model
3. We fine tune the task specific language model for classification

Before going further, lets define the two types of models we will deal with. A __Language Model__ is a model that takes in a sequence of k-mer tokens and predicts the next token in the sequence. A __Classification Model__ is a model that takes in a sequence of tokens and predicts what category or class that sequence belongs to.

A language model is trained in an unsupervised fashion, meaning that no labeled data is required. Since the goal of the language model is to predict the next k-mer in a sequence, each k-mer becomes a correct output prediction for the sequence that preceeds it. This means we can generate huge amounts of paired data (input sequence + next k-mer) from any unlabeled genomic sequence.

A classification model is trained in a supervised fashion, requiring paired labeled data. For example if the task is promoter classification, all sequences in the classification dataset must be labeled as 0 or 1 for not-promoter or promoter.

The arthitectures for the Classification Model and the Language Model follow similar structures - the consist of an __Embedding__, an __Encoder__, and a __Linear Head__. On a high level, these layers function in the following ways:

 * Embedding: Converts the numericalized tokens of the input sequence into vector representations
 * Encoder: Processes the vectorized sequence into a hidden state
 * Linear Head: Uses the hidden state to make a classification decision.
 
When we move between stages, we transfer the learned weights from one model to the next. When we train the language models, we transfer all three sections of the model. When we transfer to the classification model, we only transfer the Embedding and the Encoder, as the classifcation model required a different linear head. Visually:

![](media/ulmfit1.png)

(Black arrows show transfer learning)
 
Model architecture is discussed in detail in Section 3.

### 2.1 General-Domain Language Model Training

In the first stage of training, we want to train a general genomic language model on a large unlabeled corpus. They key detail here is the training corpus used in this step can be any corpus in a similar domain to the classification corpus. So if for example we wanted to classify human genome sequences as either promoter or not-promoter, we could use the entire human genome to train the general domain language model. We could even go further and use an ensemble of genomes from animals phylogenically similar to humans. This allows us to generate large amounts of training data very easily.

The general domain language model will form the basis for all subsequent models. For this reason, we want the general domain model to be well trained. But what exactly does this mean? We want a model that understands the structure of genomic data and is able to pull meaning from nucleotide sequences. We use the language modeling task (predicting the next k-mer) as a proxy for this. In practice, I have seen that improving the general domain language model has a direct impact on improving performance of the classification model downstream. For this reason, it is worth investing in training a high performing general domain language model. Consequently, this step is actually the most time consuming step of the process. I typically invest 12+ hours into training the general domain language model, compared to 1-4 hours for fine tuning the task specific language model and 0.25-1 hours for training the classification model.

While training the general domain language model is computationally intensive, it only needs to be done once. If you train a human genome language model, that language model can be fine tuned for any number of downsteam tasks. This means the general domain language model has a high return on investment of compute time.

### 2.2 Task Specific Language Model

Once we have the general model trained, we want to fine tune it to the classification corpus to create a task specific language model. This is because no matter how general the general domain language model is, the classification dataset likely comes from a different distribution [1]. If we have a classification dataset specifically curated for a set of recognizable genomic classes, there are likely motifs and other structures in the sequence data that are more significant in the context of the classification dataset than in the general domain corpus. 

The task specific language model is initialized with the weights of the general domain language model. The full model (Embedding + Encoder + Linear Head) is transferred.

### 2.3 Task Specific Classification Model

Once we have trained the task specific language model, we can attempt our original goal: training a classification model. This model is initialized using the Embedding and the Encoder of the task specific language model. The Linear Head is not transferred, as the final classification task has changed. The Language Model produced a prediction vector corresponding to the length of the k-mer vocabulary (predicting the next k-mer), while the Classification Model produces a prediction vector with length equal to the number of classes in the classification dataset.

Initializing the classification model with the embedding and encoder of the task specific language model allows the classification model to train much faster while also being more robust to overfitting compared to training a model from scratch. Empirically I have found that models trained using transfer learning require much less regularization than models trained from scratch. This performance boost from pre-training and transfer learning is what allows ULMFiT to work so effectively on small datasets.

Transfer learning is an extremely important step for getting high quality results from training on small datasets. Transfer learning for deep learning genomics models has been done [3,7,11,12], but it is far from common. Many published methods train from scratch. I would expect pre-training to provide a general improvement to many methods.

## 3. Model Architecture

This section deals with the particulars of the language model and classification model architectures.

### 3.1 High Level Overview -  Language Models and Classification Models

As described in the previous section, ULMFiT uses a three step transfer learning process that utilizes two types of models: Language Models and Classification Models. These models are built using three sections: an __Embedding__, an __Encoder__, and a __Linear Head__. On a high level, these layers function in the following ways:

 * Embedding: Converts the tokens of the input sequence into vector representations
 * Encoder: Processes the vectorized sequence into a hidden state
 * Linear Head: Uses the hidden state to make a classification decision.

### 3.2 Embeddings

As described in Section 1, the ultimate input into the model is a vector of integers. For example:

`[5, 12, 8, 32, 27, 5, 14]`

Each integer value represents a k-mer in the vocabulary of our dataset. The sequence of integers represents a real sequence of k-mers. The first step of processing the input sequence is to convert the integer values into some sort of vector representation. A common way this is done in literature is to convert each token into a one hot encoded vector [2,3,4,5,6,7]. This works, but it is not a very rich representation. All the vectors are mutually orthagonal and don't contain any information about the relationships between k-mers.

By using an Embedding to represent k-mers as learned vectors, the model can learn more meaningful relationships between k-mers. This is also functionally identical to using one hot representations and passing them through a learned weight matrix.

The embedding weight matrix will have a size of `vocab x n_embedding` where `vocab` is the length of the model vocabulary and `n_embedding` is the length of the embedding vectors.

In practice the embedding is implemented using Pytorch's `nn.Embedding` module. An embedding vector length of 400 is typically used, but studies into the effect of embedding vector length have not been conducted.


### 3.3 LSTM Encoder

The encoder section is made of three stacked LSTM layers. This structure comes from the AWD-LSTM model [13] which is the standard model for ULMFiT [1]. An LSTM is used over a standard RNN as the update structure of the LSTM allows the model to retain information over longer sequences and filter information at each time step [14]. LSTMs are also less prone to vanishing gradients, a perpetual problem in standard RNNs. GRU units could likely be used in place of LSTMs, but this has not been tested.

The LSTM layers are structured such that the number of hidden units expand, then contract. A standard structure would be:
   1. LSTM(n_embedding, n_hidden)
   2. LSTM(n_hidden, n_hidden)
   3. LSTM(n_hidden, n_embedding)
   
Where `n_embedding` is the size of the embedding vectors, `n_hidden` is the number of hidden units in the LSTM stack, and `n_hidden > n_embedding`. The output of the final layer is set to a size of `n_embedding` to allow for weight tying when training the language models. See Section 3.4.1 for more information.

In practice an `n_hidden` value of 1150 is typically used. LSTMs are implemented using the standard Pytorch `nn.LSTM` module.


### 3.4 Linear Head

The linear head uses the hidden states output by the final LSTM layer to make predictions. Different linear heads are used for the Classification Model and the Language Model as each model is performing classification for different purposes, and use the hidden states from the final LSTM layer in different ways. The Language Model predicts the next k-mer in a genomic sequence, outputing a classification vector of length equal to the model vocabulary. The Language Model outputs a prediction for each time step in the input sequence, using each hidden state from the final LSTM layer to generate predictions.

The Classification Model makes classification predictions over the number of classes in the dataset. The classification model uses aggregates of all hidden states to produce a single prediction.

#### 3.4.1 Language Model Head

The Language Model linear head consists of just a single linear layer. The layer has a weight matrix of size `n_embedding x vocab`. The output is of length `vocab`, which makes sense as the model is predicting over k-mers in the vocab. The input size is `n_embedding`, which makes the size of the output weight matrix identical to the input embedding matrix, just with the dimensions flipped. This is intentional, as it allows us to tie the weights of the embedding to the softmax layer. This technique is motivated by [15,16] and found by [13] to lead to significantly improved language modeling performance. It also has the nice effect of reducing the number of parameters in the model.

The linear layer uses hidden states from all time steps to output predictions at every time step. So if a sequence section is 200 k-mers long, the model outputs a matrix of `200 x vocab` of next-k-mer predictions at every time step. This allows us to massively expand the amount of usable data we have for the language model. Each k-mer serves as the output value for the previous k-mer, and the input value for the subsequent k-mer. An important caveat to note is that this training approach is __not compatible__ with bidirectional LSTMs. If a bidirectional LSTM is used, the model will be predicting over k-mers it has already seen. A bidirectional model will achieve almost 100% accuracy while learning nothing.

In practice, the linear layer is implemented using Pytorch's `nn.Linear` module. The weights for the softmax layer are tied to the weights of the embedding layer. Typically bias is also included in the final layer, so strictly speaking there is not 100% weight tying between the embedding and the softmax layer.

#### 3.4.2 Classification Model Head

The linear head for the Classification Model is more complex than the linear head for the Language Model. This is due in part to the weight tying restriction on the complexity of the Language Model head - the Language Model must use a weight matrix the same size as the embedding, no more.

The Classification Model head consists of two linear layers with batchnorm layers in between. The classification head also uses the LSTM hidden states differently. Typically the final hidden state from the last LSTM layer is used for classification, but the most important parts of the sequence might be buried in the middle. Following the methods used in [1], we take the three vectors - the final hidden state, a maxpooling over all hidden states, and an average pooling over all hidden states - and concatenate them together. So for a vector containing hidden states at all time steps $ H = [h_{0}, ... h_{t}]$, we create $h_{c} = cat[h_{t}, maxpool(H), meanpool(H)]$ and use the vector $h_{c}$ as input to the linear head. $h_{c}$ will have a length of `n_embedding*3`. Since we transfer the learned wights from the embedding and encoder of the language model, we still use the LSTMs from the language model which output hidden states of length `n_embedding`.

The structure of the classification head is as follows:
1. Batchnorm1d(n_embedding*3)
2. Linear(n_embedding*3, n) + bias
3. ReLU
4. Batchnorm1d(n)
5. Linear(n, n_classes) + bias

In practice, the intermediate size `n` is typically 50, but can be tuned. Since the standard `n_embedding` size is 400, the input of length `n_embedding*3` will be of length 1200. The final output size `n_classes` is determined by the dataset.

### 3.5 Practical Model Parameters

To summarize the parameters used in practice

For the Embedding and Encoder layers:
   * Embeddings - size (vocab, 400)
   * LSTM 1 - size (400, 1150)
   * LSTM 2 - size (1150, 1150)
   * LSTM 3 - size (1150, 400)

For the Language Model Head:
   * Linear - size (400, vocab)
   
For the Classification Model Head:
   * Batchnorm1d 1 - size (1200)
   * Linear 1 - size (1200, 50)
   * Batchnorm1d 2 - size (50)
   * Linear 2 - size (50, n_classes)
   
Updating the diagram from earlier:

![](media/ulmfit2.png)

## 4. Regularization and Optimization

Effectively using regularization to improve model performance and prevent overfitting is one of the major results of [13]. Many different types of regularization are used in the AWD-LSTM model. This section covers the different forms of regularization and how they are implemented.

### 4.1 Dropout

Dropout, introduced by [17], is a standard form of regularization. Dropout uses a randomly selected mask to zero out activations, forcing the network to learn multiple paths from input to output. However a naive application of dropout functions poorly in recurrent models. Naive dropout - sampling a new dropout mask at every time step - disrupts the networks ability to retain long term dependencies [18]. To get around this, we need to be clever in how we apply dropout. Thankfully, all that hard work was done by [13] and now we can just follow their example. We apply dropout in four different ways:

#### Embedding Dropout

We apply dropout to the embedding matrix. Dropout is applied to the embedding matrix on the _k-mer level_. This means that every k-mer in the vocabulary has a probability of $p_{emb}$ of being completely zeroed out. The remaining word vectors are scaled by a factor of $\frac{1}{1-p_{emb}}$ to compensate. For a given batch of data, the same embedding dropout mask used for everything. This means that if a given k-mer vector is dropped out, that k-mer dissapears from the entire batch. This form of dropout is used by [13] and stems from [18].

#### Weight-Dropped LSTM

This form of dropout applies DropConnect [19] to the hidden-to-hidden weight matrices of the LSTM layers. There are two important points here. The first is that the dropout mask is sampled once at the start of the batch, then reused for all time steps within a batch. Using a constant dropout mask avoids the issues raised by [18] of dropout impacting the networks ability to retain long term dependencies. The second is that the dropout mask is applied to the _weights_ of the RNN, not the _activations_ moving through the RNN.

#### Variational Dropout

Variational Dropout, proposed by [18], is similar to DropConnect, except it is applied to the activations rather than the weights. Similar to DropConnect, a single dropout mask is sampled at the start of a batch and used for all inputs and outputs of the LSTM layers during the batch. Unlike DropConnect, which uses a single dropout mask for all elements in a batch, Variational Dropout uses a different mask for each element in a minibatch.

#### Standard Dropout

Regular old dropout is also used, but only in the linear head of the classification model. Dropout is used before each of the two linear layers in the classification head.


### 4.2 Weight Decay

Weight decay is used as another form of regularization. Weight decay penalizes the model for having large weights, and has been shown by [13] to improve the AWD-LSTM language model. An important point here is precisely _how_ weight decay is applied. There are two ways weight decay can be applied. There is the L2 Regularization method, where the sum of squared weights is added to directly to the loss: 

$L_{\text total} = \Big(L + \lambda||w||_2^2\Big)$

And the weight decay method, where the weight component is added during the gradient update.


$w_{i+1} = w_i - 2 \lambda w_i - \Big<\frac{\delta L}{\delta w}|_{w_i}\Big>$


From now on I will refer to the first method as L2 Regularization, and the second method as Weight Decay.

When using a simple optimizer like SGD, these methods are equivalent. However, we will be training the model using the Adam optimizer (see Section 5.1), which accumulates gradients. This means that L2 regularization will have a very different effect compared to Weight Decay. If we use L2 Regularization, the sum of squared weights becomes part of our loss value, and therefore part of our gradients. When the Adam optimizer accumulates gradients, the L2 Regularization term will be accumulated as well.

The Weight Decay method on the other hand is only added during the update step, so weight terms are not accumulated during the gradient. This difference was pointed out in [20], which showed the Weight Decay method (which they call AdamW) gave improved performance.

Weight Decay in Genomic-ULMFiT is implemented using the AdamW/Weight Decay method.

### 4.3 Activation Regularization

[13] implements another interesting form of regularization by adding an L2 penalty to the loss based on the activations of the model. This is implemented in two ways - Activation Regularization (AR) and Temporal Activation Regularization (TAR).

#### Activation Regularization

AR penalizes the model for having activations significantly larger than 0. AR is defined as $\alpha L_{2}(m \odot h_{t})$, where $m$ is the dropout mask, $h_{t}$ is the output activation at time $t$, $L_{2}(\cdot)$ is the sum of squared operation and $\alpha$ is the AR coefficient.

#### Temporal Activation Regularization

TAR [21] imposes a slowness constraint on the model by penalizing the difference in activation values between two time steps. TAR is defined as $\beta L_{2}(h_{t} - h_{t-1})$ where $h_{t}$ is the output activation at time $t$, $h_{t-1}$ is the output activation at time $t-1$ and $\beta$ is the TAR coefficient.

The L2 penalty for AR and TAR is used in the same way as L2 Regularization was defined in the previous section. Meaning the values for AR and TAR are added directly to the loss value before backpropagation. AR and TAR are embedded in the accumulated gradients.

### 4.4 Practical Regularization Parameters

We just covered many different types of regularization. This section summarizes practical values that I have found to work on genomic datasets. This comes with the caveat that I have not run serious ablation studies on all the regularization parameters. Many of these parameters come from the default ULMFiT values in [1].

#### Dropout Parameters

There are four dropout hyperparameters to set:
* $p_{emb}$ - dropout on the embedding
* $p_{weight}$ - dropout on the weights of the LSTM layers
* $p_{hidden}$ - variational dropout on the activations of the LSTM layers
* $p_{output}$ - standard dropout applied to activations in the linear head

It is important to consider both the magnitude of each dropout parameter and the relative ratios between dropout parameters. In practice I find it best to set values for all dropout parameters and tune a `drop_mult` parameter that scales all four dropout parameters while keeping the same ratio. I typically use the same ratios for all datasets, and tune `drop_mult` on a per dataset basis. With the caveat that I have not run serious ablations on dropout configurations, here are the parameters I typically use.

For genomic language models:
* $p_{emb} = 0.02$ 
* $p_{weight} = 0.15$
* $p_{hidden} = 0.1$
* $p_{output} = 0.25$
* drop_mult $= 0.15-0.35$

For genomic classification models:
* $p_{emb} = 0.1$ 
* $p_{weight} = 0.5$
* $p_{hidden} = 0.2$
* $p_{output} = 0.4$
* drop_mult $= 0.2 - 0.7$

#### Weight Decay

I typically use a weight decay coefficient of $wd = 1e-2$. If the model is struggling to train and changing dropout does not appear to help, I will lower weight decay to $1e-3$

#### AR and TAR

Following [13], the coefficients for AR and TAR are 2 and 1 respectively 

## 5 Training

This section covers all aspects of training, including optimization, learning rate scheduling, learning rate selection and other important parameters for training models quickly and avoiding overfitting.

### 5.1 Optimizer - AdamW

The model is trained using the AdamW optimizer, the variant of Adam that implements weight decay during the weight update step rather than in the loss calculation. The particulars of how weight decay in AdamW differs from L2 Regularization are explained in Section 4.2. The Adam coefficients $\beta_{1} = 0.9, \beta_{2} = 0.99$

### 5.2 One Cycle Policy

Learning rates are scheduled using the One Cycle policy from [22]. Learning rates start low, rise following a cosine function to a high learning rate peak, then decrease following a cosine function back down to a low learning rate. At the same time, momentum is decreased, then increased following an inverted form of the learning rate function. This leads to a learning rate plot that looks like this:

![](lr_sched.png)

The high learning rates in the middle of the cycle provide regularization and prevent overfitting. It was observed in [23] that approximates of the Hessian of the loss were lower when using this method, indicating the model was in a flatter area of the loss landscape. [23] also observed that models trained much faster.

Momentum scheduling runs from a maximum of 0.8 to a minimum of 0.7. The learning rate maximum/momemtum minimum occurrs 30% of the way through the training cycle. Learning rate is increased from a minimum of $\frac{lr}{25}$ to a maximum of $lr$, then back down to $lr/25$. Selecting the learning rate $lr$ is described in the next section.

### 5.3 Selecting a Learning Rate

Learning rate is a crucial parameter for getting good results. Too low and the model will fail to train. Too high and the model will plateau or diverge. Learning rate decisions are selected using the method described in [23,24]. The model is trained with an exponentially increasing loss rate starting from a very low rate and ramping up until the model diverges. The loss is plotted as a function of learning rate, generating a plot like this:

![](lr_finder.png)

A learning rate is selected from anywhere in the region where the loss is monotonically decreasing. From the plot above, a learning rate ranging from $5e-4$ to $1e-2$ would be acceptable. This learning rate corresponds to the maximum learning rate in the training cycle. As such, I tend to use learning rates on the higher end of the range. 

### 5.4 Discriminative Learning Rates

Different layers in the network encode different types of information [25]. In the context of transfer learning, different layers of the pre-trained model need to be fine tuned to different extents. This is done through the use of discriminative learning rates, introduced by [1]. With this technique, higher layers in the model are fine-tuned at higher learning rates compared to the lower layers of the model. Following [1], learning rates follow the function $\eta^{l-1} = \frac{\eta^{l}}{2.6}$.

Discriminative learning rates are used in fune tuning the language model and training the classification model.

### 5.5 Gradual Unfreezing

Fine tuning a pre-trained model all at once risks catastrohpic forgetting. To avoid this, we employ gradual unfreezing [1]. First we unfreeze the final layer of the model and train with the remaining layers frozen. Then we unfreeze the second to last layer of the model and train the final two layers. And so on until all layers are unfrozen.

Gradual unfreezing is employed for fine-tuning the classification model.

### 5.6 FP16 Training

To improve training speed, models are trained in FP16. The FP16 training cycle implementation from the fastai library is used. The FP16 training cycle consists of:

1. Compute the forward pass with the FP16 model, then the loss.
2. Multiply the loss by the loss scale, then back-propagate the gradients in half-precision.
3. Copy the gradients in FP32 precision then divide by the loss scale.
4. Do the update on the master model (in FP32 precision).
5. Copy the master model in the FP16 model.

Batchnorm computations are always done in FP32. The loss scale parameter is determined dynamically. The loss scale starts at $2^{16}$ and is halved until overflow ceases to occurr.


### 5.7 Language Model Training

The initial language model, trained on the large general genomic corpus, is trained using the One Cycle policy with a constant learning rate throughout the model (that is no discriminative learning rates). Learning rates are selected using the method in Section 5.3, but tend to be between $1e-3$ and $1e-2$ (for the maximum learning rate in the cycle). The Language Model is trained with cross entropy loss.

#### Impact of Tokenization k-mer and Stride on Language Modeling

I mentioned in Section 1.1 that the choice of k-mer and stride for tokenization impact language modeling. I delayed discussing the impact to this section to build up a better understanding of the models at play. The choice of k-mer and stride impact language modeling in two ways. Firstly the choice of k-mer and stride affects the number of tokens processed per sequence length. Secondly the size of stride relative to k-mer imposes a prior on language modeling.

The number of k-mers for a sequence length $L$ with k-mer size $k$ and stride $s$ is given by the formula $kmers = floor(\frac{L-k}{s}) + 1$. $k$ has an additive effect on the number of k-mers, while $s$ has a multiplicative effect. A model using a stride of 1 will have twice as many k-mers per sequence length compared to a model with stride 2. This means lower stride models will require processing more compute to process the same sequence length. At the same time, since each k-mer is an output for the model to train on, lower stride tokenization leads to more gradient updates and training data per sequence length. 

Stride implies a prior about the relationship between a given k-mer and the next k-mer. Consider k-mers of length 5 with a stride of 1. For an input k-mer of `_ATGC`, we know based on the stride that the next k-mer must have the form of `ATGC_`. For a stride of 2, `__ATG` must map to `ATG__`. Consider how this affects the model's ability to guess the next k-mer. With a stride of 1, there are really only 4 k-mers that could come next. With a stride of 2, there are 16 k-mers that could come next. Language models very quickly learn to exploit the stride prior, converging to a point of "informed random guessing" based on the stride prior. This is strongly reflected in the cross entropy loss of the language model. The model quickly converges to the "informed random" point with a cross entropy loss of $ln(4^{s})$. Understanding this is important for comparing the performance of language models.

Consider two models only trained to the informed random point and no further. A stride 1 model reaches an informed random loss of 1.38. A stride 2 model reaches an informed random loss of 2.77. At these loss values, each model is doing nothing more than exploiting the stride prior to guess over a refined set of k-mers. Can we really say the lower loss stride 1 model is performing better than the stride 2 model? Personally I would say they perform the same. I prefer to evaluate genomic language models based on their improvement over the informed random point.

But what really matters here is not language model loss, but classification model loss. The language model is just a pre-training stepping stone to get to a classification model. So the real question is does the loss difference related to the stride prior impact classification performance? Empirically the answer seems to be dataset dependent. Generally I see genomic datasets perform in one of two ways. Either they show significant improvements with lower k-mer and stride values (think k-mer of 3 and stride of 1), or they are totally invariant to the choice of k-mer and stride. Determining how your dataset responds to k-mer and stride parameters is important for getting good results and balancing compute costs. If your dataset isn't impacted by stride, go for a larger value. Things will train faster and require less compute (fewer k-mers per sequence length) with no impact on results. But if you don't experiment with small strides to be sure, you could be leaving significant performance gains on the table.


### 5.8 Language Model Fine Tuning

Language Model fine tuning on a classification corpus is done using the One Cycle policy with discriminative learning rates. Discriminative learning rates follow the form $\eta^{l-1} = \frac{\eta^{l}}{2.6}$. Learning rates depend on the dataset but typically range from $5e-4$ and $5e-3$. The model is trained using cross entropy loss.

### 5.9 Classification Model Training

The classification model is trained using One Cycle scheduling, discriminative learning rates and gradual unfreezing. When the pre-trained weights are transferred from the fine tuned language model to the classification model, only the embedding and the encoder are transferred. The linear head on the classification model is untrained and randomly initialized. It is important to train the linear head sufficiently before unfreezing the encoder or embedding layers, or else a bad gradient signal caused by the untrained linear head could propagate back and adversely affect the pre-trained weights.

Unfreezing is done over four training cycles:
* Linear Head only
* Linear Head + final LSTM
* Linear Head + final 2 LSTMs
* Full model

Each phase of the unfreezing goes through a full One Cycle learning rate schedule. Typically the peak learning rate of each cycle decreased, starting at around $2e-2$ when training only the linear head, down to around $1e-4$ to $1e-3$ for training the full model. Discriminative learning rates following the $\eta^{l-1} = \eta^{l}/2.6$ are used at all steps. The classification model is trained using cross entropy loss.

Typically each unfreezing step is trained for the same number of epochs, but some datasets may respond to adding more epochs in the early stages or more epochs in the later stages. Getting optimum classification results can require a good deal of experimenting with learning rates and cycle lengths during the gradual unfreezing process. Thankfully the classification model training process tends to be much faster than the language model training process.

Empirically I have found it is extremely important to get the right number of epochs when training just the linear head. If too few epochs are run, the linear head won't perform well and will drag down performance of the entire model. If too many epochs are run, the linear head overfits to the data at the very start of the unfreezing cycle, and the model remains overfit for the entire process.

#### Padding for Classification Batching

Often the classification dataset contains sequences of different length. This poses an issue for batching sequences together. The solution is to padd all items in a batch to the length of the longest sequence in the batch. Sequences are padded at the start. This does pose another problem though. If sequences of wildly different length are batched together, the shorter sequence will be processed as mostly a sequence of padding, which yields poor results. For this reason, the samples in the classification dataset need to be sorted by length so that similar length sequences are batched together. To prevent the model from seeing the exact same batches over and over again, local random permutations are added to the sorting order.

## 6. Comparison to Other Approaches

Deep learning has been applied to many areas of genomics for some time now [26]. Deep classification models have been applied to a variety of genomics classification problems such as promoter classification [5,6], enhancer classification [3,4,7,12], enhancer-promoter interactions [8], CRISPR guide scoring [2], transcription factor binding sites [9], metagenomics classification [10], delitrious mutation classification [11], and long noncoding RNA classification [27] to name a few. This section compares previous approaches to Gemomic-ULMFiT.

### 6.1 Use of Pre-Training and Transfer Learning

One of the core strengths of Genomic-ULMFiT is the use of pre-training transfer learning. In particular, pre-training on a large unlabled corpus, then transferring a full pre-trained model (not just embeddings) between steps. Pre-training and transfer learning for genomics classification has been done in a variety of ways, both in terms of the data sources used for pre-training and the way model weights are transferred in transfer learning. 

#### Pre-Training Data Sources

The most common way pre-training is done in genomic contexts is to run supervised pre-training on a large in-domain corpus, then run supervised fine-tuning on a specific corpus [3,7,11]. For example training on data from multiple cell lines, then fine tuning on a specific cell line. This method is shown to be effective, but is still ultimately limited by the availability of labeled data.

When labeled data is scarce, some approached generate synthetic data by augmenting training data with point mutations, shuffling sequences, or generating all possible variants of a specific genomic motif [2,7,10,27]. While these methods can expand the amount of labeled data, they run the risk of overfitting. If adding point mutations is not sufficient to make augmented sequences different, adding augmented sequences has a similar effect to simply duplicating the training data - the model received no regularizing effect from the augmentations and overfits to the training data. [7] pre-trains on augmented random shuffles on the k-mer level of training sequences. I found this caused the model to learn an overly simplistic mapping (differentiating a true sequence from a random one is empirically very easy) that lead to significantly worse performance on the classification task.

Unsupervised learning through a variety of methods is also done. Interestingly, unsupervised learning is typically done on the classification corpus [8,9,12,28,29,30]. The idea of using an out of domain corpus for unsupervised pre-training is generally not used.

The use of pre-training on a large general corpus, rather than specific classification data, is underused in the context of genomics models. I would expect most approaches would benefit from using the technique.

#### Transferring Learned Weights

Once a pre-trained model is generated, the weights need to be transferred to the classification model. Depending on the style of pre-training used, some or all of the weights will be transferred. Typically when supervised pre-training is used, the entire model will be transferred [2,3,7,10,11,27].

When unsupervised pre-training is used, which weights are transferrable depends on how the unsupervised pre-training structure compares to the classification model. Common methods for unsupervised pre-training include using autoencoders, RMBs or HMMs to train feature extractors [2,12,28,29,30]. In this case, only a section of the weights corresponding to the encoder can be transferred. Other methods for unsupervised pre-training include using a word2vec approach to learning k-mer word vectors [9] or a language modeling approach to learning k-mer word vectors [8].

Pre-trained weights are typically used as part of a deep classification model, but sometimes the extracted features are used as input to different types of models. [8] used unsupervised pre-training using language models to learn k-mer word vectors, which are then used as input to a gradient boosted decision tree.

Genomic-ULMFiT pre-training is similar to other unsupervised methods in that only the encoder weights (embedding + LSTMs) are transferred to the classification model.


### 6.2 Nucelotide Representation

Genomic-ULMFiT tokenizes nucleotides on the k-mer level. K-mer representations are used by [8,9,10]. It is much more common to use nucleotide level representation [2,3,4,5,6,7,27]. Empirically in Genomic-ULMFiT k-mer level representations appear to perform better. This is reflected in results from [8,9]

Once tokenized, the k-mers are vectorized as input to the model. This is either done through one hot encodings [2,3,4,5,6,7,10] or the use of embeddings [8,9,27]. Empirically in Genomic-ULMFiT embeddings perform better than one-hot encodings, similar to results from [8,9,27].

### 6.3 Model Architecture

Genomic-ULMFiT uses LSTM layers for the encoder section of the model. Other methods that use recurrent models typically use GRU cells [4,9,27]. More commonly, genomic classification models use CNN based architectures [2,3,5,6,7,10]. While I have not run direct comparisons between CNNs and LSTMs for genomic sequence classification, I would expect LSTMs to better learn long range interactions over genomic sequences. I have also found empirically that Genomic-ULMFiT is able to produce better results than CNN models when comparing directly to pbulished CNN results on publicly available datasets.

## 7. Summary of Results

This section details results summary and comparisons to existing methods.

### E. coli Baseline

This is a baseline comparison to show the effect of pre-training and validate that the Genomic-ULMFiT approach improves results over training from scratch. Here the Naive model is trained from scratch. The E. coli Genome Pre-Training model is pre-trained on only the E. coli genome. The Genomic Ensemble Pre-Training model is trained on a dozen or so bacterial genomes. Pre-training has a clear impact on model performance. Pre-training on more data shows improvements over pre-training on less data. In general the quality of the pre-trained language model has a direct impact on classification performance.

  | Model                        	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	|
  |------------------------------	|:--------:	|:---------:	|:------:	|:-----------------------:	|
  | Naive                        	|   0.834  	|   0.847   	|  0.816 	|          0.670          	|
  | E. coli Genome Pre-Training   	|   0.919  	|   0.941   	|  0.893 	|          0.839          	|
  | Genomic Ensemble Pre-Training 	|   0.973  	|   0.980   	|  0.966 	|          0.947          	|
 
Data generation described in [notebook](https://github.com/tejasvi/DNAish/blob/master/Bacteria/E.%20Coli/E.%20coli%200%20Data%20Processing.ipynb)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Bacteria/E.%20Coli)


### Human Promoters, Short Sequences

This data shows a direct comparison to [5] for classification of human promoters from short (250 bp) sequences, taken -200/50 relative to the TSS. The same dataset from [5] was used to generate these results. The data also looks at the impact of k-mer and stride tokenization values on model performance. This dataset performed better with lower k-mer and stride values. However there is a caveat to these results. When I fine-tuned the genomic language model on the classification corpus, I got shockingly good results. After investigating, I found what amounts to a leak in the dataset. There were a large number of conserved DNA motifs (around 50-150 bp in length) present in only negative examples in the dataset. The language model very quickly learned these motifs, and the classification model likely exploited them as an easy feature for classification. 

| Model                            	| DNA Size 	| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	| Specificity 	|
|----------------------------------	|----------	|-------------	|----------	|-----------	|--------	|-------------------------	|-------------	|
| Kh et al.                        	| -200/50  	|      -      	|     -    	|     -     	|   0.9  	|           0.89          	|     0.98    	|
| Naive Model                      	| -200/50  	|     5/2     	|   0.80   	|    0.74   	|  0.80  	|           0.59          	|     0.80    	|
| With Pre-Training                 	| -200/50  	|     5/2     	|   0.922  	|   0.963   	|  0.849 	|          0.844          	|    0.976    	|
| With Pre-Training and Fine Tuning 	| -200/50  	|     5/2     	|   .977   	|    .959   	|  .989  	|           .955          	|     .969    	|
| With Pre-Training and Fine Tuning 	| -200/50  	|     5/1     	|   .990   	|    .983   	|  .995  	|           .981          	|     .987    	|
| With Pre-Training and Fine Tuning 	| -200/50  	|     3/1     	|   __.995__   	|    __.992__   	|  __.996__  	|           __.991__          	|     __.994__    	|

[Data Source](https://github.com/solovictor/CNNPromoterData)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Mammals/Human/Promoter%20Classification%20Short%20Sequences)


### Human Promoters, Long Sequences

These results show a direct comparison to [6]. The dataset for [6] was not publicly available, but the same methodology was used to generate a dataset. Positive sequences were taken as the region -500/500 relative to TSS locations in the [EPDnew Database](https://epd.epfl.ch//EPDnew_database.php). Negative sequences were randomly selected from regions in the genome not overlapping with regions taken for promoter sequences. The [NCBI Human Genome](https://www.ncbi.nlm.nih.gov/genome/51) is used as a reference template. With this dataset, different metrics responded differently to changes of k-mer and stride tokenization values.

| Model                                   	| DNA Size  	| Kmer/Stride 	| Models           	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	|
|-----------------------------------------	|-----------	|-------------	|------------------	|----------	|-----------	|--------	|-------------------------	|
| Umarov et al.                           	| -1000/500 	|      -      	| 2 Model Ensemble 	|     -    	|   0.636   	|  0.802 	|          0.714          	|
| Umarov et al.                           	|  -200/400 	|      -      	| 2 Model Ensemble 	|     -    	|   0.769   	|  0.755 	|          0.762          	|
| Naive Model                             	|  -500/500 	|     5/2     	|   Single Model   	|   0.858  	|   0.877   	|  0.772 	|          0.708          	|
| With Pretraining                        	|  -500/500 	|     5/2     	|   Single Model   	|   0.888  	|   __0.902__   	|  0.824 	|          0.770          	|
| With Pretraining and Fine Tuning (5mer) 	|  -500/500 	|     5/2     	|   Single Model   	|   0.889  	|   0.886   	|  0.846 	|          0.772          	|
| With Pretraining and Fine Tuning (4mer) 	|  -500/500 	|     4/2     	|   Single Model   	|   0.892  	|   0.877   	|  __0.865__ 	|          0.778          	|
| With Pretraining and Fine Tuning (8mer) 	|  -500/500 	|     8/3     	|   Single Model   	|   0.874  	|   0.889   	|  0.802 	|          0.742          	|
| With Pretraining and Fine Tuning (1mer) 	|  -500/500 	|     1/1     	|   Single Model   	|   __0.894__  	|   0.900   	|  0.844 	|          __0.784__          	|

Data generation described in [notebook](https://github.com/tejasvi/DNAish/blob/master/Mammals/Human/Promoter%20Classification%20Long%20Sequences/Human%20Promoters%20Long%20Sequences%200%20Data%20Processing.ipynb)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Mammals/Human/Promoter%20Classification%20Long%20Sequences)


### Bacterial Promoters

These results show comparisons to performance on another dataset from [5] containing promoter sequences from E. coli and B. subtilis. Compared to the CNN-based method used by [5], Genomic-ULMFiT performed similarly on E. coli promoters, but worse on B. subtilis promoters, likely due to the amount of data available (2936 examples for E. coli, 1050 for B. subtilis). This suggests that in extremely low data regimes, CNN models may perform better.


| Method         	| Organism    	| Training Examples 	| Accuracy 	| Precision 	| Recall 	| Correlation Coefficient 	| Specificity 	|
|----------------	|-------------	|-------------------	|----------	|-----------	|--------	|-------------------------	|-------------	|
| Kh et al.     	| E. coli     	|        2936       	|     -    	|     -     	|  __0.90__  	|           0.84          	|     0.96    	|
| Genomic-ULMFiT 	| E. coli     	|        2936       	|   0.956  	|   0.917   	|  0.880 	|          __0.871__          	|    __0.977__    	|
| Kh et al.     	| B. subtilis 	|        1050       	|     -    	|     -     	|  __0.91__  	|           __0.86__          	|     0.95    	|
| Genomic-ULMFiT 	| B. subtilis 	|        1050       	|   0.905  	|   0.857   	|  0.789 	|          0.759          	|     0.95    	|

[Data Source](https://github.com/solovictor/CNNPromoterData)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Bacteria/Bacterial%20Ensemble/Promoter%20Classification)


### Metagenomics Classification

These results show a direct comparison to [10], using the same datasets for classification. Two datasets are used - one for amplicon sequencing data, another for shotgun sequencing data. Datasets are generated synthetically based on sequencing of S16 regions of bacterial genomes. These results highlight how different datasets respond differently to changes in k-mer and stride tokenization parameters. The amplicon dataset showed a very weak reaction to tuning tokenization parameters, while the shotgun dataset showed a very strong reaction.


| Amplicon Data   	| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|-------------	|----------	|-----------	|--------	|-------	|
| Fiannaca et al. 	|      -      	|   .9137  	|   .9162   	|  .9137 	| .9126 	|
| Genomic-ULMFiT  	|     5/2     	|   .9144  	|   .9369   	|  .9250 	| .9214 	|
| Genomic-ULMFiT  	|     5/1     	|   .9150  	|   .9309   	|  .9263 	| .9230 	|
| Genomic-ULMFiT  	|     3/1     	|   __.9239__  	|   __.9402__   	|  __.9332__ 	| __.9306__ 	|

| Shotgun Data    	| kmer/stride 	| Accuracy 	| Precision 	| Recall 	| F1    	|
|-----------------	|-------------	|----------	|-----------	|--------	|-------	|
| Fiannaca et al. 	|      -      	|   .8550  	|   .8570   	|  .8520 	| .8511 	|
| Genomic-ULMFiT  	|     5/2     	|   .8075  	|   .8102   	|  .8054 	| .8044 	|
| Genomic-ULMFiT  	|     5/1     	|   .8528  	|   .8631   	|  .8566 	| .8569 	|
| Genomic-ULMFiT  	|     3/1     	|   __.8797__  	|   __.8824__   	|  __.8769__ 	| __.8758__ 	|

[Data Source](https://github.com/IcarPA-TBlab/MetagenomicDC)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Bacteria/Bacterial%20Ensemble/Metagenomics%20Classification)


### Enhancer Classification

These results show a direct comparison to [7], using the same dataset. Results here are compared using ROC-AUC as this was the metric used by [7]. Positive examples are 
500 bp sequences defined as having active enhancer marks (H3K27ac) in the liver. Negative examples are genomic regions showing no H3K27ac marks.

The data from [7] on this dataset is actually not presented in the paper itself, but put in the supplementary section, available [here](https://www.biorxiv.org/content/biorxiv/suppl/2018/02/14/264200.DC2/264200-1.pdf). The results below are compared to the author's results in supplementary Figure 3. This dataset was used because the main dataset from [7] used to generate figured in the main paper was not made available on their github repo.

| Model/ROC-AUC                 	| Human 	| Mouse 	|  Dog  	| Opossum 	|
|-------------------------------	|:-----:	|:-----:	|:-----:	|:-------:	|
| Cohn et al.                   	|  0.80 	|  0.78 	|  0.77 	|   0.72  	|
| Genomic-ULMFiT 5-mer Stride 2 	| 0.812 	| 0.871 	| 0.773 	|  0.787  	|
| Genomic-ULMFiT 4-mer Stride 2 	| 0.804 	| __0.876__ 	| 0.771 	|  0.786  	|
| Genomic-ULMFiT 3-mer Stride 1 	| __0.819__ 	| 0.875 	| __0.788__ 	|  __0.798__  	|

[Data Source](https://github.com/cohnDikla/enhancer_CNN)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Mammals/Mammal%20Ensemble/Enhancer%20Classification)


### mRNA/lncRNA Classification

These results show a direct comparison to [27] using data from the paper. The classification dataset consists of DNA sequences corresponding to mRNA and lncRNA sequences. The dataset contains two test sets - a standard test set and a challenge test set. In the table below, results from a single Genomic-ULMFiT model are compared to an ensemble of GRU models used by [27].


| Model                          	| Test Set           	| Accuracy 	| Specificity 	| Sensitivity 	| Precision 	| MCC   	|
|--------------------------------	|--------------------	|----------	|-------------	|-------------	|-----------	|-------	|
| GRU Ensemble (Hill et al.)*    	| Standard Test Set  	|   0.96   	|     __0.97__    	|     0.95    	|    __0.97__   	|  0.92 	|
| Genomic ULMFiT (3mer stride 1) 	| Standard Test Set  	|   __0.963__  	|    0.952    	|    __0.974__    	|   0.953   	| __0.926__ 	|
| GRU Ensemble (Hill et al.)*    	| Challenge Test Set 	|   0.875  	|     __0.95__    	|     0.80    	|    __0.95__   	|  0.75 	|
| Genomic ULMFiT (3mer stride 1) 	| Challenge Test Set 	|   __0.90__   	|    0.944    	|    __0.871__    	|   0.939   	| __0.817__ 	|

(*) [27] presented their results as a plot rather than as a data table. Values in the above table are estimated by reading off the plot

[Data Source](https://osf.io/4htpy/)

[Notebook Directory](https://github.com/tejasvi/DNAish/tree/master/Mammals/Human/lncRNA%20Classification)


## 8. Areas for Continuing Work

I think the results shown thus far make a strong case for using Genomic-ULMFiT for genomics classification tasks. Where do we go from here? I think the most interesting direction would be to apply Genomic-ULMFiT beyond simple classification problems. Right now all applications relate to taking in a sequence and producing a single classification prediction for that sequence. Moving beyond this framework opens up a variety of use cases.

### Multiple Sequence Comparative Classification

Rather than taking in a single sequence and classifying it, Genomic-ULMFiT could be adapted to take in two or more sequences and make a classification decision based on the comparison of those sequences. This can be applied to tasks like predicting promoter-enhancer interactions, predicting CRISPR guide off target affinity, or predicting the pathogenicity of missense mutations.

### Sequence to Sequence Classification

Rather than generating a single classification vector for an input sequence, sequence to sequence methods would produce a classification output at very token. This can be used for tasks like protein secondary structure prediction or TSS location prediction.

References:

[1] Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. arXiv preprint arXiv:1801.06146

[2] Chuai G, Ma H, Yan J, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning. Genome Biol. 2018;19(1):80. Published 2018 Jun 26. doi:10.1186/s13059-018-1459-4

[3] Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18(Suppl 13):478. Published 2017 Dec 1. doi:10.1186/s12859-017-1878-3

[4] Bite Yang, Feng Liu, Chao Ren, Zhangyi Ouyang, Ziwei Xie, Xiaochen Bo, Wenjie Shu, BiRen: predicting enhancers with a deep-learning-based model using the DNA sequence alone, Bioinformatics, Volume 33, Issue 13, 1 July 2017, Pages 1930–1936, https://doi.org/10.1093/bioinformatics/btx105

[5] Umarov RK, Solovyev VV (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE 12(2): e0171410. https://doi.org/10.1371/journal.pone.0171410

[6] Umarov RK, et al. 2018. PromID: human promoter prediction by deep learning. arXiv preprint arXiv:1810.01414

[7] Cohn D. et al. 2018. Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences. bioRxiv doi:https://doi.org/10.1101/264200

[8] Zeng W, Wu M, Jiang R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics. 2018;19(Suppl 2):84. Published 2018 May 9. doi:10.1186/s12864-018-4459-6

[9] Shen Z, Bao W, Huang DS. Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep. 2018;8(1):15270. Published 2018 Oct 15. doi:10.1038/s41598-018-33321-1

[10] Fiannaca A, La Paglia L, La Rosa M, et al. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC Bioinformatics. 2018;19(Suppl 7):198. Published 2018 Jul 9. doi:10.1186/s12859-018-2182-6

[11] Plekhanova, E., Nuzhdin, S. V., Utkin, L. V., & Samsonova, M. G. ( 2018). Prediction of deleterious mutations in coding regions of mammals with transfer learning. Evolutionary Applications, 12, 18– 28. https://doi.org/10.1111/eva.12607

[12] Liu F, Li H, Ren C, Bo X, Shu W. 2016 PEDLA: predicting enhancers with a deep learning-based algorithmic framework. bioRxiv (doi:10.1101/036129) Google Scholar

[13] Stephen Merity, Nitish Shirish Keskar, and RichardSocher. 2017. Regularizing and Optimizing LSTM Language Models. arXiv preprint arXiv:1708.02182

[14] Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term memory. Neural computation, 9(8):1735–1780, 1997.

[15] Inan, H., Khosravi, K., and Socher, R. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. arXiv preprint arXiv:1611.01462, 2016.

[16] Press, O. and Wolf, L. Using the output embedding to improve language models. arXiv preprint arXiv:1608.05859, 2016.

[17] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.

[18] Gal, Y. and Ghahramani, Z. A theoretically grounded application of dropout in recurrent neural networks. In NIPS, 2016.

[19] Wan, L., Zeiler, M., Zhang, S., LeCun, Y, and Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine learning (ICML-13), pp. 1058–1066, 2013.

[20] Ilya Loshchilov, Frank Hutter. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.

[21] Merity, S., McCann, B., and Socher, R. Revisiting activation regularization for language rnns. arXiv preprint arXiv:1708.01009, 2017.

[22] Leslie N. Smith. A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820, 2018

[23] Leslie N. Smith, Nicholay Topin. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv preprint arXiv:1708.07120, 2017

[24] Leslie N Smith. Cyclical Learning Rates for Training Neural Networks. arXiv preprint arXiv:1506.01186, 2015

[25] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems. pages 3320–3328.

[26] Tianwei Yue, Haohan Wang. Deep Learning for Genomics: A Concise Overview. arXiv preprint arXiv:1802.00810, 2018

[27] Hill S.T., Kuintzle R., Teegarden A., Merrill E., 3rd, Danaee P., Hendrix D.A. A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential. Nucleic Acids Res. 2018;46:8105–8113. doi: 10.1093/nar/gky567.

[28] Weihua Guo, You Xu, Xueyang Feng. DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing. arXiv preprint arXiv:1705.03094, 2017.

[29] Seonwoo Min, Byunghan Lee, Sungroh Yoon. Deep Learning in Bioinformatics. arXiv preprint arXiv:1603.06430, 2016.

[30] Young J.D., Cai C., Lu X. Unsupervised deep learning reveals prognostically relevant subtypes of glioblastoma. BMC Bioinform. 2017;18:381. doi: 10.1186/s12859-017-1798-2