In [5]:
from torchtext import *
from fastai.datasets import untar_data 
import torch
from fastai.text import get_language_model, convert_weights, get_text_classifier
from torchtext.datasets import Multi30k, LanguageModelingDataset
from models import *
from torchtext.data import *
from tqdm import tqdm, trange, tqdm_notebook
import os
import csv
from itertools import chain
import numpy as np
from fastai.core import even_mults
from fastai.callback import annealing_cos, annealing_exp, annealing_linear
from typing import Callable, Union
%load_ext autoreload
%autoreload 2

%matplotlib inline

# Models

In this notebook, we will go through the full implementation of the language model and the text classifier model used in the [ULMFIT](https://arxiv.org/pdf/1801.06146.pdf) paper. Most of the upcoming code is heavily based on the [fastai](https://docs.fast.ai) library and its deep learning course, which has already a full implementation of the ulmfit approach for NLP. However considering the complexity of the fastai code and its simplicity to use we figured it would helpful for readers to get a full bottom up implementation using pytorch as a baseline. Most of the code is exported to training.py

## Language model : AWD_LSTM Architecture

### LSTM

The core of the AWD LSTM architecture is of course the LSTM neural net. It is an improvement to the standart RNN way of dealing with sequential data (such as text). LSTM deals with the [vanishing/exploding gradient problem](https://medium.com/learn-love-ai/the-curious-case-of-the-vanishing-exploding-gradient-bf58ec6822eb) that come up in simple RNNs using cell connection gates. For further intuition on 'why' most the upcoming implentations I recommend colah's blog post : [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) 

#### LSTM cell

![LSTM cell and equations](images/lstm.jpg)
(picture from [Understanding LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Chris Olah.)

The LSTM archutecture is composed of a repeated cell which is shown in the image above. Its inputs are :
- **xt** which in our case is the embedding vector of the nth word of a batch of sentences
- **ht-1** the output of the last cell just like in RNNs.
- **ct-1** again output form last cell which is called the *cell state* used to prevent long-term dependencies problem.

The $\sigma$ reprenstents the [Sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function) function applied element-wise to its input. Both x and + connections are elemnt-wise multiplication and addition respectively. 

Let us implement it using the pytorch nn.Module class. We use a two big matrix multiplication to compute x*U and and h*U instead of 4 for each of them.

In [5]:
class LSTMCell(nn.Module):
    def __init__(self, x_s, h_s):
        super().__init__()
        self.h_s = h_s
        self.x_s = x_s
        self.U = nn.Linear(x_s,4*h_s)
        self.W = nn.Linear(h_s,4*h_s)

    def forward(self, input, state):
        #inputs from last cell
        h,c = state
        
        #computing itermedtiate gates
        gates = (self.U(input) + self.W(h)).chunk(4, 1)
        i_t,f_t,o_t = map(torch.sigmoid, gates[:3])
        c_t = gates[3].tanh()
        c = (f_t*c) + (i_t*c_t)
        h = o_t * c.tanh()
        
        #outputting the usualt h output and the state to give to next cell if needed
        return h, (h,c)

#### LSTM Layer

The next building block of the LSTM is the LSTM layer wich consisit of appling the LSTM cell to each sequential input in a recurrent manner with each time forwarding its state to the next time step.

![LSTM layer](images/LSTM3.png)
(picture from [Understanding LSTMs](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) by Chris Olah.)

In [6]:
class LSTMLayer(nn.Module):
    def __init__(self, x_s, h_s):
        super().__init__()
        self.lstm_cell = LSTMCell(x_s, h_s)

    def forward(self, input, state):
        # divide the input in the sequence dimension to get x_0, x_1, x_2, ...
        inputs = input.unbind(1)
        
        #prepare to store the output of each cell
        outputs = []
        
        #applying the cell recursively 
        for i in range(len(inputs)):
            out, state = self.lstm_cell(inputs[i], state)
            outputs += [out]
        
        #return the stacked outputs
        return torch.stack(outputs, dim=1), state

(For the state for the first cell we simply use tensors with only zeroes)

#### Stacked LSTM layers

The last step before having fully implemented pytorch's LSTM module is stacking multiple LSTM layers one above each other as shown in the following diagram :

![Stacked RNN](images/RNN_Stacking.png)
(picture from : https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/recurrent_neural_networks.html?q= )

- <font color='pink'>pink rectangles</font>  : The sequential inputs  
- <font color='green'>green rectangles</font>  : An LSTM cell, each line is a layer so the cells on the same line are the same
- <font color='blue'>blue rectangles</font> : The sequential outputs

On the previous implementations, we created the initial state **h0** and **c0** outside of the model at the same time as we gave the input. This means that we could create h0 and c0 with the right sizes by simply comparing it to the input sizes. This time around, we will create the initial state to give to all the layers inside the model itself. As the dimensions of the state depend on the batch size of the input given, we need to create the initial states at the start of the forward pass when we are given the input (and thus we know the batch size) with the help of the **reset()** method. We also want to keep the last states from last batch if the batch size did not change.

We aso have to be careful of the sives of the hidden layers inputs. For example, on the image above, the first layer takes as input the initial sequence and outputs a hidden sequence whereas the second and third layers take as input a hidden sequence and outputs a hidden sequence. Because of that we must have a different sizes for the first layer and the other layers. And of course the same goes for the initial states.

In [7]:
class fullLSTM(nn.Module):
    def __init__(self, x_s, h_s, n_layers):
        super().__init__()
        self.n_layers = n_layers
        self.lstm_layers = nn.ModuleList([LSTMLayer(x_s if i==0 else h_s, h_s ) for i in range(n_layers)])
        self.bs = 0

    def forward(self, input):
        
        #get the batch size from the first dimension of the input 
        bs, sl, _ = input.size()
        if self.bs != bs :
            self.bs = bs
            self.reset()
       
        # now we have the initial states and we can go through all the layers recursively
        for j in range(self.n_layers) :
            layer = self.lstm_layers[j]
            input, self.hidden[j] = layer(input, self.hidden[j])
                
        
        #return the outputs
        return input
    
    def reset(self) :
        st = next(self.parameters()).new(self.bs, h_s).zero_()
        self.hidden = [(st, st) for l in range(self.n_layers)]
        

This implementation is pretty much the same as pytorch's nn.LSTM. The difference is that pytorch uses CuDNN to make the computations faster. We will now use pytorch's implementation instead of ours

### Generalization : Dropout

Usual regularization techniques used in feed-forward and convolutional neural nets such as dropout and batchnorm do not work well in RNNs. The AWD LSTM uses extensions of those to regularize its model. Correspondigns ections of the [paper](https://arxiv.org/pdf/1708.02182.pdf) will be provided for more info.

#### Variational dropout 
Section 4.2

The idea in variational dropout is to use the **same** drop out mask to a squential input over the sequence dimension. In essence, if you have an input x with shape *(bs, seq_len, x_s)*, the dropout mask will be of shape *(bs, 1, x_s)* and will be applied to each slice of sequence.
This dropout will be used on each output/input of the LSTM layers. Additionally we divide every activations that have not been set to 0 by the mask by 1-p (p: probability of dropout) to keep the average. We use [broadcasting](https://pytorch.org/docs/stable/notes/broadcasting.html) to be efficient in the element-wise computations

In [8]:
def dropout_mask(x, sz, p):
    return x.new(*sz).bernoulli_(1-p).div_(1-p)

In [9]:
class VDropout(nn.Module) :
    def __init__(self, p=0.5) :
        super().__init__()
        self.p = p
    def forward(self, x) :
        #The dropout should only be used during training and not eval 
        if not self.training or self.p == 0.: return x
        #the mask
        m = dropout_mask(x.data, (x.size(0), 1, x.size(2)), self.p)
        #element-wise multiplication with broadcasting
        return x*m

In [10]:
m = VDropout(0.3)
tst_input = torch.randn(3,3,7)
tst_input, m(tst_input)

(tensor([[[ 1.9241e+00,  7.5969e-01, -8.2366e-01,  1.6091e+00, -2.4311e-01,
           -1.7052e-01,  3.5545e+00],
          [ 2.9414e-03, -3.9516e-01, -1.1121e-01,  8.0972e-01,  3.7951e-01,
            1.6787e+00,  1.2045e+00],
          [-2.0720e+00,  4.5873e-02,  3.1458e-01, -6.5350e-02,  6.6786e-01,
           -4.8988e-02,  5.2090e-01]],
 
         [[ 1.2378e-01, -2.0185e+00, -2.4288e+00, -5.2116e-01,  2.1887e-01,
           -3.1126e-01,  1.4472e+00],
          [ 2.8520e+00,  6.5735e-01,  1.2059e+00,  1.4011e-02, -9.4021e-01,
            3.2747e-02,  4.0664e-01],
          [ 5.6582e-02,  1.4144e+00,  4.7610e-02, -6.4115e-01, -1.8566e-01,
           -3.7903e-01, -3.3611e-01]],
 
         [[-1.1319e+00, -4.4765e-01, -2.4535e-01,  7.7542e-01,  1.8033e+00,
            1.8792e+00,  1.8971e+00],
          [-1.0683e+00,  6.1346e-01,  5.0164e-01,  4.0671e-01,  5.9879e-01,
            7.1216e-01,  1.3058e-01],
          [-8.0786e-01,  1.6683e+00,  2.0653e-01,  5.0945e-01,  1.8499e+00,
      

Here we can see that the dropped is consistent in the second dimension

#### Embedding dropout 
Section 4.3

For embedding dropout we simply nulifiy entire rows of the word embedding matrix with probability p. Again broadcastiong is used

In [11]:
class mEmbeddingDropout(nn.Module):
    
    def __init__(self, emb, embed_p):
        super().__init__()
        self.emb,self.embed_p = emb,embed_p
        self.pad_idx = self.emb.padding_idx
        if self.pad_idx is None: self.pad_idx = -1

    def forward(self, words, scale=None):
        if self.training and self.embed_p != 0:
            size = (self.emb.weight.size(0),1)
            mask = dropout_mask(self.emb.weight.data, size, self.embed_p)
            masked_embed = self.emb.weight * mask
        else: masked_embed = self.emb.weight
        if scale: masked_embed.mul_(scale)
        return F.embedding(words, masked_embed, self.pad_idx, self.emb.max_norm,
                           self.emb.norm_type, self.emb.scale_grad_by_freq, self.emb.sparse)

In [12]:
enc = nn.Embedding(100, 7, padding_idx=1)
enc_dp = mEmbeddingDropout(enc, 0.5)
tst_input = torch.randint(0,100,(8,))
enc_dp(tst_input)

tensor([[-0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000,  0.0000],
        [ 0.5269,  0.9033, -0.8586,  1.7035,  0.3431, -0.5421, -1.1035],
        [-0.2606,  0.9515, -1.7171,  1.4127, -5.0057, -1.4582,  0.3003],
        [ 2.6632, -1.5893, -1.7917, -0.9815, -0.1710,  2.1583,  4.5810],
        [ 1.6075, -3.0090,  0.3957, -3.6681,  0.0316,  0.0674,  1.3095],
        [ 0.0000, -0.0000,  0.0000, -0.0000, -0.0000, -0.0000,  0.0000],
        [ 3.1144, -3.0127,  0.9502,  1.9643, -1.2018,  0.4565, -0.9475],
        [ 0.0000, -0.0000, -0.0000, -0.0000,  0.0000, -0.0000,  0.0000]],
       grad_fn=<EmbeddingBackward>)

We can see that entire rows have been dropped

#### Weight-dropout
Section 2

Weight dropout is a dropout applied to the weights inside the LSTM cells : U and W.

In order to keep the speed of the LSTM layer, we simply replace the weight matrix of the LSTM by a masked version and keep the non-masked version. We can then simply apply the LSTM layer and it will use its new weights.

In [13]:
# The name of the parameter in the nn.LSTM module containing the weights 
WEIGHT_HH = 'weight_hh_l0'

class mWeightDropout(nn.Module):
    def __init__(self, module, weight_p=[0.], layer_names=[WEIGHT_HH]):
        super().__init__()
        self.module,self.weight_p,self.layer_names = module,weight_p,layer_names
        for layer in self.layer_names:
            #Makes a copy of the weights of the selected layers.
            w = getattr(self.module, layer)
            #
            self.register_parameter(f'{layer}_raw', nn.Parameter(w.data))
            self.module._parameters[layer] = F.dropout(w, p=self.weight_p, training=False)

    def _setweights(self):
        for layer in self.layer_names:
            raw_w = getattr(self, f'{layer}_raw')
            self.module._parameters[layer] = F.dropout(raw_w, p=self.weight_p, training=self.training)

    def forward(self, *args):
        self._setweights()
        with warnings.catch_warnings():
            #To avoid the warning that comes because the weights aren't flattened.
            warnings.simplefilter("ignore")
            return self.module.forward(*args)

In [14]:
module = nn.LSTM(5, 2)
dp_module = mWeightDropout(module, 0.4)
getattr(dp_module.module, WEIGHT_HH)

Parameter containing:
tensor([[ 0.2535, -0.6309],
        [ 0.7069, -0.1329],
        [ 0.2839, -0.0401],
        [ 0.0010,  0.2045],
        [-0.0550, -0.2916],
        [ 0.3217, -0.6281],
        [ 0.4655, -0.6114],
        [-0.6816,  0.1285]], requires_grad=True)

In [15]:
tst_input = torch.randn(4,20,5)
h = (torch.zeros(1,20,2), torch.zeros(1,20,2))
x,h = dp_module(tst_input,h)
getattr(dp_module.module, WEIGHT_HH)

tensor([[ 0.4225, -1.0515],
        [ 1.1782, -0.2215],
        [ 0.0000, -0.0000],
        [ 0.0016,  0.3409],
        [-0.0000, -0.0000],
        [ 0.5362, -0.0000],
        [ 0.7759, -0.0000],
        [-0.0000,  0.0000]], grad_fn=<MulBackward0>)

As we can see, the dropout is applied to the weights during the forward pass

### Full model

At this point we have everything ready to implement the entire AWSD LSTM model, the following code might look really complicated at first but it is in fact pretty much the same as our fullLSTM except we use the different kinds of dropout disscussed above. It also takes care of the word embeddings whereas our fullLSTM assumed it was already done so we need to take care of that. Another difference is that the last layer outputs a different size tensor 

In [16]:
def to_detach(h):
    "Detaches `h` from its history."
    return h.detach() if type(h) == torch.Tensor else tuple(to_detach(v) for v in h)

In [17]:
class mAWD_LSTM(nn.Module):
    initrange=0.1

    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token,
                 hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5):
        super().__init__()
        """Returns an iterator over module parameters.

        This is typically passed to an optimizer.

        Args:
            vocab_sz (int): number of words in the vocab
            emb_sz (int): size of the word embedding vector
            n_hid (int): size of the hidden vector 
            n_layers (int): number of layers in the LSTM
            pad_token (int): id of the pad_idx for the embedding matrix
            hidden_p (float): dropout probability for variational dropout on hidden activations
            input_p (float): dropout probability for variational dropout on input activations
            embed_p (float):dropout probability for embedding dropout 
            weight_p (float):dropout probability for weight dropout

        """
        self.bs,self.emb_sz,self.n_hid,self.n_layers = 1,emb_sz,n_hid,n_layers
        self.emb = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)
        self.emb_dp = mEmbeddingDropout(self.emb, embed_p)
        self.rnns = [nn.LSTM(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz), 1,
                             batch_first=True) for l in range(n_layers)]
        self.rnns = nn.ModuleList([mWeightDropout(rnn, weight_p) for rnn in self.rnns])
        self.emb.weight.data.uniform_(-self.initrange, self.initrange)
        self.input_dp = VDropout(input_p)
        self.hidden_dps = nn.ModuleList([VDropout(hidden_p) for l in range(n_layers)])

    def forward(self, input):
        bs,sl = input.size()
        if bs!=self.bs:
            self.bs=bs
            self.reset()
        raw_output = self.input_dp(self.emb_dp(input))
        print(raw_output.shape)
        new_hidden,raw_outputs,outputs = [],[],[]
        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
            raw_output, new_h = rnn(raw_output, self.hidden[l])
            new_hidden.append(new_h)
            raw_outputs.append(raw_output)
            if l != self.n_layers - 1: raw_output = hid_dp(raw_output)
            outputs.append(raw_output) 
        self.hidden = to_detach(new_hidden)
        return raw_outputs, outputs

    def _one_hidden(self, l):
        "Return one hidden state."
        nh = self.n_hid if l != self.n_layers - 1 else self.emb_sz
        return next(self.parameters()).new(1, self.bs, nh).zero_()

    def reset(self):
        "Reset the hidden states."
        self.hidden = [(self._one_hidden(l), self._one_hidden(l)) for l in range(self.n_layers)]

We also need a decoder which takes the output of our AWD_LSTM and transform it into the prediction of the wrord

In [18]:
class mLinearDecoder(nn.Module):
    def __init__(self, n_out, n_hid, output_p, tie_encoder=None, bias=True):
        super().__init__()
        self.output_dp = VDropout(output_p)
        self.decoder = nn.Linear(n_hid, n_out, bias=bias)
        if bias: self.decoder.bias.data.zero_()
        if tie_encoder: self.decoder.weight = tie_encoder.weight
        else: init.kaiming_uniform_(self.decoder.weight)

    def forward(self, input):
        raw_outputs, outputs = input
        output = self.output_dp(outputs[-1]).contiguous()
        decoded = self.decoder(output.view(output.size(0)*output.size(1), output.size(2)))
        return decoded, raw_outputs, outputs

We can combine both of them using a sequential module 

In [19]:
class mSequentialRNN(nn.Sequential):
    "A sequential module that passes the reset call to its children."
    def reset(self):
        for c in self.children():
            if hasattr(c, 'reset'): c.reset()

In [20]:
def mget_language_model(vocab_sz, emb_sz, n_hid, n_layers, pad_token, output_p=0.4, hidden_p=0.2, input_p=0.6, 
                       embed_p=0.1, weight_p=0.5, tie_weights=True, bias=True):
    rnn_enc = mAWD_LSTM(vocab_sz, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
                       hidden_p=hidden_p, input_p=input_p, embed_p=embed_p, weight_p=weight_p)
    enc = rnn_enc.emb if tie_weights else None
    return SequentialRNN(rnn_enc, mLinearDecoder(vocab_sz, emb_sz, output_p, tie_encoder=enc, bias=bias))

Most of the code from the last couple of cells is actually already implemented in the fastai library with minor changes with the same functions/class names without the m at the beggining. For example here is the model generated from our implementation. 

In [21]:
mget_language_model(vocab_sz=400, emb_sz=20, n_hid=100, n_layers=3, pad_token=-1)

SequentialRNN(
  (0): mAWD_LSTM(
    (emb): Embedding(400, 20, padding_idx=399)
    (emb_dp): mEmbeddingDropout(
      (emb): Embedding(400, 20, padding_idx=399)
    )
    (rnns): ModuleList(
      (0): mWeightDropout(
        (module): LSTM(20, 100, batch_first=True)
      )
      (1): mWeightDropout(
        (module): LSTM(100, 100, batch_first=True)
      )
      (2): mWeightDropout(
        (module): LSTM(100, 20, batch_first=True)
      )
    )
    (input_dp): VDropout()
    (hidden_dps): ModuleList(
      (0): VDropout()
      (1): VDropout()
      (2): VDropout()
    )
  )
  (1): mLinearDecoder(
    (output_dp): VDropout()
    (decoder): Linear(in_features=20, out_features=400, bias=True)
  )
)

We can now generate the dame using fastai's get_language_model

In [22]:
configs = {'emb_sz':20, 'n_hid':100, 'n_layers':3, 'pad_token':-1, 'output_p':0.4, 'hidden_p':0.2, 'input_p':0.6, 
                       'embed_p':0.1, 'weight_p':0.5, 'tie_weights':True, 'out_bias':True}
lm = get_language_model(AWD_LSTM, 400, configs)
lm

SequentialRNN(
  (0): AWD_LSTM(
    (encoder): Embedding(400, 20, padding_idx=399)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(400, 20, padding_idx=399)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(20, 100, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(100, 100, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(100, 20, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
  (1): LinearDecoder(
    (decoder): Linear(in_features=20, out_features=400, bias=True)
    (output_dp): RNNDropout()
  )
)

As we can see, they are both the same with only module names changing

## Loading pretrained model

We can directly load fastai's awd_lstm pretrained model

In [None]:
def load_pretrained_lm(vocab) :    
    """
    Load fastai's pretrained awd_lstm model
    """
    lm = get_language_model(AWD_LSTM, len(vocab))
    model_path = untar_data('https://s3.amazonaws.com/fast-ai-modelzoo/wt103-1', data=False)
    fnames = [list(model_path.glob(f'*.{ext}'))[0] for ext in ['pth', 'pkl']]
    old_itos = pickle.load(open(fnames[1], 'rb'))
    old_stoi = {v:k for k,v in enumerate(old_itos)}
    wgts = torch.load(fnames[0], map_location=lambda storage, loc: storage)
    wgts = convert_weights(wgts, old_stoi, vocab)
    lm.load_state_dict(wgts)
    return lm

# Sentiment classifier model

Our classifier model will have the same core as the language model which is the AWD_LSTM model with some minor modifications to ignore the padding. Will will then have on top of that some concat pooling of the results of the core which will then have a linear classifier on top. ([ulmfit paper](https://arxiv.org/pdf/1801.06146.pdf) section 3 for more info.

### Modified AWD_LSTM

This time, our RNN inputs will have some padding that we want to ignore when giving it to the layers. In order to do that while keeping the CuDN speed of our LSTM layers we use pytorch **pack_padded_sequence** and **pad_packed_sequence** ([docs](https://pytorch.org/docs/stable/nn.html)) which deals with padded tensors

In [30]:
class AWD_LSTM_clas(nn.Module):
    "AWD-LSTM inspired by https://arxiv.org/abs/1708.02182."
    initrange=0.1

    def __init__(self, vocab_sz, emb_sz, n_hid, n_layers, pad_token,
                 hidden_p=0.2, input_p=0.6, embed_p=0.1, weight_p=0.5):
        super().__init__()
        self.bs,self.emb_sz,self.n_hid,self.n_layers,self.pad_token = 1,emb_sz,n_hid,n_layers,pad_token
        self.emb = nn.Embedding(vocab_sz, emb_sz, padding_idx=pad_token)
        self.emb_dp = EmbeddingDropout(self.emb, embed_p)
        self.rnns = [nn.LSTM(emb_sz if l == 0 else n_hid, (n_hid if l != n_layers - 1 else emb_sz), 1,
                             batch_first=True) for l in range(n_layers)]
        self.rnns = nn.ModuleList([WeightDropout(rnn, weight_p) for rnn in self.rnns])
        self.emb.weight.data.uniform_(-self.initrange, self.initrange)
        self.input_dp = RNNDropout(input_p)
        self.hidden_dps = nn.ModuleList([RNNDropout(hidden_p) for l in range(n_layers)])

    def forward(self, input):
        bs,sl = input.size()
        
        ##new
        mask = (input == self.pad_token)
        lengths = sl - mask.long().sum(1)
        n_empty = (lengths == 0).sum()
        if n_empty > 0:
            input = input[:-n_empty]
            lengths = lengths[:-n_empty]
            self.hidden = [(h[0][:,:input.size(0)], h[1][:,:input.size(0)]) for h in self.hidden]
            
        raw_output = self.input_dp(self.emb_dp(input))
        new_hidden,raw_outputs,outputs = [],[],[]
        for l, (rnn,hid_dp) in enumerate(zip(self.rnns, self.hidden_dps)):
            raw_output = pack_padded_sequence(raw_output, lengths, batch_first=True) #new
            raw_output, new_h = rnn(raw_output, self.hidden[l])
            raw_output = pad_packed_sequence(raw_output, batch_first=True)[0] #new
            raw_outputs.append(raw_output)
            if l != self.n_layers - 1: raw_output = hid_dp(raw_output)
            outputs.append(raw_output)
            new_hidden.append(new_h)
        self.hidden = to_detach(new_hidden)
        return raw_outputs, outputs, mask

    def _one_hidden(self, l):
        "Return one hidden state."
        nh = self.n_hid if l != self.n_layers - 1 else self.emb_sz
        return next(self.parameters()).new(1, self.bs, nh).zero_()

    def reset(self):
        "Reset the hidden states."
        self.hidden = [(self._one_hidden(l), self._one_hidden(l)) for l in range(self.n_layers)]

### Concat pooling

We will use three things for the classification head of the model: the last hidden state, the average of all the hidden states and the maximum of all the hidden states. The trick is just to, once again, ignore the padding in the last element/average/maximum. (again refer to paper for more info)

In [31]:
class Pooling(nn.Module):
    def forward(self, input):
        raw_outputs,outputs,mask = input
        output = outputs[-1]
        lengths = output.size(1) - mask.long().sum(dim=1)
        avg_pool = output.masked_fill(mask[:,:,None], 0).sum(dim=1)
        avg_pool.div_(lengths.type(avg_pool.dtype)[:,None])
        max_pool = output.masked_fill(mask[:,:,None], -float('inf')).max(dim=1)[0]
        x = torch.cat([output[torch.arange(0, output.size(0)),lengths-1], max_pool, avg_pool], 1) #Concat pooling.
        return output,x

### Classifier head

Now we can create the head of our model which is a concat pooling with a classifer. We create a **PoolingLinearClassifier** which can have multiple layers and uses dropout and batchnorm on the activations (and of course ReLU for non-linearity)

In [32]:
def bn_drop_lin(n_in, n_out, bn=True, p=0., actn=None):
    layers = [nn.BatchNorm1d(n_in)] if bn else []
    if p != 0: layers.append(nn.Dropout(p))
    layers.append(nn.Linear(n_in, n_out))
    if actn is not None: layers.append(actn)
    return layers

In [33]:
class PoolingLinearClassifier(nn.Module):
    "Create a linear classifier with pooling."

    def __init__(self, layers, drops):
        super().__init__()
        mod_layers = []
        activs = [nn.ReLU(inplace=True)] * (len(layers) - 2) + [None]
        for n_in, n_out, p, actn in zip(layers[:-1], layers[1:], drops, activs):
            mod_layers += bn_drop_lin(n_in, n_out, p=p, actn=actn)
        self.layers = nn.Sequential(*mod_layers)

    def forward(self, input):
        raw_outputs,outputs,mask = input
        output = outputs[-1]
        lengths = output.size(1) - mask.long().sum(dim=1)
        avg_pool = output.masked_fill(mask[:,:,None], 0).sum(dim=1)
        avg_pool.div_(lengths.type(avg_pool.dtype)[:,None])
        max_pool = output.masked_fill(mask[:,:,None], -float('inf')).max(dim=1)[0]
        x = torch.cat([output[torch.arange(0, output.size(0)),lengths-1], max_pool, avg_pool], 1) #Concat pooling.
        x = self.layers(x)
        return x

We also need to take care of the fact that our tweets might be too long to givto one LSTM layer, so we need to divide the batches into smaller batches of shape $(bs, bptt)$ which will be done using a **SentenceEncoder** over our awd_lstm model

In [34]:
def pad_tensor(t, bs, val=0.):
    if t.size(0) < bs:
        return torch.cat([t, val + t.new_zeros(bs-t.size(0), *t.shape[1:])])
    return t

class SentenceEncoder(nn.Module):
    def __init__(self, module, bptt, pad_idx=1):
        super().__init__()
        self.bptt,self.module,self.pad_idx = bptt,module,pad_idx

    def concat(self, arrs, bs):
        return [torch.cat([pad_tensor(l[si],bs) for l in arrs], dim=1) for si in range(len(arrs[0]))]
    
    def forward(self, input):
        bs,sl = input.size()
        self.module.bs = bs
        self.module.reset()
        raw_outputs,outputs,masks = [],[],[]
        for i in range(0, sl, self.bptt):
            r,o,m = self.module(input[:,i: min(i+self.bptt, sl)])
            masks.append(pad_tensor(m, bs, 1))
            raw_outputs.append(r)
            outputs.append(o)
        return self.concat(raw_outputs, bs),self.concat(outputs, bs),torch.cat(masks,dim=1)

### Full classifier 

In [43]:
def mget_text_classifier(vocab_sz, emb_sz, n_hid, n_layers, n_out, pad_token, bptt, output_p=0.4, hidden_p=0.2, 
                        input_p=0.6, embed_p=0.1, weight_p=0.5, layers=None, drops=None):
    "To create a full AWD-LSTM"
    rnn_enc = AWD_LSTM_clas(vocab_sz, emb_sz, n_hid=n_hid, n_layers=n_layers, pad_token=pad_token,
                        hidden_p=hidden_p, input_p=input_p, embed_p=embed_p, weight_p=weight_p)
    enc = SentenceEncoder(rnn_enc, bptt)
    if layers is None: layers = [50]
    if drops is None:  drops = [0.1] * len(layers)
    layers = [3 * emb_sz] + layers + [n_out] 
    drops = [output_p] + drops
    return SequentialRNN(enc, PoolingLinearClassifier(layers, drops))

In [45]:
bptt=70
emb_sz, nh, nl = 300, 300, 2
dps = tensor([0.4, 0.3, 0.4, 0.05, 0.5]) * 0.25
model = mget_text_classifier(60000, emb_sz, nh, nl, 2, 1, bptt, *dps)

In [46]:
model

SequentialRNN(
  (0): SentenceEncoder(
    (module): AWD_LSTM_clas(
      (emb): Embedding(60000, 300, padding_idx=1)
      (emb_dp): EmbeddingDropout(
        (emb): Embedding(60000, 300, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): WeightDropout(
          (module): LSTM(300, 300, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(300, 300, batch_first=True)
        )
      )
      (input_dp): RNNDropout()
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): RNNDropout()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(900, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.10000000149011612)
      (2): Linear(in_features=900, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (5): Dropout(p=0.1)
      (6): Linear(in_features=50, out_feat

As for the language model, most of the code is already in the fastai library with some minor changes, so we can directly use get_text_classifier from fastai :

In [48]:
config = dict(emb_sz=400, n_hid=1150, n_layers=3, pad_token=1, qrnn=False, bidir=False, output_p=0.4,
                       hidden_p=0.3, input_p=0.4, embed_p=0.05, weight_p=0.5)
model = get_text_classifier(AWD_LSTM, 60000, 2, config=config)

In [49]:
model

SequentialRNN(
  (0): MultiBatchEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(60000, 400, padding_idx=1)
      (encoder_dp): EmbeddingDropout(
        (emb): Embedding(60000, 400, padding_idx=1)
      )
      (rnns): ModuleList(
        (0): WeightDropout(
          (module): LSTM(400, 1150, batch_first=True)
        )
        (1): WeightDropout(
          (module): LSTM(1150, 1150, batch_first=True)
        )
        (2): WeightDropout(
          (module): LSTM(1150, 400, batch_first=True)
        )
      )
      (input_dp): RNNDropout()
      (hidden_dps): ModuleList(
        (0): RNNDropout()
        (1): RNNDropout()
        (2): RNNDropout()
      )
    )
  )
  (1): PoolingLinearClassifier(
    (layers): Sequential(
      (0): BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (1): Dropout(p=0.4)
      (2): Linear(in_features=1200, out_features=50, bias=True)
      (3): ReLU(inplace)
      (4): BatchNorm1d(50, eps=1e-05, momentum=

In [50]:
model[0]

MultiBatchEncoder(
  (module): AWD_LSTM(
    (encoder): Embedding(60000, 400, padding_idx=1)
    (encoder_dp): EmbeddingDropout(
      (emb): Embedding(60000, 400, padding_idx=1)
    )
    (rnns): ModuleList(
      (0): WeightDropout(
        (module): LSTM(400, 1150, batch_first=True)
      )
      (1): WeightDropout(
        (module): LSTM(1150, 1150, batch_first=True)
      )
      (2): WeightDropout(
        (module): LSTM(1150, 400, batch_first=True)
      )
    )
    (input_dp): RNNDropout()
    (hidden_dps): ModuleList(
      (0): RNNDropout()
      (1): RNNDropout()
      (2): RNNDropout()
    )
  )
)

We again need to get the parameters of the model in group to apply discriminative learning rate and gradual unfreezing

In [62]:
def get_class_model_param_groups(classifier_model) :
    """
    Returns the parameter groups structured by the RNN layers of the classifier model
    """
    parameters = []
    parameters.append({'params' : chain(classifier_model[0].module.encoder.parameters(), classifier_model[0].module.encoder_dp.parameters())})
    for rnn in classifier_model[0].module.rnns :
        parameters.append({'params' : rnn.parameters()})
    parameters.append({'params' : classifier_model[1].parameters()})
    return parameters

# Training

## Leanrner

For code simplicity, we refactor all the training data in the Learner class (similar to fastai's Learner).
The learner class contains the model, the optimizer, the loss function and the data. It can also freeze layers (make them not train) for gradual unfreezing. We create two subclasses for the classifer and the language model

In [2]:
class Learner():
    """
    Container for a deep learning model, an optimizer, a loss function and the data containing the training and validation 
    with possibility to train the model

    Arguments :
        model (nn.Module): the pytorch model
        opt (torch.optim): the pytorch optimizer
        loss_func (Callable): the loss function 
        data (Databunch): the data containing the training and validation dataloaders
    """
    def __init__(self, model, opt, loss_func, data):
        self.model,self.opt,self.loss_func,self.data = model,opt,loss_func,data
    
    def freeze_to(self, n) :
        """
        Freezes the optimizer parameter group up to n
        """
        if n >= len(self.opt.param_groups) :
            raise ValueError(f'The optimizer only has {len(self.opt.param_groups)} parameter groups')
        
        for g in self.opt.param_groups[:n]:
            for l in g['params']:
                l.requires_grad=False
        for g in self.opt.param_groups[n:]: 
            for l in g['params']:
                l.requires_grad=True
    
    def unfreeze(self) :
        """
        Unfreezes the whole parameter groups 
        """
        self.freeze_to(0)

    def save(self, path) :
        state = {'model' : self.model.state_dict(), 'opt' : self.opt.state_dict()}
        torch.save(state, path)
    
    def load(self, path) :
        state = torch.load(path)
        self.model.load_state_dict(state['model'])
        self.opt.load_state_dict(state['opt'])

In [3]:
class TextLanguageLearner(Learner) :
    
    def save_encoder(self, path) :
        """
        saves the encoder part of the model, which will then be loaded by the classifer
        """
        torch.save(self.model[0].state_dict(), path)
    
    def fit(self, epochs, **kwargs) :
        return fit(epochs, self, lm=True, **kwargs)

    def validate(self, cuda=True) :
        validate(self, cuda, lm=True)
    
class TextClassifierLearner(Learner) :
    
    def fit(self, epochs, **kwargs) :
        return fit(epochs, self, lm=False, **kwargs)
    
    def validate(self, cuda=True) :
        validate(self, cuda, lm=False)

    def add_test(self, test_dl) :
        self.test_dl = test_dl
    
    def predict_test(self) :
        if self.test_dl is None :
            return 
        preds = []
        self.model = self.model.cuda()
        batches = tqdm_notebook(self.test_dl, leave=False,
                    total=len(self.test_dl), desc=f'Predictions')
        for x, _ in batches :
            x = x.cuda()
            pred = self.model(x)[0]
            preds.append(torch.argmax(pred, dim=1))
        preds = torch.cat(preds, dim=0)
        return preds
    
    def make_submission(self, path) :
        preds = self.predict_test()
        preds = (preds - (preds == 0).type(torch.cuda.LongTensor)).tolist()
        sub = pd.DataFrame({'Id' : range(1, len(preds)+1), 'Prediction' : list(preds)})
        sub.to_csv(path)

In [6]:
def get_language_learner(data, opt_func=torch.optim.Adam, loss_func=CrossEntropyFlat(), lr=0.01) :
    model = load_pretrained_lm(data.vocab)
    opt = opt_func(get_lang_model_param_groups(model), lr=lr)
    return TextLanguageLearner(model, opt, loss_func, data)

def get_classifier_learner(data, enc_path, opt_func=torch.optim.Adam, loss_func=CrossEntropyFlat(), lr=0.01) :
    model = get_text_classifier(AWD_LSTM, len(data.vocab), 2)
    load_encoder_clas(model, enc_path)
    opt = opt_func(get_class_model_param_groups(model), lr=lr)
    return TextClassifierLearner(model, opt, loss_func, data)

## Training loop

Our training loop implement th following (taken from [ULMFIT](https://arxiv.org/pdf/1801.06146.pdf) and [A disciplined approach to neural network hyper parameters part 1](https://arxiv.org/pdf/1803.09820.pdf)) :

- Gradiant clipping
- Activation regularization (AR)
- Temporal activation regularization (TAR)
- discriminative learning rate 
- cyclical learning rates and mometnum 

In [12]:
def fit(epochs, learn, lm, cuda=True, show_info=True, grad_clip=0.1, alpha=2., beta=1., record=True, one_cycle=True, 
                 max_lr:Union[float,slice]=0.01,  div_factor:float=25., pct_start:float=0.3, final_div:float=None, moms=(0.95, 0.85),
                 annealing:Callable=annealing_cos, notebook=True):
    
    """
    Train the learner for a number of epochs

    Arguments :

        epochs : number of epochs
        learn : the Learner
        cuda : if we train on gpu or not
        show_info : show training and validation loss and accuracy
        grad_clip : use fro gradiant clipping 
        alpha :  activation regularization parameter 
        beta : temporal activation regularization parameter
        record : to record hyperparameters (learning rate and mometnum) and losses
        one_cycle : for using cycling learning rate and momentum
        max_lr : the max learning rate for the cycle (if max_lr is a slice then discriminative learning rate is applied)
        div_factor : factor to divide max_lr to get the starting learning rate for the cycle 
        pct_start : at which fraction of the cycle do we reach max_lr
        final_div : factor to divide max_lr to get the ending learning rate for the cycle
        moms : the maximum and lowest momentum for the cycle 
        annealing : the interpolation function for the learning rate and the momentum 
    """
     #number of batches in one epoch for validation and training data
    train_size = len(learn.data.train_dl)
    valid_size = len(learn.data.valid_dl)
    
    # total iterations and cut used for slanted_triangular learning rates (T and cut from paper)
    total_iterations = epochs*train_size

    if record:
        momentum = [[] for i in range(len(learn.opt.param_groups))]
        lrs_record = [[] for i in range(len(learn.opt.param_groups))]
        train_losses = []
        val_losses =[]
        train_accs = []
        valid_accs =[]
    
    #puts model on gpu
    if cuda :
        learn.model.cuda()
       
    #Start the epoch
    for epoch in range(epochs):
        
        if hasattr(learn.data.train_dl.dataset, "batchify"): learn.data.train_dl.dataset.batchify()

        #loss and accuracy 
        train_loss, valid_loss, train_acc, valid_acc = 0, 0, 0, 0

        #puts the model on training mode (activates dropout)
        learn.model.train()
        
        if notebook :
            batches_train = tqdm_notebook(learn.data.train_dl, leave=False,
                    total=len(learn.data.train_dl), desc=f'Epoch {epoch} training')
        else :
            batches_train = tqdm(learn.data.train_dl, leave=False,
                    total=len(learn.data.train_dl), desc=f'Epoch {epoch} training')

        
        #batch number counter
        batch_num = 0

        learn.model.reset()
       
        #starts sgd for each batches
        for x, y in batches_train:
            
            #cyclical learning rate and momentum
            if one_cycle :
                
                cut = int(total_iterations*pct_start)
                iteration = (epoch * train_size) + batch_num
                
                #next we compute the maximum lrs for each layer of our model, we can use either discriminative
                #learning rate or the same learning rate for each layer
                
                #if we use discriminative learning rates
                if isinstance(max_lr, slice) :
                    max_lrs = even_mults(max_lr.start, max_lr.stop, len(learn.opt.param_groups))
                
                #else we give the same max_lr to every layer of the model
                else :
                    max_lrs = [max_lr for i in range(len(learn.opt.param_groups))]
                
                #the final learning rate division factor
                if final_div is None: final_div = div_factor*1e4
                
                #computting the learning rate and momentum 
                if iteration < cut :
                    lrs = [annealing(lr/div_factor, lr, iteration/cut) for lr in max_lrs]
                    mom = annealing(moms[0], moms[1], iteration/cut) 
                else :
                    lrs = [annealing(lr, lr/final_div, (iteration-cut)/(total_iterations-cut)) for lr in max_lrs]
                    mom = annealing(moms[1], moms[0], (iteration-cut)/(total_iterations-cut))
                
                for i, param_group, lr in zip(range(len(learn.opt.param_groups)), learn.opt.param_groups, lrs) :
                    param_group['lr'] = lr
                    param_group['betas'] = (mom ,param_group['betas'][1])
                    if record :
                        lrs_record[i].append(lr)
                        momentum[i].append(mom)
            
            batch_num+=1

           #forward pass
            if cuda :
                x = x.cuda()
                y = y.cuda()
            pred, raw_out, out = learn.model(x)
            loss = learn.loss_func(pred, y)
            
            #activation regularization 
            if alpha != 0.:  loss += alpha * out[-1].float().pow(2).mean()
            
            #temporal activation regularization 
            if beta != 0.:
                h = raw_out[-1]
                if len(h)>1: loss += beta * (h[:,1:] - h[:,:-1]).float().pow(2).mean()
            
            train_loss += loss
            if lm :
                train_acc += (torch.argmax(pred, dim=2) == y).type(torch.FloatTensor).mean() 
            else :
                train_acc += (torch.argmax(pred, dim=1) == y).type(torch.FloatTensor).mean() 

            # compute gradients and updtape parameters
            loss.backward()
            
            #gradient clipping
            if grad_clip:  nn.utils.clip_grad_norm_(learn.model.parameters(), grad_clip)
            
            #optimizationm step
            learn.opt.step()
            learn.opt.zero_grad()

        train_loss = train_loss/train_size
        train_acc = train_acc/train_size
        

        # putting the model in eval mode so that dropout is not applied
        learn.model.eval()

        if notebook :
            batches_valid = tqdm_notebook(learn.data.valid_dl, leave=False,
                total=len(learn.data.valid_dl), desc=f'Epoch {epoch} validation')
        else : 
            batches_valid = tqdm(learn.data.valid_dl, leave=False,
                total=len(learn.data.valid_dl), desc=f'Epoch {epoch} validation')    
    
        with torch.no_grad():
            for x, y in batches_valid: 
                if cuda :
                    x = x.cuda()
                    y = y.cuda()
                pred = learn.model(x)[0]
                loss = learn.loss_func(pred, y)

                valid_loss += loss
                if lm :
                    valid_acc += (torch.argmax(pred, dim=2) == y).type(torch.FloatTensor).mean() 
                else :
                    valid_acc += (torch.argmax(pred, dim=1) == y).type(torch.FloatTensor).mean() 
                
        valid_loss = valid_loss/valid_size
        valid_acc = valid_acc/valid_size
        
        if show_info :
            print("Epoch {:.0f} training loss : {:.3f}, train accuracy : {:.3f}, validation loss : {:.3f}, valid accuracy : {:.3f}".format(epoch, train_loss, train_acc, valid_loss, valid_acc))
        if record :
            val_losses.append(valid_loss)
            train_losses.append(train_loss)
            train_accs.append(train_acc)
            valid_accs.append(valid_acc)
    
    if record :
        return {'train_loss' : train_losses, 'valid_loss' : val_losses, 'train_acc': train_acc, 'valid_acc' : valid_acc, 'lrs' : lrs_record, 'momentums' : momentum}    


We also have a function to just test the Learner on the validation data

In [None]:
def validate(learn, cuda=True, lm=True) :
    """
    Computes the validation loss and accuracy of the learner
    """
    valid_size = len(learn.data.valid_dl)
    
    #puts model on gpu
    if cuda :
        learn.model.cuda()
    else :
        learn.model.cpu()
    
    #loss and accuracy 
    valid_loss, valid_acc = 0, 0


    # putting the model in eval mode so that dropout is not applied
    learn.model.eval()
    with torch.no_grad():
        batches = tqdm_notebook(learn.data.valid_dl, leave=False,
                total=len(learn.data.valid_dl), desc=f'Validation')
        for x, y in batches: 
            if cuda :
                x = x.cuda()
                y = y.cuda()
            pred = learn.model(x)[0]
            loss = learn.loss_func(pred, y)

            valid_loss += loss
            if lm :
                valid_acc += (torch.argmax(pred, dim=2) == y).type(torch.FloatTensor).mean() 
            else :
                valid_acc += (torch.argmax(pred, dim=1) == y).type(torch.FloatTensor).mean() 
                
    valid_loss = valid_loss/valid_size
    valid_acc = valid_acc/valid_size
        
    print("Loss : {:.3f}, Accuracy : {:.3f}".format(valid_loss, valid_acc))
   

## Transfering language model encoder to classifier

We know need to transfer what we have learned in the language model to the classifier model, to to this we simply transfer the AWD_LSTM parameters which contain the embeddings and the LSTM layers. We already have a method in the LMTextLearner to save the encoder, now we ned one to loead it in the classifer 

In [None]:
def load_encoder_clas(model, enc_path):
        model[0].module.load_state_dict(torch.load(enc_path))