In [0]:
from fastai.text import * 
from fastai import * 


**[1][Regularizing RNNs by Stabilizing Activations](https://arxiv.org/abs/1511.08400)**
 - by David Krueger, Roland Memisevic



In [2]:

bs=48
path = untar_data(URLs.IMDB)
path.ls()
data_lm = (TextList.from_folder(path)
           #Inputs: all the text files in path
            .filter_by_folder(include=['train', 'test', 'unsup']) 
           #We may have other temp folders that contain text files so we only keep what's in train and test
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))
data_lm.save('data_lm.pkl')

data_lm = load_data(path, 'data_lm.pkl', bs=bs)
data_lm.show_batch()

idx,text
0,"breaking the xxmaj ice . "" xxmaj seriously , the only good thing about this film was seeing the original xxmaj hanson xxmaj brothers reprise their roles , which was not nearly enough to save it . a life - size cardboard cutout of xxmaj paul xxmaj newman has more acting talent than xxmaj steven xxmaj baldwin . xxmaj and i 'm being generous . xxmaj only if you have"
1,"award , and surprisingly , it did n't work ( unlike xxmaj dustin xxmaj hoffman 's one - note performance in xxmaj rain xxmaj man or xxmaj al xxmaj pacino 's self - parody in xxmaj scent of a xxmaj woman ) . xxmaj michael xxmaj apted should know better , having done a half - decent job portraying mountain life in xxmaj coal xxmaj miner 's xxmaj daughter ."
2,"he keeps ignoring how he 's being treated by the xxmaj roberts of the world , all of who are bleeding him white ( no pun intended )  and , in the finale , to the point of forcing xxmaj davis to take a dive in his last bout and retire rich . xxmaj does he do it or does he redeem his honor ? i 'll let you"
3,"the world . xxbos i was very fortunate last night , as i had a chance to see xxup xxunk in a sneak preview almost half a year prior to its scheduled start over here in xxmaj austria . i was instantly thrilled as i heard a lot about this ambitious project in the past two years ! xxmaj and i learned a lot last night ... xxmaj for starters"
4,"zoo in xxmaj budapest "" and "" xxmaj man 's xxmaj castle "" would cement her position as the xxmaj depression 's most desirable waif , the pin - up girl of the bread lines . xxmaj with the xxunk comedienne xxmaj winnie xxmaj lightner as her wisecracking pal and xxmaj guy xxmaj kibbee , criminally wasted as xxmaj lightner 's swain . xxbos xxmaj as i saw the movie"


In [0]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
# learn.lr_find()
# learn.recorder.plot(skip_end=15)

In [0]:
learn.save_encoder('fine_tuned_enc')

In [5]:
path = untar_data(URLs.IMDB)
data_clas = (TextList.from_folder(path, vocab=data_lm.vocab)
             #grab all the text files in path
             .split_by_folder(valid='test')
             #split by train and valid folder (that only keeps 'train' and 'test' so no need to filter)
             .label_from_folder(classes=['neg', 'pos'])
             #label them all with their folders
             .databunch(bs=bs))

data_clas.save('data_clas.pkl')
data_clas = load_data(path, 'data_clas.pkl', bs=bs)
data_clas.show_batch()

text,target
xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules,pos
"xxbos * ! ! - xxup spoilers - ! ! * \n \n xxmaj before i begin this , let me say that i have had both the advantages of seeing this movie on the big screen and of having seen the "" xxmaj authorized xxmaj version "" of this movie , remade by xxmaj stephen xxmaj king , himself , in 1997 . \n \n xxmaj both",pos
"xxbos * * * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . * * * \n \n xxmaj before i begin , i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that",neg
"xxbos i felt duty bound to watch the 1983 xxmaj timothy xxmaj dalton / xxmaj zelah xxmaj clarke adaptation of "" xxmaj jane xxmaj eyre , "" because i 'd just written an article about the 2006 xxup bbc "" xxmaj jane xxmaj eyre "" for xxunk . \n \n xxmaj so , i approached watching this the way i 'd approach doing homework . \n \n i",pos
"xxbos xxmaj waitress : xxmaj honey , here 's them eggs you ordered . xxmaj honey , like bee , get it ? xxmaj that 's called pointless foreshadowing . \n \n xxmaj edward xxmaj basket : xxmaj huh ? ( xxmaj on the road ) xxmaj basket : xxmaj here 's your doll back , little girl . xxmaj you really should n't be so careless with your",neg


First using fastai's basic methodology we've created a language model fine tuned on the Wikipedia text. After that we've established a baseline using the fastai's fit_one_cycle policy. 

In [0]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
# learn.lr_find()
# learn.recorder.plot()
learn.fit_one_cycle(5, 2e-2, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.471733,0.457118,0.77704,10:06



#### **BASELINE** ####
The baseline I have established does not include fine tuning the language model on the IMDB data. I simply loaded the encoder without fine tuning the language model at all. 


#### **fastai's existing fine tuning techniques** ####
fastai already uses a couple of techniques for stabilizing RNNs. 
Those are defined [here](https://github.com/fastai/fastai/blob/master/fastai/callbacks/rnn.py). 

fastai documentation for the fine tuning techniques for `fastai.text` can  be found [here](https://docs.fast.ai/callbacks.rnn.html): 

These techniques are adopted from this paper:

**[2][Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182)**
 - Stephen Merity, Nitish Shirish Keskar, Richard Socher

**[2]** Introduces a technique called `Temporal Activation Regularization (TAR)`

 The TAR technique is essentially adding the following term to the loss (thus it exists in the callback `backward_begin`): 

 `extra_loss_regularization_term = β L2(ht − ht+1)`

 Where `ht` is the hidden state at timestep `t`

For those who are confused by the fancy terms (as I was) [here](https://en.wikipedia.org/wiki/Norm_(mathematics)#p-norm) is the definition if L2

 [2] also states:
`As in Merity et al. (2017),
the AR and TAR loss are only applied to the output of the
final RNN layer as opposed to being applied to all layers.`

The Norm Stabilizer paper introduces a technique similar to TAR from the paper above. So it implement this paper we will have to override the existing callback. 

Below is the callback implemented by fastai for reference. 



In [0]:
class RNNTrainer(LearnerCallback):
    "`Callback` that regroups lr adjustment to seq_len, AR and TAR."
    def __init__(self, learn:Learner, alpha:float=0., beta:float=0.):
        super().__init__(learn)
        self.not_min += ['raw_out', 'out']
        self.alpha,self.beta = alpha,beta
        
    def on_epoch_begin(self, **kwargs):
        "Reset the hidden state of the model."
        self.learn.model.reset()

    def on_loss_begin(self, last_output:Tuple[Tensor,Tensor,Tensor], **kwargs):
        "Save the extra outputs for later and only returns the true output."
        self.raw_out,self.out = last_output[1],last_output[2]
        return {'last_output': last_output[0]}

    def on_backward_begin(self, last_loss:Rank0Tensor, last_input:Tensor, **kwargs):
        "Apply AR and TAR to `last_loss`."
        
        #AR 
        if self.alpha != 0.:  last_loss += self.alpha * self.out[-1].float().pow(2).mean()


        # TAR
        if self.beta != 0.:
            h = self.raw_out[-1]
            if len(h)>1: last_loss += self.beta * (h[:,1:] - h[:,:-1]).float().pow(2).mean() #L2


        return {'last_loss': last_loss}


####**Modifications**####

We are going to have to replace the existing callback with a callback which also adds the norm-stabilizer term. 

####**Norm Stabilizer**####
The term reccomended by [1] is as follows: 
`(β/T)* SIGMA[Across T](L2(ht) - L2(ht-1)^2`

I have modified the RNNTrainer to add the norm-stabilizer below. 


In [0]:
def apply_ar(alpha,out): return alpha * out[-1].float().pow(2).mean()

def apply_tar(beta,h): return beta * (h[:,1:] - h[:,:-1]).float().pow(2).mean()

def apply_normstable(beta,h): return beta/h.shape[1] *  (h[:,1:].pow(2).sum().sqrt() - h[:,:-1].pow(2).sum().sqrt()).float().pow(2)

class RNNTrainerNorm(LearnerCallback):
    def __init__(self, learn:Learner, alpha:float=2., beta:float=1., beta_norm:float=50.):
        super().__init__(learn)
        self.not_min += ['raw_out', 'out']
        self.alpha,self.beta,self.beta_norm = alpha,beta,beta_norm
        
    def on_epoch_begin(self, **kwargs):
        "Reset the hidden state of the model."
        self.learn.model.reset()

    def on_loss_begin(self, last_output:Tuple[Tensor,Tensor,Tensor], **kwargs):
        "Save the extra outputs for later and only returns the true output."
        self.raw_out,self.out = last_output[1],last_output[2]
        return {'last_output': last_output[0]}

    def on_backward_begin(self, last_loss:Rank0Tensor, last_input:Tensor, **kwargs):
        import pdb;pdb.set_trace()
        "Apply AR and TAR to `last_loss`."
        #AR and TAR
        if self.alpha != 0.:  last_loss += apply_ar(self.alpha,self.out)
        if self.beta != 0.:
            h = self.raw_out[-1]
            if len(h)>1: last_loss += apply_tar(self.beta,h)

        if self.beta_norm != 0.:
            h = self.raw_out[-1]
            if len(h)>1: last_loss += apply_normstable(self.beta_norm,h)        
        
        return {'last_loss': last_loss}



In [0]:
def has_params(m:nn.Module)->bool:
    "Check if `m` has at least one parameter"
    return len(list(m.parameters())) > 0


In [0]:
modules = [m for m in flatten_model(learn.model) if has_params(m)]

In [0]:
modules

[Embedding(60000, 400, padding_idx=1),
 Embedding(60000, 400, padding_idx=1),
 LSTM(400, 1152, batch_first=True),
 ParameterModule(),
 LSTM(1152, 1152, batch_first=True),
 ParameterModule(),
 LSTM(1152, 400, batch_first=True),
 ParameterModule(),
 BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
 Linear(in_features=1200, out_features=50, bias=True),
 BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
 Linear(in_features=50, out_features=2, bias=True)]

In [0]:
def requires_grad_bool(m:nn.Module)->Optional[bool]:
    ps = list(m.parameters())
    return ps[0].requires_grad


In [0]:
for it in modules:
  print(requires_grad_bool(it),it)

False Embedding(60000, 400, padding_idx=1)
False Embedding(60000, 400, padding_idx=1)
False LSTM(400, 1152, batch_first=True)
False ParameterModule()
False LSTM(1152, 1152, batch_first=True)
False ParameterModule()
False LSTM(1152, 400, batch_first=True)
False ParameterModule()
True BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
True Linear(in_features=1200, out_features=50, bias=True)
True BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
True Linear(in_features=50, out_features=2, bias=True)


In [0]:
learn.freeze_to(-2)

In [0]:
modules = [m for m in flatten_model(learn.model) if has_params(m)]

In [0]:
for it in modules:
  print(requires_grad_bool(it),it)

False Embedding(60000, 400, padding_idx=1)
False Embedding(60000, 400, padding_idx=1)
False LSTM(400, 1152, batch_first=True)
False ParameterModule()
False LSTM(1152, 1152, batch_first=True)
False ParameterModule()
True LSTM(1152, 400, batch_first=True)
True ParameterModule()
True BatchNorm1d(1200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
True Linear(in_features=1200, out_features=50, bias=True)
True BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
True Linear(in_features=50, out_features=2, bias=True)


In [0]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')
learn.freeze_to(-2)
learn.lr_find()
learn.recorder.plot()

epoch,train_loss,valid_loss,accuracy,time


In [0]:
learn.fit_one_cycle(5, 1e-3, moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.402022,0.315708,0.86636,12:47
