## Fastai with HuggingFace ü§óTransformers 

<center><img src = "https://miro.medium.com/max/1400/1*Aqcm4iX3AQNWx9Zb-z7o1Q.png"></center>



> <div class="alert alert-block alert-info">
>     <p>This is an introductory notebook to the transformers with HuggingFace and FastAI, I have tried to do it as much interactively as I can, however if you find any mistakes or ways in which it could be improved, please let me know. And if you find this notebook useful, let it be upvoted (it's free)</p>
> </div>

## The task in hand: Movie Reviews

Given an input text üî£ related to the movie, we need to classify the given text into one of the 5 classes üèÜ, which are based on the rating of the movie.
<div class="alert alert-block alert-info">
  
 <ul>
The classses:
     <li>0 -> Negative          ‚úÖ</li>
<li>1 -> Somewhat Negative ‚úÖ</li>
<li>2 -> Neutral           ‚úÖ</li>
<li>3 -> Somewhat Positive ‚úÖ</li>
<li>4 -> Positive          ‚úÖ</li>
   </ul>
</div>

* The Data is present in the `DataFrame` file, which can be loaded using `Pandas` üêº


In [None]:
!pip install -q transformers==2.0.0
!pip install -q fastai==1.0.58

In [None]:
from IPython.core.display import HTML
def css_styling():
    styles = open("../input/riiid-answer-correctness-prediction-rapids/custom.css", "r").read()
    return HTML("<style>"+styles+"</style>")
css_styling()

In [None]:
def notebook_styling():
    styles = open("../input/riiid-answer-correctness-prediction-rapids/custom_rapids.css", "r").read()
    return HTML("<style>"+styles+"</style>")
notebook_styling()

In [None]:
class color:
    '''S from Start & E from End.'''
    S = '\033[1m' + '\033[93m'
    E = '\033[0m'

In [None]:
import numpy as np  # Linear Algebra
import pandas as pd # Data Manipulation
from pathlib import Path

import os

import torch
import torch.optim

import random

# fastai
from fastai import *
from fastai.text import *
from fastai.callbacks import *

# Warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Transformers
from transformers import PreTrainedModel,PreTrainedTokenizer,PretrainedConfig

from transformers import BertForSequenceClassification,BertTokenizer,BertConfig
from transformers import RobertaForSequenceClassification,RobertaTokenizer,RobertaConfig
from transformers import XLNetForSequenceClassification,XLNetTokenizer,XLNetConfig
from transformers import XLMForSequenceClassification,XLMTokenizer,XLMConfig
from transformers import DistilBertForSequenceClassification,DistilBertTokenizer,DistilBertConfig

In [None]:
import fastai
import transformers

print(color.S+"FastAI version:"+color.E,fastai.__version__)
print(color.S+"Transformers version:"+color.E,transformers.__version__)

In [None]:
for dirname,_,filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname,filename))

In [None]:
DATA_ROOT = Path('..')/"/kaggle/input/sentiment-analysis-on-movie-reviews"
train = pd.read_csv(DATA_ROOT/'train.tsv.zip',sep='\t')
test = pd.read_csv(DATA_ROOT/'test.tsv.zip',sep='\t')
print(color.S+ "Train shape:"+color.E,train.shape)
print(color.S+ "Test shape:"+color.E,test.shape)
train.head()


MAIN TRANSFORMER CLASS
<div class="alert alert-block alert-info">
<b>In transformers, each model architecture consists of three things:</b>
</div>

* **Configuration Class**: Contains the architecture of the model
* **Model Class**: Contains the pretrained weights of the model
* **Tokenizer Class**: Tokenizes and preprocess the data, to make it compatible with the model

In [None]:
MODEL_CLASSES = {'bert':(BertForSequenceClassification,BertTokenizer,BertConfig),
                'xlm':(XLMForSequenceClassification,XLMTokenizer,XLMConfig),
                'xlnet':(XLNetForSequenceClassification,XLNetTokenizer,XLNetConfig),
                'roberta':(RobertaForSequenceClassification,RobertaTokenizer,RobertaConfig),
                'distilbert':(DistilBertForSequenceClassification,DistilBertTokenizer,DistilBertConfig)}

* Now, for loading the classes (such as **tokenizer**,**config**,**model**), they all have a common method known as `from_pretrained(pretrained_model_name)`, and in our case, `pretrained_model_name` is a string (which is a shortcut name for the model). As, for an example, *roberta_base*.

In [None]:
# parameters
seed = 42
use_fp16 = False   # Whether or not to use mixed precision training.
bs = 16 # A DataBunch is a collection of PyTorch DataLoaders returned when you call the databunch function. It also defines how they are created from your training, validation, and optionally test LabelList instances.
model_type = 'bert'
pretrained_model_name = 'bert-base-uncased'

In [None]:
model_class,tokenizer_class,config_class = MODEL_CLASSES[model_type]

* Now, having selected the **BERT MODEL**, we need to get the name of the different bert models available, for that, the 
`pretrained_model_archive_map`, is used

In [None]:
model_class.pretrained_model_archive_map.keys() # Note that, in latest version, it will throw an error, so you need to downgrade the version of transformer to 2.0.0

## Util Function

* For getting reproducible results üìâ, we need to set the seed, so that every time, something random is done, it is done such that, every time, we get the same random number

In [None]:
def seed_all(seed):
    random.seed(seed)  # Python Random Number
    np.random.seed(seed) # Numpy random number
    torch.manual_seed(seed) # Torch random number
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed) # GPU Vars
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
seed_all(seed)

## Data Pre-processing

* Now, for a moment forget that we need to do something more. We simply know the fact that, computers cannot understand texts, they only understand numbers. So, how to convert the text into numbers?
* Split the word into a list (referred to as tokenization)
* Assign each word a number (known as Numericalization in fastai)

* Now, for each of the transformer model, we need to preprocess the text to convert into number (by tokenizing and numericalization). So, for that we have the `tokenizer` class, which converts the text into appropriate input for the model
* What does an already present `tokenizer` class do? 
<div class="alert alert-block alert-info">
    <ul>
    <li>Tokenizing (splitting strings in sub-word token strings), converting tokens strings to ids and back, and encoding/decoding (i.e., tokenizing and converting to integers).</li>

    <li>Adding new tokens to the vocabulary in a way that is independent of the underlying structure (BPE, SentencePiece‚Ä¶).</li>

    <li>Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure they are not split during tokenization.</li>
     </ul>
</div>


#### Custom Tokenizer
It is quite simple, just need to observe 3 things and then you are done..... :)

> 1. The `BaseTokenizer` class, which takes an input a string, and gives an output a list, containing the words of the sentence
> 2. The `Tokenizer` class, which takes an argument a BaseTokenizer, and outputs a list of tokens, along with the padding and **start** and **end** of sentence.

Let us implement it, it is really very very easy!!!

In [None]:
class TransformersBaseTokenizer(BaseTokenizer):
    def __init__(self,pretrained_tokenizer:PreTrainedTokenizer,model_type = 'bert',**kwargs):
        self._pretrained_tokenizer = pretrained_tokenizer
        self.max_seq_len = pretrained_tokenizer.max_len
        self.model_type = model_type
    
    def __call__(self,*args,**kwargs):
        return self
    
    def tokenizer(self,t):
        CLS  = self._pretrained_tokenizer.cls_token
        SEP = self._pretrained_tokenizer.sep_token
        if self.model_type in ['roberta']:
            tokens = self._pretrained_tokenizer.tokenize(t, add_prefix_space=True)[:self.max_seq_len - 2]
            tokens = [CLS] + tokens + [SEP]
        else:
            tokens = self._pretrained_tokenizer.tokenize(t)[:self.max_seq_len - 2]
            if self.model_type in ['xlnet']:
                tokens = tokens + [SEP] +  [CLS]
            else:
                tokens = [CLS] + tokens + [SEP]
                
        return tokens

In [None]:
transformer_tokenizer = tokenizer_class.from_pretrained(pretrained_model_name)
transformer_base_tokenizer = TransformersBaseTokenizer(pretrained_tokenizer = transformer_tokenizer,model_type = model_type)
fastai_tokenizer = Tokenizer(tok_func = transformer_base_tokenizer,pre_rules = [],post_rules = [])

In this implementation, we need to take the following things into account:

1. Since, this is not a RNN ‚ùå, we need to limit the sequence length to a particular number
2. Most of the NLP models require special token at the start and the end of the sentence
3. In some models such as RoBERTa, requires a space at the start of the string. So, for that we need to call the `add_prefix_space` equal to True

Some of the few styles of input in various models:
  
    ‚öá bert :      [CLS] + tokens + [SEP] + padding
    ‚öá xlm  :      [CLS] + tokens + [SEP] + padding
    ‚öá distilbert: [CLS] + tokens + [SEP] + padding
    ‚öá roberta:    [CLS] + prefix_space + tokens + [CLS] + [SEP]
    ‚öá xlnet:      padding + tokens + [SEP] + [CLS]
    

* And for the padding part? We need to do nothing. It will be implemented itself in the `DataBunch`



### CUSTOM NUMERICALIZER (For FastAI)

* In `fastai`, `NumericalizeProcessor` object takes as `vocab` argument. 


<div class="alert alert-block alert-success">
    <li>Why are we doing this? ü§î</li>
    <li>To understand the implementation of Numericalization</li>
</div>

* Let us do it, it is quite easy, nothing more to do than converting into tokens and implementing the function to go from list of strings to list of numbers and vice versa ü§©

In [None]:
class TransformersVocab(Vocab):
    def __init__(self, tokenizer: PreTrainedTokenizer):
        super(TransformersVocab, self).__init__(itos = [])
        self.tokenizer = tokenizer

    def numericalize(self, t:Collection[str]) -> List[int]:
        "Convert a list of tokens `t` to their ids."
        return self.tokenizer.convert_tokens_to_ids(t)
        #return self.tokenizer.encode(t)

    def textify(self, nums:Collection[int], sep=' ') -> List[str]:
        "Convert a list of `nums` to their tokens."
        nums = np.array(nums).tolist()
        return sep.join(self.tokenizer.convert_ids_to_tokens(nums)) if sep is not None else self.tokenizer.convert_ids_to_tokens(nums)

    def __getstate__(self):
        return {'itos':self.itos, 'tokenizer':self.tokenizer}

    def __setstate__(self, state:dict):
        self.itos = state['itos']
        self.tokenizer = state['tokenizer']
        self.stoi = collections.defaultdict(int,{v:k for k,v in enumerate(self.itos)})

## Custom processor

* Now that we have our `custom tokenizer` and `numericalizer`, we can create the `custom processor`. Notice we are passing the `include_bos = False` and `include_eos = False` options. This is because fastai adds its own special tokens by default which interferes with the [CLS] and [SEP] tokens added by our custom tokenizer.

In [None]:
transformer_vocab =  TransformersVocab(tokenizer = transformer_tokenizer)
numericalize_processor = NumericalizeProcessor(vocab = transformer_vocab)

tokenize_processor = TokenizeProcessor(tokenizer = fastai_tokenizer,include_bos=False, include_eos=False)
transformer_processor = [tokenize_processor, numericalize_processor]

## Setting up the DataBunch ‚úÖ

* So, the only thing that needs to be taken is, if the model requires the padding first, or padding last, and then you are good to train, let us see it how is it code?



In [None]:
pad_first = bool(model_type in ['xlnet'])
pad_idx = transformer_tokenizer.pad_token_id

In [None]:
tokens = transformer_tokenizer.tokenize('Salut c est moi, Hello it s me')
print(tokens)
ids = transformer_tokenizer.convert_tokens_to_ids(tokens)
print(ids)
transformer_tokenizer.convert_ids_to_tokens(ids)

In [None]:
# Implementing DataBunch

databunch = (TextList.from_df(train,cols = 'Phrase',processor = transformer_processor)
            .split_by_rand_pct(0.1,seed = seed)
            .label_from_df(cols = 'Sentiment')
            .add_test(test)
            .databunch(bs = bs,pad_first=pad_first,pad_idx = pad_idx))

In [None]:
print('[CLS] token :', transformer_tokenizer.cls_token)
print('[SEP] token :', transformer_tokenizer.sep_token)
print('[PAD] token :', transformer_tokenizer.pad_token)
databunch.show_batch()

<div class="alert alert-block alert-success">
    <li>Check batch and numericalizer :</li>
 </div>

In [None]:
print('[CLS] id :', transformer_tokenizer.cls_token_id)
print('[SEP] id :', transformer_tokenizer.sep_token_id)
print('[PAD] id :', pad_idx)
test_one_batch = databunch.one_batch()[0]
print('Batch shape : ',test_one_batch.shape)
print(test_one_batch)

## CUSTOM MODEL
<div class="alert alert-block alert-success">
    <li> Let us see, how can we make custom models of transformer to better meet our needs</li>
 </div>

In [None]:
class CustomTransformerModel(nn.Module):
    def __init__(self,transformer:PreTrainedModel):
        super(CustomTransformerModel,self).__init__()
        self.transformer = transformer
    
    def forward(self,input_ids,attention_mask = None):
        attention_mask = (input_ids!=pad_idx).type(input_ids.type()) 
        logits = self.transformer(input_ids,attention_mask = attention_mask)[0]
        return logits

<div class="alert alert-block alert-success">
    <li> Now, for <code>MultiClass Classification</code> we need to modify the configuration class, (remember that, configuration is like the brain for the model!!, it has all the information about what layers to add in the model...)</li>
 </div>



In [None]:
config = config_class.from_pretrained(pretrained_model_name)
config.num_labels = 5
config.use_bfloat16 = use_fp16
print(config)

In [None]:
transformer_model = model_class.from_pretrained(pretrained_model_name,config = config)
custom_transformer_model = CustomTransformerModel(transformer= transformer_model)

## Learner: Custom Optimizer/ Custom Metrics

<div class="alert alert-block alert-success">
    <li> Hugging Face has implemented two custom optimizers: <code>BertAdam</code> and <code>OpenAIAdam</code>. Since, these has been implemented in pytorch, we would be able to integrate it with FastAI, and for using <code>BertAdam</code>, we need to follow a convention that, we need to set a attribute name <code>correct_bias=False</code></li>
 </div>




In [None]:
from fastai.callbacks import *
from transformers import AdamW
from functools import partial

CustomAdamW = partial(AdamW,correct_bias = False)
learner = Learner(databunch,custom_transformer_model,opt_func = CustomAdamW,metrics = [accuracy,error_rate])

learner.callbacks.append(ShowGraph(learner))
if use_fp16: learner = learner.to_fp16()

In [None]:
print(learner.model)

In [None]:
# For DistilBERT
# list_layers = [learner.model.transformer.distilbert.embeddings,
#                learner.model.transformer.distilbert.transformer.layer[0],
#                learner.model.transformer.distilbert.transformer.layer[1],
#                learner.model.transformer.distilbert.transformer.layer[2],
#                learner.model.transformer.distilbert.transformer.layer[3],
#                learner.model.transformer.distilbert.transformer.layer[4],
#                learner.model.transformer.distilbert.transformer.layer[5],
#                learner.model.transformer.pre_classifier]

# For bert-base-uncasecased
list_layers = [learner.model.transformer.bert.embeddings,
              learner.model.transformer.bert.encoder.layer[0],
              learner.model.transformer.bert.encoder.layer[1],
              learner.model.transformer.bert.encoder.layer[2],
              learner.model.transformer.bert.encoder.layer[3],
              learner.model.transformer.bert.encoder.layer[4],
              learner.model.transformer.bert.encoder.layer[5],
              learner.model.transformer.bert.encoder.layer[6],
              learner.model.transformer.bert.encoder.layer[7],
              learner.model.transformer.bert.encoder.layer[8],
              learner.model.transformer.bert.encoder.layer[9],
              learner.model.transformer.bert.encoder.layer[10],
              learner.model.transformer.bert.encoder.layer[11],
              learner.model.transformer.bert.pooler]

# For xlnet-base-cased
# list_layers = [learner.model.transformer.transformer.word_embedding,
#               learner.model.transformer.transformer.layer[0],
#               learner.model.transformer.transformer.layer[1],
#               learner.model.transformer.transformer.layer[2],
#               learner.model.transformer.transformer.layer[3],
#               learner.model.transformer.transformer.layer[4],
#               learner.model.transformer.transformer.layer[5],
#               learner.model.transformer.transformer.layer[6],
#               learner.model.transformer.transformer.layer[7],
#               learner.model.transformer.transformer.layer[8],
#               learner.model.transformer.transformer.layer[9],
#               learner.model.transformer.transformer.layer[10],
#               learner.model.transformer.transformer.layer[11],
#               learner.model.transformer.sequence_summary]

# For roberta-base
# list_layers = [learner.model.transformer.roberta.embeddings,
#               learner.model.transformer.roberta.encoder.layer[0],
#               learner.model.transformer.roberta.encoder.layer[1],
#               learner.model.transformer.roberta.encoder.layer[2],
#               learner.model.transformer.roberta.encoder.layer[3],
#               learner.model.transformer.roberta.encoder.layer[4],
#               learner.model.transformer.roberta.encoder.layer[5],
#               learner.model.transformer.roberta.encoder.layer[6],
#               learner.model.transformer.roberta.encoder.layer[7],
#               learner.model.transformer.roberta.encoder.layer[8],
#               learner.model.transformer.roberta.encoder.layer[9],
#               learner.model.transformer.roberta.encoder.layer[10],
#               learner.model.transformer.roberta.encoder.layer[11],
#               learner.model.transformer.roberta.pooler]

* Check groups :

In [None]:
learner.split(list_layers)
num_groups = len(learner.layer_groups)
print('Learner split in',num_groups,'groups')
print(learner.layer_groups)

## Train the Model and Experimentation 


<div class="alert alert-block alert-success">
    <li> So, this is new for me,but we would be using <b>Slanted Triangular Learning Rate</b>, <b>Discriminant Learning Rate</b></li>
 </div>


In [None]:
learner.save('untrain')
seed_all(seed)
learner.load('untrain');

<div class="alert alert-block alert-success">
    <li> We freeze all the layers, except the classifier one, and check the layers that are trainable</li>
 </div>


**Experiment. 1** : Freezing uptil the last layer

In [None]:
learner.freeze_to(-1)  # Freeze uptil the last part (classifier)
learner.summary()

In [None]:
learner.lr_find()
learner.recorder.plot(skip_end=10,suggestion=True)

In [None]:
learner.fit_one_cycle(1,max_lr=2e-03,moms=(0.8,0.7))
#The momentum is the first beta in Adam (or the momentum in SGD/RMSProp). When you pass along (0.95,0.85) it means going from 0.95
#to 0.85 during the warmup then from 0.85 to 0.95 in the annealing, but it only changes the first beta in Adam, yes.

In [None]:
learner.save('first_cycle')           
seed_all(seed)
learner.load('first_cycle');

**Experiment. 2** : Freezing uptil the last second layer

In [None]:
learner.freeze_to(-2)     # Freeze uptil the last two parts (Transformer and classifier)

In [None]:
lr = 1e-5

<div class="alert alert-block alert-success">
    <li> Note here that we use slice to create separate learning rate for each group.</li>
 </div>

In [None]:
learner.fit_one_cycle(1, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9))

In [None]:
learner.save('second_cycle')
seed_all(seed)
learner.load('second_cycle');

**Experiment. 3** : Freezing uptil the last third layer

In [None]:
learner.freeze_to(-3)   
learner.fit_one_cycle(1, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9))
learner.save('third_cycle')
seed_all(seed)
learner.load('third_cycle');

**Experiment. 4** : Freezing all the layers

In [None]:
learner.unfreeze() # Trainable embeddings, transformer and classifier
learner.fit_one_cycle(2, max_lr=slice(lr*0.95**num_groups, lr), moms=(0.8, 0.9)) 

## Predictions

In [None]:
learner.predict("I am feeling happy today!!")

In [None]:
learner.predict("Would you be with me, I need you")

## Saving the model

In [None]:
learner.export('transformer.pkl');

In [None]:
path = '/kaggle/working'
export_learner = load_learner(path, file = 'transformer.pkl')

* Just take care that, all the custom classes are defined earlier, as it needs to be present while predicting

In [None]:
export_learner.predict("Would you be with me, I need you")

## Creating prediction file and submitting it üéâ

Let us understand one problem here, when training and testing, the shuffling of the data would have been done, so the output would not be corresponding to the exact same statement that is rpesent in the test dataset. So, let us see how to handle it

In [None]:
def get_preds_as_nparray(ds_type) -> np.ndarray:
    preds = learner.get_preds(ds_type)[0].detach().cpu().numpy()
    sampler = [i for i in databunch.dl(ds_type).sampler]
    reverse_sampler = np.argsort(sampler)
    return preds[reverse_sampler, :]

test_preds = get_preds_as_nparray(DatasetType.Test)

In [None]:
sample_submission = pd.read_csv(DATA_ROOT / 'sampleSubmission.csv')
sample_submission['Sentiment'] = np.argmax(test_preds,axis=1)
sample_submission.to_csv("predictions.csv", index=False)

## Conclusions: üôÇ

* I tried to explain in the simplest way (by which I understood this thing), and added some colors to make it more engaging, however, if you have any idea or doubt which could make this notebook more engaging and informative, let me know, thanks a lot 

## References: üìö

1. Biggest shoutout to the author of this notebook, which got me inspired to write it again in my own language [FastAI with Transformer](https://www.kaggle.com/maroberti/fastai-with-transformers-bert-roberta/notebook)
2. FastAI Documents [FastAI](https://docs.fast.ai/text.html)