# Context
Two weeks ago I had never trained a transformer. Although I extensively implemented a large number of state of the art NLP models (which were all RNN or CNN based) until 2018, later on I moved to Image recognition and never found my way back to NLP. This year I found my old love for NLP and I decided to jump on to this competition to update my NLP skills. I love this competition because I personally think its quite challenging and I don't think training deeper and larger transformers will take us to the top. I personally feel that a very clear understanding of the task and a meticulous feature engineering whether from Neural models or non-neural models is very important for the models to perform well. So the first one and half months I decided to invest all my time learning to train transformers properly so I could have a very good feature extractor in my shelf. Here is a write up of my journey so far with transformer in this competition where I moved from 0.562 RMSE to 0.482 RMSE. Until now I have made all my notebooks public and I have put in a lot of effort to make my code clean and concise so that other people getting started with NLP or transformers could find it useful.

Here are the list of notebooks that I wrote. Each one adds a layer of complexity progressively as we move down the lane.

# List of notebooks
Insert a table of notebooks along with their scores here

## 1. Transformer as a feature extractor
**Link**: https://www.kaggle.com/vigneshbaskaran/commonlit-derive-more-features-from-transformer?scriptVersionId=64757154  
**Public LB**: 0.562


The objectives of this notebook are:
1. To understand the architecture of transformer and to leverage it in extracting various features from it.
2. Additionally derive more features on top of the extracted features to see if they are better than the extracted features
3. I wanted to have a transformer baseline that I can improve upon by further finetuning the transformer

I extracted the following features from the transformer (bert-base-uncased)
1. Pooler output
2. Hidden states corresponding to each layer of the transformer. (In case of BERT there are 12 hidden states)

Derive the following features on top of the extracted features:  
1. Mean of the last hidden state (from n<sup>th</sup> layer)  
2. Mean of the last but one hidden state (from n-1<sup>st</sup> layer)  
3. Mean of the last but two hidden state (from n-2<sup>nd</sup> layer)  
4. Hidden state corresponding to the CLS token of last hidden state (from n<sup>th</sup> layer)  
5. Hidden state corresponding to the CLS token of last but one hidden state (from n-1<sup>st</sup> layer)  
6. Hidden state corresponding to the CLS token of last but two hidden state (from n-2<sup>nd</sup> layer)  

After deriving the above mentioned features from the transformer, I built a regressor on top of it and the cross-validation scores are here:  


| Feature                               | Type      | CV RMSE |
|---------------------------------------|-----------|---------|
| Pooler output                         | Extracted | 0.646   |
| Mean of last hidden state             | Derived   | 0.575   |
| Mean of the last but one hidden state | Derived   | 0.586   |
| Mean of the last but two hidden state | Derived   | 0.581   |
| CLS of the last hidden state          | Derived   | 0.680   |
| CLS of the last but one hidden states | Derived   | 0.682   |
| CLS of the last but two hidden states | Derived   | 0.673   |

**Remark**: I submitted the model built on top of the Mean of the last hidden state and the public LB score was 0.562 which is close to the cross validation score of 0.575. 

**Conclusion**: The performance of the features extracted/derived from transformer depends on the task and in our case of predicting readability we notice that Mean of the last hidden state is better than the rest. The key to better score is instead of naively computing the mean if I had computed mean after masking the embeddings corresponding to the PAD token. This resulted in a significant boost of the results.

## 2. Easy Transformer Finetuner
**Trainer notebook:** https://www.kaggle.com/vigneshbaskaran/commonlit-easy-transformer-finetuner?scriptVersionId=65522393  
**Inference notebook:** https://www.kaggle.com/vigneshbaskaran/commonlit-easy-finetuner-inference 
**Public LB**: 0.526  
In the previous notebook we used a pretrained transformer model as a feature extractor. The regressor was the only trainable part of the model. The objective of this notebook is to enable finetuning any pretrained model quickly on the competition dataset. Additionally I have not used any external trainer classes and I have created all the components necessary to train from scratch so that I can easily investigate what is going on under the hood. 

**Algorithm**:
1. Start with a pretrained ROBERTA model from huggingface.
2. Extract the pooler output and build a regressor on top of it
3. Split the training data into five folds
4. For each fold, evaluate the performance of the model on the validation set after each epoch and save the best model  
| Model   | Feature | CV RMSE  |
|---------|---------|----------|
| Roberta | Pooler  |   0.524  |

I was quite OK with the result and then I came across Maunish's Public notebook that scored ~0.469 on the public LB. I started reading it and then I found that the approach that he has adapted is quite similar to that of mine except for some small changes which has brough upon that large difference. I really wanted to understand which component from his algorithm brought such a massive boost to him. Therefore I started my quest to discover the reason. I did not 100% succeed in identifying all the reasons but I managed to figure out let's say 90% of the magic. Here is a write-up on it. 

## 3. Why my transformer is not good enough?
In this notebook I start with recreating Maunish's model and progressively remove components to match that of mine and identify which additional components contribute to his improved performace


Maunish's finetuner notebook : https://www.kaggle.com/maunish/clrp-pytorch-roberta-finetune  
Maunish's inference notebook: https://www.kaggle.com/maunish/clrp-pytorch-roberta-inference  
My finetuner notebook: https://www.kaggle.com/vigneshbaskaran/commonlit-easy-transformer-finetuner  
My inference notebook: https://www.kaggle.com/vigneshbaskaran/commonlit-easy-finetuner-inference  



|      Component     |                      Maunish                      |            Vignesh            |   |
|:------------------:|:-------------------------------------------------:|:-----------------------------:|:-:|
|  Pretrained model  | Pretrained with additional data on top of ROBERTA |  Simple ROBERTA base from HF  |   |
|      Optimizer     |              AdamW with weight decay              |   AdamW with no weight decay  |   |
|      Scheduler     |                  Cosine Annealing                 |          No scheduler         |   |
| Model architecutre |       Last hidden state with attention head       | Pooler with no attention head |   |
|        Loss        |                      SQRT MSE                     |              MSE              |   |
|      Tokenizer     |                    max_len: 256                   |        max_len: default       |   |
|  train_valid_split |              Stratified KFold on bins             |          Simple KFold         |   |


## Design of experiments
Notebook with model trained on 1% of the training data: https://www.kaggle.com/vigneshbaskaran/commonlit-why-my-transformer-is-not-good-enough?scriptVersionId=65730159  
Notebook with model trained on 25% training data: https://www.kaggle.com/vigneshbaskaran/commonlit-why-my-transformer-is-not-good-enough?scriptVersionId=65757167  
Notebook with model trained on 100% training data: https://www.kaggle.com/vigneshbaskaran/commonlit-making-my-transformer-good-enough  
Inference notebook: https://www.kaggle.com/vigneshbaskaran/commonlit-inference-is-my-transformer-good  

I designed seven experiments which are described as follows:

|              |                                 Description                                 |        Pretrained model       |            Regressor           |   Tokenizer   |          Scheduler          | CV score on 1% training data | CV score on 25% training data | CV score on 100% training data | LB on 25% training data | LB on 100% training data |
|:------------:|:---------------------------------------------------------------------------:|:-----------------------------:|:------------------------------:|:-------------:|:---------------------------:|:----------------------------:|:-----------------------------:|:------------------------------:|:-----------------------:|:------------------------:|
| Experiment 1 |                         Maunish's original algorithm                        | Pretrained on additional data | Attention on Last hidden state |  max len: 256 | cosine schedule with warmup |             0.856            |             0.546             |              0.498             |          0.534          |           0.484          |
| Experiment 2 |          Maunish's original algorithm with **my pretrained model**          |  Base model from Huggingface  | Attention on Last hidden state |  max len: 256 | cosine schedule with warmup |             0.860            |             0.557             |                                |                         |                          |
| Experiment 3 |    Maunish's original algorithm with **my feature extractor and my loss**   | Pretrained on additional data |   Pooler output with MSE loss  |  max len: 256 | cosine schedule with warmup |             0.891            |             0.556             |                                |                         |                          |
| Experiment 4 |              Maunish's original algorithm with my **tokenizer**             | Pretrained on additional data | Attention on Last hidden state | max len: None | cosine schedule with warmup |             0.868            |             0.547             |                                |                         |                          |
| Experiment 5 |              Maunish's original algorithm with my **scheduler**             | Pretrained on additional data | Attention on Last hidden state |  max len: 256 |             None            |             0.861            |             0.552             |                                |                         |                          |
| Experiment 6 | Maunish's original algorithm with **my feature extractor instead his loss** | Pretrained on additional data |  Pooler output with RMSE loss  |  max len: 256 | cosine schedule with warmup |             0.923            |             0.563             |                                |                         |                          |
| Experiment 7 |                            My original algorithm                            |  Base model from Huggingface  |   Pooler output with MSE loss  | max len: None |             None            |             0.913            |             0.565             |              0.496             |          0.565          |           0.488          |

## Lessons from the experiments
1. This notebook was designed to identify why Maunish's similar model score 0.06 RMSE better than my model. When we look at the CV score on 1% training data the results are similar. But as we increase the size of training data we notice that his model and my model converge to almost the same validation scores and finally similar LB scores.
2. Why didn't it converge to the same levels earlier? What's the difference between my earlier notebook where I got 0.53 on the same algorithm whereas now I get 0.488 on the same algorithm. I did not change anything except for the fact that I validate the model more often note after every epoch but after every 10 iterations like Maunish and I get the same performance like him.

**Moral of the story**: Validating more often and saving the best model just gives us a better chance of getting a better model. So it feels like kind of shooting in the dark when I look at sample of my training logs:     
Fold: 0, epoch: 2, Iteration num: 10, Validation RMSE: 0.6139673601014568  
Fold: 0, epoch: 2, Iteration num: 20, Validation RMSE: 0.6660280633377417  
Fold: 0, epoch: 2, Iteration num: 30, Validation RMSE: 0.7603617468118679  
Fold: 0, epoch: 2, Iteration num: 40, Validation RMSE: 0.5133815802608167  
Fold: 0, epoch: 2, Iteration num: 50, Validation RMSE: 0.6345178061838926  
Fold: 0, epoch: 2, Iteration num: 60, Validation RMSE: 0.6030063566418447  
Fold: 0, epoch: 2, Iteration num: 70, Validation RMSE: 0.5133519362217996  
Fold: 0, epoch: 2, Iteration num: 80, Validation RMSE: 0.8300067331494465  

But at least we have scope for improvement and more work to do this way. So this is the story of me navigating from LB 0.56 to 0.48 with transformers. Thank you for reading! 

|                                            | RMSE |
|--------------------------------------------|------|
| Transformer as feature extractor           | 0.56 |
| Finetuned transformer                      | 0.53 |
| Frequently evaluated finetuned transformer | 0.48 |