## Further Pre-training

### This part is copied from: https://www.kaggle.com/rhtsingh/commonlit-readability-prize-roberta-torch-itpt

### Thanks for his work!

Besides the training data of a target task, we can **further pre-train** a transformer on the data from the same domain.

![image.png](https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-030-32381-3_16/MediaObjects/489562_1_En_16_Fig1_HTML.png)

The Transformer models are pre-trained on the general domain corpus. For a text classification task / regression task in a specific domain, such as Readability Assesment, its data
distribution may be different from a transformer trained on a different corpus e.g. RoBERTa trained on BookCorpus, Wiki, CC-News, OpenWebText, Stories. Therefore the idea is, we can further pre-train the transformer with masked language model and next sentence prediction tasks on the domain-specific data. Three further pretraining approaches are performed:

1) `Within-task pre-training (ITPT)`, in which transformer is further pre-trained on the training data of a target task. `This Kernel.`

2) `In-domain pre-training (IDPT)`, in which the pretraining data is obtained from the same domain of a target task. For example, there are several different sentiment classification tasks, which have a similar data distribution. We can further pre-train the transformer on the combined training data from these tasks.

3) `Cross-domain pre-training (CDPT)`, in which the pretraining data is obtained from both the same and other different domains to a target task.

#### Reference1: [How to finetune BERT for Text Classification ?](https://arxiv.org/pdf/1905.05583.pdf)
#### Reference2: [Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)

> Note: This Kernel implements ITPT i.e. Within-Task Pretraining. First we will pretrain a RoBERTa model and then utilize the same for further finetuing tasks using different strategies.

In [None]:
import os
import pandas as pd
import numpy as np
import warnings
import pandas as pd
from tqdm import tqdm
from transformers import (AutoModelForMaskedLM,
                          AutoTokenizer, LineByLineTextDataset,
                          DataCollatorForLanguageModeling,
                          Trainer, TrainingArguments)

warnings.filterwarnings('ignore')
os.environ["WANDB_DISABLED"] = "true"

In [None]:
#model_name = "roberta-large"
#model_name = "roberta-base"
model_name = "allenai/longformer-base-4096"
#model_name = "allenai/longformer-large-4096"
model = AutoModelForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

## Load Data

In [None]:
train_names, train_texts = [], []
for f in tqdm(list(os.listdir('../input/feedback-prize-2021/train'))):
    train_names.append(f.replace('.txt', ''))
    with open('../input/feedback-prize-2021/train/' + f, 'r', encoding='utf-8') as f:
        text = ''
        for line in f.readlines():
            #text += line.replace('\n', '').replace('\xa0', '')
            text += line.replace('\n', ' ')
        train_texts.append(text)

In [None]:
texts = '\n'.join(train_texts)

In [None]:
with open('text.txt', 'w') as f:
    f.write(texts)

In [None]:
tokenizer.save_pretrained("./model_pretrained") 

## Further Pretraining

In [None]:
train_dataset = LineByLineTextDataset( 
    tokenizer=tokenizer,
    file_path="text.txt",  # mention train text file here
    block_size=1024)

valid_dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="text.txt",  # mention valid text file here
    block_size=1024)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15)

training_args = TrainingArguments(
    output_dir="./model_pretrained_chk",  # select model path for checkpoint
    overwrite_output_dir=True,
    num_train_epochs=4,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=6,
    #evaluation_strategy='steps',
    evaluation_strategy='epoch',
    save_total_limit=1,
    eval_steps=5000,
    metric_for_best_model='eval_loss',
    greater_is_better=False,
    load_best_model_at_end=False,
    prediction_loss_only=True,
    learning_rate=5e-5,
    seed=2021,
    report_to="none")

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset)

trainer.train()
trainer.save_model(f'./model_pretrained')