# **Modern NLP: Course project - Milestone 2**

#### **Team**: Alexander Sternfeld, Silvia Romanato and Antoine Bonnet (`syntax-sorcerers`)


> **Remember**: In Milestone 1, we picked a robust prompting stategy to get accurate answers from ChatGPT, which we used to generate answers for questions from EPFL course content. The generated answers were then rated by human annotators. The data collected (available at `project_reference/interactions_v1.json` and `project_reference/solutions_v1.json`) will be used in Milestone 3 for the supervised fine-tuning of a language model to answer questions from EPFL course content.
> 
> However, high-quality assistants such as ChatGPT are trained using more than only
supervised learning. They use a technique called Reinforcement Learning with Human
Feedback (RLHF). RLHF requires your training procedure to have access to a reward model
that can evaluate multiple different responses and rank them according to their suitability. 
>
> **For newcomers**: Make sure that you have run the `scripts/data_preparation.ipynb` notebook before running this notebook.

## **Training a reward model**

This notebook aims to trains a **reward model** to rate the quality of answers given a question. This model will later be used to train a **policy model** with RLHF to rank multiple answers from the same question.

We will use the [RoBERTa](https://arxiv.org/abs/1907.11692) transformer-based model on the [StackOverflow](https://www.kaggle.com/datasets/stackoverflow/stackoverflow) datasets to predict the quality (and rank) multiple answers from the same question. 




In [2]:
from reward_dataset import *
from classifier_model import *
#from regressive_model import *
from evaluate import *
from model import *

STACK_PATH = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers_StackOverflow.json')
EPFL_PATH = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers_EPFL.json')

BASE_MODEL_NAME = "roberta-base"

%reload_ext autoreload
%autoreload 2

In [None]:
# If you are running this notebook on Google Colab
from google.colab import drive

drive.mount('/content/drive/MyDrive/Modern_NLP/Project/')

#### 1. **Loading training data**

We first import the StackOverflow and EPFL data. These data was previously pre-processed and cleaned in the `scripts/data_preparation.ipynb` notebook.


In [18]:
# Load the stackOverflow data
stack_data = pd.read_json(STACK_PATH)
stack_data.head()


Unnamed: 0,label,chat,entry_id
0,positive,Human: Can't use The SGD optimizer <p>I am usi...,0
1,negative,Human: Can't use The SGD optimizer <p>I am usi...,1
2,positive,"Human: Preprocessing , EDA , and Feature Engin...",2
3,negative,"Human: Preprocessing , EDA , and Feature Engin...",3
4,negative,Human: Examples of reversible computations <p>...,4


In [19]:
# Load the EPFL interaction data
epfl_data = pd.read_json(EPFL_PATH, orient='records')
epfl_data.head()

Unnamed: 0,chat,label,entry_id
0,Human: Une conquille sphérique de rayon $R_1$ ...,negative,0
1,Human: Une conquille sphérique de rayon $R_1$ ...,negative,1
2,Human: Une conquille sphérique de rayon $R_1$ ...,positive,2
3,Human: Assume that we have a convolutional neu...,positive,3
4,Human: Q: Which of the following functions rea...,negative,4


### 2. **Pre-trained base model**

[RoBERTa](https://arxiv.org/abs/1907.11692) (**R**obustly **o**ptimized **B**idirectional **E**ncoder **R**epresentations from **T**ransformers) is an upgraded version of the original [BERT](https://arxiv.org/abs/1810.04805) model released by Google in 2018. 

The model architecture consists of a **pre-trained tokenizer** (mapping text to vectors) and a pre-trained **Transformer-based model** (mapping vectors to vectors). 

We will use the **BERTbase** model, which is composed of a stack of 12 identical layers (number of Transformer blocks), each with 12 attention heads (size of a transformer block). The model is trained on a masked language modeling (MLM) objective, which means that the model is trained to predict randomly masked tokens in a sequence.

<p align="center">
  <img src="https://www.researchgate.net/publication/352642553/figure/fig2/AS:1037416861282304@1624350862022/The-RoBERTa-model-architecture.ppm"/>
</p>

We use the [HuggingFace](https://huggingface.co/) library to load the pre-trained model and tokenizer.


In [21]:
# Do not print warnings
RoBERTa_base = AutoModel.from_pretrained(BASE_MODEL_NAME)
RoBERTa_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### 3. **Data pre-processing**  

Following recommendations from the [InstructGPT paper](https://arxiv.org/pdf/2203.02155.pdf), we train the model with one batch per question. Each batch will therefore consist of $K$ question-answer pairs, where $K$ is the number of different answers for that question. 

We define a custom PyTorch `Dataset` class to that effect. We tokenize the data using the RoBERTa pre-trained tokenizer. We also **truncate** the tokenized question-answer pairs to 512 tokens (maximum length of a sequence that can be processed by RoBERTa). We found that too many question-answer pairs exceeded this limit, so we decided to truncate the text to the first 512 tokens. We also **pad** the text to 512 tokens if it is shorter than 512 tokens.

We split the dataset into a training, validation and test `DataLoaders` using a 60/20/20 split with shuffling. Note that we keep answers to the same questions in the same set to avoid any data leakage.

In [5]:
dataset = create_dataset(stack_data, RoBERTa_tokenizer)

### 4. **Classification model**

We treat the task as a **binary classification** problem and add a **classification head** on top of the pre-trained model to predict whether any given question-answer pair has a correct answer or not. As previously, we pass batches of answers to the same question to the model, and the model will learn to rank the answers according to their quality.

The difference now is that we will use a **binary cross-entropy loss** function to train the model. The model will be trained to predict the probability that a given answer is correct.

We however need to account for class imbalances (we always have a single correct answer, but between 1 and 30 incorrect answers) which might lead to the model predicting that all answers are incorrect. To account for this, we use **focal loss**, which is a modified version of the binary cross-entropy loss function that down-weighs the loss of correct answers. This is equivalent to weighting the loss of incorrect answers by the ratio of incorrect answers to correct answers in the batch.


In [22]:
model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL_NAME, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

print(model)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0): RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerN

In [None]:
# Instantiate the model
model = AutoModelForSequenceClassification.from_pretrained(BASE_MODEL_NAME, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

# Replace the last layer with layers of width [768, 256, 32, 2]
#model.classifier = nn.Sequential(nn.Linear(768, 256),nn.ReLU(),nn.Linear(256, 32),nn.ReLU(),nn.Linear(32, 2))


# Load the dataset
df = stack_data
dataset = create_dataset(df, tokenizer)

# Set up the training arguments
training_args = TrainingArguments(  
    output_dir=RUNS_DIR,            
    evaluation_strategy="epoch",        # Run evaluation every epoch
    save_strategy="epoch",              # Save checkpoint every epoch
    num_train_epochs=1,                 # Number of training epochs
    per_device_train_batch_size=16,     # Number of QA per batch (default=8)
    per_device_eval_batch_size=16,      # Number of QA per batch (default=8)         
    load_best_model_at_end=True,        # Load the best model when finished training 
    metric_for_best_model="accuracy",   # Use accuracy to evaluate the best model
    logging_strategy="steps",           # Log val metrics every (logging_steps) batches
    logging_steps=100,                  
    logging_dir=LOGS_DIR,               
    disable_tqdm=False, 
    report_to='all', 
    seed=1,
    learning_rate=1e-5,                 # Learning rate
    #gradient_accumulation_steps=10     # Accumulate 10 steps before backward pass
    #weight_decay=0.01,                 # L2 regularization strength 
    #eval_accumulation_steps=10,         # Accumulate 10 steps before eval loss
)

# Create the Trainer
trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=compute_metrics
)

# Train the model 
# Note: best model is automatically loaded at end of training
trainer.train()

# Evaluate the model on test set and save results
results = trainer.evaluate(dataset['test'])
print(results)

### 5. **Evaluating the reward model**

We now use the trained reward model to produce scores for the answers to the questions in test StackOverFlow dataset. We then compare the scores to the ground truth labels to evaluate the performance of the model.

In [None]:
# Load model from checkpoint
checkpoint_path = os.path.join(RUNS_DIR, 'checkpoint')
model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME)

# Load the datasets
datasets = [load_dataset(dataset_name, tokenizer) for dataset_name in DATASET_NAMES]

# Evaluate the model on test sets and save results
TRAINING_ARGS.run_name = 'evaluation'
TRAINING_ARGS.output_dir = EVAL_DIR
TRAINING_ARGS.report_to = None

for i in range(len(datasets)):
    print(f'\nEvaluating on dataset [{i+1}/{len(datasets)}] ({DATASET_NAMES[i]}).')
    trainer = WeightedTrainer(
        model=model,
        args=TRAINING_ARGS,
        train_dataset=datasets[i]['train'],
        eval_dataset=datasets[i]['val'],
        compute_metrics=compute_metrics,       
    )
    eval_res = trainer.evaluate(datasets[i]['test'])
    print('Evaluation result:\n', eval_res)

### **Preparing submission**

We start by combining our datasets into a single json file. 

In [19]:
# Combine both datasets into one
data_path = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers.json')
stack_path = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers_StackOverflow.json')
epfl_path = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers_EPFL.json')
StackOverflow_df = pd.read_json(stack_path, orient='records')
EPFL_df = pd.read_json(epfl_path, orient='records')
df = pd.concat([StackOverflow_df, EPFL_df], ignore_index=True)
df.to_json(data_path, orient='records', indent=4)

The trained model is saved to the `checkpoint` folder. We save the config files to the `reward_model` folder to be used by `evaluate.py`. 

In [4]:
# Load the dataset from saved model path
checkpoint_dir = os.path.join(BASE_DIR, 'checkpoint')
data_path = os.path.join(BASE_DIR, 'data', 'reward_model', 'm2_reward_dataset_syntax-sorcerers.json')
submission_folder = os.path.join(BASE_DIR, 'reward_model')
hf_pretrained_model_name = "roberta-base"

In [5]:
# Save model and model config to model_path directory
model_config = ClassifierRewardModelConfig.from_pretrained(hf_pretrained_model_name)
tokenizer = AutoTokenizer.from_pretrained(hf_pretrained_model_name)
model_config.problem_type = 'single_label_classification'
model = ClassifierRewardModel(model_config)

tokenizer.save_pretrained(submission_folder)
model.save_pretrained(submission_folder)
model_config.save_pretrained(submission_folder)


You are using a model of type roberta to instantiate a model of type ClassifierRewardModel. This is not supported for all configurations of models and can yield errors.
