# Reward Model Training

This tutorial trains a reward model on "chosen" and "rejected" responses. 

Training Procedure
1. Read a dataset with three columns - instruction, chosen_response and rejected_response.
2. Use HF's `AutoModelForSequenceClassification` to wrap Roberta model to output a binary class probability. 
3. Prepare dataset for Reward Trainer with each data point containing `input_token_ids_chosen` and `input_token_ids_rejected`. 
4. Pass model, dataset, and `RewardConfig` to TRL's `RewardTrainer` and call `train()`.

In [1]:
import random
import pandas as pd
from operator import itemgetter
import torch
import warnings
warnings.filterwarnings('ignore')
from datasets import Dataset, load_dataset
from transformers import AutoModelForSequenceClassification,AutoTokenizer,TrainingArguments
from trl import RewardTrainer, RewardConfig

# Read Dataset

Dataset format: (prompt, chosen response, rejected response)

In [None]:
# Read dataset
df = pd.read_csv('comparisons.csv', dtype={'question':str, 'chosen':str, 'rejected':str})
print("len(df)", len(df))

df.rename(columns={'question':'instruction'}, inplace=True)
df.rename(columns={'chosen':'chosen_response'}, inplace=True)
df.rename(columns={'rejected':'rejected_response'}, inplace=True)

# convert df to dict
rows = df.to_dict(orient='records')
cols = df.to_dict(orient='list')

print("len rows", len(rows))
print("rows[:3]", rows[:3])

print("cols type", type(cols))
print("cols keys", cols.keys())

print("cols['instruction'][:3]", cols['instruction'][:3])

prepared_dataset = Dataset.from_dict(cols)
print(prepared_dataset.to_pandas())

len(df) 18007
len rows 6987
rows[:3] [{'instruction': 'Voiced by Harry Shearer, what Simpsons character was modeled after Ted Koppel?', 'chosen_response': "Apu Nahasapeemapetilon is a recurring character in the American animated television series The Simpsons. He is an Indian immigrant proprietor who runs the Kwik-E-Mart, a popular convenience store in Springfield. [1] He was based on Peter Seller's character in the film The Party. [2]", 'rejected_response': 'The Simpsons character that was possibly based on Ted Koppel is Kent Brockman.  He is a local news anchor in Springfield and is modeled after Ted Koppel. [1]'}, {'instruction': 'Alliumphobia is the irrational fear of which plant?', 'chosen_response': 'Alliumphobia is the irrational fear of garlic. [1]', 'rejected_response': 'Alliumphobia is the fear of garlic [1, 3]. People who suffer from this condition can expect to experience a very high amount of anxiety from merely thinking of garlic, let alone actually seeing it in real life

In [3]:
#Select a base model whch we need to train for reward modeling.
model_name = "distilroberta-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
def formatting_func(examples):
    kwargs = {"padding": "max_length", "truncation": True, "max_length": 512, "return_tensors": "pt"}
    prompt_plus_chosen_response = examples["instruction"] + "\n" + examples["chosen_response"]
    prompt_plus_rejected_response = examples["instruction"] + "\n" + examples["rejected_response"]
    tokens_chosen = tokenizer.encode_plus(prompt_plus_chosen_response, **kwargs)
    tokens_rejected = tokenizer.encode_plus(prompt_plus_rejected_response, **kwargs)
    return {
        "input_ids_chosen": tokens_chosen["input_ids"][0], "attention_mask_chosen": tokens_chosen["attention_mask"][0],
        "input_ids_rejected": tokens_rejected["input_ids"][0], "attention_mask_rejected": tokens_rejected["attention_mask"][0]
    }

formatted_dataset = prepared_dataset.map(formatting_func)
formatted_dataset = formatted_dataset.train_test_split()

training_args = RewardConfig(
    output_dir="./reward_model",
    per_device_train_batch_size=16,
    evaluation_strategy="steps",
    logging_steps=1,
    num_train_epochs = 10,
    report_to=None,
)
# Loading the RewardTrainer from TRL
trainer = RewardTrainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
)
trainer.train()

Map: 100%|██████████| 6987/6987 [00:31<00:00, 220.50 examples/s]
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33msadmankiba[0m ([33msadman-ai[0m). Use [1m`wandb login --relogin`[0m to force relogin


You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Accuracy
1,0.7268,0.693517,0.497424
2,0.681,0.693581,0.502576
3,0.7119,0.693462,0.497424
4,0.7034,0.693256,0.499714
5,0.6945,0.693095,0.497424
6,0.6879,0.692943,0.518031
7,0.6801,0.692809,0.522038
8,0.6976,0.692696,0.521465
9,0.7041,0.692581,0.528334
10,0.7074,0.692514,0.527189


KeyboardInterrupt: 

: 

## Result 

Initially Promising. Model starts to learn. 

Step	Training Loss	Validation Loss	Accuracy  
1	0.726800	0.693517	0.497424  
2	0.681000	0.693581	0.502576  
3	0.711900	0.693462	0.497424  
4	0.703400	0.693256	0.499714  
5	0.694500	0.693095	0.497424  
6	0.687900	0.692943	0.518031  
7	0.680100	0.692809	0.522038  
8	0.697600	0.692696	0.521465  
9	0.704100	0.692581	0.528334  
10	0.707400	0.692514	0.527189   
11	0.680300	0.692380	0.530052   
12	0.701300	0.692229	0.530624   
13	0.702300	0.692087	0.536920   
14	0.689100	0.691951	0.542072   
15	0.687900	0.691848	0.546651    