# Value-based and policy-based method

1. **Value-based** - this learns the function (using a neural network) that estimates how good an action is given its current state.
2. **Policy-based** - directly learns the policy - without estimating how value it's to take an action in a given state.
3. **Combo** - this is a combination of value-based and policy-based. For instance, PPO is one example.
    - Actor proposes a policy
    - Critic evaluates the action generated by the policy

# Off-policy vs On-policy

1. **on-policy** - learns only from the data collected using the correct version of the policy. For instance like PPO.
2. **off-policy** - learns both from the data collected from current policy as well as previous version of the policy



# DPO
1. Direct Preference Optimisation
2. Remove the reward model completely.
3. So since we don't have a reward model to evaluate the LLM responses, so we have to ourselves design an objective that optimises the LLM
    - How this happnes we need to discuss?
4. Here the optimisation is more like a classification problem, trying to make prefered outputs more likely


# PPO
1. Collect human preference data
2. Use the above dataset to train the reward model
3. Once the reward model is trained, then use it act as an evaulator to evaluate the responses of the LLM that needs to be finetunee


In [1]:
from transformers import AutoTokenizer
from datasets import Dataset

# Simulate creation of the human preferences dataset

In [2]:
# Prepare synthetic dataset of prompts and preferred responses
# In reality this will be created by using real human responses to real questions.
# But for now this will work
prompts = [
    "Tell me about your day.",
    "What's your favorite food?",
    "Describe a good movie.",
    "What do you think about AI?",
    "Give me advice for being productive."
]


# Simulated response pairs: (preferred, rejected)
response_pairs = [
    ("It was a good day, I enjoyed it.", "It was okay."),
    ("I really love pizza, it's so good!", "I eat food."),
    ("A good movie has an engaging story.", "Movies are fine."),
    ("AI is good because it helps people.", "AI is complicated."),
    ("Be consistent and create good habits.", "Just work.")
]


data = []
for prompt, (accepted, rejected) in zip(prompts, response_pairs):
    data.append({
        'prompt': prompt,
        'accepted': accepted,
        'rejected': rejected
    })

# Conver this list to a hugginface dataset format
preference_dataset = Dataset.from_list(data)
print (preference_dataset)

Dataset({
    features: ['prompt', 'accepted', 'rejected'],
    num_rows: 5
})


In [3]:
preference_dataset[0]

{'prompt': 'Tell me about your day.',
 'accepted': 'It was a good day, I enjoyed it.',
 'rejected': 'It was okay.'}

# PPO pipeline



In [4]:
# Load the base model GPT2
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token   # GPT2 doesn't have a pad token

## Preference dataset example

In [5]:
# Show me a sample example to use the tokenizer
example =  preference_dataset[0]
print (f"Example:\n{example}")

# Tokenize the prompt and the two responses
print (f"Forward mapping (tokens -> ids):")
tokens = tokenizer.tokenize(example['prompt'])
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print (f"Tokens: {tokens}")
print (f"Token IDs: {token_ids}")

print (f"Reverse mapping (ids -> tokens):")
id2tokens = tokenizer.convert_ids_to_tokens(token_ids)
print(f"ID -> Tokens: {id2tokens}")
decoded_tokens = tokenizer.decode(token_ids)
print (f"Decoded tokens: {decoded_tokens}")

Example:
{'prompt': 'Tell me about your day.', 'accepted': 'It was a good day, I enjoyed it.', 'rejected': 'It was okay.'}
Forward mapping (tokens -> ids):
Tokens: ['Tell', 'Ġme', 'Ġabout', 'Ġyour', 'Ġday', '.']
Token IDs: [24446, 502, 546, 534, 1110, 13]
Reverse mapping (ids -> tokens):
ID -> Tokens: ['Tell', 'Ġme', 'Ġabout', 'Ġyour', 'Ġday', '.']
Decoded tokens: Tell me about your day.


## Prepare reward data

Typically this reward data will also be prepared while training a reward model using the preference dataset.

In [6]:
reward_data = []
for example in data:
    reward_data.append({
        "text": example['prompt'] + " " + example['accepted'],
        "label": 1  # reward for preferred response
    })

    reward_data.append({
        "text": example['prompt'] + " " + example['rejected'],
        "label": 0  # reward for rejected response
    })

# convert to hugginface dataset
reward_dataset = Dataset.from_list(reward_data)

# specify train and test split
reward_dataset = reward_dataset.train_test_split(test_size=0.2)

In [7]:
reward_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 8
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2
    })
})

In [8]:
preference_dataset

Dataset({
    features: ['prompt', 'accepted', 'rejected'],
    num_rows: 5
})

## Tokenize the reward data

Tokenize the input sequence  (prompt + response) using the LLMs tokenizer

In [9]:
reward_dataset.shape

{'train': (8, 2), 'test': (2, 2)}

In [10]:
def tokenize_reward(example):
    # the arguments provided to the tokenizer function will pad all sequences of length smaller than max_length to max_length
    # and truncate all sequences longer than max_length
    encodings = tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)
    encodings["labels"] = example["label"]   # here label refers to the reward label i.e 1 or 0
    return encodings

tokenized_reward = reward_dataset.map(tokenize_reward, batched=True, remove_columns=['label'])

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

In [11]:
# As we can see, we have a new dataset with the original text, the labels, tokenized input and the attention mask
tokenized_reward


DatasetDict({
    train: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8
    })
    test: Dataset({
        features: ['text', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2
    })
})

In [12]:
# skip_special_tokens=True will remove the special tokens like <endoftext>, <eos>
decoded_output = tokenizer.decode(tokenized_reward['train'][0]['input_ids'], skip_special_tokens=True)
print (f"Decoded output: {decoded_output}")

Decoded output: Give me advice for being productive. Be consistent and create good habits.


## Train a reward model

In [13]:
# from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, GPT2Config, GPT2ForSequenceClassification, AutoConfig
# import torch

# # config = GPT2Config()

# # # Load GPT2 reward model (classification head)
# # reward_model = GPT2ForSequenceClassification(config).from_pretrained(model_name, num_labels=2)
# # reward_model.config.pad_token_id = reward_model.config.eos_token_id  # GPT2 doesn't have a pad token


# # Another way of loading a pretrained huggingface model
# # using AutoModel class
# config = AutoConfig.from_pretrained(model_name)
# config.pad_token_id = config.eos_token_id
# reward_model = AutoModelForSequenceClassification.from_config(config)

# training_args = TrainingArguments(
#     output_dir="./reward_model",
#     overwrite_output_dir=True,
#     per_device_train_batch_size=4,
#     per_device_eval_batch_size=4,
#     logging_strategy="epoch",
#     learning_rate=5e-5,
#     num_train_epochs=30,
#     eval_strategy="epoch",
#     logging_dir="./logs",
#     save_strategy="epoch",
#     metric_for_best_model="eval_loss",
#     greater_is_better=False,
#     load_best_model_at_end=True,
# )

# trainer = Trainer(
#     model=reward_model,
#     args=training_args,
#     train_dataset=tokenized_reward["train"],
#     eval_dataset=tokenized_reward["test"],
#     tokenizer=tokenizer
# )

# trainer.train()

In [13]:
# Both training and validation loss are converging
# So let's try it out on a unseen example.

test_prompt = "What is the meaning of life?"
good_response = "Life is such a beautiful thing and one must really value each others life."
bad_response = "Life is LIT AF!!"

good_response = f"{test_prompt} {good_response}"
bad_response = f"{test_prompt} {bad_response}"

In [None]:
bad_response

In [None]:
best_model = AutoModelForSequenceClassification.from_pretrained("./reward_model/checkpoint-60")

In [None]:
best_tokenizer = AutoTokenizer.from_pretrained("./reward_model/checkpoint-60")

In [21]:
inputs = best_tokenizer(bad_response, return_tensors="pt")

In [22]:
outputs = best_model(**inputs)

In [None]:
logits = outputs.logits
predicted_class = logits.argmax(dim=-1)
print (f"Predicted class: {predicted_class}")

## Observations



We will now try training a reward model using Huggingface's TRL library.

## Training a reward model using `TRL`

In [14]:
from peft import LoraConfig, TaskType
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig
from trl import RewardTrainer, RewardConfig, setup_chat_format

## Load Preference dataset

In [15]:
from datasets import load_dataset

preference_dataset = load_dataset("trl-lib/ultrafeedback_binarized")

In [16]:
preference_dataset

DatasetDict({
    train: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 62135
    })
    test: Dataset({
        features: ['chosen', 'rejected', 'score_chosen', 'score_rejected'],
        num_rows: 1000
    })
})

In [17]:
preference_dataset['train'][2]

{'chosen': [{'content': 'Detailed Instructions: In this task, you must classify if a given review is positive/negative, indicating your answer as P or N.\nQ: " desperate measures " was something i was excited about seeing back when it was originally scheduled to be released : summer \' 97 . for some reason , it was delayed until hollywood \'s traditional dumping ground : january .\nnow that it \'s out , i see no real reason for that delay , as it \'s a simple yet highly entertaining film .\nmichael keaton stars as a maniacial murderer who \'s bone marrow can save the life of the dying son of a san francisco police detective ( garcia ) .\nkeaton agrees to the transplant , only so he can attempt escape .\nhe succeeds , in a plan that of course could only work in the movies .\nthe police force is now trying to kill keaton , while garcia is working against them trying to keep keaton alive in order to save his son .\nthe film definately has it \'s flaws .\nthe plot is strictly tv movie of t

## Define Reward Model trainer

In [18]:
HF_MODEL_NAME = "gpt2"


# load pretrained model that we need to train as a reward model
config = AutoConfig.from_pretrained(HF_MODEL_NAME)
config.pad_token_id = config.eos_token_id    # GPT2 doesn't have a pad token
reward_model = AutoModelForSequenceClassification.from_config(config)


tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token   # GPT2 doesn't have a pad token

if tokenizer.chat_template is None:
    reward_model, tokenizer = setup_chat_format(reward_model, tokenizer)

# base model training args
training_args = RewardConfig(
    output_dir="./peft_reward_model",
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_strategy="steps",
    learning_rate=1e-4,
    num_train_epochs=1,
    eval_strategy="steps",
    logging_dir="./peft_logs",
    save_strategy="steps",
    logging_steps=25,
    eval_steps=50,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    load_best_model_at_end=True,
    disable_dropout=False
)

# LoRA for fast finetuning
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,   # stands for Sequence classification
    inference_mode=False,    # whether u are using the model for inference or not; for now we are finetuning so we'll set it to False
    r=8,    # LoRA attention dimension AKA rank
    lora_alpha=16,      # it controls how much the weights updates from LoRA layers will affect the original layers' weights. higher the value, higher the effect on the original layers.
    lora_dropout=0.1,   # the dropout probability for LoRA layers
)


# Define the Reward Model Trainer
reward_model_trainer = RewardTrainer(
    model=reward_model,
    args=training_args,
    processing_class=tokenizer,
    train_dataset=preference_dataset["train"],
    eval_dataset=preference_dataset["test"],
    peft_config=peft_config
)

reward_model_trainer.train()


The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`
No label_names provided for model class `PeftModelForSequenceClassification`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
50,0.7322,0.70709,0.537585
100,0.7359,0.693426,0.571754
150,0.715,0.68485,0.589977
200,0.7389,0.681197,0.588369
250,0.729,0.676577,0.593394




KeyboardInterrupt: 

## TODO:

- How do we fix this memory issue?
- Can playing around with the values is fine?
- If not, then we will have to test this on Lightning AI.


- Can we speed up the tokenization process?
    - Increase batch size
    - Also understand in what stage inside the trainer, is this tokenization happening?
- No labels names provided 
- Understand what are some other ways of speeding up training?
- Tensorboard integration

In [None]:
import peft
peft.__version__

## PPO finetuning with Reward Model

In [None]:
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import pipeline

# Load model for PPO training
ppo_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)
ppo_tokenizer = tokenizer  # same tokenizer

# Set up PPO config
ppo_config = PPOConfig(
    model_name=model_name,
    learning_rate=1e-5,
    log_with=None,
    batch_size=1,
    mini_batch_size=1,
)

# Reward function using the trained reward model
# we won't be optimising the reward model here
# We just want to use it
# That's why we set it to eval() model
reward_model.eval()

def get_reward(prompt, response):
    text = prompt + " " + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(reward_model.device)
    with torch.no_grad():
        logits = reward_model(**inputs).logits
        probs = torch.softmax(logits, dim=-1)
        reward_score = probs[:, 1]  # probability of class 1 (preferred)
    return reward_score.item()

# Initialize logger
ppo_logger = TrainingLogger(run_name="ppo_run")


# Prepare PPO dataset (only prompts)
ppo_dataset = [{"prompt": ex["prompt"]} for ex in data]

# Setup PPOTrainer
ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=ppo_model,
    tokenizer=ppo_tokenizer,
    dataset=ppo_dataset
)

# PPO Training Loop (modified)
for step, sample in enumerate(ppo_dataset):
    prompt = sample["prompt"]
    input_ids = ppo_tokenizer(prompt, return_tensors="pt").input_ids
    response_ids = ppo_trainer.model.generate(input_ids, max_new_tokens=20)
    response_text = ppo_tokenizer.decode(response_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

    reward_score = get_reward(prompt, response_text)

    ppo_trainer.step([prompt], [response_text], [reward_score])

    # Log
    ppo_logger.log_sample(step, prompt, response_text, reward_score)

# Save logs & plot
ppo_logger.save_json()
ppo_logger.save_csv()
ppo_logger.plot_scores("PPO Reward Trend")

print("PPO fine-tuning complete!")


# DPO pipeline

In [None]:
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLMWithValueHead

# Initialize the DPO model
dpo_model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)
dpo_tokenizer = tokenizer  # same tokenizer

# Set up DPO config
dpo_config = DPOConfig(
    model_name=model_name,
    learning_rate=1e-5,
    log_with=None,
    batch_size=1,
    mini_batch_size=1,
)

# Initialize DPOTrainer
dpo_trainer = DPOTrainer(
    config=dpo_config,
    model=dpo_model,
    tokenizer=dpo_tokenizer,
    dataset=ppo_dataset  # Same dataset as PPO, contains prompts
)

# Initialize logger for DPO
dpo_logger = TrainingLogger(run_name="dpo_run")

# DPO Training Loop
for step, sample in enumerate(ppo_dataset):
    prompt = sample["prompt"]
    
    # Generate response
    input_ids = dpo_tokenizer(prompt, return_tensors="pt").input_ids
    response_ids = dpo_trainer.model.generate(input_ids, max_new_tokens=20)
    response_text = dpo_tokenizer.decode(response_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

    # Simulate preference (alternatively, use real feedback from a user, but we'll randomly choose here for simplicity)
    chosen_response = response_text  # Simulating "chosen" response
    rejected_response = "This is not a good response."  # Simulating "rejected" response

    # Log the results (no reward here, just label chosen/rejected)
    dpo_logger.log_sample(step, prompt, response_text, score=None, label="chosen")

    # Directly optimize using DPO (using preference pairs)
    dpo_trainer.step([prompt], [chosen_response], [rejected_response])

# Save logs & plot for DPO
dpo_logger.save_json()
dpo_logger.save_csv()
dpo_logger.plot_scores("DPO Preference Trend")
print("DPO fine-tuning complete!")


# Evaluation

## Qualitative evaluation

In [None]:
# Generate responses using PPO model
def generate_response_ppo(prompt):
    input_ids = ppo_tokenizer(prompt, return_tensors="pt").input_ids
    response_ids = ppo_model.generate(input_ids, max_new_tokens=20)
    return ppo_tokenizer.decode(response_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

# Generate responses using DPO model
def generate_response_dpo(prompt):
    input_ids = dpo_tokenizer(prompt, return_tensors="pt").input_ids
    response_ids = dpo_model.generate(input_ids, max_new_tokens=20)
    return dpo_tokenizer.decode(response_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

# Example prompts to test
test_prompts = [
    "Tell me about your day.",
    "What's your favorite food?",
    "Describe a good movie.",
    "What do you think about AI?",
    "Give me advice for being productive."
]

# Generate responses from both models
ppo_responses = [generate_response_ppo(prompt) for prompt in test_prompts]
dpo_responses = [generate_response_dpo(prompt) for prompt in test_prompts]

# Print both sets of responses for comparison
for prompt, ppo_resp, dpo_resp in zip(test_prompts, ppo_responses, dpo_responses):
    print(f"Prompt: {prompt}")
    print(f"PPO Response: {ppo_resp}")
    print(f"DPO Response: {dpo_resp}")
    print("-" * 50)


## Quantitative evaluation

In [None]:
from nltk.translate.bleu_score import sentence_bleu

# Define reference responses (example, can be replaced with actual preferred responses)
reference_responses = [
    "It was a good day, I enjoyed it.",
    "I really love pizza, it's so good!",
    "A good movie has an engaging story.",
    "AI is good because it helps people.",
    "Be consistent and create good habits."
]

# Compute BLEU score for PPO vs DPO responses
def compute_bleu(reference_responses, generated_responses):
    bleu_scores = []
    for ref, gen in zip(reference_responses, generated_responses):
        ref_tokens = ref.split()
        gen_tokens = gen.split()
        bleu_score = sentence_bleu([ref_tokens], gen_tokens)
        bleu_scores.append(bleu_score)
    return bleu_scores

# Calculate BLEU scores for PPO and DPO
ppo_bleu = compute_bleu(reference_responses, ppo_responses)
dpo_bleu = compute_bleu(reference_responses, dpo_responses)

# Print BLEU scores
print(f"PPO BLEU Scores: {ppo_bleu}")
print(f"DPO BLEU Scores: {dpo_bleu}")


In [None]:
def plot_bleu_scores(ppo_bleu, dpo_bleu):
    plt.figure(figsize=(10, 5))
    plt.plot(range(len(ppo_bleu)), ppo_bleu, label="PPO BLEU Scores", marker='o')
    plt.plot(range(len(dpo_bleu)), dpo_bleu, label="DPO BLEU Scores", marker='o')
    plt.xlabel("Prompt Index")
    plt.ylabel("BLEU Score")
    plt.title("BLEU Score Comparison: PPO vs DPO")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

# Plot BLEU score comparison
plot_bleu_scores(ppo_bleu, dpo_bleu)


# Now I just need to run the code

- Next I need to learn about distributed training of PPO/DPO (using Deepspeed, FSDP)
- Efficient sampling, buffering, checkpointing
- ~~LoRA and QLoRA for memory efficient finetuning~~
- Deep dive into what the reward model is? Does it always have to be a neural network. Does it have to be a reward function?