# LLM Post-training using Proximal Policy Optimization

In this notebook, we show how you can apply post training using Proximal Policy Optimization (RL) to a base LLM.

First, install required packages. TRL is a cutting-edge library designed for post-training foundation models.

In [38]:
!pip install trl==0.4.7 transformers==4.28.1 peft==0.3.0 accelerate -q

Load a base model. We only use a very small foundation model as this notebook is for illustration purposes only.

In [39]:
from transformers import AutoTokenizer
from trl import AutoModelForCausalLMWithValueHead, PPOConfig, PPOTrainer

model_name = "distilgpt2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLMWithValueHead.from_pretrained(model_name)

We define a toy problem where we ask the model what the capital of France is, and use a rule-based reward model where we give a reward when the answer contains Paris and a punishment when it doesn't.

In [40]:
# Toy dataset
prompts = ["What is the capital city of France?"] * 10

# Improved reward function: fuzzy match for 'Paris'
def reward_fn(output: str) -> float:
    return 1.0 if re.search(r'\b[Pp]aris\b', output) else 0.0

Loading the PPOTrainer

In [41]:
ppo_config = PPOConfig(
    model_name=model_name,
    learning_rate=1e-4,
    batch_size=2,
    mini_batch_size=1,
    gradient_accumulation_steps=1,
    target_kl=0.2
)

ppo_trainer = PPOTrainer(config=ppo_config, model=model, tokenizer=tokenizer)


Training loop. We only train for a few epochs as this is only for illustration purposes.

In [42]:
import torch
import random
import re


# Run a few PPO steps (small demo loop)
for epoch in range(5):
    print(f"Epoch {epoch + 1}")
    batch_prompts = random.sample(prompts, k=2)

    # Tokenize prompts
    device = next(model.parameters()).device
    inputs = tokenizer(batch_prompts, return_tensors="pt", padding=True).to(device)
    query_tensors = [q for q in inputs["input_ids"]]

    # Generate responses with sampling + top-k/top-p
    response_tensors = []
    for query in query_tensors:
        output = model.generate(
            query.unsqueeze(0),
            max_new_tokens=20,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=True,
            top_k=50,
            top_p=0.95,
            temperature=1.0,
            eos_token_id=tokenizer.eos_token_id
        )
        response = output[0][query.shape[0]:]  # new tokens only
        response_tensors.append(response)

    # Decode and compute rewards
    responses = [tokenizer.decode(r, skip_special_tokens=True) for r in response_tensors]
    rewards = [torch.tensor(reward_fn(r)).to(device) for r in responses]

    # PPO update step
    ppo_trainer.step(query_tensors, response_tensors, rewards)

    # Print result
    for p, r, rw in zip(batch_prompts, responses, rewards):
        print(f"\nPrompt: {p}\n→ Response: {repr(r)}\n→ Reward: {rw.item()}")


Epoch 1


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Prompt: What is the capital city of France?
→ Response: " The city you're in is not a city. The city you're in is not a city."
→ Reward: 0.0

Prompt: What is the capital city of France?
→ Response: '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
→ Reward: 0.0
Epoch 2

Prompt: What is the capital city of France?
→ Response: '\n\n\n\n\n\n\n\n\n\n'
→ Reward: 0.0

Prompt: What is the capital city of France?
→ Response: '\n\n\n\n\n\n\n\nThe UK plans to start the first public funding at 1240'
→ Reward: 0.0
Epoch 3

Prompt: What is the capital city of France?
→ Response: '\n\n\n- $1,000; R322175; New York City City City City'
→ Reward: 0.0

Prompt: What is the capital city of France?
→ Response: ' - June 28\n\n\nAaaaaAaAaAAaAa'
→ Reward: 0.0
Epoch 4

Prompt: What is the capital city of France?
→ Response: ' A $2002510142243444444444545444445444444'
→ Reward: 0.0

Prompt: What is the capital city of France?
→ Response: '\n\nState Public Works. Eighty First City City City City City City City City C

It seems like the "distilgpt2" model is too small to accurately respond to our questions, but there are probably also problems in how we generate from the model.

In [43]:
# Get the underlying language model
base_model = model.pretrained_model

# Prompt for testing
test_prompt = "What is the capital of France?"
device = next(model.parameters()).device
test_input = tokenizer(test_prompt, return_tensors="pt").to(device)

# Generate a response
output = base_model.generate(
    test_input["input_ids"],
    max_new_tokens=20,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=1.0,
    eos_token_id=tokenizer.eos_token_id
)

# Decode and print the output
print("Generated answer after training:", tokenizer.decode(output[0], skip_special_tokens=True))


Generated answer after training: What is the capital of France?
4GXM70XPZ6THKKKKKKKK
