Skip to content

weizeming/RAPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

This is the repository for the paper RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning by Zeming Wei, Qiaosheng Zhang, Xia Hu, and Xingcheng Xu, presented at ICLR 2026 Workshop on Trustworthy AI.

Method overview

Installation

pip install -e .

Requirements: Python ≥ 3.10, torch, transformers, trl, vllm, datasets, pandas, pyyaml, peft.

Repository Structure

rapo/               # Core library
  data/loader.py    # Dataset loading
  eval/             # Safety and capability evaluation
  models/           # Local (vLLM/HF) and API model wrappers
  train/            # SFT and RL training logic
  utils.py          # System prompt templates and shared constants

scripts/
  train_pipeline.py # Full SFT → RL → Eval pipeline (recommended entry point)
  train_sft.py      # SFT stage only
  train_rl.py       # RL stage only
  eval_safety.py    # Safety evaluation (ASR on jailbreak benchmarks)
  eval_capability.py# Capability evaluation (MMLU-Pro, etc.)
  run_pipeline.sh   # Shell launcher example

configs/
  pipeline_recipe_qwen1b.yaml  # Qwen3-1.7B full pipeline example
  pipeline_recipe_ds1b.yaml    # DeepSeek-distill-1.5B full pipeline example
  models/           # Per-model vLLM inference settings
  data/             # Dataset recipes (sft_recipe.yaml, rl_recipe.yaml)
  train/            # Training hyperparameters (sft.yaml, rl.yaml)

Data Setup

This repository does not ship datasets. Download them and set:

export RAPO_DATASETS_ROOT=/path/to/datasets

If not set, the code falls back to ./datasets relative to RAPO_ROOT. Place the following under one directory:

Dataset Files/folders
StrataSword strata_sword_en_level_1.csv, strata_sword_en_level_2.csv, strata_sword_en_level_3.csv
STAR-1 STAR-benign/, STAR-1K/
WildJailbreak wildjailbreak/
JailbreakBench JailbreakBench/ (eval only)
HarmBench Harmbench/ (eval only)
XsTest XsTest/ (eval only)

Model Setup

Set model.path and reward_model.path in the pipeline recipe YAML to local paths or HuggingFace model identifiers. RAPO has been evaluated on:

  • Qwen/Qwen3-8B
  • Qwen/Qwen3-1.7B
  • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Running the Pipeline

export RAPO_ROOT=/path/to/RAPO
export RAPO_DATASETS_ROOT=/path/to/datasets
export PYTHONPATH=$RAPO_ROOT:$PYTHONPATH

# Full SFT → RL → Eval pipeline
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml

# Or skip stages
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-sft
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-eval

To run stages individually:

python scripts/train_sft.py --config configs/pipeline_recipe_qwen1b.yaml
python scripts/train_rl.py  --config configs/pipeline_recipe_qwen1b.yaml

Reward Design

RAPO uses a composite reward $A_p = R(s_p) + G(r_p)$:

  • Risk-aware reward $R(s_p) \in {-1, 0, 1}$: judges whether the safety reasoning trace in the thinking content is adequately calibrated to the prompt's complexity level (L1/L2/L3).
  • General reward $G(r_p) \in {-1, +1}$: checks refusal on harmful prompts and helpfulness on benign ones.

Both judges run via the base LRM itself with system prompts (self-rewarding), in parallel via vLLM.

Citation

@inproceedings{
wei2026rapo,
title={{RAPO}: Risk-Aware Preference Optimization for Generalizable Safe Reasoning},
author={Wei, Zeming and Zhang, Qiaosheng and Hu, Xia and Xu, Xingcheng},
booktitle={ICLR Trustworthy AI Workshop},
year={2026},
url={https://openreview.net/forum?id=smLgjabnLP}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors