This is the repository for the paper RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning by Zeming Wei, Qiaosheng Zhang, Xia Hu, and Xingcheng Xu, presented at ICLR 2026 Workshop on Trustworthy AI.
pip install -e .Requirements: Python ≥ 3.10, torch, transformers, trl, vllm, datasets, pandas, pyyaml, peft.
rapo/ # Core library
data/loader.py # Dataset loading
eval/ # Safety and capability evaluation
models/ # Local (vLLM/HF) and API model wrappers
train/ # SFT and RL training logic
utils.py # System prompt templates and shared constants
scripts/
train_pipeline.py # Full SFT → RL → Eval pipeline (recommended entry point)
train_sft.py # SFT stage only
train_rl.py # RL stage only
eval_safety.py # Safety evaluation (ASR on jailbreak benchmarks)
eval_capability.py# Capability evaluation (MMLU-Pro, etc.)
run_pipeline.sh # Shell launcher example
configs/
pipeline_recipe_qwen1b.yaml # Qwen3-1.7B full pipeline example
pipeline_recipe_ds1b.yaml # DeepSeek-distill-1.5B full pipeline example
models/ # Per-model vLLM inference settings
data/ # Dataset recipes (sft_recipe.yaml, rl_recipe.yaml)
train/ # Training hyperparameters (sft.yaml, rl.yaml)
This repository does not ship datasets. Download them and set:
export RAPO_DATASETS_ROOT=/path/to/datasetsIf not set, the code falls back to ./datasets relative to RAPO_ROOT. Place the following under one directory:
| Dataset | Files/folders |
|---|---|
| StrataSword | strata_sword_en_level_1.csv, strata_sword_en_level_2.csv, strata_sword_en_level_3.csv |
| STAR-1 | STAR-benign/, STAR-1K/ |
| WildJailbreak | wildjailbreak/ |
| JailbreakBench | JailbreakBench/ (eval only) |
| HarmBench | Harmbench/ (eval only) |
| XsTest | XsTest/ (eval only) |
Set model.path and reward_model.path in the pipeline recipe YAML to local paths or HuggingFace model identifiers. RAPO has been evaluated on:
Qwen/Qwen3-8BQwen/Qwen3-1.7Bdeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
export RAPO_ROOT=/path/to/RAPO
export RAPO_DATASETS_ROOT=/path/to/datasets
export PYTHONPATH=$RAPO_ROOT:$PYTHONPATH
# Full SFT → RL → Eval pipeline
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml
# Or skip stages
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-sft
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-evalTo run stages individually:
python scripts/train_sft.py --config configs/pipeline_recipe_qwen1b.yaml
python scripts/train_rl.py --config configs/pipeline_recipe_qwen1b.yamlRAPO uses a composite reward
-
Risk-aware reward
$R(s_p) \in {-1, 0, 1}$ : judges whether the safety reasoning trace in the thinking content is adequately calibrated to the prompt's complexity level (L1/L2/L3). -
General reward
$G(r_p) \in {-1, +1}$ : checks refusal on harmful prompts and helpfulness on benign ones.
Both judges run via the base LRM itself with system prompts (self-rewarding), in parallel via vLLM.
@inproceedings{
wei2026rapo,
title={{RAPO}: Risk-Aware Preference Optimization for Generalizable Safe Reasoning},
author={Wei, Zeming and Zhang, Qiaosheng and Hu, Xia and Xu, Xingcheng},
booktitle={ICLR Trustworthy AI Workshop},
year={2026},
url={https://openreview.net/forum?id=smLgjabnLP}
}