RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

This is the repository for the paper RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning by Zeming Wei, Qiaosheng Zhang, Xia Hu, and Xingcheng Xu, presented at ICLR 2026 Workshop on Trustworthy AI.

Installation

pip install -e .

Requirements: Python ≥ 3.10, torch, transformers, trl, vllm, datasets, pandas, pyyaml, peft.

Repository Structure

rapo/               # Core library
  data/loader.py    # Dataset loading
  eval/             # Safety and capability evaluation
  models/           # Local (vLLM/HF) and API model wrappers
  train/            # SFT and RL training logic
  utils.py          # System prompt templates and shared constants

scripts/
  train_pipeline.py # Full SFT → RL → Eval pipeline (recommended entry point)
  train_sft.py      # SFT stage only
  train_rl.py       # RL stage only
  eval_safety.py    # Safety evaluation (ASR on jailbreak benchmarks)
  eval_capability.py# Capability evaluation (MMLU-Pro, etc.)
  run_pipeline.sh   # Shell launcher example

configs/
  pipeline_recipe_qwen1b.yaml  # Qwen3-1.7B full pipeline example
  pipeline_recipe_ds1b.yaml    # DeepSeek-distill-1.5B full pipeline example
  models/           # Per-model vLLM inference settings
  data/             # Dataset recipes (sft_recipe.yaml, rl_recipe.yaml)
  train/            # Training hyperparameters (sft.yaml, rl.yaml)

Data Setup

This repository does not ship datasets. Download them and set:

export RAPO_DATASETS_ROOT=/path/to/datasets

If not set, the code falls back to ./datasets relative to RAPO_ROOT. Place the following under one directory:

Dataset	Files/folders
StrataSword	`strata_sword_en_level_1.csv`, `strata_sword_en_level_2.csv`, `strata_sword_en_level_3.csv`
STAR-1	`STAR-benign/`, `STAR-1K/`
WildJailbreak	`wildjailbreak/`
JailbreakBench	`JailbreakBench/` (eval only)
HarmBench	`Harmbench/` (eval only)
XsTest	`XsTest/` (eval only)

Model Setup

Set model.path and reward_model.path in the pipeline recipe YAML to local paths or HuggingFace model identifiers. RAPO has been evaluated on:

Qwen/Qwen3-8B
Qwen/Qwen3-1.7B
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Running the Pipeline

export RAPO_ROOT=/path/to/RAPO
export RAPO_DATASETS_ROOT=/path/to/datasets
export PYTHONPATH=$RAPO_ROOT:$PYTHONPATH

# Full SFT → RL → Eval pipeline
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml

# Or skip stages
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-sft
python scripts/train_pipeline.py --config configs/pipeline_recipe_qwen1b.yaml --skip-eval

To run stages individually:

python scripts/train_sft.py --config configs/pipeline_recipe_qwen1b.yaml
python scripts/train_rl.py  --config configs/pipeline_recipe_qwen1b.yaml

Reward Design

RAPO uses a composite reward $A_p = R(s_p) + G(r_p)$:

Risk-aware reward $R(s_p) \in {-1, 0, 1}$: judges whether the safety reasoning trace in the thinking content is adequately calibrated to the prompt's complexity level (L1/L2/L3).
General reward $G(r_p) \in {-1, +1}$: checks refusal on harmful prompts and helpfulness on benign ones.

Both judges run via the base LRM itself with system prompts (self-rewarding), in parallel via vLLM.

Citation

@inproceedings{
wei2026rapo,
title={{RAPO}: Risk-Aware Preference Optimization for Generalizable Safe Reasoning},
author={Wei, Zeming and Zhang, Qiaosheng and Hu, Xia and Xu, Xingcheng},
booktitle={ICLR Trustworthy AI Workshop},
year={2026},
url={https://openreview.net/forum?id=smLgjabnLP}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs		configs
rapo		rapo
scripts		scripts
.gitignore		.gitignore
README.md		README.md
method.png		method.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Installation

Repository Structure

Data Setup

Model Setup

Running the Pipeline

Reward Design

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Installation

Repository Structure

Data Setup

Model Setup

Running the Pipeline

Reward Design

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages