sft-reverse

Code for reversible SFT behavior experiments (LCDD + SFT-Eraser).

Scope

This repository provides the code for the core pipeline:

Two-phase training (Full SFT -> LCDD)
Soft-trigger optimization (SFT-Eraser)
Main evaluations (task metrics + KL benchmark + utility benchmark)

Environment

Python >= 3.11
uv package manager
CUDA GPU recommended for training/evaluation

Install dependencies:

uv sync

Data

Most datasets are loaded directly from Hugging Face in the scripts:

tatsu-lab/alpaca (Fixed Response task)
allenai/wildjailbreak (Safety task)
Roudranil/shakespearean-and-modern-english-conversational-dataset (Shakespeare task)

For safety evaluation, the HarmBench behavior CSV is expected at:

data/harmbench/harmbench_behaviors_text_all.csv

If missing, download it from the HarmBench release and place it at the path above.

Reproduction Commands

All commands use uv run.

1) Train (Full SFT or LCDD)

CUDA_VISIBLE_DEVICES=0 uv run python scripts/train.py \
  --task safety \
  --base-model Qwen/Qwen3-0.6B \
  --num-samples 5000 \
  --num-epochs 3 \
  --learning-rate 2e-5 \
  --output-dir checkpoints/example_sft \
  --no-l0

LCDD phase example:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/train.py \
  --task safety \
  --base-model checkpoints/example_sft/final \
  --l0-base-model Qwen/Qwen3-0.6B \
  --loss-budget-ratio 0.30 \
  --l0-lambda-init 0.01 \
  --lambda-lr 0.1 \
  --l0-mask-lr 0.1 \
  --l0-warmup-steps 300 \
  --mask-optimizer sgd \
  --mask-momentum 0.0 \
  --num-samples 5000 \
  --num-epochs 20 \
  --learning-rate 2e-5 \
  --output-dir checkpoints/example_lcdd

2) Optimize Trigger

CUDA_VISIBLE_DEVICES=0 uv run python scripts/optimize_trigger.py \
  --task safety \
  --model checkpoints/example_lcdd/final \
  --base-model Qwen/Qwen3-0.6B \
  --method statistical_kl \
  --channel-type write \
  --trigger-length 20 \
  --batch-size 16 \
  --lr 0.003 \
  --kl-weight 0.7 \
  --kl-tail-k 8 \
  --l2 0.1 \
  --max-norm 1.0 \
  --n-pairs 200 \
  --n-steps 2000 \
  --output-dir results/triggers/example

3) Evaluate

Safety benchmark:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/eval_safety_bench.py \
  --model checkpoints/example_lcdd/final \
  --num-samples 200 \
  --output-dir results/benchmarks/safety/example

KL benchmark:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/eval_kl_bench.py \
  --task safety \
  --base-model Qwen/Qwen3-0.6B \
  --sft-model checkpoints/example_sft/final \
  --l0-model checkpoints/example_lcdd/final \
  --trigger results/triggers/example/trigger_embeds.pt \
  --num-samples 200 \
  --output-dir results/kl/example

Utility benchmark:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/eval_utility_bench.py \
  --model checkpoints/example_lcdd/final \
  --output-dir results/benchmarks/utility/example

Task-Specific Examples

Fixed Response (`idk`)

Train (Full SFT):

CUDA_VISIBLE_DEVICES=0 uv run python scripts/train.py \
  --task idk \
  --base-model Qwen/Qwen3-0.6B \
  --num-samples 5000 \
  --num-epochs 3 \
  --learning-rate 2e-5 \
  --output-dir checkpoints/idk_example_sft \
  --no-l0

Evaluate:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/eval_idk_bench.py \
  --model checkpoints/idk_example_sft/final \
  --num-samples 200 \
  --output-dir results/benchmarks/idk/example

Shakespeare (`shakespeare`)

Train (Full SFT):

CUDA_VISIBLE_DEVICES=0 uv run python scripts/train.py \
  --task shakespeare \
  --base-model Qwen/Qwen3-0.6B \
  --num-samples 5000 \
  --num-epochs 3 \
  --learning-rate 2e-5 \
  --output-dir checkpoints/shakespeare_example_sft \
  --no-l0

Evaluate:

CUDA_VISIBLE_DEVICES=0 uv run python scripts/eval_shakespeare_bench.py \
  --model checkpoints/shakespeare_example_sft/final \
  --num-samples 200 \
  --output-dir results/benchmarks/shakespeare/example

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data/harmbench		data/harmbench
scripts		scripts
src/sft_reverse		src/sft_reverse
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sft-reverse

Scope

Environment

Data

Reproduction Commands

1) Train (Full SFT or LCDD)

2) Optimize Trigger

3) Evaluate

Task-Specific Examples

Fixed Response (`idk`)

Shakespeare (`shakespeare`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sft-reverse

Scope

Environment

Data

Reproduction Commands

1) Train (Full SFT or LCDD)

2) Optimize Trigger

3) Evaluate

Task-Specific Examples

Fixed Response (idk)

Shakespeare (shakespeare)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Fixed Response (`idk`)

Shakespeare (`shakespeare`)

Packages