Official repository for MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning.
MSRL addresses a central problem in multimodal alignment: how to scale reinforcement learning for generative multimodal reward models (MRMs) when high-quality multimodal preference annotations are scarce and expensive to obtain.
Recent progress in multimodal reward modeling has shifted from discriminative formulations to generative ones, and more recent studies have further introduced reinforcement learning with verifiable rewards (RLVR) to improve reward reasoning. Despite these advances, existing training pipelines still rely heavily on human-annotated multimodal preference data. Such a paradigm is costly, difficult to scale, and ultimately constrains the development of stronger multimodal reward models.
To overcome this limitation, the paper proposes MSRL (Multi-Stage Reinforcement Learning).
The key idea is not to depend on increasingly expensive multimodal annotations, but instead to:
- learn general reward reasoning capabilities from large-scale textual preference data,
- transfer these capabilities to multimodal settings through caption-based preference transfer, and
- perform final adaptation with only a limited amount of real multimodal preference data.
In addition to staged reinforcement learning, the paper introduces CMKD (Cross-Modal Knowledge Distillation) to reduce the modality gap between textual inputs and genuine multimodal inputs, thereby enabling more stable transfer of reward reasoning across modalities.
MSRL consists of three stages.
The first stage uses only textual preference data. The model is first trained with supervised fine-tuning (SFT) to produce well-structured reasoning traces, such as outputs containing <think> and <answer> tags. It is then optimized with RLVR on large-scale textual preference data to strengthen reward reasoning.
The objective of this stage is to obtain a generalizable textual reward policy. Since no image or video input is involved, the vision encoder and projector can remain frozen while the language-side reward reasoning capability is strengthened.
The second stage uses caption-based preference data derived from multimodal preference samples. Specifically, the original image or video in each sample is replaced by its corresponding textual caption. As a result, supervision remains textual, while the semantics of the task become substantially closer to the target multimodal setting.
This stage serves as an intermediate bridge between text-only preference modeling and genuine multimodal preference modeling, thereby reducing the task gap. In this stage, the paper additionally introduces:
- Task recognition reward: the model is required to identify the task type, such as
image understanding,image generation,video understanding, orvideo generation, before reward reasoning. - Experience replay: a portion of high-quality textual preference samples from Stage 1 is mixed into each batch to mitigate catastrophic forgetting.
- CMKD: high-quality reasoning traces generated under caption conditions are distilled into multimodal inputs to improve cross-modal generalization.
The third stage performs reinforcement learning on a limited amount of real multimodal preference data, enabling the model to adapt to image understanding, image generation, video understanding, and video generation tasks.
Because the first two stages already provide strong reward reasoning capability and partial cross-modal transfer, this final stage no longer requires large-scale multimodal annotations to produce substantial gains. The paper reports consistent improvements on both visual understanding and visual generation benchmarks, for example:
- VL-RewardBench:
66.6% -> 75.9% - GenAI-Bench:
70.2% -> 75.7%
The paper uses the following training scales:
- Stage 1
- SFT: 40k rationale-based textual preference samples from HelpSteer3
- RLVR: 400k textual preference samples sampled from GRAM-RR-TrainingData
- Stage 2
- Caption-based preference data:
19,442samples
- Caption-based preference data:
- Stage 3
- Multimodal preference data:
20,038samples
- Multimodal preference data:
Stages 2 and 3 cover four task categories:
| Task Type | Input Format | Description |
|---|---|---|
| Image Understanding | Query + Image + Response A/B | Given an image and a textual query, determine which response is better. |
| Image Generation | Caption + Image A/B | Given a textual prompt, determine which generated image better matches the prompt. |
| Video Understanding | Query + Video + Response A/B | Given a video and a textual query, determine which response is better. |
| Video Generation | Caption + Video A/B | Given a textual prompt, determine which generated video better matches the prompt. |
Important The examples in this section are README-level illustrations for understanding the MSRL pipeline only. They may not be directly usable as input files for training or evaluation scripts. Since this project is built on top of
ms-swift, the actual dataset files should follow the dataset schema expected byms-swift. Please refer to the official documentation: Custom Dataset - ms-swift. In other words, the JSON examples below are intended to explain the semantics of each stage, while the real runnable data files should be converted into thems-swiftformat required by the corresponding script and reward plugin.
Below we provide README-level examples to illustrate the data organization used in each stage. These examples are simplified from the training templates described in the paper and are intended to clarify the overall pipeline.
This stage contains only textual input. The model compares two candidate responses for a user instruction and produces a preference judgment together with a reasoning trace.
{
"stage": "text_only_rl",
"prompt": "Which response better answers the user's request?",
"instruction": "Explain why regular exercise is important for mental health......",
"response_a": "Regular exercise can reduce stress, improve mood, and support better sleep.",
"response_b": "Exercise is good.",
"chosen": "A"
}The target output typically takes the following form:
<think>
Response A is more complete, specific, and directly addresses mental health benefits.
</think>
<answer>
A
</answer>
This stage is designed to learn general reward reasoning capability, which serves as the foundation for subsequent multimodal transfer.
In the second stage, the image or video in the original multimodal sample is replaced by a caption. The input therefore remains textual, while the task semantics become much closer to real visual preference modeling.
{
"stage": "caption_based_rl",
"task_type": "image_understanding",
"query": "What is the person doing in the scene?",
"caption": "A man in a yellow raincoat is kayaking on rough water.",
"response_a": "He is kayaking through moving water.",
"response_b": "He is riding a bicycle in the city.",
"chosen": "A"
}{
"stage": "caption_based_rl",
"task_type": "image_generation",
"prompt": "A futuristic city street at night with neon reflections after rain.",
"image_a_caption": "A rainy cyberpunk street with neon lights reflected on the ground.",
"image_b_caption": "A sunny village road with trees and bicycles.",
"chosen": "A"
}In this stage, the model is typically required to predict the task type first, for example:
<type>image_generation</type>
It then generates <think> and <answer> outputs. This stage is primarily responsible for bridging text-based preference reasoning and multimodal task-specific preference reasoning.
The third stage uses real image or video inputs for reinforcement learning, allowing the model to perform preference judgment in actual multimodal scenarios.
{
"stage": "multimodal_rl",
"task_type": "image_understanding",
"query": "What is the attire of the kayaker?",
"image": "assets/examples/kayaker.jpg",
"response_a": "Yellow.",
"response_b": "Black formal suit.",
"chosen": "A"
}{
"stage": "multimodal_rl",
"task_type": "image_generation",
"prompt": "Generate an image of a ruined stone thoroughfare in epic fantasy style.",
"image_a": "assets/examples/fantasy_road_a.png",
"image_b": "assets/examples/fantasy_road_b.png",
"chosen": "B"
}{
"stage": "multimodal_rl",
"task_type": "video_understanding",
"query": "What are the performers doing in the video?",
"video": "assets/examples/performance.mp4",
"response_a": "They are moving in sync as part of a coordinated performance.",
"response_b": "They are balancing on top of drums.",
"chosen": "A"
}{
"stage": "multimodal_rl",
"task_type": "video_generation",
"prompt": "A jellyfish floating through a colorful coral reef.",
"video_a": "assets/examples/jellyfish_a.mp4",
"video_b": "assets/examples/jellyfish_b.mp4",
"chosen": "A"
}The goal of this stage is to complete the final multimodal task adaptation. Since the model has already acquired strong reward reasoning and cross-modal transfer ability in the previous stages, only a relatively small amount of real multimodal preference data is required to obtain clear gains.
According to the paper appendix, the multimodal data used in Stages 2 and 3 are mainly drawn from the following sources:
S1: vision-feedback-mix-binarizedS2: open-image-preferences-v1S3: OpenAI-4o-human-preferenceS4: ShareGPTVideo-DPOS5: VideoDPOS6: text-2-video-human-preferences
More specifically:
- Stage 2 converts these multimodal samples into caption-based training instances.
- Stage 3 retains the original visual inputs and performs multimodal RL directly.
The current release is built on top of ms-swift. A typical environment can be prepared as follows:
conda create -n msrl python=3.10 -y
conda activate msrl
pip install --upgrade pip
pip install -e ".[eval]"
# RL and multimodal training dependencies
pip install deepspeed math_verify
# Optional but recommended for GRPO + fast rollout
pip install "vllm>=0.7.0"
# Optional acceleration libraries, install according to your CUDA environment
# pip install flash-attn --no-build-isolationAll training commands below assume Linux, CUDA GPUs, and a local checkout of this repository.
The provided scripts assume the following local files:
data/
├── stage1_sft.jsonl
├── stage1_rl.jsonl
├── stage2_caption_rl.jsonl
├── stage2_cmkd_sft.jsonl
├── stage3_multimodal_rl.jsonl
└── eval.jsonl
In practice, the fields used by the training scripts are:
messages: standardms-swiftconversational input.images/videos: optional multimodal inputs for Stage 2 CMKD and Stage 3 RL.solutionorchosen: the gold preference label, typicallyAorB.task_type: required for Stage 2 and Stage 3 if task recognition reward is enabled.
For GRPO-based stages, all extra dataset fields are passed to the custom reward plugin. The corresponding plugin in this repository is scripts/msrl/msrl_grpo_plugin.py, which currently defines:
msrl_accuracy_rewardmsrl_format_rewardmsrl_task_reward
Stage 1 SFT is used to teach the model the desired reasoning structure before reinforcement learning. The script is:
Example:
bash scripts/msrl/train_stage1_sft.shKey environment variables:
MODEL: base checkpoint, e.g.OpenGVLab/InternVL3_5-8BMODEL_TYPE: model type recognized byms-swiftDATASET: local JSONL file for Stage 1 SFTOUTPUT_DIR: output directory for the checkpoint
After SFT, the textual reward policy can be strengthened with GRPO:
Example:
MODEL=output/msrl/stage1_sft \
DATASET=data/stage1_rl.jsonl \
bash scripts/msrl/train_stage1_grpo.shThis stage freezes the vision encoder and projector, consistent with the text-only setting described in the paper.
Caption-based preference transfer is implemented through:
Example:
MODEL=output/msrl/stage1_grpo \
DATASET=data/stage2_caption_rl.jsonl \
bash scripts/msrl/train_stage2_caption_grpo.shThis script enables the task recognition reward and is intended for caption-only intermediate supervision. If experience replay is desired, the simplest approach is to materialize a mixed JSONL file that already contains both caption-based samples and replayed textual samples.
The CMKD stage aligns caption-induced reasoning with genuine multimodal inputs. The script is:
Example:
MODEL=output/msrl/stage2_caption_grpo \
DATASET=data/stage2_cmkd_sft.jsonl \
bash scripts/msrl/train_stage2_cmkd_sft.shThis dataset should contain real multimodal inputs together with distilled teacher rationales as SFT targets.
The final multimodal adaptation stage is provided in:
Example:
MODEL=output/msrl/stage2_cmkd_sft \
DATASET=data/stage3_multimodal_rl.jsonl \
bash scripts/msrl/train_stage3_multimodal_grpo.shThis stage unfreezes the visual modules and applies GRPO directly to genuine multimodal preference data.
This repository currently provides a lightweight native evaluation wrapper:
Example:
MODEL=output/msrl/stage3_multimodal_grpo \
EVAL_DATASET=data/eval.jsonl \
bash scripts/msrl/eval_native.shThe evaluation script calls swift eval with the Native backend and is suitable for local JSONL datasets that follow the ms-swift format. In a typical setup, the evaluation dataset should contain:
- the same input structure as the corresponding training stage,
- ground-truth preference labels in fields such as
solutionorchosen, - optional modality fields such as
imagesorvideos.
For benchmark-specific evaluation on datasets such as VL-RewardBench, Multimodal RewardBench, GenAI-Bench, ShareGPTVideo, or VideoGen-RewardBench, the recommended practice is to prepare benchmark-specific adapters that convert each benchmark into the ms-swift evaluation format and then invoke the same evaluation entry point.
We use ms-swift as our codebase🌹🌹🌹.
We thank the following papers👍:
[1] Wang, Yibin, et al. "Unified reward model for multimodal understanding and generation." arXiv preprint arXiv:2503.05236 (2025).
[2] Wang, Chenglong, et al. "Rovrm: A robust visual reward model optimized via auxiliary textual preference data." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 39. No. 24. 2025.
[3] Chen, Xiusi, et al. "Rm-r1: Reward modeling as reasoning." arXiv preprint arXiv:2505.02387 (2025).
If you find our work helpful, please kindly cite us as:
@misc{wang2026msrl,
title={MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning},
author={Chenglong Wang and Yifu Huo and Yang Gan and Qiaozhi He and Qi Meng and Bei Li and Yan Wang and Junfu Liu and Tianhua Zhou and Jingbo Zhu and Tong Xiao},
year={2026},
eprint={2603.25108},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.25108},
}
