Relax (Reinforcement Engine Leveraging Agentic X-modality) is a high-performance reinforcement learning post-training framework open-sourced by the rednote AI platform for multimodal large language models. Built on Ray Serve with a service-oriented architecture, Relax uses Megatron-LM as the training backend and SGLang as the inference engine. Through the TransferQueue data transfer system, it achieves complete decoupling of training and inference, supporting end-to-end multimodal RL training from text to images, videos, and audio.
- π Full Omni-Modal Training β One unified framework for text, vision, and audio RL β one of the few systems capable of end-to-end Omni model (Qwen3-Omni) post-training
- βοΈ Service-Oriented Six-Layer Architecture β Every role is an independent Ray Serve deployment, with native service-level elastic scheduling and fault recovery
- β‘ Fully Async via TransferQueue β Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters with streaming data exchange and configurable staleness
- π€ Agentic RL β Multi-turn interaction, loss masking, flexible termination, and VLM multimodal context carry-over for closed-loop "execute β observe β decide" training
- π Elastic Rollout Scaling β Dynamically grow/shrink inference engines mid-training via HTTP REST API, with same-cluster (
ray_native) and cross-cluster (external) federation modes - π§ Rich Algorithm Suite β GRPO, GSPO, SAPO, and On-Policy Distillation out of the box, with pluggable rewards and built-in GenRM (LLM-as-judge) mode
- π Megatron + SGLang Backends β Megatron-LM (TP/PP/CP/EP) for MoE and deep models, SGLang for high-throughput inference, DCS for NCCL-broadcast weight sync
- π¦ Production-Ready Ops β HealthManager auto-recovery, centralized Metrics Service (WandB / TensorBoard / ClearML), and Apprise real-time notifications
| π£ Updates |
|---|
| [04/15/2026] π Relax is now open-source! |
Relax adopts a six-layer service-oriented architecture where every role is deployed as an independent Ray Serve deployment, cleanly separating orchestration, components, engines, backends, and distributed capabilities:
| Layer | Responsibility |
|---|---|
| Entrypoints | train.py β signal handling, CLI parsing, Ray cluster connection, Controller launch |
| Orchestration | Controller (training loop, global restart), Service (placement groups, lifecycle), Registry (role & algorithm mapping) |
| Components | Ray Serve deployments: Actor, Rollout, Critic, ActorFwd, Advantages, GenRM |
| Engine | SGLang rollout engine, pluggable reward functions, request router, data filters |
| Backends | Megatron-LM training backend (TP/PP/CP/EP) and SGLang inference engine |
| Distributed | Ray Actor groups (RolloutManager / GenRMManager) and DCS (Distributed Checkpoint Service) for NCCL/GLOO weight sync |
Two execution modes are supported:
- Colocate (Sync) β Actor and Rollout time-share the same GPUs; Rollout writes a full batch to TransferQueue, then yields GPUs for training. Memory-efficient for constrained hardware and strict on-policy (
max_staleness=0). - Fully Async β Actor, Rollout, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel, exchanging data through TransferQueue and syncing weights asynchronously through DCS for maximum throughput with configurable staleness.
π Learn more: Architecture Guide Β· Fully Async Training Β· Elastic Rollout Scaling
| Algorithm | Type | Description |
|---|---|---|
| GRPO | Policy Optimization | Group Relative Policy Optimization |
| GSPO | Policy Optimization | Group Sample Policy Optimization |
| SAPO | Policy Optimization | Sample-Aware Policy Optimization |
| On-Policy Distillation | Knowledge Transfer | Teacher-student KL penalty distillation |
π Adding a new algorithm is straightforward β implement a service class, register it in the
ALGOSregistry, and you're done.
Relax is designed for omni-modal RL training β text, vision, and audio in one unified framework. Multimodal data is configured via the --multimodal-keys flag, with complete image/video/audio processing pipelines under relax/utils/multimodal/ for fine-grained control over image token counts, video frame sampling, and audio sample rates.
| Model Family | Sizes | Modality | Typical Tasks | Backend |
|---|---|---|---|---|
| Qwen3 | 4B, 30B-A3B (MoE) | Text | Math reasoning, code, multi-turn dialogue, tool use | Megatron |
| Qwen3-VL | 4B, 30B-A3B | Vision + Language | Visual QA, image understanding, multimodal reasoning | Megatron |
| Qwen3.5 | 30B-A3B | Vision + Language | Visual QA, image understanding, multimodal reasoning | Megatron |
| Qwen3-Omni | 30B-A3B | Text + Vision + Audio | Audio-visual QA, omni-modal understanding | Megatron |
π New architectures are integrated via Megatron Bridge for automatic HF β Megatron weight conversion.
The recommended way to run Relax is via the official Docker image, which ships with all CUDA, PyTorch, Megatron-LM, SGLang, and Ray dependencies pre-installed and version-matched.
# Pull the official image
docker pull relaxrl/relax:latest
# Launch a container with GPUs, shared memory, and your workspace mounted
docker run -it --gpus all --ipc=host --network=host \
-v /path/to/your/workspace:/root \
relaxrl/relax:latest bash
# Inside the container
git clone https://github.com/redai-infra/Relax.git /root/Relax
cd /root/Relax && pip install -e .π For GPU driver requirements, multi-node setup, and persistent storage mounts, see the Installation Guide.
Three end-to-end tasks cover text, vision-language, and omni-modal training. Each task downloads a public HuggingFace dataset and model, then launches training with a single script. Set EXP_DIR=/root (or wherever your models and datasets live) and the scripts will locate them automatically.
Train Qwen3-4B on dapo-math-17k with GRPO. Reward is rule-based answer extraction plus symbolic math verification.
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
cd /root/Relax && export EXP_DIR=/root
bash scripts/training/text/run-qwen3-4B-8xgpu.shTrain Qwen3-VL-4B on multimodal-open-r1-8k-verified with GRPO using the openr1mm reward.
hf download --repo-type dataset lmms-lab/multimodal-open-r1-8k-verified \
--local-dir /root/multimodal-open-r1-8k-verified
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-Instruct
cd /root/Relax && export EXP_DIR=/root
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.shTrain Qwen3-Omni-30B-A3B on AVQA-R1-6K with GRPO and a multiple-choice reward.
hf download --repo-type dataset harryhsing/AVQA-R1-6K --local-dir /root/AVQA-R1-6K
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct
cd /root/Relax && export EXP_DIR=/root
bash -x scripts/entrypoint/spmd-multinode.sh \
scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.shOnce running, you should see logs like:
Finish rollout 0/200
training step 0/200
Checkpoints are saved in Megatron DCP format; convert them to HuggingFace weights with scripts/tools/convert_torch_dist_to_hf_bridge.py.
π Full walkthrough: Quick Start Guide Β· Customize Training Β· Configuration Guide
In fully-async mode, Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel. Three mechanisms make this efficient:
- StreamingDataLoader β Actor begins consuming samples as Rollout incrementally writes them to TransferQueue, eliminating GPU idle time between phases.
- Configurable staleness β
--max-stalenessprecisely controls how off-policy training data can drift, flexibly balancing on-policy accuracy and throughput. - DCS weight sync β After each training step, weights are NCCL-broadcast from Actor to Rollout/ActorFwd/Reference via the Distributed Checkpoint Service, overlapped with the next training computation.
Relax provides first-class support for multi-turn, closed-loop "execute β observe β decide" training:
- Multi-turn sampling with loss masking β model outputs (mask=1) are cleanly separated from environment observations (mask=0) so only model actions participate in training.
- Environment / Rollout decoupling β a standard
BaseInteractionEnvinterface (reset,step,format_observation) lets environments evolve independently of the sampler. - VLM multimodal context carry-over β
image_dataon the Rollout side andmultimodal_train_inputson the training side are incrementally merged each turn so visual observations concatenate correctly. - Flexible termination β combine
max_turns, token-budget exhaustion, and env-signalleddone. The DeepEyes example demonstrates Agentic multi-turn GRPO with Qwen3-VL-30B-A3B.
Since 60β70% of RL training time is spent in the Rollout phase, Relax exposes HTTP REST APIs to dynamically add or remove inference engines mid-training without interrupting the training loop:
ray_nativemode β specify a target engine count; Relax allocates resources and launches new SGLang engines inside the current Ray cluster.externalmode β register SGLang engines already deployed in other clusters for cross-cluster federated inference on preemptible or idle resources.
Scaling is asynchronous, idempotent, mutually exclusive, and supports graceful drain-and-remove plus cancellation with rollback. Engines from startup parameters are protected; only dynamically added engines can be scaled in.
Training uses Megatron-LM with full Tensor / Pipeline / Context / Expert parallelism for MoE and ultra-deep models. Inference uses SGLang with process-lifecycle management. New model architectures plug in through Megatron Bridge for automatic HF β Megatron weight conversion.
Built-in rewards for math (DeepScaler, DAPO), GPQA, F1, IFBench, multiple-choice, multimodal Open-R1, and GenRM (generative LLM-as-judge). Add a custom reward by dropping a single file into relax/engine/rewards/.
- HealthManager β heartbeat monitoring with two-tier auto-recovery (in-place restart first, global restart as fallback).
- Metrics Service β centralized Ray Serve deployment that fans out to TensorBoard, WandB, and ClearML.
- Notifications β real-time training alerts via Apprise (Slack, WeChat, email, and more).
Full bilingual documentation is available at redai-infra.github.io/Relax.
| Example | Description |
|---|---|
| DeepEyes | Multi-modal vision-language RL with Qwen3-VL |
| On-Policy Distillation | Teacher-student knowledge distillation via KL penalty |
We welcome contributions of all kinds! Please read our Contributing Guide to get started.
If you find Relax useful in your research, please cite:
@software{relax2026,
title = {Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale},
author = {Relax Contributors},
url = {https://arxiv.org/abs/2604.11554},
year = {2026}
}This project is licensed under the Apache License 2.0.
Relax is built upon the shoulders of excellent open-source projects:
- Slime β Scalable training and inference framework for reinforcement learning
- SGLang β Fast serving framework for large language models
- Megatron-LM & Megatron-Bridge β Large-scale distributed training framework and HF β Megatron weight conversion bridge, with sincere thanks to the entire NVIDIA team
- TransferQueue β High-performance distributed data transfer queue
- Ray β Distributed computing framework
- HuggingFace Transformers β State-of-the-art model hub
We sincerely thank all contributors and the open-source community for making this project possible.

