Skip to content

redai-infra/Relax

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Towards Async, Omni-Modal RL at Scale, Just Relax.

Relax

License Python 3.12 arXiv Documentation WeChat QR

πŸ“– English | πŸ“– δΈ­ζ–‡


Relax (Reinforcement Engine Leveraging Agentic X-modality) is a high-performance reinforcement learning post-training framework open-sourced by the rednote AI platform for multimodal large language models. Built on Ray Serve with a service-oriented architecture, Relax uses Megatron-LM as the training backend and SGLang as the inference engine. Through the TransferQueue data transfer system, it achieves complete decoupling of training and inference, supporting end-to-end multimodal RL training from text to images, videos, and audio.


✨ Highlights

  • 🌐 Full Omni-Modal Training β€” One unified framework for text, vision, and audio RL β€” one of the few systems capable of end-to-end Omni model (Qwen3-Omni) post-training
  • βš™οΈ Service-Oriented Six-Layer Architecture β€” Every role is an independent Ray Serve deployment, with native service-level elastic scheduling and fault recovery
  • ⚑ Fully Async via TransferQueue β€” Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters with streaming data exchange and configurable staleness
  • πŸ€– Agentic RL β€” Multi-turn interaction, loss masking, flexible termination, and VLM multimodal context carry-over for closed-loop "execute β†’ observe β†’ decide" training
  • πŸ”€ Elastic Rollout Scaling β€” Dynamically grow/shrink inference engines mid-training via HTTP REST API, with same-cluster (ray_native) and cross-cluster (external) federation modes
  • 🧠 Rich Algorithm Suite β€” GRPO, GSPO, SAPO, and On-Policy Distillation out of the box, with pluggable rewards and built-in GenRM (LLM-as-judge) mode
  • πŸš€ Megatron + SGLang Backends β€” Megatron-LM (TP/PP/CP/EP) for MoE and deep models, SGLang for high-throughput inference, DCS for NCCL-broadcast weight sync
  • πŸ“¦ Production-Ready Ops β€” HealthManager auto-recovery, centralized Metrics Service (WandB / TensorBoard / ClearML), and Apprise real-time notifications

πŸ“’ News

πŸ“£ Updates
[04/15/2026] πŸŽ‰ Relax is now open-source!

πŸ—οΈ Architecture

Relax Architecture

Relax adopts a six-layer service-oriented architecture where every role is deployed as an independent Ray Serve deployment, cleanly separating orchestration, components, engines, backends, and distributed capabilities:

Layer Responsibility
Entrypoints train.py β€” signal handling, CLI parsing, Ray cluster connection, Controller launch
Orchestration Controller (training loop, global restart), Service (placement groups, lifecycle), Registry (role & algorithm mapping)
Components Ray Serve deployments: Actor, Rollout, Critic, ActorFwd, Advantages, GenRM
Engine SGLang rollout engine, pluggable reward functions, request router, data filters
Backends Megatron-LM training backend (TP/PP/CP/EP) and SGLang inference engine
Distributed Ray Actor groups (RolloutManager / GenRMManager) and DCS (Distributed Checkpoint Service) for NCCL/GLOO weight sync

Two execution modes are supported:

  • Colocate (Sync) β€” Actor and Rollout time-share the same GPUs; Rollout writes a full batch to TransferQueue, then yields GPUs for training. Memory-efficient for constrained hardware and strict on-policy (max_staleness=0).
  • Fully Async β€” Actor, Rollout, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel, exchanging data through TransferQueue and syncing weights asynchronously through DCS for maximum throughput with configurable staleness.

πŸ“– Learn more: Architecture Guide Β· Fully Async Training Β· Elastic Rollout Scaling


🧠 Supported Algorithms

Algorithm Type Description
GRPO Policy Optimization Group Relative Policy Optimization
GSPO Policy Optimization Group Sample Policy Optimization
SAPO Policy Optimization Sample-Aware Policy Optimization
On-Policy Distillation Knowledge Transfer Teacher-student KL penalty distillation

πŸ“– Adding a new algorithm is straightforward β€” implement a service class, register it in the ALGOS registry, and you're done.


πŸ€– Supported Models

Relax is designed for omni-modal RL training β€” text, vision, and audio in one unified framework. Multimodal data is configured via the --multimodal-keys flag, with complete image/video/audio processing pipelines under relax/utils/multimodal/ for fine-grained control over image token counts, video frame sampling, and audio sample rates.

Model Family Sizes Modality Typical Tasks Backend
Qwen3 4B, 30B-A3B (MoE) Text Math reasoning, code, multi-turn dialogue, tool use Megatron
Qwen3-VL 4B, 30B-A3B Vision + Language Visual QA, image understanding, multimodal reasoning Megatron
Qwen3.5 30B-A3B Vision + Language Visual QA, image understanding, multimodal reasoning Megatron
Qwen3-Omni 30B-A3B Text + Vision + Audio Audio-visual QA, omni-modal understanding Megatron

πŸ“– New architectures are integrated via Megatron Bridge for automatic HF ↔ Megatron weight conversion.


πŸ“¦ Installation

The recommended way to run Relax is via the official Docker image, which ships with all CUDA, PyTorch, Megatron-LM, SGLang, and Ray dependencies pre-installed and version-matched.

# Pull the official image
docker pull relaxrl/relax:latest

# Launch a container with GPUs, shared memory, and your workspace mounted
docker run -it --gpus all --ipc=host --network=host \
  -v /path/to/your/workspace:/root \
  relaxrl/relax:latest bash

# Inside the container
git clone https://github.com/redai-infra/Relax.git /root/Relax
cd /root/Relax && pip install -e .

πŸ“– For GPU driver requirements, multi-node setup, and persistent storage mounts, see the Installation Guide.


πŸš€ Quick Start

Three end-to-end tasks cover text, vision-language, and omni-modal training. Each task downloads a public HuggingFace dataset and model, then launches training with a single script. Set EXP_DIR=/root (or wherever your models and datasets live) and the scripts will locate them automatically.

Task 1 β€” DAPO Math (Text, 8 GPUs)

Train Qwen3-4B on dapo-math-17k with GRPO. Reward is rule-based answer extraction plus symbolic math verification.

hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/text/run-qwen3-4B-8xgpu.sh

Task 2 β€” Open-R1 (Vision-Language, 8 GPUs)

Train Qwen3-VL-4B on multimodal-open-r1-8k-verified with GRPO using the openr1mm reward.

hf download --repo-type dataset lmms-lab/multimodal-open-r1-8k-verified \
  --local-dir /root/multimodal-open-r1-8k-verified
hf download Qwen/Qwen3-VL-4B-Instruct --local-dir /root/Qwen3-VL-4B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash scripts/training/multimodal/run-qwen3-vl-4B-8xgpu.sh

Task 3 β€” AVQA (Omni-Modal: Image + Audio, 16 GPUs / 2 nodes)

Train Qwen3-Omni-30B-A3B on AVQA-R1-6K with GRPO and a multiple-choice reward.

hf download --repo-type dataset harryhsing/AVQA-R1-6K --local-dir /root/AVQA-R1-6K
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir /root/Qwen3-Omni-30B-A3B-Instruct

cd /root/Relax && export EXP_DIR=/root
bash -x scripts/entrypoint/spmd-multinode.sh \
  scripts/training/multimodal/run-qwen3-30B-A3B-omni-16xgpu.sh

Once running, you should see logs like:

Finish rollout 0/200
training step 0/200

Checkpoints are saved in Megatron DCP format; convert them to HuggingFace weights with scripts/tools/convert_torch_dist_to_hf_bridge.py.

πŸ“– Full walkthrough: Quick Start Guide Β· Customize Training Β· Configuration Guide


⚑ Key Features

Fully Async Training via TransferQueue

In fully-async mode, Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters in parallel. Three mechanisms make this efficient:

  • StreamingDataLoader β€” Actor begins consuming samples as Rollout incrementally writes them to TransferQueue, eliminating GPU idle time between phases.
  • Configurable staleness β€” --max-staleness precisely controls how off-policy training data can drift, flexibly balancing on-policy accuracy and throughput.
  • DCS weight sync β€” After each training step, weights are NCCL-broadcast from Actor to Rollout/ActorFwd/Reference via the Distributed Checkpoint Service, overlapped with the next training computation.

Agentic RL

Relax provides first-class support for multi-turn, closed-loop "execute β†’ observe β†’ decide" training:

  • Multi-turn sampling with loss masking β€” model outputs (mask=1) are cleanly separated from environment observations (mask=0) so only model actions participate in training.
  • Environment / Rollout decoupling β€” a standard BaseInteractionEnv interface (reset, step, format_observation) lets environments evolve independently of the sampler.
  • VLM multimodal context carry-over β€” image_data on the Rollout side and multimodal_train_inputs on the training side are incrementally merged each turn so visual observations concatenate correctly.
  • Flexible termination β€” combine max_turns, token-budget exhaustion, and env-signalled done. The DeepEyes example demonstrates Agentic multi-turn GRPO with Qwen3-VL-30B-A3B.

Elastic Rollout Scaling

Since 60–70% of RL training time is spent in the Rollout phase, Relax exposes HTTP REST APIs to dynamically add or remove inference engines mid-training without interrupting the training loop:

  • ray_native mode β€” specify a target engine count; Relax allocates resources and launches new SGLang engines inside the current Ray cluster.
  • external mode β€” register SGLang engines already deployed in other clusters for cross-cluster federated inference on preemptible or idle resources.

Scaling is asynchronous, idempotent, mutually exclusive, and supports graceful drain-and-remove plus cancellation with rollback. Engines from startup parameters are protected; only dynamically added engines can be scaled in.

Megatron Training Backend & SGLang Inference

Training uses Megatron-LM with full Tensor / Pipeline / Context / Expert parallelism for MoE and ultra-deep models. Inference uses SGLang with process-lifecycle management. New model architectures plug in through Megatron Bridge for automatic HF ↔ Megatron weight conversion.

Pluggable Reward Hub

Built-in rewards for math (DeepScaler, DAPO), GPQA, F1, IFBench, multiple-choice, multimodal Open-R1, and GenRM (generative LLM-as-judge). Add a custom reward by dropping a single file into relax/engine/rewards/.

Production Operations

  • HealthManager β€” heartbeat monitoring with two-tier auto-recovery (in-place restart first, global restart as fallback).
  • Metrics Service β€” centralized Ray Serve deployment that fans out to TensorBoard, WandB, and ClearML.
  • Notifications β€” real-time training alerts via Apprise (Slack, WeChat, email, and more).

πŸ“š Documentation

Full bilingual documentation is available at redai-infra.github.io/Relax.


πŸ§ͺ Examples

Example Description
DeepEyes Multi-modal vision-language RL with Qwen3-VL
On-Policy Distillation Teacher-student knowledge distillation via KL penalty

🀝 Contributing

We welcome contributions of all kinds! Please read our Contributing Guide to get started.


πŸ“ Citation

If you find Relax useful in your research, please cite:

@software{relax2026,
  title  = {Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale},
  author = {Relax Contributors},
  url    = {https://arxiv.org/abs/2604.11554},
  year   = {2026}
}

πŸ“œ License

This project is licensed under the Apache License 2.0.


πŸ™ Acknowledgements

Relax is built upon the shoulders of excellent open-source projects:

  • Slime β€” Scalable training and inference framework for reinforcement learning
  • SGLang β€” Fast serving framework for large language models
  • Megatron-LM & Megatron-Bridge β€” Large-scale distributed training framework and HF ↔ Megatron weight conversion bridge, with sincere thanks to the entire NVIDIA team
  • TransferQueue β€” High-performance distributed data transfer queue
  • Ray β€” Distributed computing framework
  • HuggingFace Transformers β€” State-of-the-art model hub

We sincerely thank all contributors and the open-source community for making this project possible.

About

An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages