Releases · modelscope/ms-swift

Megatron-SWIFT
a. Mcore-Bridge发布。支持直接加载和存储 safetensors 格式的模型权重；支持LoRA增量权重双向转换；支持多机转换。文档参考：https://swift.readthedocs.io/zh-cn/latest/Megatron-SWIFT/Mcore-Bridge.html 。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. megatron-core 版本升级至0.14.0。
c. 多模态模型训练新增 vit_lr 和 aligner_lr 参数支持。
d. 新增存储优化参数：async_save, save_retain_interval等。
e. 支持batched mrope，加速Qwen3-VL、Qwen2.5-VL等模型的训练速度。
RL
a. GRPO LoRA 训练权重同步速度优化，具体参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/GetStarted/GRPO.html#id3
b. GRPO 训练显存优化以降低峰值显存占用。
c. RLVR 新算法支持：RLOO，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/RLOO.html 。REINFORCE++ Baseline，文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD 支持使用 vLLM 加速策略模型rollout，并新增参数teacher_deepspeed额外控制教师模型分片策略。文档参考：https://swift.readthedocs.io/zh-cn/latest/Instruction/GKD.html
e. GSPO 支持使用liger_kernel减少显存使用。
训练
a. PT/SFT/采样/数据蒸馏中支持了RAY，具体参考文档：https://swift.readthedocs.io/zh-cn/latest/Instruction/Ray.html
b. Qwen3-VL、Qwen3-Omni支持混合模态数据训练；Qwen3-VL支持ulysses序列并行。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. 支持 yaml 方式配置训练参数，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. 新增 FSDP2 训练启动案例，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. 新增自定义多模态模型注册最佳实践：https://swift.readthedocs.io/zh-cn/latest/BestPractices/MLLM-Registration.html
f. embedding 训练中的 InfoNCE 损失与 Qwen3-Embedding 论文描述对齐。具体参考文档：https://swift.readthedocs.io/zh-cn/latest/BestPractices/Embedding.html
g. 新增多标签分类训练案例，脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template 支持 seed-oss。感谢@hpsun1109的贡献。
全链路
a. swift export支持 GPTQ-v2 量化，脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh 。感谢@zzc0430的贡献。
b. swift deploy vllm推理后端支持 DP 部署，使用--vllm_data_parallel_size参数。感谢@YushunXiang 的贡献。
c. swift deploy 新增 health/ping endpoints。
d. vLLM 部署新增参数 vllm_mm_processor_cache_gb/vllm_engine_kwargs。

新模型

纯文本模型：
a. Qwen/Qwen3Guard-Gen-0.6B系列
b. MiniMax/MiniMax-M2
多模态模型：
a. Qwen/Qwen3-VL-2B-Instruct系列
b. deepseek-ai/DeepSeek-OCR，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking系列
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct系列

English Version

New Features

Megatron-SWIFT
a. Mcore-Bridge Release. Supports direct loading and saving of model weights in safetensors format; supports bidirectional conversion of LoRA incremental weights; supports multi-node conversion. Documentation: https://swift.readthedocs.io/en/latest/Megatron-SWIFT/Mcore-Bridge.html. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/mcore_bridge
b. Upgraded megatron-core version to 0.14.0.
c. Added vit_lr and aligner_lr parameter support for multimodal model training.
d. Added storage optimization parameters: async_save, save_retain_interval, etc.
e. Support for batched mrope to accelerate training speed of Qwen3-VL, Qwen2.5-VL, and other models.
RL
a. GRPO LoRA training weight synchronization speed optimization. Details: https://swift.readthedocs.io/en/latest/Instruction/GRPO/GetStarted/GRPO.html#memory-optimization-solutions-in-colocate-mode
b. GRPO training memory optimization to reduce peak memory consumption.
c. New RLVR algorithm support: RLOO, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/RLOO.html. REINFORCE++ Baseline, documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/REINFORCEPP.html
d. GKD supports using vLLM to accelerate policy model rollout, with new parameter teacher_deepspeed for additional control of teacher model sharding strategy. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GKD.html
e. GSPO supports using liger_kernel to reduce memory usage.
Training
a. RAY support added for PT/SFT/Sampling/Data Distillation, documentation: https://swift.readthedocs.io/en/latest/Instruction/Ray.html
b. Qwen3-VL and Qwen3-Omni support mixed modality data training; Qwen3-VL supports Ulysses sequence parallelism. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
c. Support for YAML-based training parameter configuration, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/yaml
d. Added FSDP2 training launch example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-gpu/fsdp2_lora
e. Added best practice for custom multimodal model registration: https://swift.readthedocs.io/en/latest/BestPractices/MLLM-Registration.html
f. InfoNCE loss in embedding training aligned with Qwen3-Embedding paper description. Documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
g. Added multi-label classification training example, scripts: https://github.com/modelscope/ms-swift/tree/main/examples/train/seq_cls/multi_label
h. agent_template supports seed-oss. Thanks to @hpsun1109 for the contribution.
Full Pipeline
a. swift export supports GPTQ-v2 quantization, scripts: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize/gptq_v2.sh. Thanks to @zzc0430 for the contribution.
b. swift deploy vLLM inference backend supports DP deployment, using --vllm_data_parallel_size parameter. Thanks to @YushunXiang for the contribution.
c. swift deploy added health/ping endpoints.
d. vLLM deployment added parameters vllm_mm_processor_cache_gb/vllm_engine_kwargs.

New Models

Text-only models:
a. Qwen/Qwen3Guard-Gen-0.6B series
b. MiniMax/MiniMax-M2
Multimodal models:
a. Qwen/Qwen3-VL-2B-Instruct series
b. deepseek-ai/DeepSeek-OCR, training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/deepseek_ocr
c. PaddlePaddle/PaddleOCR-VL
d. ZhipuAI/Glyph
e. PaddlePaddle/ERNIE-4.5-VL-28B-A3B-Thinking series
f. lmms-lab/LLaVA-OneVision-1.5-4B-Instruct series

What's Changed

[bugfix] fix image_list qwen2.5/3-omni by @Jintao-Huang in #6122
[model] Support Qwen3-VL dense by @Jintao-Huang in #6120
feat: support gptq_v2 quantization method by @zzc0430 in #6102
[bugfix] fix gptq_v2 by @Jintao-Huang in #6126
[bugfix] patch timeout & fix print_rich_table by @Jintao-Huang in #6137
Add the support for vLLM data parallel configuration in SwiftDeploy by @YushunXiang in #6114
[docs] update vllm deploy DP docs by @Jintao-Huang in #6139
[model] Support Qwen/Qwen3-VL-4B-Instruct series by @Jintao-Huang in #6143
Update loss_scale method call to pass through inputs.extra_kwargs by @CJack812 in #6160
[bugfix] fix qwen3_vl videos by @Jintao-Huang in #6162
Fix bug of sp/cp by @tastelikefeet in #6163
[deploy] update vllm_enable_prefix_caching by @Jintao-Huang in #6165
[bugfix] qwen3-vl support mixed data by @Jintao-Huang in #6161
[template] add_retry by @Jintao-Huang in #6138
[bugfix] Fix multimodal lazy_tokenize false by @Jintao-Huang in #6172
[template] update qwen3_vl grounding dataset format by @Jintao-Huang in #6178
[docs] update docs by @Jintao-Huang in #6180
[bugfix] add tools fileds in inputs2reqeusts by @hjh0119 in #6054
[grpo] Optimize vLLM weight synchronization & update buitin accuracy reward by @hjh0119 in #5773
[model] support Qwen/Qwen3Guard-Gen-0.6B series by @Jintao-Huang in #6189
[template] Support qwen3 omni mixed data by @Jintao-Huang in #6196
[docs] update qwen3_vl best practice by @Jintao-Huang in #6206
[vllm] support vllm_mm_processor_cache_gb by @hjh0119 in #6210
[megatron] fix qwen3_vl new_special_tokens by @Jintao-Huang in #6213
[megatron] add mcore save_args by @Jintao-Huang in #6216
[bugfix] fix dtype warning by @Jintao-Huang in #6219
[bugfix] fix infer pt dp by @Jintao-Huang in #6222
support training for multimodal reranker by @0russwest0 in #6192
[bugfix] fix reward_trainer logger by @Jintao-Huang in #6240
[model] Support deepseek-ocr by @Jintao-Huang in #6238
[docs] update deepseek_ocr docs by @Jintao-Huang in #6242
[bugfix] fi...

Contributors

hpsun1109, CJack812, and 8 other contributors

Assets 2

04 Nov 13:46

Jintao-Huang

v3.9.3

aa36829

Patch release v3.9.3

Full Changelog: v3.9.2...v3.9.3

Assets 2

26 Oct 09:30

Jintao-Huang

v3.9.2

4bacd3f

Patch release v3.9.2

Full Changelog: v3.9.1...v3.9.2

Assets 2

19 Oct 16:50

Jintao-Huang

v3.9.1

fd4ea9b

Patch release v3.9.1

Full Changelog: v3.9.0...v3.9.1

Assets 2

13 Oct 17:50

Jintao-Huang

v3.9.0

ff3e583

v3.9.0

中文版

新特性

Megatron-SWIFT
a. 支持更多模型架构：Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF等。完整的模型支持情况，参考支持的模型文档：https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. 支持KTO训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。感谢招商银行技术团队@kevssim 的贡献。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. 支持RM训练，包括全参数/LoRA/MoE/多模态/Packing等训练技术等支持。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. 支持序列分类模型架构，包括三种任务：regression、single_label_classification、multi_label_classification。训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. 支持VPP并行技术，减少PP并行的计算空泡，提高GPU利用率，但会略微提高通信量。支持异构PP并行 pipeline_model_parallel_layout，自定义流水线并行（PP/VPP）布局。
f. DPO等RLHF技术中的ref_model不初始化 main_grad 降低显存占用。
训练
a. 序列并行优化，ulysses 和 ring-attention 支持混合使用，实现更长的序列处理能力。支持纯文本和多模态模型的SFT/DPO/GRPO训练。训练脚本参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. 纯文本及多模态模型Embedding/Reranker/序列分类任务训练支持使用 padding_free 以节约显存资源并加速训练。
c. Embedding和Reranker训练数据集格式重构，具体参考文档：https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent template支持更多模型：deepseek_v3_1, qwen3_coder。（感谢@gakkiri ,@ray075hl 的贡献）
e. load_from_cache_file 默认值从True改成False，避免因缓存原因导致的未知问题。
RLHF
a. GRPO支持CHORD算法，在GRPO训练中混合SFT训练，参考文档：https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO支持padding free和packing以节约显存资源并加速训练。
c. GRPO训练 padding_free重构，更好支持多模态模型。
d. GRPO vLLM 支持PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"环境变量，减小显存碎片。
推理
a. 支持Reranker任务的推理/部署 (pt/vllm)，以及序列分类任务的推理部署（pt/vllm）。脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

新模型

纯文本模型
a. Qwen/Qwen3-Next-80B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0系列
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct系列（感谢@hpsun1109 的贡献）
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m（embedding模型）
多模态模型
a. Qwen/Qwen3-VL-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct系列，训练脚本参考：https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B（感谢@hellopahe 的贡献）
d. OpenGVLab/InternVL3_5-1B-HF系列
e. BytedanceDouyinContent/SAIL-VL2-2B系列
f. stepfun-ai/Step-Audio-2-mini（感谢@CJack812 的贡献）

English Version

New Features

Megatron-SWIFT
a. More model architecture support: Qwen3-VL, Qwen3-Omni, Qwen3-Next, Kimi-VL, InternVL3.5-HF, etc. For a complete list of supported models, please refer to the Supported Models documentation: https://swift.readthedocs.io/en/latest/Instruction/Supported-models-and-datasets.html
b. KTO training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Special thanks to @kevssim from China Merchants Bank’s technical team for their contribution. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto
c. Reward Model training support, including full-parameter, LoRA, MoE, multimodal, and Packing training techniques. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm
d. Sequence classification model architecture support, covering three task types: regression, single_label_classification, and multi_label_classification. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls
e. Support for VPP (Virtual Pipeline Parallelism): reduces pipeline bubbles in PP (Pipeline Parallelism), improving GPU utilization at the cost of slightly increased communication overhead. Supports heterogeneous PP via pipeline_model_parallel_layout for custom PP/VPP pipeline layouts.
f. In RLHF techniques such as DPO, the ref_model no longer initializes main_grad, reducing GPU memory consumption.
Training
a. Sequence parallelism optimization: Ulysses and Ring Attention can now be used together, enabling processing of even longer sequences. Supports SFT/DPO/GRPO training for both text-only and multimodal models. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/sequence_parallel/sequence_parallel.sh
b. Padding-free training is now supported for embedding, reranker, and sequence classification tasks on both text-only and multimodal models, saving GPU memory and accelerating training.
c. Restructured dataset formats for embedding and reranker training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html, https://swift.readthedocs.io/en/latest/BestPractices/Reranker.html
d. Agent templates support more models: deepseek_v3_1, qwen3_coder. (Thanks to contributions from @gakkiri and @ray075hl)
e. Default value of load_from_cache_file changed from True to False to avoid unexpected issues caused by caching.
RLHF
a. GRPO now supports the CHORD algorithm, enabling mixed SFT training during GRPO. Documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO/AdvancedResearch/CHORD.html
b. KTO supports padding-free and packing, reducing memory usage and accelerating training.
c. Padding-free implementation in GRPO has been refactored for better multimodal model support.
d. GRPO with vLLM now supports the environment variable PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" to reduce GPU memory fragmentation.
Inference
a. Inference and deployment support for Reranker tasks (PyTorch/vLLM) and sequence classification tasks (PyTorch/vLLM). Example scripts: https://github.com/modelscope/ms-swift/tree/main/examples/deploy/reranker, https://github.com/modelscope/ms-swift/tree/main/examples/deploy/seq_cls

New Models

Text-only Models
a. Qwen/Qwen3-Next-80B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_next
b. ZhipuAI/GLM-4.6
c. inclusionAI/Ling-mini-2.0; inclusionAI/Ring-mini-2.0 series
d. iic/Tongyi-DeepResearch-30B-A3B
e. ByteDance-Seed/Seed-OSS-36B-Instruct series (Thanks to @hpsun1109 for the contribution)
f. deepseek-ai/DeepSeek-V3.1-Terminus
g. PaddlePaddle/ERNIE-4.5-21B-A3B-Thinking
h. google/embeddinggemma-300m (embedding model)
Multimodal Models
a. Qwen/Qwen3-VL-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_vl
b. Qwen/Qwen3-Omni-30B-A3B-Instruct series. Training scripts: https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_omni
c. Kwai-Keye/Keye-VL-1_5-8B (Thanks to @hellopahe for the contribution)
d. OpenGVLab/InternVL3_5-1B-HF series
e. BytedanceDouyinContent/SAIL-VL2-2B series
f. stepfun-ai/Step-Audio-2-mini (Thanks to @CJack812 for the contribution)

What's Changed

Merge ulysses and ring-attention by @tastelikefeet in #5522
[bugfix] fix text_position_ids by @Jintao-Huang in #5692
[grpo] support CHORD algorithm by @hjh0119 in #5680
[doc] update chord doc by @hjh0119 in #5701
[bugfix]: use GCD to robustly configure sp and rp dimensions for any world_size by @0russwest0 in #5698
[megatron] Fix SP & LoRA by @Jintao-Huang in #5704
[megatron] Support ovis2.5 by @Jintao-Huang in #5719
[template] update get_env_args & load_from_cache_file by @Jintao-Huang in #5730
[bugfix] fix qwen3 swift pt by @Jintao-Huang in #5741
fix sp grpo by @tastelikefeet in #5744
Fix multiple input issue and more_params for web-ui by @slin000111 in #5739
[bugfix] set default padding side to left for generative reranker by @0russwest0 in #5751
[bugfix] correct multi-GPU reranker evaluation metric calculation by @0russwest0 in #5755
wrap base_model into get_llm_model by @tastelikefeet in #5749
[bugfix] fix forward_context by @Jintao-Huang in #5757
[bugfix] update use_barrier -> True by @Jintao-Huang in #5763
support Seed-OSS-36B-Instruct by @hpsun1109 in #5761
[bugfix] fix megatron model_type by @Jintao-Huang in #5767
Refactor grpo padding free by @tastelikefeet in #5769
Update seed.py by @hpsun1109 in #5725
[model] Support qwen3_next (transformers) by @Jintao-Huang in #5782
[megatron] fix text_position_ids by @Jintao-Huang in #5783
[model] su...