Skip to content

Releases: modelscope/ms-swift

Patch release v3.5.1

13 Jun 14:24
Compare
Choose a tag to compare

v3.5.0

08 Jun 16:51
Compare
Choose a tag to compare

中文版

新特性

  1. GRPO:
    a. 代码重构,使用参数vllm_mode指定。参数说明详见参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id1:~:text=vllm_mode%20server%20%E5%8F%82%E6%95%B0,colocate%20mode%20%E7%94%9F%E6%95%88%E3%80%82
    b. GRPO长文本优化,支持ulysses序列并行,显著降低长文本训练显存占用,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. 新增sync_ref_model参数,支持训练中同步参考模型权重。
    d. 支持 liger kernel loss,使用参数 use_liger_kernel,降低显存占用。
    e. External mode 支持 move_model_batches,降低zero3同步权重时的显存峰值。
    f. 集成 INTELLECT-2 的 Two-Sided Clipping 算法,使用参数 delta。
    g. 支持奖励函数返回 None,适用于多任务训练,参考文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#id7
    h. Internal mode 支持 vllm_server_base_url,传入外部 vLLM 服务器url。
    i. 插件拓展:支持 QwenLong-L1 奖励模型插件。
    j. 新增 steps_per_generation/generation_batch_size 参数,支持自定义采样批量大小。
    k. Web-UI支持GRPO训练。
    l. 以下参数将在 v3.6 移除:tensor_parallel_size / vllm_device / vllm_max_num_seqs / num_infer_workers。
  2. 训练:
    a. CPT/SFT/DPO/GRPO 支持 padding free。通过将批次数据展平避免数据填充(padding),显著降低显存并加速训练。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. 多模态训练增强。支持使用 vit_lr 和 aligner_lr 参数独立控制 ViT 和 Aligner 模块的学习率。支持通过 vit_gradient_checkpointing 参数单独控制 vit 模块的 gradient checkpointing,性能基准测试参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT支持使用 channel loss 对不同 channel 数据集分别统计损失值。感谢招商银行技术团队的贡献。
    d. CPT/SFT/DPO支持 use_logits_to_keep参数,降低显存占用,提升训练速度。
    e. Qwen2.5-VL/Omni 支持传入图像目录进行视频训练。
  3. 推理部署:
    a. swift infer批处理优化,新增 write_batch_size 参数,用于控制批处理推理结果写入result_path的间隔。
    b. vllm 推理引擎默认使用 V1 engine,并支持TP和DP结合的推理模式,脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. 非流式数据集支持通过 max_epochs 自动计算 train_iters。
    b. 提供 extra_megatron_kwargs 参数,支持未写入ms-swift的megatron参数传入。

新模型

  1. Qwen/Qwen3-Embedding-0.6B系列,训练脚本参考:https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B系列,最佳实践参考https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530、XiaomiMiMo/MiMo-VL-7B-SFT系列
  5. OpenBMB/MiniCPM4-0.5B系列

English Version

New Features

  1. GRPO:
    a. Code refactored, specified via the vllm_mode parameter. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#arguments-and-execution-script:~:text=vllm_mode%20server%20parameter,in%20colocate%20mode.
    b. GRPO long-text optimization with Ulysses sequence parallelism, significantly reducing GPU memory usage during long-text training. Training script: https://github.com/modelscope/ms-swift/blob/main/examples/train/long_text/sequence_parallel_grpo.sh
    c. Added sync_ref_model parameter to synchronize reference model weights during training.
    d. Supports Liger Kernel Loss via use_liger_kernel parameter, reducing GPU memory consumption.
    e. External mode supports move_model_batches to lower peak GPU memory during ZeRO-3 weight synchronization.
    f. Integrated INTELLECT-2’s Two-Sided Clipping algorithm using the delta parameter.
    g. Supports reward functions returning None, applicable for multi-task training. For details, refer to the documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#multi-task-training
    h. Internal mode supports vllm_server_base_url for passing external vLLM server URLs.
    i. Plugin extension: Added QwenLong-L1 reward model plugin.
    j. Added steps_per_generation and generation_batch_size parameters for customizing sampling batch size.
    k. Web-UI supports GRPO training.
    l. The following parameters will be deprecated in v3.6: tensor_parallel_size, vllm_device, vllm_max_num_seqs, num_infer_workers.
  2. Training:
    a. CPT/SFT/DPO/GRPO support padding-free training. By flattening batch data to avoid padding, GPU memory usage is reduced and training speed is improved. Script: https://github.com/modelscope/ms-swift/tree/main/examples/train/padding_free
    b. Multimodal training enhancements: Supports separate learning rates for ViT and Aligner modules via vit_lr and aligner_lr parameters. Added vit_gradient_checkpointing to independently control gradient checkpointing for ViT modules. Benchmark: https://github.com/modelscope/ms-swift/blob/main/examples/train/multimodal/vit_gradient_checkpointing.sh
    c. CPT/SFT support channel_loss to separately calculate loss for different channel datasets. Thanks to the contributions from the technical team at China Merchants Bank.
    d. CPT/SFT/DPO support use_logits_to_keep to reduce GPU memory usage and accelerate training.
    e. Qwen2.5-VL/Omni support video training by passing image directories.
  3. Inference & Deployment:
    a. Optimized swift infer batching with new write_batch_size parameter to control inference result write intervals to result_path.
    b. vLLM inference engine now defaults to V1 engine and supports hybrid Tensor Parallelism (TP) and Data Parallelism (DP). Script: https://github.com/modelscope/ms-swift/blob/main/examples/infer/vllm/dp_tp.sh
  4. Megatron-SWIFT:
    a. Non-streaming datasets automatically calculate train_iters via max_epochs.
    b. Added extra_megatron_kwargs to pass unlisted Megatron parameters into ms-swift.

New Models

  1. Qwen/Qwen3-Embedding-0.6B series. Training script reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/embedding/train_emb.sh
  2. deepseek-ai/DeepSeek-R1-0528-Qwen3-8B series. Best practices: https://mp.weixin.qq.com/s/-hhfGiiGTqXUybwPH525gw
  3. iic/QwenLong-L1-32B
  4. XiaomiMiMo/MiMo-7B-RL-0530 & XiaomiMiMo/MiMo-VL-7B-SFT series
  5. OpenBMB/MiniCPM4-0.5B series

What's Changed

Read more

v3.4.1.post1

18 May 14:47
Compare
Choose a tag to compare

v3.4.1

13 May 06:33
Compare
Choose a tag to compare

中文版

新特性

  1. 序列并行: 支持在PT/SFT/DPO阶段使用ulysses序列并行。兼容deepspeed、packing、flash_attn、streaming等训练技术。训练脚本参考这里
  2. GRPO: 支持自定义奖励模型逻辑,内置了一个生成式奖励模型的例子,训练脚本参考这里
  3. Megatron-SWIFT: 更新megatron-core到0.12.0;新增max_epochs参数,在epoch到达max_epochs时停止训练并保存权重;新增wandb参数记录训练日志。
  4. 最佳实践:新增从零开始快速训练视觉语言模型的最佳实践,参考这里
  5. 外部贡献:支持GRPO使用judge0执行生成的代码;支持指定freeze/activate parameters使用正则表达式;支持对初始化模型中未初始化参数指定初始化策略。感谢招商银行技术团队的贡献。

新模型

  1. XiaomiMiMo/MiMo-7B-RL系列
  2. deepseek-ai/DeepSeek-Prover-V2-7B系列
  3. OpenGVLab/InternVL3-1B-Pretrained系列

English Version

New Features

  1. Sequence Parallelism: Supports the use of Ulysses sequence parallelism during PT/SFT/DPO stages. Compatible with training techniques such as DeepSpeed, packing, flash_attn, and streaming. Refer to the training script here.
  2. GRPO: Supports custom reward model logic. Includes a built-in example of a generative reward model. Refer to the training script here.
  3. Megatron-SWIFT: Updated megatron-core to version 0.12.0. Added the max_epochs parameter to stop training and save weights when the epoch reaches max_epochs. Added the wandb parameter to log training metrics.
  4. Best Practices: Added best practices for quickly training vision-language models from scratch. Refer to the guide here.
  5. External Contributions: Supports GRPO using judge0 for executing generated code. Allows specifying freeze/activate parameters using regular expressions. Supports defining initialization strategies for uninitialized parameters in the initial model. Thanks to the contributions from the technical team at China Merchants Bank.

New Models

  1. XiaomiMiMo/MiMo-7B-RL Series
  2. deepseek-ai/DeepSeek-Prover-V2-7B Series
  3. OpenGVLab/InternVL3-1B-Pretrained Series

What's Changed

New Contributors

Full Changelog: v3.4.0...v3.4.1

v3.4.0

30 Apr 15:45
Compare
Choose a tag to compare

中文版

新特性

  1. 支持Qwen3/Qwen2-MoE/Qwen3-MoE的Megatron训练(CPT/SFT),在MoE模型上相比transformers实现训练速度快近10倍。Qwen3-MoE训练最佳实践参考: #4030

新模型

  1. Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B系列
  2. Qwen/Qwen2.5-Omni-3B

English Version

New Features

  1. Support for Megatron training (CPT/SFT) of Qwen3/Qwen2-MoE/Qwen3-MoE, with training speeds nearly 10 times faster on MoE models compared to the Transformers implementation. For best practices on Qwen3-MoE training, refer to: #4030

New Models

  1. Qwen/Qwen3-32B, Qwen/Qwen3-30B-A3B series
  2. Qwen/Qwen2.5-Omni-3B

What's Changed

New Contributors

Full Changelog: v3.3.1...v3.4.0

v3.3.1

26 Apr 08:57
Compare
Choose a tag to compare

中文版

新特性

  1. Agent训练部署模块引入agent template,包括hermes, glm4_0414, llama4等10余种agent template,支持agent数据集兼容不同模型的训练切换,文档参考这里
  2. GRPO训练支持调用外部vLLM server,训练与部署显存分配更灵活,训练脚本参考这里

新模型

  1. OpenGVLab/InternVL3-1B系列
  2. moonshotai/Kimi-VL-A3B-Instruct系列
  3. ZhipuAI/GLM-4-9B-0414, ZhipuAI/GLM-Z1-9B-0414系列

English Version

New Features

  1. The Agent training and deployment module introduces agent templates, including more than 10 types such as hermes, glm4_0414, and llama4. These templates support switching between different models for agent dataset compatibility during training. For documentation, refer to here.
  2. GRPO training now supports calling an external vLLM server, allowing for more flexible allocation of GPU memory during training and deployment. For the training script, refer to here.

New Models

  1. OpenGVLab/InternVL3-1B series
  2. moonshotai/Kimi-VL-A3B-Instruct series
  3. ZhipuAI/GLM-4-9B-0414, ZhipuAI/GLM-Z1-9B-0414 series

What's Changed

New Contributors

Full Changelog: v3.3.0...v3.3.1

v3.3.0.post1

15 Apr 09:43
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v3.3.0...v3.3.0.post1

v3.3.0

11 Apr 06:36
Compare
Choose a tag to compare

中文版

新特性

  1. 支持DAPO算法,训练文档参考:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html#dapo
  2. 支持多模态模型的序列packing,包括qwen2-vl、qwen2.5-vl、qwen2.5-omni和internvl2.5系列,训练速度提升100%。训练脚本参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/packing
  3. 新增SWIFT和Megatron-SWIFT镜像,参考这里:https://swift.readthedocs.io/zh-cn/latest/GetStarted/SWIFT%E5%AE%89%E8%A3%85.html#id3
  4. 多模态/Omni/Moe量化能力增强,量化脚本参考这里:https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize

新模型

  1. Qwen/Qwen2.5-Omni-7B
  2. LLM-Research/Llama-4-Scout-17B-16E-Instruct系列
  3. cognitivecomputations/DeepSeek-V3-0324-AWQ

English Version

New Features

  1. Supports the DAPO algorithm; training documentation can be found here: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html#dapo
  2. Supports sequence packing for multimodal models, including qwen2-vl, qwen2.5-vl, qwen2.5-omni, and the internvl2.5 series, with a 100% increase in training speed. Training scripts can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/packing
  3. Added SWIFT and Megatron-SWIFT mirrors, see details here: https://swift.readthedocs.io/en/latest/GetStarted/SWIFT-installation.html#mirror
  4. Enhanced quantization capabilities for Multimodal/Omni/Moe models, shell scripts can be found here: https://github.com/modelscope/ms-swift/blob/main/examples/export/quantize

New Models

  1. Qwen/Qwen2.5-Omni-7B
  2. LLM-Research/Llama-4-Scout-17B-16E-Instruct series
  3. cognitivecomputations/DeepSeek-V3-0324-AWQ

What's Changed

New Contributors

Full Changelog: v3.2.2...v3.3.0

v3.2.2

26 Mar 02:59
Compare
Choose a tag to compare

中文版

新特性

  1. Megatron-SWIFT发布。支持TP、PP、SP、CP等并行技术对Qwen系、Llama系、Deepseek-R1蒸馏系等100+模型进行预训练和微调。支持streaming数据集和序列packing功能支持超大数据集并提升训练效率。更多内容参考Megatron-SWIFT训练文档
  2. 支持多轮GRPO训练以适配例如Deep Search等多轮agent工具调用场景,示例代码参考这里
  3. 支持iic/gme-Qwen2-VL-2B-Instruct等多模态模型的Embedding训练。具体参考embedding模型训练文档
  4. 支持大模型和多模态大模型的多标签分类和回归任务的训练到部署。示例脚本参考这里
  5. 支持在训练过程中使用EvalScope对模型进行评测,及时了解模型的训练效果。示例脚本参考评测文档
  6. 书写外置plugin,以支持多模态模型LoRA训练LLM的同时,全参数训练ViT,并采用不同的学习率。避免ViT部分merge-lora造成的精度误差。示例脚本参考这里

新模型

  1. iic/gme-Qwen2-VL-2B-Instruct系列
  2. Qwen/Qwen2.5-VL-32B-Instruct
  3. LLM-Research/gemma-3-4b-it系列
  4. deepseek-ai/DeepSeek-V3-0324
  5. mistralai/Mistral-Small-3.1-24B-Instruct-2503系列

English Version

New Features

  1. Release of Megatron-SWIFT: Megatron-SWIFT has been released, supporting various parallel technologies such as TP (Tensor Parallelism), PP (Pipeline Parallelism), SP (Sequence Parallelism), and CP (Context Parallelism) for pre-training and fine-tuning over 100 models, including the Qwen series, Llama series, and Deepseek-R1 distillation series. It also supports streaming datasets and sequence packing, enabling the handling of ultra-large datasets while improving training efficiency. For more details, refer to the Megatron-SWIFT Training Documentation.
  2. Support for Multi-turn GRPO Training: Supports multi-turn GRPO training to adapt to scenarios such as multi-turn agent tool calls in Deep Search. Example code can be found here.
  3. Embedding Training for Multimodal Models: Supports embedding training for multimodal models such as iic/gme-Qwen2-VL-2B-Instruct. For more information, refer to the Embedding Model Training Documentation.
  4. Multi-label Classification and Regression Tasks for Large Models and Multimodal Large Models: Supports end-to-end training and deployment for multi-label classification and regression tasks for large models and multimodal large models. Example scripts can be found here.
  5. Model Evaluation with EvalScope During Training: Supports model evaluation using EvalScope during training to monitor training performance in real time. Example scripts can be found in the Evaluation Documentation.
  6. Custom External Plugin for LoRA + ViT Training: Provides an external plugin to support LoRA training for LLMs (Large Language Models) while performing full-parameter training for ViTs (Vision Transformers) with different learning rates. This avoids precision errors caused by merging LoRA into the ViT portion. Example code can be found here.

New Models

  1. iic/gme-Qwen2-VL-2B-Instruct series
  2. Qwen/Qwen2.5-VL-32B-Instruct
  3. LLM-Research/gemma-3-4b-it series
  4. deepseek-ai/DeepSeek-V3-0324
  5. mistralai/Mistral-Small-3.1-24B-Instruct-2503 series

What's Changed

Read more

v3.2.1

14 Mar 07:07
Compare
Choose a tag to compare

中文版

新特性

  1. GRPO支持vLLM的tensor parallel模式。例子参考这里
  2. GRPO支持co-locate和optimizer和model的offload,支持分批次导入权重和合并LoRA,节约显存资源,使72B模型的训练可以在四张A100上运行。例子参考这里
  3. GRPO支持code ORM。最佳实践参考这里

新模型

  1. Qwen/QwQ-32B系列
  2. inclusionAI/Ling-lite系列

New Features

  1. GRPO supports the tensor parallel mode of vLLM. Examples can be found here.
  2. GRPO supports co-locating offloading for both the optimizer and the model, allows for batch weight loading and LoRA merging, saving GPU memory resources, which enables training of a 72B model on four A100 GPUs. Examples can be found here.
  3. GRPO supports code ORM. Best practices can be found here.

New Models

  1. Qwen/QwQ-32B series
  2. inclusionAI/Ling-lite series

What's Changed

New Contributors

Full Changelog: v3.2.0...v3.2.1