Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
v0.9.0
What's Changed
🚀 Features
- LMDeploy Distserve by @JimyMa in #3304
- allow api server terminated through requests from clients by @RunningLeon in #3533
- support update params for pytorch backend from api server by @irexyc in #3535
- support eplb for Qwen3-MoE by @zhaochaoxing in #3582
- support update params for turbomind backend by @irexyc in #3566
- Quantize Qwen3 MoE bf16 model to fp8 model at runtime by @grimoire in #3631
- [Feat]: Support internvl3-8b-hf by @RunningLeon in #3633
- Add FP8 MoE for turbomind by @lzhangzz in #3601
💥 Improvements
- reduce ray memory usage by @grimoire in #3487
- use dlblas by @zhaochaoxing in #3489
- internlm3 dense fp8 by @CUHKSZzxy in #3527
- random pad input ids by @grimoire in #3530
- ray nsys profile support by @grimoire in #3448
- update blockedfp8 scale name by @CUHKSZzxy in #3532
- start engine loop on server startup event by @grimoire in #3523
- update two microbatch by @SHshenhao in #3540
- [ascend]set transdata dynamic shape true by @JackWeiw in #3531
- ray safe exit by @grimoire in #3545
- support update params with dp=1 for pytorch engine by @irexyc in #3562
- Skip dp dummy input forward by @grimoire in #3552
- Unclock mutual exclusivity of argument:
tool-call-parser
andreasoning-parser
by @jingyibo123 in #3550 - perform torch.cuda.empty_cache() after conversion by @bltcn in #3570
- pipeline warmup by @irexyc in #3548
- Launch multiple api servers for dp > 1 by @RunningLeon in #3414
- support awq for Qwen2.5-VL by @RunningLeon in #3559
- support qwen3 /think & /no_think & enable_thinking parameter by @BUJIDAOVS in #3564
- Eplb by @zhaochaoxing in #3572
- Update benchmark by @lvhan028 in #3578
- block output when prefetch next forward inputs. by @grimoire in #3573
- support both eplb and microbatch simultaneously by @zhaochaoxing in #3591
- Add log_file and set loglevel in launch_servers by @RunningLeon in #3596
- sampling on the tokenizer's vocab by @grimoire in #3604
- update deepgemm version by @grimoire in #3606
- [Ascend] set default distrbuted backend as ray for ascend device by @JackWeiw in #3603
- Blocked fp8 tma by @grimoire in #3470
- [PDDisaggreagtion] Async migration by @JimyMa in #3610
- move dp loop to model agent by @grimoire in #3598
- update some logs of proxy_server and pt engine by @lvhan028 in #3621
- improve loading model performance by shuffling the weight files by @irexyc in #3625
- add benchmark scripts about pipeline api and inference engines according to the config file by @lvhan028 in #3622
🐞 Bug fixes
- [ascend] fix recompile on different rank by @jinminxi104 in #3513
- fix attention sm86 by @grimoire in #3519
- fix stopwords kv cache by @grimoire in #3494
- [bug fix] fix PD Disaggregation in DSV3 by @JimyMa in #3547
- fix proxy server heart beat by @irexyc in #3543
- fix dp>1 tp=1 ep=1 by @grimoire in #3555
- fix mixtral on new transformers by @grimoire in #3580
- [Fix]: reset step after eviction by @RunningLeon in #3589
- fix parsing dynamic rope param failed by @lvhan028 in #3575
- Fix batch infer for gemma3vl by @RunningLeon in #3592
- Fix symbol error when dlBLAS is not imported by @zhaochaoxing in #3597
- read distributed envs by @grimoire in #3600
- fix side-effect caused by PR 3590 by @lvhan028 in #3608
- fix bug in qwen2 by @LKJacky in #3614
- fix awq kernel by @grimoire in #3618
- fix flash mla interface by @grimoire in #3617
- add sampling_vocab_size by @irexyc in #3607
- fix for default quant by @grimoire in #3640
- Fix log file env in ray worker by @RunningLeon in #3624
- fix qwen3 chat template by @lvhan028 in #3641
- fix vlm runtime quant by @grimoire in #3644
- Fix 'Namespace' object has no attribute 'num_tokens_per_iter' when serving by gradio by @lvhan028 in #3647
- Synchronize weight processing by @lzhangzz in #3649
- Fix zero scale in fp8 quantization by @lzhangzz in #3652
🌐 Other
- update doc for ascend 300I Duo docker image by @jinminxi104 in #3526
- simulate EPLB for benchmark only by @lvhan028 in #3490
- [ci] add test workflow for 3090 machine by @zhulinJulia24 in #3561
- [ci] fix transformers version in prtest by @zhulinJulia24 in #3584
- [Misc] minor api_server and tm loader, and upgrade docformatter to resolve lint error by @lvhan028 in #3590
- [ci] add qwen3 models into testcase by @zhulinJulia24 in #3593
- update Dockerfile by @CUHKSZzxy in #3634
- check in lmdeploy-builder on cuda 12.4 and 12.8 platform by @lvhan028 in #3630
- fix blocked fp8 overflow by @grimoire in #3650
- Bump version to v0.9.0 by @lvhan028 in #3609
New Contributors
- @JimyMa made their first contribution in #3304
- @jingyibo123 made their first contribution in #3550
- @bltcn made their first contribution in #3570
- @BUJIDAOVS made their first contribution in #3564
- @LKJacky made their first contribution in #3614
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's Changed
🚀 Features
- Torch dp support by @grimoire in #3207
- Add deep gemm with tma pre allocated by @AllentDan in #3287
- Add mixed DP + TP by @lzhangzz in #3229
- Add Qwen3 and Qwen3MoE by @lzhangzz in #3305
- [ascend] support multi nodes on ascend device by @tangzhiyi11 in #3260
- [Feature] support qwen3 and qwen3-moe for pytorch engine by @CUHKSZzxy in #3315
- [ascend]support deepseekv2 by @yao-fengchen in #3206
- add deepep by @zhaochaoxing in #3313
- support ascend w8a8 graph_mode by @yao-fengchen in #3267
- support all2all ep by @zhaochaoxing in #3370
- optimize ep in decoding stage by @zhaochaoxing in #3383
- Warmup deepgemm by @grimoire in #3387
- support Llama4 by @grimoire in #3408
- add twomicrobatch support by @SHshenhao in #3381
- Support phi4 mini by @RunningLeon in #3467
- [Dlinfer][Ascend] support 310P by @JackWeiw in #3484
- support qwen3 fp8 by @CUHKSZzxy in #3505
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
- refactor attn param by @irexyc in #3164
- Verbose log by @grimoire in #3329
- optimize mla, remove load
v
by @grimoire in #3334 - support dp decoding with cudagraph by @grimoire in #3311
- optimize quant-fp8 kernel by @grimoire in #3345
- refactor dlinfer rope by @yao-fengchen in #3326
- enable qwenvl2.5 graph mode on ascend by @jinminxi104 in #3367
- Add AIOHTTP_TIMEOUT env var for proxy server by @AllentDan in #3355
- disable sync batch on dp eager mode by @grimoire in #3382
- fix for deepgemm update by @grimoire in #3380
- Add string before hash tokens in blocktrie by @RunningLeon in #3386
- optimize moe get sorted idx by @grimoire in #3356
- use half/bf16 lm_head output by @irexyc in #3213
- remove ep eager check by @grimoire in #3392
- Optimize ascend moe by @yao-fengchen in #3364
- optimize fp8 moe kernel by @grimoire in #3419
- ray async forward execute by @grimoire in #3443
- map internvl3 chat template to builtin chat template internvl2_5 by @lvhan028 in #3450
- Refactor turbomind (low-level abstractions) by @lzhangzz in #3423
- remove barely used code to improve maintenance by @lvhan028 in #3462
- optimize sm80 long context by @grimoire in #3465
- move partial_json_parser from ’serve.txt‘ to ‘runtime.txt‘ by @lvhan028 in #3493
- support qwen3-dense models awq quantization by @lvhan028 in #3503
- Optimize MoE gate for Qwen3 by @lzhangzz in #3500
- Pass num_tokens_per_iter and max_prefill_iters params through in
lmdeploy serve api_server
by @josephrocca in #3504 - [Dlinfer][Ascend] Optimize performance of 310P device by @JackWeiw in #3486
- optimize longcontext decoding by @grimoire in #3510
- Support min_p in openai completions_v1 by @josephrocca in #3506
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
- fix sliding window multi chat by @grimoire in #3302
- add
v
check by @grimoire in #3307 - Fix Qwen3MoE config parsing by @lzhangzz in #3336
- Fix finish reasons by @AllentDan in #3338
- remove think_end_token_id in streaming content by @AllentDan in #3327
- Fix the finish_reason by @AllentDan in #3350
- set cmake policy minimum version as 3.5 by @lvhan028 in #3376
- fix dp cudagraph by @grimoire in #3372
- fix flashmla eagermode by @grimoire in #3375
- close engine after each benchmark-generation iter by @grimoire in #3269
- [Fix] fix
image_token_id
error of qwen2-vl and deepseek by @ao-zz in #3358 - fix stopping criteria by @grimoire in #3384
- support List[dict] prompt input without do_preprocess by @irexyc in #3385
- add rayexecutor release timeout by @grimoire in #3403
- fix tensor dispatch in dynamo by @wanfengcxz in #3417
- fix linting error by upgrade to ubuntu-latest by @lvhan028 in #3442
- fix awq tp for pytorch engine by @RunningLeon in #3435
- fix mllm testcase fail by @caikun-pjlab in #3458
- remove paged attention autotune by @grimoire in #3452
- Remove empty prompts in benchmark scripts by @lvhan028 in #3460
- failed to end session properly by @lvhan028 in #3471
- fix qwen2.5-vl chat template by @CUHKSZzxy in #3475
- Align forward arguments of deepgemm blockedf8 by @RunningLeon in #3474
- fix turbomind lib missing to link nccl by exporting nccl path by @lvhan028 in #3479
- fix dsvl2 no attr config error by @CUHKSZzxy in #3477
- fix flash attention crash on triton3.1.0 by @grimoire in #3478
- Fix disorder of ray execution by @RunningLeon in #3481
- update dockerfile by @CUHKSZzxy in #3482
- fix output logprobs by @irexyc in #3488
- Fix Qwen2MoE shared expert gate by @lzhangzz in #3491
- fix replicate kv for qwen3-moe by @grimoire in #3499
- fix sampling if data overflow after temperature penalty by @irexyc in #3508
📚 Documentations
- update qwen2.5-vl-32b docs by @CUHKSZzxy in #3446
🌐 Other
- bump version to v0.7.2.post1 by @lvhan028 in #3298
- [ci] add think function testcase by @zhulinJulia24 in #3299
- merge dev into main by @lvhan028 in #3348
- [ci] add vl models into pipeline interface testcase by @zhulinJulia24 in #3374
- merge dev to main branch by @lvhan028 in #3378
- opt experts memory and permute by @zhaochaoxing in #3390
- Revert "opt experts memory and permute" by @zhaochaoxing in #3406
- merge dev to main by @lvhan028 in #3400
- add Hopper GPU dockerfile by @CUHKSZzxy in #3415
- optimize internvit by @caikun-pjlab in #3433
- fix stop/bad words by @irexyc in #3492
- [ci] testcase bugfix and add more models into testcase by @zhulinJulia24 in #3463
- bump version to v0.8.0 by @lvhan028 in #3432
New Contributors
- @zhaochaoxing made their first contribution in #3313
- @ao-zz made their first contribution in #3358
- @wanfengcxz made their first contribution in #34...
v0.7.3
What's Changed
🚀 Features
- Add Qwen3 and Qwen3MoE by @lzhangzz in #3305
- [Feature] support qwen3 and qwen3-moe for pytorch engine by @CUHKSZzxy in #3315
- [ascend]support deepseekv2 by @yao-fengchen in #3206
- support ascend w8a8 graph_mode by @yao-fengchen in #3267
- support Llama4 by @grimoire in #3408
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
- optimize mla, remove load
v
by @grimoire in #3334 - refactor dlinfer rope by @yao-fengchen in #3326
- enable qwenvl2.5 graph mode on ascend by @jinminxi104 in #3367
- Optimize ascend moe by @yao-fengchen in #3364
- find port by @grimoire in #3429
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
- add
v
check by @grimoire in #3307 - Fix Qwen3MoE config parsing by @lzhangzz in #3336
- Fix finish reasons by @AllentDan in #3338
- remove think_end_token_id in streaming content by @AllentDan in #3327
- Fix the finish_reason by @AllentDan in #3350
- support List[dict] prompt input without do_preprocess by @irexyc in #3385
- fix tensor dispatch in dynamo by @wanfengcxz in #3417
📚 Documentations
- update ascend doc by @yao-fengchen in #3420
🌐 Other
- bump version to v0.7.2.post1 by @lvhan028 in #3298
- Optimize internvit by @caikun-pjlab in #3316
- bump version to v0.7.3 by @lvhan028 in #3416
New Contributors
- @wanfengcxz made their first contribution in #3417
- @caikun-pjlab made their first contribution in #3316
Full Changelog: v0.7.2...v0.7.3
v0.7.2.post1
What's Changed
💥 Improvements
- Add spaces_between_special_tokens to /v1/interactive and make compatible with empty text by @AllentDan in #3283
- add env var to control timeout by @CUHKSZzxy in #3291
🐞 Bug fixes
- fix activation grid oversize by @grimoire in #3282
- Set ensure_ascii=False for tool calling by @AllentDan in #3295
🌐 Other
Full Changelog: v0.7.2...v0.7.2.post1
v0.7.2
What's Changed
🚀 Features
- [Feature] support qwen2.5-vl for pytorch engine by @CUHKSZzxy in #3194
- Support reward models by @lvhan028 in #3192
- Add collective communication kernels by @lzhangzz in #3163
- PytorchEngine multi-node support v2 by @grimoire in #3147
- Add flash mla by @AllentDan in #3218
- Add gemma3 implementation by @AllentDan in #3272
💥 Improvements
- remove update badwords by @grimoire in #3183
- defaullt executor ray by @grimoire in #3210
- change ascend&camb default_batch_size to 256 by @jinminxi104 in #3251
- Tool reasoning parsers and streaming function call by @AllentDan in #3198
- remove torchelastic flag by @grimoire in #3242
- disable flashmla warning on sm<90 by @grimoire in #3271
🐞 Bug fixes
- Fix missing cli chat option by @lzhangzz in #3209
- [ascend] fix multi-card distributed inference failures by @tangzhiyi11 in #3215
- fix for small cache-max-entry-count by @grimoire in #3221
- [dlinfer] fix glm-4v graph mode on ascend by @jinminxi104 in #3235
- fix qwen2.5 pytorch engine dtype error on NPU by @tcye in #3247
- [Fix] failed to update the tokenizer's eos_token_id into stop_word list by @lvhan028 in #3257
- fix dsv3 gate scaling by @grimoire in #3263
- Fix the bug for reading dict error by @GxjGit in #3196
- Fix get ppl by @lvhan028 in #3268
📚 Documentations
- Specifiy lmdeploy version in benchmark guide by @lyj0309 in #3216
- [ascend] add Ascend docker image by @jinminxi104 in #3239
🌐 Other
- [ci] testcase refactoring by @zhulinJulia24 in #3151
- [ci] add testcase for native communicator by @zhulinJulia24 in #3217
- [ci] add volc evaluation testcase by @zhulinJulia24 in #3240
- [ci] remove v100 testconfig by @zhulinJulia24 in #3253
- add rdma dependencies into docker file by @CUHKSZzxy in #3262
- docs: update ascend docs for docker running by @CyCle1024 in #3266
- bump version to v0.7.2 by @lvhan028 in #3252
New Contributors
Full Changelog: v0.7.1...v0.7.2
v0.7.1
What's Changed
🚀 Features
- support release pipeline by @irexyc in #3069
- [feature] add dlinfer w8a8 support. by @Reinerzhou in #2988
- [maca] support deepseekv2 for maca backend. by @Reinerzhou in #2918
- [Feature] support deepseek-vl2 for pytorch engine by @CUHKSZzxy in #3149
💥 Improvements
- use weights iterator while loading by @RunningLeon in #2886
- Add deepseek-r1 chat template by @AllentDan in #3072
- Update tokenizer by @lvhan028 in #3061
- Set max concurrent requests by @AllentDan in #2961
- remove logitswarper by @grimoire in #3109
- Update benchmark script and user guide by @lvhan028 in #3110
- support eos_token list in turbomind by @irexyc in #3044
- Use aiohttp inside proxy server && add --disable-cache-status argument by @AllentDan in #3020
- Update runtime package dependencies by @zgjja in #3142
- Make turbomind support embedding inputs on GPU by @chengyuma in #3177
🐞 Bug fixes
- [dlinfer] fix ascend qwen2_vl graph_mode by @yao-fengchen in #3045
- fix error in interactive api by @lvhan028 in #3074
- fix sliding window mgr by @grimoire in #3068
- More arguments in api_client, update docstrings by @AllentDan in #3077
- Add system role to deepseek chat template by @AllentDan in #3031
- Fix xcomposer2d5 by @irexyc in #3087
- fix user guide about cogvlm deployment by @lvhan028 in #3088
- fix postional argument by @lvhan028 in #3086
- Fix UT of deepseek chat template by @lvhan028 in #3125
- Fix internvl2.5 error after eviction by @grimoire in #3122
- Fix cogvlm and phi3vision by @RunningLeon in #3137
- [fix] fix vl gradio, use pipeline api and remove interactive chat by @irexyc in #3136
- fix the issue that stop_token may be less than defined in model.py by @irexyc in #3148
- fix typing by @lz1998 in #3153
- fix min length penalty by @irexyc in #3150
- fix default temperature value by @irexyc in #3166
- Use pad_token_id as image_token_id for vl models by @RunningLeon in #3158
- Fix tool call prompt for InternLM and Qwen by @AllentDan in #3156
- Update qwen2.py by @GxjGit in #3174
- fix temperature=0 by @grimoire in #3176
- fix blocked fp8 moe by @grimoire in #3181
- fix deepseekv2 has no attribute use_mla error by @CUHKSZzxy in #3188
- fix unstoppable chat by @lvhan028 in #3189
🌐 Other
- [ci] add internlm3 into testcase by @zhulinJulia24 in #3038
- add internlm3 to supported models by @lvhan028 in #3041
- update pre-commit config by @lvhan028 in #2683
- [maca] add cudagraph support on maca backend. by @Reinerzhou in #2834
- bump version to v0.7.0.post1 by @lvhan028 in #3076
- bump version to v0.7.0.post2 by @lvhan028 in #3094
- [Fix] fix the URL judgment problem in Windows by @Lychee-acaca in #3103
- bump version to v0.7.0.post3 by @lvhan028 in #3115
- [ci] fix some fail in daily testcase by @zhulinJulia24 in #3134
- Bump version to v0.7.1 by @lvhan028 in #3178
New Contributors
- @Lychee-acaca made their first contribution in #3103
- @lz1998 made their first contribution in #3153
- @GxjGit made their first contribution in #3174
- @chengyuma made their first contribution in #3177
- @CUHKSZzxy made their first contribution in #3149
Full Changelog: v0.7.0...v0.7.1
v0.7.0.post3
What's Changed
💥 Improvements
- Set max concurrent requests by @AllentDan in #2961
- remove logitswarper by @grimoire in #3109
🐞 Bug fixes
- fix user guide about cogvlm deployment by @lvhan028 in #3088
- fix postional argument by @lvhan028 in #3086
🌐 Other
- [Fix] fix the URL judgment problem in Windows by @Lychee-acaca in #3103
- bump version to v0.7.0.post3 by @lvhan028 in #3115
New Contributors
- @Lychee-acaca made their first contribution in #3103
Full Changelog: v0.7.0.post2...v0.7.0.post3
LMDeploy Release V0.7.0.post2
What's Changed
💥 Improvements
- Add deepseek-r1 chat template by @AllentDan in #3072
- Update tokenizer by @lvhan028 in #3061
🐞 Bug fixes
- Add system role to deepseek chat template by @AllentDan in #3031
- Fix xcomposer2d5 by @irexyc in #3087
🌐 Other
Full Changelog: v0.7.0.post1...v0.7.0.post2
LMDeploy Release V0.7.0.post1
What's Changed
💥 Improvements
- use weights iterator while loading by @RunningLeon in #2886
🐞 Bug fixes
- [dlinfer] fix ascend qwen2_vl graph_mode by @yao-fengchen in #3045
- fix error in interactive api by @lvhan028 in #3074
- fix sliding window mgr by @grimoire in #3068
- More arguments in api_client, update docstrings by @AllentDan in #3077
🌐 Other
- [ci] add internlm3 into testcase by @zhulinJulia24 in #3038
- add internlm3 to supported models by @lvhan028 in #3041
- update pre-commit config by @lvhan028 in #2683
- [maca] add cudagraph support on maca backend. by @Reinerzhou in #2834
- bump version to v0.7.0.post1 by @lvhan028 in #3076
Full Changelog: v0.7.0...v0.7.0.post1
LMDeploy Release v0.7.0
What's Changed
🚀 Features
- Support moe w8a8 in pytorch engine by @grimoire in #2894
- Support DeepseekV3 fp8 by @grimoire in #2967
- support new backend cambricon by @JackWeiw in #3002
- support-moe-fp8 by @RunningLeon in #3007
- add internlm3-dense(turbomind) & chat template by @irexyc in #3024
- support internlm3 on pt by @RunningLeon in #3026
- Support internlm3 quantization by @AllentDan in #3027
💥 Improvements
- Optimize awq kernel in pytorch engine by @grimoire in #2965
- Support fp8 w8a8 for pt backend by @RunningLeon in #2959
- Optimize lora kernel by @grimoire in #2975
- Remove threadsafe by @grimoire in #2907
- Refactor async engine & turbomind IO by @lzhangzz in #2968
- [dlinfer]rope refine by @JackWeiw in #2984
- Expose spaces_between_special_tokens by @AllentDan in #2991
- [dlinfer]change llm op interface of paged_prefill_attention. by @JackWeiw in #2977
- Update request logger by @lvhan028 in #2981
- remove decoding by @grimoire in #3016
🐞 Bug fixes
- Fix build crash in nvcr.io/nvidia/pytorch:24.06-py3 image by @zgjja in #2964
- add tool role in BaseChatTemplate as tool response in messages by @AllentDan in #2979
- Fix ascend dockerfile by @jinminxi104 in #2989
- fix internvl2 qk norm by @grimoire in #2987
- fix xcomposer2 when transformers is upgraded greater than 4.46 by @irexyc in #3001
- Fix get_ppl & get_logits by @lvhan028 in #3008
- Fix typo in w4a16 guide by @Yan-Xiangjun in #3018
- fix blocked fp8 moe kernel by @grimoire in #3009
- Fix async engine by @lzhangzz in #3029
- [hotfix] Fix get_ppl by @lvhan028 in #3023
- Fix MoE gating for DeepSeek V2 by @lzhangzz in #3030
- Fix empty response for pipeline by @lzhangzz in #3034
- Fix potential hang during TP model initialization by @lzhangzz in #3033
🌐 Other
- [ci] add w8a8 and internvl2.5 models into testcase by @zhulinJulia24 in #2949
- bump version to v0.7.0 by @lvhan028 in #3010
New Contributors
- @zgjja made their first contribution in #2964
- @Yan-Xiangjun made their first contribution in #3018
Full Changelog: 0.6.5...v0.7.0