Release v0.9.0 · InternLM/lmdeploy

What's Changed

🚀 Features

LMDeploy Distserve by @JimyMa in #3304
allow api server terminated through requests from clients by @RunningLeon in #3533
support update params for pytorch backend from api server by @irexyc in #3535
support eplb for Qwen3-MoE by @zhaochaoxing in #3582
support update params for turbomind backend by @irexyc in #3566
Quantize Qwen3 MoE bf16 model to fp8 model at runtime by @grimoire in #3631
[Feat]: Support internvl3-8b-hf by @RunningLeon in #3633
Add FP8 MoE for turbomind by @lzhangzz in #3601

💥 Improvements

reduce ray memory usage by @grimoire in #3487
use dlblas by @zhaochaoxing in #3489
internlm3 dense fp8 by @CUHKSZzxy in #3527
random pad input ids by @grimoire in #3530
ray nsys profile support by @grimoire in #3448
update blockedfp8 scale name by @CUHKSZzxy in #3532
start engine loop on server startup event by @grimoire in #3523
update two microbatch by @SHshenhao in #3540
[ascend]set transdata dynamic shape true by @JackWeiw in #3531
ray safe exit by @grimoire in #3545
support update params with dp=1 for pytorch engine by @irexyc in #3562
Skip dp dummy input forward by @grimoire in #3552
Unclock mutual exclusivity of argument: tool-call-parser and reasoning-parser by @jingyibo123 in #3550
perform torch.cuda.empty_cache() after conversion by @bltcn in #3570
pipeline warmup by @irexyc in #3548
Launch multiple api servers for dp > 1 by @RunningLeon in #3414
support awq for Qwen2.5-VL by @RunningLeon in #3559
support qwen3 /think & /no_think & enable_thinking parameter by @BUJIDAOVS in #3564
Eplb by @zhaochaoxing in #3572
Update benchmark by @lvhan028 in #3578
block output when prefetch next forward inputs. by @grimoire in #3573
support both eplb and microbatch simultaneously by @zhaochaoxing in #3591
Add log_file and set loglevel in launch_servers by @RunningLeon in #3596
1. add migration flow control by @JimyMa in #3599
sampling on the tokenizer's vocab by @grimoire in #3604
update deepgemm version by @grimoire in #3606
[Ascend] set default distrbuted backend as ray for ascend device by @JackWeiw in #3603
Blocked fp8 tma by @grimoire in #3470
[PDDisaggreagtion] Async migration by @JimyMa in #3610
move dp loop to model agent by @grimoire in #3598
update some logs of proxy_server and pt engine by @lvhan028 in #3621
improve loading model performance by shuffling the weight files by @irexyc in #3625
add benchmark scripts about pipeline api and inference engines according to the config file by @lvhan028 in #3622

🐞 Bug fixes

[ascend] fix recompile on different rank by @jinminxi104 in #3513
fix attention sm86 by @grimoire in #3519
fix stopwords kv cache by @grimoire in #3494
[bug fix] fix PD Disaggregation in DSV3 by @JimyMa in #3547
fix proxy server heart beat by @irexyc in #3543
fix dp>1 tp=1 ep=1 by @grimoire in #3555
fix mixtral on new transformers by @grimoire in #3580
[Fix]: reset step after eviction by @RunningLeon in #3589
fix parsing dynamic rope param failed by @lvhan028 in #3575
Fix batch infer for gemma3vl by @RunningLeon in #3592
Fix symbol error when dlBLAS is not imported by @zhaochaoxing in #3597
read distributed envs by @grimoire in #3600
fix side-effect caused by PR 3590 by @lvhan028 in #3608
fix bug in qwen2 by @LKJacky in #3614
fix awq kernel by @grimoire in #3618
fix flash mla interface by @grimoire in #3617
add sampling_vocab_size by @irexyc in #3607
fix for default quant by @grimoire in #3640
Fix log file env in ray worker by @RunningLeon in #3624
fix qwen3 chat template by @lvhan028 in #3641
fix vlm runtime quant by @grimoire in #3644
Fix 'Namespace' object has no attribute 'num_tokens_per_iter' when serving by gradio by @lvhan028 in #3647
Synchronize weight processing by @lzhangzz in #3649
Fix zero scale in fp8 quantization by @lzhangzz in #3652

🌐 Other

update doc for ascend 300I Duo docker image by @jinminxi104 in #3526
simulate EPLB for benchmark only by @lvhan028 in #3490
[ci] add test workflow for 3090 machine by @zhulinJulia24 in #3561
[ci] fix transformers version in prtest by @zhulinJulia24 in #3584
[Misc] minor api_server and tm loader, and upgrade docformatter to resolve lint error by @lvhan028 in #3590
[ci] add qwen3 models into testcase by @zhulinJulia24 in #3593
update Dockerfile by @CUHKSZzxy in #3634
check in lmdeploy-builder on cuda 12.4 and 12.8 platform by @lvhan028 in #3630
fix blocked fp8 overflow by @grimoire in #3650
Bump version to v0.9.0 by @lvhan028 in #3609

New Contributors

@JimyMa made their first contribution in #3304
@jingyibo123 made their first contribution in #3550
@bltcn made their first contribution in #3570
@BUJIDAOVS made their first contribution in #3564
@LKJacky made their first contribution in #3614

Full Changelog: v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.9.0

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

🌐 Other

New Contributors

Contributors

Uh oh!