What's Changed
🚀 Features
- LMDeploy Distserve by @JimyMa in #3304
- allow api server terminated through requests from clients by @RunningLeon in #3533
- support update params for pytorch backend from api server by @irexyc in #3535
- support eplb for Qwen3-MoE by @zhaochaoxing in #3582
- support update params for turbomind backend by @irexyc in #3566
- Quantize Qwen3 MoE bf16 model to fp8 model at runtime by @grimoire in #3631
- [Feat]: Support internvl3-8b-hf by @RunningLeon in #3633
- Add FP8 MoE for turbomind by @lzhangzz in #3601
💥 Improvements
- reduce ray memory usage by @grimoire in #3487
- use dlblas by @zhaochaoxing in #3489
- internlm3 dense fp8 by @CUHKSZzxy in #3527
- random pad input ids by @grimoire in #3530
- ray nsys profile support by @grimoire in #3448
- update blockedfp8 scale name by @CUHKSZzxy in #3532
- start engine loop on server startup event by @grimoire in #3523
- update two microbatch by @SHshenhao in #3540
- [ascend]set transdata dynamic shape true by @JackWeiw in #3531
- ray safe exit by @grimoire in #3545
- support update params with dp=1 for pytorch engine by @irexyc in #3562
- Skip dp dummy input forward by @grimoire in #3552
- Unclock mutual exclusivity of argument:
tool-call-parser
andreasoning-parser
by @jingyibo123 in #3550 - perform torch.cuda.empty_cache() after conversion by @bltcn in #3570
- pipeline warmup by @irexyc in #3548
- Launch multiple api servers for dp > 1 by @RunningLeon in #3414
- support awq for Qwen2.5-VL by @RunningLeon in #3559
- support qwen3 /think & /no_think & enable_thinking parameter by @BUJIDAOVS in #3564
- Eplb by @zhaochaoxing in #3572
- Update benchmark by @lvhan028 in #3578
- block output when prefetch next forward inputs. by @grimoire in #3573
- support both eplb and microbatch simultaneously by @zhaochaoxing in #3591
- Add log_file and set loglevel in launch_servers by @RunningLeon in #3596
- sampling on the tokenizer's vocab by @grimoire in #3604
- update deepgemm version by @grimoire in #3606
- [Ascend] set default distrbuted backend as ray for ascend device by @JackWeiw in #3603
- Blocked fp8 tma by @grimoire in #3470
- [PDDisaggreagtion] Async migration by @JimyMa in #3610
- move dp loop to model agent by @grimoire in #3598
- update some logs of proxy_server and pt engine by @lvhan028 in #3621
- improve loading model performance by shuffling the weight files by @irexyc in #3625
- add benchmark scripts about pipeline api and inference engines according to the config file by @lvhan028 in #3622
🐞 Bug fixes
- [ascend] fix recompile on different rank by @jinminxi104 in #3513
- fix attention sm86 by @grimoire in #3519
- fix stopwords kv cache by @grimoire in #3494
- [bug fix] fix PD Disaggregation in DSV3 by @JimyMa in #3547
- fix proxy server heart beat by @irexyc in #3543
- fix dp>1 tp=1 ep=1 by @grimoire in #3555
- fix mixtral on new transformers by @grimoire in #3580
- [Fix]: reset step after eviction by @RunningLeon in #3589
- fix parsing dynamic rope param failed by @lvhan028 in #3575
- Fix batch infer for gemma3vl by @RunningLeon in #3592
- Fix symbol error when dlBLAS is not imported by @zhaochaoxing in #3597
- read distributed envs by @grimoire in #3600
- fix side-effect caused by PR 3590 by @lvhan028 in #3608
- fix bug in qwen2 by @LKJacky in #3614
- fix awq kernel by @grimoire in #3618
- fix flash mla interface by @grimoire in #3617
- add sampling_vocab_size by @irexyc in #3607
- fix for default quant by @grimoire in #3640
- Fix log file env in ray worker by @RunningLeon in #3624
- fix qwen3 chat template by @lvhan028 in #3641
- fix vlm runtime quant by @grimoire in #3644
- Fix 'Namespace' object has no attribute 'num_tokens_per_iter' when serving by gradio by @lvhan028 in #3647
- Synchronize weight processing by @lzhangzz in #3649
- Fix zero scale in fp8 quantization by @lzhangzz in #3652
🌐 Other
- update doc for ascend 300I Duo docker image by @jinminxi104 in #3526
- simulate EPLB for benchmark only by @lvhan028 in #3490
- [ci] add test workflow for 3090 machine by @zhulinJulia24 in #3561
- [ci] fix transformers version in prtest by @zhulinJulia24 in #3584
- [Misc] minor api_server and tm loader, and upgrade docformatter to resolve lint error by @lvhan028 in #3590
- [ci] add qwen3 models into testcase by @zhulinJulia24 in #3593
- update Dockerfile by @CUHKSZzxy in #3634
- check in lmdeploy-builder on cuda 12.4 and 12.8 platform by @lvhan028 in #3630
- fix blocked fp8 overflow by @grimoire in #3650
- Bump version to v0.9.0 by @lvhan028 in #3609
New Contributors
- @JimyMa made their first contribution in #3304
- @jingyibo123 made their first contribution in #3550
- @bltcn made their first contribution in #3570
- @BUJIDAOVS made their first contribution in #3564
- @LKJacky made their first contribution in #3614
Full Changelog: v0.8.0...v0.9.0