Pulse · vllm-project/vllm · GitHub

July 18, 2025 – July 25, 2025

Overview

330 Active pull requests

240 Active issues

Could not load contribution data

Please try again later

3 Releases published by 1 person

v0.10.0rc1
published Jul 20, 2025
v0.10.0rc2
published Jul 24, 2025
v0.10.0
published Jul 24, 2025

186 Pull requests merged by 102 people

[TPU][Test] Rollback PR-21550.
#21619 merged Jul 25, 2025
[Docs] add auto-round quantization readme
#21600 merged Jul 25, 2025
[CI] Unifying Dockerfiles for ARM and X86 Builds
#21343 merged Jul 25, 2025
Add support for Prithvi in Online serving mode
#21518 merged Jul 25, 2025
[Kernel] Improve machete memory bound perf
#21556 merged Jul 25, 2025
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend.
#20295 merged Jul 25, 2025
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel
#20839 merged Jul 25, 2025
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers
#21009 merged Jul 25, 2025
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B
#20931 merged Jul 25, 2025
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter
#21586 merged Jul 25, 2025
[Bugfix][Logprobs] Fix logprobs op to support more backend
#21591 merged Jul 25, 2025
[V1] Get supported tasks from model runner instead of model config
#21585 merged Jul 25, 2025
[Quantization] Enable BNB support for more MoE models
#21370 merged Jul 25, 2025
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith'
#21579 merged Jul 25, 2025
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct
#21598 merged Jul 25, 2025
[Tests] Harden DP tests
#21508 merged Jul 25, 2025
[TPU][Bugfix] fix OOM issue in CI test
#21550 merged Jul 25, 2025
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS
#21262 merged Jul 25, 2025
[CI/Build] fix cpu_extension for apple silicon
#21195 merged Jul 25, 2025
[Misc][Tools] make max-model-len a parameter in auto_tune script
#21321 merged Jul 25, 2025
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings
#21479 merged Jul 25, 2025
[Model] Support tensor parallel for timm ViT in Deepseek_vl2
#21494 merged Jul 25, 2025
[Bugfix] fix modelscope snapshot_download serialization
#21536 merged Jul 25, 2025
[CI] Update CODEOWNERS for CPU and Intel GPU
#21582 merged Jul 25, 2025
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs
#21232 merged Jul 25, 2025
[Docs] Add requirements/common.txt to run unit tests
#21572 merged Jul 25, 2025
[TPU][Test] Temporarily suspend this MoE model in test_basic.py.
#21560 merged Jul 25, 2025
[DP] Support api-server-count > 0 in hybrid DP LB mode
#21510 merged Jul 25, 2025
[Bugfix] DeepGemm utils : Fix hardcoded type-cast
#21517 merged Jul 25, 2025
[Kernel] adding fused_moe configs for upcoming granite4
#21332 merged Jul 25, 2025
Fix GLM-4 PP Missing Layer When using with PP.
#21531 merged Jul 25, 2025
[Bug] Fix DeepGemm Init Error
#21554 merged Jul 25, 2025
[Docs] Fix site_url for RunLLM
#21564 merged Jul 25, 2025
[Frontend] run-batch supports V1
#21541 merged Jul 25, 2025
[MoE] More balanced expert sharding
#21497 merged Jul 24, 2025
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3.
#21539 merged Jul 24, 2025
update flashinfer to v0.2.9rc1
#21485 merged Jul 24, 2025
[Docs] Add Expert Parallelism Initial Documentation
#21373 merged Jul 24, 2025
[Docs][minor] Fix broken gh-file link in distributed serving docs
#21543 merged Jul 24, 2025
[P/D] Support CPU Transfer in NixlConnector
#18293 merged Jul 24, 2025
[P/D] Move FakeNixlWrapper to test dir
#21328 merged Jul 24, 2025
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform
#21036 merged Jul 24, 2025
Update flashinfer CUTLASS MoE Kernel
#21408 merged Jul 24, 2025
[Bug] Fix Compressed Tensor NVFP4 cutlass_fp4_group_mm illegal memory access
#21465 merged Jul 24, 2025
[Docs] Rewrite Distributed Inference and Serving guide
#20593 merged Jul 24, 2025
[Docs] Update Tensorizer usage documentation
#21190 merged Jul 24, 2025
[Fix] Update mamba_ssm to 2.2.5
#21421 merged Jul 24, 2025
[Bugfix] Fix CUDA arch flags for MoE permute
#21426 merged Jul 24, 2025
[Model] Officially support Emu3 with Transformers backend
#21319 merged Jul 24, 2025
[Attention] Optimize FlashInfer MetadataBuilder Build call
#21137 merged Jul 24, 2025
bump flashinfer to v0.2.8
#21385 merged Jul 24, 2025
[Feat] Allow custom naming of vLLM processes
#21445 merged Jul 24, 2025
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices()
#21501 merged Jul 24, 2025
Replace --expand-tools-even-if-tool-choice-none with --exclude-tools-when-tool-choice-none for v0.10.0
#20544 merged Jul 24, 2025
remove GLM-4 quantization wrong Code
#21435 merged Jul 24, 2025
[Core] Support model loader plugins
#21067 merged Jul 24, 2025
[Misc] Fix duplicate FusedMoEConfig debug messages
#21455 merged Jul 24, 2025
[v1][Core] Clean up usages of SpecializedManager
#21407 merged Jul 24, 2025
[TPU][Bugfix] fix moe layer
#21340 merged Jul 24, 2025
[Bugfix][ROCm] Fix for warp_size uses on host
#21205 merged Jul 24, 2025
Deduplicate Transformers backend code using inheritance
#21461 merged Jul 24, 2025
Add think chunk
#21333 merged Jul 24, 2025
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses
#21211 merged Jul 24, 2025
Dump input metadata on crash for async scheduling
#21258 merged Jul 24, 2025
[DP] Internal Load Balancing Per Node [one-pod-per-node]
#21238 merged Jul 24, 2025
[BugFix] Fix KVConnector TP worker aggregation
#21473 merged Jul 24, 2025
[BugFix]: Batch generation from prompt_embeds fails for long prompts
#21390 merged Jul 24, 2025
[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process
#21437 merged Jul 24, 2025
[Bugfix] Fix casing warning
#21468 merged Jul 24, 2025
[XPU][UT] increase intel xpu CI test scope
#21492 merged Jul 24, 2025
[Misc] Add dummy maverick test to CI
#21324 merged Jul 24, 2025
[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding
#21374 merged Jul 24, 2025
feat(gguf_loader): accept HF repo paths & URLs for GGUF
#20793 merged Jul 24, 2025
[Core] Freeze gc during cuda graph capture to speed up init
#21146 merged Jul 24, 2025
[V0 Deprecation] Remove Prompt Adapters
#20588 merged Jul 23, 2025
[V1] Fix local chunked attention always disabled
#21419 merged Jul 23, 2025
[Core] Add reload_weights RPC method
#20096 merged Jul 23, 2025
[TPU][TEST] Fix the downloading issue in TPU v1 test 11.
#21418 merged Jul 23, 2025
Add test case for compiling multiple graphs
#21044 merged Jul 23, 2025
[Core][Model] PrithviMAE Enablement on vLLM v1 engine
#20577 merged Jul 23, 2025
[Tests] Add tests for headless internal DP LB
#21450 merged Jul 23, 2025
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models.
#21364 merged Jul 23, 2025
[V1] Check all pooling tasks during profiling
#21299 merged Jul 23, 2025
[Model] add Hunyuan V1 Dense Model support.
#21368 merged Jul 23, 2025
[Docs] Clean up v1/metrics.md
#21449 merged Jul 23, 2025
[Misc] fixed nvfp4_moe test failures due to invalid kwargs
#21246 merged Jul 23, 2025
Mamba V2 Test not Asserting Failures.
#21379 merged Jul 23, 2025
[Sampler] Introduce logprobs mode for logging
#21398 merged Jul 23, 2025
[Docs] Fix bullets and grammars in tool_calling.md
#21440 merged Jul 23, 2025
Fixed typo in profiling logs
#21441 merged Jul 23, 2025
[Bugfix] ensure tool_choice is popped when tool_choice:null is passed in json payload
#19679 merged Jul 23, 2025
add clear messages for deprecated models
#21424 merged Jul 23, 2025
[Cleanup] Only log MoE DP setup warning if DP is enabled
#21315 merged Jul 23, 2025
[Core] Add basic unit test for maybe_evict_cached_block
#21400 merged Jul 23, 2025
[Bugfix] Fix nightly transformers CI failure
#21427 merged Jul 23, 2025
Changing "amdproduction" allocation.
#21409 merged Jul 23, 2025
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported
#21420 merged Jul 23, 2025
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update
#21414 merged Jul 23, 2025
[BugFix] Update python to python3 calls for image; fix prefix & input calculations.
#21391 merged Jul 23, 2025
Simplify weight loading in Transformers backend
#21382 merged Jul 23, 2025
[Bugfix][ROCm][Build] Fix build regression on ROCm
#21393 merged Jul 23, 2025
[CI/Build] Fix model executor tests
#21387 merged Jul 23, 2025
[BugFix] Fix ray import error mem cleanup bug
#21381 merged Jul 22, 2025
[Misc] Copy HF_TOKEN env var to Ray workers
#21406 merged Jul 22, 2025
[Model] Add Qwen3CoderToolParser
#21396 merged Jul 22, 2025
Fix Flashinfer Allreduce+Norm enable disable calculation based on fi_allreduce_fusion_max_token_num
#21325 merged Jul 22, 2025
[CI/Build] Fix test failure due to updated model repo
#21375 merged Jul 22, 2025
[Bugfix] Decode Tokenized IDs to Strings for hf_processor in llm.chat() with model_impl=transformers
#21353 merged Jul 22, 2025
Add tokenization_kwargs to encode for embedding model truncation
#21033 merged Jul 22, 2025
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)"
#21384 merged Jul 22, 2025
[feat] Enable mm caching for transformers backend
#21358 merged Jul 22, 2025
Adds parallel model weight loading for runai_streamer
#21330 merged Jul 22, 2025
[Perf] Cuda Kernel for Per Token Group Quant
#21083 merged Jul 22, 2025
[feat]: add SM100 support for cutlass FP8 groupGEMM
#20447 merged Jul 22, 2025
[perf] Add fused MLA QKV + strided layernorm
#21116 merged Jul 22, 2025
[Misc] unify variable for LLM instance v2
#21356 merged Jul 22, 2025
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool
#21222 merged Jul 22, 2025
[benchmark] Port benchmark request sent optimization to benchmark_serving
#21209 merged Jul 22, 2025
[Core] Optimize update checks in LogitsProcessor
#21245 merged Jul 22, 2025
[Misc] Remove deprecated args in v0.10
#21349 merged Jul 22, 2025
[Bugfix] Fix eviction cached blocked logic
#21357 merged Jul 22, 2025
Add arcee model
#21296 merged Jul 22, 2025
[Feature][eplb] add verify ep or tp or dp
#21102 merged Jul 22, 2025
Update fp4 quantize API
#21327 merged Jul 22, 2025
[Bug] DeepGemm: Fix Cuda Init Error
#21312 merged Jul 22, 2025
[Misc] DeepEPHighThroughtput - Enable Inductor pass
#21311 merged Jul 22, 2025
Fix kv_cache_dtype handling for out-of-tree HPU plugin
#21302 merged Jul 22, 2025
[Refactor] Fix Compile Warning #1444-D
#21208 merged Jul 22, 2025
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible
#21300 merged Jul 22, 2025
[Core] Minimize number of dict lookup in _maybe_evict_cached_block
#21281 merged Jul 22, 2025
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
#21334 merged Jul 22, 2025
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU
#21338 merged Jul 22, 2025
[Doc] Fix CPU doc format
#21316 merged Jul 22, 2025
[XPU] Enable external_launcher to serve as an executor via torchrun
#21021 merged Jul 22, 2025
[v1][sampler] Inplace logprobs comparison to get the token rank
#21283 merged Jul 21, 2025
[perf] Speed up align sum kernels
#21079 merged Jul 21, 2025
Fix bad lm-eval fork
#21318 merged Jul 21, 2025
[DP] Fix Prometheus Logging
#21257 merged Jul 21, 2025
[Attention] Clean up iRoPE in V1
#21188 merged Jul 21, 2025
[Misc] Add dummy maverick test
#21199 merged Jul 21, 2025
[BugFix] make utils.current_stream thread-safety (#21252)
#21253 merged Jul 21, 2025
[CPU] Enable shared-memory based pipeline parallel for CPU backend
#21289 merged Jul 21, 2025
[Misc] Add sliding window to flashinfer test
#21282 merged Jul 21, 2025
Add Nvidia ModelOpt config adaptation
#19815 merged Jul 21, 2025
[Misc] unify variable for LLM instance
#20996 merged Jul 21, 2025
[Docs] Make tables more space efficient in supported_models.md
#21291 merged Jul 21, 2025
[Docs] Fix hardcoded links in docs
#21287 merged Jul 21, 2025
[Model][1/N] Support multiple poolers at model level
#21227 merged Jul 21, 2025
[Bugfix] Fix missing placeholder in logger debug
#21280 merged Jul 21, 2025
Add the instruction to run e2e validation manually before release
#21023 merged Jul 21, 2025
[Docs] Add RFC Meeting to Issue Template
#21279 merged Jul 21, 2025
[CI] Cleanup modelscope version constraint in Dockerfile
#21243 merged Jul 21, 2025
[bugfix] fix syntax warning caused by backslash
#21251 merged Jul 20, 2025
[Model] Support VLMs with transformers backend
#20543 merged Jul 20, 2025
[Docs] Upgrade VLLM version to 0.10.0 for installing from vLLM's binaries
#21240 merged Jul 20, 2025
[Model] use AutoWeightsLoader for bart
#18299 merged Jul 20, 2025
Enable v1 metrics tests
#20953 merged Jul 20, 2025
[TPU] support fp8 kv cache quantization
#19292 merged Jul 20, 2025
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models.
#21233 merged Jul 19, 2025
GLM-4 Update
#20736 merged Jul 19, 2025
[BugFix] Fix full cuda graph slot_mapping
#21228 merged Jul 19, 2025
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small
#21217 merged Jul 19, 2025
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers
#21194 merged Jul 19, 2025
[BugFix] Make PD work with Ray
#21072 merged Jul 19, 2025
[Docs] Update the link to the 'Prometheus/Grafana' example
#21225 merged Jul 19, 2025
[CI/CD][bugfix]fix: error argument to loads has incompatible type
#21223 merged Jul 19, 2025
Fix/remove some broken model executor tests
#21224 merged Jul 19, 2025
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code
#21032 merged Jul 19, 2025
[Bugfix][Frontend] Fix openai CLI arg middleware
#21220 merged Jul 19, 2025
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency
#20645 merged Jul 19, 2025
Add torch golden impl for moe_align_block_size kernel test
#20653 merged Jul 19, 2025
[BugFix] Fix potential cuda-graph IMA
#21196 merged Jul 19, 2025
[Bugfix] Fix ndarray video color from VideoAsset
#21064 merged Jul 19, 2025
[V0 deprecation] Remove long context LoRA
#21169 merged Jul 19, 2025
Fix a couple of Voxtral tests
#21218 merged Jul 19, 2025
[Misc][Tools][Benchmark] Add readme file for auto_tune script
#20779 merged Jul 19, 2025
[Model] EXAONE 4.0 model support
#21060 merged Jul 19, 2025
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100
#21187 merged Jul 19, 2025
[BugFix][CPU] Fix TorchSDPABackendImpl doesn't have use_irope
#21200 merged Jul 19, 2025
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel
#21193 merged Jul 19, 2025
[V0 Deprecation] Remove V0 Spec Decode workers
#21152 merged Jul 19, 2025
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503
#21183 merged Jul 19, 2025
[Core] Support Local Chunked Attention for Hybrid KV Cache
#19351 merged Jul 19, 2025
[Quantization] Enable BNB support for more MoE models
#21100 merged Jul 19, 2025
Elastic Expert Parallel Initial Support
#20775 merged Jul 19, 2025
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series)
#21077 merged Jul 18, 2025

144 Pull requests opened by 113 people

Refactor xformers check for MultiHeadAttention
#21210 opened Jul 18, 2025
Add chat doc in quick start
#21213 opened Jul 19, 2025
[Feature] [V1] intermediate logging
#21215 opened Jul 19, 2025
[Bugfix] missing kv_cache_scheme
#21221 opened Jul 19, 2025
[WIP][Kernel]FusedMoE LoRA
#21229 opened Jul 19, 2025
[Nixl] Debug logging
#21230 opened Jul 19, 2025
[CI/Build] Add bc-linter to vLLM CI
#21234 opened Jul 19, 2025
[no commit] bc-linter demo
#21235 opened Jul 19, 2025
[bugfix] Remove the attribute 'version' from docker compose
#21241 opened Jul 20, 2025
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel
#21242 opened Jul 20, 2025
Make async scheduling compatible with DP
#21244 opened Jul 20, 2025
[Core] Prototype: Move hash_request_tokens computation from input request threads
#21248 opened Jul 20, 2025
[v1] - Mamba1 Attention Metadata
#21249 opened Jul 20, 2025
raise 400 Bad Request with detailed error message for `aiohttp.ClientError`
#21255 opened Jul 20, 2025
rank -> TP rank in MultiProcExecutor log
#21256 opened Jul 20, 2025
[Fix] correct tool_id for kimi-k2 when use tool_choice=required
#21259 opened Jul 20, 2025
Fix docstring of PyNcclCommunicator device arg
#21268 opened Jul 20, 2025
Support encoder-only models without KV-Cache
#21270 opened Jul 20, 2025
[Core] Add max-waiting-queue-length parameter to reject requests when queue is full
#21271 opened Jul 20, 2025
WIP: Add EPLB support for Grok1
#21273 opened Jul 21, 2025
[Model] vllm v1 support mlp_speculator
#21276 opened Jul 21, 2025
[Misc][Numerics] Basic logprobs benchmark tool
#21286 opened Jul 21, 2025
[Feature][EPLB] Add support for Qwen3 EPLB
#21290 opened Jul 21, 2025
Fix docker/AppArmor crash caused by cpuinfo __cpuid jit path
#21305 opened Jul 21, 2025
Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120)
#21309 opened Jul 21, 2025
[DBO] Adding the `UBatchContext` class for DBO support
#21314 opened Jul 21, 2025
adds include_thinking optional Param to Request object to preserve re…
#21317 opened Jul 21, 2025
[WIP][RC] Update PyTorch to 2.8.0
#21320 opened Jul 21, 2025
Support Tensorrt-LLM MoE fp4 for low-latency
#21331 opened Jul 21, 2025
[Refactor] Remove `moe_align_block_size_triton`
#21335 opened Jul 21, 2025
Support DeepSeekV3-style block FP8 quantization with CT
#21337 opened Jul 21, 2025
Add anthropic endpoint
#21341 opened Jul 22, 2025
[V1] port xformers backend to v1
#21342 opened Jul 22, 2025
[Speculative Decoding] Add `speculators` Config Support
#21345 opened Jul 22, 2025
[V0 deprecation] Guided decoding
#21347 opened Jul 22, 2025
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile
#21350 opened Jul 22, 2025
[Core] Minor comments and asserts changes in block pool
#21351 opened Jul 22, 2025
[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full
#21352 opened Jul 22, 2025
[xpu] disable cudagraph for xpu platform
#21354 opened Jul 22, 2025
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI
#21355 opened Jul 22, 2025
[Bugfix] FIX hermes tool parser streaming bug when using function call
#21360 opened Jul 22, 2025
[feat] Support EAGLE for Qwen2
#21363 opened Jul 22, 2025
fix: return {} for tool arguments when no argument is needed, so that…
#21365 opened Jul 22, 2025
[ROCm] Auto-Select Attention Backend
#21366 opened Jul 22, 2025
[V1][CUDA] Full cudagraph support for FlashInfer
#21367 opened Jul 22, 2025
skip fusedmoe layer for start_load_kv
#21378 opened Jul 22, 2025
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon
#21380 opened Jul 22, 2025
[wip] add nccl allocator and symm memory and enable TP all reduce for nccl symm
#21383 opened Jul 22, 2025
[tests] test_async_llm_engine.py
#21388 opened Jul 22, 2025
Add `flashinfer_python` to CUDA wheel requirements
#21389 opened Jul 22, 2025
[Model] Refactor JambaForCausalLM
#21394 opened Jul 22, 2025
[wip]
#21395 opened Jul 22, 2025
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers
#21401 opened Jul 22, 2025
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block
#21404 opened Jul 22, 2025
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels
#21411 opened Jul 22, 2025
[v1][attention] Support Hybrid Allocator + FlashInfer
#21412 opened Jul 22, 2025
Intentionally fail parallel sampling test
#21413 opened Jul 22, 2025
[WIP] Prepare for DI integration. Currently DI is not used; but this is to make sure both paths run fine.
#21415 opened Jul 22, 2025
Updates to Flex + VLLm integration
#21416 opened Jul 22, 2025
[TPU] Support Pathways in vLLM
#21417 opened Jul 22, 2025
[BugFix] Fix shared storage connector load kv only load attention layer
#21428 opened Jul 23, 2025
[Misc] Improve memory profiling debug message
#21429 opened Jul 23, 2025
[Fix] Connect fx_graph_cache option to envs.VLLM_DISABLE_COMPILE_CACHE
#21430 opened Jul 23, 2025
[TPU][Test] Divide TPU v1 Test into 2 parts.
#21431 opened Jul 23, 2025
[Bugfix]check core_engine process exit unexpectedly
#21443 opened Jul 23, 2025
[Bugfix] Fixed the missing metrics in output
#21444 opened Jul 23, 2025
Support online_serving for qwen3-reranker model
#21446 opened Jul 23, 2025
v1/offloading: Add worker-side CPU support
#21448 opened Jul 23, 2025
Add TNG Tool Call Parser
#21456 opened Jul 23, 2025
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend
#21458 opened Jul 23, 2025
[Refactor] Fix Compile Warning #1444-D
#21462 opened Jul 23, 2025
[Misc] Move comment to reflect original intent
#21464 opened Jul 23, 2025
[Model] Mamba2 varlen and metadata refactor
#21467 opened Jul 23, 2025
[Deprecation][2/N] Replace `--task` with `--runner` and `--convert`
#21470 opened Jul 23, 2025
DeepGEMM is not enabled on B200 when loading DeepSeek R1
#21472 opened Jul 23, 2025
[Perf] Cuda Kernel for Int8 Per Token Group Quant
#21476 opened Jul 23, 2025
[v1][spec decode] Run eagle with full cudagraph support
#21477 opened Jul 23, 2025
Add interleaved RoPE test for Llama4 (Maverick)
#21478 opened Jul 23, 2025
[DRAFR] [WIP] Use torch compiles 'guard_filter_fn' to drop guards instead of hooking into bytecode and add option to evalauet shape env guards.
#21482 opened Jul 23, 2025
Llama4 FP4 Support
#21484 opened Jul 23, 2025
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release
#21486 opened Jul 24, 2025
improve estimation of available KV Cache memory
#21489 opened Jul 24, 2025
[V1][Neuron] Neuron chunked prefill V1 impl
#21490 opened Jul 24, 2025
Delete useless allgather in qwen2_5_vl vit attention
#21493 opened Jul 24, 2025
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine
#21496 opened Jul 24, 2025
[NVIDIA] Fix Llama4 Scout FP4 functionality issues
#21499 opened Jul 24, 2025
[Bugfix] Fix retrieve_process not ending normally and resources not being released properly
#21502 opened Jul 24, 2025
[V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1
#21506 opened Jul 24, 2025
[bugfix] fix profile impact benchmark results
#21507 opened Jul 24, 2025
[Bugfix] Fix v1 engine crash in priority scheduling with parallel sampling (n > 1)
#21519 opened Jul 24, 2025
support silu vectorization
#21521 opened Jul 24, 2025
[ROCm] Add flag to avoid `invalid device ordinal` HIP error
#21522 opened Jul 24, 2025
[Docs] Fix the outdated URL for installing from vLLM binaries
#21523 opened Jul 24, 2025
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend
#21525 opened Jul 24, 2025
Only try and load generation config if it will be used
#21526 opened Jul 24, 2025
[Bugfix] Investigate Qwen2-VL failing test
#21527 opened Jul 24, 2025
[Bugfix] v1 fix current scheduling defects and enhance the scheduling preemption logic.
#21528 opened Jul 24, 2025
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL
#21530 opened Jul 24, 2025
[Bugfix] Handle None case for dt_bias and D in selective_state_update
#21532 opened Jul 24, 2025
Add DeepGEMM to Dockerfile in vllm-base image
#21533 opened Jul 24, 2025
[V1] Exception Handling when Loading KV Cache from Remote Store
#21534 opened Jul 24, 2025
[Bugfix] Add startup probe and fix disable extraInit container in online deploy helm chart
#21535 opened Jul 24, 2025
[Bugfix] Fix sync_and_slice_intermediate_tensors
#21537 opened Jul 24, 2025
[BugFix] Harden distributed DP startup
#21538 opened Jul 24, 2025
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn
#21540 opened Jul 24, 2025
Enable 4bit bnb prequant MOE
#21548 opened Jul 24, 2025
[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend
#21549 opened Jul 24, 2025
[Do not merge] Debug TPU issues with Xet
#21551 opened Jul 24, 2025
[TPU] Update ptxla nightly version to 20250724
#21555 opened Jul 24, 2025
[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time
#21557 opened Jul 24, 2025
[Feat][Scheduler] Implement shortest prefill first scheduling
#21558 opened Jul 24, 2025
[Test] Add Unit Test for Batched DeepGEMM
#21559 opened Jul 24, 2025
[Misc] add options for auto_tune
#21566 opened Jul 25, 2025
[cutlass] Bump version to 410
#21569 opened Jul 25, 2025
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and named function calling when Thinking is disabled
#21573 opened Jul 25, 2025
[Draft][Docs] Factor out troubleshooting to its own guide; add section for Ray Observability
#21578 opened Jul 25, 2025
adding params_dtype for vocab parallel embedding layer
#21580 opened Jul 25, 2025
[Draft][Docs] Expand introduction to Ray in Multi-node deployment section
#21584 opened Jul 25, 2025
[CI/Build] Fix failing tensorizer tests on AMD
#21587 opened Jul 25, 2025
[WIP] local attention no hybrid kv cache + support multiple attention metadata builders per kv_cache_spec
#21588 opened Jul 25, 2025
Add option to propagate padded logits_indices to model
#21590 opened Jul 25, 2025
[V1] large block_size solution
#21597 opened Jul 25, 2025
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES
#21599 opened Jul 25, 2025
KV Cache swap num_blocks layout + Heterogenous TP for NixlConnecotr
#21607 opened Jul 25, 2025
[Model] Fix for Granite 4 to work with compressed_tensors
#21608 opened Jul 25, 2025
Use all_stop_token_ids instead of stop_token_ids
#21610 opened Jul 25, 2025
[Bugfix] SharedStorage Connector for V1 PD multimodal
#21611 opened Jul 25, 2025
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison
#21612 opened Jul 25, 2025
[BugFix] Improve internal DP load balancing
#21617 opened Jul 25, 2025
[Misc] remove unused try-except in pooling config check
#21618 opened Jul 25, 2025
Migrate AriImagePixelInputs to TensorSchema for shape validation
#21620 opened Jul 25, 2025
[Core] Hidden State Processors via plugins
#21621 opened Jul 25, 2025
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation
#21622 opened Jul 25, 2025
[Doc] Add FusedMoE Modular Kernel Documentation
#21623 opened Jul 25, 2025
[V1][Hybrid] Make KV cache layout of triton_attn compatible with hybrid models
#21624 opened Jul 25, 2025
[do not merge] IL tool
#21625 opened Jul 25, 2025
[Attention] Make CutlassMLA the default backend for SM100 (blackwell)
#21626 opened Jul 25, 2025
[Core] Move EngineCoreRequest to Request conversion out of EngineCore
#21627 opened Jul 25, 2025
Support Intern-S1
#21628 opened Jul 25, 2025
[Bug] Update auto_tune.sh to separate benchmarking and profiling.
#21629 opened Jul 25, 2025
[Fix] Bump triton version in rocm-build requirements
#21630 opened Jul 25, 2025
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor
#21631 opened Jul 25, 2025
[Feature][EPLB] Add EPLB support for Ernie4.5-MoE
#21632 opened Jul 25, 2025
[Bug] Fix `has_flashinfer_moe` Import Error when it is not installed
#21634 opened Jul 25, 2025

145 Issues closed by 42 people

[Bug]: The shape of the kv cache in the FlashAttention component of the LLM model in Qwen2.5 is very strange.
#17023 closed Jul 25, 2025
[Bug]: [DP/EP] DeepGEMM with Qwen Fails
#21562 closed Jul 25, 2025
[New Model]: HyperClova X SEED (ChatClova)
#21275 closed Jul 25, 2025
[New Model]: Support HCXVisionForCausalLM
#19963 closed Jul 25, 2025
[Bug]: Failed to execute_model with logprobs on v0.10.0rc2
#21592 closed Jul 25, 2025
[Bug]: AttributeError: 'PosixPath' object has no attribute 'startswith'
#19173 closed Jul 25, 2025
[New Model]: please surpport google/medgemma-27b-it
#20806 closed Jul 25, 2025
[Bug]: Regression in vllm 0.9.2 for (at least) google/medgemma-27b-it
#21601 closed Jul 25, 2025
[Bug]: qwen2.5-vl-3B inference with lora "unsupported LoRA weight"
#21500 closed Jul 25, 2025
[RFC]: Schema for checking input shapes for multi-modal models
#14764 closed Jul 25, 2025
[Bug]: set n=2 in the sampling parameter, but the final return result only contains one sequence
#21288 closed Jul 25, 2025
[Bug]:
#21575 closed Jul 25, 2025
[Bug]: error helper for TypeError: _extractNVMLErrorsAsClasses..gen_new..new() takes 1 positional argument but 2 were given
#12906 closed Jul 25, 2025
[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory
#13563 closed Jul 25, 2025
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 closed Jul 25, 2025
[Bug]: Enable lora returns garbage output
#14392 closed Jul 25, 2025
[RFC]: A proper way to deal with 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' problem
#14610 closed Jul 25, 2025
[Bug]: CDNA cc >= 90, choose_mp_linear_kernel MacheteLinearKernel is possible
#14996 closed Jul 25, 2025
[Feature]: Add CoT dataset to the benchmark
#15378 closed Jul 25, 2025
[Bug]: awq Deepseek-R1-AWQ The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
#15386 closed Jul 25, 2025
[Bug]: `Phi-4-multimodal-instruct` encoder outputs didn't have the same length as defined in input_ids
#15404 closed Jul 25, 2025
[Feature]: Implement Embedding Models in V1
#15406 closed Jul 25, 2025
[Bug]: logprobs/ranks not matching when comparing `vllm` with `transformers`
#15420 closed Jul 25, 2025
[Bug]: DeepSeek-r1-AWQ (W4A16) can perform normal inference using BF16, but it shows abnormal behavior when using FP16.
#15429 closed Jul 25, 2025
[Performance]: Regarding the issue of context length for QWQ-32B in different distributed environments:
#15442 closed Jul 25, 2025
[Usage]: Question about Interleaved Text/Image Format in Online Inference
#15449 closed Jul 25, 2025
[Usage]: vllm启动服务卡住
#15451 closed Jul 25, 2025
[Feature]: preprocessing of weights in advance
#15459 closed Jul 25, 2025
[Doc]: https://docs.vllm.ai/en/latest/deployment/k8s.html not working
#15461 closed Jul 25, 2025
[Usage]:Phi-4-multimodal-instruct
#15468 closed Jul 25, 2025
[Bug]: Unknown gguf model_type: gemma3
#15480 closed Jul 25, 2025
[Bug]: Allow flexible message role ordering in conversations (user/assistant in any sequence)
#15486 closed Jul 25, 2025
[Bug]: Support Bitsandbytes weight loading when offline (via huggingface cache)
#15507 closed Jul 25, 2025
[Feature]: Reason model reasoning effort feature like OpenAI
#15524 closed Jul 25, 2025
[Bug]: VLLM_NCCL_SO_PATH take no effects when spawn worker
#15525 closed Jul 25, 2025
[Usage]: Qwen2.5-VL-32B-Instruct 4卡4090启动报错
#15529 closed Jul 25, 2025
[Installation]: flaky publishing of cpu image
#15547 closed Jul 25, 2025
[Bug]: Tools parsing issues with mistral3.1
#15549 closed Jul 25, 2025
[Feature]: LMCache support to the CPU version of vLLM
#15562 closed Jul 25, 2025
[Feature]: Ring Attention for Long Context in vLLM - RL Applications Focus
#15566 closed Jul 25, 2025
[Bug]: [v0.8.4][Critical] Tools calling broken: xgrammar rejects minItems in JSON Schema, blocking agent functionality
#16880 closed Jul 24, 2025
[Bug]: Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access
#21399 closed Jul 24, 2025
[Bug]: Crash in fused_moe.py due to Triton illegal memory access
#21520 closed Jul 24, 2025
[Usage]: Viability of Data Parallelism with FP8 KV-Cache and tpu_int8 on TPU v4-64
#21459 closed Jul 24, 2025
[New Model]: Emu3
#11008 closed Jul 24, 2025
[Usage]: vLLM 0.8.5 post1 unable to run GLM-4.1V-9B-Thinking large model, error message `glm4v` cannot recognize this architecture
#20357 closed Jul 24, 2025
[Bug]: Ray/vLLM RuntimeError: HIP error: invalid device ordinal (reopen)
#21457 closed Jul 24, 2025
[Bug]: the shape of b in function w8a8_block_fp8_matmul
#21503 closed Jul 24, 2025
[Bug]: vllm start gemma3 fail: NotImplementedError: Vlm do not work with prefix caching yet rank=6
#21498 closed Jul 24, 2025
[Bug]: Batch generation from prompt_embeds fails for long prompts
#21386 closed Jul 24, 2025
[Feature]: Support One Pod Per Node LB for DP/EP
#21261 closed Jul 24, 2025
[Bug]: After online_serving disagg_example_p2p_nccl_xpyd.sh cleanup, there is a zombie process
#21432 closed Jul 24, 2025
[Feature]: Support for specific GGUF model in a HF Repo
#20084 closed Jul 24, 2025
[Performance]: How to Improve Performance Under Concurrency
#9722 closed Jul 24, 2025
[Bug]: AssertionError assert self.num_blocks >= nixl_agent_meta.num_blocks
#19338 closed Jul 23, 2025
[Feature]: Remove Unused Moe Permute / Un-permute
#21124 closed Jul 23, 2025
[Bug]: openai whisper model response is not accurate on AMD-based(MI300x) systems.
#20069 closed Jul 23, 2025
[Bug]: Qwen2.5 1M models no longer working since v.0.8.5
#21452 closed Jul 23, 2025
[Bug]: Guided decoding with Phi-3-small crashes
#6193 closed Jul 23, 2025
[Usage]: cannot import name 'VoxtralForConditionalGeneration' from 'transformers'
#21369 closed Jul 23, 2025
[Performance]: With exactly the same setting, two sets of data with similar average length have a large difference in inference speed
#21425 closed Jul 23, 2025
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 closed Jul 23, 2025
[Bug]: The service request for vllm064post1 was prematurely terminated, and it could not output a fixed number of tokens.”
#13156 closed Jul 23, 2025
[Bug]: vllm cannot connect to an external ray cluster
#14349 closed Jul 23, 2025
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 closed Jul 23, 2025
[Installation]: Cannot compile vLLM from source on XPU
#14747 closed Jul 23, 2025
[Bug]: AssertionError with Speculative Decoding in vLLM Using DeepSeek R1 Distill Qwen Models
#14939 closed Jul 23, 2025
[Bug]: Internal Server Error when using Qwen2-VL-7B with vLLM Docker Container
#15110 closed Jul 23, 2025
[Usage]: relationship between embedding size and vocab_size
#15131 closed Jul 23, 2025
[Feature]: Ability to warm up vLLM instances
#15225 closed Jul 23, 2025
[Bug]: working with openai-agents sdk an use Runner.run_streamed() got fucntion call error
#15256 closed Jul 23, 2025
[Feature]: Dynamic Memory Release for GPU after idle time
#15287 closed Jul 23, 2025
[Bug]: Crashing on unsupported Sampling params
#15312 closed Jul 23, 2025
[Bug]: 0.8.0 and 0.8.1 bugs
#15365 closed Jul 23, 2025
[Bug]: VLLM Build Using Docker Error Deploy
#15376 closed Jul 23, 2025
[Feature]: Support Top-nσ sampling
#15379 closed Jul 23, 2025
[Bug]: Different logprobs output behaviour under vllm 0.8.0 and 0.8.1
#15381 closed Jul 23, 2025
[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model
#15384 closed Jul 23, 2025
[Bug]: AttributeError: Model PixtralForConditionalGeneration does not support BitsAndBytes quantization yet. No 'packed_modules_mapping' found.
#15396 closed Jul 23, 2025
[New Model]: Baichuan-Audio
#15425 closed Jul 23, 2025
[Usage]: when setting quantizaion AWQ on AWQ model it slows down the model execution by up to 5x
#21376 closed Jul 22, 2025
[Usage]: How to turn off thinking using OpenAI client?
#20976 closed Jul 22, 2025
[Bug]: Failed profiling vllm (both offline and server) with Nsight Systems
#20178 closed Jul 22, 2025
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
#21377 closed Jul 22, 2025
[Bug]: OOM Error with Qwen/Qwen3-235B-A22B on Python SDK
#21361 closed Jul 22, 2025
[Bug]: BART broken on vLLM 0.8.1 and above. (Even on v0 engine).
#19981 closed Jul 22, 2025
[Bug]: RuntimeError: CUDA error: initialization error in `_is_fa2_supported`
#21304 closed Jul 22, 2025
[Bug]: Llama4 Maverick runtime error (shuffle_rows)
#21322 closed Jul 22, 2025
[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
#3900 closed Jul 22, 2025
[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果
#3998 closed Jul 22, 2025
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 closed Jul 22, 2025
[Bug]: No available block found in 60 second in shm
#6614 closed Jul 22, 2025
[Bug]: Likely Regression - Was working in v0.6.3.post1, now using response_format parameter with "type": "bool" in v0.7.3: BadRequestError: Error code 400 - {'object': 'error', 'message': 'json_schema_converter.cc:595 Unsupported type bool in schema {type":"bool"}\n, 'type': 'BadRequestError', 'param': None, 'code': 400}
#13864 closed Jul 22, 2025
[Usage]: How to benchmark throughput of DeepSeek-R1-671B on 2 nodes
#15024 closed Jul 22, 2025
[Doc]: new attention layer
#15077 closed Jul 22, 2025
[Bug]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
#15327 closed Jul 22, 2025
[Usage]: Why can't the --profile and --ignore-eos parameters of benchmark_serving take effect at the same time?
#21294 closed Jul 22, 2025
[Bug]: Prometheus DP Metrics
#21260 closed Jul 21, 2025
[Bug]: V1 + FLASH_ATTN V3 + FP8 kv-cache randomly crashes w/qwen3 (and other models)
#17442 closed Jul 21, 2025
[New Model]: nvidia/DeepSeek-R1-FP4
#16323 closed Jul 21, 2025
[Bug]: CUDA kernel image error when serving Llama4 Maverick since #20694
#20847 closed Jul 21, 2025
[Feature]: Simplify speculative-config format for vllm serve
#19709 closed Jul 21, 2025
[Bug]: pynccl leads to incorrect data in multi-thread GPU-worker
#21252 closed Jul 21, 2025
jinaai/jina-reranker-v1-turbo-en not compatible with vLLM
#16153 closed Jul 21, 2025
[Bug]: After outputting the normal content, keep outputting content= '', until finish_reason='length'.
#21181 closed Jul 21, 2025
[CI Failure]: Classification test failure for Qwen2.5-1.5B-apeach model in half precision
#21277 closed Jul 21, 2025
新手入门，请多指教
#11223 closed Jul 21, 2025
[RFC]: layer-wise kv cache offloading to enable larger batches
#15123 closed Jul 21, 2025
[Bug]: Quantization does not lead to Throughput Speedup (Please Help)
#21236 closed Jul 20, 2025
[Bug]: TypeError: RayGaugeWrapper.__init__() got an unexpected keyword argument
#20954 closed Jul 20, 2025
Llama3.2 Vision Model: Guides and Issues
#8826 closed Jul 20, 2025
[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management
#10086 closed Jul 20, 2025
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 closed Jul 20, 2025
[Feature]: Disaggregated Prefill on multi-node & multi-gpu
#13004 closed Jul 20, 2025
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 closed Jul 20, 2025
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 closed Jul 20, 2025
[Usage]: Distributed inference not supported with OpenVINO?
#14933 closed Jul 20, 2025
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 closed Jul 20, 2025
[Bug]: Capture CudaGraph with LoRA
#15090 closed Jul 20, 2025
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 closed Jul 20, 2025
[Performance]: online batch inference faster than offline batch inference
#15178 closed Jul 20, 2025
[Usage]: VLLM 0.7.3 with tensor parallelism outputs only exclamation marks when using multiple GPUs
#15194 closed Jul 20, 2025
[Feature]: vllm what supports dialog prefix continuation?
#15198 closed Jul 20, 2025
[Misc][Help]: Adding support for a Custom model with External MoE Routing
#15214 closed Jul 20, 2025
[Usage]: : How to properly use vllm when serving - keyerror 'text'
#15219 closed Jul 20, 2025
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 closed Jul 20, 2025
[Performance]: V0 and V1 give the same throughput number
#15253 closed Jul 20, 2025
[Bug]: --tensor-parallel-size Error
#15255 closed Jul 20, 2025
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 closed Jul 20, 2025
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 closed Jul 20, 2025
[Feature]: looking into adding a generation algorithm
#15315 closed Jul 20, 2025
[Bug]: ImportError: vllm/_C.abi3.so: undefined symbol _ZN3c106ivalue14ConstantString6createENSt7
#21226 closed Jul 19, 2025
[Bug]: PD does not work with ray distributed backend
#21070 closed Jul 19, 2025
[Bug]: Middleware crashes vLLM on startup w/latest commit
#21219 closed Jul 19, 2025
[Bug]: RGB inverted in offline example?
#21053 closed Jul 19, 2025
[Usage]: How to do expert parallel on MoE model?
#21054 closed Jul 19, 2025
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 closed Jul 19, 2025
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,)
#8432 closed Jul 19, 2025
[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
#8735 closed Jul 19, 2025
[Usage]: Can I get the loss of model directly?
#9750 closed Jul 19, 2025
[Feature]: Support for priority preemption with chunked-prefill
#10101 closed Jul 19, 2025
[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm
#10478 closed Jul 19, 2025
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 closed Jul 19, 2025
[Usage]: `torch.compile` is turned on, but the model LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct does not support it.
#15093 closed Jul 19, 2025
[Bug]: 0.8.0(V1) crash on NCCL when load MoE model on 16 GPUs(H20)
#15098 closed Jul 19, 2025

95 Issues opened by 83 people

[Bug]: v0.10.0 built with early version of pytorch that does not support sm-120
#21633 opened Jul 25, 2025
[Bug]: Very Low Prompt Evaluation Speed
#21616 opened Jul 25, 2025
[Bug]: Can't set limit-mm-per-prompt in 0.10.0
#21615 opened Jul 25, 2025
[Bug]: Qwen3 Embedding does not load in 0.10.0 - There is no module or parameter named 'layers' in Qwen3ForCausalLM
#21614 opened Jul 25, 2025
[Bug]:
#21609 opened Jul 25, 2025
[Bug]: Enabling EPLB leads to inconsistent inference results
#21606 opened Jul 25, 2025
[Feature]: Support fairness heuristics for the batched requests
#21605 opened Jul 25, 2025
[Feature]: Support multiple guided decoding settings
#21604 opened Jul 25, 2025
[Bug]: DeepEP with Qwen3-Coder Fails
#21603 opened Jul 25, 2025
[Feature]: Support CPU on ray
#21602 opened Jul 25, 2025
[Usage]: How do you use benchmark_serving on VLM?
#21596 opened Jul 25, 2025
[Feature]: Support GteNewModelForSequenceClassification
#21595 opened Jul 25, 2025
[Bug]: The current scheduling logic has a bug: when a scheduled request is evicted, ...
#21594 opened Jul 25, 2025
[Bug]: MistralTokenizer is missing batch_decode, breaks /detokenize in OpenAI server
#21593 opened Jul 25, 2025
[Bug]: [P/D] P/d is incompatible with spec decoding
#21583 opened Jul 25, 2025
[Bug]: Incorrect Answer with Llama-Scout-Fp8 and PPLX
#21581 opened Jul 25, 2025
[Feature]: [P/D] NIXL Connector Error Handling
#21577 opened Jul 25, 2025
[Bug]: [P/D] NIXLConnector does not support P TP > D TP
#21576 opened Jul 25, 2025
[Bug]: vLLM ranking is biased towards short texts, giving high scores to irrelevant short texts
#21574 opened Jul 25, 2025
[Usage]: Qwen tool_call response type problem
#21571 opened Jul 25, 2025
[Bug]: [P/D] in nixl_connector, the P node implements a request timeout but the D node cannot detect.
#21570 opened Jul 25, 2025
[Usage]: Qwen3-Coder-480B-A35B-Instruct deploy hang up
#21568 opened Jul 25, 2025
[Bug]: Qwen3 failed to get function with stream and named function calling when thinking is disabled
#21565 opened Jul 25, 2025
[Usage]: Disable the FlashInfer 0.2.3+ does not support per-request generators warning
#21563 opened Jul 25, 2025
[Bug]: tensorizer example failed
#21547 opened Jul 24, 2025
[Bug]: Beam search implementation disables logit processor functionality
#21546 opened Jul 24, 2025
[Feature]: Zero copy for direct GPU model loading
#21545 opened Jul 24, 2025
[Bug]: Hermes tool call parser fails with "Error trying to handle streaming tool call"
#21544 opened Jul 24, 2025
[Bug]: Failing to initialize engine on qwen3 on B200 with VLLM_USE_DEEP_GEMM=1
#21542 opened Jul 24, 2025
[Bug]: Incorrect Generation for Qwen2.5-VL-7B-Instruct in Batch Mode
#21529 opened Jul 24, 2025
[Bug]: Qwen3-30B-A3B distributed Inference hang when set tp 2 pp 1 on two H100 node
#21524 opened Jul 24, 2025
[Bug]: run glm4.1v ,ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
#21516 opened Jul 24, 2025
[CI Failure]: Transformers Nightly Models Test
#21515 opened Jul 24, 2025
[CI Failure]: Multi-model Models Test (Extended 1)
#21514 opened Jul 24, 2025
[Bug]: Aborted without reason,timeout after three retries
#21513 opened Jul 24, 2025
[CI Failure]: Multi-model Models Test (Extended 2)
#21512 opened Jul 24, 2025
[Feature]: Qwen3 Models GGUF Support
#21511 opened Jul 24, 2025
[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode)
#21505 opened Jul 24, 2025
[RFC] [ROCm] [AITER]: Propose a `_aiter_ops` class like `_custom_ops` and `_ipex_ops`
#21504 opened Jul 24, 2025
[Performance]: How to improve TTFT for subsequent short-text requests when the system has just handled a long-text prompt?
#21495 opened Jul 24, 2025
[Bug]: Tensor parallelism on sm_120 (rtx 5090) is broken on latest docker (0.9.2)?
#21491 opened Jul 24, 2025
[Feature]: torch >2.7.0 support
#21488 opened Jul 24, 2025
[Feature]: Multiple models one server
#21481 opened Jul 23, 2025
[RFC]: vLLM vs HuggingFace numerical parity report
#21475 opened Jul 23, 2025
[Bug]: Incorrect output when using LoRA modules with tensor parallelism in vLLM
#21471 opened Jul 23, 2025
[RFC]: Shorten all of the CI by reducing `cudagraph_capture_sizes` for most of the unit tests
#21469 opened Jul 23, 2025
[Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8)
#21466 opened Jul 23, 2025
[Usage]: In pd disaggregation, does chunked prefill + pipeline parallelism equal to chunked pipeline parallelism proposed by mooncake?
#21463 opened Jul 23, 2025
[Bug]:
#21454 opened Jul 23, 2025
[Feature]: sm_120 support
#21453 opened Jul 23, 2025
[Usage]: How can a vLLM cluster support deploying multiple large models?
#21451 opened Jul 23, 2025
[Doc]: Undocumented required option "method" in speculative config when loading an eagle3 model
#21447 opened Jul 23, 2025
[Bug]:
#21442 opened Jul 23, 2025
[Bug]: ParallelHead has no attribute 'params_dtypes'
#21439 opened Jul 23, 2025
[Bug]: vLLM Multinode Pipeline Error with pipeline parallelism using Ray
#21438 opened Jul 23, 2025
[Bug]: Performance issue with NPS-4 configuration with respect to NPS-1 configuration
#21436 opened Jul 23, 2025
[Usage]: How to reproduce the results of `vllm` using `transformers`
#21433 opened Jul 23, 2025
[Feature]: add a arg to modify process_name
#21423 opened Jul 23, 2025
[Doc]: how to close the log ？ Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
#21422 opened Jul 23, 2025
[Bug]: auto_tune.sh gpu_memory_utilization issues - succeeds on first iteration for HBM OOM determination, then fails during benchmarking
#21410 opened Jul 22, 2025
[Bug]: auto_tune.sh profiling attempts are hanging (i.e., "benchmarking_serving.py --profile" is failing)
#21403 opened Jul 22, 2025
[Bug]: Tool call argument value of type `integer` may break things when `stream=True`
#21372 opened Jul 22, 2025
[Bug]: 'FusedMoE' object has no attribute 'kv_cache' when running a 1P1D test with PowerMoE-3b
#21359 opened Jul 22, 2025
[Performance]: KV Cache Size Comparison vLLM vs SGLang
#21348 opened Jul 22, 2025
[Usage]: Prefill node crashed when P/D Disaggregated Serving with MooncakeStore for Qwen3MOE
#21346 opened Jul 22, 2025
[Bug]: Large image requests silently dropped with Llama-Guard-4
#21344 opened Jul 22, 2025
[Bug]: vllm crashes using Eight RTX 3090s
#21339 opened Jul 21, 2025
[Bug]: vLLM crashes when using --enable-sleep-mode with Blackwell PRO 6000 GPUs
#21336 opened Jul 21, 2025
[Bug]: dsv3 generates all 0s output
#21326 opened Jul 21, 2025
[Feature]: Hybrid Cloud Model Serving
#21323 opened Jul 21, 2025
[Feature]: Support Anthropic API `/v1/messages` endpoint
#21313 opened Jul 21, 2025
[Bug]:
#21310 opened Jul 21, 2025
[Bug]: ROCm NotImplementedError: Speculative decoding is not yet supported on vLLM V1
#21308 opened Jul 21, 2025
[Bug]: qwen tool bug
#21307 opened Jul 21, 2025
[Bug]: all2all communication hangs when using DeepEP and PPLX for v0.9.2
#21306 opened Jul 21, 2025
[Bug]: Mistral Tool Parser Crashes with Empty JSONDecodeError for Mistral Small 3.2 24B FP8 Instruct
#21303 opened Jul 21, 2025
[Bug]: Hermes tool parser returns invalid arguments
#21301 opened Jul 21, 2025
[Usage]: How to test throughput of 2:4 sparse model?
#21298 opened Jul 21, 2025
[Usage]: how to execute benchmark_serving.py with an apikey?
#21297 opened Jul 21, 2025
[Performance]: how to test model performance with apikey using benchmark_serving.py?
#21295 opened Jul 21, 2025
[Bug]: OpenReasoning-Nemotron-32B Only Outputs Exclamation Marks Regardless of Input
#21292 opened Jul 21, 2025
[Performance]: Speculative decoding doesn't seem to speed up inference?
#21278 opened Jul 21, 2025
[Bug]: nvfp4 support on sm120
#21274 opened Jul 21, 2025
[Feature]: support for NVIDIA RTX 5070Ti graphics card and Windows 11 system
#21272 opened Jul 21, 2025
[Bug]: Endless Generation near Context Window with Eagle3/Spec Dec
#21269 opened Jul 20, 2025
[Bug]: GPTQ w4a16 Quantization slower than FP16 (Please Help)
#21266 opened Jul 20, 2025
[Feature]: Raise proper HTTP error with details for multimodal input url fetch error
#21254 opened Jul 20, 2025
[Usage]: Abnormal LoRA kernel performance
#21250 opened Jul 20, 2025
[Performance]: Move hash_request_tokens computation from input request threads
#21247 opened Jul 20, 2025
[Bug]: tensor parallelism inference doesn't run on Nvidia Blackwell 5070ti
#21239 opened Jul 20, 2025
[Bug]: vLLM stops inference
#21237 opened Jul 20, 2025
[Bug]: 100% cpu usage on 3 cores on every node when using ray distributed pipeline parallel
#21231 opened Jul 19, 2025
[Feature]: Support xformers on ARM GPU machines including GB200.
#21207 opened Jul 18, 2025
[Feature]: Consolidate benchmark_serving.py and serve.py to avoid code duplication and usage confusions
#21206 opened Jul 18, 2025
[Bug]: Guidance decoding broken for Granite 3.3 and hangs server
#21204 opened Jul 18, 2025

380 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Model] Auto resolve default_pooling_type & Optimize prefix caching enable verify logic.
#20930 commented on Jul 24, 2025 • 28 new comments
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer
#20059 commented on Jul 25, 2025 • 28 new comments
LFM2
#20797 commented on Jul 25, 2025 • 25 new comments
Add an optimization doc on TPU
#21155 commented on Jul 25, 2025 • 20 new comments
[V1] Logits processors extensibility
#19912 commented on Jul 25, 2025 • 14 new comments
[Feature] limit thinking tokens
#20859 commented on Jul 25, 2025 • 12 new comments
security policy: take 1
#21119 commented on Jul 22, 2025 • 10 new comments
[Model] Add support for Jina Embeddings V4
#20802 commented on Jul 21, 2025 • 10 new comments
[1/N] Refactor platform API to reduce `torch.cuda` call
#20751 commented on Jul 25, 2025 • 10 new comments
[Feature] use --ep_config to set eplb param
#20562 commented on Jul 25, 2025 • 9 new comments
v1: Add Whisper model support (encoder-decoder)
#21088 commented on Jul 25, 2025 • 8 new comments
[VLM] Support HF format Phi-4-MM model
#17121 commented on Jul 23, 2025 • 7 new comments
[Feature][EPLB] Add eplb support for Qwen3
#20815 commented on Jul 25, 2025 • 6 new comments
[Model] Ultravox: Support Llama 4 and Gemma 3 backends
#17818 commented on Jul 25, 2025 • 5 new comments
[Bugfix] Mistral tool parser streaming update
#19425 commented on Jul 25, 2025 • 5 new comments
v1: Add Request.block_hashes
#19728 commented on Jul 25, 2025 • 5 new comments
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on Jul 23, 2025 • 4 new comments
[Model] Pooling model activation supports per request control by PoolingParams
#20538 commented on Jul 25, 2025 • 4 new comments
[Model] Gemma3n MM
#20495 commented on Jul 24, 2025 • 4 new comments
[Model] Support TP/PP/mamba2 kernel for PLaMo2
#19674 commented on Jul 25, 2025 • 4 new comments
Add add_logger API to AsyncLLM
#20952 commented on Jul 23, 2025 • 3 new comments
[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata
#21153 commented on Jul 25, 2025 • 3 new comments
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on Jul 25, 2025 • 2 new comments
[BugFix] fix: aot passes kvcache dtype information
#19750 commented on Jul 24, 2025 • 2 new comments
[Frontend] Add chunked processing to handle long inputs in embedding models
#20837 commented on Jul 25, 2025 • 2 new comments
[Feature] Add async tensor parallelism for scaled mm
#20155 commented on Jul 25, 2025 • 2 new comments
Add tree attention backend for v1 (part 1)
#20401 commented on Jul 24, 2025 • 2 new comments
ci: Add CUDA + arm64 release builds
#21201 commented on Jul 19, 2025 • 2 new comments
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels
#17112 commented on Jul 25, 2025 • 2 new comments
[Feature][EPLB] Add support for unquantized models
#21168 commented on Jul 21, 2025 • 2 new comments
[Feature] Support multiple api keys in server
#18548 commented on Jul 25, 2025 • 2 new comments
[Misc] allow pulling vllm in Ray runtime environment
#21143 commented on Jul 23, 2025 • 2 new comments
[Misc] change default request logging behavior to off
#21135 commented on Jul 23, 2025 • 2 new comments
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning
#20396 commented on Jul 24, 2025 • 1 new comment
Resolve the torch nightly sync issue
#20393 commented on Jul 23, 2025 • 1 new comment
[Bugfix] V1 Fix the cursor leakage issue during request scheduling.
#21173 commented on Jul 25, 2025 • 1 new comment
Implement structural_tag and json_schema for non-chat completion
#21150 commented on Jul 22, 2025 • 1 new comment
[Bugfix] Fix the bug in Hermes streaming parsing
#20824 commented on Jul 25, 2025 • 1 new comment
Enable multi-image support benchmarking for serving
#21145 commented on Jul 22, 2025 • 1 new comment
[Nvidia] Integrate cudnn prefill paged attention kernel for head_dim == 128 models, like Llama family
#20850 commented on Jul 25, 2025 • 1 new comment
[Doc] Add multi-modal development example for encoder-decoder models
#15405 commented on Jul 24, 2025 • 0 new comments
[ROCm][AMD] Enable ROCm Flash Attention Backend for Encoder-Decoder Models
#14803 commented on Jul 21, 2025 • 0 new comments
[Feature] Memory interleaving (#14680)
#14690 commented on Jul 21, 2025 • 0 new comments
[Bug]: Importing DeepSpeed causes crash in vLLM when running with data parallelism and TP=1
#17079 commented on Jul 25, 2025 • 0 new comments
[Feature]: Add support for multi-lora and single lora for classification tasks
#19623 commented on Jul 25, 2025 • 0 new comments
[Feature]: add DoRA support
#10849 commented on Jul 25, 2025 • 0 new comments
[Feature]: Add EP/DP/PD deps in docker image
#19653 commented on Jul 25, 2025 • 0 new comments
[Bugfix] set correct lora mapping when compute prompt logprobs
#16694 commented on Jul 20, 2025 • 0 new comments
[Bug]: ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
#17569 commented on Jul 25, 2025 • 0 new comments
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on Jul 19, 2025 • 0 new comments
[Misc] Improve cli help show
#15455 commented on Jul 21, 2025 • 0 new comments
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on Jul 25, 2025 • 0 new comments
Reshape cache flash kernel to support HND layout
#8200 commented on Jul 24, 2025 • 0 new comments
[Model] LoRA with lm_head and embed_tokens fully trained - 4
#11714 commented on Jul 22, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Jul 23, 2025 • 0 new comments
First working PoC for bge-m3 sparse embeddings
#14526 commented on Jul 24, 2025 • 0 new comments
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on Jul 22, 2025 • 0 new comments
[Misc]fix demo function call JSONDecodeError
#16595 commented on Jul 25, 2025 • 0 new comments
[Doc] update docs for nightly benchmarks
#12022 commented on Jul 23, 2025 • 0 new comments
[Misc] improve chat_with_tools example
#16044 commented on Jul 25, 2025 • 0 new comments
DeepGemm MoE expert map support
#15957 commented on Jul 24, 2025 • 0 new comments
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 commented on Jul 25, 2025 • 0 new comments
[Frontend] Pythonic tool names flexibility (#14470)
#14474 commented on Jul 25, 2025 • 0 new comments
[Core] Make disaggregated prefill compatible with pipeline parallelism
#12301 commented on Jul 23, 2025 • 0 new comments
Adding Share Expert Fusion for DeepSeek
#15502 commented on Jul 20, 2025 • 0 new comments
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 commented on Jul 25, 2025 • 0 new comments
[Bugfix][Frontend] Strip empty tool calls from incoming chat conversations
#14054 commented on Jul 25, 2025 • 0 new comments
Fixed Stream set to True, client stream receiving arguments, concatenated json string, missing curly braces end
#15930 commented on Jul 25, 2025 • 0 new comments
[Bugfix][Spec Decode][V0] fix: update logits processor for MQA scoring
#12537 commented on Jul 21, 2025 • 0 new comments
[Misc] Disable pin_memory in AsyncMetricsCollector for spec decode tensor allocation
#15886 commented on Jul 23, 2025 • 0 new comments
[CI/Build] Add support for Python 3.13
#13164 commented on Jul 24, 2025 • 0 new comments
[Bug]: xgrammar doesn't support enums, but vllm isn't falling back to outlines
#15762 commented on Jul 25, 2025 • 0 new comments
[Bug]: gemma 3 structured output api occurs assertion error
#15766 commented on Jul 25, 2025 • 0 new comments
[Bug]: xgrammar==0.17 not work when guided
#15790 commented on Jul 25, 2025 • 0 new comments
[Bug]: Models converted to GGUF don't seem to be able to do tool calling
#16195 commented on Jul 25, 2025 • 0 new comments
[Feature]: Disable unicode characters in structured decoding
#16363 commented on Jul 25, 2025 • 0 new comments
[Bug]: Qwen2.5 assistant output on tool call is empty
#16430 commented on Jul 25, 2025 • 0 new comments
[Feature]: Native Tool Call for Gemma 3
#16482 commented on Jul 25, 2025 • 0 new comments
[Usage]: VLLM>0.8 also met No platform detected, vLLM is running on UnspecifiedPlatform
#16724 commented on Jul 25, 2025 • 0 new comments
[Bug]: Ngram speculative decoding doesn't work in vLLM 0.8.3/0.8.4 with VLLM_USE_V1 enabled.
#16883 commented on Jul 25, 2025 • 0 new comments
[Bug]: guided_grammar example syntax does not work
#16911 commented on Jul 25, 2025 • 0 new comments
[Bug]: ```image_grid_thw``` not set in ```CachedRequestState``` - ```Qwen2.5 VL 3B```
#17007 commented on Jul 25, 2025 • 0 new comments
[RFC]: All Ops should be determined during init and wrapped in a Layer Module to avoid envs.ENVIRON overhead
#17067 commented on Jul 25, 2025 • 0 new comments
[RFC]: Expert parallelism in VLLM - do you do local dropping on sub-batch of token activations before going through gating layer to make each rank possess unique sub-batch of data?
#17087 commented on Jul 25, 2025 • 0 new comments
tool call arguments parse failed
#17089 commented on Jul 25, 2025 • 0 new comments
[Bug]: [0.7.2+] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
#17098 commented on Jul 25, 2025 • 0 new comments
[Bug]: Tool calls data comes in content field after text chunks
#17109 commented on Jul 25, 2025 • 0 new comments
[Installation]: vllm/vllm-tpu image doesn't have :latest tag
#17114 commented on Jul 25, 2025 • 0 new comments
[Bug]: Why does torch.cuda.memory_allocated() remain unchanged after calling sleep()?
#17117 commented on Jul 25, 2025 • 0 new comments
[Bug]: jinja2 TemplateError should return 422 instead of 500 error code
#17119 commented on Jul 25, 2025 • 0 new comments
[Feature]: Automatically detect numerical issues
#17123 commented on Jul 25, 2025 • 0 new comments
[Bug]: waiting reqs vanish！
#17147 commented on Jul 25, 2025 • 0 new comments
[Bug]: DeepSeek Lora inference has no effect.
#17155 commented on Jul 25, 2025 • 0 new comments
[Usage]: How to configure the server parameters for THUDM/GLM-4-32B-0414 to support Function call using vllm-0.8.4?
#16771 commented on Jul 25, 2025 • 0 new comments
[RFC]: In-place weights loading and model swapping
#19886 commented on Jul 25, 2025 • 0 new comments
[Bug]: Prompt Embedding returns 500 internal error for Qwen 2.5 VL model
#20757 commented on Jul 25, 2025 • 0 new comments
[Bug]: PD demo example failed to run benchmark
#20477 commented on Jul 25, 2025 • 0 new comments
[Bug]: Error during transcription: Received a CachedWhisperTokenizerFast for argument tokenizer, but a WhisperTokenizer was expected.
#19538 commented on Jul 25, 2025 • 0 new comments
[Usage]: How to log stat when using AsyncLLM locally (do not based on openAI api)
#18948 commented on Jul 25, 2025 • 0 new comments
[RFC]: scheduling policy optimization in vLLM
#16969 commented on Jul 25, 2025 • 0 new comments
[Roadmap] vLLM Release/CI/Performance Benchmark Q2 2025
#16284 commented on Jul 25, 2025 • 0 new comments
[Bug]: The mixed precision model lacks kernel image in the Blackwell architecture(version:0.9.2 + cu12.8 + RTX5060)
#20605 commented on Jul 25, 2025 • 0 new comments
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on Jul 25, 2025 • 0 new comments
[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1
#17140 commented on Jul 25, 2025 • 0 new comments
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 commented on Jul 25, 2025 • 0 new comments
[Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4
#20468 commented on Jul 25, 2025 • 0 new comments
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on Jul 25, 2025 • 0 new comments
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on Jul 25, 2025 • 0 new comments
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on Jul 25, 2025 • 0 new comments
[Usage]: vLLM and In the fly tool calling
#13497 commented on Jul 25, 2025 • 0 new comments
[Feature]: add tool calling support for DeepSeek-R1-Distill-Qwen-32B
#13700 commented on Jul 25, 2025 • 0 new comments
[Feature]: Improve Logging for Error Messages
#14083 commented on Jul 25, 2025 • 0 new comments
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 commented on Jul 25, 2025 • 0 new comments
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on Jul 25, 2025 • 0 new comments
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on Jul 25, 2025 • 0 new comments
[Model] Reasoning Parser for Nemotron Models
#21041 commented on Jul 21, 2025 • 0 new comments
Fix minor docs issues and fix metric requests
#21040 commented on Jul 25, 2025 • 0 new comments
Enable sequence parallelism for full cuda graph without specifying compile sizes
#21031 commented on Jul 23, 2025 • 0 new comments
[Not for merge] Unshift eagle prefill
#21008 commented on Jul 25, 2025 • 0 new comments
fix(completion): always include usage
#20983 commented on Jul 24, 2025 • 0 new comments
[V0 deprecation] Removal V0 structured outputs
#20928 commented on Jul 21, 2025 • 0 new comments
[Bugfix] Support for getting the exact memory value when in a container
#20917 commented on Jul 22, 2025 • 0 new comments
[Frontend] Raise an extremely dangerous warning when using VLLM_ALLOW_LONG_MAX_MODEL_LEN
#20904 commented on Jul 22, 2025 • 0 new comments
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on Jul 24, 2025 • 0 new comments
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache
#20873 commented on Jul 20, 2025 • 0 new comments
Allow serving Llama4ForCausalLM directly
#20868 commented on Jul 21, 2025 • 0 new comments
[compile][startup] Disable C++ compilation of symbolic shapes
#20836 commented on Jul 22, 2025 • 0 new comments
[Meta] Official Eagle mm support, first enablement on llama4
#20788 commented on Jul 25, 2025 • 0 new comments
[Feature] Add support for MoE models in the calibration-free RTN-based quantization
#20766 commented on Jul 25, 2025 • 0 new comments
[PERF] Symmetric memory allreduce
#20759 commented on Jul 25, 2025 • 0 new comments
feat: Add --enable-log-outputs flag for logging model generations
#20707 commented on Jul 24, 2025 • 0 new comments
[Bugfix] Fix grafana's model_name list showing other values
#20677 commented on Jul 22, 2025 • 0 new comments
PrefixRepetitionRandomDataset
#20638 commented on Jul 19, 2025 • 0 new comments
v1: Support KV events from connectors
#19737 commented on Jul 23, 2025 • 0 new comments
[Compilation fix] add stubs to allow compilation without sm100
#21198 commented on Jul 22, 2025 • 0 new comments
[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel
#21197 commented on Jul 22, 2025 • 0 new comments
[W.I.P]: add Lmcache metrics
#21189 commented on Jul 20, 2025 • 0 new comments
Some initial Vulkan boilerplate
#21184 commented on Jul 18, 2025 • 0 new comments
[Bugfix] Mistral crashes on tool with no description
#21167 commented on Jul 25, 2025 • 0 new comments
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing
#21161 commented on Jul 24, 2025 • 0 new comments
[V0 deprecation] Deprecate V0 Neuron backend
#21159 commented on Jul 21, 2025 • 0 new comments
Remove xformers requirement for Mistral-format Pixtral and Mistral3
#21154 commented on Jul 24, 2025 • 0 new comments
[Perf] Using `mul` instead of `div` for int8 quant
#21136 commented on Jul 24, 2025 • 0 new comments
[V1] Large Block_size solution
#21123 commented on Jul 21, 2025 • 0 new comments
Add `fused_moe_gate` kernel and integrate to DeepSeek MoE layer
#21107 commented on Jul 22, 2025 • 0 new comments
[V1][Metrics][Frontend] Add support for custom stat loggers via CLI --stat-loggers
#21105 commented on Jul 24, 2025 • 0 new comments
[benchmark] add max-concurrency in result table
#21095 commented on Jul 21, 2025 • 0 new comments
[Model] Support deepseek with eagle
#21086 commented on Jul 21, 2025 • 0 new comments
[Draft][Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on Jul 22, 2025 • 0 new comments
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead
#21075 commented on Jul 23, 2025 • 0 new comments
fix: NIXL connector transfers partial block to pass full multi-modal context
#21074 commented on Jul 25, 2025 • 0 new comments
Add FlashInfer allreduce RMSNorm Quant fusion
#21069 commented on Jul 25, 2025 • 0 new comments
[Feature][EPLB] Add EPLB support for MiniMax-01
#21056 commented on Jul 24, 2025 • 0 new comments
[V1] Partial prefill skip for layers reusing shared KV cache
#19719 commented on Jul 24, 2025 • 0 new comments
Fixed power build by building numba from source
#19433 commented on Jul 23, 2025 • 0 new comments
[ROCm][FEAT] Integrate AITER gemm w8a8 ptpc
#19417 commented on Jul 20, 2025 • 0 new comments
[Bugfix] VLLM_V1 supports passing other compilation levels
#19340 commented on Jul 24, 2025 • 0 new comments
[Misc][Bugfix] specify docker registry to support podman
#19236 commented on Jul 21, 2025 • 0 new comments
[CI/Build] Add tool to build vllm-tpu wheel
#19165 commented on Jul 23, 2025 • 0 new comments
[Bugfix]: Fix DualChunkFlashAttention for short sequences
#19084 commented on Jul 23, 2025 • 0 new comments
[BugFix]: Hermes tool parser stream output error #19056
#19058 commented on Jul 25, 2025 • 0 new comments
[Bugfix] Improve JSON extraction in LlamaToolParser
#19024 commented on Jul 25, 2025 • 0 new comments
[V1] Support `LLM.apply_model`
#18465 commented on Jul 23, 2025 • 0 new comments
[Doc] update Contributing page's testing section
#18272 commented on Jul 25, 2025 • 0 new comments
[Bugfix] Fix Hermes tool call parser with streaming
#18220 commented on Jul 25, 2025 • 0 new comments
[Frontend] Add unix domain socket support
#18097 commented on Jul 22, 2025 • 0 new comments
[Misc] Remove duplicate division check between num_query_heads and num_kv_heads.
#18074 commented on Jul 24, 2025 • 0 new comments
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on Jul 25, 2025 • 0 new comments
[Kernel] Bf16 data type support for awq quantization
#17705 commented on Jul 21, 2025 • 0 new comments
[Misc] Raise ValueError for V1 during profiling when max_num_batched_tokens is too short
#16834 commented on Jul 21, 2025 • 0 new comments
[V1] Update default max_num_batched_tokens for V1 openai server
#16795 commented on Jul 20, 2025 • 0 new comments
[Core] feat: Add aging factor support to priority request queue for fairer scheduling
#20608 commented on Jul 22, 2025 • 0 new comments
feat: Add streaming support for Mistral v11 tool format
#20503 commented on Jul 23, 2025 • 0 new comments
[FEAT] [V1] [ROCm] Enable DeepSeek R1 MTP V1 ROCm
#20493 commented on Jul 19, 2025 • 0 new comments
[V1][Spec Decode][Feature] Spec decode with probs
#20459 commented on Jul 18, 2025 • 0 new comments
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on Jul 24, 2025 • 0 new comments
Add experimental Dual-Batch Overlap mechanism to VLLM
#20448 commented on Jul 25, 2025 • 0 new comments
feat: Add support for speculators Eagle checkpoints
#20436 commented on Jul 22, 2025 • 0 new comments
[WIP][RC] Update PyTorch to 2.8.0
#20358 commented on Jul 23, 2025 • 0 new comments
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D
#20340 commented on Jul 25, 2025 • 0 new comments
[Hardware][RISC-V] Add RISC-V architecture cpu inference support
#20292 commented on Jul 25, 2025 • 0 new comments
[Benchmark] Add benchmark tool for multi turn conversations
#20267 commented on Jul 22, 2025 • 0 new comments
[Frontend] add previous context to whisper transcription over 30s audio
#20249 commented on Jul 25, 2025 • 0 new comments
[Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318)
#20229 commented on Jul 25, 2025 • 0 new comments
[CI/Build][Bugfix]Fix marlin kernel no built on 4090
#20219 commented on Jul 25, 2025 • 0 new comments
[Nixl] Heterogeneous TP support FlashInfer
#20189 commented on Jul 25, 2025 • 0 new comments
v1: Introduce LRU-based CPU offloading management
#20075 commented on Jul 23, 2025 • 0 new comments
Add support for encoder embedding models
#19988 commented on Jul 25, 2025 • 0 new comments
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on Jul 25, 2025 • 0 new comments
v1: Introduce an offloading component
#19848 commented on Jul 23, 2025 • 0 new comments
[Feature]: Support for Universal Assisted Generation
#16503 commented on Jul 22, 2025 • 0 new comments
[Bug]: Main branch code reasoning reports an error in h100 inference
#16656 commented on Jul 22, 2025 • 0 new comments
[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two A100 machines using lws
#16827 commented on Jul 22, 2025 • 0 new comments
[Bug]: SharedStorageConnector only see first batch of tokens
#16928 commented on Jul 22, 2025 • 0 new comments
[Bug]:Why is the GPU memory usage after quantizing the model to int8 W8A8 with llmcompressor almost the same as before quantization?
#16959 commented on Jul 22, 2025 • 0 new comments
[Bug]: The output of MathResponse is empty when running THUDM/GLM-Z1-32B-0414 with vLLM-0.8.4
#16967 commented on Jul 22, 2025 • 0 new comments
[Bug]: Performance degradation with increasing number of requests in long-running vLLM inference sessions
#16985 commented on Jul 22, 2025 • 0 new comments
[Usage]: multilora_inference with max_loras>1
#17003 commented on Jul 22, 2025 • 0 new comments
[Installation]: Cannot install vllm due to xformers: ERROR: Failed building wheel for xformers fatal: Not a git repository (or any parent up to mount point /scratch) assert len(sources) > 0 AssertionError
#17015 commented on Jul 22, 2025 • 0 new comments
[Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE
#17812 commented on Jul 21, 2025 • 0 new comments
[Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output
#20627 commented on Jul 21, 2025 • 0 new comments
[Performance]: Opportunities to speed up BlockPool processing
#21141 commented on Jul 21, 2025 • 0 new comments
[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError
#20521 commented on Jul 21, 2025 • 0 new comments
[Bug]: DSR1 with DEP OOM during initialization on 32xH100
#20441 commented on Jul 21, 2025 • 0 new comments
[Usage]: Wrong context length for Qwen2.5-7B-Instruct?
#16757 commented on Jul 21, 2025 • 0 new comments
[RFC]: Response format extensions for structured outputs
#19097 commented on Jul 21, 2025 • 0 new comments
[Bug]: When making a streaming request, the 9-digit integer in the function call result will be truncated to 6 digits
#21156 commented on Jul 21, 2025 • 0 new comments
[Bug]: 单机多卡推理 tensor-parallel-size和pipeline-parallel-size 推理结果差距巨大
#19136 commented on Jul 21, 2025 • 0 new comments
[Bug]: Wrongly reuse KV, for V1 PD disaggregation with multimodal input
#21175 commented on Jul 21, 2025 • 0 new comments
[Feature]: Add MXFP6 Quantization Format
#17837 commented on Jul 21, 2025 • 0 new comments
[Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]
#19628 commented on Jul 21, 2025 • 0 new comments
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 commented on Jul 21, 2025 • 0 new comments
[Usage]: I cannot compile vllm on RTX5090
#20345 commented on Jul 21, 2025 • 0 new comments
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on Jul 20, 2025 • 0 new comments
[Feature]: Colocating multiple LLM engines in the same process with sleep mode.
#18975 commented on Jul 22, 2025 • 0 new comments
[New Model]: nvidia/llama-nemoretriever-colembed-3b-v1
#20703 commented on Jul 22, 2025 • 0 new comments
[RFC]: Neuron Support for V1 Engine
#21082 commented on Jul 22, 2025 • 0 new comments
[Feature]: vLLM does not support torch 2.7.1
#20566 commented on Jul 22, 2025 • 0 new comments
[Bug]: When passing text prompt + image embedding as Input, prefix cache usage is alway 0%
#21016 commented on Jul 22, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q3 2025
#20336 commented on Jul 22, 2025 • 0 new comments
[Usage]: Llama4 tool parser
#16214 commented on Jul 22, 2025 • 0 new comments
[New Model]: Qwen3-Embedding-8B-GGUF
#19602 commented on Jul 22, 2025 • 0 new comments
[New Model]: moonshotai/Kimi-Audio-7B-Instruct
#17234 commented on Jul 22, 2025 • 0 new comments
[Feature request] Output attention scores in vLLM
#3192 commented on Jul 22, 2025 • 0 new comments
[Feature]: Add support for attention score output
#11365 commented on Jul 22, 2025 • 0 new comments
[Feature]: Support Gemma 3 QAT series
#16856 commented on Jul 22, 2025 • 0 new comments
[Performance]: phi 3.5 vision model consuming high CPU RAM and the process getting killed
#9190 commented on Jul 22, 2025 • 0 new comments
[New Model]: Kimi-K2-Instruct
#20963 commented on Jul 22, 2025 • 0 new comments
[RFC]: EPLB Execution Optimization From pr 18343
#20805 commented on Jul 22, 2025 • 0 new comments
[Feature]: Ernie-4.5 vision support
#20732 commented on Jul 22, 2025 • 0 new comments
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on Jul 22, 2025 • 0 new comments
[New Model]: HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
#21087 commented on Jul 22, 2025 • 0 new comments
[Bug]: N-gram speculative decoding performs slower than Qwen3-32B-FP8 with vLLM 0.9.0.1
#19254 commented on Jul 22, 2025 • 0 new comments
[Bug]: Mistral tool parser & streaming: corrupt tool_calls completions
#17585 commented on Jul 22, 2025 • 0 new comments
[Bug]: AsyncLLM sleep then wake_up produces meaningless outputs
#17103 commented on Jul 22, 2025 • 0 new comments
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on Jul 22, 2025 • 0 new comments
[Bug]: Cast error details: Unable to cast 1024 to Tensor
#12771 commented on Jul 22, 2025 • 0 new comments
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on Jul 22, 2025 • 0 new comments
[Feature]: Model download progress using tqdm
#21191 commented on Jul 20, 2025 • 0 new comments
[Bug]: RuntimeError: query and key must have the same dtype when using Eagle3 speculative decoding with kv-cache-dtype fp8
#21177 commented on Jul 20, 2025 • 0 new comments
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 commented on Jul 20, 2025 • 0 new comments
[Usage]: Does model streamer supports loading model from GCS bucket?
#12290 commented on Jul 20, 2025 • 0 new comments
[Feature]: gemma3 raise error
#14723 commented on Jul 20, 2025 • 0 new comments
[Bug]: Embed model has additional dense module(dim=1792, but only 1024)
#15509 commented on Jul 20, 2025 • 0 new comments
[Bug]: v1 engine error when I using gemma-3 (v0 engine is okay)
#16643 commented on Jul 20, 2025 • 0 new comments
[Bug]: InternVL3-78B OOM on 4 A100 40G in 0.8.4
#16749 commented on Jul 20, 2025 • 0 new comments
[Bug]: Rocm Memory Access Fault.
#16840 commented on Jul 20, 2025 • 0 new comments
[New Model]: jinaai/jina-embeddings-v2-base-code
#16874 commented on Jul 20, 2025 • 0 new comments
[Usage]: Is it true that vllm doesn't support deepseek r1 yet with the v1 engine?
#16885 commented on Jul 20, 2025 • 0 new comments
[New Model]: Gemma 3n support
#18476 commented on Jul 20, 2025 • 0 new comments
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on Jul 19, 2025 • 0 new comments
[Feature]: EXL3 support
#19896 commented on Jul 19, 2025 • 0 new comments
[Feature]: DRY Sampling
#8581 commented on Jul 19, 2025 • 0 new comments
[Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with --mm-processor-kwargs
#20995 commented on Jul 19, 2025 • 0 new comments
[Bug]:qwen2_5vl: Internal Server Error when processing short video and vllm has been installed 0.9.0
#20313 commented on Jul 19, 2025 • 0 new comments
should deepseek v3 also need to upate? [examples/tool_chat_template_deepseekv3.jinja]
#21186 commented on Jul 19, 2025 • 0 new comments
[Bug]: Speculative decoding inconsistency for Qwen-Coder-32B
#10913 commented on Jul 19, 2025 • 0 new comments
[Feature]: Add Triton implementation of NVFP4 GEMM
#21014 commented on Jul 18, 2025 • 0 new comments
[Bug]: Requests that do not return results within 15 minutes are directly aborted, and then the request will be added by vllm again...
#20520 commented on Jul 18, 2025 • 0 new comments
[Bug]: Server hang with google/gemma-3-27b-it and structured decoding
#21148 commented on Jul 18, 2025 • 0 new comments
[Feature]: Add Support for Updating Lora Weights
#20149 commented on Jul 18, 2025 • 0 new comments
[Bug]: vllm serve Qwen2.5-VL-3B-Instruct run error
#21050 commented on Jul 21, 2025 • 0 new comments
[Bug]: Received a Qwen2VLImageProcessorFast for argument image_processor, but a Qwen2VLImageProcessor was expected
#20855 commented on Jul 21, 2025 • 0 new comments
[Bug]: RuntimeError: Failed to apply Qwen2_5_VLProcessor on data={'text': '<|image_pad|>', 'images': [<PIL.Image.Image image mode=RGB size=332x27 at 0x7FA449949720>]} with kwargs={}
#21109 commented on Jul 21, 2025 • 0 new comments
[Feature]: support vision encoder quantization
#20729 commented on Jul 21, 2025 • 0 new comments
[RFC]: Better support for weight updating while waking up from sleep mode for RLHF
#15254 commented on Jul 21, 2025 • 0 new comments
[Usage]: why no ray command in my docker image
#15284 commented on Jul 21, 2025 • 0 new comments
[Bug]: TypeError: Unknown image model type: qwen2_5_omni for branch: qwen2_omni_public_v1
#15754 commented on Jul 21, 2025 • 0 new comments
[Bug]: Cannot load Qwen2.5-VL
#16429 commented on Jul 21, 2025 • 0 new comments
[Bug]: Request stucks when serving model with v1 engine
#16580 commented on Jul 21, 2025 • 0 new comments
[Usage]: How to add a hook function
#16585 commented on Jul 21, 2025 • 0 new comments
[Feature]: Add support for AMD Strix/Strix Halo APU (gfx1150/gfx1151 RDNA 3.5)
#16621 commented on Jul 21, 2025 • 0 new comments
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on Jul 21, 2025 • 0 new comments
[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks
#16669 commented on Jul 21, 2025 • 0 new comments
[Usage]: Request scheduling when using LoRA
#16876 commented on Jul 21, 2025 • 0 new comments
[Bug]: architecture of models not correctly recognized
#16905 commented on Jul 21, 2025 • 0 new comments
[Bug]: oom occurs when 128+128 256 concurrency, while 4K+4K 256 concurrency is ok. DeepSeek-R1-awq benchmark test.
#16909 commented on Jul 21, 2025 • 0 new comments
[Bug]:Engine Compatibility Issue with vllm 0.8.4 Loading Qwen2.5-32B-AWQ: Abnormal Behavior of v1 Engine Under High Concurrency and Solutions
#16913 commented on Jul 21, 2025 • 0 new comments
[UI_Bug]: Content_Menu_and_Icon_Spacing_Issue_in_UI
#16917 commented on Jul 21, 2025 • 0 new comments
[Bug]: Pooling model adapter removes the attributes expected by model init
#16932 commented on Jul 21, 2025 • 0 new comments
[Bug]: Phi-4-MM generates gibberish for large image input with v1 chunked prefill
#16934 commented on Jul 21, 2025 • 0 new comments
[Performance]: Why/How vLLM uses CPU memory?
#16947 commented on Jul 21, 2025 • 0 new comments
[Installation]: Deploy vLLM for CPU server using GGUF model on Kubernetes
#20587 commented on Jul 20, 2025 • 0 new comments
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 commented on Jul 20, 2025 • 0 new comments
[Performance]: Quantized Model Inference
#17487 commented on Jul 20, 2025 • 0 new comments
[Bug]: Prefix caching ignores visual input, causing incorrect multimodal outputs under concurrency
#20261 commented on Jul 24, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Jul 24, 2025 • 0 new comments
[Performance]: `model_weights` generator invoked out of Model loading in EAGLE series models.
#21160 commented on Jul 24, 2025 • 0 new comments
[Feature]: Align the API with OAI's structured output
#7220 commented on Jul 24, 2025 • 0 new comments
[Bug]: Guided decoding is broken because tokenizers can't be pickled
#7557 commented on Jul 24, 2025 • 0 new comments
[Usage]: Confirm tool calling is not supported and this is the closest thing can be done
#7912 commented on Jul 24, 2025 • 0 new comments
[Bug]: Persistent OutOfMemoryError error when using speculative decoding
#8073 commented on Jul 24, 2025 • 0 new comments
[Performance]: guided generation is very slow in offline mode
#8313 commented on Jul 24, 2025 • 0 new comments
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 commented on Jul 24, 2025 • 0 new comments
[Feature]: Guided Decoding Schema Cache Store
#8902 commented on Jul 24, 2025 • 0 new comments
[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding
#9032 commented on Jul 24, 2025 • 0 new comments
[Bug]: Speculative decoding breaks guided decoding.
#9423 commented on Jul 24, 2025 • 0 new comments
[Bug]: Error with structured output inference after upgrade 0.6.2->0.6.3
#9462 commented on Jul 24, 2025 • 0 new comments
[Feature]: Support guided decoding with multistep decoding
#9893 commented on Jul 24, 2025 • 0 new comments
[Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
#11799 commented on Jul 24, 2025 • 0 new comments
[Bug]: 使用Sonatype Nexus Repository时下载模型错误。
#14993 commented on Jul 24, 2025 • 0 new comments
[Bug]: deploy deepseek-r1-awq on 16 x 4090 48G, layer_kv_cache = torch.zeros(kv_cache_shape, [rank0]: RuntimeError: CUDA error: invalid argument
#15014 commented on Jul 24, 2025 • 0 new comments
[Bug]: Issue running mistralai/Magistral-Small-2506 on NVIDIA hardware
#21122 commented on Jul 23, 2025 • 0 new comments
[Bug]: Close feature gaps when using xgrammar for structured output
#12131 commented on Jul 23, 2025 • 0 new comments
[Installation]: Nightly builds not available in container registry
#19335 commented on Jul 23, 2025 • 0 new comments
[Feature]: Remove xformers requirement for Mistral-format Pixtral and Mistral3
#21062 commented on Jul 23, 2025 • 0 new comments
[Bug]: IndexError: list index out of range on chunked prefill with speculative decoding
#20531 commented on Jul 23, 2025 • 0 new comments
[Bug]: Killing local vLLM worker processes in multiproc_worker_utils.py
#18577 commented on Jul 23, 2025 • 0 new comments
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on Jul 23, 2025 • 0 new comments
[Bug]: GLM-Z1 uses vllm batch inference to output confusion
#17157 commented on Jul 25, 2025 • 0 new comments
Error：kimi-vl：Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
#17162 commented on Jul 25, 2025 • 0 new comments
[Installation]: Bloated docker image size causes problems on k8s
#17163 commented on Jul 25, 2025 • 0 new comments
[Usage]: How to deploy tensorized vllm model (deserialize) as api_server?
#17178 commented on Jul 25, 2025 • 0 new comments
[Bug]: vllm LLM utils.py resolve_obj_by_qualname ValueError: not enough values to unpack (expected 2, got 1)
#17188 commented on Jul 25, 2025 • 0 new comments
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on Jul 25, 2025 • 0 new comments
[Bug]: Inconsistent Output: First API call differs from subsequent identical calls with temperature=0 on Qwen models
#17832 commented on Jul 25, 2025 • 0 new comments
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on Jul 24, 2025 • 0 new comments
[Bug]: Compile inductor / CUDA Graph build before the memory profiling
#19480 commented on Jul 24, 2025 • 0 new comments
[Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5
#17759 commented on Jul 24, 2025 • 0 new comments
[Usage]: V1 Engine with Qwen3 keeps on allocating memory for cuda graphs until OOM
#21172 commented on Jul 24, 2025 • 0 new comments
[Bug]: some vllm routes can be reached without authorization
#18892 commented on Jul 24, 2025 • 0 new comments
[Bug]: Failed to start vLLM v1 with Ray. Encountered the following error: `KeyError: 'bundles'`
#19123 commented on Jul 24, 2025 • 0 new comments
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on Jul 24, 2025 • 0 new comments
[RFC]: KV cache offloading
#19854 commented on Jul 24, 2025 • 0 new comments
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 commented on Jul 24, 2025 • 0 new comments
[Installation]: no version of pip install vllm works - Failed to initialize NumPy: No Module named 'numpy'
#11037 commented on Jul 24, 2025 • 0 new comments
[Usage]: 大量请求排队的时候推理速度很慢是什么原因
#16444 commented on Jul 24, 2025 • 0 new comments
[Bug]: vllm部署qwen2.5_vl_72b之后，你们有出现，刚部署好之后调用一切正常3-5秒一条，然后使用一段时间，就越来越慢了的情况吗60s一条
#13886 commented on Jul 24, 2025 • 0 new comments
[Bug]: InternVL2_5-8B-AWQ has no any throughput benefit compared to the InternVL2_5-8B
#19195 commented on Jul 24, 2025 • 0 new comments
[Bug]: qwen2-vl 7b, on vllm 0.8.1 & 0.8.2, sometimes (not deterministically but depends on data) I got: ValueError: Attempted to assign 702 = 702 multimodal tokens to 703 placeholders
#15764 commented on Jul 24, 2025 • 0 new comments
[Bug]: Subprocess health check / automatic restart for V1 EngineCore
#19849 commented on Jul 24, 2025 • 0 new comments
[Bug]: 'MultiprocExecutor' object has no attribute 'workers'
#17756 commented on Jul 24, 2025 • 0 new comments
[Performance]: The GPU memory usage of vllm v0.9.2 is significantly higher than that of v0.9.1. Why is this? How can it be improved?
#21027 commented on Jul 24, 2025 • 0 new comments
[Bug]: Distilled DeepSeek Models do not work with guided_json
#12548 commented on Jul 23, 2025 • 0 new comments
[Installation]: Can't build arm container image with podman without a SELinux relabel of bind mounts
#12734 commented on Jul 23, 2025 • 0 new comments
[Feature]: Specific Docker Image for vllm["audio,video"]
#13940 commented on Jul 23, 2025 • 0 new comments
[Feature]: support tool and reasoning together
#14429 commented on Jul 23, 2025 • 0 new comments
[Feature]: hub.docker.com Please add arm docker image
#14656 commented on Jul 23, 2025 • 0 new comments
[Feature]: Can support CPU inference with Ray cluster?
#15266 commented on Jul 23, 2025 • 0 new comments
[Feature]: Support LoRA adapter for whisper
#15370 commented on Jul 23, 2025 • 0 new comments
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 commented on Jul 23, 2025 • 0 new comments
[Installation]: how to run swiftkv with vllm
#16109 commented on Jul 23, 2025 • 0 new comments
[Bug]: Qwen2.5 tool call failed
#16393 commented on Jul 23, 2025 • 0 new comments
[Installation]:
#16575 commented on Jul 23, 2025 • 0 new comments
[Bug]: examples/offline_inference/chat_with_tools.py JSONDecodeError
#16594 commented on Jul 23, 2025 • 0 new comments
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on Jul 23, 2025 • 0 new comments
[Bug]: Bug while using deepspeed with TRL with vLLM
#16867 commented on Jul 23, 2025 • 0 new comments
Qwen2.5 VL and gemma-3-12b error on VLLM 8.4
#16918 commented on Jul 23, 2025 • 0 new comments
[Feature]: Enable Partial Guided Decoding / Structured Output Support
#16979 commented on Jul 23, 2025 • 0 new comments
[Bug]: When adding the parameter tensor_parallel_size, a TypeError occurred: BackendCompilerFailed.__init__() is missing one required positional argument: 'inner_exception'.
#17018 commented on Jul 23, 2025 • 0 new comments
[Feature]: add hostname in metrics for clustering deployment
#17029 commented on Jul 23, 2025 • 0 new comments
[Bug]: vllm 0.8.3 v1 engine has different computation performance per iteration when serving multi-lora with different chunk size
#17034 commented on Jul 23, 2025 • 0 new comments
[Usage]: I have 2 nodes 16 GPUs, how can i use 16 dp+16 ep to run deepseek v3?
#17041 commented on Jul 23, 2025 • 0 new comments
[Bug]: noop elimination for slice errors when end = -1
#17078 commented on Jul 23, 2025 • 0 new comments
[Bug]: Inconsistent behavior of AsyncLLMEngine.abort between v0 and v1
#20362 commented on Jul 23, 2025 • 0 new comments
[Feature]: TPU Embedding models support?
#20869 commented on Jul 23, 2025 • 0 new comments
[RFC]: Lazy CUDA Graph capture
#20098 commented on Jul 23, 2025 • 0 new comments
[Bug]: ValueError: There is no module or parameter named 'mm_whisper_embeddings' in LlamaForCausalLM
#21176 commented on Jul 23, 2025 • 0 new comments
[ROCm]: There is too many opt-in ROCm specific flags
#21138 commented on Jul 23, 2025 • 0 new comments
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on Jul 23, 2025 • 0 new comments
[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011
#9342 commented on Jul 23, 2025 • 0 new comments
[Bug]: Issue of Unstable Output for Identical Queries
#19403 commented on Jul 23, 2025 • 0 new comments
[RFC][FEATURE]: TTFT Routing
#20962 commented on Jul 23, 2025 • 0 new comments
[Feature]: Dynamic Chunked Pipeline Parallelism
#20808 commented on Jul 23, 2025 • 0 new comments
[Bug]: Phi-3-small-8k cannot be served for vllm >= 0.8.5
#18168 commented on Jul 23, 2025 • 0 new comments
[Bug]: ValueError when using Multi-Instance GPU
#17047 commented on Jul 23, 2025 • 0 new comments
[RFC]: vLLM configuration refactoring and modularization
#18953 commented on Jul 23, 2025 • 0 new comments
[Bug]: ray with nixl connector failed
#20980 commented on Jul 23, 2025 • 0 new comments
[Bug]: PaliGemma2 not working with OpenAI Docker serve
#12052 commented on Jul 23, 2025 • 0 new comments
[Bug]:Qwen2.5vl vllm serve Engine process failed to start
#17372 commented on Jul 23, 2025 • 0 new comments
[Bug]: FP8 Attention on H100 - CUDA error: an illegal memory access was encountered
#21110 commented on Jul 23, 2025 • 0 new comments
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on Jul 23, 2025 • 0 new comments
[Feature]: Remove cupy dependency for multi-node Ray deployment
#19758 commented on Jul 23, 2025 • 0 new comments
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on Jul 23, 2025 • 0 new comments
[Bug]: Guided Decoding Broken in Streaming mode
#10376 commented on Jul 23, 2025 • 0 new comments
[Bug]: Speculative decoding + guided decoding not working
#10442 commented on Jul 23, 2025 • 0 new comments
[Feature]: load and save kv cache from disk
#10611 commented on Jul 23, 2025 • 0 new comments
[Bug]: xgrammar crashes with speculative decoding
#11484 commented on Jul 23, 2025 • 0 new comments
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on Jul 23, 2025 • 0 new comments
[Bug]: Very slow guided decoding with Outlines backend since v0.6.5
#12005 commented on Jul 23, 2025 • 0 new comments