-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
3 Releases published by 1 person
-
v0.10.0rc1
published
Jul 20, 2025 -
v0.10.0rc2
published
Jul 24, 2025 -
v0.10.0
published
Jul 24, 2025
186 Pull requests merged by 102 people
-
[TPU][Test] Rollback PR-21550.
#21619 merged
Jul 25, 2025 -
[Docs] add auto-round quantization readme
#21600 merged
Jul 25, 2025 -
[CI] Unifying Dockerfiles for ARM and X86 Builds
#21343 merged
Jul 25, 2025 -
Add support for Prithvi in Online serving mode
#21518 merged
Jul 25, 2025 -
[Kernel] Improve machete memory bound perf
#21556 merged
Jul 25, 2025 -
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend.
#20295 merged
Jul 25, 2025 -
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel
#20839 merged
Jul 25, 2025 -
[Frontend] Add request_id to the Request object so they can be controlled better via external load balancers
#21009 merged
Jul 25, 2025 -
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Vision-Instruct-3B
#20931 merged
Jul 25, 2025 -
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter
#21586 merged
Jul 25, 2025 -
[Bugfix][Logprobs] Fix logprobs op to support more backend
#21591 merged
Jul 25, 2025 -
[V1] Get supported tasks from model runner instead of model config
#21585 merged
Jul 25, 2025 -
[Quantization] Enable BNB support for more MoE models
#21370 merged
Jul 25, 2025 -
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribute 'startswith'
#21579 merged
Jul 25, 2025 -
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-Instruct
#21598 merged
Jul 25, 2025 -
[Tests] Harden DP tests
#21508 merged
Jul 25, 2025 -
[TPU][Bugfix] fix OOM issue in CI test
#21550 merged
Jul 25, 2025 -
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS
#21262 merged
Jul 25, 2025 -
[CI/Build] fix cpu_extension for apple silicon
#21195 merged
Jul 25, 2025 -
[Misc][Tools] make max-model-len a parameter in auto_tune script
#21321 merged
Jul 25, 2025 -
[Model] Fix a check for None but the return value was empty list in Gemma3 MM vision_embeddings
#21479 merged
Jul 25, 2025 -
[Model] Support tensor parallel for timm ViT in Deepseek_vl2
#21494 merged
Jul 25, 2025 -
[Bugfix] fix modelscope snapshot_download serialization
#21536 merged
Jul 25, 2025 -
[CI] Update CODEOWNERS for CPU and Intel GPU
#21582 merged
Jul 25, 2025 -
Integrate TensorSchema with shape validation for Phi3VImagePixelInputs
#21232 merged
Jul 25, 2025 -
[Docs] Add
requirements/common.txt
to run unit tests#21572 merged
Jul 25, 2025 -
[TPU][Test] Temporarily suspend this MoE model in test_basic.py.
#21560 merged
Jul 25, 2025 -
[DP] Support api-server-count > 0 in hybrid DP LB mode
#21510 merged
Jul 25, 2025 -
[Bugfix] DeepGemm utils : Fix hardcoded type-cast
#21517 merged
Jul 25, 2025 -
[Kernel] adding fused_moe configs for upcoming granite4
#21332 merged
Jul 25, 2025 -
Fix GLM-4 PP Missing Layer When using with PP.
#21531 merged
Jul 25, 2025 -
[Bug] Fix DeepGemm Init Error
#21554 merged
Jul 25, 2025 -
[Docs] Fix
site_url
for RunLLM#21564 merged
Jul 25, 2025 -
[Frontend]
run-batch
supports V1#21541 merged
Jul 25, 2025 -
[MoE] More balanced expert sharding
#21497 merged
Jul 24, 2025 -
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3.
#21539 merged
Jul 24, 2025 -
update flashinfer to v0.2.9rc1
#21485 merged
Jul 24, 2025 -
[Docs] Add Expert Parallelism Initial Documentation
#21373 merged
Jul 24, 2025 -
[Docs][minor] Fix broken gh-file link in distributed serving docs
#21543 merged
Jul 24, 2025 -
[P/D] Support CPU Transfer in NixlConnector
#18293 merged
Jul 24, 2025 -
[P/D] Move FakeNixlWrapper to test dir
#21328 merged
Jul 24, 2025 -
[XPU] Conditionally import CUDA-specific passes to avoid import errors on xpu platform
#21036 merged
Jul 24, 2025 -
Update flashinfer CUTLASS MoE Kernel
#21408 merged
Jul 24, 2025 -
[Bug] Fix Compressed Tensor NVFP4
cutlass_fp4_group_mm
illegal memory access#21465 merged
Jul 24, 2025 -
[Docs] Rewrite Distributed Inference and Serving guide
#20593 merged
Jul 24, 2025 -
[Docs] Update Tensorizer usage documentation
#21190 merged
Jul 24, 2025 -
[Fix] Update mamba_ssm to 2.2.5
#21421 merged
Jul 24, 2025 -
[Bugfix] Fix CUDA arch flags for MoE permute
#21426 merged
Jul 24, 2025 -
[Model] Officially support Emu3 with Transformers backend
#21319 merged
Jul 24, 2025 -
[Attention] Optimize FlashInfer MetadataBuilder Build call
#21137 merged
Jul 24, 2025 -
bump
flashinfer
tov0.2.8
#21385 merged
Jul 24, 2025 -
[Feat] Allow custom naming of vLLM processes
#21445 merged
Jul 24, 2025 -
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_devices()
#21501 merged
Jul 24, 2025 -
Replace
--expand-tools-even-if-tool-choice-none
with--exclude-tools-when-tool-choice-none
for v0.10.0#20544 merged
Jul 24, 2025 -
remove GLM-4 quantization wrong Code
#21435 merged
Jul 24, 2025 -
[Core] Support model loader plugins
#21067 merged
Jul 24, 2025 -
[Misc] Fix duplicate FusedMoEConfig debug messages
#21455 merged
Jul 24, 2025 -
[v1][Core] Clean up usages of
SpecializedManager
#21407 merged
Jul 24, 2025 -
[TPU][Bugfix] fix moe layer
#21340 merged
Jul 24, 2025 -
[Bugfix][ROCm] Fix for warp_size uses on host
#21205 merged
Jul 24, 2025 -
Deduplicate Transformers backend code using inheritance
#21461 merged
Jul 24, 2025 -
Add think chunk
#21333 merged
Jul 24, 2025 -
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses
#21211 merged
Jul 24, 2025 -
Dump input metadata on crash for async scheduling
#21258 merged
Jul 24, 2025 -
[DP] Internal Load Balancing Per Node [
one-pod-per-node
]#21238 merged
Jul 24, 2025 -
[BugFix] Fix KVConnector TP worker aggregation
#21473 merged
Jul 24, 2025 -
[BugFix]: Batch generation from prompt_embeds fails for long prompts
#21390 merged
Jul 24, 2025 -
[Bugfix] Fix example disagg_example_p2p_nccl_xpyd.sh zombie process
#21437 merged
Jul 24, 2025 -
[Bugfix] Fix casing warning
#21468 merged
Jul 24, 2025 -
[XPU][UT] increase intel xpu CI test scope
#21492 merged
Jul 24, 2025 -
[Misc] Add dummy maverick test to CI
#21324 merged
Jul 24, 2025 -
[Frontend] Set MAX_AUDIO_CLIP_FILESIZE_MB via env var instead of hardcoding
#21374 merged
Jul 24, 2025 -
feat(gguf_loader): accept HF repo paths & URLs for GGUF
#20793 merged
Jul 24, 2025 -
[Core] Freeze gc during cuda graph capture to speed up init
#21146 merged
Jul 24, 2025 -
[V0 Deprecation] Remove Prompt Adapters
#20588 merged
Jul 23, 2025 -
[V1] Fix local chunked attention always disabled
#21419 merged
Jul 23, 2025 -
[Core] Add
reload_weights
RPC method#20096 merged
Jul 23, 2025 -
[TPU][TEST] Fix the downloading issue in TPU v1 test 11.
#21418 merged
Jul 23, 2025 -
Add test case for compiling multiple graphs
#21044 merged
Jul 23, 2025 -
[Core][Model] PrithviMAE Enablement on vLLM v1 engine
#20577 merged
Jul 23, 2025 -
[Tests] Add tests for headless internal DP LB
#21450 merged
Jul 23, 2025 -
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qwen 1m models.
#21364 merged
Jul 23, 2025 -
[V1] Check all pooling tasks during profiling
#21299 merged
Jul 23, 2025 -
[Model] add Hunyuan V1 Dense Model support.
#21368 merged
Jul 23, 2025 -
[Docs] Clean up v1/metrics.md
#21449 merged
Jul 23, 2025 -
[Misc] fixed nvfp4_moe test failures due to invalid kwargs
#21246 merged
Jul 23, 2025 -
Mamba V2 Test not Asserting Failures.
#21379 merged
Jul 23, 2025 -
[Sampler] Introduce logprobs mode for logging
#21398 merged
Jul 23, 2025 -
[Docs] Fix bullets and grammars in tool_calling.md
#21440 merged
Jul 23, 2025 -
Fixed typo in profiling logs
#21441 merged
Jul 23, 2025 -
[Bugfix] ensure tool_choice is popped when
tool_choice:null
is passed in json payload#19679 merged
Jul 23, 2025 -
add clear messages for deprecated models
#21424 merged
Jul 23, 2025 -
[Cleanup] Only log MoE DP setup warning if DP is enabled
#21315 merged
Jul 23, 2025 -
[Core] Add basic unit test for maybe_evict_cached_block
#21400 merged
Jul 23, 2025 -
[Bugfix] Fix nightly transformers CI failure
#21427 merged
Jul 23, 2025 -
Changing "amdproduction" allocation.
#21409 merged
Jul 23, 2025 -
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported
#21420 merged
Jul 23, 2025 -
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update
#21414 merged
Jul 23, 2025 -
[BugFix] Update python to python3 calls for image; fix prefix & input calculations.
#21391 merged
Jul 23, 2025 -
Simplify weight loading in Transformers backend
#21382 merged
Jul 23, 2025 -
[Bugfix][ROCm][Build] Fix build regression on ROCm
#21393 merged
Jul 23, 2025 -
[CI/Build] Fix model executor tests
#21387 merged
Jul 23, 2025 -
[BugFix] Fix ray import error mem cleanup bug
#21381 merged
Jul 22, 2025 -
[Misc] Copy HF_TOKEN env var to Ray workers
#21406 merged
Jul 22, 2025 -
[Model] Add Qwen3CoderToolParser
#21396 merged
Jul 22, 2025 -
Fix Flashinfer Allreduce+Norm enable disable calculation based on
fi_allreduce_fusion_max_token_num
#21325 merged
Jul 22, 2025 -
[CI/Build] Fix test failure due to updated model repo
#21375 merged
Jul 22, 2025 -
[Bugfix] Decode Tokenized IDs to Strings for
hf_processor
inllm.chat()
withmodel_impl=transformers
#21353 merged
Jul 22, 2025 -
Add tokenization_kwargs to encode for embedding model truncation
#21033 merged
Jul 22, 2025 -
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)"
#21384 merged
Jul 22, 2025 -
[feat] Enable mm caching for transformers backend
#21358 merged
Jul 22, 2025 -
Adds parallel model weight loading for runai_streamer
#21330 merged
Jul 22, 2025 -
[Perf] Cuda Kernel for Per Token Group Quant
#21083 merged
Jul 22, 2025 -
[feat]: add SM100 support for cutlass FP8 groupGEMM
#20447 merged
Jul 22, 2025 -
[perf] Add fused MLA QKV + strided layernorm
#21116 merged
Jul 22, 2025 -
[Misc] unify variable for LLM instance v2
#21356 merged
Jul 22, 2025 -
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to further optimize block_pool
#21222 merged
Jul 22, 2025 -
[benchmark] Port benchmark request sent optimization to benchmark_serving
#21209 merged
Jul 22, 2025 -
[Core] Optimize update checks in LogitsProcessor
#21245 merged
Jul 22, 2025 -
[Misc] Remove deprecated args in v0.10
#21349 merged
Jul 22, 2025 -
[Bugfix] Fix eviction cached blocked logic
#21357 merged
Jul 22, 2025 -
Add arcee model
#21296 merged
Jul 22, 2025 -
[Feature][eplb] add verify ep or tp or dp
#21102 merged
Jul 22, 2025 -
Update fp4 quantize API
#21327 merged
Jul 22, 2025 -
[Bug] DeepGemm: Fix Cuda Init Error
#21312 merged
Jul 22, 2025 -
[Misc] DeepEPHighThroughtput - Enable Inductor pass
#21311 merged
Jul 22, 2025 -
Fix kv_cache_dtype handling for out-of-tree HPU plugin
#21302 merged
Jul 22, 2025 -
[Refactor] Fix Compile Warning #1444-D
#21208 merged
Jul 22, 2025 -
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTensor are compatible
#21300 merged
Jul 22, 2025 -
[Core] Minimize number of dict lookup in _maybe_evict_cached_block
#21281 merged
Jul 22, 2025 -
Revert "[Performance] Performance improvements in non-blockwise fp8 CUTLASS MoE (#20762)
#21334 merged
Jul 22, 2025 -
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU
#21338 merged
Jul 22, 2025 -
[Doc] Fix CPU doc format
#21316 merged
Jul 22, 2025 -
[XPU] Enable external_launcher to serve as an executor via torchrun
#21021 merged
Jul 22, 2025 -
[v1][sampler] Inplace logprobs comparison to get the token rank
#21283 merged
Jul 21, 2025 -
[perf] Speed up align sum kernels
#21079 merged
Jul 21, 2025 -
Fix bad lm-eval fork
#21318 merged
Jul 21, 2025 -
[DP] Fix Prometheus Logging
#21257 merged
Jul 21, 2025 -
[Attention] Clean up iRoPE in V1
#21188 merged
Jul 21, 2025 -
[Misc] Add dummy maverick test
#21199 merged
Jul 21, 2025 -
[BugFix] make utils.current_stream thread-safety (#21252)
#21253 merged
Jul 21, 2025 -
[CPU] Enable shared-memory based pipeline parallel for CPU backend
#21289 merged
Jul 21, 2025 -
[Misc] Add sliding window to flashinfer test
#21282 merged
Jul 21, 2025 -
Add Nvidia ModelOpt config adaptation
#19815 merged
Jul 21, 2025 -
[Misc] unify variable for LLM instance
#20996 merged
Jul 21, 2025 -
[Docs] Make tables more space efficient in
supported_models.md
#21291 merged
Jul 21, 2025 -
[Docs] Fix hardcoded links in docs
#21287 merged
Jul 21, 2025 -
[Model][1/N] Support multiple poolers at model level
#21227 merged
Jul 21, 2025 -
[Bugfix] Fix missing placeholder in logger debug
#21280 merged
Jul 21, 2025 -
Add the instruction to run e2e validation manually before release
#21023 merged
Jul 21, 2025 -
[Docs] Add RFC Meeting to Issue Template
#21279 merged
Jul 21, 2025 -
[CI] Cleanup modelscope version constraint in Dockerfile
#21243 merged
Jul 21, 2025 -
[bugfix] fix syntax warning caused by backslash
#21251 merged
Jul 20, 2025 -
[Model] Support VLMs with transformers backend
#20543 merged
Jul 20, 2025 -
[Docs] Upgrade VLLM version to 0.10.0 for installing from vLLM's binaries
#21240 merged
Jul 20, 2025 -
[Model] use AutoWeightsLoader for bart
#18299 merged
Jul 20, 2025 -
Enable v1 metrics tests
#20953 merged
Jul 20, 2025 -
[TPU] support fp8 kv cache quantization
#19292 merged
Jul 20, 2025 -
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid models.
#21233 merged
Jul 19, 2025 -
GLM-4 Update
#20736 merged
Jul 19, 2025 -
[BugFix] Fix full cuda graph slot_mapping
#21228 merged
Jul 19, 2025 -
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small
#21217 merged
Jul 19, 2025 -
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers
#21194 merged
Jul 19, 2025 -
[BugFix] Make PD work with Ray
#21072 merged
Jul 19, 2025 -
[Docs] Update the link to the 'Prometheus/Grafana' example
#21225 merged
Jul 19, 2025 -
[CI/CD][bugfix]fix: error argument to loads has incompatible type
#21223 merged
Jul 19, 2025 -
Fix/remove some broken model executor tests
#21224 merged
Jul 19, 2025 -
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend and refactor code
#21032 merged
Jul 19, 2025 -
[Bugfix][Frontend] Fix openai CLI arg
middleware
#21220 merged
Jul 19, 2025 -
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low latency
#20645 merged
Jul 19, 2025 -
Add torch golden impl for moe_align_block_size kernel test
#20653 merged
Jul 19, 2025 -
[BugFix] Fix potential cuda-graph IMA
#21196 merged
Jul 19, 2025 -
[Bugfix] Fix ndarray video color from VideoAsset
#21064 merged
Jul 19, 2025 -
[V0 deprecation] Remove long context LoRA
#21169 merged
Jul 19, 2025 -
Fix a couple of Voxtral tests
#21218 merged
Jul 19, 2025 -
[Misc][Tools][Benchmark] Add readme file for auto_tune script
#20779 merged
Jul 19, 2025 -
[Model] EXAONE 4.0 model support
#21060 merged
Jul 19, 2025 -
[BugFix][CPU] Fix
TorchSDPABackendImpl
doesn't haveuse_irope
#21200 merged
Jul 19, 2025 -
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm kernel
#21193 merged
Jul 19, 2025 -
[V0 Deprecation] Remove V0 Spec Decode workers
#21152 merged
Jul 19, 2025 -
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503
#21183 merged
Jul 19, 2025 -
[Core] Support Local Chunked Attention for Hybrid KV Cache
#19351 merged
Jul 19, 2025 -
[Quantization] Enable BNB support for more MoE models
#21100 merged
Jul 19, 2025 -
Elastic Expert Parallel Initial Support
#20775 merged
Jul 19, 2025 -
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series)
#21077 merged
Jul 18, 2025
144 Pull requests opened by 113 people
-
Refactor xformers check for MultiHeadAttention
#21210 opened
Jul 18, 2025 -
Add chat doc in quick start
#21213 opened
Jul 19, 2025 -
[Feature] [V1] intermediate logging
#21215 opened
Jul 19, 2025 -
[Bugfix] missing kv_cache_scheme
#21221 opened
Jul 19, 2025 -
[WIP][Kernel]FusedMoE LoRA
#21229 opened
Jul 19, 2025 -
[Nixl] Debug logging
#21230 opened
Jul 19, 2025 -
[CI/Build] Add bc-linter to vLLM CI
#21234 opened
Jul 19, 2025 -
[no commit] bc-linter demo
#21235 opened
Jul 19, 2025 -
[bugfix] Remove the attribute 'version' from docker compose
#21241 opened
Jul 20, 2025 -
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel
#21242 opened
Jul 20, 2025 -
Make async scheduling compatible with DP
#21244 opened
Jul 20, 2025 -
[Core] Prototype: Move hash_request_tokens computation from input request threads
#21248 opened
Jul 20, 2025 -
[v1] - Mamba1 Attention Metadata
#21249 opened
Jul 20, 2025 -
raise 400 Bad Request with detailed error message for `aiohttp.ClientError`
#21255 opened
Jul 20, 2025 -
rank -> TP rank in MultiProcExecutor log
#21256 opened
Jul 20, 2025 -
[Fix] correct tool_id for kimi-k2 when use tool_choice=required
#21259 opened
Jul 20, 2025 -
Fix docstring of PyNcclCommunicator device arg
#21268 opened
Jul 20, 2025 -
Support encoder-only models without KV-Cache
#21270 opened
Jul 20, 2025 -
[Core] Add max-waiting-queue-length parameter to reject requests when queue is full
#21271 opened
Jul 20, 2025 -
WIP: Add EPLB support for Grok1
#21273 opened
Jul 21, 2025 -
[Model] vllm v1 support mlp_speculator
#21276 opened
Jul 21, 2025 -
[Misc][Numerics] Basic logprobs benchmark tool
#21286 opened
Jul 21, 2025 -
[Feature][EPLB] Add support for Qwen3 EPLB
#21290 opened
Jul 21, 2025 -
Fix docker/AppArmor crash caused by cpuinfo __cpuid jit path
#21305 opened
Jul 21, 2025 -
Support CUTLASS NVFP4 (w4a4) for Blackwell Geforce GPUs (SM120)
#21309 opened
Jul 21, 2025 -
[DBO] Adding the `UBatchContext` class for DBO support
#21314 opened
Jul 21, 2025 -
adds include_thinking optional Param to Request object to preserve re…
#21317 opened
Jul 21, 2025 -
[WIP][RC] Update PyTorch to 2.8.0
#21320 opened
Jul 21, 2025 -
Support Tensorrt-LLM MoE fp4 for low-latency
#21331 opened
Jul 21, 2025 -
[Refactor] Remove `moe_align_block_size_triton`
#21335 opened
Jul 21, 2025 -
Support DeepSeekV3-style block FP8 quantization with CT
#21337 opened
Jul 21, 2025 -
Add anthropic endpoint
#21341 opened
Jul 22, 2025 -
[V1] port xformers backend to v1
#21342 opened
Jul 22, 2025 -
[Speculative Decoding] Add `speculators` Config Support
#21345 opened
Jul 22, 2025 -
[V0 deprecation] Guided decoding
#21347 opened
Jul 22, 2025 -
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile
#21350 opened
Jul 22, 2025 -
[Core] Minor comments and asserts changes in block pool
#21351 opened
Jul 22, 2025 -
[Core][Feat] Add max-waiting-queue-length parameter to reject requests when waiting queue is full
#21352 opened
Jul 22, 2025 -
[xpu] disable cudagraph for xpu platform
#21354 opened
Jul 22, 2025 -
[CI/Build][Doc] Move existing benchmark scripts in CI/document/example to vllm bench CLI
#21355 opened
Jul 22, 2025 -
[Bugfix] FIX hermes tool parser streaming bug when using function call
#21360 opened
Jul 22, 2025 -
[feat] Support EAGLE for Qwen2
#21363 opened
Jul 22, 2025 -
fix: return {} for tool arguments when no argument is needed, so that…
#21365 opened
Jul 22, 2025 -
[ROCm] Auto-Select Attention Backend
#21366 opened
Jul 22, 2025 -
[V1][CUDA] Full cudagraph support for FlashInfer
#21367 opened
Jul 22, 2025 -
skip fusedmoe layer for start_load_kv
#21378 opened
Jul 22, 2025 -
[Bugfix][Apple Silicon] fix missing symbols when build from source on Mac with Apple Silicon
#21380 opened
Jul 22, 2025 -
[wip] add nccl allocator and symm memory and enable TP all reduce for nccl symm
#21383 opened
Jul 22, 2025 -
[tests] test_async_llm_engine.py
#21388 opened
Jul 22, 2025 -
Add `flashinfer_python` to CUDA wheel requirements
#21389 opened
Jul 22, 2025 -
[Model] Refactor JambaForCausalLM
#21394 opened
Jul 22, 2025 -
[wip]
#21395 opened
Jul 22, 2025 -
[V1] [Hybrid] Enable Full CUDA Graph (decode-only) for Mamba layers
#21401 opened
Jul 22, 2025 -
Refactor dense FP8 tensor/channel/block utils and add CT FP8 block
#21404 opened
Jul 22, 2025 -
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscale moe fp8 kernels
#21411 opened
Jul 22, 2025 -
[v1][attention] Support Hybrid Allocator + FlashInfer
#21412 opened
Jul 22, 2025 -
Intentionally fail parallel sampling test
#21413 opened
Jul 22, 2025 -
[WIP] Prepare for DI integration. Currently DI is not used; but this is to make sure both paths run fine.
#21415 opened
Jul 22, 2025 -
Updates to Flex + VLLm integration
#21416 opened
Jul 22, 2025 -
[TPU] Support Pathways in vLLM
#21417 opened
Jul 22, 2025 -
[BugFix] Fix shared storage connector load kv only load attention layer
#21428 opened
Jul 23, 2025 -
[Misc] Improve memory profiling debug message
#21429 opened
Jul 23, 2025 -
[Fix] Connect fx_graph_cache option to envs.VLLM_DISABLE_COMPILE_CACHE
#21430 opened
Jul 23, 2025 -
[TPU][Test] Divide TPU v1 Test into 2 parts.
#21431 opened
Jul 23, 2025 -
[Bugfix]check core_engine process exit unexpectedly
#21443 opened
Jul 23, 2025 -
[Bugfix] Fixed the missing metrics in output
#21444 opened
Jul 23, 2025 -
Support online_serving for qwen3-reranker model
#21446 opened
Jul 23, 2025 -
v1/offloading: Add worker-side CPU support
#21448 opened
Jul 23, 2025 -
Add TNG Tool Call Parser
#21456 opened
Jul 23, 2025 -
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend
#21458 opened
Jul 23, 2025 -
[Refactor] Fix Compile Warning #1444-D
#21462 opened
Jul 23, 2025 -
[Misc] Move comment to reflect original intent
#21464 opened
Jul 23, 2025 -
[Model] Mamba2 varlen and metadata refactor
#21467 opened
Jul 23, 2025 -
[Deprecation][2/N] Replace `--task` with `--runner` and `--convert`
#21470 opened
Jul 23, 2025 -
DeepGEMM is not enabled on B200 when loading DeepSeek R1
#21472 opened
Jul 23, 2025 -
[Perf] Cuda Kernel for Int8 Per Token Group Quant
#21476 opened
Jul 23, 2025 -
[v1][spec decode] Run eagle with full cudagraph support
#21477 opened
Jul 23, 2025 -
Add interleaved RoPE test for Llama4 (Maverick)
#21478 opened
Jul 23, 2025 -
Llama4 FP4 Support
#21484 opened
Jul 23, 2025 -
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release
#21486 opened
Jul 24, 2025 -
improve estimation of available KV Cache memory
#21489 opened
Jul 24, 2025 -
[V1][Neuron] Neuron chunked prefill V1 impl
#21490 opened
Jul 24, 2025 -
Delete useless allgather in qwen2_5_vl vit attention
#21493 opened
Jul 24, 2025 -
[ROCm] [V1] [SpecDec] Enable Speculative Decoding on ROCm V1 Engine
#21496 opened
Jul 24, 2025 -
[NVIDIA] Fix Llama4 Scout FP4 functionality issues
#21499 opened
Jul 24, 2025 -
[Bugfix] Fix retrieve_process not ending normally and resources not being released properly
#21502 opened
Jul 24, 2025 -
[V1][SpecDecode]Support relaxed acceptance for thinking tokens in speculative decoding in V1
#21506 opened
Jul 24, 2025 -
[bugfix] fix profile impact benchmark results
#21507 opened
Jul 24, 2025 -
[Bugfix] Fix v1 engine crash in priority scheduling with parallel sampling (n > 1)
#21519 opened
Jul 24, 2025 -
support silu vectorization
#21521 opened
Jul 24, 2025 -
[ROCm] Add flag to avoid `invalid device ordinal` HIP error
#21522 opened
Jul 24, 2025 -
[Docs] Fix the outdated URL for installing from vLLM binaries
#21523 opened
Jul 24, 2025 -
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backend
#21525 opened
Jul 24, 2025 -
Only try and load generation config if it will be used
#21526 opened
Jul 24, 2025 -
[Bugfix] Investigate Qwen2-VL failing test
#21527 opened
Jul 24, 2025 -
[Bugfix] v1 fix current scheduling defects and enhance the scheduling preemption logic.
#21528 opened
Jul 24, 2025 -
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL
#21530 opened
Jul 24, 2025 -
[Bugfix] Handle None case for dt_bias and D in selective_state_update
#21532 opened
Jul 24, 2025 -
Add DeepGEMM to Dockerfile in vllm-base image
#21533 opened
Jul 24, 2025 -
[V1] Exception Handling when Loading KV Cache from Remote Store
#21534 opened
Jul 24, 2025 -
[Bugfix] Add startup probe and fix disable extraInit container in online deploy helm chart
#21535 opened
Jul 24, 2025 -
[Bugfix] Fix sync_and_slice_intermediate_tensors
#21537 opened
Jul 24, 2025 -
[BugFix] Harden distributed DP startup
#21538 opened
Jul 24, 2025 -
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn
#21540 opened
Jul 24, 2025 -
Enable 4bit bnb prequant MOE
#21548 opened
Jul 24, 2025 -
[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend
#21549 opened
Jul 24, 2025 -
[Do not merge] Debug TPU issues with Xet
#21551 opened
Jul 24, 2025 -
[TPU] Update ptxla nightly version to 20250724
#21555 opened
Jul 24, 2025 -
[V1] [Hybrid] Validate compatibility of attention backend batch reordering at init time
#21557 opened
Jul 24, 2025 -
[Feat][Scheduler] Implement shortest prefill first scheduling
#21558 opened
Jul 24, 2025 -
[Test] Add Unit Test for Batched DeepGEMM
#21559 opened
Jul 24, 2025 -
[Misc] add options for auto_tune
#21566 opened
Jul 25, 2025 -
[cutlass] Bump version to 410
#21569 opened
Jul 25, 2025 -
[Draft][Docs] Factor out troubleshooting to its own guide; add section for Ray Observability
#21578 opened
Jul 25, 2025 -
adding params_dtype for vocab parallel embedding layer
#21580 opened
Jul 25, 2025 -
[Draft][Docs] Expand introduction to Ray in Multi-node deployment section
#21584 opened
Jul 25, 2025 -
[CI/Build] Fix failing tensorizer tests on AMD
#21587 opened
Jul 25, 2025 -
[WIP] local attention no hybrid kv cache + support multiple attention metadata builders per kv_cache_spec
#21588 opened
Jul 25, 2025 -
Add option to propagate padded logits_indices to model
#21590 opened
Jul 25, 2025 -
[V1] large block_size solution
#21597 opened
Jul 25, 2025 -
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES
#21599 opened
Jul 25, 2025 -
KV Cache swap num_blocks layout + Heterogenous TP for NixlConnecotr
#21607 opened
Jul 25, 2025 -
[Model] Fix for Granite 4 to work with compressed_tensors
#21608 opened
Jul 25, 2025 -
Use all_stop_token_ids instead of stop_token_ids
#21610 opened
Jul 25, 2025 -
[Bugfix] SharedStorage Connector for V1 PD multimodal
#21611 opened
Jul 25, 2025 -
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds to use dtype comparison
#21612 opened
Jul 25, 2025 -
[BugFix] Improve internal DP load balancing
#21617 opened
Jul 25, 2025 -
[Misc] remove unused try-except in pooling config check
#21618 opened
Jul 25, 2025 -
Migrate AriImagePixelInputs to TensorSchema for shape validation
#21620 opened
Jul 25, 2025 -
[Core] Hidden State Processors via plugins
#21621 opened
Jul 25, 2025 -
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validation
#21622 opened
Jul 25, 2025 -
[Doc] Add FusedMoE Modular Kernel Documentation
#21623 opened
Jul 25, 2025 -
[V1][Hybrid] Make KV cache layout of triton_attn compatible with hybrid models
#21624 opened
Jul 25, 2025 -
[do not merge] IL tool
#21625 opened
Jul 25, 2025 -
[Attention] Make CutlassMLA the default backend for SM100 (blackwell)
#21626 opened
Jul 25, 2025 -
[Core] Move EngineCoreRequest to Request conversion out of EngineCore
#21627 opened
Jul 25, 2025 -
Support Intern-S1
#21628 opened
Jul 25, 2025 -
[Bug] Update auto_tune.sh to separate benchmarking and profiling.
#21629 opened
Jul 25, 2025 -
[Fix] Bump triton version in rocm-build requirements
#21630 opened
Jul 25, 2025 -
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor
#21631 opened
Jul 25, 2025 -
[Feature][EPLB] Add EPLB support for Ernie4.5-MoE
#21632 opened
Jul 25, 2025 -
[Bug] Fix `has_flashinfer_moe` Import Error when it is not installed
#21634 opened
Jul 25, 2025
145 Issues closed by 42 people
-
[Bug]: [DP/EP] DeepGEMM with Qwen Fails
#21562 closed
Jul 25, 2025 -
[New Model]: HyperClova X SEED (ChatClova)
#21275 closed
Jul 25, 2025 -
[New Model]: Support HCXVisionForCausalLM
#19963 closed
Jul 25, 2025 -
[Bug]: Failed to execute_model with logprobs on v0.10.0rc2
#21592 closed
Jul 25, 2025 -
[Bug]: AttributeError: 'PosixPath' object has no attribute 'startswith'
#19173 closed
Jul 25, 2025 -
[New Model]: please surpport google/medgemma-27b-it
#20806 closed
Jul 25, 2025 -
[Bug]: Regression in vllm 0.9.2 for (at least) google/medgemma-27b-it
#21601 closed
Jul 25, 2025 -
[Bug]: qwen2.5-vl-3B inference with lora "unsupported LoRA weight"
#21500 closed
Jul 25, 2025 -
[RFC]: Schema for checking input shapes for multi-modal models
#14764 closed
Jul 25, 2025 -
[Bug]: set n=2 in the sampling parameter, but the final return result only contains one sequence
#21288 closed
Jul 25, 2025 -
[Bug]:
#21575 closed
Jul 25, 2025 -
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 closed
Jul 25, 2025 -
[Bug]: Enable lora returns garbage output
#14392 closed
Jul 25, 2025 -
[Bug]: CDNA cc >= 90, choose_mp_linear_kernel MacheteLinearKernel is possible
#14996 closed
Jul 25, 2025 -
[Feature]: Add CoT dataset to the benchmark
#15378 closed
Jul 25, 2025 -
[Bug]: awq Deepseek-R1-AWQ The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
#15386 closed
Jul 25, 2025 -
[Bug]: `Phi-4-multimodal-instruct` encoder outputs didn't have the same length as defined in input_ids
#15404 closed
Jul 25, 2025 -
[Feature]: Implement Embedding Models in V1
#15406 closed
Jul 25, 2025 -
[Bug]: logprobs/ranks not matching when comparing `vllm` with `transformers`
#15420 closed
Jul 25, 2025 -
[Performance]: Regarding the issue of context length for QWQ-32B in different distributed environments:
#15442 closed
Jul 25, 2025 -
[Usage]: Question about Interleaved Text/Image Format in Online Inference
#15449 closed
Jul 25, 2025 -
[Usage]: vllm启动服务卡住
#15451 closed
Jul 25, 2025 -
[Feature]: preprocessing of weights in advance
#15459 closed
Jul 25, 2025 -
[Doc]: https://docs.vllm.ai/en/latest/deployment/k8s.html not working
#15461 closed
Jul 25, 2025 -
[Usage]:Phi-4-multimodal-instruct
#15468 closed
Jul 25, 2025 -
[Bug]: Unknown gguf model_type: gemma3
#15480 closed
Jul 25, 2025 -
[Bug]: Allow flexible message role ordering in conversations (user/assistant in any sequence)
#15486 closed
Jul 25, 2025 -
[Bug]: Support Bitsandbytes weight loading when offline (via huggingface cache)
#15507 closed
Jul 25, 2025 -
[Feature]: Reason model reasoning effort feature like OpenAI
#15524 closed
Jul 25, 2025 -
[Bug]: VLLM_NCCL_SO_PATH take no effects when spawn worker
#15525 closed
Jul 25, 2025 -
[Usage]: Qwen2.5-VL-32B-Instruct 4卡4090启动报错
#15529 closed
Jul 25, 2025 -
[Installation]: flaky publishing of cpu image
#15547 closed
Jul 25, 2025 -
[Bug]: Tools parsing issues with mistral3.1
#15549 closed
Jul 25, 2025 -
[Feature]: LMCache support to the CPU version of vLLM
#15562 closed
Jul 25, 2025 -
[Feature]: Ring Attention for Long Context in vLLM - RL Applications Focus
#15566 closed
Jul 25, 2025 -
[Bug]: Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memory access
#21399 closed
Jul 24, 2025 -
[Bug]: Crash in fused_moe.py due to Triton illegal memory access
#21520 closed
Jul 24, 2025 -
[Usage]: Viability of Data Parallelism with FP8 KV-Cache and tpu_int8 on TPU v4-64
#21459 closed
Jul 24, 2025 -
[New Model]: Emu3
#11008 closed
Jul 24, 2025 -
[Bug]: Ray/vLLM RuntimeError: HIP error: invalid device ordinal (reopen)
#21457 closed
Jul 24, 2025 -
[Bug]: the shape of b in function w8a8_block_fp8_matmul
#21503 closed
Jul 24, 2025 -
[Bug]: vllm start gemma3 fail: NotImplementedError: Vlm do not work with prefix caching yet rank=6
#21498 closed
Jul 24, 2025 -
[Bug]: Batch generation from prompt_embeds fails for long prompts
#21386 closed
Jul 24, 2025 -
[Feature]: Support One Pod Per Node LB for DP/EP
#21261 closed
Jul 24, 2025 -
[Bug]: After online_serving disagg_example_p2p_nccl_xpyd.sh cleanup, there is a zombie process
#21432 closed
Jul 24, 2025 -
[Feature]: Support for specific GGUF model in a HF Repo
#20084 closed
Jul 24, 2025 -
[Performance]: How to Improve Performance Under Concurrency
#9722 closed
Jul 24, 2025 -
[Bug]: AssertionError assert self.num_blocks >= nixl_agent_meta.num_blocks
#19338 closed
Jul 23, 2025 -
[Feature]: Remove Unused Moe Permute / Un-permute
#21124 closed
Jul 23, 2025 -
[Bug]: openai whisper model response is not accurate on AMD-based(MI300x) systems.
#20069 closed
Jul 23, 2025 -
[Bug]: Qwen2.5 1M models no longer working since v.0.8.5
#21452 closed
Jul 23, 2025 -
[Bug]: Guided decoding with Phi-3-small crashes
#6193 closed
Jul 23, 2025 -
[Usage]: cannot import name 'VoxtralForConditionalGeneration' from 'transformers'
#21369 closed
Jul 23, 2025 -
[Bug]: vllm cannot connect to an external ray cluster
#14349 closed
Jul 23, 2025 -
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 closed
Jul 23, 2025 -
[Installation]: Cannot compile vLLM from source on XPU
#14747 closed
Jul 23, 2025 -
[Bug]: AssertionError with Speculative Decoding in vLLM Using DeepSeek R1 Distill Qwen Models
#14939 closed
Jul 23, 2025 -
[Bug]: Internal Server Error when using Qwen2-VL-7B with vLLM Docker Container
#15110 closed
Jul 23, 2025 -
[Usage]: relationship between embedding size and vocab_size
#15131 closed
Jul 23, 2025 -
[Feature]: Ability to warm up vLLM instances
#15225 closed
Jul 23, 2025 -
[Bug]: working with openai-agents sdk an use Runner.run_streamed() got fucntion call error
#15256 closed
Jul 23, 2025 -
[Feature]: Dynamic Memory Release for GPU after idle time
#15287 closed
Jul 23, 2025 -
[Bug]: Crashing on unsupported Sampling params
#15312 closed
Jul 23, 2025 -
[Bug]: 0.8.0 and 0.8.1 bugs
#15365 closed
Jul 23, 2025 -
[Bug]: VLLM Build Using Docker Error Deploy
#15376 closed
Jul 23, 2025 -
[Feature]: Support Top-nσ sampling
#15379 closed
Jul 23, 2025 -
[Bug]: Different logprobs output behaviour under vllm 0.8.0 and 0.8.1
#15381 closed
Jul 23, 2025 -
[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model
#15384 closed
Jul 23, 2025 -
[New Model]: Baichuan-Audio
#15425 closed
Jul 23, 2025 -
[Usage]: when setting quantizaion AWQ on AWQ model it slows down the model execution by up to 5x
#21376 closed
Jul 22, 2025 -
[Usage]: How to turn off thinking using OpenAI client?
#20976 closed
Jul 22, 2025 -
[Bug]: Failed profiling vllm (both offline and server) with Nsight Systems
#20178 closed
Jul 22, 2025 -
[Bug]: OOM Error with Qwen/Qwen3-235B-A22B on Python SDK
#21361 closed
Jul 22, 2025 -
[Bug]: BART broken on vLLM 0.8.1 and above. (Even on v0 engine).
#19981 closed
Jul 22, 2025 -
[Bug]: RuntimeError: CUDA error: initialization error in `_is_fa2_supported`
#21304 closed
Jul 22, 2025 -
[Bug]: Llama4 Maverick runtime error (shuffle_rows)
#21322 closed
Jul 22, 2025 -
[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
#3900 closed
Jul 22, 2025 -
[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号,无结果
#3998 closed
Jul 22, 2025 -
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 closed
Jul 22, 2025 -
[Bug]: No available block found in 60 second in shm
#6614 closed
Jul 22, 2025 -
[Usage]: How to benchmark throughput of DeepSeek-R1-671B on 2 nodes
#15024 closed
Jul 22, 2025 -
[Doc]: new attention layer
#15077 closed
Jul 22, 2025 -
[Bug]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
#15327 closed
Jul 22, 2025 -
[Bug]: Prometheus DP Metrics
#21260 closed
Jul 21, 2025 -
[Bug]: V1 + FLASH_ATTN V3 + FP8 kv-cache randomly crashes w/qwen3 (and other models)
#17442 closed
Jul 21, 2025 -
[New Model]: nvidia/DeepSeek-R1-FP4
#16323 closed
Jul 21, 2025 -
[Bug]: CUDA kernel image error when serving Llama4 Maverick since #20694
#20847 closed
Jul 21, 2025 -
[Feature]: Simplify speculative-config format for vllm serve
#19709 closed
Jul 21, 2025 -
[Bug]: pynccl leads to incorrect data in multi-thread GPU-worker
#21252 closed
Jul 21, 2025 -
jinaai/jina-reranker-v1-turbo-en not compatible with vLLM
#16153 closed
Jul 21, 2025 -
[Bug]: After outputting the normal content, keep outputting content= '', until finish_reason='length'.
#21181 closed
Jul 21, 2025 -
[CI Failure]: Classification test failure for Qwen2.5-1.5B-apeach model in half precision
#21277 closed
Jul 21, 2025 -
新手入门,请多指教
#11223 closed
Jul 21, 2025 -
[RFC]: layer-wise kv cache offloading to enable larger batches
#15123 closed
Jul 21, 2025 -
[Bug]: Quantization does not lead to Throughput Speedup (Please Help)
#21236 closed
Jul 20, 2025 -
[Bug]: TypeError: RayGaugeWrapper.__init__() got an unexpected keyword argument
#20954 closed
Jul 20, 2025 -
Llama3.2 Vision Model: Guides and Issues
#8826 closed
Jul 20, 2025 -
[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management
#10086 closed
Jul 20, 2025 -
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 closed
Jul 20, 2025 -
[Feature]: Disaggregated Prefill on multi-node & multi-gpu
#13004 closed
Jul 20, 2025 -
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 closed
Jul 20, 2025 -
[Usage]: Distributed inference not supported with OpenVINO?
#14933 closed
Jul 20, 2025 -
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 closed
Jul 20, 2025 -
[Bug]: Capture CudaGraph with LoRA
#15090 closed
Jul 20, 2025 -
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 closed
Jul 20, 2025 -
[Performance]: online batch inference faster than offline batch inference
#15178 closed
Jul 20, 2025 -
[Usage]: VLLM 0.7.3 with tensor parallelism outputs only exclamation marks when using multiple GPUs
#15194 closed
Jul 20, 2025 -
[Feature]: vllm what supports dialog prefix continuation?
#15198 closed
Jul 20, 2025 -
[Misc][Help]: Adding support for a Custom model with External MoE Routing
#15214 closed
Jul 20, 2025 -
[Usage]: : How to properly use vllm when serving - keyerror 'text'
#15219 closed
Jul 20, 2025 -
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 closed
Jul 20, 2025 -
[Performance]: V0 and V1 give the same throughput number
#15253 closed
Jul 20, 2025 -
[Bug]: --tensor-parallel-size Error
#15255 closed
Jul 20, 2025 -
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 closed
Jul 20, 2025 -
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 closed
Jul 20, 2025 -
[Feature]: looking into adding a generation algorithm
#15315 closed
Jul 20, 2025 -
[Bug]: ImportError: vllm/_C.abi3.so: undefined symbol _ZN3c106ivalue14ConstantString6createENSt7
#21226 closed
Jul 19, 2025 -
[Bug]: PD does not work with ray distributed backend
#21070 closed
Jul 19, 2025 -
[Bug]: Middleware crashes vLLM on startup w/latest commit
#21219 closed
Jul 19, 2025 -
[Bug]: RGB inverted in offline example?
#21053 closed
Jul 19, 2025 -
[Usage]: How to do expert parallel on MoE model?
#21054 closed
Jul 19, 2025 -
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 closed
Jul 19, 2025 -
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,)
#8432 closed
Jul 19, 2025 -
[Usage]: Can I get the loss of model directly?
#9750 closed
Jul 19, 2025 -
[Feature]: Support for priority preemption with chunked-prefill
#10101 closed
Jul 19, 2025 -
[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm
#10478 closed
Jul 19, 2025 -
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 closed
Jul 19, 2025 -
[Bug]: 0.8.0(V1) crash on NCCL when load MoE model on 16 GPUs(H20)
#15098 closed
Jul 19, 2025
95 Issues opened by 83 people
-
[Bug]: v0.10.0 built with early version of pytorch that does not support sm-120
#21633 opened
Jul 25, 2025 -
[Bug]: Very Low Prompt Evaluation Speed
#21616 opened
Jul 25, 2025 -
[Bug]: Can't set limit-mm-per-prompt in 0.10.0
#21615 opened
Jul 25, 2025 -
[Bug]:
#21609 opened
Jul 25, 2025 -
[Bug]: Enabling EPLB leads to inconsistent inference results
#21606 opened
Jul 25, 2025 -
[Feature]: Support fairness heuristics for the batched requests
#21605 opened
Jul 25, 2025 -
[Feature]: Support multiple guided decoding settings
#21604 opened
Jul 25, 2025 -
[Bug]: DeepEP with Qwen3-Coder Fails
#21603 opened
Jul 25, 2025 -
[Feature]: Support CPU on ray
#21602 opened
Jul 25, 2025 -
[Usage]: How do you use benchmark_serving on VLM?
#21596 opened
Jul 25, 2025 -
[Feature]: Support GteNewModelForSequenceClassification
#21595 opened
Jul 25, 2025 -
[Bug]: The current scheduling logic has a bug: when a scheduled request is evicted, ...
#21594 opened
Jul 25, 2025 -
[Bug]: MistralTokenizer is missing batch_decode, breaks /detokenize in OpenAI server
#21593 opened
Jul 25, 2025 -
[Bug]: [P/D] P/d is incompatible with spec decoding
#21583 opened
Jul 25, 2025 -
[Bug]: Incorrect Answer with Llama-Scout-Fp8 and PPLX
#21581 opened
Jul 25, 2025 -
[Feature]: [P/D] NIXL Connector Error Handling
#21577 opened
Jul 25, 2025 -
[Bug]: [P/D] NIXLConnector does not support P TP > D TP
#21576 opened
Jul 25, 2025 -
[Bug]: vLLM ranking is biased towards short texts, giving high scores to irrelevant short texts
#21574 opened
Jul 25, 2025 -
[Usage]: Qwen tool_call response type problem
#21571 opened
Jul 25, 2025 -
[Bug]: [P/D] in nixl_connector, the P node implements a request timeout but the D node cannot detect.
#21570 opened
Jul 25, 2025 -
[Usage]: Qwen3-Coder-480B-A35B-Instruct deploy hang up
#21568 opened
Jul 25, 2025 -
[Bug]: Qwen3 failed to get function with stream and named function calling when thinking is disabled
#21565 opened
Jul 25, 2025 -
[Usage]: Disable the FlashInfer 0.2.3+ does not support per-request generators warning
#21563 opened
Jul 25, 2025 -
[Bug]: tensorizer example failed
#21547 opened
Jul 24, 2025 -
[Bug]: Beam search implementation disables logit processor functionality
#21546 opened
Jul 24, 2025 -
[Feature]: Zero copy for direct GPU model loading
#21545 opened
Jul 24, 2025 -
[Bug]: Hermes tool call parser fails with "Error trying to handle streaming tool call"
#21544 opened
Jul 24, 2025 -
[Bug]: Failing to initialize engine on qwen3 on B200 with VLLM_USE_DEEP_GEMM=1
#21542 opened
Jul 24, 2025 -
[Bug]: Incorrect Generation for Qwen2.5-VL-7B-Instruct in Batch Mode
#21529 opened
Jul 24, 2025 -
[Bug]: Qwen3-30B-A3B distributed Inference hang when set tp 2 pp 1 on two H100 node
#21524 opened
Jul 24, 2025 -
[Bug]: run glm4.1v ,ValueError: Attempted to assign 100 = 100 multimodal tokens to 30000 placeholders
#21516 opened
Jul 24, 2025 -
[CI Failure]: Transformers Nightly Models Test
#21515 opened
Jul 24, 2025 -
[CI Failure]: Multi-model Models Test (Extended 1)
#21514 opened
Jul 24, 2025 -
[Bug]: Aborted without reason,timeout after three retries
#21513 opened
Jul 24, 2025 -
[CI Failure]: Multi-model Models Test (Extended 2)
#21512 opened
Jul 24, 2025 -
[Feature]: Qwen3 Models GGUF Support
#21511 opened
Jul 24, 2025 -
[Feature]: Full cudagraph support for MLA attention backend with DeepSeek MTP(Speculative decode)
#21505 opened
Jul 24, 2025 -
[RFC] [ROCm] [AITER]: Propose a `_aiter_ops` class like `_custom_ops` and `_ipex_ops`
#21504 opened
Jul 24, 2025 -
[Bug]: Tensor parallelism on sm_120 (rtx 5090) is broken on latest docker (0.9.2)?
#21491 opened
Jul 24, 2025 -
[Feature]: torch >2.7.0 support
#21488 opened
Jul 24, 2025 -
[Feature]: Multiple models one server
#21481 opened
Jul 23, 2025 -
[RFC]: vLLM vs HuggingFace numerical parity report
#21475 opened
Jul 23, 2025 -
[Bug]: Incorrect output when using LoRA modules with tensor parallelism in vLLM
#21471 opened
Jul 23, 2025 -
[RFC]: Shorten all of the CI by reducing `cudagraph_capture_sizes` for most of the unit tests
#21469 opened
Jul 23, 2025 -
[Bug]: FP8 model crashes with EngineDeadError and CUDA illegal memory access on H100 (CUDA 12.8)
#21466 opened
Jul 23, 2025 -
[Bug]:
#21454 opened
Jul 23, 2025 -
[Feature]: sm_120 support
#21453 opened
Jul 23, 2025 -
[Usage]: How can a vLLM cluster support deploying multiple large models?
#21451 opened
Jul 23, 2025 -
[Doc]: Undocumented required option "method" in speculative config when loading an eagle3 model
#21447 opened
Jul 23, 2025 -
[Bug]:
#21442 opened
Jul 23, 2025 -
[Bug]: ParallelHead has no attribute 'params_dtypes'
#21439 opened
Jul 23, 2025 -
[Bug]: vLLM Multinode Pipeline Error with pipeline parallelism using Ray
#21438 opened
Jul 23, 2025 -
[Bug]: Performance issue with NPS-4 configuration with respect to NPS-1 configuration
#21436 opened
Jul 23, 2025 -
[Usage]: How to reproduce the results of `vllm` using `transformers`
#21433 opened
Jul 23, 2025 -
[Feature]: add a arg to modify process_name
#21423 opened
Jul 23, 2025 -
[Bug]: auto_tune.sh profiling attempts are hanging (i.e., "benchmarking_serving.py --profile" is failing)
#21403 opened
Jul 22, 2025 -
[Bug]: Tool call argument value of type `integer` may break things when `stream=True`
#21372 opened
Jul 22, 2025 -
[Bug]: 'FusedMoE' object has no attribute 'kv_cache' when running a 1P1D test with PowerMoE-3b
#21359 opened
Jul 22, 2025 -
[Performance]: KV Cache Size Comparison vLLM vs SGLang
#21348 opened
Jul 22, 2025 -
[Usage]: Prefill node crashed when P/D Disaggregated Serving with MooncakeStore for Qwen3MOE
#21346 opened
Jul 22, 2025 -
[Bug]: Large image requests silently dropped with Llama-Guard-4
#21344 opened
Jul 22, 2025 -
[Bug]: vllm crashes using Eight RTX 3090s
#21339 opened
Jul 21, 2025 -
[Bug]: vLLM crashes when using --enable-sleep-mode with Blackwell PRO 6000 GPUs
#21336 opened
Jul 21, 2025 -
[Bug]: dsv3 generates all 0s output
#21326 opened
Jul 21, 2025 -
[Feature]: Hybrid Cloud Model Serving
#21323 opened
Jul 21, 2025 -
[Feature]: Support Anthropic API `/v1/messages` endpoint
#21313 opened
Jul 21, 2025 -
[Bug]:
#21310 opened
Jul 21, 2025 -
[Bug]: ROCm NotImplementedError: Speculative decoding is not yet supported on vLLM V1
#21308 opened
Jul 21, 2025 -
[Bug]: qwen tool bug
#21307 opened
Jul 21, 2025 -
[Bug]: all2all communication hangs when using DeepEP and PPLX for v0.9.2
#21306 opened
Jul 21, 2025 -
[Bug]: Mistral Tool Parser Crashes with Empty JSONDecodeError for Mistral Small 3.2 24B FP8 Instruct
#21303 opened
Jul 21, 2025 -
[Bug]: Hermes tool parser returns invalid arguments
#21301 opened
Jul 21, 2025 -
[Usage]: How to test throughput of 2:4 sparse model?
#21298 opened
Jul 21, 2025 -
[Usage]: how to execute benchmark_serving.py with an apikey?
#21297 opened
Jul 21, 2025 -
[Performance]: how to test model performance with apikey using benchmark_serving.py?
#21295 opened
Jul 21, 2025 -
[Bug]: OpenReasoning-Nemotron-32B Only Outputs Exclamation Marks Regardless of Input
#21292 opened
Jul 21, 2025 -
[Performance]: Speculative decoding doesn't seem to speed up inference?
#21278 opened
Jul 21, 2025 -
[Bug]: nvfp4 support on sm120
#21274 opened
Jul 21, 2025 -
[Feature]: support for NVIDIA RTX 5070Ti graphics card and Windows 11 system
#21272 opened
Jul 21, 2025 -
[Bug]: Endless Generation near Context Window with Eagle3/Spec Dec
#21269 opened
Jul 20, 2025 -
[Bug]: GPTQ w4a16 Quantization slower than FP16 (Please Help)
#21266 opened
Jul 20, 2025 -
[Feature]: Raise proper HTTP error with details for multimodal input url fetch error
#21254 opened
Jul 20, 2025 -
[Usage]: Abnormal LoRA kernel performance
#21250 opened
Jul 20, 2025 -
[Performance]: Move hash_request_tokens computation from input request threads
#21247 opened
Jul 20, 2025 -
[Bug]: tensor parallelism inference doesn't run on Nvidia Blackwell 5070ti
#21239 opened
Jul 20, 2025 -
[Bug]: vLLM stops inference
#21237 opened
Jul 20, 2025 -
[Bug]: 100% cpu usage on 3 cores on every node when using ray distributed pipeline parallel
#21231 opened
Jul 19, 2025 -
[Feature]: Support xformers on ARM GPU machines including GB200.
#21207 opened
Jul 18, 2025 -
[Feature]: Consolidate benchmark_serving.py and serve.py to avoid code duplication and usage confusions
#21206 opened
Jul 18, 2025 -
[Bug]: Guidance decoding broken for Granite 3.3 and hangs server
#21204 opened
Jul 18, 2025
380 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Model] Auto resolve default_pooling_type & Optimize prefix caching enable verify logic.
#20930 commented on
Jul 24, 2025 • 28 new comments -
[Core] Allow full cudagraph with separate attention routines and orthogonal to compilation, add support for FA2 and FlashInfer
#20059 commented on
Jul 25, 2025 • 28 new comments -
LFM2
#20797 commented on
Jul 25, 2025 • 25 new comments -
Add an optimization doc on TPU
#21155 commented on
Jul 25, 2025 • 20 new comments -
[V1] Logits processors extensibility
#19912 commented on
Jul 25, 2025 • 14 new comments -
[Feature] limit thinking tokens
#20859 commented on
Jul 25, 2025 • 12 new comments -
security policy: take 1
#21119 commented on
Jul 22, 2025 • 10 new comments -
[Model] Add support for Jina Embeddings V4
#20802 commented on
Jul 21, 2025 • 10 new comments -
[1/N] Refactor platform API to reduce `torch.cuda` call
#20751 commented on
Jul 25, 2025 • 10 new comments -
[Feature] use --ep_config to set eplb param
#20562 commented on
Jul 25, 2025 • 9 new comments -
v1: Add Whisper model support (encoder-decoder)
#21088 commented on
Jul 25, 2025 • 8 new comments -
[VLM] Support HF format Phi-4-MM model
#17121 commented on
Jul 23, 2025 • 7 new comments -
[Feature][EPLB] Add eplb support for Qwen3
#20815 commented on
Jul 25, 2025 • 6 new comments -
[Model] Ultravox: Support Llama 4 and Gemma 3 backends
#17818 commented on
Jul 25, 2025 • 5 new comments -
[Bugfix] Mistral tool parser streaming update
#19425 commented on
Jul 25, 2025 • 5 new comments -
v1: Add Request.block_hashes
#19728 commented on
Jul 25, 2025 • 5 new comments -
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4
#21166 commented on
Jul 23, 2025 • 4 new comments -
[Model] Pooling model activation supports per request control by PoolingParams
#20538 commented on
Jul 25, 2025 • 4 new comments -
[Model] Gemma3n MM
#20495 commented on
Jul 24, 2025 • 4 new comments -
[Model] Support TP/PP/mamba2 kernel for PLaMo2
#19674 commented on
Jul 25, 2025 • 4 new comments -
Add add_logger API to AsyncLLM
#20952 commented on
Jul 23, 2025 • 3 new comments -
[Attention][DBO] Add support for "splitting" the CommonAttentionMetadata
#21153 commented on
Jul 25, 2025 • 3 new comments -
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on
Jul 25, 2025 • 2 new comments -
[BugFix] fix: aot passes kvcache dtype information
#19750 commented on
Jul 24, 2025 • 2 new comments -
[Frontend] Add chunked processing to handle long inputs in embedding models
#20837 commented on
Jul 25, 2025 • 2 new comments -
[Feature] Add async tensor parallelism for scaled mm
#20155 commented on
Jul 25, 2025 • 2 new comments -
Add tree attention backend for v1 (part 1)
#20401 commented on
Jul 24, 2025 • 2 new comments -
ci: Add CUDA + arm64 release builds
#21201 commented on
Jul 19, 2025 • 2 new comments -
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels
#17112 commented on
Jul 25, 2025 • 2 new comments -
[Feature][EPLB] Add support for unquantized models
#21168 commented on
Jul 21, 2025 • 2 new comments -
[Feature] Support multiple api keys in server
#18548 commented on
Jul 25, 2025 • 2 new comments -
[Misc] allow pulling vllm in Ray runtime environment
#21143 commented on
Jul 23, 2025 • 2 new comments -
[Misc] change default request logging behavior to off
#21135 commented on
Jul 23, 2025 • 2 new comments -
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuning
#20396 commented on
Jul 24, 2025 • 1 new comment -
Resolve the torch nightly sync issue
#20393 commented on
Jul 23, 2025 • 1 new comment -
[Bugfix] V1 Fix the cursor leakage issue during request scheduling.
#21173 commented on
Jul 25, 2025 • 1 new comment -
Implement structural_tag and json_schema for non-chat completion
#21150 commented on
Jul 22, 2025 • 1 new comment -
[Bugfix] Fix the bug in Hermes streaming parsing
#20824 commented on
Jul 25, 2025 • 1 new comment -
Enable multi-image support benchmarking for serving
#21145 commented on
Jul 22, 2025 • 1 new comment -
[Nvidia] Integrate cudnn prefill paged attention kernel for head_dim == 128 models, like Llama family
#20850 commented on
Jul 25, 2025 • 1 new comment -
[Doc] Add multi-modal development example for encoder-decoder models
#15405 commented on
Jul 24, 2025 • 0 new comments -
[ROCm][AMD] Enable ROCm Flash Attention Backend for Encoder-Decoder Models
#14803 commented on
Jul 21, 2025 • 0 new comments -
[Feature] Memory interleaving (#14680)
#14690 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Importing DeepSpeed causes crash in vLLM when running with data parallelism and TP=1
#17079 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Add support for multi-lora and single lora for classification tasks
#19623 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: add DoRA support
#10849 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Add EP/DP/PD deps in docker image
#19653 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix] set correct lora mapping when compute prompt logprobs
#16694 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: ValueError: The output_size of gate's and up's weight = 192 is not divisible by weight quantization block_n = 128.
#17569 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix] Move current_platform import to avoid python import cache.
#16601 commented on
Jul 19, 2025 • 0 new comments -
[Misc] Improve cli help show
#15455 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: topk=1 and temperature=0 cause different output in vllm
#5404 commented on
Jul 25, 2025 • 0 new comments -
Reshape cache flash kernel to support HND layout
#8200 commented on
Jul 24, 2025 • 0 new comments -
[Model] LoRA with lm_head and embed_tokens fully trained - 4
#11714 commented on
Jul 22, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
Jul 23, 2025 • 0 new comments -
First working PoC for bge-m3 sparse embeddings
#14526 commented on
Jul 24, 2025 • 0 new comments -
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on
Jul 22, 2025 • 0 new comments -
[Misc]fix demo function call JSONDecodeError
#16595 commented on
Jul 25, 2025 • 0 new comments -
[Doc] update docs for nightly benchmarks
#12022 commented on
Jul 23, 2025 • 0 new comments -
[Misc] improve chat_with_tools example
#16044 commented on
Jul 25, 2025 • 0 new comments -
DeepGemm MoE expert map support
#15957 commented on
Jul 24, 2025 • 0 new comments -
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 commented on
Jul 25, 2025 • 0 new comments -
[Frontend] Pythonic tool names flexibility (#14470)
#14474 commented on
Jul 25, 2025 • 0 new comments -
[Core] Make disaggregated prefill compatible with pipeline parallelism
#12301 commented on
Jul 23, 2025 • 0 new comments -
Adding Share Expert Fusion for DeepSeek
#15502 commented on
Jul 20, 2025 • 0 new comments -
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix][Frontend] Strip empty tool calls from incoming chat conversations
#14054 commented on
Jul 25, 2025 • 0 new comments -
Fixed Stream set to True, client stream receiving arguments, concatenated json string, missing curly braces end
#15930 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix][Spec Decode][V0] fix: update logits processor for MQA scoring
#12537 commented on
Jul 21, 2025 • 0 new comments -
[Misc] Disable pin_memory in AsyncMetricsCollector for spec decode tensor allocation
#15886 commented on
Jul 23, 2025 • 0 new comments -
[CI/Build] Add support for Python 3.13
#13164 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: xgrammar doesn't support enums, but vllm isn't falling back to outlines
#15762 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: gemma 3 structured output api occurs assertion error
#15766 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: xgrammar==0.17 not work when guided
#15790 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Models converted to GGUF don't seem to be able to do tool calling
#16195 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Disable unicode characters in structured decoding
#16363 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Qwen2.5 assistant output on tool call is empty
#16430 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Native Tool Call for Gemma 3
#16482 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: VLLM>0.8 also met No platform detected, vLLM is running on UnspecifiedPlatform
#16724 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Ngram speculative decoding doesn't work in vLLM 0.8.3/0.8.4 with VLLM_USE_V1 enabled.
#16883 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: guided_grammar example syntax does not work
#16911 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: ```image_grid_thw``` not set in ```CachedRequestState``` - ```Qwen2.5 VL 3B```
#17007 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: All Ops should be determined during init and wrapped in a Layer Module to avoid envs.ENVIRON overhead
#17067 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: Expert parallelism in VLLM - do you do local dropping on sub-batch of token activations before going through gating layer to make each rank possess unique sub-batch of data?
#17087 commented on
Jul 25, 2025 • 0 new comments -
tool call arguments parse failed
#17089 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: [0.7.2+] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
#17098 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Tool calls data comes in content field after text chunks
#17109 commented on
Jul 25, 2025 • 0 new comments -
[Installation]: vllm/vllm-tpu image doesn't have :latest tag
#17114 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Why does torch.cuda.memory_allocated() remain unchanged after calling sleep()?
#17117 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: jinja2 TemplateError should return 422 instead of 500 error code
#17119 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Automatically detect numerical issues
#17123 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: waiting reqs vanish!
#17147 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: DeepSeek Lora inference has no effect.
#17155 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: How to configure the server parameters for THUDM/GLM-4-32B-0414 to support Function call using vllm-0.8.4?
#16771 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: In-place weights loading and model swapping
#19886 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Prompt Embedding returns 500 internal error for Qwen 2.5 VL model
#20757 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: PD demo example failed to run benchmark
#20477 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Error during transcription: Received a CachedWhisperTokenizerFast for argument tokenizer, but a WhisperTokenizer was expected.
#19538 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: How to log stat when using AsyncLLM locally (do not based on openAI api)
#18948 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: scheduling policy optimization in vLLM
#16969 commented on
Jul 25, 2025 • 0 new comments -
[Roadmap] vLLM Release/CI/Performance Benchmark Q2 2025
#16284 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: The mixed precision model lacks kernel image in the Blackwell architecture(version:0.9.2 + cu12.8 + RTX5060)
#20605 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Engine Core initialization failed. See root cause above
#17618 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1
#17140 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Support EPLB for More MoE Models, e.g. Qwen 3, Llama 4
#20468 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: vLLM and In the fly tool calling
#13497 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: add tool calling support for DeepSeek-R1-Distill-Qwen-32B
#13700 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Improve Logging for Error Messages
#14083 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on
Jul 25, 2025 • 0 new comments -
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on
Jul 25, 2025 • 0 new comments -
[Model] Reasoning Parser for Nemotron Models
#21041 commented on
Jul 21, 2025 • 0 new comments -
Fix minor docs issues and fix metric requests
#21040 commented on
Jul 25, 2025 • 0 new comments -
Enable sequence parallelism for full cuda graph without specifying compile sizes
#21031 commented on
Jul 23, 2025 • 0 new comments -
[Not for merge] Unshift eagle prefill
#21008 commented on
Jul 25, 2025 • 0 new comments -
fix(completion): always include usage
#20983 commented on
Jul 24, 2025 • 0 new comments -
[V0 deprecation] Removal V0 structured outputs
#20928 commented on
Jul 21, 2025 • 0 new comments -
[Bugfix] Support for getting the exact memory value when in a container
#20917 commented on
Jul 22, 2025 • 0 new comments -
[Frontend] Raise an extremely dangerous warning when using VLLM_ALLOW_LONG_MAX_MODEL_LEN
#20904 commented on
Jul 22, 2025 • 0 new comments -
[Frontend] OpenAI Responses API supports Tool/Function calling
#20874 commented on
Jul 24, 2025 • 0 new comments -
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache
#20873 commented on
Jul 20, 2025 • 0 new comments -
Allow serving Llama4ForCausalLM directly
#20868 commented on
Jul 21, 2025 • 0 new comments -
[compile][startup] Disable C++ compilation of symbolic shapes
#20836 commented on
Jul 22, 2025 • 0 new comments -
[Meta] Official Eagle mm support, first enablement on llama4
#20788 commented on
Jul 25, 2025 • 0 new comments -
[Feature] Add support for MoE models in the calibration-free RTN-based quantization
#20766 commented on
Jul 25, 2025 • 0 new comments -
[PERF] Symmetric memory allreduce
#20759 commented on
Jul 25, 2025 • 0 new comments -
feat: Add --enable-log-outputs flag for logging model generations
#20707 commented on
Jul 24, 2025 • 0 new comments -
[Bugfix] Fix grafana's model_name list showing other values
#20677 commented on
Jul 22, 2025 • 0 new comments -
PrefixRepetitionRandomDataset
#20638 commented on
Jul 19, 2025 • 0 new comments -
v1: Support KV events from connectors
#19737 commented on
Jul 23, 2025 • 0 new comments -
[Compilation fix] add stubs to allow compilation without sm100
#21198 commented on
Jul 22, 2025 • 0 new comments -
[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel
#21197 commented on
Jul 22, 2025 • 0 new comments -
[W.I.P]: add Lmcache metrics
#21189 commented on
Jul 20, 2025 • 0 new comments -
Some initial Vulkan boilerplate
#21184 commented on
Jul 18, 2025 • 0 new comments -
[Bugfix] Mistral crashes on tool with no description
#21167 commented on
Jul 25, 2025 • 0 new comments -
[LMCache][Example] Align the PYTHONHASHSEED for prefillers and decoders for KV chunks hashing
#21161 commented on
Jul 24, 2025 • 0 new comments -
[V0 deprecation] Deprecate V0 Neuron backend
#21159 commented on
Jul 21, 2025 • 0 new comments -
Remove xformers requirement for Mistral-format Pixtral and Mistral3
#21154 commented on
Jul 24, 2025 • 0 new comments -
[Perf] Using `mul` instead of `div` for int8 quant
#21136 commented on
Jul 24, 2025 • 0 new comments -
[V1] Large Block_size solution
#21123 commented on
Jul 21, 2025 • 0 new comments -
Add `fused_moe_gate` kernel and integrate to DeepSeek MoE layer
#21107 commented on
Jul 22, 2025 • 0 new comments -
[V1][Metrics][Frontend] Add support for custom stat loggers via CLI --stat-loggers
#21105 commented on
Jul 24, 2025 • 0 new comments -
[benchmark] add max-concurrency in result table
#21095 commented on
Jul 21, 2025 • 0 new comments -
[Model] Support deepseek with eagle
#21086 commented on
Jul 21, 2025 • 0 new comments -
[Draft][Kernel] Flashinfer MLA (trtllm-gen) decode kernel integration
#21078 commented on
Jul 22, 2025 • 0 new comments -
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhead
#21075 commented on
Jul 23, 2025 • 0 new comments -
fix: NIXL connector transfers partial block to pass full multi-modal context
#21074 commented on
Jul 25, 2025 • 0 new comments -
Add FlashInfer allreduce RMSNorm Quant fusion
#21069 commented on
Jul 25, 2025 • 0 new comments -
[Feature][EPLB] Add EPLB support for MiniMax-01
#21056 commented on
Jul 24, 2025 • 0 new comments -
[V1] Partial prefill skip for layers reusing shared KV cache
#19719 commented on
Jul 24, 2025 • 0 new comments -
Fixed power build by building numba from source
#19433 commented on
Jul 23, 2025 • 0 new comments -
[ROCm][FEAT] Integrate AITER gemm w8a8 ptpc
#19417 commented on
Jul 20, 2025 • 0 new comments -
[Bugfix] VLLM_V1 supports passing other compilation levels
#19340 commented on
Jul 24, 2025 • 0 new comments -
[Misc][Bugfix] specify docker registry to support podman
#19236 commented on
Jul 21, 2025 • 0 new comments -
[CI/Build] Add tool to build vllm-tpu wheel
#19165 commented on
Jul 23, 2025 • 0 new comments -
[Bugfix]: Fix DualChunkFlashAttention for short sequences
#19084 commented on
Jul 23, 2025 • 0 new comments -
[BugFix]: Hermes tool parser stream output error #19056
#19058 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix] Improve JSON extraction in LlamaToolParser
#19024 commented on
Jul 25, 2025 • 0 new comments -
[V1] Support `LLM.apply_model`
#18465 commented on
Jul 23, 2025 • 0 new comments -
[Doc] update Contributing page's testing section
#18272 commented on
Jul 25, 2025 • 0 new comments -
[Bugfix] Fix Hermes tool call parser with streaming
#18220 commented on
Jul 25, 2025 • 0 new comments -
[Frontend] Add unix domain socket support
#18097 commented on
Jul 22, 2025 • 0 new comments -
[Misc] Remove duplicate division check between num_query_heads and num_kv_heads.
#18074 commented on
Jul 24, 2025 • 0 new comments -
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on
Jul 25, 2025 • 0 new comments -
[Kernel] Bf16 data type support for awq quantization
#17705 commented on
Jul 21, 2025 • 0 new comments -
[Misc] Raise ValueError for V1 during profiling when max_num_batched_tokens is too short
#16834 commented on
Jul 21, 2025 • 0 new comments -
[V1] Update default max_num_batched_tokens for V1 openai server
#16795 commented on
Jul 20, 2025 • 0 new comments -
[Core] feat: Add aging factor support to priority request queue for fairer scheduling
#20608 commented on
Jul 22, 2025 • 0 new comments -
feat: Add streaming support for Mistral v11 tool format
#20503 commented on
Jul 23, 2025 • 0 new comments -
[FEAT] [V1] [ROCm] Enable DeepSeek R1 MTP V1 ROCm
#20493 commented on
Jul 19, 2025 • 0 new comments -
[V1][Spec Decode][Feature] Spec decode with probs
#20459 commented on
Jul 18, 2025 • 0 new comments -
[Core] Shared memory based object store for Multimodal data caching and IPC
#20452 commented on
Jul 24, 2025 • 0 new comments -
Add experimental Dual-Batch Overlap mechanism to VLLM
#20448 commented on
Jul 25, 2025 • 0 new comments -
feat: Add support for speculators Eagle checkpoints
#20436 commented on
Jul 22, 2025 • 0 new comments -
[WIP][RC] Update PyTorch to 2.8.0
#20358 commented on
Jul 23, 2025 • 0 new comments -
[BugFix] [P/D] Handle lookahead token count edge-case with Eagle Spec Decoding and P/D
#20340 commented on
Jul 25, 2025 • 0 new comments -
[Hardware][RISC-V] Add RISC-V architecture cpu inference support
#20292 commented on
Jul 25, 2025 • 0 new comments -
[Benchmark] Add benchmark tool for multi turn conversations
#20267 commented on
Jul 22, 2025 • 0 new comments -
[Frontend] add previous context to whisper transcription over 30s audio
#20249 commented on
Jul 25, 2025 • 0 new comments -
[Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318)
#20229 commented on
Jul 25, 2025 • 0 new comments -
[CI/Build][Bugfix]Fix marlin kernel no built on 4090
#20219 commented on
Jul 25, 2025 • 0 new comments -
[Nixl] Heterogeneous TP support FlashInfer
#20189 commented on
Jul 25, 2025 • 0 new comments -
v1: Introduce LRU-based CPU offloading management
#20075 commented on
Jul 23, 2025 • 0 new comments -
Add support for encoder embedding models
#19988 commented on
Jul 25, 2025 • 0 new comments -
[Feature][Kernel] Blocked FP8 CUTLASS MoE for Hopper
#19983 commented on
Jul 25, 2025 • 0 new comments -
v1: Introduce an offloading component
#19848 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Support for Universal Assisted Generation
#16503 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Main branch code reasoning reports an error in h100 inference
#16656 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: An error occurred when deploying DeepSeek-R1-Channel-INT8 on two A100 machines using lws
#16827 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: SharedStorageConnector only see first batch of tokens
#16928 commented on
Jul 22, 2025 • 0 new comments -
[Bug]:Why is the GPU memory usage after quantizing the model to int8 W8A8 with llmcompressor almost the same as before quantization?
#16959 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: The output of MathResponse is empty when running THUDM/GLM-Z1-32B-0414 with vLLM-0.8.4
#16967 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Performance degradation with increasing number of requests in long-running vLLM inference sessions
#16985 commented on
Jul 22, 2025 • 0 new comments -
[Usage]: multilora_inference with max_loras>1
#17003 commented on
Jul 22, 2025 • 0 new comments -
[Benchmark][V1][Spec Decode][EAGLE] Tracking benchmark for V1 EAGLE
#17812 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: After wake_up sleeping model in OpenAI API server the model generate gibberish output
#20627 commented on
Jul 21, 2025 • 0 new comments -
[Performance]: Opportunities to speed up BlockPool processing
#21141 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: When using gemma-3n in Apple Silicon I get a NotImplementedError
#20521 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: DSR1 with DEP OOM during initialization on 32xH100
#20441 commented on
Jul 21, 2025 • 0 new comments -
[Usage]: Wrong context length for Qwen2.5-7B-Instruct?
#16757 commented on
Jul 21, 2025 • 0 new comments -
[RFC]: Response format extensions for structured outputs
#19097 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: When making a streaming request, the 9-digit integer in the function call result will be truncated to 6 digits
#21156 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: 单机多卡推理 tensor-parallel-size和pipeline-parallel-size 推理结果差距巨大
#19136 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Wrongly reuse KV, for V1 PD disaggregation with multimodal input
#21175 commented on
Jul 21, 2025 • 0 new comments -
[Feature]: Add MXFP6 Quantization Format
#17837 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 commented on
Jul 21, 2025 • 0 new comments -
[Usage]: I cannot compile vllm on RTX5090
#20345 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on
Jul 20, 2025 • 0 new comments -
[Feature]: Colocating multiple LLM engines in the same process with sleep mode.
#18975 commented on
Jul 22, 2025 • 0 new comments -
[New Model]: nvidia/llama-nemoretriever-colembed-3b-v1
#20703 commented on
Jul 22, 2025 • 0 new comments -
[RFC]: Neuron Support for V1 Engine
#21082 commented on
Jul 22, 2025 • 0 new comments -
[Feature]: vLLM does not support torch 2.7.1
#20566 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: When passing text prompt + image embedding as Input, prefix cache usage is alway 0%
#21016 commented on
Jul 22, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q3 2025
#20336 commented on
Jul 22, 2025 • 0 new comments -
[Usage]: Llama4 tool parser
#16214 commented on
Jul 22, 2025 • 0 new comments -
[New Model]: Qwen3-Embedding-8B-GGUF
#19602 commented on
Jul 22, 2025 • 0 new comments -
[New Model]: moonshotai/Kimi-Audio-7B-Instruct
#17234 commented on
Jul 22, 2025 • 0 new comments -
[Feature request] Output attention scores in vLLM
#3192 commented on
Jul 22, 2025 • 0 new comments -
[Feature]: Add support for attention score output
#11365 commented on
Jul 22, 2025 • 0 new comments -
[Feature]: Support Gemma 3 QAT series
#16856 commented on
Jul 22, 2025 • 0 new comments -
[Performance]: phi 3.5 vision model consuming high CPU RAM and the process getting killed
#9190 commented on
Jul 22, 2025 • 0 new comments -
[New Model]: Kimi-K2-Instruct
#20963 commented on
Jul 22, 2025 • 0 new comments -
[RFC]: EPLB Execution Optimization From pr 18343
#20805 commented on
Jul 22, 2025 • 0 new comments -
[Feature]: Ernie-4.5 vision support
#20732 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on
Jul 22, 2025 • 0 new comments -
[New Model]: HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
#21087 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: N-gram speculative decoding performs slower than Qwen3-32B-FP8 with vLLM 0.9.0.1
#19254 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Mistral tool parser & streaming: corrupt tool_calls completions
#17585 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: AsyncLLM sleep then wake_up produces meaningless outputs
#17103 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Cast error details: Unable to cast 1024 to Tensor
#12771 commented on
Jul 22, 2025 • 0 new comments -
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on
Jul 22, 2025 • 0 new comments -
[Feature]: Model download progress using tqdm
#21191 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: RuntimeError: query and key must have the same dtype when using Eagle3 speculative decoding with kv-cache-dtype fp8
#21177 commented on
Jul 20, 2025 • 0 new comments -
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 commented on
Jul 20, 2025 • 0 new comments -
[Usage]: Does model streamer supports loading model from GCS bucket?
#12290 commented on
Jul 20, 2025 • 0 new comments -
[Feature]: gemma3 raise error
#14723 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: Embed model has additional dense module(dim=1792, but only 1024)
#15509 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: v1 engine error when I using gemma-3 (v0 engine is okay)
#16643 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: InternVL3-78B OOM on 4 A100 40G in 0.8.4
#16749 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: Rocm Memory Access Fault.
#16840 commented on
Jul 20, 2025 • 0 new comments -
[New Model]: jinaai/jina-embeddings-v2-base-code
#16874 commented on
Jul 20, 2025 • 0 new comments -
[Usage]: Is it true that vllm doesn't support deepseek r1 yet with the v1 engine?
#16885 commented on
Jul 20, 2025 • 0 new comments -
[New Model]: Gemma 3n support
#18476 commented on
Jul 20, 2025 • 0 new comments -
[RFC]: Enabling Suffix Decoding, LSTM Speculator, Sequence Parallelism from Arctic Inference
#18037 commented on
Jul 19, 2025 • 0 new comments -
[Feature]: EXL3 support
#19896 commented on
Jul 19, 2025 • 0 new comments -
[Feature]: DRY Sampling
#8581 commented on
Jul 19, 2025 • 0 new comments -
[Bug]: Unable to use Qwen/Qwen2.5-Omni-7B with --mm-processor-kwargs
#20995 commented on
Jul 19, 2025 • 0 new comments -
[Bug]:qwen2_5vl: Internal Server Error when processing short video and vllm has been installed 0.9.0
#20313 commented on
Jul 19, 2025 • 0 new comments -
should deepseek v3 also need to upate? [examples/tool_chat_template_deepseekv3.jinja]
#21186 commented on
Jul 19, 2025 • 0 new comments -
[Bug]: Speculative decoding inconsistency for Qwen-Coder-32B
#10913 commented on
Jul 19, 2025 • 0 new comments -
[Feature]: Add Triton implementation of NVFP4 GEMM
#21014 commented on
Jul 18, 2025 • 0 new comments -
[Bug]: Requests that do not return results within 15 minutes are directly aborted, and then the request will be added by vllm again...
#20520 commented on
Jul 18, 2025 • 0 new comments -
[Bug]: Server hang with google/gemma-3-27b-it and structured decoding
#21148 commented on
Jul 18, 2025 • 0 new comments -
[Feature]: Add Support for Updating Lora Weights
#20149 commented on
Jul 18, 2025 • 0 new comments -
[Bug]: vllm serve Qwen2.5-VL-3B-Instruct run error
#21050 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Received a Qwen2VLImageProcessorFast for argument image_processor, but a Qwen2VLImageProcessor was expected
#20855 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: RuntimeError: Failed to apply Qwen2_5_VLProcessor on data={'text': '<|image_pad|>', 'images': [<PIL.Image.Image image mode=RGB size=332x27 at 0x7FA449949720>]} with kwargs={}
#21109 commented on
Jul 21, 2025 • 0 new comments -
[Feature]: support vision encoder quantization
#20729 commented on
Jul 21, 2025 • 0 new comments -
[RFC]: Better support for weight updating while waking up from sleep mode for RLHF
#15254 commented on
Jul 21, 2025 • 0 new comments -
[Usage]: why no ray command in my docker image
#15284 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: TypeError: Unknown image model type: qwen2_5_omni for branch: qwen2_omni_public_v1
#15754 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Cannot load Qwen2.5-VL
#16429 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Request stucks when serving model with v1 engine
#16580 commented on
Jul 21, 2025 • 0 new comments -
[Usage]: How to add a hook function
#16585 commented on
Jul 21, 2025 • 0 new comments -
[Feature]: Add support for AMD Strix/Strix Halo APU (gfx1150/gfx1151 RDNA 3.5)
#16621 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: using TP = 16 to serving deepseek-v3 in 2*H20 On Ray cluster, get EngineCore exception
#16646 commented on
Jul 21, 2025 • 0 new comments -
[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks
#16669 commented on
Jul 21, 2025 • 0 new comments -
[Usage]: Request scheduling when using LoRA
#16876 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: architecture of models not correctly recognized
#16905 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: oom occurs when 128+128 256 concurrency, while 4K+4K 256 concurrency is ok. DeepSeek-R1-awq benchmark test.
#16909 commented on
Jul 21, 2025 • 0 new comments -
[Bug]:Engine Compatibility Issue with vllm 0.8.4 Loading Qwen2.5-32B-AWQ: Abnormal Behavior of v1 Engine Under High Concurrency and Solutions
#16913 commented on
Jul 21, 2025 • 0 new comments -
[UI_Bug]: Content_Menu_and_Icon_Spacing_Issue_in_UI
#16917 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Pooling model adapter removes the attributes expected by model init
#16932 commented on
Jul 21, 2025 • 0 new comments -
[Bug]: Phi-4-MM generates gibberish for large image input with v1 chunked prefill
#16934 commented on
Jul 21, 2025 • 0 new comments -
[Performance]: Why/How vLLM uses CPU memory?
#16947 commented on
Jul 21, 2025 • 0 new comments -
[Installation]: Deploy vLLM for CPU server using GGUF model on Kubernetes
#20587 commented on
Jul 20, 2025 • 0 new comments -
[Usage]: DeepSeek R1 on a 8xH200 node is too slow
#17035 commented on
Jul 20, 2025 • 0 new comments -
[Performance]: Quantized Model Inference
#17487 commented on
Jul 20, 2025 • 0 new comments -
[Bug]: Prefix caching ignores visual input, causing incorrect multimodal outputs under concurrency
#20261 commented on
Jul 24, 2025 • 0 new comments -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on
Jul 24, 2025 • 0 new comments -
[Performance]: `model_weights` generator invoked out of Model loading in EAGLE series models.
#21160 commented on
Jul 24, 2025 • 0 new comments -
[Feature]: Align the API with OAI's structured output
#7220 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Guided decoding is broken because tokenizers can't be pickled
#7557 commented on
Jul 24, 2025 • 0 new comments -
[Usage]: Confirm tool calling is not supported and this is the closest thing can be done
#7912 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Persistent OutOfMemoryError error when using speculative decoding
#8073 commented on
Jul 24, 2025 • 0 new comments -
[Performance]: guided generation is very slow in offline mode
#8313 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: vllm api server return escaped unicode string in guided backend 'outlines'
#8805 commented on
Jul 24, 2025 • 0 new comments -
[Feature]: Guided Decoding Schema Cache Store
#8902 commented on
Jul 24, 2025 • 0 new comments -
[Performance]: Transformers 4.45.1 slows down `outlines` guided decoding
#9032 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Speculative decoding breaks guided decoding.
#9423 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Error with structured output inference after upgrade 0.6.2->0.6.3
#9462 commented on
Jul 24, 2025 • 0 new comments -
[Feature]: Support guided decoding with multistep decoding
#9893 commented on
Jul 24, 2025 • 0 new comments -
[Feature]: Llama3.3 Tool calling support or a Geneneric and extensible llama tool calling support
#11799 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: 使用Sonatype Nexus Repository时下载模型错误。
#14993 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: deploy deepseek-r1-awq on 16 x 4090 48G, layer_kv_cache = torch.zeros(kv_cache_shape, [rank0]: RuntimeError: CUDA error: invalid argument
#15014 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Issue running mistralai/Magistral-Small-2506 on NVIDIA hardware
#21122 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Close feature gaps when using xgrammar for structured output
#12131 commented on
Jul 23, 2025 • 0 new comments -
[Installation]: Nightly builds not available in container registry
#19335 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Remove xformers requirement for Mistral-format Pixtral and Mistral3
#21062 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: IndexError: list index out of range on chunked prefill with speculative decoding
#20531 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Killing local vLLM worker processes in multiproc_worker_utils.py
#18577 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: GLM-Z1 uses vllm batch inference to output confusion
#17157 commented on
Jul 25, 2025 • 0 new comments -
Error:kimi-vl:Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
#17162 commented on
Jul 25, 2025 • 0 new comments -
[Installation]: Bloated docker image size causes problems on k8s
#17163 commented on
Jul 25, 2025 • 0 new comments -
[Usage]: How to deploy tensorized vllm model (deserialize) as api_server?
#17178 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: vllm LLM utils.py resolve_obj_by_qualname ValueError: not enough values to unpack (expected 2, got 1)
#17188 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on
Jul 25, 2025 • 0 new comments -
[Bug]: Inconsistent Output: First API call differs from subsequent identical calls with temperature=0 on Qwen models
#17832 commented on
Jul 25, 2025 • 0 new comments -
[RFC]: Data Parallel Attention and Expert Parallel MoEs
#16037 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Compile inductor / CUDA Graph build before the memory profiling
#19480 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5
#17759 commented on
Jul 24, 2025 • 0 new comments -
[Usage]: V1 Engine with Qwen3 keeps on allocating memory for cuda graphs until OOM
#21172 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: some vllm routes can be reached without authorization
#18892 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Failed to start vLLM v1 with Ray. Encountered the following error: `KeyError: 'bundles'`
#19123 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: MoE models fail at startup: AttributeError: '_OpNamespace' '_moe_C' object has no attribute 'topk_softmax'
#18967 commented on
Jul 24, 2025 • 0 new comments -
[RFC]: KV cache offloading
#19854 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 commented on
Jul 24, 2025 • 0 new comments -
[Installation]: no version of pip install vllm works - Failed to initialize NumPy: No Module named 'numpy'
#11037 commented on
Jul 24, 2025 • 0 new comments -
[Usage]: 大量请求排队的时候推理速度很慢是什么原因
#16444 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: vllm部署qwen2.5_vl_72b之后,你们有出现,刚部署好之后调用一切正常3-5秒一条,然后使用一段时间,就越来越慢了的情况吗60s一条
#13886 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: InternVL2_5-8B-AWQ has no any throughput benefit compared to the InternVL2_5-8B
#19195 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: qwen2-vl 7b, on vllm 0.8.1 & 0.8.2, sometimes (not deterministically but depends on data) I got: ValueError: Attempted to assign 702 = 702 multimodal tokens to 703 placeholders
#15764 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Subprocess health check / automatic restart for V1 EngineCore
#19849 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: 'MultiprocExecutor' object has no attribute 'workers'
#17756 commented on
Jul 24, 2025 • 0 new comments -
[Performance]: The GPU memory usage of vllm v0.9.2 is significantly higher than that of v0.9.1. Why is this? How can it be improved?
#21027 commented on
Jul 24, 2025 • 0 new comments -
[Bug]: Distilled DeepSeek Models do not work with guided_json
#12548 commented on
Jul 23, 2025 • 0 new comments -
[Installation]: Can't build arm container image with podman without a SELinux relabel of bind mounts
#12734 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Specific Docker Image for vllm["audio,video"]
#13940 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: support tool and reasoning together
#14429 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: hub.docker.com Please add arm docker image
#14656 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Can support CPU inference with Ray cluster?
#15266 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Support LoRA adapter for whisper
#15370 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 commented on
Jul 23, 2025 • 0 new comments -
[Installation]: how to run swiftkv with vllm
#16109 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Qwen2.5 tool call failed
#16393 commented on
Jul 23, 2025 • 0 new comments -
[Installation]:
#16575 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: examples/offline_inference/chat_with_tools.py JSONDecodeError
#16594 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: cpu memory not released when wake up the vLLM instance
#16663 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Bug while using deepspeed with TRL with vLLM
#16867 commented on
Jul 23, 2025 • 0 new comments -
Qwen2.5 VL and gemma-3-12b error on VLLM 8.4
#16918 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Enable Partial Guided Decoding / Structured Output Support
#16979 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: When adding the parameter tensor_parallel_size, a TypeError occurred: BackendCompilerFailed.__init__() is missing one required positional argument: 'inner_exception'.
#17018 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: add hostname in metrics for clustering deployment
#17029 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: vllm 0.8.3 v1 engine has different computation performance per iteration when serving multi-lora with different chunk size
#17034 commented on
Jul 23, 2025 • 0 new comments -
[Usage]: I have 2 nodes 16 GPUs, how can i use 16 dp+16 ep to run deepseek v3?
#17041 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: noop elimination for slice errors when end = -1
#17078 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Inconsistent behavior of AsyncLLMEngine.abort between v0 and v1
#20362 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: TPU Embedding models support?
#20869 commented on
Jul 23, 2025 • 0 new comments -
[RFC]: Lazy CUDA Graph capture
#20098 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: ValueError: There is no module or parameter named 'mm_whisper_embeddings' in LlamaForCausalLM
#21176 commented on
Jul 23, 2025 • 0 new comments -
[ROCm]: There is too many opt-in ROCm specific flags
#21138 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: vLLM hangs forever on waiting engine process to start
#17676 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011
#9342 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Issue of Unstable Output for Identical Queries
#19403 commented on
Jul 23, 2025 • 0 new comments -
[RFC][FEATURE]: TTFT Routing
#20962 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Dynamic Chunked Pipeline Parallelism
#20808 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Phi-3-small-8k cannot be served for vllm >= 0.8.5
#18168 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: ValueError when using Multi-Instance GPU
#17047 commented on
Jul 23, 2025 • 0 new comments -
[RFC]: vLLM configuration refactoring and modularization
#18953 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: ray with nixl connector failed
#20980 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: PaliGemma2 not working with OpenAI Docker serve
#12052 commented on
Jul 23, 2025 • 0 new comments -
[Bug]:Qwen2.5vl vllm serve Engine process failed to start
#17372 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: FP8 Attention on H100 - CUDA error: an illegal memory access was encountered
#21110 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
#18455 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: Remove cupy dependency for multi-node Ray deployment
#19758 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Guided Decoding Broken in Streaming mode
#10376 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Speculative decoding + guided decoding not working
#10442 commented on
Jul 23, 2025 • 0 new comments -
[Feature]: load and save kv cache from disk
#10611 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: xgrammar crashes with speculative decoding
#11484 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on
Jul 23, 2025 • 0 new comments -
[Bug]: Very slow guided decoding with Outlines backend since v0.6.5
#12005 commented on
Jul 23, 2025 • 0 new comments