Pulse · vllm-project/vllm · GitHub

March 20, 2025 – March 27, 2025

Overview

242 Active pull requests

289 Active issues

Could not load contribution data

Please try again later

1 Release published by 1 person

v0.8.2
published Mar 23, 2025

155 Pull requests merged by 79 people

[Bugfix][TPU][V1] Fix recompilation
#15553 merged Mar 27, 2025
[Doc] Use absolute placement for Ask AI button
#15628 merged Mar 27, 2025
[Misc] Avoid direct access of global mm_registry in compute_encoder_budget
#15621 merged Mar 27, 2025
[Feature] Add middleware to log API Server responses
#15593 merged Mar 27, 2025
[Misc] Replace is_encoder_decoder_inputs with split_enc_dec_inputs
#15620 merged Mar 27, 2025
[Doc] Link to onboarding tasks
#15629 merged Mar 27, 2025
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1
#15211 merged Mar 27, 2025
[Model] MiniCPM-V/O supports V1
#15487 merged Mar 27, 2025
[Doc] update --system for transformers installation in docker doc
#15616 merged Mar 27, 2025
Fix incorrect filenames in vllm_compile_cache.py
#15494 merged Mar 27, 2025
[Misc] Use model_redirect to redirect the model name to a local folder.
#14116 merged Mar 27, 2025
[Misc] Clean up scatter_patch_features
#15559 merged Mar 27, 2025
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM
#15587 merged Mar 27, 2025
[Misc] Consolidate LRUCache implementations
#15481 merged Mar 27, 2025
[TPU] Avoid Triton Import
#15589 merged Mar 27, 2025
[Misc] Restrict ray version dependency and update PP feature warning in V1
#15556 merged Mar 27, 2025
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS
#15583 merged Mar 27, 2025
[ROCm] Env variable to trigger custom PA
#15557 merged Mar 27, 2025
Allow torchao quantization in SiglipMLP
#15575 merged Mar 27, 2025
[V1] Refactor num_computed_tokens logic
#15307 merged Mar 27, 2025
[moe][quant] add weight name case for offset
#15515 merged Mar 27, 2025
[Doc] Update V1 user guide for fp8 kv cache support
#15585 merged Mar 27, 2025
[misc] LoRA: Remove unused long context test data
#15558 merged Mar 27, 2025
add platform check back
#15578 merged Mar 27, 2025
Add automatic tpu label to mergify.yml
#15560 merged Mar 27, 2025
[Kernel] CUTLASS grouped gemm fp8 MoE kernel
#13972 merged Mar 27, 2025
Support FIPS enabled machines with MD5 hashing
#15299 merged Mar 27, 2025
[TPU] support disabling xla compilation cache
#15567 merged Mar 27, 2025
Use Cache Hinting for fused_moe kernel
#15511 merged Mar 26, 2025
[V1] TPU CI - Fix test_compilation.py
#15570 merged Mar 26, 2025
[V1] TPU - Revert to exponential padding by default
#15565 merged Mar 26, 2025
Applying some fixes for K8s agents in CI
#15493 merged Mar 26, 2025
Support SHA256 as hash function in prefix caching
#15297 merged Mar 26, 2025
[V1][Sampler] Faster top-k only implementation
#15478 merged Mar 26, 2025
[Refactor] Remove passthrough backend when generate grammar
#15317 merged Mar 26, 2025
Fix weight loading for some models in Transformers backend
#15544 merged Mar 26, 2025
multi-node offline DP+EP example
#15484 merged Mar 26, 2025
[Model] Add Reasoning Parser for Granite Models
#14202 merged Mar 26, 2025
Improve validation of TP in Transformers backend
#15540 merged Mar 26, 2025
Apply torchfix
#15532 merged Mar 26, 2025
Separate base model from TransformersModel
#15467 merged Mar 26, 2025
[Misc] improve example script output
#15528 merged Mar 26, 2025
[Misc] Enhance warning information to user-defined chat template
#15408 merged Mar 26, 2025
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER
#14967 merged Mar 26, 2025
[Feature] Enhance EAGLE Architecture with Proper RMS Norms
#14990 merged Mar 26, 2025
Fix raw_request extraction in load_aware_call decorator
#15382 merged Mar 26, 2025
[misc] LoRA - Skip LoRA kernels when not required
#15152 merged Mar 26, 2025
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results)
#15492 merged Mar 26, 2025
[Misc] Warn about v0 in benchmark_paged_attn.py
#15495 merged Mar 26, 2025
[Model] Support multi-image for Molmo
#15438 merged Mar 26, 2025
Transformers backend already supports V1
#15463 merged Mar 26, 2025
[CI/Build] LoRA: Delete long context tests
#15503 merged Mar 26, 2025
[Core] LoRA: V1 Scheduler optimization
#15422 merged Mar 25, 2025
[core] add bucket padding to tpu_model_runner
#14995 merged Mar 25, 2025
[V1] Support long_prefill_token_threshold in v1 scheduler
#15419 merged Mar 25, 2025
[V1][Minor] Use SchedulerInterface type for engine scheduler field
#15499 merged Mar 25, 2025
[TPU][V1] Fix Sampler recompilation
#15309 merged Mar 25, 2025
Add workaround for shared field_names in pydantic model class
#13925 merged Mar 25, 2025
[bugfix] add supports_v1 platform interface
#15417 merged Mar 25, 2025
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility)
#15471 merged Mar 25, 2025
[CI/Build] Add tests for the V1 tpu_model_runner.
#14843 merged Mar 25, 2025
[bugfix] fix inductor cache on max_position_embeddings
#15436 merged Mar 25, 2025
[Kernel] Fix conflicting macro names for gguf kernels
#15456 merged Mar 25, 2025
[Doc] Update V1 user guide for multi-modality
#15460 merged Mar 25, 2025
[Misc] Remove redundant num_embeds
#15443 merged Mar 25, 2025
[Misc] Clean up MiniCPM-V/O code
#15337 merged Mar 25, 2025
Dockerfile.ppc64le changes to move to UBI
#15402 merged Mar 25, 2025
[Kernel][CPU] CPU MLA
#14744 merged Mar 25, 2025
[Hardware][TPU][Bugfix] Fix v1 mp profiler
#15409 merged Mar 25, 2025
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10
#15160 merged Mar 25, 2025
[V1][Spec Decode] Update target_logits in place for rejection sampling
#15427 merged Mar 25, 2025
[V1] guidance backend for structured output + auto fallback mode
#14779 merged Mar 25, 2025
[Bugfix] Fixed the issue of not being able to input video and image simultaneously
#15387 merged Mar 25, 2025
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)"
#15398 merged Mar 25, 2025
[Misc] Remove LoRA log
#15388 merged Mar 25, 2025
Add pipeline parallel support to TransformersModel
#12832 merged Mar 25, 2025
[Minor][Spec Decode] Remove compiled_softmax
#15416 merged Mar 25, 2025
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling
#15063 merged Mar 25, 2025
[ROCm][Kernel] MoE weights padding
#14454 merged Mar 24, 2025
[Build] Cython compilation support fix
#14296 merged Mar 24, 2025
[Hardware][TPU] Skip failed compilation test
#15421 merged Mar 24, 2025
[BugFix][V1] Quick fix for min_tokens with multiple EOS
#15407 merged Mar 24, 2025
[V1][Perf] Simpler request output queues
#15156 merged Mar 24, 2025
[Doc] Update docs on handling OOM
#15357 merged Mar 24, 2025
[DOC] Add Kubernetes deployment guide with CPUs
#14865 merged Mar 24, 2025
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral
#12303 merged Mar 24, 2025
[V1] Aggregate chunked prompt logprobs in model runner
#14875 merged Mar 24, 2025
[MISC] Refine no available block debug msg
#15076 merged Mar 24, 2025
[V1][Minor] fix comments
#15392 merged Mar 24, 2025
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL
#15306 merged Mar 24, 2025
[Core] Integrate fastsafetensors loader for loading model weights
#10647 merged Mar 24, 2025
[distributed] fix dp group
#15355 merged Mar 24, 2025
[Bugfix] Fix chat template loading
#15143 merged Mar 24, 2025
Fix zmq IPv6 URL format error
#15341 merged Mar 24, 2025
[Kernel] allow non-contiguous input for marlin kernel
#14658 merged Mar 24, 2025
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)"
#15377 merged Mar 24, 2025
[Misc] Update guided decoding logs to debug
#15310 merged Mar 24, 2025
[Bugfix][V1] Avoid importing PreTrainedModel
#15366 merged Mar 24, 2025
[Misc] Remove ignore_reinit_error for ray.init()
#15373 merged Mar 24, 2025
[Misc] Upgrade BNB version
#15183 merged Mar 24, 2025
Fix non-contiguous input passed to Marlin kernel
#15319 merged Mar 24, 2025
[Fix] [torch.compile] Improve UUID system for custom passes
#15249 merged Mar 24, 2025
[V1] Enable V1 Fp8 cache for FA3 in the oracle
#15191 merged Mar 23, 2025
[Misc][Doc] Add note regarding loading generation_config by default
#15281 merged Mar 23, 2025
[Frontend] Support tool calling and reasoning parser
#14511 merged Mar 23, 2025
[V1][Spec Decode] Use better defaults for N-gram
#15358 merged Mar 23, 2025
[V1][Spec Decode] Respect prompt_lookup_max
#15348 merged Mar 23, 2025
[Bugfix] fix torch.compiled cache hash error
#14953 merged Mar 23, 2025
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20
#15322 merged Mar 23, 2025
[ci/build] fix broken tests in LLM.collective_rpc
#15350 merged Mar 23, 2025
[ci/build] update torch nightly version for GH200
#15135 merged Mar 23, 2025
[V1][Usage] Refactor speculative decoding configuration and tests
#14434 merged Mar 23, 2025
Fix v1 supported oracle for worker-cls and worker-extension-cls
#15324 merged Mar 23, 2025
[doc] Add back previous news
#15331 merged Mar 23, 2025
Remove openvino support in favor of external plugin
#15339 merged Mar 22, 2025
[BugFix][Typing] Fix Imprecise Type Annotations
#15208 merged Mar 22, 2025
[V1] Add disable-any-whitespace option support for xgrammar
#15316 merged Mar 22, 2025
[Model] Support Tele-FLM Model
#15023 merged Mar 22, 2025
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes
#15308 merged Mar 22, 2025
[Bugfix] Fix torch.compile raise FileNotFoundError
#15278 merged Mar 22, 2025
[Doc] add load_format items in docs
#14804 merged Mar 22, 2025
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature
#14959 merged Mar 22, 2025
[Bugfix][V0] Multi-sequence logprobs streaming edge case
#15259 merged Mar 22, 2025
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout
#15301 merged Mar 22, 2025
[Build/CI] Fix env var typo
#15305 merged Mar 21, 2025
[TPU][V1] MHA Pallas backend
#15288 merged Mar 21, 2025
Revert "[Feature] specify model in config.yaml (#14855)"
#15293 merged Mar 21, 2025
[Bugfix][VLM] fix llava processor
#15285 merged Mar 21, 2025
[v1] Refactor KVCacheConfig
#14079 merged Mar 21, 2025
[Misc] Add cProfile helpers
#15074 merged Mar 21, 2025
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend
#15282 merged Mar 21, 2025
[V1] Fix wrong import path of get_flash_attn_version
#15280 merged Mar 21, 2025
[Bugfix] Fix incorrect resolving order for transformers fallback
#15279 merged Mar 21, 2025
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL
#15273 merged Mar 21, 2025
[Bugfix] Add int8 torch dtype for KVCache
#15260 merged Mar 21, 2025
[Feature] specify model in config.yaml
#14855 merged Mar 21, 2025
[V1] Avoid redundant input processing in n>1 case
#14985 merged Mar 21, 2025
[Doc] Update LWS docs
#15163 merged Mar 21, 2025
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs
#14071 merged Mar 21, 2025
[Hardware][TPU] Add check for no additional graph compilation during runtime
#14710 merged Mar 21, 2025
Add an example for reproducibility
#15262 merged Mar 21, 2025
[Misc] Better RayExecutor and multiprocessing compatibility
#14705 merged Mar 21, 2025
[Docs] Trim the latest news in README
#15261 merged Mar 21, 2025
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies
#14857 merged Mar 21, 2025
[Bugfix] detect alibi and revert to FA2
#15231 merged Mar 21, 2025
[V1][TPU] Speed up top-k on TPU by using torch.topk
#15242 merged Mar 21, 2025
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client
#15240 merged Mar 21, 2025
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation
#15200 merged Mar 21, 2025
[ROCM] Upgrade torch to 2.6
#15244 merged Mar 21, 2025
[Misc] Clean up the BitsAndBytes arguments
#15140 merged Mar 21, 2025
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10
#15159 merged Mar 21, 2025
[CI/Build] LoRA : make add_lora_test safer
#15181 merged Mar 21, 2025
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface
#15250 merged Mar 21, 2025
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled.
#14617 merged Mar 21, 2025
[V1] Add flag to disable cascade attention
#15243 merged Mar 20, 2025

87 Pull requests opened by 73 people

[sleep mode] clear pytorch cache after sleep
#15248 opened Mar 20, 2025
[Misc] fix collect_env version parse
#15267 opened Mar 21, 2025
[V1] Scheduler Refactoring [2/N] - Introduce CommonSchedulerStates
#15271 opened Mar 21, 2025
Add missed ray[data] dependence in cuda.txt
#15283 opened Mar 21, 2025
[Model] Add Qwen3 and Qwen3MoE
#15289 opened Mar 21, 2025
Fix Transformers backend compatibility check
#15290 opened Mar 21, 2025
[Bugfix] utils: no bool(module) & pid may be None
#15292 opened Mar 21, 2025
[V0][Bugfix] Fix Mamba cache crashing
#15296 opened Mar 21, 2025
set UV_PYTHON_INSTALL_DIR to a world readable/executable location
#15302 opened Mar 21, 2025
[Misc]add coding benchmark for speculative decoding
#15303 opened Mar 21, 2025
[Misc] Enable V1 LoRA by default
#15320 opened Mar 22, 2025
fix test_phi3v
#15321 opened Mar 22, 2025
Fix DP group creation and compatibale with external_dp (#15176)
#15323 opened Mar 22, 2025
unittests for `FullAttentionSpec` to tests `use_mla` param
#15325 opened Mar 22, 2025
[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously
#15326 opened Mar 22, 2025
[Bugfix] Handle `process_weights_after_loading` for `QKVCrossParallelLinear`
#15328 opened Mar 22, 2025
[V1][Spec Decode] Eagle interface
#15334 opened Mar 22, 2025
[P/D Disaggregation] `PDController` and `PDWorker` Prototype (1p1d)
#15343 opened Mar 22, 2025
Vllm v1 eagle proposer
#15346 opened Mar 23, 2025
[V1] Fully Transparent Implementation of CPU Offloading
#15354 opened Mar 23, 2025
[V1][Spec Decode] Remove warning on N-gram
#15361 opened Mar 23, 2025
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process
#15367 opened Mar 23, 2025
[Bugfix] Fix regex compile display format
#15368 opened Mar 24, 2025
[Model] Support Skywork-R1V
#15397 opened Mar 24, 2025
[TPU][V1] Guided decoding on TPU
#15401 opened Mar 24, 2025
[Doc] Add multi-modal development example for encoder-decoder models
#15405 opened Mar 24, 2025
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only
#15413 opened Mar 24, 2025
[V1] TPU CI - Add basic perf regression test
#15414 opened Mar 24, 2025
[Bugfix]: Fix Promethus spec decode counter sum-of-sums
#15415 opened Mar 24, 2025
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B
#15423 opened Mar 25, 2025
[Core] [Bugfix] Add Input Embeddings
#15428 opened Mar 25, 2025
[CI] [1/N] Fix Distributed Tests
#15431 opened Mar 25, 2025
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel
#15433 opened Mar 25, 2025
Added the option of returning hidden states
#15434 opened Mar 25, 2025
[Draft] Aya Vision
#15441 opened Mar 25, 2025
[V1] [Feature] Collective RPC
#15444 opened Mar 25, 2025
[P/D Disaggregation] XpYd based on point-to-point communication
#15448 opened Mar 25, 2025
[Misc] Improve cli help show
#15455 opened Mar 25, 2025
[Metrics] Hide deprecated metrics
#15458 opened Mar 25, 2025
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 opened Mar 25, 2025
[V1][Spec Decode] Remove deprecated spec decode config params
#15466 opened Mar 25, 2025
Enhance logits processor to add additional data
#15473 opened Mar 25, 2025
[ not for land] For richard
#15474 opened Mar 25, 2025
[Bugfix][Frontend] respect provided default guided decoding backend
#15476 opened Mar 25, 2025
[WIP][Model] Refactor Phi-4-multimodal to use merged processor and support V1
#15477 opened Mar 25, 2025
Quantized Custom Allreduce
#15479 opened Mar 25, 2025
Different device CG support
#15482 opened Mar 25, 2025
[Misc] simple_connector.py: more efficient use of GPU memory in send
#15485 opened Mar 25, 2025
[V1] Fix json_object support with xgrammar
#15488 opened Mar 25, 2025
[V1][TPU] Enable Top K
#15489 opened Mar 25, 2025
[V1][Draft] Jump-forward decoding
#15490 opened Mar 25, 2025
[core] Add tags parameter to wake_up()
#15500 opened Mar 25, 2025
Adding Share Expert Fusion for DeepSeek
#15502 opened Mar 25, 2025
Add inference_benchmark_script.sh
#15504 opened Mar 25, 2025
[Model] Support Mistral3 in the HF Transformers format
#15505 opened Mar 25, 2025
[BugFix] fix speculative decoding memory leak when speculation is disabled
#15506 opened Mar 25, 2025
[Minor] QoL for Benchmarking
#15512 opened Mar 26, 2025
[Bugfix]: The sequence becomes shorter after encoding and decoding
#15516 opened Mar 26, 2025
Improve expert parallelism placement
#15517 opened Mar 26, 2025
update dockerfile to add tzdata
#15522 opened Mar 26, 2025
track http service error count
#15523 opened Mar 26, 2025
WIP: [Frontend] add functioncalling strict
#15535 opened Mar 26, 2025
[Bugfix] Fix missing return value in load_weights method of adapters.py
#15542 opened Mar 26, 2025
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 opened Mar 26, 2025
[Bugfix] Fix profile deadlock when ray backend and num-scheduler-steps > 1
#15548 opened Mar 26, 2025
[Bugfix][Model] Add SupportsQuant interface to Mixtral
#15552 opened Mar 26, 2025
[Bugfix] Fix Mllama interleaved images input support
#15564 opened Mar 26, 2025
[SupportsQuant] Bert, Blip, Blip2, Bloom
#15573 opened Mar 26, 2025
[Bugfix] Do not pad multi-modal encoder sequence dummy data
#15574 opened Mar 26, 2025
[Misc] cli auto show default value
#15582 opened Mar 26, 2025
[V1] Support disable_any_whtespace for guidance backend
#15584 opened Mar 27, 2025
Re-enable the AMD Entrypoints Test (2025-03-27)
#15586 opened Mar 27, 2025
[Frontend] update priority for --api-key and VLLM_API_KEY
#15588 opened Mar 27, 2025
[XPU][Bugfix] fix _k_scale_float/_v_scale_float in ipex_attn
#15591 opened Mar 27, 2025
[Bugfix][v1] xgrammar structured output supports Enum.
#15594 opened Mar 27, 2025
[DO NOT REVIEW YET] Merge k_cache and v_caching into one.
#15595 opened Mar 27, 2025
[Bugfix] Correct KV cache tensor dimension handling in FlashInfer backend's block operations
#15603 opened Mar 27, 2025
[V1] Support interleaved modality items
#15605 opened Mar 27, 2025
[Quantization][V1] BitsAndBytes support V1
#15611 opened Mar 27, 2025
[Bugfix] add hf_token to EngineArgs
#15615 opened Mar 27, 2025
[Model] Adding torch compile annotations to chatglm
#15624 opened Mar 27, 2025
Enable Outlines with JSON Sub-Schema References
#15627 opened Mar 27, 2025
[ROCm][AMD][Build] Update AMD supported arch list
#15632 opened Mar 27, 2025
[CI] Update rules for applying `tpu` label.
#15634 opened Mar 27, 2025
Correct PowerPC to modern IBM Power
#15635 opened Mar 27, 2025
[Doc] Fix dead links in Job Board
#15637 opened Mar 27, 2025
[NO REVIEW PLEASE] Kv
#15638 opened Mar 27, 2025

155 Issues closed by 56 people

[New Model]: Please support Babel series model ASAP
#15612 closed Mar 27, 2025
[Usage]: How can I determine the maximum number of concurrent requests?
#8031 closed Mar 27, 2025
[Usage]: Got nccl error when deploy vllm in k8s with multiple GPUs
#7466 closed Mar 27, 2025
[Usage]: how to abort request and stop inference?
#6975 closed Mar 27, 2025
[Usage]: What do max_num_seqs and max_model_len do
#6641 closed Mar 27, 2025
[Usage]: I don't know how to set the maximum number of simultaneous API requests to be processed when calling an API
#15609 closed Mar 27, 2025
[Doc]: documenting flash attention 1 vs 2 in env vars
#15344 closed Mar 27, 2025
[Bug]:torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU 6 has a total capacity of 44.53 GiB of which 448.00 KiB is free.
#15598 closed Mar 27, 2025
[Feature]: Output the JSON for the response payload when VLLM_LOGGING_LEVEL=DEBUG
#15571 closed Mar 27, 2025
[Usage]: how to reduce the number of processes of compile_worker
#14808 closed Mar 27, 2025
[Usage]: Upgrading from vllm0.7.3 to vllm0.8.2, but the required GPU memory significantly increases.
#15617 closed Mar 27, 2025
[Installation]: Transformer installation requires uv venv --system now
#15550 closed Mar 27, 2025
tracking torch.compile compatibility with cpu offloading
#10612 closed Mar 27, 2025
tracking torch.compile compatibility with lora serving
#10617 closed Mar 27, 2025
[Bug]:Question about load Qwen1.5-MoE-A2.7B model
#15561 closed Mar 27, 2025
[Bug]:vllm从0.7.0开始版本部署Qwen2_vl服务存在内存(不是GPU显存)泄漏问题
#15597 closed Mar 27, 2025
[Feature]: Consolidate `LRUCache` implementations
#14927 closed Mar 27, 2025
[Bug]: qwen2-vl with lora is not starting
#13135 closed Mar 27, 2025
[Bug]: Qwen/Qwen2.5-VL-32B-Instruct Error while deserializing header: MetadataIncompleteBuffer ProcessGroupNCCL.cpp:1496]
#15445 closed Mar 27, 2025
[Installation]: Getting "ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE..." error
#9205 closed Mar 27, 2025
[Bug]: Error loading bitsandbytes 4bit model when the quant_storage is torch.bfloat16
#10590 closed Mar 27, 2025
[RFC]: Support KV Cache Compaction
#10646 closed Mar 27, 2025
[Feature]: Mixtral manual `head_dim`
#10649 closed Mar 27, 2025
[Bug]: vllm infer for Qwen2-VL-72B-Instruct-GPTQ-Int8
#10650 closed Mar 27, 2025
[Bug]: Inference is exceptionally slow on the L20 GPU
#10652 closed Mar 27, 2025
[Bug]: AMD GPU RX 7900XT: Failed to infer device type
#10653 closed Mar 27, 2025
[Usage]: Cannot use xformers with old GPU
#10662 closed Mar 27, 2025
[RFC]: Create `VllmState` to save immutable args in `VllmConfig`
#10666 closed Mar 27, 2025
[Usage]: how to get every output token score?
#10670 closed Mar 27, 2025
[Usage]: No Generation When Running VLLM with neuralmagic/Meta-Llama-3.1-8b-Instruct-quantized.w4a16 Using langchain_openai
#10671 closed Mar 27, 2025
[Bug]: Custom logging via VLLM_LOGGING_CONFIG_PATH causes server crash with "Cannot unpickle PostGradPassManager"
#15581 closed Mar 27, 2025
[Bug]: vLLM returning 415 status code at high load
#14333 closed Mar 26, 2025
[Bug]: restarting vllm --> "WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL"
#13836 closed Mar 26, 2025
[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled.
#13704 closed Mar 26, 2025
[Usage]: where to find cpu vllm formal image
#14756 closed Mar 26, 2025
[Usage]: How to run format check locally?
#15472 closed Mar 26, 2025
[Performance][RFC]: Improving paged attention kernel's performance
#15351 closed Mar 26, 2025
[Bug]: ValueError: not enough values to unpack (expected 22, got 21) when deploying DeepSeekV3
#15453 closed Mar 26, 2025
[Feature]: Add Warning for Chat Template Mismatches similar to SGLang
#15395 closed Mar 26, 2025
[Usage]: How to make sure the timeout takes effect
#14792 closed Mar 26, 2025
[Bug]: Expected there to be 2 prompt replacements corresponding to 2 image items, but instead found 0 prompt replacements!
#15527 closed Mar 26, 2025
[Bug]: Error occurred in v1/rerank interface after upgrading from version 0.7.3 to 0.8.1
#15371 closed Mar 26, 2025
[Usage]: ModuleNotFoundError: No module named 'triton'
#14888 closed Mar 26, 2025
[Doc]: APIConnectionError with OpenAI
#15518 closed Mar 26, 2025
Cupy Import errors in Docker
#3184 closed Mar 26, 2025
[Feature][Chunked Prefill]: Enable cuda graph for chunked prefill.
#4056 closed Mar 26, 2025
[Usage]: In the VLLM background log, only the question asking log can be viewed, but the answer log generated by the large model cannot be viewed. How to display the answer generated by the large model in the log?
#4649 closed Mar 26, 2025
[Feature]: Initial LLM token
#5609 closed Mar 26, 2025
[Installation]: Meet bugs when installing from source
#8852 closed Mar 26, 2025
[Bug]: 因vllm的版本不同，启动的qwen2.5服务，对于相同的输入；0.6.1.post2 sse输出是正确的，但 0.6.3.post1是错误的？
#10280 closed Mar 26, 2025
[Bug]: VLLLm crash when running Qwen/Qwen2.5-Coder-32B-Instruct on two H100 GPUs
#10296 closed Mar 26, 2025
[Bug]: (Program crashes after increasing --tensor-parallel-size) with error pynvml.NVMLError_InvalidArgument: Invalid Argument
#10409 closed Mar 26, 2025
[Usage]: Use difference SamplingParams for each sample in batch inference via openai api
#10578 closed Mar 26, 2025
[Feature]: When apply prompt_logprobs for OpenAI server, the prompt_logprobs field in respnose does not show which token is chosen
#10607 closed Mar 26, 2025
[Feature]: if vllm supports explicitly specifying GPU devices for a model instance.
#10638 closed Mar 26, 2025
[Usage]: ValueError: Block size must be a multiple of 16. [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
#15508 closed Mar 26, 2025
[Feature][Hardware][TPU]: Improve the token_num padding logic
#14581 closed Mar 25, 2025
[Bug]: top_logrpobs generating a WARNING
#13880 closed Mar 25, 2025
[Bug]: vllm v0.7.3 - The following fields were present in the request but ignored: {'top_logprobs'}
#13881 closed Mar 25, 2025
[Feature][Hardware][TPU]: Add Recompilation Check for vLLM on TPU
#14580 closed Mar 25, 2025
[Bug]: Problem guided decoding (regex)
#15210 closed Mar 25, 2025
[Bug]: Error when use vllm in distributed environment
#15399 closed Mar 25, 2025
[Bug]: V1 cannot be run in Triton Inference Server Backend
#12690 closed Mar 25, 2025
[Bug]: Build error, nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)
#15452 closed Mar 25, 2025
[Usage]: vLLM Whisper support Sequential algorithm?
#15454 closed Mar 25, 2025
[Bug]: Uncaught exception | <class 'ValueError'>; Qwen2_5_VLModel has no vLLM implementation and the Transformers implementation is not compatible with vLLM
#15411 closed Mar 25, 2025
[Feature]: Rootless container for OpenShift compatibility
#15206 closed Mar 25, 2025
[Bug]: DeepSeek-R1-AWQ broken in nightly
#15002 closed Mar 25, 2025
[Bug]: Qwen2.5 VL online service can not input video and image simultaneously.
#15291 closed Mar 25, 2025
[Installation]: Error when importing LLM from vllm
#5086 closed Mar 25, 2025
[Feature]: need a GB-based alternative for gpu_memory_utilization
#7524 closed Mar 25, 2025
[Usage]: While loading model get 'layers.0.mlp.down_proj.weight' after merge_and_unload()
#10598 closed Mar 25, 2025
[Bug]: Ray+vllm run, then crash
#13535 closed Mar 24, 2025
[Misc]: [V1] prompt logprobs + chunked prefill can result in `EngineCore` partial prefill output
#14239 closed Mar 24, 2025
flashinfer backend, not callable NoneType object
#15389 closed Mar 24, 2025
[Bug]: Deploy DeepSeek R1 671B ,ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
#15336 closed Mar 24, 2025
[Bug]: external_dp blocks normal DP group creation
#15176 closed Mar 24, 2025
[Bug]: Qwen2.5-VL mm_processor_kwargs not respected
#15364 closed Mar 24, 2025
[Bug]: LoRA Loading Error: 'GPUModelRunner' object has no attribute 'lora_manager'
#15400 closed Mar 24, 2025
[Usage]:
#15390 closed Mar 24, 2025
[Bug]: loading default chat template occurs TypeError: unhashable type: 'dict'
#15095 closed Mar 24, 2025
[Bug]: if chat_template loaded from disk, jinja exception thrown from _try_extract_ast()
#14884 closed Mar 24, 2025
[Bug]: Executor performance degradation
#15356 closed Mar 24, 2025
[Bug]: Docker image in trunk cannot find libpython.so
#14991 closed Mar 24, 2025
[Installation]: python is missing inside the v0.8.0 docker
#15088 closed Mar 24, 2025
[Misc]: missing python inside the container v0.8.1
#15174 closed Mar 24, 2025
[Bug]: Can't create non-root user using vllm/vllm-openai:v0.8.1 as a base image
#15359 closed Mar 24, 2025
[Bug]: RuntimeError: operator torchvision::nms does not exist when importing vLLM (CPU Installation on Colab)
#15372 closed Mar 24, 2025
[Bug]: qwen2.5-vl 7b， vllm 0.8.1， vllm ValueError: Attempted to assign 841 + 1032 + 841 + 1032 = 3746 multimodal tokens to 3794 placeholders
#15185 closed Mar 24, 2025
[Bug]: LoRA request raise CUDA OutOfMemoryError when input token > 8k
#15039 closed Mar 24, 2025
[Bug]: GGUF model with architecture deepseek2 is not supported yet while vllm version is 0.8.1
#15277 closed Mar 24, 2025
[Bug]: leading space within content via OpenAI Compatible Server
#3935 closed Mar 24, 2025
[Bug]: Llama-3.2-11B-Vision-Instruct Inference Can't Stop
#9752 closed Mar 24, 2025
[Performance]: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
#10592 closed Mar 24, 2025
[Bug]: Memory allocation with echo=True
#10596 closed Mar 24, 2025
[Bug]: Cannot unpickle PostGradPassManager
#15223 closed Mar 24, 2025
[Usage]: Where is the entry pointof the programme
#15352 closed Mar 23, 2025
[Bug]: Critical Memory Leak in vLLM V1 Engine: 200+ GB RAM Usage from Image Inference
#15294 closed Mar 23, 2025
[Bug]: int8 2:4 sparse time more than fp8
#15275 closed Mar 23, 2025
[Usage]: Async engine batch request no usage
#15363 closed Mar 23, 2025
[Usage]: Async engine batch request
#15314 closed Mar 23, 2025
[Feature]: support Mistral-Large-Instruct-2407 function calling
#6778 closed Mar 23, 2025
[Feature]: Llama 3 and Command-R Chat Templates
#9904 closed Mar 23, 2025
[Bug]: [Bug]: vllm 启动，openai的swarm 函数调用不正常
#10015 closed Mar 23, 2025
[Usage]: Sending in pre-tokenized question during inference doesn't seem any faster than raw text.
#10084 closed Mar 23, 2025
[Misc]: Invariant encountered: value was None when it should not be
#10284 closed Mar 23, 2025
[Bug]: Input prompt (35247 tokens) is too long and exceeds limit of 1000
#10440 closed Mar 23, 2025
[Usage]: KVcache usage for different tasks in batch
#10509 closed Mar 23, 2025
[Bug]: Model does not split in multiple Gpus instead it occupy same memory on each GPU
#10516 closed Mar 23, 2025
[Bug]: Gemma2 becomes a fool.
#10525 closed Mar 23, 2025
[Feature]: How to run speculative models with tensor parallelism?
#10562 closed Mar 23, 2025
[Feature]: `torch>=2.6` support for expanded Python 3.13 support
#13434 closed Mar 22, 2025
[RFC]: Drop Support for OpenVINO
#14374 closed Mar 22, 2025
[Feature]: Disabling V1: unsupported structured decoding backend - xgrammar:disable-any-whitespace
#15252 closed Mar 22, 2025
[Usage]: Why Speculative decoding is not compatiable with Pipeline Paralelism?
#14089 closed Mar 22, 2025
[Bug]: RuntimeError at vllm startup, V0 engine, Llama 3.1, "The size of tensor a (50) must match the size of tensor b (56) at non-singleton dimension 0"
#15269 closed Mar 22, 2025
[Performance]: The impact of CPU on vLLM performance is significant.
#8147 closed Mar 22, 2025
[Feature]: host wheel via pypi index?
#9831 closed Mar 22, 2025
[Bug]: torch.compile raise FileNotFoundError when VLLM_DISABLE_COMPILE_CACHE=1
#15276 closed Mar 22, 2025
[Bug]: Can't deserialize object reported by ray, H800*16 DeepSeek R1
#15199 closed Mar 22, 2025
[Feature]: load/unload API to run multiple LLMs in a single GPU instance
#5491 closed Mar 22, 2025
[RFC]: Add support for IBM Spyre accelerator
#9652 closed Mar 22, 2025
[Bug]: ValueError: No available memory for the cache blocks on main branch after commit 46f98893
#14992 closed Mar 22, 2025
[Usage]: What's the best practice of deploying DeepSeekV3 using vllm?
#14614 closed Mar 22, 2025
[Misc]: How to access the KV cache directly?
#4156 closed Mar 22, 2025
[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2)
#9670 closed Mar 22, 2025
[Usage]: when i set --tensor-parallel-size 4 ，openai server dose not work . Report a new Exception
#10521 closed Mar 22, 2025
[Bug]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
#10523 closed Mar 22, 2025
[Feature]: Additional possible value for `tool_choice`: `required`
#10526 closed Mar 22, 2025
[Installation]: can't get the cu118 version of vllm 0.6.3 by https://github.com/vllm-project/vllm/releases/download/v0.6.3/vllm-0.6.3+cu118-cp310-cp310-manylinux1_x86_64.whl
#10540 closed Mar 22, 2025
[Bug]: Gemma3 is not support V1 Engine
#15298 closed Mar 22, 2025
[Bug]: RuntimeError: Phi4MM cannot process x audios and ximages in a prompt
#14506 closed Mar 22, 2025
[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output
#12371 closed Mar 21, 2025
[Bug]: Loading a model with bitsandbytes 8bit quantization
#8799 closed Mar 21, 2025
[Bug]: tests/v1/tpu/test_sampler.py crashes due to ragged_paged_attention arg mismatch
#15257 closed Mar 21, 2025
[Installation]: installation succeeded but No module named 'vllm._C'
#15286 closed Mar 21, 2025
[Bug]: Temperature is ignored in vLLM 0.8.0/0.8.1
#15241 closed Mar 21, 2025
[Bug]: V1 with MLA enable throw error `cannot import name 'get_flash_attn_version' from 'vllm.attention.backends.utils'`
#15265 closed Mar 21, 2025
[Performance]: vllm0.6.5加载GLM4-9B-Chat，动态加载lora，输入长文本时推理性能下降较多
#11317 closed Mar 21, 2025
[Feature]: Apply tool calling after reasoning steps in Reasoning models.
#14490 closed Mar 21, 2025
[Feature]: specify model only in config.yaml
#14819 closed Mar 21, 2025
[Bug]: v1 speculate decoding NgramProposer experiences service exceptions during stress testing
#14742 closed Mar 21, 2025
[Bug]: Why are the vLLM and Hugging Face Transformers inference results inconsistent?
#12343 closed Mar 21, 2025
[Bug][Help]: I used vllm 0.8.0 to deploy whisper-large-v3-turbo . The model is only 1.6G , and i use a 3060 with 12G. However, right after the service started, it encountered an out-of-memory (OOM) error.
#15216 closed Mar 21, 2025
[Usage]: vllm v0.7.2 can not support baichuan2 model
#13810 closed Mar 21, 2025
Better defaults to match Hugging Face
#2733 closed Mar 21, 2025
[Bug]: Qwen2.5-VL Cannot Output Correct Content under 0.8.1
#15197 closed Mar 21, 2025
[Bug]: Qwen VL 2.5 doesn't work in v0.8.0 - again
#15122 closed Mar 21, 2025
[Bug]: "Triton Error [CUDA]: device kernel image is invalid" when loading Mixtral-8x7B-Instruct-v0.1 in fused_moe.py
#5713 closed Mar 21, 2025
[Bug]: lora response's moel name is incorrect
#7260 closed Mar 21, 2025
[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)
#7450 closed Mar 21, 2025
[Feature]: KVPress
#10491 closed Mar 21, 2025
Metrics model name when using multiple loras
#10504 closed Mar 21, 2025
[Bug]: Use 3 node with 8*H100 to serve Deepseek R1 model, error is pthread_create failed: Resource temporarily unavailable
#13354 closed Mar 21, 2025
[Feature]: OpenAI Response API
#15237 closed Mar 20, 2025

134 Issues opened by 119 people

[Bug]: Outlines broken on vLLM 0.8+
#15636 opened Mar 27, 2025
[Feature]: Support loading LoRA adapters directly from s3 bucket
#15633 opened Mar 27, 2025
[Bug]: Can't load any LLM with v0.8.*
#15631 opened Mar 27, 2025
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 opened Mar 27, 2025
DeciLMConfig object has no attribute ‘num_key_value_heads_per_layer’ For Nemotron
#15625 opened Mar 27, 2025
[Bug]: vllm 0.8.2 have severe quality problem
#15622 opened Mar 27, 2025
[Bug]: Triton JIT Compile Regression from PR 15511
#15619 opened Mar 27, 2025
[Usage]: How to make the Reasoning of deepseek output normally and the final content structured output
#15618 opened Mar 27, 2025
[Installation]: ValueError: size must contain 'shortest_edge' and 'longest_edge' keys.
#15614 opened Mar 27, 2025
[Bug]: running vllm image in k3s helm chart gives ValueError: invalid literal for int() with base 10: 'tcp://10.43.1.39:8000'
#15613 opened Mar 27, 2025
[Bug]: The content is empty after gemma3 is deployed on the T4 graphics card to send request inference
#15610 opened Mar 27, 2025
[Feature]: In multimodal inference, is it possible to cache textual content and only load images each time to optimize inference efficiency
#15608 opened Mar 27, 2025
[Bug]: Failed to run deepseek v2 lite model with tp = 4
#15607 opened Mar 27, 2025
[Usage]: Will dynamo be on vllm main branch?
#15606 opened Mar 27, 2025
[Bug]: Failed to run deepseek v2 lite model with tp = 8 when enabling expert parallel
#15604 opened Mar 27, 2025
How to install and use vLLM to serve multiple large language models
#15602 opened Mar 27, 2025
[Bug]: Qwen2-VL-2B quantization model has no improvement in reasoning speed compared to the original model
#15601 opened Mar 27, 2025
[V1] [Performance Benchmark] Benchmark the performance of Speculative Decoding
#15600 opened Mar 27, 2025
[Bug]: Gemma3 GPU memory usage is always oom
#15599 opened Mar 27, 2025
[Bug]: Model Reasoning Warning
#15596 opened Mar 27, 2025
[Bug]:ModuleNotFoundError: No module named 'vllm._C'
#15592 opened Mar 27, 2025
[Bug]: DeepSeek R1 with V1+FLASHMLA on L40S
#15590 opened Mar 27, 2025
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 opened Mar 26, 2025
TP4 fails with 5090 in the mix
#15576 opened Mar 26, 2025
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 opened Mar 26, 2025
[Bug]: --api-key argument ignored when VLLM_API_KEY is set
#15568 opened Mar 26, 2025
[Feature]: Ring Attention for Long Context in vLLM - RL Applications Focus
#15566 opened Mar 26, 2025
[New Model]: please surport for Qwen/Qwen2.5-Omni-7B
#15563 opened Mar 26, 2025
[Feature]: LMCache support to the CPU version of vLLM
#15562 opened Mar 26, 2025
[Bug][V1]: ngram + guided decoding
#15554 opened Mar 26, 2025
[Bug]: Structured Output not working with MistralTokenizer (vLLM 0.8.2, V1)
#15551 opened Mar 26, 2025
[Bug]: Tools parsing issues with mistral3.1
#15549 opened Mar 26, 2025
[Installation]: flaky publishing of cpu image
#15547 opened Mar 26, 2025
[Bug]: when stream output tools result, the stream result will loss 2 token tool data
#15545 opened Mar 26, 2025
[Bug]: profile deadlock when ray backend and num-scheduler-steps > 1 and max_tokens % num_scheduler_steps !=0
#15543 opened Mar 26, 2025
[New Model]: HuggingFaceTB/SmolVLM2-2.2B-Instruct
#15541 opened Mar 26, 2025
[Bug]: Error when inference on llava-1.6-34B
#15539 opened Mar 26, 2025
[Usage]: How to get "num_gpu_blocks" in V1？
#15538 opened Mar 26, 2025
[Bug]: Distributed Inference and Serving BUG
#15537 opened Mar 26, 2025
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 opened Mar 26, 2025
[Bug]: `TypeError: Unknown video model type: phi4mm`
#15534 opened Mar 26, 2025
[Usage]: Evaluation score under V1 engine is low
#15533 opened Mar 26, 2025
[Installation]: install vllm with CUDA 12.8 in 5090D error
#15531 opened Mar 26, 2025
[Bug][Triton MLA]: Some calculation errors about triton mla kernal
#15530 opened Mar 26, 2025
[Usage]: Qwen2.5-VL-32B-Instruct 4卡4090启动报错
#15529 opened Mar 26, 2025
[Bug]: FunctionDefinition missing optional param strict
#15526 opened Mar 26, 2025
[Bug]: VLLM_NCCL_SO_PATH take no effects when spawn worker
#15525 opened Mar 26, 2025
[Feature]: Reason model reasoning effort feature like OpenAI
#15524 opened Mar 26, 2025
[Usage]: When deploying the BGE-M3 embedding model using vllm, during the creation of a knowledge base in cherry studio, the model can be linked successfully. However, while uploading files, the backend returns a '400 Bad Request' error.
#15519 opened Mar 26, 2025
[Usage]: distributed using ray, how to get worker runtime error log
#15514 opened Mar 26, 2025
[Bug]:
#15513 opened Mar 26, 2025
[Bug]: Embed model has additional dense module(dim=1792, but only 1024)
#15509 opened Mar 26, 2025
[Bug]: Support Bitsandbytes weight loading when offline (via huggingface cache)
#15507 opened Mar 25, 2025
[Doc]: Troubleshooting guide incorrect hardware script fails
#15498 opened Mar 25, 2025
[Bug]: Llama-3.2-11B-Vision-Instruct has an issue in vision language embedding
#15496 opened Mar 25, 2025
[Bug]: Allow flexible message role ordering in conversations (user/assistant in any sequence)
#15486 opened Mar 25, 2025
[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made
#15483 opened Mar 25, 2025
[Bug]: Unknown gguf model_type: gemma3
#15480 opened Mar 25, 2025
[Bug]: Qwen2 MoE inference is super slow
#15470 opened Mar 25, 2025
[Usage]:
#15469 opened Mar 25, 2025
[Usage]:Phi-4-multimodal-instruct
#15468 opened Mar 25, 2025
[Feature]: Embedding API dimensions is currently not supported.
#15465 opened Mar 25, 2025
[Doc]: https://docs.vllm.ai/en/latest/deployment/k8s.html not working
#15461 opened Mar 25, 2025
[Feature]: preprocessing of weights in advance
#15459 opened Mar 25, 2025
[Bug]: vllm 0.8.3 serve error
#15457 opened Mar 25, 2025
[Usage]: vllm启动服务卡住
#15451 opened Mar 25, 2025
[Installation]: RuntimeError: Unknown runtime environment
#15450 opened Mar 25, 2025
[Usage]: Question about Interleaved Text/Image Format in Online Inference
#15449 opened Mar 25, 2025
[Usage]: vllm Qwen 2.5 VL output is so different from the original Qwen 2.5VL
#15447 opened Mar 25, 2025
[Bug]: v1 fails when token budget is set to 128K
#15446 opened Mar 25, 2025
[Performance]: Regarding the issue of context length for QWQ-32B in different distributed environments:
#15442 opened Mar 25, 2025
[Usage]: `Phi-4-multimodal-instruct` activate LoRA module but get mangled text output
#15440 opened Mar 25, 2025
[Usage]: Serve From Hard disk and folder Path issue
#15439 opened Mar 25, 2025
[Bug]: Greedy sampling (temperature=0.0) doesn't work, same prompts still give different outputs. (gemma-3-4it, ...)
#15437 opened Mar 25, 2025
[Installation]: Fail to build vLLM from source on H100
#15435 opened Mar 25, 2025
[Feature]: [V1] Collective RPC
#15430 opened Mar 25, 2025
[Bug]: DeepSeek-r1-AWQ (W4A16) can perform normal inference using BF16, but it shows abnormal behavior when using FP16.
#15429 opened Mar 25, 2025
[Usage]: online server requests do not return token usage information in version 0.7.2
#15426 opened Mar 25, 2025
[New Model]: Baichuan-Audio
#15425 opened Mar 25, 2025
[New Model]: glm-4-voice-9b
#15424 opened Mar 25, 2025
[Bug]: logprobs/ranks not matching when comparing `vllm` with `transformers`
#15420 opened Mar 24, 2025
[Feature]: Limit thinking tokens
#15418 opened Mar 24, 2025
[Feature]: Implement Embedding Models in V1
#15406 opened Mar 24, 2025
[Bug]: `Phi-4-multimodal-instruct` encoder outputs didn't have the same length as defined in input_ids
#15404 opened Mar 24, 2025
[Feature]: JSON based tool calling for Gemma 3
#15403 opened Mar 24, 2025
[Bug]: AttributeError: Model PixtralForConditionalGeneration does not support BitsAndBytes quantization yet. No 'packed_modules_mapping' found.
#15396 opened Mar 24, 2025
[Bug]: RequestMetrics object (accessed through output[0].metrics) is None
#15394 opened Mar 24, 2025
[Bug]: Batch embedding inference is inconsistent with hf
#15393 opened Mar 24, 2025
[Bug]: vllm V1 pipeline parallel not compatible with ray==2.44.0
#15391 opened Mar 24, 2025
[Bug]: awq Deepseek-R1-AWQ The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
#15386 opened Mar 24, 2025
[Bug]: gcs_rpc_client.h:151: Failed to connect to GCS at address 192.168.2.40:6379 within 5 seconds.
#15385 opened Mar 24, 2025
[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model
#15384 opened Mar 24, 2025
[Bug]: Different logprobs output behaviour under vllm 0.8.0 and 0.8.1
#15381 opened Mar 24, 2025
[Usage][UT]:Why the answer is ' 0, 1'
#15380 opened Mar 24, 2025
[Feature]: Support Top-nσ sampling
#15379 opened Mar 24, 2025
[Feature]: Add CoT dataset to the benchmark
#15378 opened Mar 24, 2025
[Bug]: VLLM Build Using Docker Error Deploy
#15376 opened Mar 24, 2025
[Feature]: Support LoRA adapter for whisper
#15370 opened Mar 24, 2025
[Feature]: Overall tests improvement and speedup
#15369 opened Mar 24, 2025
[Bug]: 0.8.0 and 0.8.1 bugs
#15365 opened Mar 23, 2025
[New Model]: Support for SFR-Embedding-Code-2B_R embbeding model
#15362 opened Mar 23, 2025
[Bug]: vLLM v1 hanging during Torch compilation
#15360 opened Mar 23, 2025
[Bug]: LLM.collective_rpc is broken in v1 by default
#15349 opened Mar 23, 2025
[Bug]: Wheels binary is absent for v0.7.2 release
#15347 opened Mar 23, 2025
[Bug]: misleading regex with `--tokenizer-mode mistral` `OSError`
#15345 opened Mar 23, 2025
[usage]: The fastest offline inference method
#15342 opened Mar 22, 2025
[Bug][V0][Trition MLA][GGUF]: Deepseek R1 GGUF starts producing gibberish towards the end of a longer generation
#15340 opened Mar 22, 2025
[Bug]: Expected there to be 4 prompt updates corresponding to 4 image items, but instead found 3 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs
#15338 opened Mar 22, 2025
[Bug]: RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination
#15335 opened Mar 22, 2025
[Bug]: Can't deserialize object: ObjectRef，DeepSeek R1, H20*16, pp2, tp8, v1 engine
#15333 opened Mar 22, 2025
[Bug]: RuntimeError: The size of tensor a (1059) must match the size of tensor b (376) at non-singleton dimension, DeepSeek R1 H20x16 pp2, v1 engine
#15332 opened Mar 22, 2025
[Performance]: poor performance in pipeline parallesm when batch-size is large
#15330 opened Mar 22, 2025
[Bug]:
#15329 opened Mar 22, 2025
[Bug]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
#15327 opened Mar 22, 2025
[Feature]: looking into adding a generation algorithm
#15315 opened Mar 22, 2025
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 opened Mar 21, 2025
[Bug]: Crashing on unsupported Sampling params
#15312 opened Mar 21, 2025
[Usage]: Generating multiple completions with Qwen QwQ 32B
#15304 opened Mar 21, 2025
[Bug]: OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym error
#15300 opened Mar 21, 2025
[Bug]: Worker VllmWorkerProcess pid 000000 died, exit code: -15
#15295 opened Mar 21, 2025
[Feature]: Dynamic Memory Release for GPU after idle time
#15287 opened Mar 21, 2025
[Usage]: why no ray command in my docker image
#15284 opened Mar 21, 2025
[Bug]:streming is lost in arguments in tool_calls
#15274 opened Mar 21, 2025
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 opened Mar 21, 2025
vector search
#15268 opened Mar 21, 2025
[Feature]: Can support CPU inference with Ray cluster?
#15266 opened Mar 21, 2025
[Bug]: qwen2.5vl cannot use fp8 quantization
#15264 opened Mar 21, 2025
[Bug]: oracle for device checking raise exception unexpectly
#15263 opened Mar 21, 2025
[Bug]: OOM with QwQ-32B
#15258 opened Mar 21, 2025
[Bug]: working with openai-agents sdk an use Runner.run_streamed() got fucntion call error
#15256 opened Mar 21, 2025
[Bug]: --tensor-parallel-size Error
#15255 opened Mar 20, 2025
[RFC]: Better support for weight updating while waking up from sleep mode for RLHF
#15254 opened Mar 20, 2025
[Performance]: V0 and V1 give the same throughput number
#15253 opened Mar 20, 2025
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 opened Mar 20, 2025

326 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Model][VLM] Add Qwen2.5-Omni model support (thinker only)
#15130 commented on Mar 27, 2025 • 35 new comments
[Frontend]Reduce vLLM's import time
#15128 commented on Mar 25, 2025 • 34 new comments
[Model][MiniMaxText01] Support MiniMaxText01 model inference
#13454 commented on Mar 27, 2025 • 30 new comments
[V1] Implement sliding window attention in kv_cache_manager
#14097 commented on Mar 27, 2025 • 26 new comments
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore
#12957 commented on Mar 27, 2025 • 19 new comments
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations.
#13932 commented on Mar 27, 2025 • 17 new comments
[Model] Add PLaMo2
#14323 commented on Mar 27, 2025 • 14 new comments
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 commented on Mar 27, 2025 • 14 new comments
[Refactor][Frontend] Keep all logic about reasoning into one class
#14428 commented on Mar 27, 2025 • 11 new comments
[Feature] Support sequence parallelism
#14908 commented on Mar 27, 2025 • 10 new comments
[V1] AsyncLLM data parallel
#13923 commented on Mar 27, 2025 • 7 new comments
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm
#15050 commented on Mar 26, 2025 • 7 new comments
Allow dynamic loading of LoRA adapters in a cache dir
#14634 commented on Mar 25, 2025 • 6 new comments
[Kernel][Triton] Adding fp8 and variable length sequence support to Triton FAv2 kernel
#12591 commented on Mar 27, 2025 • 5 new comments
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS.
#14948 commented on Mar 27, 2025 • 5 new comments
Truncation control for embedding models
#14776 commented on Mar 21, 2025 • 5 new comments
[Distributed] Add custom allreduce support for ROCM
#14125 commented on Mar 27, 2025 • 5 new comments
Add endpoint load metrics
#14906 commented on Mar 27, 2025 • 4 new comments
[Bugfix] Enable `torch.comple` for 2 parts of model
#14913 commented on Mar 22, 2025 • 4 new comments
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8
#14578 commented on Mar 24, 2025 • 4 new comments
[V1][Frontend] Improve Shutdown And Logs
#11737 commented on Mar 27, 2025 • 4 new comments
Add cutlass support for blackwell fp8 blockwise gemm
#14383 commented on Mar 26, 2025 • 4 new comments
[Frontend] Implement Tool Calling with `tool_choice='required'`
#13483 commented on Mar 27, 2025 • 4 new comments
[Core] Add Additional Metrics to vLLM Server
#12726 commented on Mar 27, 2025 • 3 new comments
Torchao
#14231 commented on Mar 27, 2025 • 3 new comments
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS
#6036 commented on Mar 27, 2025 • 3 new comments
[Neuron][V1] Experimental support for neuron backend with V1 architecture
#14648 commented on Mar 23, 2025 • 2 new comments
[Bugfix] handle alignment of encoder_seq_lens in mllama.py
#14784 commented on Mar 26, 2025 • 2 new comments
[V1][Core] Support offloading KV cache to CPU.
#13377 commented on Mar 27, 2025 • 2 new comments
[Bugfix] Data parallel example will all use same GPUs if the users script initializes torch.cuda
#14598 commented on Mar 26, 2025 • 2 new comments
[WIP][V1][Metrics] Speculative decoding metrics
#15151 commented on Mar 27, 2025 • 2 new comments
[V0][Fix] structured decoding compatibility with speculative decoding
#13823 commented on Mar 27, 2025 • 2 new comments
[RFC][V1] `LogitsProcessor` interface
#13360 commented on Mar 27, 2025 • 1 new comment
[HPU] Enable AutoGPTQ/AutoAWQ quantized model inference
#13853 commented on Mar 24, 2025 • 1 new comment
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER
#15001 commented on Mar 26, 2025 • 1 new comment
[V1][Metrics] Add additional metrics to V1
#14148 commented on Mar 25, 2025 • 1 new comment
[V1][Perf] Faster incremental detokenization
#15137 commented on Mar 25, 2025 • 1 new comment
[V1][Metrics] Allow V1 AsyncLLM to use custom logger
#14661 commented on Mar 26, 2025 • 1 new comment
[Metrics] Add bucket for `request_latency_buckets`
#15202 commented on Mar 25, 2025 • 1 new comment
Move dockerfiles into their own directory
#14549 commented on Mar 27, 2025 • 1 new comment
[CI/Build] Add support for Python 3.13
#13164 commented on Mar 24, 2025 • 0 new comments
[Model] Add T5 model (2/2)
#11901 commented on Mar 27, 2025 • 0 new comments
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on Mar 21, 2025 • 0 new comments
[Hardware][TPU] improve kv cache update performance in prefill
#13176 commented on Mar 27, 2025 • 0 new comments
[Bugfix] add input embedding
#11684 commented on Mar 25, 2025 • 0 new comments
[Core] Efficient transmission for CPU prefix caching, based on PR#10874
#11099 commented on Mar 27, 2025 • 0 new comments
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server
#10546 commented on Mar 27, 2025 • 0 new comments
[Core] Faster logit_bias_logits_processor
#13334 commented on Mar 26, 2025 • 0 new comments
[WIP][Attention] Update to lastest FA3 code
#13111 commented on Mar 26, 2025 • 0 new comments
[V1][Bugfix] DeepSeek-V3 v1 attn_backend miss q_lora_rank
#13092 commented on Mar 27, 2025 • 0 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on Mar 26, 2025 • 0 new comments
[Neuron][Kernel] NKI Flash PagedAttention with BlockSparse Execution Plan
#13249 commented on Mar 21, 2025 • 0 new comments
[V1] Optimize block table copy from CPU to GPU (take 2)
#12078 commented on Mar 25, 2025 • 0 new comments
[WIP][Hardware][CPU] testing branch for mlperf
#12141 commented on Mar 27, 2025 • 0 new comments
[V1][PoC] Refactor EngineCoreOutputs
#12853 commented on Mar 27, 2025 • 0 new comments
[Quantization/Parameter] WIP: Another Implementation of the Quantization Parameter Subclass Substitution
#12158 commented on Mar 27, 2025 • 0 new comments
[WIP][AMD][Kernel][Quantization] Add fp8 and int8 support for Triton FAv2 kernel
#12534 commented on Mar 27, 2025 • 0 new comments
[WIP][Attention] KV Splits heuristic for MLA
#12654 commented on Mar 26, 2025 • 0 new comments
[WIP] MLA decode attention - cuda graph support
#12588 commented on Mar 27, 2025 • 0 new comments
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 commented on Mar 27, 2025 • 0 new comments
[Frontend] Support suffix in completions API (fill-in-the-middle, FIM)
#9522 commented on Mar 24, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q4 2024
#9006 commented on Mar 27, 2025 • 0 new comments
[Bug]: disaggregated prefilling hangs when TP=2
#11247 commented on Mar 27, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: invalid argument [ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
#13270 commented on Mar 27, 2025 • 0 new comments
[Bug]: vllm部署qwen2.5_vl_72b之后，你们有出现，刚部署好之后调用一切正常3-5秒一条，然后使用一段时间，就越来越慢了的情况吗60s一条
#13886 commented on Mar 27, 2025 • 0 new comments
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 commented on Mar 27, 2025 • 0 new comments
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on Mar 27, 2025 • 0 new comments
[Bug]: 0.8.0(V1) RayChannelTimeoutError when inferencing DeepSeekV3 on 16 H20 with large batch size
#15102 commented on Mar 27, 2025 • 0 new comments
[Bug]: v0.6.4 requires more GPU memory than v0.6.3
#10360 commented on Mar 27, 2025 • 0 new comments
[Usage]: How to get token level probablity scores
#10951 commented on Mar 27, 2025 • 0 new comments
[Bug]: Prefix caching doesn't work for LlavaOneVision
#11371 commented on Mar 27, 2025 • 0 new comments
[Bug]: error when start in multiple GPU
#11467 commented on Mar 27, 2025 • 0 new comments
[Misc]: qwen2 vllm和transform 推理结果未对齐
#11478 commented on Mar 27, 2025 • 0 new comments
[Bug]: 'int' object has no attribute 'parser_state'
#11498 commented on Mar 27, 2025 • 0 new comments
[Bug]: Qwen2.5-72B-Instruct 在A800上推理不成功
#11506 commented on Mar 27, 2025 • 0 new comments
[Bug]: The CPU usage is very low when inference is performed on the ARM CPU
#11511 commented on Mar 27, 2025 • 0 new comments
[Bug]: vllm 0.6.5 run Qwen2-VL-7B-Instruct ，lora lond success but not effective
#11525 commented on Mar 27, 2025 • 0 new comments
[Bug]: Profiling on vLLM server hangs when --num-scheduler-steps > 1
#12032 commented on Mar 27, 2025 • 0 new comments
[Bug]: Run DeepSeek-R1-awq model on AMD MI210 meet an error
#15101 commented on Mar 27, 2025 • 0 new comments
[Bug]: CDNA cc >= 90, choose_mp_linear_kernel MacheteLinearKernel is possible
#14996 commented on Mar 26, 2025 • 0 new comments
[Hardware] [Intel GPU] Add multistep scheduler for xpu device
#9337 commented on Mar 26, 2025 • 0 new comments
[Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps`
#9276 commented on Mar 27, 2025 • 0 new comments
Developed the PoC of dAttention support. It will utilize the similar idea of vAttention, but it introduces a new memory layout that overcomes the waste of memory of vAttention.
#9078 commented on Mar 27, 2025 • 0 new comments
[Bugfix] Fix LongRoPE bug
#8254 commented on Mar 27, 2025 • 0 new comments
[Frontend] Add option for LLMEngine to return model hidden states.
#7892 commented on Mar 25, 2025 • 0 new comments
[Core] generate from input embeds
#6869 commented on Mar 27, 2025 • 0 new comments
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on Mar 25, 2025 • 0 new comments
[Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 13)
#15207 commented on Mar 27, 2025 • 0 new comments
[New Model]: answerdotai/ModernBERT-large
#11347 commented on Mar 27, 2025 • 0 new comments
[Feature]: Improve Logging for Error Messages
#14083 commented on Mar 27, 2025 • 0 new comments
[Bug]: Out of Memory error for Qwen2.5 in 0.8.0 and 0.8.1. Worked fine in the previous versions
#15228 commented on Mar 27, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q1 2025
#11862 commented on Mar 27, 2025 • 0 new comments
[Bug]: size mismatch when loading MixtralForCausalLM GGUF model
#14423 commented on Mar 27, 2025 • 0 new comments
[Feature]: Support Multiple Tasks Per Model
#11905 commented on Mar 27, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Mar 27, 2025 • 0 new comments
[RFC]: Multi-modality Support on vLLM
#4194 commented on Mar 27, 2025 • 0 new comments
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on Mar 27, 2025 • 0 new comments
[Feature]: Support rerank models
#6928 commented on Mar 27, 2025 • 0 new comments
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 commented on Mar 27, 2025 • 0 new comments
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on Mar 27, 2025 • 0 new comments
[Quant] SupportsQuant handles ignored_modules
#14635 commented on Mar 26, 2025 • 0 new comments
[Ray]Ray Compiled Graph support other device
#14668 commented on Mar 24, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Mar 22, 2025 • 0 new comments
[Perf] Optimize Qwen2/2.5-VL/Omni series' rot pos compute using numba
#14684 commented on Mar 26, 2025 • 0 new comments
[V1][Feature] Enable Speculative Decoding with Structured Outputs
#14702 commented on Mar 27, 2025 • 0 new comments
[DO NOT MERGE] [V1] Implement SimpleScheduler
#14731 commented on Mar 21, 2025 • 0 new comments
[CI/Build] Add hpu test with tensor-parallel-size=2 to run-hpu-test.sh
#14751 commented on Mar 25, 2025 • 0 new comments
[Quantization] Add Gemma2 and Gemma3 text model GGUF support
#14766 commented on Mar 26, 2025 • 0 new comments
[Bugfix] fix deepseek fp16 scale bug
#14809 commented on Mar 27, 2025 • 0 new comments
[Bugfix][Mamba] Fix IndexError When MambaCacheManager is Full
#14820 commented on Mar 22, 2025 • 0 new comments
[Bugfix][Model] fix mllama multi-image
#14883 commented on Mar 26, 2025 • 0 new comments
Add Phi-4-mini funciton calling support
#14886 commented on Mar 26, 2025 • 0 new comments
[Kernel] vLLM Windows CUDA support
#14891 commented on Mar 25, 2025 • 0 new comments
[Feature] Eagle Chunked Prefill Support
#14922 commented on Mar 25, 2025 • 0 new comments
[ V0 ][ sample ] improve sample performance when using guide decoding
#14962 commented on Mar 22, 2025 • 0 new comments
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature
#14968 commented on Mar 24, 2025 • 0 new comments
[V1][Misc] Misc simplifications / performance-related
#14989 commented on Mar 27, 2025 • 0 new comments
[Model] Update support for NemotronNAS models
#15008 commented on Mar 25, 2025 • 0 new comments
[CPU] Support torch compile in CPU backend
#15020 commented on Mar 25, 2025 • 0 new comments
[Bugfix] Fix hidden_states reshape failed and no_proposals error when…
#15032 commented on Mar 24, 2025 • 0 new comments
[TPU][V1] Capture multimodal encoder during model compilation
#15051 commented on Mar 27, 2025 • 0 new comments
[DO NOT REVIEW YET] Integrate with the write-to-kvcache Pallas kernel
#15067 commented on Mar 27, 2025 • 0 new comments
[Bugfix][Misc] Add a defensive check before importing triton
#15099 commented on Mar 24, 2025 • 0 new comments
Metrics proposal OpenTelemetry API
#15138 commented on Mar 21, 2025 • 0 new comments
[WIP][TPU] Support mrope models (Qwen2VL)
#15149 commented on Mar 27, 2025 • 0 new comments
[WIP][Feature] Support chunked prefill when using Deepseek MTP model as draft model
#15153 commented on Mar 21, 2025 • 0 new comments
online_rotations
#15162 commented on Mar 21, 2025 • 0 new comments
[Spec Decode] Make speculative decoding compatible with pipeline parallelism
#15173 commented on Mar 26, 2025 • 0 new comments
[SpecDecode] Make spec decoding extensible to different backends
#15195 commented on Mar 25, 2025 • 0 new comments
[Bugfix] Fix include prompt in stream response when echo=true
#15233 commented on Mar 21, 2025 • 0 new comments
[ROCm] Pop ROCR_VISIBLE_DEVICES in RayWorkerWrapper
#15246 commented on Mar 21, 2025 • 0 new comments
[Bugfix]: DeepseekR1 model load fails with weights tied error
#13335 commented on Mar 26, 2025 • 0 new comments
[Core][Feature] Input metadata dump on crash
#13407 commented on Mar 27, 2025 • 0 new comments
[Neuron][CI][WIP] Refactor Neuron kernel tests to improve coverage
#13455 commented on Mar 21, 2025 • 0 new comments
[CI/Build] custom build backend and dynamic build dependencies v2
#13480 commented on Mar 27, 2025 • 0 new comments
[WIP][Kernel] Flashinfer MLA support
#13630 commented on Mar 26, 2025 • 0 new comments
[V1][PP] Continue scheduling prefill chunks
#13637 commented on Mar 27, 2025 • 0 new comments
[V1][Minor] Use FakeAttentionMetadata for dummy run
#13689 commented on Mar 27, 2025 • 0 new comments
[V1] Zero-copy tensor/ndarray serialization/transmission
#13790 commented on Mar 26, 2025 • 0 new comments
[Model][Speculative Decoding] support k > 1 for MTP
#13805 commented on Mar 27, 2025 • 0 new comments
[Misc] support variable remote backend for model loader
#13809 commented on Mar 25, 2025 • 0 new comments
Fix TPU CI
#13898 commented on Mar 27, 2025 • 0 new comments
Upgrade `transformers` to `v4.50.2`
#13905 commented on Mar 27, 2025 • 0 new comments
[V1] Avoid false positives when warning for unimplemented methods
#14046 commented on Mar 23, 2025 • 0 new comments
[v1] Remove bind_kv_cache and self.kv_cache in model runner
#14098 commented on Mar 27, 2025 • 0 new comments
Deepseek MTP for V1
#14182 commented on Mar 25, 2025 • 0 new comments
[V1] Enable Long Context LoRA tests for V1
#14241 commented on Mar 26, 2025 • 0 new comments
[TPU][V1] Capture multimodal encoder during model compilation
#14254 commented on Mar 27, 2025 • 0 new comments
[WIP][Attention] FlashAttn MLA
#14258 commented on Mar 26, 2025 • 0 new comments
[V1] TPU - Remove self.kv_caches
#14309 commented on Mar 27, 2025 • 0 new comments
[Misc][Minor] Benchmarks: Fix guided decoding, token sampling, and request sorting
#14368 commented on Mar 21, 2025 • 0 new comments
[Misc] Fix test_sleep to use query parameters
#14373 commented on Mar 22, 2025 • 0 new comments
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 commented on Mar 26, 2025 • 0 new comments
Clean up Engine Args & Documentation
#14409 commented on Mar 27, 2025 • 0 new comments
[Misc] Refactor platform to get device specific stream and event
#14411 commented on Mar 26, 2025 • 0 new comments
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 commented on Mar 27, 2025 • 0 new comments
[Kernel] moe wna16 marlin kernel
#14447 commented on Mar 23, 2025 • 0 new comments
[INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi
#14455 commented on Mar 24, 2025 • 0 new comments
[#14109][bug] Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu)
#14521 commented on Mar 22, 2025 • 0 new comments
First working PoC for bge-m3 sparse embeddings
#14526 commented on Mar 26, 2025 • 0 new comments
[Frontend] Skip `stop` in reasoning content
#14550 commented on Mar 27, 2025 • 0 new comments
permute/unpermute kernel for moe optimization
#14568 commented on Mar 27, 2025 • 0 new comments
fix:set use_beam_search false to aviod trace link broken
#14592 commented on Mar 21, 2025 • 0 new comments
[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory
#13563 commented on Mar 26, 2025 • 0 new comments
[Bug]: 启动之后用了一段时间显存越占越多
#8413 commented on Mar 23, 2025 • 0 new comments
[Bug]: vLLM multi-step scheduling crashes when input prompt is long
#10009 commented on Mar 23, 2025 • 0 new comments
[Installation]: Missing v0.6.3.post1-cu118-cp310.whl. Can share it? Thanks so much
#10036 commented on Mar 23, 2025 • 0 new comments
[RFC]: The two features i wish vllm has
#11410 commented on Mar 23, 2025 • 0 new comments
[Feature]: c4ai-command-r-plus-08-2024 tool choice support
#11405 commented on Mar 23, 2025 • 0 new comments
[Misc]: How to Profile Both EngineCoreClient and EngineCoreProc Activities in V1 Using Profiler
#11413 commented on Mar 23, 2025 • 0 new comments
[Feature]: QTIP Quantization
#11416 commented on Mar 23, 2025 • 0 new comments
[Feature]: Ensure benchmark serving do not import vLLM
#14923 commented on Mar 23, 2025 • 0 new comments
[RFC]: vLLM Windows CUDA support
#14981 commented on Mar 23, 2025 • 0 new comments
[Feature]: Support torch.distributed as the runtime for multi-node inference
#12511 commented on Mar 22, 2025 • 0 new comments
[Bug]: "Loading safetensors checkpoint shards" runs twice when serving model
#13765 commented on Mar 22, 2025 • 0 new comments
[Feature]: Reduce vLLM's import time
#14924 commented on Mar 22, 2025 • 0 new comments
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture
#13446 commented on Mar 22, 2025 • 0 new comments
[Bug]: The qwq-32b-q4_k_m.gguf quantized model is not supported.
#15015 commented on Mar 22, 2025 • 0 new comments
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 commented on Mar 22, 2025 • 0 new comments
[Bug]: No available block found in 60 second in shm
#6614 commented on Mar 22, 2025 • 0 new comments
[Usage]: How to benchmark throughput of DeepSeek-R1-671B on 2 nodes
#15024 commented on Mar 22, 2025 • 0 new comments
[Bug]: Unit test `tests/models/embedding/vision_language/test_phi3v.py` failing
#14677 commented on Mar 22, 2025 • 0 new comments
[Bug]: Model weights in GiB
#14979 commented on Mar 22, 2025 • 0 new comments
[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果
#3998 commented on Mar 22, 2025 • 0 new comments
[Usage]: Model `compute_logits` always get None for `sampling_metadata`
#15115 commented on Mar 24, 2025 • 0 new comments
[New Model]: jinaai/jina-reranker-v2-base-multilingual
#15222 commented on Mar 24, 2025 • 0 new comments
[RFC]: Hybrid Memory Allocator
#11382 commented on Mar 24, 2025 • 0 new comments
[Bug]: Extra body don't work when response_format is also sent for serving.
#7337 commented on Mar 24, 2025 • 0 new comments
[Bug]: Lora refuses to load from disk without extremely weird manipulations with file paths
#9063 commented on Mar 24, 2025 • 0 new comments
[Usage]: Removal of vllm.openai.rpc folder in vLLM 0.6.2 release
#10766 commented on Mar 24, 2025 • 0 new comments
[Performance]: Performance degradation due to CPU bottleneck when serving embedding models to GPUs
#11320 commented on Mar 24, 2025 • 0 new comments
[Bug]: no output of profile when VLLM_TORCH_PROFILER_DIR is enabled for vllm serve
#11346 commented on Mar 24, 2025 • 0 new comments
Error in running 'python -m vllm.entrypoints.openai.api_server '
#11411 commented on Mar 24, 2025 • 0 new comments
[Bug]: 0.6.5 randomly closes connection/drops requests
#11421 commented on Mar 24, 2025 • 0 new comments
[Doc]: new attention layer
#15077 commented on Mar 24, 2025 • 0 new comments
[Installation]: VLLM on ARM machine with GH200
#10459 commented on Mar 23, 2025 • 0 new comments
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on Mar 23, 2025 • 0 new comments
[Bug]: Missing detection of BFloat16 for CPU ARM
#11814 commented on Mar 23, 2025 • 0 new comments
[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered
#13673 commented on Mar 23, 2025 • 0 new comments
[Bug]: [V1] New v1 engine does not support n>1?
#12584 commented on Mar 23, 2025 • 0 new comments
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on Mar 23, 2025 • 0 new comments
[Bug]: Likely Regression - Was working in v0.6.3.post1, now using response_format parameter with "type": "bool" in v0.7.3: BadRequestError: Error code 400 - {'object': 'error', 'message': 'json_schema_converter.cc:595 Unsupported type bool in schema {type":"bool"}\n, 'type': 'BadRequestError', 'param': None, 'code': 400}
#13864 commented on Mar 23, 2025 • 0 new comments
[Bug]: when I use disaggregated_prefill, if I don't input anything ,KV receiving thread will report time out
#14193 commented on Mar 23, 2025 • 0 new comments
[Bug]: vLLM 0.5.5 and FlashInfer0.1.6
#8091 commented on Mar 23, 2025 • 0 new comments
Supporting RWKV models
#3583 commented on Mar 22, 2025 • 0 new comments
[Bug]: AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
#7871 commented on Mar 21, 2025 • 0 new comments
[Bug]: when tensor-parallel-size>1，Stuck
#8087 commented on Mar 21, 2025 • 0 new comments
[Doc]: Offline Inference Distributed
#8966 commented on Mar 21, 2025 • 0 new comments
[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011
#9342 commented on Mar 21, 2025 • 0 new comments
[New Model]: Qwen/QwQ-32B-Preview
#10737 commented on Mar 21, 2025 • 0 new comments
[Usage]: how to use EAGLE on vLLM?
#11126 commented on Mar 21, 2025 • 0 new comments
[Bug]: Paligemma 2 model loading error
#11343 commented on Mar 21, 2025 • 0 new comments
[Feature]: meta-llama/Prompt-Guard-86M Usage Value Error.
#11360 commented on Mar 21, 2025 • 0 new comments
[Bug]: priority scheduling doesn't work according to token_per_s. The token_per_s of requests with higher priorities is not higher than that of requests without priority settings.
#11361 commented on Mar 21, 2025 • 0 new comments
[Bug]: The service operation process results in occasional exception errors RuntimeError: CUDA error: an illegal memory access was encountered
#11366 commented on Mar 21, 2025 • 0 new comments
[Bug]: vLLM crashes on tokenized embedding input
#11375 commented on Mar 21, 2025 • 0 new comments
[Usage]: How do I run offline batch inference with Llama 405B BF16 across multinode (via SLURM)
#11379 commented on Mar 21, 2025 • 0 new comments
[Bug]: stop_sequences is applied to both reasoning_content and content
#14399 commented on Mar 21, 2025 • 0 new comments
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 commented on Mar 21, 2025 • 0 new comments
[Feature]: Support openai responses API interface
#14721 commented on Mar 21, 2025 • 0 new comments
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 commented on Mar 20, 2025 • 0 new comments
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 commented on Mar 20, 2025 • 0 new comments
[RFC][Exploratory]: vLLM Neuron Backend with V1 Architecture
#11152 commented on Mar 20, 2025 • 0 new comments
[Bug]: collect_env doesn't work in uv environment
#13888 commented on Mar 20, 2025 • 0 new comments
[New Model]: Support Zyphra/Zamba2-7B
#9382 commented on Mar 20, 2025 • 0 new comments
[New Model]: nvidia/Hymba-1.5B-Base
#10783 commented on Mar 22, 2025 • 0 new comments
[Usage]: Is pipeline parallelism supported on machines that are not in the same local network?
#11285 commented on Mar 22, 2025 • 0 new comments
[Misc]: What is 'residual' used for in the IntermediateTensor class?
#11364 commented on Mar 22, 2025 • 0 new comments
Where does the default number 43328 of KV cache come from and How can I change it?
#11391 commented on Mar 22, 2025 • 0 new comments
[V1] Add code dataset to benchmark the performance of spec decode
#14013 commented on Mar 21, 2025 • 0 new comments
[Usage]: Vllm whisper model response_format verbose_json not working
#14818 commented on Mar 21, 2025 • 0 new comments
[Bug]: latest docker build (0.6.2) got error due to VLLM_MAX_SIZE_MB
#9307 commented on Mar 21, 2025 • 0 new comments
[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
#3900 commented on Mar 21, 2025 • 0 new comments
[Bug]: 'invalid argument' Error with custom_all_reduce when doing tensor parallelism
#9046 commented on Mar 21, 2025 • 0 new comments
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on Mar 21, 2025 • 0 new comments
[Bug]: documentation say XLMRobertaForSequenceClassification is supported but logs say ['XLMRobertaForSequenceClassification'] are not supported for now
#10718 commented on Mar 21, 2025 • 0 new comments
[Feature]: Support more video loader
#15011 commented on Mar 21, 2025 • 0 new comments
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 commented on Mar 21, 2025 • 0 new comments
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 commented on Mar 21, 2025 • 0 new comments
[Usage]: Clarification on how to use Greedy Search and then Beam search's Poor Performance in VLLM
#15146 commented on Mar 21, 2025 • 0 new comments
[Usage]: How to eliminate randomness and obtain fixed results with VLLM 0.8
#15205 commented on Mar 21, 2025 • 0 new comments
[Usage]: Distributed inference not supported with OpenVINO?
#14933 commented on Mar 21, 2025 • 0 new comments
[Feature]: gemma3 raise error
#14723 commented on Mar 21, 2025 • 0 new comments
[Bug]: Ray memory leak
#4241 commented on Mar 21, 2025 • 0 new comments
[Bug]: Incomplete tool calling response for pipeline-parallel vllm with ray
#7194 commented on Mar 21, 2025 • 0 new comments
[Feature]: DeepSeek v3/r1 MTP support PP
#14005 commented on Mar 26, 2025 • 0 new comments
[Feature]: expose the tqdm progress bar to enable logging the progress
#6154 commented on Mar 26, 2025 • 0 new comments
[Bug]: KeyError: 'layers.0.self_attn.qkv_proj.weight'
#9595 commented on Mar 26, 2025 • 0 new comments
[Bug]: Qwen2-VL-7B with sglang (vLLM-back) Performance Degradation on MME benchmark
#10588 commented on Mar 26, 2025 • 0 new comments
[Installation]: May I ask if there is a good solution for deploying grmma-2-27b on v100? The deployment has been consistently unsuccessful
#11462 commented on Mar 26, 2025 • 0 new comments
[Usage]: Client-Side Error Handling for VLLM in a Client-Server Architecture
#11487 commented on Mar 26, 2025 • 0 new comments
[Feature]: Support tool calls for DeepSeek.
#14745 commented on Mar 26, 2025 • 0 new comments
[Feature]: Implement Concurrent Partial Prefills In V1 Engine
#14003 commented on Mar 25, 2025 • 0 new comments
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 commented on Mar 25, 2025 • 0 new comments
[Bug]: Cannot use model with shorter context as draft model
#7859 commented on Mar 25, 2025 • 0 new comments
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout.
#15235 commented on Mar 25, 2025 • 0 new comments
[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange`
#15144 commented on Mar 25, 2025 • 0 new comments
[Feature]: Data parallel inference in offline mode(based on Ray)
#14683 commented on Mar 25, 2025 • 0 new comments
[Performance]: V1 vs V0 with multi-steps
#11649 commented on Mar 25, 2025 • 0 new comments
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 commented on Mar 25, 2025 • 0 new comments
[Usage]: Guided choice not working as expected
#12225 commented on Mar 25, 2025 • 0 new comments
[Bug]: vllm:request_inference_time_seconds_bucket has too few buckets for long inference requests
#15167 commented on Mar 25, 2025 • 0 new comments
[Bug]: Major issues with guided generation / structured output in vLLM (up to and including v0.8.1); many examples provided by vllm in /examples and structured_outputs.html doc do not work
#15236 commented on Mar 25, 2025 • 0 new comments
[Performance]: logit bias implementation uses a slow for loop
#10741 commented on Mar 25, 2025 • 0 new comments
[Installation]: Error occured while installing vllm
#14124 commented on Mar 25, 2025 • 0 new comments
[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens
#13175 commented on Mar 26, 2025 • 0 new comments
[Installation]: undefined symbol: _ZNK3c1011StorageImpl27throw_data_ptr_access_errorEv
#15010 commented on Mar 26, 2025 • 0 new comments
[V1] Feedback Thread
#12568 commented on Mar 26, 2025 • 0 new comments
[Feature]: support tool and reasoning together
#14429 commented on Mar 26, 2025 • 0 new comments
[Installation]: uv install not installing FlashInfer anymore
#15158 commented on Mar 26, 2025 • 0 new comments
[Feature]: Mistral Small 3.1 HF support
#15212 commented on Mar 26, 2025 • 0 new comments
[Bug]: Extreme low throughput when using pipeline parallelism when Batch Size(running req) is small
#9176 commented on Mar 26, 2025 • 0 new comments
[Feature]: Support Gemma3 GGUF
#14753 commented on Mar 26, 2025 • 0 new comments
[Bug]: Is vllm compatible with torchrun?
#7939 commented on Mar 26, 2025 • 0 new comments
[Bug]: codegemma-7b crashes without error
#13044 commented on Mar 26, 2025 • 0 new comments
[Bug]: assert request.num_computed_tokens <= request.num_tokens
#14915 commented on Mar 26, 2025 • 0 new comments
[Bug]: Enable lora returns garbage output
#14392 commented on Mar 26, 2025 • 0 new comments
[Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty)
#9807 commented on Mar 26, 2025 • 0 new comments
[Bug]: vLLM ModelConfig doesn't pass hf_overrides to get_hf_image_processor_config, which could contain auth token for hugging face (not in ENV)
#14854 commented on Mar 26, 2025 • 0 new comments
[Bug]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#8893 commented on Mar 26, 2025 • 0 new comments
[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache
#13259 commented on Mar 26, 2025 • 0 new comments
[Bug]: see connection to gpu node timeout issue when initializing ray vllm multi-node serving
#13052 commented on Mar 26, 2025 • 0 new comments
[Bug]: [AMD] [vLLM=0.7.3] ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected.
#14983 commented on Mar 26, 2025 • 0 new comments
[Bug]: ValueError: Cannot unpickle PostGradPassManager
#15089 commented on Mar 26, 2025 • 0 new comments
[Misc]: 建个微信群高效交流，有兴趣的大佬请进
#14928 commented on Mar 26, 2025 • 0 new comments
[WIP][RFC]: Use auto-functionalization V2 in PyTorch 2.7+
#14703 commented on Mar 25, 2025 • 0 new comments
[Misc]: Molmo inference multi-GPU
#11468 commented on Mar 25, 2025 • 0 new comments
[Usage]: How to figure out why vllm response nothing but trt-llm response meaningful result
#11473 commented on Mar 25, 2025 • 0 new comments
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 commented on Mar 25, 2025 • 0 new comments
[Bug]: ModuleNotFoundError: No module named 'triton' when building docker image for Arm64
#14605 commented on Mar 25, 2025 • 0 new comments
[Bug]: CUDA_VISIBLE_DEVICES is not supported
#14807 commented on Mar 25, 2025 • 0 new comments
[Bug]: 0.8.0(V1) Ray cannot find model pyarrow and pandas
#15100 commented on Mar 25, 2025 • 0 new comments
[Bug]: Can't run vllm model because of the FlashAttention.
#15238 commented on Mar 24, 2025 • 0 new comments
[Usage]: relationship between embedding size and vocab_size
#15131 commented on Mar 24, 2025 • 0 new comments
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 commented on Mar 24, 2025 • 0 new comments
[Installation]: Cannot compile vLLM from source on XPU
#14747 commented on Mar 24, 2025 • 0 new comments
[Usage]: There is no module or parameter named 'language_model' in Gemma3ForCausalLM
#15031 commented on Mar 24, 2025 • 0 new comments
[Feature]: Ability to warm up vLLM instances
#15225 commented on Mar 24, 2025 • 0 new comments
[Bug]: Llama-3.1-405B-Instruct-FP8 only generates exclamation marks
#13035 commented on Mar 24, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on Mar 24, 2025 • 0 new comments
[Bug]: Internal Server Error when using Qwen2-VL-7B with vLLM Docker Container
#15110 commented on Mar 24, 2025 • 0 new comments
[Bug]: vllm cannot connect to an external ray cluster
#14349 commented on Mar 24, 2025 • 0 new comments
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 commented on Mar 24, 2025 • 0 new comments
[Bug]: The service request for vllm064post1 was prematurely terminated, and it could not output a fixed number of tokens.”
#13156 commented on Mar 24, 2025 • 0 new comments
[Bug]: ValueError: Unsupported config format: ConfigFormat.AUTO on macOS
#13889 commented on Mar 24, 2025 • 0 new comments
[Bug]: AssertionError with Speculative Decoding in vLLM Using DeepSeek R1 Distill Qwen Models
#14939 commented on Mar 24, 2025 • 0 new comments
[Bug]: vLLM running on Unspecified Platform raises NotImplementedError when using podman/docker-compose
#14954 commented on Mar 25, 2025 • 0 new comments
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on Mar 25, 2025 • 0 new comments
[RFC]: Disaggregated prefilling and KV cache transfer roadmap
#10818 commented on Mar 25, 2025 • 0 new comments
[RFC]: A proper way to deal with 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' problem
#14610 commented on Mar 25, 2025 • 0 new comments
[Bug]: AssertionError - assert loaded_weight.shape[output_dim] == self.org_vocab_size
#15124 commented on Mar 25, 2025 • 0 new comments
First tpot/itl is too long?
#15106 commented on Mar 25, 2025 • 0 new comments
[Performance]: How to Improve Performance Under Concurrency
#9722 commented on Mar 25, 2025 • 0 new comments
[Bug]: error helper for TypeError: _extractNVMLErrorsAsClasses..gen_new..new() takes 1 positional argument but 2 were given
#12906 commented on Mar 25, 2025 • 0 new comments
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on Mar 25, 2025 • 0 new comments
[Bug]: can't cache image embeds input
#15209 commented on Mar 25, 2025 • 0 new comments
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on Mar 25, 2025 • 0 new comments
ExLlamaV2: exl2 support
#3203 commented on Mar 25, 2025 • 0 new comments
[New Model]: Google SigLip 2
#13663 commented on Mar 25, 2025 • 0 new comments
[Bug]: When use `guided choice` feature, vllm.engine.async_llm_engine.AsyncEngineDeadError
#8100 commented on Mar 25, 2025 • 0 new comments
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 commented on Mar 25, 2025 • 0 new comments
[Bug]: AsyncLLMEngine CUDA runtime error 'device-side assert triggered'
#8948 commented on Mar 25, 2025 • 0 new comments
[Installation]: Segmentation fault when building Docker container on WSL
#10575 commented on Mar 25, 2025 • 0 new comments
[Bug]: Crash with Qwen2-Audio Model in vLLM During Audio Processing
#10627 commented on Mar 25, 2025 • 0 new comments
[Bug]: Prefill/decode separation leads to blocking and crashing in multi concurrent scenarios
#11445 commented on Mar 25, 2025 • 0 new comments
[Bug]: InternVL2-40B Inference Precision Problem
#11454 commented on Mar 25, 2025 • 0 new comments