-
-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
1 Release published by 1 person
-
v0.8.2
published
Mar 23, 2025
155 Pull requests merged by 79 people
-
[Bugfix][TPU][V1] Fix recompilation
#15553 merged
Mar 27, 2025 -
[Doc] Use absolute placement for Ask AI button
#15628 merged
Mar 27, 2025 -
[Misc] Avoid direct access of global
mm_registry
incompute_encoder_budget
#15621 merged
Mar 27, 2025 -
[Feature] Add middleware to log API Server responses
#15593 merged
Mar 27, 2025 -
[Misc] Replace
is_encoder_decoder_inputs
withsplit_enc_dec_inputs
#15620 merged
Mar 27, 2025 -
[Doc] Link to onboarding tasks
#15629 merged
Mar 27, 2025 -
[Bugfix] Fix use_cascade_attention handling for Alibi-based models on vllm/v1
#15211 merged
Mar 27, 2025 -
[Model] MiniCPM-V/O supports V1
#15487 merged
Mar 27, 2025 -
[Doc] update --system for transformers installation in docker doc
#15616 merged
Mar 27, 2025 -
Fix incorrect filenames in vllm_compile_cache.py
#15494 merged
Mar 27, 2025 -
[Misc] Use model_redirect to redirect the model name to a local folder.
#14116 merged
Mar 27, 2025 -
[Misc] Clean up
scatter_patch_features
#15559 merged
Mar 27, 2025 -
[Quantization] Fp8 Channelwise Dynamic Per Token GroupedGEMM
#15587 merged
Mar 27, 2025 -
[Misc] Consolidate LRUCache implementations
#15481 merged
Mar 27, 2025 -
[TPU] Avoid Triton Import
#15589 merged
Mar 27, 2025 -
[Misc] Restrict ray version dependency and update PP feature warning in V1
#15556 merged
Mar 27, 2025 -
[TPU] [V1] fix cases when max_num_reqs is set smaller than MIN_NUM_SEQS
#15583 merged
Mar 27, 2025 -
[ROCm] Env variable to trigger custom PA
#15557 merged
Mar 27, 2025 -
Allow torchao quantization in SiglipMLP
#15575 merged
Mar 27, 2025 -
[V1] Refactor num_computed_tokens logic
#15307 merged
Mar 27, 2025 -
[moe][quant] add weight name case for offset
#15515 merged
Mar 27, 2025 -
[Doc] Update V1 user guide for fp8 kv cache support
#15585 merged
Mar 27, 2025 -
[misc] LoRA: Remove unused long context test data
#15558 merged
Mar 27, 2025 -
add platform check back
#15578 merged
Mar 27, 2025 -
Add automatic tpu label to mergify.yml
#15560 merged
Mar 27, 2025 -
[Kernel] CUTLASS grouped gemm fp8 MoE kernel
#13972 merged
Mar 27, 2025 -
Support FIPS enabled machines with MD5 hashing
#15299 merged
Mar 27, 2025 -
[TPU] support disabling xla compilation cache
#15567 merged
Mar 27, 2025 -
Use Cache Hinting for fused_moe kernel
#15511 merged
Mar 26, 2025 -
[V1] TPU CI - Fix test_compilation.py
#15570 merged
Mar 26, 2025 -
[V1] TPU - Revert to exponential padding by default
#15565 merged
Mar 26, 2025 -
Applying some fixes for K8s agents in CI
#15493 merged
Mar 26, 2025 -
Support SHA256 as hash function in prefix caching
#15297 merged
Mar 26, 2025 -
[V1][Sampler] Faster top-k only implementation
#15478 merged
Mar 26, 2025 -
[Refactor] Remove passthrough
backend
when generate grammar#15317 merged
Mar 26, 2025 -
Fix weight loading for some models in Transformers backend
#15544 merged
Mar 26, 2025 -
multi-node offline DP+EP example
#15484 merged
Mar 26, 2025 -
[Model] Add Reasoning Parser for Granite Models
#14202 merged
Mar 26, 2025 -
Improve validation of TP in Transformers backend
#15540 merged
Mar 26, 2025 -
Apply torchfix
#15532 merged
Mar 26, 2025 -
Separate base model from
TransformersModel
#15467 merged
Mar 26, 2025 -
[Misc] improve example script output
#15528 merged
Mar 26, 2025 -
[Misc] Enhance warning information to user-defined chat template
#15408 merged
Mar 26, 2025 -
[FEAT][ROCm] Integrate Fused MoE Kernels from AITER
#14967 merged
Mar 26, 2025 -
[Feature] Enhance EAGLE Architecture with Proper RMS Norms
#14990 merged
Mar 26, 2025 -
Fix raw_request extraction in load_aware_call decorator
#15382 merged
Mar 26, 2025 -
[misc] LoRA - Skip LoRA kernels when not required
#15152 merged
Mar 26, 2025 -
[BugFix] Fix nightly MLA failure (FA2 + MLA chunked prefill, i.e. V1, producing bad results)
#15492 merged
Mar 26, 2025 -
[Misc] Warn about v0 in benchmark_paged_attn.py
#15495 merged
Mar 26, 2025 -
[Model] Support multi-image for Molmo
#15438 merged
Mar 26, 2025 -
Transformers backend already supports V1
#15463 merged
Mar 26, 2025 -
[CI/Build] LoRA: Delete long context tests
#15503 merged
Mar 26, 2025 -
[Core] LoRA: V1 Scheduler optimization
#15422 merged
Mar 25, 2025 -
[core] add bucket padding to tpu_model_runner
#14995 merged
Mar 25, 2025 -
[V1] Support long_prefill_token_threshold in v1 scheduler
#15419 merged
Mar 25, 2025 -
[V1][Minor] Use
SchedulerInterface
type for engine scheduler field#15499 merged
Mar 25, 2025 -
[TPU][V1] Fix Sampler recompilation
#15309 merged
Mar 25, 2025 -
Add workaround for shared field_names in pydantic model class
#13925 merged
Mar 25, 2025 -
[bugfix] add supports_v1 platform interface
#15417 merged
Mar 25, 2025 -
[Bugfix] Support triton==3.3.0+git95326d9f for RTX 5090 (Unsloth + vLLM compatibility)
#15471 merged
Mar 25, 2025 -
[CI/Build] Add tests for the V1 tpu_model_runner.
#14843 merged
Mar 25, 2025 -
[bugfix] fix inductor cache on max_position_embeddings
#15436 merged
Mar 25, 2025 -
[Kernel] Fix conflicting macro names for gguf kernels
#15456 merged
Mar 25, 2025 -
[Doc] Update V1 user guide for multi-modality
#15460 merged
Mar 25, 2025 -
[Misc] Remove redundant
num_embeds
#15443 merged
Mar 25, 2025 -
[Misc] Clean up MiniCPM-V/O code
#15337 merged
Mar 25, 2025 -
Dockerfile.ppc64le changes to move to UBI
#15402 merged
Mar 25, 2025 -
[Kernel][CPU] CPU MLA
#14744 merged
Mar 25, 2025 -
[Hardware][TPU][Bugfix] Fix v1 mp profiler
#15409 merged
Mar 25, 2025 -
Fix CUDA kernel index data type in vllm/csrc/quantization/gptq_marlin/awq_marlin_repack.cu +10
#15160 merged
Mar 25, 2025 -
[V1][Spec Decode] Update target_logits in place for rejection sampling
#15427 merged
Mar 25, 2025 -
[V1] guidance backend for structured output +
auto
fallback mode#14779 merged
Mar 25, 2025 -
[Bugfix] Fixed the issue of not being able to input video and image simultaneously
#15387 merged
Mar 25, 2025 -
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)"
#15398 merged
Mar 25, 2025 -
[Misc] Remove LoRA log
#15388 merged
Mar 25, 2025 -
Add pipeline parallel support to
TransformersModel
#12832 merged
Mar 25, 2025 -
[Minor][Spec Decode] Remove compiled_softmax
#15416 merged
Mar 25, 2025 -
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling
#15063 merged
Mar 25, 2025 -
[ROCm][Kernel] MoE weights padding
#14454 merged
Mar 24, 2025 -
[Build] Cython compilation support fix
#14296 merged
Mar 24, 2025 -
[Hardware][TPU] Skip failed compilation test
#15421 merged
Mar 24, 2025 -
[BugFix][V1] Quick fix for min_tokens with multiple EOS
#15407 merged
Mar 24, 2025 -
[V1][Perf] Simpler request output queues
#15156 merged
Mar 24, 2025 -
[Doc] Update docs on handling OOM
#15357 merged
Mar 24, 2025 -
[DOC] Add Kubernetes deployment guide with CPUs
#14865 merged
Mar 24, 2025 -
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral
#12303 merged
Mar 24, 2025 -
[V1] Aggregate chunked prompt logprobs in model runner
#14875 merged
Mar 24, 2025 -
[MISC] Refine no available block debug msg
#15076 merged
Mar 24, 2025 -
[V1][Minor] fix comments
#15392 merged
Mar 24, 2025 -
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL
#15306 merged
Mar 24, 2025 -
[Core] Integrate
fastsafetensors
loader for loading model weights#10647 merged
Mar 24, 2025 -
[distributed] fix dp group
#15355 merged
Mar 24, 2025 -
[Bugfix] Fix chat template loading
#15143 merged
Mar 24, 2025 -
Fix zmq IPv6 URL format error
#15341 merged
Mar 24, 2025 -
[Kernel] allow non-contiguous input for marlin kernel
#14658 merged
Mar 24, 2025 -
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)"
#15377 merged
Mar 24, 2025 -
[Misc] Update guided decoding logs to debug
#15310 merged
Mar 24, 2025 -
[Bugfix][V1] Avoid importing PreTrainedModel
#15366 merged
Mar 24, 2025 -
[Misc] Remove ignore_reinit_error for ray.init()
#15373 merged
Mar 24, 2025 -
[Misc] Upgrade BNB version
#15183 merged
Mar 24, 2025 -
Fix non-contiguous input passed to Marlin kernel
#15319 merged
Mar 24, 2025 -
[Fix] [torch.compile] Improve UUID system for custom passes
#15249 merged
Mar 24, 2025 -
[V1] Enable V1 Fp8 cache for FA3 in the oracle
#15191 merged
Mar 23, 2025 -
[Misc][Doc] Add note regarding loading
generation_config
by default#15281 merged
Mar 23, 2025 -
[Frontend] Support tool calling and reasoning parser
#14511 merged
Mar 23, 2025 -
[V1][Spec Decode] Use better defaults for N-gram
#15358 merged
Mar 23, 2025 -
[V1][Spec Decode] Respect prompt_lookup_max
#15348 merged
Mar 23, 2025 -
[Bugfix] fix torch.compiled cache hash error
#14953 merged
Mar 23, 2025 -
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20
#15322 merged
Mar 23, 2025 -
[ci/build] fix broken tests in LLM.collective_rpc
#15350 merged
Mar 23, 2025 -
[ci/build] update torch nightly version for GH200
#15135 merged
Mar 23, 2025 -
[V1][Usage] Refactor speculative decoding configuration and tests
#14434 merged
Mar 23, 2025 -
Fix v1 supported oracle for worker-cls and worker-extension-cls
#15324 merged
Mar 23, 2025 -
[doc] Add back previous news
#15331 merged
Mar 23, 2025 -
Remove openvino support in favor of external plugin
#15339 merged
Mar 22, 2025 -
[BugFix][Typing] Fix Imprecise Type Annotations
#15208 merged
Mar 22, 2025 -
[V1] Add
disable-any-whitespace
option support for xgrammar#15316 merged
Mar 22, 2025 -
[Model] Support Tele-FLM Model
#15023 merged
Mar 22, 2025 -
[Bugfix] LoRA V0 - Fix case where
max_num_seqs
is between cudagraph capture sizes#15308 merged
Mar 22, 2025 -
[Bugfix] Fix torch.compile raise FileNotFoundError
#15278 merged
Mar 22, 2025 -
[Doc] add load_format items in docs
#14804 merged
Mar 22, 2025 -
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature
#14959 merged
Mar 22, 2025 -
[Bugfix][V0] Multi-sequence logprobs streaming edge case
#15259 merged
Mar 22, 2025 -
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout
#15301 merged
Mar 22, 2025 -
[Build/CI] Fix env var typo
#15305 merged
Mar 21, 2025 -
[TPU][V1] MHA Pallas backend
#15288 merged
Mar 21, 2025 -
Revert "[Feature] specify model in config.yaml (#14855)"
#15293 merged
Mar 21, 2025 -
[Bugfix][VLM] fix llava processor
#15285 merged
Mar 21, 2025 -
[v1] Refactor KVCacheConfig
#14079 merged
Mar 21, 2025 -
[Misc] Add cProfile helpers
#15074 merged
Mar 21, 2025 -
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend
#15282 merged
Mar 21, 2025 -
[V1] Fix wrong import path of get_flash_attn_version
#15280 merged
Mar 21, 2025 -
[Bugfix] Fix incorrect resolving order for transformers fallback
#15279 merged
Mar 21, 2025 -
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL
#15273 merged
Mar 21, 2025 -
[Bugfix] Add int8 torch dtype for KVCache
#15260 merged
Mar 21, 2025 -
[Feature] specify model in config.yaml
#14855 merged
Mar 21, 2025 -
[V1] Avoid redundant input processing in n>1 case
#14985 merged
Mar 21, 2025 -
[Doc] Update LWS docs
#15163 merged
Mar 21, 2025 -
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs
#14071 merged
Mar 21, 2025 -
[Hardware][TPU] Add check for no additional graph compilation during runtime
#14710 merged
Mar 21, 2025 -
Add an example for reproducibility
#15262 merged
Mar 21, 2025 -
[Misc] Better RayExecutor and multiprocessing compatibility
#14705 merged
Mar 21, 2025 -
[Docs] Trim the latest news in README
#15261 merged
Mar 21, 2025 -
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies
#14857 merged
Mar 21, 2025 -
[Bugfix] detect alibi and revert to FA2
#15231 merged
Mar 21, 2025 -
[V1][TPU] Speed up top-k on TPU by using torch.topk
#15242 merged
Mar 21, 2025 -
Mention
extra_body
as a way top pass vLLM only parameters using the OpenAI client#15240 merged
Mar 21, 2025 -
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation
#15200 merged
Mar 21, 2025 -
[ROCM] Upgrade torch to 2.6
#15244 merged
Mar 21, 2025 -
[Misc] Clean up the BitsAndBytes arguments
#15140 merged
Mar 21, 2025 -
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10
#15159 merged
Mar 21, 2025 -
[CI/Build] LoRA : make add_lora_test safer
#15181 merged
Mar 21, 2025 -
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface
#15250 merged
Mar 21, 2025 -
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled.
#14617 merged
Mar 21, 2025 -
[V1] Add flag to disable cascade attention
#15243 merged
Mar 20, 2025
87 Pull requests opened by 73 people
-
[sleep mode] clear pytorch cache after sleep
#15248 opened
Mar 20, 2025 -
[Misc] fix collect_env version parse
#15267 opened
Mar 21, 2025 -
[V1] Scheduler Refactoring [2/N] - Introduce CommonSchedulerStates
#15271 opened
Mar 21, 2025 -
Add missed ray[data] dependence in cuda.txt
#15283 opened
Mar 21, 2025 -
[Model] Add Qwen3 and Qwen3MoE
#15289 opened
Mar 21, 2025 -
Fix Transformers backend compatibility check
#15290 opened
Mar 21, 2025 -
[Bugfix] utils: no bool(module) & pid may be None
#15292 opened
Mar 21, 2025 -
[V0][Bugfix] Fix Mamba cache crashing
#15296 opened
Mar 21, 2025 -
set UV_PYTHON_INSTALL_DIR to a world readable/executable location
#15302 opened
Mar 21, 2025 -
[Misc]add coding benchmark for speculative decoding
#15303 opened
Mar 21, 2025 -
[Misc] Enable V1 LoRA by default
#15320 opened
Mar 22, 2025 -
fix test_phi3v
#15321 opened
Mar 22, 2025 -
Fix DP group creation and compatibale with external_dp (#15176)
#15323 opened
Mar 22, 2025 -
unittests for `FullAttentionSpec` to tests `use_mla` param
#15325 opened
Mar 22, 2025 -
[Bugfix] Handle `process_weights_after_loading` for `QKVCrossParallelLinear`
#15328 opened
Mar 22, 2025 -
[V1][Spec Decode] Eagle interface
#15334 opened
Mar 22, 2025 -
[P/D Disaggregation] `PDController` and `PDWorker` Prototype (1p1d)
#15343 opened
Mar 22, 2025 -
Vllm v1 eagle proposer
#15346 opened
Mar 23, 2025 -
[V1] Fully Transparent Implementation of CPU Offloading
#15354 opened
Mar 23, 2025 -
[V1][Spec Decode] Remove warning on N-gram
#15361 opened
Mar 23, 2025 -
[Bugfix][V1] Fix bug from putting llm_engine.model_executor in a background process
#15367 opened
Mar 23, 2025 -
[Bugfix] Fix regex compile display format
#15368 opened
Mar 24, 2025 -
[Model] Support Skywork-R1V
#15397 opened
Mar 24, 2025 -
[TPU][V1] Guided decoding on TPU
#15401 opened
Mar 24, 2025 -
[Doc] Add multi-modal development example for encoder-decoder models
#15405 opened
Mar 24, 2025 -
[ROCm][Bugfix] Bring back fallback to eager mode removed in #14917, but for ROCm only
#15413 opened
Mar 24, 2025 -
[V1] TPU CI - Add basic perf regression test
#15414 opened
Mar 24, 2025 -
[Bugfix]: Fix Promethus spec decode counter sum-of-sums
#15415 opened
Mar 24, 2025 -
[Model] Reduce redundant computations in mamba2 blocks for Bamba-9B
#15423 opened
Mar 25, 2025 -
[Core] [Bugfix] Add Input Embeddings
#15428 opened
Mar 25, 2025 -
[CI] [1/N] Fix Distributed Tests
#15431 opened
Mar 25, 2025 -
[FEAT] [ROCm] Add AITER int8 scaled gemm kernel
#15433 opened
Mar 25, 2025 -
Added the option of returning hidden states
#15434 opened
Mar 25, 2025 -
[Draft] Aya Vision
#15441 opened
Mar 25, 2025 -
[V1] [Feature] Collective RPC
#15444 opened
Mar 25, 2025 -
[P/D Disaggregation] XpYd based on point-to-point communication
#15448 opened
Mar 25, 2025 -
[Misc] Improve cli help show
#15455 opened
Mar 25, 2025 -
[Metrics] Hide deprecated metrics
#15458 opened
Mar 25, 2025 -
[Bugfix][Frontend] Fix pythonic tool parser failure with negative numbers
#15462 opened
Mar 25, 2025 -
[V1][Spec Decode] Remove deprecated spec decode config params
#15466 opened
Mar 25, 2025 -
Enhance logits processor to add additional data
#15473 opened
Mar 25, 2025 -
[ not for land] For richard
#15474 opened
Mar 25, 2025 -
[Bugfix][Frontend] respect provided default guided decoding backend
#15476 opened
Mar 25, 2025 -
[WIP][Model] Refactor Phi-4-multimodal to use merged processor and support V1
#15477 opened
Mar 25, 2025 -
Quantized Custom Allreduce
#15479 opened
Mar 25, 2025 -
Different device CG support
#15482 opened
Mar 25, 2025 -
[Misc] simple_connector.py: more efficient use of GPU memory in send
#15485 opened
Mar 25, 2025 -
[V1] Fix json_object support with xgrammar
#15488 opened
Mar 25, 2025 -
[V1][TPU] Enable Top K
#15489 opened
Mar 25, 2025 -
[V1][Draft] Jump-forward decoding
#15490 opened
Mar 25, 2025 -
[core] Add tags parameter to wake_up()
#15500 opened
Mar 25, 2025 -
Adding Share Expert Fusion for DeepSeek
#15502 opened
Mar 25, 2025 -
Add inference_benchmark_script.sh
#15504 opened
Mar 25, 2025 -
[Model] Support Mistral3 in the HF Transformers format
#15505 opened
Mar 25, 2025 -
[BugFix] fix speculative decoding memory leak when speculation is disabled
#15506 opened
Mar 25, 2025 -
[Minor] QoL for Benchmarking
#15512 opened
Mar 26, 2025 -
[Bugfix]: The sequence becomes shorter after encoding and decoding
#15516 opened
Mar 26, 2025 -
Improve expert parallelism placement
#15517 opened
Mar 26, 2025 -
update dockerfile to add tzdata
#15522 opened
Mar 26, 2025 -
track http service error count
#15523 opened
Mar 26, 2025 -
WIP: [Frontend] add functioncalling strict
#15535 opened
Mar 26, 2025 -
[Bugfix] Fix missing return value in load_weights method of adapters.py
#15542 opened
Mar 26, 2025 -
[Frontend] fix streaming tool output lose 2 token bug #15545
#15546 opened
Mar 26, 2025 -
[Bugfix] Fix profile deadlock when ray backend and num-scheduler-steps > 1
#15548 opened
Mar 26, 2025 -
[Bugfix][Model] Add SupportsQuant interface to Mixtral
#15552 opened
Mar 26, 2025 -
[Bugfix] Fix Mllama interleaved images input support
#15564 opened
Mar 26, 2025 -
[SupportsQuant] Bert, Blip, Blip2, Bloom
#15573 opened
Mar 26, 2025 -
[Bugfix] Do not pad multi-modal encoder sequence dummy data
#15574 opened
Mar 26, 2025 -
[Misc] cli auto show default value
#15582 opened
Mar 26, 2025 -
[V1] Support disable_any_whtespace for guidance backend
#15584 opened
Mar 27, 2025 -
Re-enable the AMD Entrypoints Test (2025-03-27)
#15586 opened
Mar 27, 2025 -
[Frontend] update priority for --api-key and VLLM_API_KEY
#15588 opened
Mar 27, 2025 -
[XPU][Bugfix] fix _k_scale_float/_v_scale_float in ipex_attn
#15591 opened
Mar 27, 2025 -
[Bugfix][v1] xgrammar structured output supports Enum.
#15594 opened
Mar 27, 2025 -
[DO NOT REVIEW YET] Merge k_cache and v_caching into one.
#15595 opened
Mar 27, 2025 -
[Bugfix] Correct KV cache tensor dimension handling in FlashInfer backend's block operations
#15603 opened
Mar 27, 2025 -
[V1] Support interleaved modality items
#15605 opened
Mar 27, 2025 -
[Quantization][V1] BitsAndBytes support V1
#15611 opened
Mar 27, 2025 -
[Bugfix] add hf_token to EngineArgs
#15615 opened
Mar 27, 2025 -
[Model] Adding torch compile annotations to chatglm
#15624 opened
Mar 27, 2025 -
Enable Outlines with JSON Sub-Schema References
#15627 opened
Mar 27, 2025 -
[ROCm][AMD][Build] Update AMD supported arch list
#15632 opened
Mar 27, 2025 -
[CI] Update rules for applying `tpu` label.
#15634 opened
Mar 27, 2025 -
Correct PowerPC to modern IBM Power
#15635 opened
Mar 27, 2025 -
[Doc] Fix dead links in Job Board
#15637 opened
Mar 27, 2025 -
[NO REVIEW PLEASE] Kv
#15638 opened
Mar 27, 2025
155 Issues closed by 56 people
-
[New Model]: Please support Babel series model ASAP
#15612 closed
Mar 27, 2025 -
[Usage]: How can I determine the maximum number of concurrent requests?
#8031 closed
Mar 27, 2025 -
[Usage]: Got nccl error when deploy vllm in k8s with multiple GPUs
#7466 closed
Mar 27, 2025 -
[Usage]: how to abort request and stop inference?
#6975 closed
Mar 27, 2025 -
[Usage]: What do max_num_seqs and max_model_len do
#6641 closed
Mar 27, 2025 -
[Doc]: documenting flash attention 1 vs 2 in env vars
#15344 closed
Mar 27, 2025 -
[Feature]: Output the JSON for the response payload when VLLM_LOGGING_LEVEL=DEBUG
#15571 closed
Mar 27, 2025 -
[Usage]: how to reduce the number of processes of compile_worker
#14808 closed
Mar 27, 2025 -
[Usage]: Upgrading from vllm0.7.3 to vllm0.8.2, but the required GPU memory significantly increases.
#15617 closed
Mar 27, 2025 -
[Installation]: Transformer installation requires uv venv --system now
#15550 closed
Mar 27, 2025 -
tracking torch.compile compatibility with cpu offloading
#10612 closed
Mar 27, 2025 -
tracking torch.compile compatibility with lora serving
#10617 closed
Mar 27, 2025 -
[Bug]:Question about load Qwen1.5-MoE-A2.7B model
#15561 closed
Mar 27, 2025 -
[Bug]:vllm从0.7.0开始版本部署Qwen2_vl服务存在内存(不是GPU显存)泄漏问题
#15597 closed
Mar 27, 2025 -
[Feature]: Consolidate `LRUCache` implementations
#14927 closed
Mar 27, 2025 -
[Bug]: qwen2-vl with lora is not starting
#13135 closed
Mar 27, 2025 -
[Bug]: Error loading bitsandbytes 4bit model when the quant_storage is torch.bfloat16
#10590 closed
Mar 27, 2025 -
[RFC]: Support KV Cache Compaction
#10646 closed
Mar 27, 2025 -
[Feature]: Mixtral manual `head_dim`
#10649 closed
Mar 27, 2025 -
[Bug]: vllm infer for Qwen2-VL-72B-Instruct-GPTQ-Int8
#10650 closed
Mar 27, 2025 -
[Bug]: Inference is exceptionally slow on the L20 GPU
#10652 closed
Mar 27, 2025 -
[Bug]: AMD GPU RX 7900XT: Failed to infer device type
#10653 closed
Mar 27, 2025 -
[Usage]: Cannot use xformers with old GPU
#10662 closed
Mar 27, 2025 -
[RFC]: Create `VllmState` to save immutable args in `VllmConfig`
#10666 closed
Mar 27, 2025 -
[Usage]: how to get every output token score?
#10670 closed
Mar 27, 2025 -
[Bug]: vLLM returning 415 status code at high load
#14333 closed
Mar 26, 2025 -
[Bug]: DeepSeek-R1-AWQ gets stuck with all tokens rejected when MTP is enabled.
#13704 closed
Mar 26, 2025 -
[Usage]: where to find cpu vllm formal image
#14756 closed
Mar 26, 2025 -
[Usage]: How to run format check locally?
#15472 closed
Mar 26, 2025 -
[Performance][RFC]: Improving paged attention kernel's performance
#15351 closed
Mar 26, 2025 -
[Bug]: ValueError: not enough values to unpack (expected 22, got 21) when deploying DeepSeekV3
#15453 closed
Mar 26, 2025 -
[Feature]: Add Warning for Chat Template Mismatches similar to SGLang
#15395 closed
Mar 26, 2025 -
[Usage]: How to make sure the timeout takes effect
#14792 closed
Mar 26, 2025 -
[Bug]: Error occurred in v1/rerank interface after upgrading from version 0.7.3 to 0.8.1
#15371 closed
Mar 26, 2025 -
[Usage]: ModuleNotFoundError: No module named 'triton'
#14888 closed
Mar 26, 2025 -
[Doc]: APIConnectionError with OpenAI
#15518 closed
Mar 26, 2025 -
Cupy Import errors in Docker
#3184 closed
Mar 26, 2025 -
[Feature][Chunked Prefill]: Enable cuda graph for chunked prefill.
#4056 closed
Mar 26, 2025 -
[Feature]: Initial LLM token
#5609 closed
Mar 26, 2025 -
[Installation]: Meet bugs when installing from source
#8852 closed
Mar 26, 2025 -
[Bug]: 因vllm的版本不同,启动的qwen2.5服务,对于相同的输入;0.6.1.post2 sse输出是正确的,但 0.6.3.post1是错误的?
#10280 closed
Mar 26, 2025 -
[Bug]: VLLLm crash when running Qwen/Qwen2.5-Coder-32B-Instruct on two H100 GPUs
#10296 closed
Mar 26, 2025 -
[Usage]: Use difference SamplingParams for each sample in batch inference via openai api
#10578 closed
Mar 26, 2025 -
[Feature]: if vllm supports explicitly specifying GPU devices for a model instance.
#10638 closed
Mar 26, 2025 -
[Feature][Hardware][TPU]: Improve the token_num padding logic
#14581 closed
Mar 25, 2025 -
[Bug]: top_logrpobs generating a WARNING
#13880 closed
Mar 25, 2025 -
[Bug]: vllm v0.7.3 - The following fields were present in the request but ignored: {'top_logprobs'}
#13881 closed
Mar 25, 2025 -
[Feature][Hardware][TPU]: Add Recompilation Check for vLLM on TPU
#14580 closed
Mar 25, 2025 -
[Bug]: Problem guided decoding (regex)
#15210 closed
Mar 25, 2025 -
[Bug]: Error when use vllm in distributed environment
#15399 closed
Mar 25, 2025 -
[Bug]: V1 cannot be run in Triton Inference Server Backend
#12690 closed
Mar 25, 2025 -
[Bug]: Build error, nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference)
#15452 closed
Mar 25, 2025 -
[Usage]: vLLM Whisper support Sequential algorithm?
#15454 closed
Mar 25, 2025 -
[Feature]: Rootless container for OpenShift compatibility
#15206 closed
Mar 25, 2025 -
[Bug]: DeepSeek-R1-AWQ broken in nightly
#15002 closed
Mar 25, 2025 -
[Bug]: Qwen2.5 VL online service can not input video and image simultaneously.
#15291 closed
Mar 25, 2025 -
[Installation]: Error when importing LLM from vllm
#5086 closed
Mar 25, 2025 -
[Feature]: need a GB-based alternative for gpu_memory_utilization
#7524 closed
Mar 25, 2025 -
[Usage]: While loading model get 'layers.0.mlp.down_proj.weight' after merge_and_unload()
#10598 closed
Mar 25, 2025 -
[Bug]: Ray+vllm run, then crash
#13535 closed
Mar 24, 2025 -
[Misc]: [V1] prompt logprobs + chunked prefill can result in `EngineCore` partial prefill output
#14239 closed
Mar 24, 2025 -
flashinfer backend, not callable NoneType object
#15389 closed
Mar 24, 2025 -
[Bug]: external_dp blocks normal DP group creation
#15176 closed
Mar 24, 2025 -
[Bug]: Qwen2.5-VL mm_processor_kwargs not respected
#15364 closed
Mar 24, 2025 -
[Bug]: LoRA Loading Error: 'GPUModelRunner' object has no attribute 'lora_manager'
#15400 closed
Mar 24, 2025 -
[Usage]:
#15390 closed
Mar 24, 2025 -
[Bug]: loading default chat template occurs TypeError: unhashable type: 'dict'
#15095 closed
Mar 24, 2025 -
[Bug]: if chat_template loaded from disk, jinja exception thrown from _try_extract_ast()
#14884 closed
Mar 24, 2025 -
[Bug]: Executor performance degradation
#15356 closed
Mar 24, 2025 -
[Bug]: Docker image in trunk cannot find libpython.so
#14991 closed
Mar 24, 2025 -
[Installation]: python is missing inside the v0.8.0 docker
#15088 closed
Mar 24, 2025 -
[Misc]: missing python inside the container v0.8.1
#15174 closed
Mar 24, 2025 -
[Bug]: Can't create non-root user using vllm/vllm-openai:v0.8.1 as a base image
#15359 closed
Mar 24, 2025 -
[Bug]: LoRA request raise CUDA OutOfMemoryError when input token > 8k
#15039 closed
Mar 24, 2025 -
[Bug]: GGUF model with architecture deepseek2 is not supported yet while vllm version is 0.8.1
#15277 closed
Mar 24, 2025 -
[Bug]: leading space within content via OpenAI Compatible Server
#3935 closed
Mar 24, 2025 -
[Bug]: Llama-3.2-11B-Vision-Instruct Inference Can't Stop
#9752 closed
Mar 24, 2025 -
[Performance]: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
#10592 closed
Mar 24, 2025 -
[Bug]: Memory allocation with echo=True
#10596 closed
Mar 24, 2025 -
[Bug]: Cannot unpickle PostGradPassManager
#15223 closed
Mar 24, 2025 -
[Usage]: Where is the entry pointof the programme
#15352 closed
Mar 23, 2025 -
[Bug]: Critical Memory Leak in vLLM V1 Engine: 200+ GB RAM Usage from Image Inference
#15294 closed
Mar 23, 2025 -
[Bug]: int8 2:4 sparse time more than fp8
#15275 closed
Mar 23, 2025 -
[Usage]: Async engine batch request no usage
#15363 closed
Mar 23, 2025 -
[Usage]: Async engine batch request
#15314 closed
Mar 23, 2025 -
[Feature]: support Mistral-Large-Instruct-2407 function calling
#6778 closed
Mar 23, 2025 -
[Feature]: Llama 3 and Command-R Chat Templates
#9904 closed
Mar 23, 2025 -
[Bug]: [Bug]: vllm 启动,openai的swarm 函数调用不正常
#10015 closed
Mar 23, 2025 -
[Usage]: Sending in pre-tokenized question during inference doesn't seem any faster than raw text.
#10084 closed
Mar 23, 2025 -
[Misc]: Invariant encountered: value was None when it should not be
#10284 closed
Mar 23, 2025 -
[Bug]: Input prompt (35247 tokens) is too long and exceeds limit of 1000
#10440 closed
Mar 23, 2025 -
[Usage]: KVcache usage for different tasks in batch
#10509 closed
Mar 23, 2025 -
[Bug]: Model does not split in multiple Gpus instead it occupy same memory on each GPU
#10516 closed
Mar 23, 2025 -
[Bug]: Gemma2 becomes a fool.
#10525 closed
Mar 23, 2025 -
[Feature]: How to run speculative models with tensor parallelism?
#10562 closed
Mar 23, 2025 -
[Feature]: `torch>=2.6` support for expanded Python 3.13 support
#13434 closed
Mar 22, 2025 -
[RFC]: Drop Support for OpenVINO
#14374 closed
Mar 22, 2025 -
[Feature]: Disabling V1: unsupported structured decoding backend - xgrammar:disable-any-whitespace
#15252 closed
Mar 22, 2025 -
[Usage]: Why Speculative decoding is not compatiable with Pipeline Paralelism?
#14089 closed
Mar 22, 2025 -
[Performance]: The impact of CPU on vLLM performance is significant.
#8147 closed
Mar 22, 2025 -
[Feature]: host wheel via pypi index?
#9831 closed
Mar 22, 2025 -
[Bug]: torch.compile raise FileNotFoundError when VLLM_DISABLE_COMPILE_CACHE=1
#15276 closed
Mar 22, 2025 -
[Bug]: Can't deserialize object reported by ray, H800*16 DeepSeek R1
#15199 closed
Mar 22, 2025 -
[Feature]: load/unload API to run multiple LLMs in a single GPU instance
#5491 closed
Mar 22, 2025 -
[RFC]: Add support for IBM Spyre accelerator
#9652 closed
Mar 22, 2025 -
[Bug]: ValueError: No available memory for the cache blocks on main branch after commit 46f98893
#14992 closed
Mar 22, 2025 -
[Usage]: What's the best practice of deploying DeepSeekV3 using vllm?
#14614 closed
Mar 22, 2025 -
[Misc]: How to access the KV cache directly?
#4156 closed
Mar 22, 2025 -
[Usage]: when i set --tensor-parallel-size 4 ,openai server dose not work . Report a new Exception
#10521 closed
Mar 22, 2025 -
[Feature]: Additional possible value for `tool_choice`: `required`
#10526 closed
Mar 22, 2025 -
[Bug]: Gemma3 is not support V1 Engine
#15298 closed
Mar 22, 2025 -
[Bug]: RuntimeError: Phi4MM cannot process x audios and ximages in a prompt
#14506 closed
Mar 22, 2025 -
[Bug]: [TPU] Prefix caching + w8a8 + long context results in degraded performance and corrupted output
#12371 closed
Mar 21, 2025 -
[Bug]: Loading a model with bitsandbytes 8bit quantization
#8799 closed
Mar 21, 2025 -
[Bug]: tests/v1/tpu/test_sampler.py crashes due to ragged_paged_attention arg mismatch
#15257 closed
Mar 21, 2025 -
[Installation]: installation succeeded but No module named 'vllm._C'
#15286 closed
Mar 21, 2025 -
[Bug]: Temperature is ignored in vLLM 0.8.0/0.8.1
#15241 closed
Mar 21, 2025 -
[Performance]: vllm0.6.5加载GLM4-9B-Chat,动态加载lora,输入长文本时推理性能下降较多
#11317 closed
Mar 21, 2025 -
[Feature]: Apply tool calling after reasoning steps in Reasoning models.
#14490 closed
Mar 21, 2025 -
[Feature]: specify model only in config.yaml
#14819 closed
Mar 21, 2025 -
[Bug]: v1 speculate decoding NgramProposer experiences service exceptions during stress testing
#14742 closed
Mar 21, 2025 -
[Bug]: Why are the vLLM and Hugging Face Transformers inference results inconsistent?
#12343 closed
Mar 21, 2025 -
[Usage]: vllm v0.7.2 can not support baichuan2 model
#13810 closed
Mar 21, 2025 -
Better defaults to match Hugging Face
#2733 closed
Mar 21, 2025 -
[Bug]: Qwen2.5-VL Cannot Output Correct Content under 0.8.1
#15197 closed
Mar 21, 2025 -
[Bug]: Qwen VL 2.5 doesn't work in v0.8.0 - again
#15122 closed
Mar 21, 2025 -
[Bug]: lora response's moel name is incorrect
#7260 closed
Mar 21, 2025 -
[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)
#7450 closed
Mar 21, 2025 -
[Feature]: KVPress
#10491 closed
Mar 21, 2025 -
Metrics model name when using multiple loras
#10504 closed
Mar 21, 2025 -
[Feature]: OpenAI Response API
#15237 closed
Mar 20, 2025
134 Issues opened by 119 people
-
[Bug]: Outlines broken on vLLM 0.8+
#15636 opened
Mar 27, 2025 -
[Feature]: Support loading LoRA adapters directly from s3 bucket
#15633 opened
Mar 27, 2025 -
[Bug]: Can't load any LLM with v0.8.*
#15631 opened
Mar 27, 2025 -
[Bug]: v1 flash_attn and triton_attn backends don't have `get_state_cls`
#15630 opened
Mar 27, 2025 -
DeciLMConfig object has no attribute ‘num_key_value_heads_per_layer’ For Nemotron
#15625 opened
Mar 27, 2025 -
[Bug]: vllm 0.8.2 have severe quality problem
#15622 opened
Mar 27, 2025 -
[Bug]: Triton JIT Compile Regression from PR 15511
#15619 opened
Mar 27, 2025 -
[Usage]: How to make the Reasoning of deepseek output normally and the final content structured output
#15618 opened
Mar 27, 2025 -
[Installation]: ValueError: size must contain 'shortest_edge' and 'longest_edge' keys.
#15614 opened
Mar 27, 2025 -
[Bug]: The content is empty after gemma3 is deployed on the T4 graphics card to send request inference
#15610 opened
Mar 27, 2025 -
[Bug]: Failed to run deepseek v2 lite model with tp = 4
#15607 opened
Mar 27, 2025 -
[Usage]: Will dynamo be on vllm main branch?
#15606 opened
Mar 27, 2025 -
[Bug]: Failed to run deepseek v2 lite model with tp = 8 when enabling expert parallel
#15604 opened
Mar 27, 2025 -
How to install and use vLLM to serve multiple large language models
#15602 opened
Mar 27, 2025 -
[Bug]: Qwen2-VL-2B quantization model has no improvement in reasoning speed compared to the original model
#15601 opened
Mar 27, 2025 -
[V1] [Performance Benchmark] Benchmark the performance of Speculative Decoding
#15600 opened
Mar 27, 2025 -
[Bug]: Gemma3 GPU memory usage is always oom
#15599 opened
Mar 27, 2025 -
[Bug]: Model Reasoning Warning
#15596 opened
Mar 27, 2025 -
[Bug]:ModuleNotFoundError: No module named 'vllm._C'
#15592 opened
Mar 27, 2025 -
[Bug]: DeepSeek R1 with V1+FLASHMLA on L40S
#15590 opened
Mar 27, 2025 -
[Bug]: guided_json not working correctly with (quantized) mistral-small model
#15577 opened
Mar 26, 2025 -
TP4 fails with 5090 in the mix
#15576 opened
Mar 26, 2025 -
[Bug]: Vllm 0.8.2 + Ray 2.44 (Ray serve deployment) fallbacks to V0 Engine
#15569 opened
Mar 26, 2025 -
[Bug]: --api-key argument ignored when VLLM_API_KEY is set
#15568 opened
Mar 26, 2025 -
[Feature]: Ring Attention for Long Context in vLLM - RL Applications Focus
#15566 opened
Mar 26, 2025 -
[New Model]: please surport for Qwen/Qwen2.5-Omni-7B
#15563 opened
Mar 26, 2025 -
[Feature]: LMCache support to the CPU version of vLLM
#15562 opened
Mar 26, 2025 -
[Bug][V1]: ngram + guided decoding
#15554 opened
Mar 26, 2025 -
[Bug]: Structured Output not working with MistralTokenizer (vLLM 0.8.2, V1)
#15551 opened
Mar 26, 2025 -
[Bug]: Tools parsing issues with mistral3.1
#15549 opened
Mar 26, 2025 -
[Installation]: flaky publishing of cpu image
#15547 opened
Mar 26, 2025 -
[Bug]: when stream output tools result, the stream result will loss 2 token tool data
#15545 opened
Mar 26, 2025 -
[New Model]: HuggingFaceTB/SmolVLM2-2.2B-Instruct
#15541 opened
Mar 26, 2025 -
[Bug]: Error when inference on llava-1.6-34B
#15539 opened
Mar 26, 2025 -
[Usage]: How to get "num_gpu_blocks" in V1?
#15538 opened
Mar 26, 2025 -
[Bug]: Distributed Inference and Serving BUG
#15537 opened
Mar 26, 2025 -
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 opened
Mar 26, 2025 -
[Bug]: `TypeError: Unknown video model type: phi4mm`
#15534 opened
Mar 26, 2025 -
[Usage]: Evaluation score under V1 engine is low
#15533 opened
Mar 26, 2025 -
[Installation]: install vllm with CUDA 12.8 in 5090D error
#15531 opened
Mar 26, 2025 -
[Bug][Triton MLA]: Some calculation errors about triton mla kernal
#15530 opened
Mar 26, 2025 -
[Usage]: Qwen2.5-VL-32B-Instruct 4卡4090启动报错
#15529 opened
Mar 26, 2025 -
[Bug]: FunctionDefinition missing optional param strict
#15526 opened
Mar 26, 2025 -
[Bug]: VLLM_NCCL_SO_PATH take no effects when spawn worker
#15525 opened
Mar 26, 2025 -
[Feature]: Reason model reasoning effort feature like OpenAI
#15524 opened
Mar 26, 2025 -
[Usage]: distributed using ray, how to get worker runtime error log
#15514 opened
Mar 26, 2025 -
[Bug]:
#15513 opened
Mar 26, 2025 -
[Bug]: Embed model has additional dense module(dim=1792, but only 1024)
#15509 opened
Mar 26, 2025 -
[Bug]: Support Bitsandbytes weight loading when offline (via huggingface cache)
#15507 opened
Mar 25, 2025 -
[Doc]: Troubleshooting guide incorrect hardware script fails
#15498 opened
Mar 25, 2025 -
[Bug]: Llama-3.2-11B-Vision-Instruct has an issue in vision language embedding
#15496 opened
Mar 25, 2025 -
[Bug]: Allow flexible message role ordering in conversations (user/assistant in any sequence)
#15486 opened
Mar 25, 2025 -
[Bug]: vLLM engine crashes then restarts and loads the model on sleep if a chat request is made
#15483 opened
Mar 25, 2025 -
[Bug]: Unknown gguf model_type: gemma3
#15480 opened
Mar 25, 2025 -
[Bug]: Qwen2 MoE inference is super slow
#15470 opened
Mar 25, 2025 -
[Usage]:
#15469 opened
Mar 25, 2025 -
[Usage]:Phi-4-multimodal-instruct
#15468 opened
Mar 25, 2025 -
[Feature]: Embedding API dimensions is currently not supported.
#15465 opened
Mar 25, 2025 -
[Doc]: https://docs.vllm.ai/en/latest/deployment/k8s.html not working
#15461 opened
Mar 25, 2025 -
[Feature]: preprocessing of weights in advance
#15459 opened
Mar 25, 2025 -
[Bug]: vllm 0.8.3 serve error
#15457 opened
Mar 25, 2025 -
[Usage]: vllm启动服务卡住
#15451 opened
Mar 25, 2025 -
[Installation]: RuntimeError: Unknown runtime environment
#15450 opened
Mar 25, 2025 -
[Usage]: Question about Interleaved Text/Image Format in Online Inference
#15449 opened
Mar 25, 2025 -
[Usage]: vllm Qwen 2.5 VL output is so different from the original Qwen 2.5VL
#15447 opened
Mar 25, 2025 -
[Bug]: v1 fails when token budget is set to 128K
#15446 opened
Mar 25, 2025 -
[Performance]: Regarding the issue of context length for QWQ-32B in different distributed environments:
#15442 opened
Mar 25, 2025 -
[Usage]: `Phi-4-multimodal-instruct` activate LoRA module but get mangled text output
#15440 opened
Mar 25, 2025 -
[Usage]: Serve From Hard disk and folder Path issue
#15439 opened
Mar 25, 2025 -
[Installation]: Fail to build vLLM from source on H100
#15435 opened
Mar 25, 2025 -
[Feature]: [V1] Collective RPC
#15430 opened
Mar 25, 2025 -
[Usage]: online server requests do not return token usage information in version 0.7.2
#15426 opened
Mar 25, 2025 -
[New Model]: Baichuan-Audio
#15425 opened
Mar 25, 2025 -
[New Model]: glm-4-voice-9b
#15424 opened
Mar 25, 2025 -
[Bug]: logprobs/ranks not matching when comparing `vllm` with `transformers`
#15420 opened
Mar 24, 2025 -
[Feature]: Limit thinking tokens
#15418 opened
Mar 24, 2025 -
[Feature]: Implement Embedding Models in V1
#15406 opened
Mar 24, 2025 -
[Bug]: `Phi-4-multimodal-instruct` encoder outputs didn't have the same length as defined in input_ids
#15404 opened
Mar 24, 2025 -
[Feature]: JSON based tool calling for Gemma 3
#15403 opened
Mar 24, 2025 -
[Bug]: RequestMetrics object (accessed through output[0].metrics) is None
#15394 opened
Mar 24, 2025 -
[Bug]: Batch embedding inference is inconsistent with hf
#15393 opened
Mar 24, 2025 -
[Bug]: vllm V1 pipeline parallel not compatible with ray==2.44.0
#15391 opened
Mar 24, 2025 -
[Bug]: awq Deepseek-R1-AWQ The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
#15386 opened
Mar 24, 2025 -
[Bug]: gcs_rpc_client.h:151: Failed to connect to GCS at address 192.168.2.40:6379 within 5 seconds.
#15385 opened
Mar 24, 2025 -
[Feature]: Request for Support of Dense and Sparse Features in bge-m3 Embedding Model
#15384 opened
Mar 24, 2025 -
[Bug]: Different logprobs output behaviour under vllm 0.8.0 and 0.8.1
#15381 opened
Mar 24, 2025 -
[Usage][UT]:Why the answer is ' 0, 1'
#15380 opened
Mar 24, 2025 -
[Feature]: Support Top-nσ sampling
#15379 opened
Mar 24, 2025 -
[Feature]: Add CoT dataset to the benchmark
#15378 opened
Mar 24, 2025 -
[Bug]: VLLM Build Using Docker Error Deploy
#15376 opened
Mar 24, 2025 -
[Feature]: Support LoRA adapter for whisper
#15370 opened
Mar 24, 2025 -
[Feature]: Overall tests improvement and speedup
#15369 opened
Mar 24, 2025 -
[Bug]: 0.8.0 and 0.8.1 bugs
#15365 opened
Mar 23, 2025 -
[New Model]: Support for SFR-Embedding-Code-2B_R embbeding model
#15362 opened
Mar 23, 2025 -
[Bug]: vLLM v1 hanging during Torch compilation
#15360 opened
Mar 23, 2025 -
[Bug]: LLM.collective_rpc is broken in v1 by default
#15349 opened
Mar 23, 2025 -
[Bug]: Wheels binary is absent for v0.7.2 release
#15347 opened
Mar 23, 2025 -
[Bug]: misleading regex with `--tokenizer-mode mistral` `OSError`
#15345 opened
Mar 23, 2025 -
[usage]: The fastest offline inference method
#15342 opened
Mar 22, 2025 -
[Bug]: Can't deserialize object: ObjectRef,DeepSeek R1, H20*16, pp2, tp8, v1 engine
#15333 opened
Mar 22, 2025 -
[Performance]: poor performance in pipeline parallesm when batch-size is large
#15330 opened
Mar 22, 2025 -
[Bug]:
#15329 opened
Mar 22, 2025 -
[Bug]: ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
#15327 opened
Mar 22, 2025 -
[Feature]: looking into adding a generation algorithm
#15315 opened
Mar 22, 2025 -
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 opened
Mar 21, 2025 -
[Bug]: Crashing on unsupported Sampling params
#15312 opened
Mar 21, 2025 -
[Usage]: Generating multiple completions with Qwen QwQ 32B
#15304 opened
Mar 21, 2025 -
[Bug]: OPEA/Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym error
#15300 opened
Mar 21, 2025 -
[Bug]: Worker VllmWorkerProcess pid 000000 died, exit code: -15
#15295 opened
Mar 21, 2025 -
[Feature]: Dynamic Memory Release for GPU after idle time
#15287 opened
Mar 21, 2025 -
[Usage]: why no ray command in my docker image
#15284 opened
Mar 21, 2025 -
[Bug]:streming is lost in arguments in tool_calls
#15274 opened
Mar 21, 2025 -
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 opened
Mar 21, 2025 -
vector search
#15268 opened
Mar 21, 2025 -
[Feature]: Can support CPU inference with Ray cluster?
#15266 opened
Mar 21, 2025 -
[Bug]: qwen2.5vl cannot use fp8 quantization
#15264 opened
Mar 21, 2025 -
[Bug]: oracle for device checking raise exception unexpectly
#15263 opened
Mar 21, 2025 -
[Bug]: OOM with QwQ-32B
#15258 opened
Mar 21, 2025 -
[Bug]: working with openai-agents sdk an use Runner.run_streamed() got fucntion call error
#15256 opened
Mar 21, 2025 -
[Bug]: --tensor-parallel-size Error
#15255 opened
Mar 20, 2025 -
[RFC]: Better support for weight updating while waking up from sleep mode for RLHF
#15254 opened
Mar 20, 2025 -
[Performance]: V0 and V1 give the same throughput number
#15253 opened
Mar 20, 2025 -
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 opened
Mar 20, 2025
326 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Model][VLM] Add Qwen2.5-Omni model support (thinker only)
#15130 commented on
Mar 27, 2025 • 35 new comments -
[Frontend]Reduce vLLM's import time
#15128 commented on
Mar 25, 2025 • 34 new comments -
[Model][MiniMaxText01] Support MiniMaxText01 model inference
#13454 commented on
Mar 27, 2025 • 30 new comments -
[V1] Implement sliding window attention in kv_cache_manager
#14097 commented on
Mar 27, 2025 • 26 new comments -
[Feature][Disaggregated] Support XpYd disaggregated prefill with MooncakeStore
#12957 commented on
Mar 27, 2025 • 19 new comments -
Add option to use DeepGemm contiguous grouped gemm kernel for fused MoE operations.
#13932 commented on
Mar 27, 2025 • 17 new comments -
[Model] Add PLaMo2
#14323 commented on
Mar 27, 2025 • 14 new comments -
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 commented on
Mar 27, 2025 • 14 new comments -
[Refactor][Frontend] Keep all logic about reasoning into one class
#14428 commented on
Mar 27, 2025 • 11 new comments -
[Feature] Support sequence parallelism
#14908 commented on
Mar 27, 2025 • 10 new comments -
[V1] AsyncLLM data parallel
#13923 commented on
Mar 27, 2025 • 7 new comments -
[Feature][ROCm]Enable fusion pass for torch.compile on ROCm
#15050 commented on
Mar 26, 2025 • 7 new comments -
Allow dynamic loading of LoRA adapters in a cache dir
#14634 commented on
Mar 25, 2025 • 6 new comments -
[Kernel][Triton] Adding fp8 and variable length sequence support to Triton FAv2 kernel
#12591 commented on
Mar 27, 2025 • 5 new comments -
[Bugfix] Fix failure to launch in Tensor Parallel TP mode on macOS.
#14948 commented on
Mar 27, 2025 • 5 new comments -
Truncation control for embedding models
#14776 commented on
Mar 21, 2025 • 5 new comments -
[Distributed] Add custom allreduce support for ROCM
#14125 commented on
Mar 27, 2025 • 5 new comments -
Add endpoint load metrics
#14906 commented on
Mar 27, 2025 • 4 new comments -
[Bugfix] Enable `torch.comple` for 2 parts of model
#14913 commented on
Mar 22, 2025 • 4 new comments -
[Quantization][FP8] Adding support for fp8 gemm layer input in fp8
#14578 commented on
Mar 24, 2025 • 4 new comments -
[V1][Frontend] Improve Shutdown And Logs
#11737 commented on
Mar 27, 2025 • 4 new comments -
Add cutlass support for blackwell fp8 blockwise gemm
#14383 commented on
Mar 26, 2025 • 4 new comments -
[Frontend] Implement Tool Calling with `tool_choice='required'`
#13483 commented on
Mar 27, 2025 • 4 new comments -
[Core] Add Additional Metrics to vLLM Server
#12726 commented on
Mar 27, 2025 • 3 new comments -
Torchao
#14231 commented on
Mar 27, 2025 • 3 new comments -
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS
#6036 commented on
Mar 27, 2025 • 3 new comments -
[Neuron][V1] Experimental support for neuron backend with V1 architecture
#14648 commented on
Mar 23, 2025 • 2 new comments -
[Bugfix] handle alignment of encoder_seq_lens in mllama.py
#14784 commented on
Mar 26, 2025 • 2 new comments -
[V1][Core] Support offloading KV cache to CPU.
#13377 commented on
Mar 27, 2025 • 2 new comments -
[Bugfix] Data parallel example will all use same GPUs if the users script initializes torch.cuda
#14598 commented on
Mar 26, 2025 • 2 new comments -
[WIP][V1][Metrics] Speculative decoding metrics
#15151 commented on
Mar 27, 2025 • 2 new comments -
[V0][Fix] structured decoding compatibility with speculative decoding
#13823 commented on
Mar 27, 2025 • 2 new comments -
[RFC][V1] `LogitsProcessor` interface
#13360 commented on
Mar 27, 2025 • 1 new comment -
[HPU] Enable AutoGPTQ/AutoAWQ quantized model inference
#13853 commented on
Mar 24, 2025 • 1 new comment -
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER
#15001 commented on
Mar 26, 2025 • 1 new comment -
[V1][Metrics] Add additional metrics to V1
#14148 commented on
Mar 25, 2025 • 1 new comment -
[V1][Perf] Faster incremental detokenization
#15137 commented on
Mar 25, 2025 • 1 new comment -
[V1][Metrics] Allow V1 AsyncLLM to use custom logger
#14661 commented on
Mar 26, 2025 • 1 new comment -
[Metrics] Add bucket for `request_latency_buckets`
#15202 commented on
Mar 25, 2025 • 1 new comment -
Move dockerfiles into their own directory
#14549 commented on
Mar 27, 2025 • 1 new comment -
[CI/Build] Add support for Python 3.13
#13164 commented on
Mar 24, 2025 • 0 new comments -
[Model] Add T5 model (2/2)
#11901 commented on
Mar 27, 2025 • 0 new comments -
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on
Mar 21, 2025 • 0 new comments -
[Hardware][TPU] improve kv cache update performance in prefill
#13176 commented on
Mar 27, 2025 • 0 new comments -
[Bugfix] add input embedding
#11684 commented on
Mar 25, 2025 • 0 new comments -
[Core] Efficient transmission for CPU prefix caching, based on PR#10874
#11099 commented on
Mar 27, 2025 • 0 new comments -
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server
#10546 commented on
Mar 27, 2025 • 0 new comments -
[Core] Faster logit_bias_logits_processor
#13334 commented on
Mar 26, 2025 • 0 new comments -
[WIP][Attention] Update to lastest FA3 code
#13111 commented on
Mar 26, 2025 • 0 new comments -
[V1][Bugfix] DeepSeek-V3 v1 attn_backend miss q_lora_rank
#13092 commented on
Mar 27, 2025 • 0 new comments -
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on
Mar 26, 2025 • 0 new comments -
[Neuron][Kernel] NKI Flash PagedAttention with BlockSparse Execution Plan
#13249 commented on
Mar 21, 2025 • 0 new comments -
[V1] Optimize block table copy from CPU to GPU (take 2)
#12078 commented on
Mar 25, 2025 • 0 new comments -
[WIP][Hardware][CPU] testing branch for mlperf
#12141 commented on
Mar 27, 2025 • 0 new comments -
[V1][PoC] Refactor EngineCoreOutputs
#12853 commented on
Mar 27, 2025 • 0 new comments -
[Quantization/Parameter] WIP: Another Implementation of the Quantization Parameter Subclass Substitution
#12158 commented on
Mar 27, 2025 • 0 new comments -
[WIP][AMD][Kernel][Quantization] Add fp8 and int8 support for Triton FAv2 kernel
#12534 commented on
Mar 27, 2025 • 0 new comments -
[WIP][Attention] KV Splits heuristic for MLA
#12654 commented on
Mar 26, 2025 • 0 new comments -
[WIP] MLA decode attention - cuda graph support
#12588 commented on
Mar 27, 2025 • 0 new comments -
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 commented on
Mar 27, 2025 • 0 new comments -
[Frontend] Support suffix in completions API (fill-in-the-middle, FIM)
#9522 commented on
Mar 24, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q4 2024
#9006 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: disaggregated prefilling hangs when TP=2
#11247 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: invalid argument [ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
#13270 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: vllm部署qwen2.5_vl_72b之后,你们有出现,刚部署好之后调用一切正常3-5秒一条,然后使用一段时间,就越来越慢了的情况吗60s一条
#13886 commented on
Mar 27, 2025 • 0 new comments -
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 commented on
Mar 27, 2025 • 0 new comments -
[Feature]: Qwen2_5_VLForEmbedding
#13373 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: 0.8.0(V1) RayChannelTimeoutError when inferencing DeepSeekV3 on 16 H20 with large batch size
#15102 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: v0.6.4 requires more GPU memory than v0.6.3
#10360 commented on
Mar 27, 2025 • 0 new comments -
[Usage]: How to get token level probablity scores
#10951 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: Prefix caching doesn't work for LlavaOneVision
#11371 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: error when start in multiple GPU
#11467 commented on
Mar 27, 2025 • 0 new comments -
[Misc]: qwen2 vllm和transform 推理结果未对齐
#11478 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: 'int' object has no attribute 'parser_state'
#11498 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: Qwen2.5-72B-Instruct 在A800上推理不成功
#11506 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: The CPU usage is very low when inference is performed on the ARM CPU
#11511 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: vllm 0.6.5 run Qwen2-VL-7B-Instruct ,lora lond success but not effective
#11525 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: Profiling on vLLM server hangs when --num-scheduler-steps > 1
#12032 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: Run DeepSeek-R1-awq model on AMD MI210 meet an error
#15101 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: CDNA cc >= 90, choose_mp_linear_kernel MacheteLinearKernel is possible
#14996 commented on
Mar 26, 2025 • 0 new comments -
[Hardware] [Intel GPU] Add multistep scheduler for xpu device
#9337 commented on
Mar 26, 2025 • 0 new comments -
[Bugfix] fix error due to an uninitialized tokenizer when using `skip_tokenizer_init` with `num_scheduler_steps`
#9276 commented on
Mar 27, 2025 • 0 new comments -
Developed the PoC of dAttention support. It will utilize the similar idea of vAttention, but it introduces a new memory layout that overcomes the waste of memory of vAttention.
#9078 commented on
Mar 27, 2025 • 0 new comments -
[Bugfix] Fix LongRoPE bug
#8254 commented on
Mar 27, 2025 • 0 new comments -
[Frontend] Add option for LLMEngine to return model hidden states.
#7892 commented on
Mar 25, 2025 • 0 new comments -
[Core] generate from input embeds
#6869 commented on
Mar 27, 2025 • 0 new comments -
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: msgspec.DecodeError: MessagePack data is malformed: trailing characters (byte 13)
#15207 commented on
Mar 27, 2025 • 0 new comments -
[New Model]: answerdotai/ModernBERT-large
#11347 commented on
Mar 27, 2025 • 0 new comments -
[Feature]: Improve Logging for Error Messages
#14083 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: Out of Memory error for Qwen2.5 in 0.8.0 and 0.8.1. Worked fine in the previous versions
#15228 commented on
Mar 27, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q1 2025
#11862 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: size mismatch when loading MixtralForCausalLM GGUF model
#14423 commented on
Mar 27, 2025 • 0 new comments -
[Feature]: Support Multiple Tasks Per Model
#11905 commented on
Mar 27, 2025 • 0 new comments -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on
Mar 27, 2025 • 0 new comments -
[RFC]: Multi-modality Support on vLLM
#4194 commented on
Mar 27, 2025 • 0 new comments -
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on
Mar 27, 2025 • 0 new comments -
[Feature]: Support rerank models
#6928 commented on
Mar 27, 2025 • 0 new comments -
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 commented on
Mar 27, 2025 • 0 new comments -
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on
Mar 27, 2025 • 0 new comments -
[Quant] SupportsQuant handles ignored_modules
#14635 commented on
Mar 26, 2025 • 0 new comments -
[Ray]Ray Compiled Graph support other device
#14668 commented on
Mar 24, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
Mar 22, 2025 • 0 new comments -
[Perf] Optimize Qwen2/2.5-VL/Omni series' rot pos compute using numba
#14684 commented on
Mar 26, 2025 • 0 new comments -
[V1][Feature] Enable Speculative Decoding with Structured Outputs
#14702 commented on
Mar 27, 2025 • 0 new comments -
[DO NOT MERGE] [V1] Implement SimpleScheduler
#14731 commented on
Mar 21, 2025 • 0 new comments -
[CI/Build] Add hpu test with tensor-parallel-size=2 to run-hpu-test.sh
#14751 commented on
Mar 25, 2025 • 0 new comments -
[Quantization] Add Gemma2 and Gemma3 text model GGUF support
#14766 commented on
Mar 26, 2025 • 0 new comments -
[Bugfix] fix deepseek fp16 scale bug
#14809 commented on
Mar 27, 2025 • 0 new comments -
[Bugfix][Mamba] Fix IndexError When MambaCacheManager is Full
#14820 commented on
Mar 22, 2025 • 0 new comments -
[Bugfix][Model] fix mllama multi-image
#14883 commented on
Mar 26, 2025 • 0 new comments -
Add Phi-4-mini funciton calling support
#14886 commented on
Mar 26, 2025 • 0 new comments -
[Kernel] vLLM Windows CUDA support
#14891 commented on
Mar 25, 2025 • 0 new comments -
[Feature] Eagle Chunked Prefill Support
#14922 commented on
Mar 25, 2025 • 0 new comments -
[ V0 ][ sample ] improve sample performance when using guide decoding
#14962 commented on
Mar 22, 2025 • 0 new comments -
[FEAT] [ROCm]: Add AITER Block-Scaled GEMM Feature
#14968 commented on
Mar 24, 2025 • 0 new comments -
[V1][Misc] Misc simplifications / performance-related
#14989 commented on
Mar 27, 2025 • 0 new comments -
[Model] Update support for NemotronNAS models
#15008 commented on
Mar 25, 2025 • 0 new comments -
[CPU] Support torch compile in CPU backend
#15020 commented on
Mar 25, 2025 • 0 new comments -
[Bugfix] Fix hidden_states reshape failed and no_proposals error when…
#15032 commented on
Mar 24, 2025 • 0 new comments -
[TPU][V1] Capture multimodal encoder during model compilation
#15051 commented on
Mar 27, 2025 • 0 new comments -
[DO NOT REVIEW YET] Integrate with the write-to-kvcache Pallas kernel
#15067 commented on
Mar 27, 2025 • 0 new comments -
[Bugfix][Misc] Add a defensive check before importing triton
#15099 commented on
Mar 24, 2025 • 0 new comments -
Metrics proposal OpenTelemetry API
#15138 commented on
Mar 21, 2025 • 0 new comments -
[WIP][TPU] Support mrope models (Qwen2VL)
#15149 commented on
Mar 27, 2025 • 0 new comments -
[WIP][Feature] Support chunked prefill when using Deepseek MTP model as draft model
#15153 commented on
Mar 21, 2025 • 0 new comments -
online_rotations
#15162 commented on
Mar 21, 2025 • 0 new comments -
[Spec Decode] Make speculative decoding compatible with pipeline parallelism
#15173 commented on
Mar 26, 2025 • 0 new comments -
[SpecDecode] Make spec decoding extensible to different backends
#15195 commented on
Mar 25, 2025 • 0 new comments -
[Bugfix] Fix include prompt in stream response when echo=true
#15233 commented on
Mar 21, 2025 • 0 new comments -
[ROCm] Pop ROCR_VISIBLE_DEVICES in RayWorkerWrapper
#15246 commented on
Mar 21, 2025 • 0 new comments -
[Bugfix]: DeepseekR1 model load fails with weights tied error
#13335 commented on
Mar 26, 2025 • 0 new comments -
[Core][Feature] Input metadata dump on crash
#13407 commented on
Mar 27, 2025 • 0 new comments -
[Neuron][CI][WIP] Refactor Neuron kernel tests to improve coverage
#13455 commented on
Mar 21, 2025 • 0 new comments -
[CI/Build] custom build backend and dynamic build dependencies v2
#13480 commented on
Mar 27, 2025 • 0 new comments -
[WIP][Kernel] Flashinfer MLA support
#13630 commented on
Mar 26, 2025 • 0 new comments -
[V1][PP] Continue scheduling prefill chunks
#13637 commented on
Mar 27, 2025 • 0 new comments -
[V1][Minor] Use FakeAttentionMetadata for dummy run
#13689 commented on
Mar 27, 2025 • 0 new comments -
[V1] Zero-copy tensor/ndarray serialization/transmission
#13790 commented on
Mar 26, 2025 • 0 new comments -
[Model][Speculative Decoding] support k > 1 for MTP
#13805 commented on
Mar 27, 2025 • 0 new comments -
[Misc] support variable remote backend for model loader
#13809 commented on
Mar 25, 2025 • 0 new comments -
Fix TPU CI
#13898 commented on
Mar 27, 2025 • 0 new comments -
Upgrade `transformers` to `v4.50.2`
#13905 commented on
Mar 27, 2025 • 0 new comments -
[V1] Avoid false positives when warning for unimplemented methods
#14046 commented on
Mar 23, 2025 • 0 new comments -
[v1] Remove bind_kv_cache and self.kv_cache in model runner
#14098 commented on
Mar 27, 2025 • 0 new comments -
Deepseek MTP for V1
#14182 commented on
Mar 25, 2025 • 0 new comments -
[V1] Enable Long Context LoRA tests for V1
#14241 commented on
Mar 26, 2025 • 0 new comments -
[TPU][V1] Capture multimodal encoder during model compilation
#14254 commented on
Mar 27, 2025 • 0 new comments -
[WIP][Attention] FlashAttn MLA
#14258 commented on
Mar 26, 2025 • 0 new comments -
[V1] TPU - Remove self.kv_caches
#14309 commented on
Mar 27, 2025 • 0 new comments -
[Misc][Minor] Benchmarks: Fix guided decoding, token sampling, and request sorting
#14368 commented on
Mar 21, 2025 • 0 new comments -
[Misc] Fix test_sleep to use query parameters
#14373 commented on
Mar 22, 2025 • 0 new comments -
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 commented on
Mar 26, 2025 • 0 new comments -
Clean up Engine Args & Documentation
#14409 commented on
Mar 27, 2025 • 0 new comments -
[Misc] Refactor platform to get device specific stream and event
#14411 commented on
Mar 26, 2025 • 0 new comments -
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 commented on
Mar 27, 2025 • 0 new comments -
[Kernel] moe wna16 marlin kernel
#14447 commented on
Mar 23, 2025 • 0 new comments -
[INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi
#14455 commented on
Mar 24, 2025 • 0 new comments -
[#14109][bug] Fix Ray placement group allocation is not respecting env VLLM_RAY_PER_WORKER_GPUS (fractional gpu)
#14521 commented on
Mar 22, 2025 • 0 new comments -
First working PoC for bge-m3 sparse embeddings
#14526 commented on
Mar 26, 2025 • 0 new comments -
[Frontend] Skip `stop` in reasoning content
#14550 commented on
Mar 27, 2025 • 0 new comments -
permute/unpermute kernel for moe optimization
#14568 commented on
Mar 27, 2025 • 0 new comments -
fix:set use_beam_search false to aviod trace link broken
#14592 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: 启动之后 用了一段时间 显存越占越多
#8413 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: vLLM multi-step scheduling crashes when input prompt is long
#10009 commented on
Mar 23, 2025 • 0 new comments -
[Installation]: Missing v0.6.3.post1-cu118-cp310.whl. Can share it? Thanks so much
#10036 commented on
Mar 23, 2025 • 0 new comments -
[RFC]: The two features i wish vllm has
#11410 commented on
Mar 23, 2025 • 0 new comments -
[Feature]: c4ai-command-r-plus-08-2024 tool choice support
#11405 commented on
Mar 23, 2025 • 0 new comments -
[Misc]: How to Profile Both EngineCoreClient and EngineCoreProc Activities in V1 Using Profiler
#11413 commented on
Mar 23, 2025 • 0 new comments -
[Feature]: QTIP Quantization
#11416 commented on
Mar 23, 2025 • 0 new comments -
[Feature]: Ensure benchmark serving do not import vLLM
#14923 commented on
Mar 23, 2025 • 0 new comments -
[RFC]: vLLM Windows CUDA support
#14981 commented on
Mar 23, 2025 • 0 new comments -
[Feature]: Support torch.distributed as the runtime for multi-node inference
#12511 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: "Loading safetensors checkpoint shards" runs twice when serving model
#13765 commented on
Mar 22, 2025 • 0 new comments -
[Feature]: Reduce vLLM's import time
#14924 commented on
Mar 22, 2025 • 0 new comments -
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture
#13446 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: The qwq-32b-q4_k_m.gguf quantized model is not supported.
#15015 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: TRACKING ISSUE: `AsyncEngineDeadError`
#5901 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: No available block found in 60 second in shm
#6614 commented on
Mar 22, 2025 • 0 new comments -
[Usage]: How to benchmark throughput of DeepSeek-R1-671B on 2 nodes
#15024 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: Unit test `tests/models/embedding/vision_language/test_phi3v.py` failing
#14677 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: Model weights in GiB
#14979 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号,无结果
#3998 commented on
Mar 22, 2025 • 0 new comments -
[Usage]: Model `compute_logits` always get None for `sampling_metadata`
#15115 commented on
Mar 24, 2025 • 0 new comments -
[New Model]: jinaai/jina-reranker-v2-base-multilingual
#15222 commented on
Mar 24, 2025 • 0 new comments -
[RFC]: Hybrid Memory Allocator
#11382 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: Extra body don't work when response_format is also sent for serving.
#7337 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: Lora refuses to load from disk without extremely weird manipulations with file paths
#9063 commented on
Mar 24, 2025 • 0 new comments -
[Usage]: Removal of vllm.openai.rpc folder in vLLM 0.6.2 release
#10766 commented on
Mar 24, 2025 • 0 new comments -
[Performance]: Performance degradation due to CPU bottleneck when serving embedding models to GPUs
#11320 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: no output of profile when VLLM_TORCH_PROFILER_DIR is enabled for vllm serve
#11346 commented on
Mar 24, 2025 • 0 new comments -
Error in running 'python -m vllm.entrypoints.openai.api_server '
#11411 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: 0.6.5 randomly closes connection/drops requests
#11421 commented on
Mar 24, 2025 • 0 new comments -
[Doc]: new attention layer
#15077 commented on
Mar 24, 2025 • 0 new comments -
[Installation]: VLLM on ARM machine with GH200
#10459 commented on
Mar 23, 2025 • 0 new comments -
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: Missing detection of BFloat16 for CPU ARM
#11814 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: [V1][SpecDec] RuntimeError: CUDA error: an illegal memory access was encountered
#13673 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: [V1] New v1 engine does not support n>1?
#12584 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: When enabling LoRA, greedy search got different answers.
#7977 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: Likely Regression - Was working in v0.6.3.post1, now using response_format parameter with "type": "bool" in v0.7.3: BadRequestError: Error code 400 - {'object': 'error', 'message': 'json_schema_converter.cc:595 Unsupported type bool in schema {type":"bool"}\n, 'type': 'BadRequestError', 'param': None, 'code': 400}
#13864 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: when I use disaggregated_prefill, if I don't input anything ,KV receiving thread will report time out
#14193 commented on
Mar 23, 2025 • 0 new comments -
[Bug]: vLLM 0.5.5 and FlashInfer0.1.6
#8091 commented on
Mar 23, 2025 • 0 new comments -
Supporting RWKV models
#3583 commented on
Mar 22, 2025 • 0 new comments -
[Bug]: AttributeError: 'RayGPUExecutorAsync' object has no attribute 'forward_dag'
#7871 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: when tensor-parallel-size>1,Stuck
#8087 commented on
Mar 21, 2025 • 0 new comments -
[Doc]: Offline Inference Distributed
#8966 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: vllm crashes when preemption of priority scheduling is triggered on vllm-0.6.3.dev173+g36ea7907.d20241011
#9342 commented on
Mar 21, 2025 • 0 new comments -
[New Model]: Qwen/QwQ-32B-Preview
#10737 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: how to use EAGLE on vLLM?
#11126 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: Paligemma 2 model loading error
#11343 commented on
Mar 21, 2025 • 0 new comments -
[Feature]: meta-llama/Prompt-Guard-86M Usage Value Error.
#11360 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: priority scheduling doesn't work according to token_per_s. The token_per_s of requests with higher priorities is not higher than that of requests without priority settings.
#11361 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: The service operation process results in occasional exception errors RuntimeError: CUDA error: an illegal memory access was encountered
#11366 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: vLLM crashes on tokenized embedding input
#11375 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: How do I run offline batch inference with Llama 405B BF16 across multinode (via SLURM)
#11379 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: stop_sequences is applied to both reasoning_content and content
#14399 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 commented on
Mar 21, 2025 • 0 new comments -
[Feature]: Support openai responses API interface
#14721 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 commented on
Mar 20, 2025 • 0 new comments -
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 commented on
Mar 20, 2025 • 0 new comments -
[RFC][Exploratory]: vLLM Neuron Backend with V1 Architecture
#11152 commented on
Mar 20, 2025 • 0 new comments -
[Bug]: collect_env doesn't work in uv environment
#13888 commented on
Mar 20, 2025 • 0 new comments -
[New Model]: Support Zyphra/Zamba2-7B
#9382 commented on
Mar 20, 2025 • 0 new comments -
[New Model]: nvidia/Hymba-1.5B-Base
#10783 commented on
Mar 22, 2025 • 0 new comments -
[Usage]: Is pipeline parallelism supported on machines that are not in the same local network?
#11285 commented on
Mar 22, 2025 • 0 new comments -
[Misc]: What is 'residual' used for in the IntermediateTensor class?
#11364 commented on
Mar 22, 2025 • 0 new comments -
Where does the default number 43328 of KV cache come from and How can I change it?
#11391 commented on
Mar 22, 2025 • 0 new comments -
[V1] Add code dataset to benchmark the performance of spec decode
#14013 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: Vllm whisper model response_format verbose_json not working
#14818 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: latest docker build (0.6.2) got error due to VLLM_MAX_SIZE_MB
#9307 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
#3900 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: 'invalid argument' Error with custom_all_reduce when doing tensor parallelism
#9046 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: vLLM response on tool_calls does not align with OpenAI standard
#14951 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: documentation say XLMRobertaForSequenceClassification is supported but logs say ['XLMRobertaForSequenceClassification'] are not supported for now
#10718 commented on
Mar 21, 2025 • 0 new comments -
[Feature]: Support more video loader
#15011 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 commented on
Mar 21, 2025 • 0 new comments -
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: Clarification on how to use Greedy Search and then Beam search's Poor Performance in VLLM
#15146 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: How to eliminate randomness and obtain fixed results with VLLM 0.8
#15205 commented on
Mar 21, 2025 • 0 new comments -
[Usage]: Distributed inference not supported with OpenVINO?
#14933 commented on
Mar 21, 2025 • 0 new comments -
[Feature]: gemma3 raise error
#14723 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: Ray memory leak
#4241 commented on
Mar 21, 2025 • 0 new comments -
[Bug]: Incomplete tool calling response for pipeline-parallel vllm with ray
#7194 commented on
Mar 21, 2025 • 0 new comments -
[Feature]: DeepSeek v3/r1 MTP support PP
#14005 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: expose the tqdm progress bar to enable logging the progress
#6154 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: KeyError: 'layers.0.self_attn.qkv_proj.weight'
#9595 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: Qwen2-VL-7B with sglang (vLLM-back) Performance Degradation on MME benchmark
#10588 commented on
Mar 26, 2025 • 0 new comments -
[Installation]: May I ask if there is a good solution for deploying grmma-2-27b on v100? The deployment has been consistently unsuccessful
#11462 commented on
Mar 26, 2025 • 0 new comments -
[Usage]: Client-Side Error Handling for VLLM in a Client-Server Architecture
#11487 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: Support tool calls for DeepSeek.
#14745 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: Implement Concurrent Partial Prefills In V1 Engine
#14003 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Cannot use model with shorter context as draft model
#7859 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: LLVM ERROR: Failed to compute parent layout for slice layout.
#15235 commented on
Mar 25, 2025 • 0 new comments -
[Bug] Mismatch between `get_multimodal_embedding` output and `PlaceholderRange`
#15144 commented on
Mar 25, 2025 • 0 new comments -
[Feature]: Data parallel inference in offline mode(based on Ray)
#14683 commented on
Mar 25, 2025 • 0 new comments -
[Performance]: V1 vs V0 with multi-steps
#11649 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 commented on
Mar 25, 2025 • 0 new comments -
[Usage]: Guided choice not working as expected
#12225 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: vllm:request_inference_time_seconds_bucket has too few buckets for long inference requests
#15167 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Major issues with guided generation / structured output in vLLM (up to and including v0.8.1); many examples provided by vllm in /examples and structured_outputs.html doc do not work
#15236 commented on
Mar 25, 2025 • 0 new comments -
[Performance]: logit bias implementation uses a slow for loop
#10741 commented on
Mar 25, 2025 • 0 new comments -
[Installation]: Error occured while installing vllm
#14124 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens
#13175 commented on
Mar 26, 2025 • 0 new comments -
[Installation]: undefined symbol: _ZNK3c1011StorageImpl27throw_data_ptr_access_errorEv
#15010 commented on
Mar 26, 2025 • 0 new comments -
[V1] Feedback Thread
#12568 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: support tool and reasoning together
#14429 commented on
Mar 26, 2025 • 0 new comments -
[Installation]: uv install not installing FlashInfer anymore
#15158 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: Mistral Small 3.1 HF support
#15212 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: Extreme low throughput when using pipeline parallelism when Batch Size(running req) is small
#9176 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: Support Gemma3 GGUF
#14753 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: Is vllm compatible with torchrun?
#7939 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: codegemma-7b crashes without error
#13044 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: assert request.num_computed_tokens <= request.num_tokens
#14915 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: Enable lora returns garbage output
#14392 commented on
Mar 26, 2025 • 0 new comments -
[Feature]: Integrate Writing in the Margins inference pattern ($5,000 Bounty)
#9807 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: vLLM ModelConfig doesn't pass hf_overrides to get_hf_image_processor_config, which could contain auth token for hugging face (not in ENV)
#14854 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#8893 commented on
Mar 26, 2025 • 0 new comments -
[Performance]: Added request take too much time, and the model will not run untill all the request are added into the cache
#13259 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: see connection to gpu node timeout issue when initializing ray vllm multi-node serving
#13052 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: [AMD] [vLLM=0.7.3] ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected.
#14983 commented on
Mar 26, 2025 • 0 new comments -
[Bug]: ValueError: Cannot unpickle PostGradPassManager
#15089 commented on
Mar 26, 2025 • 0 new comments -
[Misc]: 建个微信群高效交流,有兴趣的大佬请进
#14928 commented on
Mar 26, 2025 • 0 new comments -
[WIP][RFC]: Use auto-functionalization V2 in PyTorch 2.7+
#14703 commented on
Mar 25, 2025 • 0 new comments -
[Misc]: Molmo inference multi-GPU
#11468 commented on
Mar 25, 2025 • 0 new comments -
[Usage]: How to figure out why vllm response nothing but trt-llm response meaningful result
#11473 commented on
Mar 25, 2025 • 0 new comments -
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: ModuleNotFoundError: No module named 'triton' when building docker image for Arm64
#14605 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: CUDA_VISIBLE_DEVICES is not supported
#14807 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: 0.8.0(V1) Ray cannot find model pyarrow and pandas
#15100 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Can't run vllm model because of the FlashAttention.
#15238 commented on
Mar 24, 2025 • 0 new comments -
[Usage]: relationship between embedding size and vocab_size
#15131 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 commented on
Mar 24, 2025 • 0 new comments -
[Installation]: Cannot compile vLLM from source on XPU
#14747 commented on
Mar 24, 2025 • 0 new comments -
[Usage]: There is no module or parameter named 'language_model' in Gemma3ForCausalLM
#15031 commented on
Mar 24, 2025 • 0 new comments -
[Feature]: Ability to warm up vLLM instances
#15225 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: Llama-3.1-405B-Instruct-FP8 only generates exclamation marks
#13035 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: Internal Server Error when using Qwen2-VL-7B with vLLM Docker Container
#15110 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: vllm cannot connect to an external ray cluster
#14349 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: 张量并行离线推理报错 CalledProcessError: Command '['/usr/bin/gcc'....] returned non-zero exit status 1.
#15013 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: The service request for vllm064post1 was prematurely terminated, and it could not output a fixed number of tokens.”
#13156 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: ValueError: Unsupported config format: ConfigFormat.AUTO on macOS
#13889 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: AssertionError with Speculative Decoding in vLLM Using DeepSeek R1 Distill Qwen Models
#14939 commented on
Mar 24, 2025 • 0 new comments -
[Bug]: vLLM running on Unspecified Platform raises NotImplementedError when using podman/docker-compose
#14954 commented on
Mar 25, 2025 • 0 new comments -
[Feature]: Add support for reusable subschemas in tool requests (PydanticAI)
#15035 commented on
Mar 25, 2025 • 0 new comments -
[RFC]: Disaggregated prefilling and KV cache transfer roadmap
#10818 commented on
Mar 25, 2025 • 0 new comments -
[RFC]: A proper way to deal with 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' problem
#14610 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: AssertionError - assert loaded_weight.shape[output_dim] == self.org_vocab_size
#15124 commented on
Mar 25, 2025 • 0 new comments -
First tpot/itl is too long?
#15106 commented on
Mar 25, 2025 • 0 new comments -
[Performance]: How to Improve Performance Under Concurrency
#9722 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: error helper for TypeError: _extractNVMLErrorsAsClasses..gen_new..new() takes 1 positional argument but 2 were given
#12906 commented on
Mar 25, 2025 • 0 new comments -
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: can't cache image embeds input
#15209 commented on
Mar 25, 2025 • 0 new comments -
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on
Mar 25, 2025 • 0 new comments -
ExLlamaV2: exl2 support
#3203 commented on
Mar 25, 2025 • 0 new comments -
[New Model]: Google SigLip 2
#13663 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: When use `guided choice` feature, vllm.engine.async_llm_engine.AsyncEngineDeadError
#8100 commented on
Mar 25, 2025 • 0 new comments -
[Usage]: RuntimeError: Failed to infer device type (Intel Iris Xe Graphics)
#8863 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: AsyncLLMEngine CUDA runtime error 'device-side assert triggered'
#8948 commented on
Mar 25, 2025 • 0 new comments -
[Installation]: Segmentation fault when building Docker container on WSL
#10575 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Crash with Qwen2-Audio Model in vLLM During Audio Processing
#10627 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: Prefill/decode separation leads to blocking and crashing in multi concurrent scenarios
#11445 commented on
Mar 25, 2025 • 0 new comments -
[Bug]: InternVL2-40B Inference Precision Problem
#11454 commented on
Mar 25, 2025 • 0 new comments