Pulse · vllm-project/vllm · GitHub

March 2, 2025 – March 9, 2025

Overview

259 Active pull requests

231 Active issues

Could not load contribution data

Please try again later

164 Pull requests merged by 77 people

[Misc] Ensure out-of-tree quantization method recognize by cli args
#14328 merged Mar 9, 2025
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup
#14510 merged Mar 9, 2025
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work
#14498 merged Mar 9, 2025
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1
#12428 merged Mar 9, 2025
[Feat] Support chunked prefill for LMCache connector
#14505 merged Mar 9, 2025
[V1][TPU] Remove unnecessary padding for running on TPU.
#14467 merged Mar 9, 2025
[Attention] Default to FlashMLA backend for MLA
#14451 merged Mar 9, 2025
Revert "[V1][Core] Fix memory issue with logits & sampling"
#14504 merged Mar 9, 2025
[V1] Support bad_words in sampler
#13376 merged Mar 8, 2025
[Misc] Upgrade to Python 3.9 typing for additional directories
#14492 merged Mar 8, 2025
Update CODEOWNERS for structured output
#14496 merged Mar 8, 2025
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling
#14361 merged Mar 8, 2025
[Bugfix] DeepSeek Accuracy
#14476 merged Mar 8, 2025
Move requirements into their own directory
#12547 merged Mar 8, 2025
[Misc] Don't run ruff at all on 3rd party libs
#14493 merged Mar 8, 2025
[benchmarks] Add option to use unique jsonschema for each request
#14457 merged Mar 8, 2025
[V1][Core] Fix memory issue with logits & sampling
#13776 merged Mar 8, 2025
[Misc] add use_tqdm_on_load to reduce logs
#14407 merged Mar 8, 2025
[VLM] Add TP support for Phi-4-MM
#14453 merged Mar 8, 2025
[V1] TPU - Add tensor parallel support via Ray
#13618 merged Mar 8, 2025
[CI/Build] Use a fixed seed to avoid flaky tests
#14480 merged Mar 8, 2025
Add RLHF document
#14482 merged Mar 8, 2025
[Build/BugFix] Fix hopper 12.8 build
#14354 merged Mar 8, 2025
Add training doc signposting to TRL
#14439 merged Mar 8, 2025
[Bugfix] Make the deviceprofiler include LoRA memory.
#14469 merged Mar 8, 2025
[Doc] Added QwQ-32B to the supported models list in the reasoning out…
#14479 merged Mar 8, 2025
[Doc]add doc for Qwen models tool calling
#14478 merged Mar 8, 2025
Default to generation_config from model
#12622 merged Mar 8, 2025
[CI/Build] refactor: set timezone of container to UTC
#12888 merged Mar 8, 2025
[core] add extra_args to SamplingParams
#13300 merged Mar 8, 2025
[MISC][V1] Register process killing handler only in the main thread
#14380 merged Mar 8, 2025
[Bugfix][Disaggregated] Add a check in send_kv_caches_and_hidden_states and fix the reshape of the KVCache
#14369 merged Mar 8, 2025
Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)"
#14471 merged Mar 8, 2025
[Bugfix][V1] Handle MLA in kv_cache_interface
#14462 merged Mar 8, 2025
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC
#13949 merged Mar 8, 2025
[Bugfix] Fix torch_xla which can't handle None seed introduced in #14274
#14459 merged Mar 7, 2025
[V1][Metrics] Fix traceback with preemptions+LoRA
#14220 merged Mar 7, 2025
[V1] Eagerly remove finished requests from the batch
#14388 merged Mar 7, 2025
[v1] torch.compile integration explanation
#14437 merged Mar 7, 2025
[Misc] Add Phi4-MM example
#14343 merged Mar 7, 2025
[Kernel] optimize performance of gptq marlin kernel when n is small
#14138 merged Mar 7, 2025
[Benchmarks] Make detokenization optional in benchmark scripts
#11697 merged Mar 7, 2025
[Doc] Update prefix_caching.md to match the example image
#14420 merged Mar 7, 2025
[V1][Core] Support for Structured Outputs
#12388 merged Mar 7, 2025
Use the optimized block sizes after tuning the kernel.
#14329 merged Mar 7, 2025
Fix missing kv_caches and attn_metadata in OpenVINOCausalLM
#14271 merged Mar 7, 2025
[BUGFIX] Skip tokenization support for throughput benchmark
#12712 merged Mar 7, 2025
[Misc] Set default value of seed to None
#14274 merged Mar 7, 2025
[Bugfix] Clean up multi-modal processors
#14417 merged Mar 7, 2025
[Bugfix] Further clean up LoRA test
#14422 merged Mar 7, 2025
correct wrong markdown syntax
#14414 merged Mar 7, 2025
[GH] Auto-apply multi-modality label to relevant PRs
#14402 merged Mar 7, 2025
OpenVINO: added CPU-like conditions
#14338 merged Mar 7, 2025
[Build] Add nightly wheel fallback when latest commit wheel unavailable
#14358 merged Mar 7, 2025
[Bugfix] Fix JambaForCausalLM LoRA
#14370 merged Mar 7, 2025
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs
#14396 merged Mar 7, 2025
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object
#14390 merged Mar 7, 2025
[Perf] Reduce MLA CPU overheads in V1
#14384 merged Mar 7, 2025
[Bugfix] Correctly call cudaProfilerStop in benchmarks script
#14183 merged Mar 7, 2025
[Doc] Fix a typo
#14385 merged Mar 7, 2025
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue
#14310 merged Mar 6, 2025
[Docs] Add nsight guide to profiling docs
#14298 merged Mar 6, 2025
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends
#14221 merged Mar 6, 2025
[Bug] Fix Attention when ignored in by quant_method
#14313 merged Mar 6, 2025
[Bugfix] Fix use_direct_call condition in FusedMoE layer for
#14382 merged Mar 6, 2025
[Kernel] Add needs_fixed_stride_order tag to most GEMMs
#14306 merged Mar 6, 2025
[CI] Disable spawn when running V1 Test
#14345 merged Mar 6, 2025
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa
#13569 merged Mar 6, 2025
[Distributed] Add enable_expert_parallel arg
#14305 merged Mar 6, 2025
[V1] Do not detokenize if sampling param detokenize is False
#14224 merged Mar 6, 2025
Fix mla prefill context performance
#13897 merged Mar 6, 2025
Add authors to license header.
#14371 merged Mar 6, 2025
Adding cpu inference with VXE ISA for s390x architecture
#12613 merged Mar 6, 2025
Reinstate best_of for V0
#14356 merged Mar 6, 2025
[RLHF] use worker_extension_cls for compatibility with V0 and V1
#14185 merged Mar 6, 2025
[Doc] Fix date typo in README.md
#14366 merged Mar 6, 2025
[Core] Don't use cache during multi-modal profiling
#14336 merged Mar 6, 2025
[Bugfix][Core] fix abort_seq_group and memory leak when n>1
#14326 merged Mar 6, 2025
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend
#14152 merged Mar 6, 2025
[Doc] Correct beam_search using in generative_models.md
#14363 merged Mar 6, 2025
[Doc] Update reasoning with stream example to use OpenAI library
#14077 merged Mar 6, 2025
[Frontend][Docs] Transcription API streaming
#13301 merged Mar 6, 2025
[Core] Optimizing cross-attention QKVParallelLinear computation
#12325 merged Mar 6, 2025
[VLM] Support Pixtral-HF on V1
#14275 merged Mar 6, 2025
[Model] Update Paligemma multimodal processing with PromptUpdate
#14015 merged Mar 6, 2025
[Hardware] Update the flash attn tag to support Blackwell
#14244 merged Mar 6, 2025
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention
#11301 merged Mar 6, 2025
[V1] LoRA - Enable more V1 tests
#14315 merged Mar 6, 2025
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1
#14114 merged Mar 6, 2025
[misc] Mention ray list nodes command to troubleshoot ray issues
#14318 merged Mar 6, 2025
[BugFix] MLA + V1, illegal memory access and accuracy issues
#14253 merged Mar 6, 2025
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation
#13850 merged Mar 6, 2025
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM
#13917 merged Mar 6, 2025
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline
#14243 merged Mar 6, 2025
[BugFix] Fix prefix caching V0 MLA
#14255 merged Mar 6, 2025
[Bugfix] Remove num_tokens_across_dp
#14302 merged Mar 5, 2025
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch
#14237 merged Mar 5, 2025
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test
#14308 merged Mar 5, 2025
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env
#14267 merged Mar 5, 2025
[V1][BugFix] Fix for mixed top_k batch
#14301 merged Mar 5, 2025
Deprecate best_of Sampling Parameter in anticipation for vLLM V1
#13997 merged Mar 5, 2025
[V1][Minor] Remove obsolete FIXME comment
#14304 merged Mar 5, 2025
[Docs] Add Meta Slides
#14297 merged Mar 5, 2025
[Bugfix] Fix broken vision language example
#14292 merged Mar 5, 2025
[Doc] Fixed typo in prefix_caching.md
#14293 merged Mar 5, 2025
[Misc] Add Qwen2MoeForCausalLM moe tuning support
#14276 merged Mar 5, 2025
[LoRA] Remove linear hack outside transformers backend
#14177 merged Mar 5, 2025
[V1][Frontend] Add Testing For V1 Runtime Parameters
#14159 merged Mar 5, 2025
Small update for external_launcher backend docs
#14288 merged Mar 5, 2025
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor
#14278 merged Mar 5, 2025
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID
#14217 merged Mar 5, 2025
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler
#14169 merged Mar 5, 2025
[Misc][V1] Avoid using envs.VLLM_USE_V1 in mm processing
#14256 merged Mar 5, 2025
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param
#14066 merged Mar 5, 2025
Temporarily disable test_awq_gemm_opcheck
#14251 merged Mar 5, 2025
[platforms] improve rocm debugging info
#14257 merged Mar 5, 2025
[V1] EP/TP MoE + DP Attention
#13931 merged Mar 5, 2025
[Model] New model support for Phi-4-multimodal-instruct
#14119 merged Mar 5, 2025
[V1][Bugfix] Do not reset prefix caching metrics
#14235 merged Mar 5, 2025
[Bugfix] Fix gptq_marlin for deepseek-v3
#13750 merged Mar 5, 2025
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0
#14157 merged Mar 5, 2025
Moved numba from common requirements to cuda/rocm specific requirements
#14199 merged Mar 5, 2025
[misc] announce china meetup
#14248 merged Mar 5, 2025
[V1][TPU] TPU multimodal model support for ragged attention
#14158 merged Mar 5, 2025
[ROCm] Disable a few more kernel tests that are broken on ROCm
#14145 merged Mar 4, 2025
Clean up unused padding_idx variables across many model definitions
#13240 merged Mar 4, 2025
Serialize using safetensors for KV caches
#14228 merged Mar 4, 2025
[v1][Metrics] Add design doc
#12745 merged Mar 4, 2025
[Docs] Update Dockerfile dependency image
#14215 merged Mar 4, 2025
[Frontend] Do prompt_logprobs clamping for chat as well as completions
#14225 merged Mar 4, 2025
Fix performance when --generation-config is not None
#14223 merged Mar 4, 2025
[TPU][Profiler] Support start_profile/stop_profile in TPU worker
#13988 merged Mar 4, 2025
add cutlass support for blackwell fp8 gemm
#13798 merged Mar 4, 2025
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py
#14161 merged Mar 4, 2025
[V0][Metrics] Deprecate some questionable request time metrics
#14135 merged Mar 4, 2025
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs
#13869 merged Mar 4, 2025
[Bugfix] Restrict MacOS CPU detection
#14210 merged Mar 4, 2025
[doc] add "Failed to infer device type" to faq
#14200 merged Mar 4, 2025
[sleep mode] error out with expandable_segments
#14189 merged Mar 4, 2025
[platform] add debug logging during inferring the device type
#14195 merged Mar 4, 2025
Fix benchmark_moe.py tuning for CUDA devices
#14164 merged Mar 4, 2025
Use math.prod instead of np.prod for trivial ops
#14142 merged Mar 4, 2025
[core] Pass all driver env vars to ray workers unless excluded
#14099 merged Mar 4, 2025
[Misc] Remove lru_cache in NvmlCudaPlatform
#14156 merged Mar 4, 2025
[core] moe fp8 block quant tuning support
#14068 merged Mar 4, 2025
[Model] Add support for GraniteMoeShared models
#13313 merged Mar 4, 2025
[v1] Add comments to the new ragged paged attention Pallas kernel
#14155 merged Mar 3, 2025
[Docs] Add GPTQModel
#14056 merged Mar 3, 2025
[Kernel] Optimize moe intermediate_cache usage
#13625 merged Mar 3, 2025
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3
#14100 merged Mar 3, 2025
[WIP][[V1][Metrics] Implement max_num_generation_tokens, request_params_n, and request_params_max_tokens metrics
#14055 merged Mar 3, 2025
[V1] Simplify stats logging
#14082 merged Mar 3, 2025
[V0][Metrics] Deprecate some KV/prefix cache metrics
#14136 merged Mar 3, 2025
[V0][Metrics] Remove unimplemented vllm:tokens_total
#14134 merged Mar 3, 2025
Fix head_dim not existing in all model configs (Transformers backend)
#14141 merged Mar 3, 2025
[ROCm] Faster Custom Paged Attention kernels
#12348 merged Mar 3, 2025
Improve the docs for TransformersModel
#14147 merged Mar 3, 2025
[V1] Refactor parallel sampling support
#13774 merged Mar 3, 2025
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1
#13921 merged Mar 3, 2025
[Misc][Platform] Move use allgather to platform
#14010 merged Mar 3, 2025
[Misc] typo find in deepseek_v2
#14106 merged Mar 3, 2025
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure
#14051 merged Mar 3, 2025
Update deprecated Python 3.8 typing
#13971 merged Mar 3, 2025
[v0][structured output] Support reasoning output
#12955 merged Mar 2, 2025

95 Pull requests opened by 81 people

[V1] Implement sliding window attention in kv_cache_manager
#14097 opened Mar 2, 2025
[v1] Remove bind_kv_cache and self.kv_cache in model runner
#14098 opened Mar 2, 2025
[Misc] Use model_overwrite to redirect the model name to a local folder.
#14116 opened Mar 3, 2025
[Distributed] Add custom allreduce support for ROCM
#14125 opened Mar 3, 2025
[Feature] Vllm int8 quantization enablement for ARM CPUs
#14129 opened Mar 3, 2025
[Misc] Add log information for handle_process_request.
#14130 opened Mar 3, 2025
feat: add DeepGEMM for fp8 dense models
#14140 opened Mar 3, 2025
[WIP] FLUTE Integration
#14146 opened Mar 3, 2025
[V1][Metrics] Add additional metrics to V1
#14148 opened Mar 3, 2025
[Feature]: Pin vLLM process to the right NUMA Region
#14149 opened Mar 3, 2025
[Frontend] Relax 'method' field constraint in BatchRequestInput
#14153 opened Mar 3, 2025
[bug fix]: benchmark enabling torch profiler in openai chat backend
#14162 opened Mar 4, 2025
[Bugfix] Remove unecessary call of `make_rand_sparse_tensors` in `make_n_rand_sparse_tensors`
#14165 opened Mar 4, 2025
[DRAFT] Try to bump torch version
#14171 opened Mar 4, 2025
Add CUDA kernel for per_token_group_quant_fp8
#14175 opened Mar 4, 2025
Fix WorkerWrapperBase initialization: defer vllm_config setup
#14179 opened Mar 4, 2025
Deepseek MTP for V1
#14182 opened Mar 4, 2025
[misc] Update blog link in README
#14194 opened Mar 4, 2025
docs: Add documentation for s390x cpu implementation
#14198 opened Mar 4, 2025
[Model] Add Reasoning Parser for Granite Models
#14202 opened Mar 4, 2025
[Kernel] MoE tuning, quickly skip slow config
#14207 opened Mar 4, 2025
[Misc] Update `compressed-tensors` WNA16 to support zero-points
#14211 opened Mar 4, 2025
[V1][PP] Support PP for MultiprocExecutor
#14219 opened Mar 4, 2025
[V1][TPU] Support V1 Sampler for ragged attention
#14227 opened Mar 4, 2025
[CI] Make UT cases in test_comm_ops.py compatible with more devices
#14229 opened Mar 4, 2025
Use getattr for hidden_act and hidden_activation in Gemma models
#14230 opened Mar 4, 2025
Torchao
#14231 opened Mar 4, 2025
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 opened Mar 4, 2025
[V1] Aggregate prompt logprobs in `EngineCore`
#14240 opened Mar 4, 2025
[V1] Enable Long Context LoRA tests for V1
#14241 opened Mar 4, 2025
[do not merge][core] Workarounds for V1 for spyre plugin
#14242 opened Mar 4, 2025
dynamic distpatch of fp8 kernels
#14245 opened Mar 5, 2025
[ROCm] Tweak the benchmark script to run on ROCm
#14252 opened Mar 5, 2025
[TPU][V1] Capture multimodal encoder during model compilation
#14254 opened Mar 5, 2025
[WIP][Attention] FlashAttn MLA
#14258 opened Mar 5, 2025
[utils] Update DNS server IPs in get_ip function to avoid rate limiting
#14262 opened Mar 5, 2025
exmaple: manipulate cache
#14265 opened Mar 5, 2025
[BugFix] Add explict defualt ctor for RankData to make it can be built with clang
#14268 opened Mar 5, 2025
[Doc] Create tool_chat_template_llama3.3_json.jinja
#14269 opened Mar 5, 2025
[Kernel] Add trition.autotune to address the high latency overhead of punica kernels
#14272 opened Mar 5, 2025
[Doc] Fix env path name in `uv` Python env setup instructions
#14273 opened Mar 5, 2025
[MISC] rename interval to max_recent_requests
#14285 opened Mar 5, 2025
[MISC] Refine no available block debug msg
#14287 opened Mar 5, 2025
[Model] add colqwen2_vl code & inference
#14291 opened Mar 5, 2025
[Build] Cython compilation support fix
#14296 opened Mar 5, 2025
[V1] TPU - Remove self.kv_caches
#14309 opened Mar 5, 2025
[Core] Expose API endpoint `/is_sleeping`
#14312 opened Mar 5, 2025
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm
#14316 opened Mar 5, 2025
[Model] Add PLaMo2
#14323 opened Mar 6, 2025
fix minor miscalled method
#14327 opened Mar 6, 2025
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros
#14347 opened Mar 6, 2025
[Bugfix][Frontend] Fix validation of `logprobs` in `ChatCompletionRequest`
#14352 opened Mar 6, 2025
Update Dockerfile, typo
#14362 opened Mar 6, 2025
[Config][Disaggregated] Add timeout configuration for the torch.store and add KVTransferConfig.kv_connector_extra_config
#14367 opened Mar 6, 2025
[Misc] Benchmarks: Fix guided decoding, token sampling, and request sorting
#14368 opened Mar 6, 2025
[Misc] Fix test_sleep to use query parameters
#14373 opened Mar 6, 2025
feat:Optimize qwen2-vl to reduce cudaMemcpyAsync
#14377 opened Mar 6, 2025
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils
#14379 opened Mar 6, 2025
[wip] scaled_mm_bw_sm100
#14383 opened Mar 6, 2025
[Core] Add DoRA Support
#14389 opened Mar 7, 2025
[neuron] add reshape_and_cache
#14391 opened Mar 7, 2025
A different take
#14393 opened Mar 7, 2025
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 opened Mar 7, 2025
[Bugfix] Make the fused_moe code compatible with non-triton supported hardware
#14400 opened Mar 7, 2025
Clean up Engine Args & Documentation
#14409 opened Mar 7, 2025
[rlhf] support named placement group
#14410 opened Mar 7, 2025
[Misc] Add get_stream_cls() method for Platform class
#14411 opened Mar 7, 2025
[Bugfix][V1] Exclude HBM used by other processes when calculating peak memory during profile runs
#14419 opened Mar 7, 2025
[Bugfix] Fix When choice the specified tool call, it returns a ToolCa…
#14427 opened Mar 7, 2025
[Refactor][Reasoning] Keep all logic about reasoning into one class
#14428 opened Mar 7, 2025
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0
#14430 opened Mar 7, 2025
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA.
#14431 opened Mar 7, 2025
[Usage] Refactor speculative decoding configuration and tests
#14434 opened Mar 7, 2025
[BUGFIX] fix the need_recv method of model_runner
#14436 opened Mar 7, 2025
[Feature]: PD separation supports prefix caching #12257
#14440 opened Mar 7, 2025
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 opened Mar 7, 2025
[Core][LoRA] Add ignore layer for LoRA
#14445 opened Mar 7, 2025
[WIP][Kernel] moe wna16 marlin kernel
#14447 opened Mar 7, 2025
[ROCm] Fix kernel cache miss in Triton FA
#14448 opened Mar 7, 2025
[ROCm][Kernel] MoE weights padding
#14454 opened Mar 7, 2025
[INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi
#14455 opened Mar 7, 2025
Fix EAGLE output norm bug
#14464 opened Mar 7, 2025
[core][V1] pluggable scheduler
#14466 opened Mar 7, 2025
Fix GuidedDecodingParams backend_name issue
#14473 opened Mar 8, 2025
[Frontend] Pythonic tool names flexibility (#14470)
#14474 opened Mar 8, 2025
LLama 3.2 11b lm eval accuracy drop fix
#14477 opened Mar 8, 2025
[not ready for review] introduce some profiling in the benchmark
#14481 opened Mar 8, 2025
[Misc] Unify formatter and linter to use ruff
#14485 opened Mar 8, 2025
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace
#14501 opened Mar 8, 2025
[V1][Core] Fix memory issue with logits & sampling
#14508 opened Mar 9, 2025
[Misc] QoL: add speculative_model to SpeculativeConfig
#14509 opened Mar 9, 2025
[Frontend] Support both tool calling and reasoning parser for reasoni…
#14511 opened Mar 9, 2025
[BugFix][V1] Fix parallel sampling finishing/aborts
#14512 opened Mar 9, 2025
[DO NOT MERGE] Varun/fix memory
#14514 opened Mar 9, 2025
[Misc] Replace os environ to monkeypatch in test suite
#14516 opened Mar 9, 2025

89 Issues closed by 36 people

[Bug]: Mismatch of tqdm when n > 1
#10949 closed Mar 9, 2025
[Bug]: tqdm progress bar seems to be wrong.
#11519 closed Mar 9, 2025
[Usage]:ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#14475 closed Mar 9, 2025
[Usage]: How to use BitsAndBytesConfig with vllm serve
#8813 closed Mar 9, 2025
[Bug]: Server - `aqlm` fails with `--cpu-offload-gb`
#8873 closed Mar 9, 2025
[Bug]: CRITICAL 11-05 12:03:03 launcher.py:99] MQLLMEngine is already dead, terminating server process
#10024 closed Mar 9, 2025
[Feature]: When generating multiple answers of the same prompt?
#10099 closed Mar 9, 2025
[Bug]: Kernel crash while loading the 14B models on GPU L4x4
#14132 closed Mar 8, 2025
[Bug]: remove_oldest LRU lora may remove lora which is still in usage
#14497 closed Mar 8, 2025
[Bug][V1]: Qwen2-VL-7B OOM when loading the model in v0 but not in v1
#14184 closed Mar 8, 2025
[Feature]: num of LoRAs requested by the batch is larger than num lora slots
#14495 closed Mar 8, 2025
[Bug]: Corrupted responses for Llama-3.2-3B-Instruct with v0.6.6.post1
#12096 closed Mar 8, 2025
[Performance]: LoRA is not taken into account when determining the number of KV cache blocks
#14450 closed Mar 8, 2025
[RFC]: Prompt logprobs + APC compatibility
#13414 closed Mar 8, 2025
[Usage]: Multiple rounds of image dialogue support ？（多轮图片对话支持？）
#11006 closed Mar 7, 2025
[Usage]: torch.OutOfMemoryError: CUDA out of memory.
#11560 closed Mar 7, 2025
[Bug]: Qwen2_5_VL-3B :When running the multi-modal model, encountered multiple critical issues related to sequence length and context window limitations.
#12940 closed Mar 7, 2025
[Bug]: Can't run offline inference (example script) in OpenVINO: TypeError: OpenVINOCausalLM.forward() missing 2 required positional arguments: 'kv_caches' and 'attn_metadata'
#14205 closed Mar 7, 2025
[Usage]: How to bypass multimodal processor logic when inputs are already processed
#14281 closed Mar 7, 2025
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected. Please check the logs for more details
#13332 closed Mar 7, 2025
[Usage]: I started an internvl2-1b service using the terminal command vllm serve ./internvl2-1b/ --tensor-parallel-size 1 --trust-remote-code. How can I implement this functionality with a Python script, as I want to customize some logs?
#10929 closed Mar 7, 2025
[Usage]: Inquiry about AsyncLLMEngine's generate method and multi-modal input support
#10937 closed Mar 7, 2025
[Feature]: Add classification Task with AutoModelForSequenceClassification and BertForSequenceClassification
#10939 closed Mar 7, 2025
[Doc]: How can I set the date_string for the chat templates
#14344 closed Mar 7, 2025
[Bug]: The driver_worker gets stuck 100% of the time, when using Medusa with TP > 1
#9573 closed Mar 7, 2025
[Bug]: Broken outputs for large contexts if `max_model_len` is fixed.
#9615 closed Mar 7, 2025
[Bug]: [Regression Issue] The output from Qwen2 VL are different between vLLM v0.6.3-post1 and vLLM v0.6.1-post2
#9988 closed Mar 7, 2025
[Bug]: I cannot able to load the model on TESLA T4 GPU in Full precision
#9990 closed Mar 7, 2025
[New Model]: Support Tencent-Hunyuan-Large
#10043 closed Mar 7, 2025
[Usage]: Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now
#10065 closed Mar 7, 2025
[Misc]: w8a8 model inference
#10068 closed Mar 7, 2025
[Bug]: vllm 多节点部署问题
#14186 closed Mar 7, 2025
[Bug]: FP8 kvcache causes RuntimeError in v1 engine
#11329 closed Mar 6, 2025
[Bug]: [V1] wrong output when using kv cache fp8
#13133 closed Mar 6, 2025
[Performance]: The measured concurrency value is twice as high as the calculated value in the formula, why?
#14350 closed Mar 6, 2025
[Feature]: Ovis2 VLM series
#14346 closed Mar 6, 2025
[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)
#6155 closed Mar 6, 2025
[Bug]: Memory leak due to LLMEngine.seq_id_to_seq_group
#14353 closed Mar 6, 2025
[Bug]: Phi-4-multimodal-instruct audio tag seems wrong
#14342 closed Mar 6, 2025
[Usage]: Running OpenAI Swarm with vLLM-hosted models
#11774 closed Mar 6, 2025
[Bug]: Ray fails to register worker when running DeepSeek R1 model with vLLM and tensor parallelism
#13557 closed Mar 6, 2025
[Bug][V1]: Kernel crashed when running qwen2.5_vl model
#14181 closed Mar 6, 2025
[New Model]: QwQ-32B
#14321 closed Mar 6, 2025
[Usage]: How do I use langchain for tool calls？
#9692 closed Mar 6, 2025
[Bug]: DeepSeek R1 with outlines structured engine stops generation after `</think>`
#14113 closed Mar 6, 2025
[Bug]: Failed to run example.py even if the pytorch framework has been compiled natively.
#14178 closed Mar 6, 2025
[Feature]: need no_repeat_n_gram in SamplingParams
#7842 closed Mar 6, 2025
[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization
#8463 closed Mar 6, 2025
[Performance]: FP8 performance worse than FP16 for Qwen2-VL-2B-Instruct
#9992 closed Mar 6, 2025
[Misc]: How to organize a large number of requests for invocation？
#10018 closed Mar 6, 2025
[Performance]: latency of medusa is longer than naive inferece even the concurreny =2
#10031 closed Mar 6, 2025
[Feature]: Slurm run_cluster.sh launcher instead of just Ray
#7933 closed Mar 6, 2025
[RFC]: Deprecation of the `best_of` Sampling Parameter in vLLM V1
#13361 closed Mar 5, 2025
[Usage]: Does benchmark of vllm support audio input to multi-modal ?
#14284 closed Mar 5, 2025
[Bug]: 【Janus-Pro-7B】 KeyError: 'multi_modality'
#14247 closed Mar 5, 2025
[Feature]: How to handle concurrent request in single instance of Qwen2-VL model.
#14226 closed Mar 5, 2025
[Doc]: Typo in prefix_caching.md
#14294 closed Mar 5, 2025
[Usage]: How to enforce think for deepseek-r1?
#14201 closed Mar 5, 2025
[Feature]: enable prefix caching when MLA is enabled
#13720 closed Mar 5, 2025
[Feature]: Only apply Guided/Structured grammar after reasoning steps in Reasoning models
#12619 closed Mar 5, 2025
[New Model]: Phi-4 Multimodal Instruct
#13936 closed Mar 5, 2025
[Bug]: VLLM 0.5.3.post1 [rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
#6732 closed Mar 5, 2025
[Bug]: vllm0.6.2 Using FLASHINFER to start VLLM reported an error, enabling -- quantification gptq -- kv cache dtype fp8_e5m2
#9243 closed Mar 5, 2025
[Feature]: LoRA support for InternVLChatModel
#9495 closed Mar 5, 2025
[New Model]: SparseLLM/prosparse-llama-2-7b
#9916 closed Mar 5, 2025
[Usage]: Inference delay
#9949 closed Mar 5, 2025
[Performance]: Any up-to-date and convincing benchmark for chosing the fastest engine ?
#9975 closed Mar 5, 2025
[Usage]: Are there any batch size requirements for offline batch inference? For example, is 10,000 okay?
#9985 closed Mar 5, 2025
[Bug]: For speculative decoding with a draft model, the "determine_num_available_blocks" only considers the memory usage of the target model
#9995 closed Mar 5, 2025
[Bug]: Why does the worker node exit after running for a while after joining the head node, even though I did not send an exit signal?
#13367 closed Mar 4, 2025
[Bug]: vLLM with ray backend and enable nsight can't get perf metrics due to connection issue
#7830 closed Mar 4, 2025
[Feature]: Use math.prod instead of np.prod for trivial ops
#14144 closed Mar 4, 2025
[Feature]: Avoid KV Cache and offload Model weights in RL workloads
#11638 closed Mar 4, 2025
[Bug]: Running on a single machine with multiple GPUs error
#9875 closed Mar 4, 2025
[Feature]: more exhaustive tracing
#9952 closed Mar 4, 2025
[Installation]: build on arm64 meet a error
#9964 closed Mar 4, 2025
[Feature]: frequency_penalties is missing in V1
#10696 closed Mar 4, 2025
[Bug]: Using Ray with compiled DAG throws the "The compiled graph can't have more than 10 in-flight executions" error
#12747 closed Mar 4, 2025
[Bug]: `TransformersModel` fails if model config does not have `head_dim` attr
#14139 closed Mar 3, 2025
[Usage]: "POST /v1/audio/transcriptions HTTP/1.1" 404 Not Found
#14127 closed Mar 3, 2025
[Feature]: Allow setting tool_choice="none" in LLM calls if the OpenAI comaptible vllm server is started with --enable-auto-tool-choice
#9426 closed Mar 3, 2025
[Usage]: Fail to create distributed inference serving with rocm/vllm
#14111 closed Mar 3, 2025
[Feature]: API for evicting all KV cache from GPU memory (or `sleep mode`)
#10714 closed Mar 3, 2025
[Bug]: preifx cache reuse
#9643 closed Mar 3, 2025
[Feature]: Low GPU utilization and memory bandwidth utilization
#9953 closed Mar 3, 2025
[Installation]: I was never able to install it, which cuda version is required?
#9960 closed Mar 3, 2025
[Feature]: print config of vllm LLM instance and modify it afterwards
#9962 closed Mar 3, 2025
[Feature]: Any plan run deepseek-r1 fp8 on Ampere gpu
#13885 closed Mar 3, 2025
[Installation]: Can't find OpenMP headers on macOS
#14034 closed Mar 3, 2025

142 Issues opened by 136 people

[Usage]: Can't use the local model with LLM class.
#14518 opened Mar 9, 2025
[Bug]: Cannot be run on multiple GPUs.
#14517 opened Mar 9, 2025
[Usage]: How to improve concurrent processing capacity
#14513 opened Mar 9, 2025
[Usage]: The example of using microsoft/Phi-4-multimodal-instruct audio
#14507 opened Mar 9, 2025
[Bug]: RuntimeError: Phi4MM cannot process x audios and ximages in a prompt
#14506 opened Mar 9, 2025
[Bug]: ValueError: Expected a torch.device with a specified index or an integer, but got:cuda
#14500 opened Mar 8, 2025
[Feature]: Convert all `os.environ(xxx)` to `monkeypatch.setenv` in test suite
#14499 opened Mar 8, 2025
[Bug]: Weird output when server with high load
#14491 opened Mar 8, 2025
[Feature]: Apply tool calling after reasoning steps in Reasoning models.
#14490 opened Mar 8, 2025
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 opened Mar 8, 2025
[Performance]: Eagle implentation is not efficient obviously
#14486 opened Mar 8, 2025
[New Model]: Babel Model, Open Multilingual Large Language Models Serving Over 90% of Global Speakers
#14484 opened Mar 8, 2025
[Bug]: trl's grpo-trainer with vllm not convergence
#14483 opened Mar 8, 2025
[Feature]: Add reasoning token usage
#14472 opened Mar 8, 2025
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 opened Mar 8, 2025
[Bug]: Mistral-Small-24B-Instruct-2501 on V1 fails to start with Mistral tokenizer since V1 enabled guided decoding
#14465 opened Mar 7, 2025
[Bugfix]: Suggesting an update in the vllm source code to fix the error "Unable to assign 256 multimodal tokens to 0 placeholders"
#14463 opened Mar 7, 2025
[Usage]: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet.
#14458 opened Mar 7, 2025
[Bug]: No Cuda GPUs are available when running vLLM on Ray (Qwen 2.5 VL AWQ)
#14456 opened Mar 7, 2025
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 opened Mar 7, 2025
[Bug]: `vllm serve Qwen/QwQ-32B-AWQ --tensor-parallel-size 2` hangs with both RTX A6000 GPUs at max utilization
#14449 opened Mar 7, 2025
[Usage]: After starting the QwQ-32B model normally, it was found that the model could not output the thought tag normally
#14446 opened Mar 7, 2025
[Feature]: Run/Debug vllm in pycharm
#14444 opened Mar 7, 2025
[Bug]: External Launcher producing NaN outputs on Large Models when Collocating with Model Training
#14443 opened Mar 7, 2025
[Usage]: Question about Multimodal token ids on offloaded tokenization
#14441 opened Mar 7, 2025
[RFC]: Configurable multi-modal data for profiling
#14438 opened Mar 7, 2025
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 opened Mar 7, 2025
[Bug]: Docker GPU image is unnecessarily fat due to two (mismatching) copies of CUDA runtime libraries
#14433 opened Mar 7, 2025
[Usage]: Cuda out of memory while loading the quantized model
#14432 opened Mar 7, 2025
[Feature]: support tool and reasoning together
#14429 opened Mar 7, 2025
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 opened Mar 7, 2025
[Feature]: Prefill. How to support 1M prompt tokens input?
#14425 opened Mar 7, 2025
[Bug]: Unexpected content when selecting the choice tool
#14424 opened Mar 7, 2025
[Bug]: size mismatch when loading MixtralForCausalLM GGUF model
#14423 opened Mar 7, 2025
[Bug]: low quality of deepseek-vl2 when using vllm
#14421 opened Mar 7, 2025
[Usage]: How to use the image datasets sharegpt4v provided in benchmark_serving?
#14418 opened Mar 7, 2025
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 opened Mar 7, 2025
[Bug]: vLLM-0.7.2 reports "No CUDA GPUs are available" while vllm-0.6.6.post1 works fine on kuberay under same environment conditions.
#14415 opened Mar 7, 2025
[Bug]: RuntimeError: No CUDA GPUs are available | when using TP>1 and using vllm v0.7
#14413 opened Mar 7, 2025
[Performance]:
#14412 opened Mar 7, 2025
[Bug]: pipeline-parallel not working properly with QwQ model on TPU v4
#14406 opened Mar 7, 2025
[Misc, this is not a dev issue]: Congrats to vllm for having 888 developers!
#14405 opened Mar 7, 2025
[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered cos_sin = self.cos_sin_cache[torch.add(positions, offsets)
#14404 opened Mar 7, 2025
[Bug]: Error when Run Image Docker Vllm v0.7.3 - Unexpected error from cudaGetDeviceCount(). ....
#14403 opened Mar 7, 2025
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 opened Mar 7, 2025
[Bug]: stop_sequences is applied to both reasoning_content and content
#14399 opened Mar 7, 2025
[Installation]:
#14398 opened Mar 7, 2025
[Bug]: `triton_scaled_mm` never used on ROCm
#14397 opened Mar 7, 2025
[Bug]: vllm-0.7.3. gptq-int3 model cannot run.
#14394 opened Mar 7, 2025
[Bug]: Enable lora returns garbage output
#14392 opened Mar 7, 2025
[Usage]: Clean up Engine Args & Documentation
#14386 opened Mar 6, 2025
[Feature]: Q-Filters for KV Cache Compression
#14381 opened Mar 6, 2025
[Bug]: When loading two models into the same GPU with Docker, the second model requires more memory allocation than the first
#14376 opened Mar 6, 2025
[RFC]: Drop Support for OpenVINO
#14374 opened Mar 6, 2025
[Bug]: VLLM process dies when trying to profile with nsight systemss
#14372 opened Mar 6, 2025
[Bug]: API Connection Error after concurrent API calls
#14365 opened Mar 6, 2025
[Usage]: STREAMING doesn't generate spaces
#14364 opened Mar 6, 2025
[Performance]: Multimodal embeds input reduces service throughput
#14360 opened Mar 6, 2025
[Bug]: phi-4-mini-instruct auto tool call doesnt have tool-call-parser
#14359 opened Mar 6, 2025
[Bug]: ChatCompletionRequest rejects its own defaults
#14351 opened Mar 6, 2025
[Bug]: vllm cannot connect to an external ray cluster
#14349 opened Mar 6, 2025
[Feature]: How can I get embedding result from images？ Can Qwen2.5-vl-7b do this?
#14348 opened Mar 6, 2025
[Bug]: online-rl sampling is different from offline-sampling
#14341 opened Mar 6, 2025
distributed inference multi-node communication bug
#14340 opened Mar 6, 2025
[Feature]: eagle支持多模态模型
#14337 opened Mar 6, 2025
[Feature]: `reasoning_tokens` in Chat Completion Response `usage`
#14335 opened Mar 6, 2025
[Bug]: GPU索引指定失败
#14334 opened Mar 6, 2025
[Bug]: vLLM returning 415 status code at high load
#14333 opened Mar 6, 2025
[Bug]: opentelemetry POC vLLM span cannot be concatenated with HTTP spans.
#14330 opened Mar 6, 2025
[Usage]: How do I set the input image size when using qwen2-vl?
#14325 opened Mar 6, 2025
[Bug]: 'DeepseekV2Model' object has no attribute 'config' when enabling P/D Disaggregation
#14324 opened Mar 6, 2025
[Usage]: What is the default input construction of multimodel input?
#14322 opened Mar 6, 2025
[Feature]: `Invalid attention backend for cuda` with `TORCH_SDPA` better error message
#14320 opened Mar 6, 2025
[Doc]: Why is max block_size on CUDA 32?
#14319 opened Mar 5, 2025
[Feature]: Expose a read-only API to check whether engine is sleeping
#14311 opened Mar 5, 2025
Issue with Mistral Small and greek characters
#14307 opened Mar 5, 2025
[Usage]: Logprobs Scaling with O(n) Complexity – Unexpected Performance Degradation
#14300 opened Mar 5, 2025
[Installation]: Attempting to build and run vLLM for Intel Core Ultra 7 155H with ARC iGPU
#14295 opened Mar 5, 2025
[New Model]: llava-onevision-qwen2-72b-ov-sft
#14290 opened Mar 5, 2025
[Feature]: Chat inputs to AsyncLLMEngine
#14289 opened Mar 5, 2025
[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0
#14286 opened Mar 5, 2025
[Doc]: dead link in source code comment
#14282 opened Mar 5, 2025
[Bug]: V1 still sample n=1 when set n>1 in samplingparam
#14280 opened Mar 5, 2025
[Misc]: running multiple vLLM instances on a single ray cluster
#14277 opened Mar 5, 2025
[Performance]: About peak activation memory usage for quantized model
#14270 opened Mar 5, 2025
[Bug]: weight_loader of fp8 weights are wrongly set to None. [Deepseek V3/R1]
#14266 opened Mar 5, 2025
[Bug]: Dose V1 support MLA + PP now? Raise error while using PP+TP+V1.
#14263 opened Mar 5, 2025
[Bug]: ValueError: "CompilationConfig" object has no field "max_capture_size"
#14261 opened Mar 5, 2025
[New Model]: baichuan-inc/Baichuan-M1-14B-Instruct
#14259 opened Mar 5, 2025
[Bug]: when compilingFlashMLA/csrc/flash_api.cpp error occurred
#14250 opened Mar 5, 2025
[Misc]: [V1] prompt logprobs + chunked prefill can result in `EngineCore` partial prefill output
#14239 opened Mar 4, 2025
[Feature]: Bump xgrammar version to 1.1.14 to support ARM64 processors
#14236 opened Mar 4, 2025
[Bug]: ValueError: The vocabulary does not allow us to build a sequence that matches the input regex
#14233 opened Mar 4, 2025
[Bug]: Corrupted output from Llama-3.2-1B when LoRA is enabled on multi-GPU instance
#14232 opened Mar 4, 2025
[Usage]: Getting intermediate outputs to store on disk
#14222 opened Mar 4, 2025
[Bug]: Deepseek R1 671B int8 not working on TPU
#14218 opened Mar 4, 2025
[New Model]: aya 32b vision support
#14216 opened Mar 4, 2025
[New Model]: pfnet/plamo-2-8b
#14214 opened Mar 4, 2025
[Misc]: How does the system evenly distribute the requests to multiple micro batches?
#14213 opened Mar 4, 2025
[Bug]: ValueError: There is no module or parameter named 'lm_head' in Gemma2ForCausalLM
#14212 opened Mar 4, 2025
[Bug]: Ultravox audio doesn't work with auto tool choice
#14209 opened Mar 4, 2025
[New Model]: nicolinho/QRM-Llama3.1-8B-v2
#14208 opened Mar 4, 2025
[Feature]: how to log out the request token length？
#14206 opened Mar 4, 2025
[Doc]: I modified the _sample function of qwen2 in transformers. How can I quickly make it compatible with vllm?
#14204 opened Mar 4, 2025
[Bug]: Different results in gsm8k when using chat
#14203 opened Mar 4, 2025
[Bug]: Port is still open after crashing vllm
#14197 opened Mar 4, 2025
[Bug]: when I use disaggregated_prefill, if I don't input anything ,KV receiving thread will report time out
#14193 opened Mar 4, 2025
[New Model]: deepseek-vl2
#14192 opened Mar 4, 2025
[Usage]: CUDA_VISIBLE_DEVICES not support
#14191 opened Mar 4, 2025
[Feature]: Prompt Formatting Issue with LLaMA 3.1 Instruction Model in vLLM
#14190 opened Mar 4, 2025
[Usage]: The inference results are not the same order as the inputs.
#14187 opened Mar 4, 2025
[New Model]: InternVideo2.5 by OpenGVLab
#14180 opened Mar 4, 2025
[Feature]: deepseek-r1-w8a8
#14176 opened Mar 4, 2025
[Feature]: will whisper add language detection?
#14174 opened Mar 4, 2025
[Performance] [V1]: Optimize batch token processing in `IncrementalDetokenizer.update()`
#14173 opened Mar 4, 2025
[Bug]: If I include the stop parameter, the output of reasoning_content gets truncated, and the output of content will display /think>
#14170 opened Mar 4, 2025
[RFC]: Deprecate `max_num_generation_tokens`
#14168 opened Mar 4, 2025
[Feature]: SLora hot loading
#14166 opened Mar 4, 2025
[Feature]: Support Dynamic Loading of Prompt Adapters
#14163 opened Mar 4, 2025
[Feature]: Support multi step drafting for DeepSeek MTP when k > n_predict
#14160 opened Mar 3, 2025
[Bug]: ValueError: The quantization method bitsandbytes is not supported for the current GPU. Minimum capability: 70. Current capability: 52.
#14154 opened Mar 3, 2025
[Bug]: Structured output requests can hang the server
#14151 opened Mar 3, 2025
[Bug]: qwen2.5-vl 3B inference is OOM, but qwen2-vl 7B does not
#14150 opened Mar 3, 2025
[Bug]: Failure to Init Qwen2VL-2B-Instruct with tensor-parallel-size == 4 and quantization
#14143 opened Mar 3, 2025
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 opened Mar 3, 2025
[Misc]: When using lossy optimization, how to explain that the loss caused by optimization is within the acceptable range?
#14128 opened Mar 3, 2025
VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization
#14126 opened Mar 3, 2025
[Installation]: Error occured while installing vllm
#14124 opened Mar 3, 2025
[Bug]: run v1 engine with cuda graph. raise error.
#14121 opened Mar 3, 2025
[Bug]: last gpu always OOM when only use pipeline parallelism with 2 nodes x 8cards
#14120 opened Mar 3, 2025
[Bug]: When I use pipeline_parallel_size and tensor_parallel_size for disaggregated prefilling, vllm broken
#14118 opened Mar 3, 2025
[Bug]: PP > 1 with speculative decoding enabled reports an unsupported error
#14117 opened Mar 3, 2025
[Bug]: ERROR 03-02 20:28:05 engine.py:400] Ovis has no vLLM implementation and the Transformers implementation is not compatible with vLLM.
#14115 opened Mar 3, 2025
[Feature]: Does Qwen2.5-VL support batch processing of multiple videos using vLLM?
#14112 opened Mar 3, 2025
[Bug]: [vllm + Qwen2.5 VL72B] Model Continuously Outputs “!” for Certain Images
#14110 opened Mar 3, 2025
[Bug]: Using fractional GPU will change the GPU resource names on Ray cluster nodes
#14109 opened Mar 3, 2025
[Bug]: 使用vllm部署qwen2.5-vl系列的模型，出现调用卡死的问题，怎么解决？刚刚部署调用正常，显卡占用50%，后来使用一段时间之后，调用一次请求时间会很长，显卡占用到了90%。日志报错如下，怎么解决？
#14108 opened Mar 3, 2025
[Feature]: Added the parameter to specify a device for the speculative model when using speculative decoding
#14107 opened Mar 3, 2025
[New Model]: No supported config format found in deepseek-vl2-small
#14105 opened Mar 3, 2025
[Feature]: baichuan-inc/Baichuan-Omni-1.5 support
#14104 opened Mar 3, 2025
[Bug]: cannot launch deepseek-vl2 on A100
#14103 opened Mar 3, 2025
[Bug]: max_model_len setting fail
#14102 opened Mar 3, 2025

268 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Frontend] support image embeds
#13955 commented on Mar 9, 2025 • 35 new comments
[V1] V1 Enablement Oracle
#13726 commented on Mar 9, 2025 • 33 new comments
[Feature] Consolidate performance benchmark datasets
#14036 commented on Mar 9, 2025 • 31 new comments
[V1] AsyncLLM data parallel
#13923 commented on Mar 8, 2025 • 30 new comments
[Kernel] CUTLASS grouped gemm fp8 MoE kernel
#13972 commented on Mar 7, 2025 • 24 new comments
track server_load
#13950 commented on Mar 8, 2025 • 19 new comments
[V1] [Spec Decode] Support random sampling for spec decode
#13933 commented on Mar 6, 2025 • 18 new comments
[Model] Extend Ultravox to accept audio longer than 30s
#13631 commented on Mar 9, 2025 • 14 new comments
[MODEL] Add support for Zamba2 models
#13185 commented on Mar 7, 2025 • 13 new comments
[Doc] V1 user guide
#13991 commented on Mar 5, 2025 • 13 new comments
[V1] LoRA - Add triton kernels for V1
#13096 commented on Mar 7, 2025 • 11 new comments
[Core][Feature] Input metadata dump on crash
#13407 commented on Mar 7, 2025 • 9 new comments
[Doc] More neutral K8s deployment guide
#14084 commented on Mar 9, 2025 • 9 new comments
[Core] Integrate Fastsafetensor loader for loading model weights
#10647 commented on Mar 7, 2025 • 8 new comments
[v1] Refactor KVCacheConfig
#14079 commented on Mar 9, 2025 • 8 new comments
[Neuron] Add Neuron device communicator for vLLM v1
#14085 commented on Mar 7, 2025 • 8 new comments
[Feature] Add `vllm bench` CLI
#13993 commented on Mar 4, 2025 • 6 new comments
[FEAT] [ROCm] Enabling AITER Kernel
#14007 commented on Mar 8, 2025 • 5 new comments
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on Mar 3, 2025 • 5 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on Mar 5, 2025 • 4 new comments
[Attention] Update to lastest FA3 code that supports different K and V head dims
#13111 commented on Mar 7, 2025 • 4 new comments
[Hardware][TPU] improve kv cache update performance in prefill
#13176 commented on Mar 8, 2025 • 3 new comments
[Model] Add Support for Ovis1.6-Gemma2-9B Model
#11240 commented on Mar 3, 2025 • 2 new comments
[Kernel] Add more dtype support for GGUF kernels
#14043 commented on Mar 8, 2025 • 2 new comments
[Feature] Add filter for log redaction
#13225 commented on Mar 5, 2025 • 1 new comment
XGRAMMAR now support aarch64
#13894 commented on Mar 9, 2025 • 1 new comment
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs
#14071 commented on Mar 6, 2025 • 1 new comment
[V1][Frontend] Improve Shutdown And Logs
#14048 commented on Mar 2, 2025 • 1 new comment
[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32)
#14027 commented on Mar 3, 2025 • 1 new comment
Add ROCm Quark docs
#13984 commented on Mar 5, 2025 • 1 new comment
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 commented on Mar 9, 2025 • 0 new comments
[RFC]: Encoder/decoder models & feature compatibility
#7366 commented on Mar 9, 2025 • 0 new comments
[Feature]: Support gemma2 GGUF architecture
#12000 commented on Mar 9, 2025 • 0 new comments
[Bug]: GPTQ llama2-7b infer server failed!!!
#10848 commented on Mar 7, 2025 • 0 new comments
[Feature]: Support for Controlled Decoding
#9541 commented on Mar 9, 2025 • 0 new comments
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on Mar 8, 2025 • 0 new comments
[Bug]: sentence_bert_config.json 404 Client Error
#11268 commented on Mar 8, 2025 • 0 new comments
[Bug]: Missing detection of BFloat16 for CPU ARM
#11814 commented on Mar 8, 2025 • 0 new comments
[Bug]: Vllm CPU mode only takes 1 single core for multi-core cpu
#10971 commented on Mar 8, 2025 • 0 new comments
[V1] Feedback Thread
#12568 commented on Mar 9, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on Mar 9, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q1 2025
#11862 commented on Mar 9, 2025 • 0 new comments
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on Mar 9, 2025 • 0 new comments
[Bug]: Inf2 Serving Fails
#11189 commented on Mar 9, 2025 • 0 new comments
Failed to find C compiler. Please specify via CC environment variable
#2997 commented on Mar 9, 2025 • 0 new comments
[Feature]: Support for RTX 5090 (CUDA 12.8)
#13306 commented on Mar 9, 2025 • 0 new comments
[Bug]: vllm using ray in eks hangs when using --pipeline_parallel_size > 1
#11139 commented on Mar 9, 2025 • 0 new comments
[Model] Update MPT model with GLU and rope and add low precision layer norm
#9500 commented on Mar 4, 2025 • 0 new comments
[Frontend][Core] Add Guidance backend for guided decoding
#10217 commented on Mar 7, 2025 • 0 new comments
[Kernel] Add CUTLASS sparse support, heuristics, and torch operators
#10340 commented on Mar 4, 2025 • 0 new comments
Configuration of the model parallelism does not make sense
#10749 commented on Mar 4, 2025 • 0 new comments
[Bugfix] Check prompt length < max_model_len for all models in AsyncLLMEngine
#10881 commented on Mar 5, 2025 • 0 new comments
[Bugfix] add input embedding
#11684 commented on Mar 8, 2025 • 0 new comments
[V1] Allow sliding window + prefix caching
#13069 commented on Mar 9, 2025 • 0 new comments
[Bug]: AttributeError: 'Qwen2Model' object has no attribute 'rotary_emb'
#10773 commented on Mar 7, 2025 • 0 new comments
[Installation]: Missing v0.6.3.post1-cu118-cp310.whl. Can share it? Thanks so much
#10036 commented on Mar 7, 2025 • 0 new comments
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 commented on Mar 7, 2025 • 0 new comments
[Bug]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#8893 commented on Mar 7, 2025 • 0 new comments
[Bug]: Engine fails to start when running Qwen2.5 Deepseek r1
#12554 commented on Mar 7, 2025 • 0 new comments
[Usage]: Qwen2-VL keyword argument `max_pixels` is not a valid argument for this processor and will be ignored.
#13143 commented on Mar 7, 2025 • 0 new comments
[Bug]: RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example
#13597 commented on Mar 7, 2025 • 0 new comments
[RFC]: Hardware pluggable
#11162 commented on Mar 7, 2025 • 0 new comments
[Feature]: Support torch.distributed as the runtime for multi-node inference
#12511 commented on Mar 7, 2025 • 0 new comments
[Bug]: DeepseekR1 model load fails with weights tied error
#12541 commented on Mar 7, 2025 • 0 new comments
[RFC]: Drop support for prompt adapter
#13981 commented on Mar 7, 2025 • 0 new comments
Generate nothing from VLLM output
#1185 commented on Mar 7, 2025 • 0 new comments
[Bug]: Segment fault when loading model on multi-gpu
#13309 commented on Mar 7, 2025 • 0 new comments
[Performance]: decoding speed on long context
#11286 commented on Mar 7, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: invalid argument [ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
#13270 commented on Mar 7, 2025 • 0 new comments
[Bug]: Spurious warning on dropped args
#12856 commented on Mar 7, 2025 • 0 new comments
[V1][Help Wanted] Porting missing sampling parameters to V1
#13058 commented on Mar 7, 2025 • 0 new comments
[New Model]: answerdotai/ModernBERT-large
#11347 commented on Mar 7, 2025 • 0 new comments
[Installation]: subprocess-exited-with-error while installing vllm
#12965 commented on Mar 7, 2025 • 0 new comments
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on Mar 7, 2025 • 0 new comments
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on Mar 7, 2025 • 0 new comments
[Bug]: Waiting for output from MQLLMEngine. Hangs and then crashes after about an 1 hour
#14025 commented on Mar 7, 2025 • 0 new comments
[Bug]: see connection to gpu node timeout issue when initializing ray vllm multi-node serving
#13052 commented on Mar 7, 2025 • 0 new comments
[Bug]: KV Cache Quantization with GGUF turns out quite poorly.
#10411 commented on Mar 8, 2025 • 0 new comments
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on Mar 8, 2025 • 0 new comments
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on Mar 8, 2025 • 0 new comments
[Doc]: Why NGramWorker does not support cache operations
#11758 commented on Mar 8, 2025 • 0 new comments
[V1][Bugfix] DeepSeek-V3 v1 attn_backend miss q_lora_rank
#13092 commented on Mar 6, 2025 • 0 new comments
[BUG] Addreses #3935 and #3683, by making `intial_incremental_detokenization_offset` configurable
#13106 commented on Mar 6, 2025 • 0 new comments
[CI/Build] Add support for Python 3.13
#13164 commented on Mar 6, 2025 • 0 new comments
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 commented on Mar 4, 2025 • 0 new comments
[V0][Sampler] Use raw logits for greedy argmax
#13312 commented on Mar 7, 2025 • 0 new comments
[Kernel] moe wna16 cuda kernel
#13321 commented on Mar 9, 2025 • 0 new comments
[CPU] Upgrade CPU backend to torch-2.6
#13381 commented on Mar 7, 2025 • 0 new comments
[Model][MiniMaxText01] Support MiniMaxText01 model inference
#13454 commented on Mar 7, 2025 • 0 new comments
[Frontend] Implement Tool Calling with `tool_choice='required'`
#13483 commented on Mar 5, 2025 • 0 new comments
[Bugfix] Fix quantization skip modules logic
#13562 commented on Mar 8, 2025 • 0 new comments
Integrating torchao quantization into vllm
#13588 commented on Mar 4, 2025 • 0 new comments
[Model] Support VLMs with transformers backend
#13754 commented on Mar 8, 2025 • 0 new comments
[ROCm] Enable custom paged attention kernel for Navi3/4
#13843 commented on Mar 6, 2025 • 0 new comments
Fix TPU CI
#13898 commented on Mar 9, 2025 • 0 new comments
Upgrade `transformers` to `v4.49.0`
#13905 commented on Mar 4, 2025 • 0 new comments
[Feat][whisper] add more sampling parameters to whisper endpoint
#13910 commented on Mar 6, 2025 • 0 new comments
Add test for DeepGEMM contiguous layout MoE kernels
#13932 commented on Mar 8, 2025 • 0 new comments
[WIP][Core] Support tensor parallelism with uneven heads
#13934 commented on Mar 2, 2025 • 0 new comments
[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization
#13942 commented on Mar 7, 2025 • 0 new comments
Support non-attention path operators in Triton
#13963 commented on Mar 3, 2025 • 0 new comments
[BugFix]: properly catch templating error when preprocess input
#13976 commented on Mar 3, 2025 • 0 new comments
[Bug]: Speculative Decoding Tokens not being included in Prometheus metrics
#13992 commented on Mar 3, 2025 • 0 new comments
[Kernel] Integrate DeepGEMM dense block fp8
#13996 commented on Mar 3, 2025 • 0 new comments
benchmark serving: random + sharegpt dataset
#14026 commented on Mar 3, 2025 • 0 new comments
[Bugfix] Make memory profiler account for speculative draft model weights
#14067 commented on Mar 3, 2025 • 0 new comments
[Bugfix]: do not shutdown server is `skip_special_use=False` for MistralTokenizer
#14094 commented on Mar 7, 2025 • 0 new comments
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on Mar 9, 2025 • 0 new comments
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support
#11844 commented on Mar 7, 2025 • 0 new comments
[API_SERVER] Add maximum concurrency limit for API interface
#11997 commented on Mar 3, 2025 • 0 new comments
[VLM] Merged multi-modal processor for Pixtral
#12211 commented on Mar 8, 2025 • 0 new comments
[Model] Enable Inference Support for the New Baichuan-M1 Model
#12251 commented on Mar 8, 2025 • 0 new comments
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral
#12303 commented on Mar 7, 2025 • 0 new comments
add support for AMD MI25/50/60
#12431 commented on Mar 7, 2025 • 0 new comments
[CI/Build] Better default num jobs heuristic
#12477 commented on Mar 7, 2025 • 0 new comments
[Kernel] Add ModelOpt FP4 Checkpoint Support
#12520 commented on Mar 7, 2025 • 0 new comments
layerwise KV transfer in PD Disaggregation
#12523 commented on Mar 5, 2025 • 0 new comments
[Bugfix] fix vocab size assertion
#12550 commented on Mar 5, 2025 • 0 new comments
[CI] Performance regression fastcheck
#12576 commented on Mar 5, 2025 • 0 new comments
add tools definition into tokenize api
#12684 commented on Mar 6, 2025 • 0 new comments
Add helm chart release workflow
#12685 commented on Mar 6, 2025 • 0 new comments
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on Mar 6, 2025 • 0 new comments
add initial Blackwell support
#12702 commented on Mar 6, 2025 • 0 new comments
Update to torch==2.6.0
#12721 commented on Mar 9, 2025 • 0 new comments
[Core] Add Additional Metrics to vLLM Server
#12726 commented on Mar 7, 2025 • 0 new comments
[Bugfix] Add Containerfile.arm for podman support
#12735 commented on Mar 8, 2025 • 0 new comments
[build][misc] allow to use recent numpy
#12759 commented on Mar 9, 2025 • 0 new comments
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU
#12779 commented on Mar 3, 2025 • 0 new comments
adding workaround for c2x/c3x initializer issue
#12866 commented on Mar 6, 2025 • 0 new comments
[Core][Frontend] Fix : Adding Control Vector Support
#12870 commented on Mar 6, 2025 • 0 new comments
Add flag for enabling finer-grained cuda graph capture
#12920 commented on Mar 6, 2025 • 0 new comments
[Bugfix] Adjust tool call handling in llama template to support single tool calls only
#12938 commented on Mar 3, 2025 • 0 new comments
[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature
#12957 commented on Mar 4, 2025 • 0 new comments
Registered the model config for DeepSeek-V3.
#13055 commented on Mar 6, 2025 • 0 new comments
[Feature]: LoRA support for qwen2-vl Models
#11255 commented on Mar 4, 2025 • 0 new comments
[Bug]:Phi-4-Mini giving garbage outputs with torch 2.5.1 and vllm==0.7.3 with multiple parallel requests
#14058 commented on Mar 4, 2025 • 0 new comments
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 commented on Mar 4, 2025 • 0 new comments
[Installation]: When i build vllm from source with pip install -e ,there is a ninja error: unknown target '_vllm_fa3_C', did you mean '_vllm_fa2_C'.
#13183 commented on Mar 4, 2025 • 0 new comments
[Bug]: Qwen2vl vllm grounding任务效果不如transformers推理
#11254 commented on Mar 4, 2025 • 0 new comments
[Bug]: running deepseek-r1 14B with 2*5090D
#13914 commented on Mar 4, 2025 • 0 new comments
[Bug]: Why 0 device need more memory? will it cause OOM?
#14011 commented on Mar 4, 2025 • 0 new comments
[Bug]: deepseek-r1 mutlti-node crash
#13136 commented on Mar 4, 2025 • 0 new comments
[Usage]: how to use prefill-decode disaggregation ??
#11490 commented on Mar 4, 2025 • 0 new comments
unload the model
#3281 commented on Mar 4, 2025 • 0 new comments
[Bug]: vllm api_server often crashes when the version is higher than 0.5.3.
#7936 commented on Mar 4, 2025 • 0 new comments
[Bug]: Issue when benchmarking the dynamically served LoRA adapter
#8564 commented on Mar 4, 2025 • 0 new comments
[Usage]: Engine iteration timed out. (during using qwen2-vl-7b)
#10123 commented on Mar 4, 2025 • 0 new comments
[Bug]: Get meaningless output when run long context inference of Qwen2.5 model with vllm>=0.6.3
#10298 commented on Mar 4, 2025 • 0 new comments
[Usage]: Cant use vllm on a multiGPU node
#10474 commented on Mar 4, 2025 • 0 new comments
[Bug]: benchmark random input-len inconsistent
#10847 commented on Mar 4, 2025 • 0 new comments
[Bug]: vllm部署qwen2.5_vl_72b之后，你们有出现，刚部署好之后调用一切正常3-5秒一条，然后使用一段时间，就越来越慢了的情况吗60s一条
#13886 commented on Mar 3, 2025 • 0 new comments
[Bug]: ERROR hermes_tool_parser.py:108] Error in extracting tool call from response.
#10831 commented on Mar 5, 2025 • 0 new comments
[Usage]: Sampling several sequences from OpenAI compatible server.
#10852 commented on Mar 5, 2025 • 0 new comments
[Usage]: How to access to the generated token from LogitsProcessor
#10885 commented on Mar 5, 2025 • 0 new comments
[BUG] [MultiStep+AsyncOutputProc] the remaining steps not released when request output reaches max-token
#10890 commented on Mar 5, 2025 • 0 new comments
[Feature]: when will publish a new version
#10892 commented on Mar 5, 2025 • 0 new comments
[Bug]: VLLM (0.7.0) will report gpu missing on the hosting node in Ray
#12614 commented on Mar 4, 2025 • 0 new comments
[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled"
#14069 commented on Mar 4, 2025 • 0 new comments
[Bug]: The value of --max-model-len may influence results although the length of input less than max-model-len
#11447 commented on Mar 4, 2025 • 0 new comments
[Usage]: Qwen2-VL-2B-Instruct Issue when passing a video URL to /chat/completions
#13927 commented on Mar 4, 2025 • 0 new comments
[Bug]: Generation mismatch with Model: meta-llama/Llama-3.2-11B-Vision-Instruct
#13763 commented on Mar 4, 2025 • 0 new comments
[Bug]: mllama AssertionError during kv cache profiling
#13929 commented on Mar 4, 2025 • 0 new comments
[Feature]: Add an endpoint to know the server config
#13056 commented on Mar 4, 2025 • 0 new comments
[Installation]: Could not find a version that satisfies the requirement xgrammar>=0.1.6; platform_machine == "x86_64" (from vllm) (from versions: none)
#11886 commented on Mar 4, 2025 • 0 new comments
[Bug]: Error in benchmark model with vllm backend for endpoint /v1/chat/completions
#10158 commented on Mar 4, 2025 • 0 new comments
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 commented on Mar 4, 2025 • 0 new comments
[Usage]: How to use pipeline parallelism in offline inference?
#13453 commented on Mar 4, 2025 • 0 new comments
[Usage]: How to do offline inference on multi-node with tensor-parallel and pipeline-parallel
#12950 commented on Mar 4, 2025 • 0 new comments
[Usage]: Guided choice not working as expected
#12225 commented on Mar 3, 2025 • 0 new comments
[Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
#14040 commented on Mar 3, 2025 • 0 new comments
[New Model]: support Ovis VLM series
#13441 commented on Mar 3, 2025 • 0 new comments
Please add lora support for higher ranks and alpha values
#2847 commented on Mar 3, 2025 • 0 new comments
[RFC] Initial Support for Cloud TPUs
#3620 commented on Mar 3, 2025 • 0 new comments
[Feature]: Chunked prefill + lora
#4995 commented on Mar 3, 2025 • 0 new comments
[Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.
#7787 commented on Mar 3, 2025 • 0 new comments
[Usage]: What's the relationship between KV cache and MAX_SEQUENCE_LENGTH.
#10517 commented on Mar 3, 2025 • 0 new comments
[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
#10713 commented on Mar 3, 2025 • 0 new comments
[Bug]: Improve Error Messaging for Unsupported Tasks in vLLM (e.g., embedding with Llama Models)
#10794 commented on Mar 3, 2025 • 0 new comments
[Feature]: LoRA support for LLama 3.2 Vision Models
#10824 commented on Mar 3, 2025 • 0 new comments
[Feature]: vllm Multicar inference bnb model TP is not supported
#10823 commented on Mar 3, 2025 • 0 new comments
[Usage]: Moving from 1 to 2 GPU's in vLLM
#10826 commented on Mar 3, 2025 • 0 new comments
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on Mar 3, 2025 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Mar 3, 2025 • 0 new comments
[Bug]:ValueError: vllm serve Qwen/Qwen2.5-VL-72B-Instruct-AWQ ERROR:The input size is not aligned with the quantized weight shape.
#13980 commented on Mar 2, 2025 • 0 new comments
[Bug][Ray]: Pipeline parallelism fails on the same host
#14093 commented on Mar 4, 2025 • 0 new comments
[Feature][Frontend]: Deprecate `--enable-reasoning`
#14088 commented on Mar 4, 2025 • 0 new comments
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on Mar 3, 2025 • 0 new comments
[RFC]: Implement Structured Output support for V1 engine
#11908 commented on Mar 3, 2025 • 0 new comments
[Feature]: Add KV Cache Metrics to Usage Object
#12283 commented on Mar 3, 2025 • 0 new comments
[Feature]: Online Inference on local model with OpenAI Python SDK
#8631 commented on Mar 3, 2025 • 0 new comments
[Feature]: Improve Logging for Error Messages
#14083 commented on Mar 3, 2025 • 0 new comments
[RFC]: Sparse KV cache management framework
#12254 commented on Mar 3, 2025 • 0 new comments
[Feature]: Expose option to load new model weights from disk
#12774 commented on Mar 3, 2025 • 0 new comments
[Feature]: Support for DeepGEMM
#13857 commented on Mar 3, 2025 • 0 new comments
[Bug]: Nonsensical Sentences Generated When Inferencing INT8 Quantized Qwen2.5-72B Model
#11175 commented on Mar 3, 2025 • 0 new comments
[Bug]: macOS with vllm-cpu v0.6.6-post2 serving Qwen2.5-1.5b-Instruct results in endless exclamation marks
#12427 commented on Mar 3, 2025 • 0 new comments
[Bug]: The accuracy of multiple cards and single card is inconsistent
#13801 commented on Mar 3, 2025 • 0 new comments
[Doc]: provide docker-compose.yml for multi-node serving
#13158 commented on Mar 3, 2025 • 0 new comments
[Installation]: Dockerfile.cpu installation problem vLLM
#14033 commented on Mar 3, 2025 • 0 new comments
[Feature]: Store KVCache in 3FS
#14012 commented on Mar 3, 2025 • 0 new comments
[RFC]: Async KV Cache Transfer for Disaggregated Inference
#13020 commented on Mar 3, 2025 • 0 new comments
[Bug]: CUDA Exception on multi-gpus with concurrent users
#12307 commented on Mar 6, 2025 • 0 new comments
[Bug]: vllm is hang after upgrade to v0.5.4
#7297 commented on Mar 6, 2025 • 0 new comments
[RFC]: Multi-modality Support on vLLM
#4194 commented on Mar 6, 2025 • 0 new comments
[Usage]: How can I use 6 GPU's?
#11147 commented on Mar 6, 2025 • 0 new comments
[Bug]: AttributeError: Qwen2Tokenizer has no attribute lower
#13127 commented on Mar 6, 2025 • 0 new comments
[Bug]: crash：RecursionError: maximum recursion depth exceeded
#9608 commented on Mar 6, 2025 • 0 new comments
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture
#13446 commented on Mar 6, 2025 • 0 new comments
[Usage]: 为什么CPU KV cache usage一直为0.0%
#11871 commented on Mar 6, 2025 • 0 new comments
[Usage]: Low GPU utilization when running Deepseek-r1-distill-llama-8B
#14022 commented on Mar 6, 2025 • 0 new comments
[Usage]: LLM repeat automatically
#13952 commented on Mar 6, 2025 • 0 new comments
[Performance]: Plan to support DP attention for Deepseek models
#12871 commented on Mar 6, 2025 • 0 new comments
[Installation]: AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul' on the CPU instance
#13593 commented on Mar 6, 2025 • 0 new comments
[Bug]: When using the latest 0.6.3, No module named 'vllm._version' appears
#9421 commented on Mar 6, 2025 • 0 new comments
[Bug]: deepseek r1 + vllm (v0.7.2) torch.compile error
#13471 commented on Mar 6, 2025 • 0 new comments
[Feature]: Support tool parser for DeepSeek-V3
#13764 commented on Mar 6, 2025 • 0 new comments
[Usage]: LLama-3.1-405B Inference with vLLM TPU
#9052 commented on Mar 6, 2025 • 0 new comments
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on Mar 5, 2025 • 0 new comments
[Bug]: The inference result from vLLM is incorrect on specific prompt
#10916 commented on Mar 7, 2025 • 0 new comments
[Bug]: Not able to install/compile vllm using alpine linux base image
#10924 commented on Mar 7, 2025 • 0 new comments
[Feature]: add optimum-neuron
#10946 commented on Mar 7, 2025 • 0 new comments
[Misc]: Saved sharded state should also include GPU P2P access cache
#10967 commented on Mar 7, 2025 • 0 new comments
[Feature]: Support Qwen/Qwen2.5-14B-Instruct-1M
#12452 commented on Mar 7, 2025 • 0 new comments
[Feature]: add DoRA support
#10849 commented on Mar 7, 2025 • 0 new comments
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on Mar 6, 2025 • 0 new comments
[Bug]: The api server /health endpoint is unable to detect when the Worker VllmWorkerProcess has died
#11996 commented on Mar 6, 2025 • 0 new comments
[Feature][v1]: Add metrics support
#10582 commented on Mar 6, 2025 • 0 new comments
[Bug]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 11.53 GiB of which 187.75 MiB is free. Including non-PyTorch memory, this process has 11.34 GiB memory in use.
#12030 commented on Mar 6, 2025 • 0 new comments
[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens
#13175 commented on Mar 6, 2025 • 0 new comments
[Feature]: Support "required" option in tool_choice
#13002 commented on Mar 6, 2025 • 0 new comments
[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images.
#12973 commented on Mar 6, 2025 • 0 new comments
[Bug]: Not support MiniCPM-o 2.6 ‘s finetune lora
#13018 commented on Mar 6, 2025 • 0 new comments
[Bug]: Using lm_format_enforcer, or using certain schemas, with Llama-3.2-90B-Vision-Instruct causes a crash
#11248 commented on Mar 6, 2025 • 0 new comments
[New Model]: Ovis2
#13251 commented on Mar 6, 2025 • 0 new comments
[Bug]: RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
#12233 commented on Mar 6, 2025 • 0 new comments
[New Model]: nomic-ai/nomic-embed-text-v1
#12054 commented on Mar 5, 2025 • 0 new comments
[Bug]: No available block found in 60 second in shm
#6614 commented on Mar 5, 2025 • 0 new comments
[Bug]: torchvision.libs/libcudart.41118559.so.12 (deleted): cannot open shared object file: No such file or directory
#13040 commented on Mar 5, 2025 • 0 new comments
[Performance]: enforce_eager=False degrade the performance metrics for long context input
#13536 commented on Mar 5, 2025 • 0 new comments
[Bug]: Fatal Python Error when Starting DeepSeek V3
#13014 commented on Mar 5, 2025 • 0 new comments
[Bug]: cross-node tensor parallel + cudagraph issue
#13552 commented on Mar 5, 2025 • 0 new comments
[Bug]: Make https://wheels.vllm.ai/nightly inspectable
#13545 commented on Mar 5, 2025 • 0 new comments
[Bug]: Gemma 2 - AttributeError: 'Gemma2Config' object has no attribute 'interleaved_sliding_window'
#13226 commented on Mar 5, 2025 • 0 new comments
[Bug]: structured output with xgrammar using vllm serve with llama-8b fails results in os error OSError: OSError: (...)/.cache/torch_extensions/py312_cu124/xgrammar/xgrammar.so: cannot open shared object file: No such file or directory
#13563 commented on Mar 5, 2025 • 0 new comments
[Bug]: AssertionError, assert prefill_metadata.context_chunk_seq_tot is not None
#14009 commented on Mar 5, 2025 • 0 new comments
How to use vllm to compute ppl score for input text?
#1019 commented on Mar 5, 2025 • 0 new comments
[Feature]: Support for priority preemption with chunked-prefill
#10101 commented on Mar 5, 2025 • 0 new comments
[Usage]: OpenTelemetry with fastapi not working
#10213 commented on Mar 5, 2025 • 0 new comments
[Doc]: Docker+vllm+fastchat deploys multimodal large model Qwen2-vl-7b-instruct(docker+vllm+fastchat部署多模态大模型Qwen2-vl-7b-instruct)
#10566 commented on Mar 5, 2025 • 0 new comments
[Bug]: Engine process (pid 76) died
#10812 commented on Mar 5, 2025 • 0 new comments
[Bug]: The following error occurred when I ran the qwen2.5-7b-fp8-dynamic model with vllm0.6.4.post1 on a single card 4090
#10828 commented on Mar 5, 2025 • 0 new comments
[Installation]: XPU dependencies are missing
#11173 commented on Mar 6, 2025 • 0 new comments
[Bug]: An error occurred while using H20 to perform multi machine inference 405B through the ray cluster, causing inference to crash.
#9215 commented on Mar 6, 2025 • 0 new comments
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on Mar 6, 2025 • 0 new comments
[Bug]: Speculative Decoding without enabling eager mode returns gibberish output after some tokens.
#10559 commented on Mar 6, 2025 • 0 new comments
[Performance]: logit bias implementation uses a slow for loop
#10741 commented on Mar 6, 2025 • 0 new comments
[Feature]: Serving VLM VILA
#10889 commented on Mar 6, 2025 • 0 new comments
[Bug]: Illegal Memory access was encounterd when running UT: pytest -s -v vllm/tests/spec_decode/test_multi_step_worker.py::test_use_draft_model_runner_advance_step
#10918 commented on Mar 6, 2025 • 0 new comments
[Bug]: deepseek v2.5 gtpq(int4) error with vllm-0.6.4
#10923 commented on Mar 6, 2025 • 0 new comments
[Bug]: vllm v1/chat/completions Internal Server Error
#10925 commented on Mar 6, 2025 • 0 new comments
[Bug]: ERROR 07-26 14:50:35 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 214281 died, exit code: -11
#6823 commented on Mar 6, 2025 • 0 new comments
[Misc]: I want to run Llama 3.1 405B using speculative. Can you give me a guide?
#7456 commented on Mar 6, 2025 • 0 new comments
[Bug]: Llama 3.2 90b crash
#10648 commented on Mar 5, 2025 • 0 new comments
[Usage]: why speculate decoding is slower than normal decoding？
#8439 commented on Mar 5, 2025 • 0 new comments
[RFC]: Disaggregated prefilling and KV cache transfer roadmap
#10818 commented on Mar 5, 2025 • 0 new comments
[Bug]: vllm0.7.3: an illegal memory access was encountered
#13824 commented on Mar 5, 2025 • 0 new comments
[Bug]: (v0.7.2): RuntimeError: CUDA error: an illegal memory access was encountered
#13939 commented on Mar 5, 2025 • 0 new comments
[Bug]: V1 engine ignores logits processors and min-p sampling
#12678 commented on Mar 5, 2025 • 0 new comments