Pulse · vllm-project/vllm · GitHub

June 13, 2025 – June 20, 2025

Overview

212 Active pull requests

229 Active issues

Could not load contribution data

Please try again later

116 Pull requests merged by 73 people

[Misc] Clean up useless code
#19889 merged Jun 20, 2025
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError
#19749 merged Jun 20, 2025
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests
#19901 merged Jun 20, 2025
Export NaNs in logits to scheduler_stats if output is corrupted
#18777 merged Jun 20, 2025
[custom_op][vllm-plugin] update custom_op class to use op_registry
#19164 merged Jun 20, 2025
[Model] GPT2ForSequenceClassification model
#19663 merged Jun 20, 2025
[Fix] import regex instead of re
#19875 merged Jun 20, 2025
[Kernel] correct cpu worker function parameter type
#19745 merged Jun 20, 2025
[Misc] refactor example - openai_transcription_client
#19851 merged Jun 20, 2025
[Misc] update cuda version
#19526 merged Jun 20, 2025
[Bugfix][Ray] Set the cuda context eagerly in the ray worker
#19583 merged Jun 20, 2025
[Bugfix] Enable PP with AITER+V1
#19822 merged Jun 20, 2025
[Chore]: qwen3-moe-type-hints-mistake
#19860 merged Jun 20, 2025
[Benchmark] Fix Value of type "SampleRequest" is not indexable
#18032 merged Jun 20, 2025
[CI][Neuron] Fail and exit on first error
#19622 merged Jun 20, 2025
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI
#19872 merged Jun 20, 2025
[Benchmark][Bugfix] Fix Dataset Length Calculation
#19868 merged Jun 20, 2025
[Frontend] early return chat format resolution when specified
#19735 merged Jun 19, 2025
[Core][Bugfix] Fix Online MM Beam Search
#19688 merged Jun 19, 2025
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI
#19838 merged Jun 19, 2025
[Doc] Update V1 user guide for embedding models
#19842 merged Jun 19, 2025
Fixing Chunked Prefill Test.
#19762 merged Jun 19, 2025
[Frontend] Add optional token-level progress bar to LLM.beam_search
#19301 merged Jun 19, 2025
Add xLAM tool parser support
#17148 merged Jun 19, 2025
[Minor] Allow redirecting model path for HfRunner in test
#19795 merged Jun 19, 2025
raise exception for pin_lora
#19809 merged Jun 19, 2025
[Misc] [ROCm] Prevent surplus tensor reshape
#19803 merged Jun 19, 2025
[ROCm] [AITER] [Bugfix] Patch for AITER commit 648764942e552a8bb5fe16026703716a81f05374
#18990 merged Jun 19, 2025
Mark invariant normalizer in Gemma as non-persistent
#19788 merged Jun 19, 2025
[Bugfix] Add check_health to v1 async client.
#19821 merged Jun 19, 2025
[Bugfix] Fix the linter
#19826 merged Jun 19, 2025
Support embedding models in V1
#16188 merged Jun 19, 2025
[Quantization] Modify the logic of BNB double quantization
#19742 merged Jun 19, 2025
[Misc][ROCm] Enforce no unused variable in ROCm C++ files
#19796 merged Jun 19, 2025
Fix FA2 fallback for Blackwell V1
#19781 merged Jun 19, 2025
[Frontend] Expose custom args in OpenAI APIs
#16862 merged Jun 19, 2025
[BugFix] Fix use_cudagraph=False
#19612 merged Jun 19, 2025
[Multimodal] Use fast processor for Qwen2/2.5-VL
#19789 merged Jun 18, 2025
[Core] More fixes to MultiModalEmbeddings type handling
#19715 merged Jun 18, 2025
[TPU] Update torch-xla version to include paged attention tuned block change
#19813 merged Jun 18, 2025
[Core] Do not copy array during hashing
#19484 merged Jun 18, 2025
Disable "Forbid direct 'import triton'" check for vllm/triton_utils/importing.py in an extensible way
#19783 merged Jun 18, 2025
docs: fix Slack bulletpoint in README
#19811 merged Jun 18, 2025
[v1] Support mamba2
#19327 merged Jun 18, 2025
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc
#19808 merged Jun 18, 2025
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully
#19725 merged Jun 18, 2025
[Hardware][AMD] integrate aiter chunked prefill into vllm
#18596 merged Jun 18, 2025
[Qwen] Add tagging rule for Qwen related PRs
#19799 merged Jun 18, 2025
[Platform] Allow platform use V1 Engine by default
#19792 merged Jun 18, 2025
[doc] fix the incorrect label
#19787 merged Jun 18, 2025
[Minor] Zero-initialize attn output buffer
#19784 merged Jun 18, 2025
[V1] Decouple GPU and TPU InputBatch
#19778 merged Jun 18, 2025
[V1][P/D] An native implementation of xPyD based on P2P NCCL
#18242 merged Jun 18, 2025
[V1] Add API docs for EncoderCacheManager
#19294 merged Jun 18, 2025
[Misc] Add __str__ for RequestStatus
#19780 merged Jun 18, 2025
[MISC] correct DeviceConfig device field static type analysis
#19699 merged Jun 18, 2025
[MISC] correct copy_blocks src_to_dists param type
#19696 merged Jun 18, 2025
[TPU] Update torch version to include paged attention kernel change
#19706 merged Jun 17, 2025
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend
#19158 merged Jun 17, 2025
[Bugfix] Fix faulty triton importing logic when using Ray for DP
#19734 merged Jun 17, 2025
[Misc] Update lmcache connector with the latest connector apis
#19441 merged Jun 17, 2025
Remove sm120 arch from sm100 cutlass kernel arch list
#19716 merged Jun 17, 2025
[Perf] Optimize moe_align_block_size CUDA kernel
#19572 merged Jun 17, 2025
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52
#19151 merged Jun 17, 2025
[Mis] remove duplicate engine status checks
#19647 merged Jun 17, 2025
[V1][Kernel] Flashinfer HND KV cache layout
#19280 merged Jun 17, 2025
[doc] split "Other AI Accelerators" tabs
#19708 merged Jun 17, 2025
[doc][mkdocs] Add edit button to documentation
#19637 merged Jun 17, 2025
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel
#19152 merged Jun 17, 2025
Add a doc on how to update PyTorch version
#19705 merged Jun 17, 2025
[Doc] Add missing llava family multi-image examples
#19698 merged Jun 17, 2025
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager
#19686 merged Jun 17, 2025
Fixes IMA for TP w/ flex-attention
#19712 merged Jun 17, 2025
[DOC] fix doc typos
#19600 merged Jun 17, 2025
[Frontend] add chunking audio for > 30s audio
#19597 merged Jun 17, 2025
[Wheel Size] Only build FA2 8.0+PTX
#19336 merged Jun 17, 2025
[doc] add project flag to gcloud TPU command
#19664 merged Jun 17, 2025
[Fix] Fall back to Gloo when NCCL backend is unavailable
#19641 merged Jun 17, 2025
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100
#19563 merged Jun 16, 2025
[V1] Change return type on get_multimodal_embeddings()
#19446 merged Jun 16, 2025
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM)
#19677 merged Jun 16, 2025
[Kernels] Use empty for modular MoE workspaces
#19667 merged Jun 16, 2025
[Bugfix] fix missing 'finish_reason': null in streaming chat
#19662 merged Jun 16, 2025
[MISC] bump huggingface_hub pkg to 0.33.0
#19547 merged Jun 16, 2025
[Bugfix] Fix TP inference for Flex attention backend
#19657 merged Jun 16, 2025
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts.
#19652 merged Jun 16, 2025
[DOC] Add reasoning capability to vLLM streamlit code
#19557 merged Jun 16, 2025
[BugFix] Don't catch BaseException when dumping execute_model errors
#19626 merged Jun 16, 2025
[Kernel] GGUF MMVQ kernel for multiple input vectors
#18754 merged Jun 16, 2025
[Docs] Move multiproc doc to v1 dir
#19651 merged Jun 16, 2025
[CI] Add mteb testing for rerank models
#19344 merged Jun 16, 2025
[MISC] typo fix
#19672 merged Jun 16, 2025
[TPU] support attention head dim smaller than 128
#19620 merged Jun 16, 2025
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config
#19660 merged Jun 16, 2025
[Misc][Frontend] passthrough bad_words
#19564 merged Jun 16, 2025
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker
#18957 merged Jun 16, 2025
[MISC] Remove unused variableds in C++
#19609 merged Jun 16, 2025
[Misc] Remove duplicate multiproc method setting for CPU platform
#19649 merged Jun 16, 2025
[CI/Build] Fix torch nightly CI dependencies part 2
#19589 merged Jun 15, 2025
Enable prefix caching with full cuda graphs
#19617 merged Jun 15, 2025
[Benchmark] Refactor benchmark script for fp8 & int8
#19627 merged Jun 15, 2025
[Kernel] Raise verbose error and consolidate num_heads/num_kv_heads divisibility check
#19339 merged Jun 15, 2025
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness
#19644 merged Jun 15, 2025
[Perf] Further tunings for SM100 FP8 CUTLASS kernel
#19566 merged Jun 15, 2025
[Fix] Convert kv_transfer_config from dict to KVTransferConfig
#19262 merged Jun 14, 2025
[Bugfix] Don't attempt to use triton if no driver is active
#19561 merged Jun 14, 2025
Only build CUTLASS MoE kernels on Hopper
#19648 merged Jun 14, 2025
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization
#19500 merged Jun 14, 2025
[Bugfix] Fix auto dtype casting for BatchFeature
#19316 merged Jun 14, 2025
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts
#19593 merged Jun 14, 2025
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype
#19633 merged Jun 14, 2025
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics.
#18354 merged Jun 14, 2025
[BugFix] Fix DP Coordinator incorrect debug log message
#19624 merged Jun 14, 2025
Adding "AMD: Multi-step Tests" to amdproduction.
#19508 merged Jun 14, 2025
[torch.compile] Use custom ops when use_inductor=False
#19618 merged Jun 13, 2025
[Doc] Add troubleshooting section to k8s deployment
#19377 merged Jun 13, 2025

96 Pull requests opened by 72 people

Sync test dependency with test.in for torch nightly
#19632 opened Jun 14, 2025
[Config] Make prefix cache metrics interval configurable
#19634 opened Jun 14, 2025
[Frontend] Support image object in llm.chat
#19635 opened Jun 14, 2025
[Kernels] MoE refactor
#19636 opened Jun 14, 2025
[Doc] Add inplace weights loading example
#19640 opened Jun 14, 2025
[BugFix] Add an env to disable moe chunking by default
#19642 opened Jun 14, 2025
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor
#19643 opened Jun 14, 2025
When numa support is found but size is 0, divide by zero exception
#19654 opened Jun 14, 2025
[Bugfix] - Add Trace Headers to Beam Search Path
#19655 opened Jun 15, 2025
[Feature]: Support offline expert load distribution recording
#19658 opened Jun 15, 2025
feat(model loader): add load format 'prefetch_auto' for parallel mmap…
#19659 opened Jun 15, 2025
[Misc] add CLI completion
#19669 opened Jun 16, 2025
[Model] Support TP/PP/mamba2 kernel for PLaMo2
#19674 opened Jun 16, 2025
[Model] Automatic conversion of score (CrossEncoding) models
#19675 opened Jun 16, 2025
[Bugfix] ensure tool_choice is popped when `tool_choice:null` is passed in json payload
#19679 opened Jun 16, 2025
[V1] [Metrics] Hide deprecated metrics.
#19682 opened Jun 16, 2025
Add Thor SBSA and Spark
#19685 opened Jun 16, 2025
[P/D][NixlConnector] Support `tp_size > num_kv_heads` deployments
#19691 opened Jun 16, 2025
feat: add enforce_include_usage option
#19695 opened Jun 16, 2025
add type assertion of request_id for LLMEngine.add_request
#19700 opened Jun 16, 2025
[Docs] Enhance SupportsMultiModal interface documentation
#19701 opened Jun 16, 2025
Make sure the correct version of ao is installed in CI
#19704 opened Jun 16, 2025
Adding "AMD: Plugin Tests" to amdproduction.
#19707 opened Jun 16, 2025
[Model] Activated LoRA
#19710 opened Jun 16, 2025
[Misc][Tools][Benchmark] Add profile to autotune script
#19711 opened Jun 16, 2025
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs.
#19717 opened Jun 16, 2025
[V1] Perf optimization for layers reusing shared KV cache
#19719 opened Jun 17, 2025
[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE
#19721 opened Jun 17, 2025
v1: Add Request.block_hashes
#19728 opened Jun 17, 2025
[PD] let toy proxy handle /chat/completions
#19730 opened Jun 17, 2025
v1: Support KV events from connectors
#19737 opened Jun 17, 2025
[P/D] Handle Abort and Make Lifecycle Explicit
#19740 opened Jun 17, 2025
[Feature] add quick all reduce
#19744 opened Jun 17, 2025
[BugFix] fix: aot passes kvcache dtype information
#19750 opened Jun 17, 2025
[v1] Re-add fp32 support to v1 engine through FlexAttention
#19754 opened Jun 17, 2025
[Multimodal] Optimize Qwen2/2.5-VL startup time
#19756 opened Jun 17, 2025
[feat]: CUTLASS block scaled group gemm for SM100
#19757 opened Jun 17, 2025
Register deepgemm moe kernels to work with v1 engine
#19759 opened Jun 17, 2025
[BugFix] Fix topk_softmax assert
#19764 opened Jun 17, 2025
add mamba head fix
#19766 opened Jun 17, 2025
[Draft][torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 opened Jun 17, 2025
BLOCK_SIZE_K fix
#19769 opened Jun 17, 2025
Workaround for an integer overflow with large CHUNK_SIZE
#19770 opened Jun 17, 2025
Triton-fused DeepseekScalingRotaryEmbedding
#19771 opened Jun 17, 2025
[AMD][P/D] Add libamdhip64.so.6 for llmd
#19773 opened Jun 17, 2025
[WIP] Splitting attention _fwd_grouped_kernel_stage1 to improve occupancy
#19774 opened Jun 17, 2025
[Ray] v1 Change device str for platform compatibility
#19785 opened Jun 18, 2025
[DP] Support external DP Load Balancer mode
#19790 opened Jun 18, 2025
Add SM120 to the Dockerfile
#19794 opened Jun 18, 2025
Allow to override KV cache memory calculation
#19804 opened Jun 18, 2025
[MISC] add cpu_kvcache_space_bytes to CacheConfig
#19812 opened Jun 18, 2025
Improve quant config semantic clarity, add Nvidia ModelOpt config adaptation
#19815 opened Jun 18, 2025
Introduce RayCudaCommunicator as Ray Compiled Graph communicator
#19816 opened Jun 18, 2025
[Kernel] Add Conch backend for mixed-precision linear layer
#19818 opened Jun 18, 2025
LoRA support on llama4
#19819 opened Jun 18, 2025
[Feature] Integrate new deepgemm
#19820 opened Jun 18, 2025
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100).
#19825 opened Jun 19, 2025
Move Gemma's stacked_params_mapping to class scope
#19829 opened Jun 19, 2025
FP8 custom ops
#19830 opened Jun 19, 2025
[WIP] Async Scheduler Prototype
#19831 opened Jun 19, 2025
[P/D] Asynchronously do _nixl_handshake
#19836 opened Jun 19, 2025
[Do not merge] Cache model info for faster startup
#19837 opened Jun 19, 2025
Add Cutlass integration for MoE FP8
#19843 opened Jun 19, 2025
[BugFix][V0] Fix AssertionError for prompt_logprobs
#19844 opened Jun 19, 2025
refactor example - qwen3_reranker
#19847 opened Jun 19, 2025
v1: Introduce an offloading component
#19848 opened Jun 19, 2025
[Chore] logging metrics rename
#19852 opened Jun 19, 2025
optimze attn
#19858 opened Jun 19, 2025
[Misc] add vllm_config in __init__
#19866 opened Jun 19, 2025
[Docs] Fix syntax highlighting of shell commands
#19870 opened Jun 19, 2025
[V1 Scheduler] BatchScheduler to balance token-based microbatches and reduce GPU pipeline bubbles
#19873 opened Jun 19, 2025
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done
#19874 opened Jun 19, 2025
Add page-aligned prefill scheduling.
#19878 opened Jun 19, 2025
[Quantization] Add compressed-tensors emulations support for NVFP4
#19879 opened Jun 19, 2025
[Misc] Add type alias `ReqId` and `EngineId` for better readability
#19880 opened Jun 19, 2025
[Core] Add `update_load_config` RPC method
#19884 opened Jun 20, 2025
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case
#19885 opened Jun 20, 2025
[New model support]Support Tarsier2
#19887 opened Jun 20, 2025
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12
#19891 opened Jun 20, 2025
add some examples for other benchmark scripts
#19893 opened Jun 20, 2025
[Benchmark][New Dataset]Added benchmark support for Unsloth Vision Datasets
#19894 opened Jun 20, 2025
[Docs] Change response symbol to json in openai_compatible_server.md
#19895 opened Jun 20, 2025
[CI/Build] Push latest tag for cpu and neuron docker image
#19897 opened Jun 20, 2025
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend to enable Llama-4
#19904 opened Jun 20, 2025
add smollm3 support
#19905 opened Jun 20, 2025
Fix: Check the type of params to be a Sequence not list.
#19910 opened Jun 20, 2025
deepep low latency + fp8 dispatch - test fixes
#19911 opened Jun 20, 2025
[V1] Logits processors extensibility
#19912 opened Jun 20, 2025
[Misc] make get_class check for Executor instead of ExecutorBase
#19914 opened Jun 20, 2025
Track expert selection metrics
#19915 opened Jun 20, 2025
Fix: Missing newline at end of file
#19916 opened Jun 20, 2025
[Bugfix] Fix bnb 8bit model weights loading
#19917 opened Jun 20, 2025
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN
#19919 opened Jun 20, 2025
[doc] improve readability for long commands
#19920 opened Jun 20, 2025
Use FusedMoEQuantConfig everywhere
#19921 opened Jun 20, 2025
[doc] add contact us in community
#19922 opened Jun 20, 2025

132 Issues closed by 33 people

[Doc]: Add list of commands for `vllm serve`
#19859 closed Jun 20, 2025
[Bug]: Inductor codegen: fatal error: stddef.h: No such file or directory
#19656 closed Jun 20, 2025
[Usage]: How to eliminate randomness and obtain fixed results with VLLM 0.8
#15205 closed Jun 20, 2025
[Bug]: enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
#19890 closed Jun 20, 2025
[Performance]: speed regression 0.6.2 => 0.6.3?
#9476 closed Jun 20, 2025
[Usage]: vLLM For maximally batched use case
#9760 closed Jun 20, 2025
[Bug]: Interference of Tokens in Concurrent Requests Causing Result Confusion in Version 0.6.3
#9910 closed Jun 20, 2025
[Usage]: Multi-Step Scheduling with Speculative Decoding
#11917 closed Jun 20, 2025
[Bug]: Nccl Test Error
#12008 closed Jun 20, 2025
[Usage]: V0 Does Qwen2-VL Support torch.compile in vllm?
#12693 closed Jun 20, 2025
[Bug]: Pooling request fails for classification task
#12753 closed Jun 20, 2025
[Bug]: V1 engine fails with offline batched inference code in V0 engine
#12929 closed Jun 20, 2025
[Installation]: When build the 0.7.2 docker image using the Dockerfile , got "LookupError: setuptools-scm was unable to detect version for /workspace."
#13130 closed Jun 20, 2025
[Bug]: Using VLLM0.7.2 server to start DeepSeek-r1-awq model, there is a phenomenon of cuda out of memory and service shutting down.
#13252 closed Jun 20, 2025
[Bug]: empty begin steam output
#13293 closed Jun 20, 2025
[Bug]: GPU drops to 0 usage when handling concurrent requests
#13422 closed Jun 20, 2025
[Bug]: Guided decoding only generating single character during inference with finetuned model
#13448 closed Jun 20, 2025
[Feature]: Support xTTSv2
#13457 closed Jun 20, 2025
[Bug]: When using the tool call streaming output of hermes,deleting the last"}"of the parameter may result in missing information
#13459 closed Jun 20, 2025
[Bug]: AWQ doesn't support 4-bit?
#13462 closed Jun 20, 2025
[Feature]: How can I set a different max_pixels for each request when starting the service?
#13463 closed Jun 20, 2025
[Bug]: Memory leak in 0.6 and 0.7 when setting "max_tokens=1"
#13464 closed Jun 20, 2025
[Bug]: Mamba should return states in fp32
#13466 closed Jun 20, 2025
[Usage]: How different between 'obtain the full response' and 'fixed length output' when using benchmark performance test
#13467 closed Jun 20, 2025
[Doc]: List of Models Supported By TPU Backend
#13476 closed Jun 20, 2025
[Bug]: When deploying two llm services on the same batch of GPUs. Inference will be twice as slow
#13477 closed Jun 20, 2025
[Feature]: kv cahce int8：Dynamic kv cache scaling factors computation
#13478 closed Jun 20, 2025
[Bug]: vLLM on TPU is broken with XLA errors
#13479 closed Jun 20, 2025
[Usage]: How to write scoring script when deploying to a managed Azure machine learning real-time endpoint?
#13491 closed Jun 20, 2025
[Usage]: How to use logits processors with max_num_seqs > 1?
#13553 closed Jun 20, 2025
[Bug]: Increasing root volume with guided decoding
#13556 closed Jun 20, 2025
[Bug]: Index Out of Range Bug in Pooler when Using returned_token_ids with hidden_states
#13559 closed Jun 20, 2025
[New Model]: Qwen3-Rerank 0.6B, 4B, 8B
#19529 closed Jun 19, 2025
[Bug]: ValueError: Cannot cast <zmq.Socket(zmq.ROUTER) at 0x796c63de24a0> to int
#19444 closed Jun 19, 2025
[Bug]: Get NCCL_ERROR_SYSTEM_ERROR with latest Docker vLLM image (v0.9.1)
#19613 closed Jun 19, 2025
[Bug]: Async Beam Search Doesn't Pass Multimodal Data Correctly
#19687 closed Jun 19, 2025
[Bug]: fail to load OpenGVLab/InternVL3-78B with vllm
#19856 closed Jun 19, 2025
[Usage]: Implement a custom scheduler
#16479 closed Jun 19, 2025
[Usage]: `gpu_memory_utilization` backend parameter questions
#19805 closed Jun 19, 2025
[Bug]: 使用vllm 0.8.2 torch 0.2.6版本启动模型报错: CRITICAL 04-02 10:00:15 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.rm) 已杀死
#15918 closed Jun 19, 2025
[Misc]: why `3B-Instruct-AWQ` takes 16G
#15204 closed Jun 19, 2025
[Usage]: How to deploy DeepSeek R1 in a K8s environment
#14740 closed Jun 19, 2025
vector search
#15268 closed Jun 19, 2025
[Bug]: Is vllm support function call mode?
#6631 closed Jun 19, 2025
Conda Forge Package
#3126 closed Jun 19, 2025
[Bug]: OOM with QwQ-32B
#15258 closed Jun 19, 2025
[Bug]: TypeError: Qwen2_5OmniProcessor.__init__() got multiple values for argument 'image_processor'
#19833 closed Jun 19, 2025
[Bug]:
#19832 closed Jun 19, 2025
[Bug]: PreemptionMode.RECOMPUTE is incorrect
#16832 closed Jun 19, 2025
[Bug]: Reproduction failed when evaluate model
#19802 closed Jun 19, 2025
[Bug]: 'CUDAGraphBatchDecodeWithPagedKVCacheWrapper' object has no attribute 'plan' when serving QwenMoE model
#12821 closed Jun 19, 2025
[Bug]: Runtime error occurs when running deepseek v3
#12827 closed Jun 19, 2025
[Feature]: Support custom args in OpenAI (chat) completion requests
#16802 closed Jun 19, 2025
Revert "[CI] Update FlashInfer to 0.2.6.post1" --- edit: No, better add "12.0" to FlashInfer TORCH_CUDA_ARCH_LIST see PR #19794
#19810 closed Jun 18, 2025
[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data
#19736 closed Jun 18, 2025
[Bug]: RAY_CGRAPH_get_timeout is not set successfully. Ray still detects default timeout value.
#19703 closed Jun 18, 2025
[Usage]: How to start the vllm service and pass parameters on XPU
#19528 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/index.html`
#19755 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19753 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19752 closed Jun 18, 2025
[Docs] Feedback for `/en/latest/`
#19751 closed Jun 18, 2025
[Bug]: vllm running on new H20-3e Nvidia card has occasional garbled bug using Qwen 2.5 VL 72B
#19723 closed Jun 18, 2025
[Feature]: do you plan to support "suffix" of "v1/completions"
#9976 closed Jun 18, 2025
[Bug]: Continuous batching (OpenAI Server) with greedy search return different results
#11658 closed Jun 18, 2025
[Usage]: Does DeepSeek-R1 1.58-bit Dynamic Quant work on VLLM?
#12573 closed Jun 18, 2025
[Usage]: How to get access to scheduler
#12772 closed Jun 18, 2025
[Usage]: How to check the corresponding functionality of operators in Llama-2-7b-hf?
#13010 closed Jun 18, 2025
[Bug]: CUDA memory error with benchmark_serving.py
#13152 closed Jun 18, 2025
[Bug]: Cannot pull the docker image for installation
#13330 closed Jun 18, 2025
[Bug]: vllm server bad
#13340 closed Jun 18, 2025
For individual inference return expected result and batched inference returns different results for same prompts - Qwen2-VL-7B
#13346 closed Jun 18, 2025
[Bug]: DeepSeek R1 deployment panics when serving requests with cuda memory access
#13389 closed Jun 18, 2025
[Bug]:Lora Adapters with num-scheduler-steps doesn't work in version 0.7.2, even with VLLM_USE_V1=0
#13394 closed Jun 18, 2025
[Misc]: Why do we need to explicitly pass tool parsers?
#13399 closed Jun 18, 2025
[Feature]: Support token-level timestamps in whisper models
#13400 closed Jun 18, 2025
[Installation]: flash-attention internal "git submodule update" problematic for offline-install
#13424 closed Jun 18, 2025
[Feature]: load_weights function in JambaForSequenceClassification
#13430 closed Jun 18, 2025
[CI Failure]: Distributed Tests (2 GPUs) - v1/test_async_llm_dp.py::test_load
#19731 closed Jun 17, 2025
[Feature]: Optimize `moe_align_block_size` CUDA kernel
#19517 closed Jun 17, 2025
[Bug]: Error when loading model(gemma-3-4b) merged after DeepSpeed training into vLLM
#19139 closed Jun 17, 2025
[Bug]: guided_regex parsing error crashes the server
#19270 closed Jun 17, 2025
Release dataset of bug-fixing commits and test cases on Hugging Face
#19738 closed Jun 17, 2025
[Bug]: GPU Placement Group Creation Error in Multi-Node Setup with vLLM
#13388 closed Jun 17, 2025
[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100
#19724 closed Jun 17, 2025
[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when `enable_thinking=False`
#18819 closed Jun 17, 2025
[Doc]: Does llava onevision support VLM multi images?
#19521 closed Jun 17, 2025
[Usage]: How to identify mix and max pixels for the image
#15034 closed Jun 17, 2025
[Usage]: Can vllm multimodal generate use preprocessed image?
#14998 closed Jun 17, 2025
[Usage]: Transcription "Maximum clip duration (30s) exceeded
#15012 closed Jun 17, 2025
centos7 package err, is my problem?
#14750 closed Jun 17, 2025
[Bug]: RuntimeError: HIP Error on vLLM ROCm Image in Kubernetes Cluster with AMD GPUs
#10855 closed Jun 17, 2025
[Bug]: Does vLLM support offloading the weights of the DeepSeek R1 model to the CPU during large model inference?
#13326 closed Jun 17, 2025
[Feature]: multiple gpus specification
#13357 closed Jun 17, 2025
[CI Failure]: Spec Decoding - spec_decode/e2e/test_multistep_correctness.py
#18954 closed Jun 16, 2025
[Bug]: AttributeError: 'MultiprocExecutor' object has no attribute 'workers' when VLLM_USE_V1=1 on rocm platform serve deepseek-r1 671B
#17533 closed Jun 16, 2025
[Bug]: (regression from v0.8.5) missing "finish_reason": null in streaming chat completion outputs
#19650 closed Jun 16, 2025
[Bug]: Cannot run two containers on one card when using VLLM_USE_V1=1
#17366 closed Jun 16, 2025
[Bug]: Under high concurrency, kvcache will be tampered with, causing duplicate characters or gibberish in subsequent request results
#18955 closed Jun 16, 2025
[Bug]: (OOM) Find two places that cause a significant increase in GPU memory usage (probably lead to memory leak)
#8184 closed Jun 16, 2025
[Misc]: How does vllm consume request in async mode？
#13328 closed Jun 16, 2025
[Bug]: issue about the gguf model's --compilation-config options
#13329 closed Jun 16, 2025
[Bug]: benchmark_serving.py with random 4k input prompt & 512 output and ignore-eos, output length only has 43 token
#13333 closed Jun 16, 2025
[Usage]: How to enable tool calling in serve llm?
#19601 closed Jun 16, 2025
[Feature]: Vectorize `scaled_int8_quant`
#18866 closed Jun 15, 2025
[Benchmark Script] Refactor benchmark script for `bench_datatype_gemm`
#19364 closed Jun 15, 2025
[Bug]: Docker deployment returns zmq.error.ZMQError: Operation not supported
#10856 closed Jun 15, 2025
[Usage]: zmq.error.ZMQError: Operation not supported
#11564 closed Jun 15, 2025
[Performance]: It takes too much time to Add a request.
#12314 closed Jun 15, 2025
When tp>1 vllm not work （Qwen2.5-VL-72B）
#13124 closed Jun 15, 2025
[Bug]: Speculative Decoding Output with Pytorch Rejection Sampling does not change when changing seed
#13196 closed Jun 15, 2025
[Bug]: Gemma2ForSequenceClassification has no vLLM implementation
#13254 closed Jun 15, 2025
[Misc]: Regarding the issue of inconsistent calculation of tokens
#13256 closed Jun 15, 2025
[Bug]: Flashinfer Metadata init for speculative decoding
#13264 closed Jun 15, 2025
[Bug]: transformer fallback encounters CUDA OOM with large models
#13268 closed Jun 15, 2025
[Bug]: Error executing method 'start_worker_execution_loop'. This might cause deadlock in distributed execution.
#13276 closed Jun 15, 2025
[Bug]: AttributeError: 'OpenVINOWorker' object has no attribute 'cache_engine'
#13287 closed Jun 15, 2025
[Bug]: meta-llama/Llama-3.2-90B-Vision-Instruct in vllm/vllm-openai:latest
#13307 closed Jun 15, 2025
[Bug]: 'dict' object has no attribute 'is_kv_transfer_instance'
#19259 closed Jun 14, 2025
[Feature]: Support for DeepGEMM
#13857 closed Jun 14, 2025
[Bug]: vllm-openai:0.9.0 docker image raise 'CUDA error: no kernel image is available for execution on the device' for Llama4 Maverick FP8
#18841 closed Jun 14, 2025
[Bug]: deepseek-vl2 `RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same`
#19219 closed Jun 14, 2025
[Bug] Mistral Tool-Call via Jinja Template: Missing `parallel_tool_prompt` Injection and Incorrect tool_response Handling
#19545 closed Jun 14, 2025
[Usage]: How to run intern_vit model using intern_vit.py?
#11221 closed Jun 14, 2025
[Bug]: I started a qwen2vl-7b video processing service using vllm (0.6.6), but encountered an error during inference
#11657 closed Jun 14, 2025
[Feature]: Any plan to support multi-token prediction via speculative decoding for DeepSeek?
#12730 closed Jun 14, 2025
[Bug]: [V1] Mamba models fail on profile run
#12826 closed Jun 14, 2025
[Bug]: Docker 多机多卡部署--模型启动报错
#13145 closed Jun 14, 2025
[New Model]: Surya OCR
#13172 closed Jun 14, 2025
[Bug]: Concurrent streaming mode produces jumbled tokens
#13186 closed Jun 14, 2025
collecting data from user should be noticed, let user to choose whether upload or not
#13195 closed Jun 14, 2025
[Bug]: when i run DeepSeek-V2-Lite with [ngram], the result is different
#13206 closed Jun 14, 2025
[Bug]: examples/grad_web_server.py run error
#13223 closed Jun 14, 2025

97 Issues opened by 88 people

Issue 1: Missing type hint for `wheel` argument in `generate_index.py`
#19918 opened Jun 20, 2025
[Docs] Feedback for `/en/latest/getting_started/installation/gpu.html`
#19913 opened Jun 20, 2025
[Bug]: file in `vllm/benchmarks/kernels/benchmark_marlin.py` cannot execute
#19909 opened Jun 20, 2025
[Bug]: Deepseek R1 0528 tool calling not working
#19907 opened Jun 20, 2025
[Bug]: RTX5080 got CUDA error: no kernel image is available for execution on the device
#19906 opened Jun 20, 2025
[Bug]: nsys cann't open the file
#19903 opened Jun 20, 2025
[Feature]: `kv_transfer_params` not returned for multiple subrequests
#19902 opened Jun 20, 2025
[Doc]: The documentation should be updated to cover GPU compatibility
#19900 opened Jun 20, 2025
[Feature]: EXL3 support
#19896 opened Jun 20, 2025
[Bug]: AsyncLLMEngine stuck in V1
#19892 opened Jun 20, 2025
[Performance]: the performance decline in fp8 inference mode
#19888 opened Jun 20, 2025
[RFC]: Inplace model weights loading
#19886 opened Jun 20, 2025
Issue 1: Incorrect comparison with MAIN_CUDA_VERSION for CPU target
#19882 opened Jun 19, 2025
[Feature]: Implement `check_health` for V1
#19881 opened Jun 19, 2025
[Feature]: Support passing token-level schedules for temperature and other sampling parameters
#19877 opened Jun 19, 2025
[Bug]: InternVL3 poor (random) output with 8bit quantization
#19876 opened Jun 19, 2025
[Bug]: 'IndexError: tuple index out of range' when using 8 gpu's
#19871 opened Jun 19, 2025
[Usage]: missing latest tag from cpu docker registry
#19869 opened Jun 19, 2025
[Bug]: AiterFlashAttentionImpl.__init__() got multiple values for argument 'use_irope' for llama4 model
#19867 opened Jun 19, 2025
[Bug]: NCCL issues when running vllm v0.9.1 for the Deepseek-R1 model [B200 GPU]
#19865 opened Jun 19, 2025
[Feature]: Returning embedding dimensions in /v1/models
#19864 opened Jun 19, 2025
[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER.
#19863 opened Jun 19, 2025
[Bug]: Internal Server Error when use max_tokens=null
#19862 opened Jun 19, 2025
[Bug]: max_completion_tokens doesn't work as max
#19861 opened Jun 19, 2025
[Bug]: AttributeError: 'InferenceClient' object has no attribute 'post'
#19857 opened Jun 19, 2025
[Bug]: dynamic fp8 quantization does not save memory when enable_sleep_mode=True
#19855 opened Jun 19, 2025
[RFC]: KV cache offloading
#19854 opened Jun 19, 2025
[Bug]: Unable to deploy NVFP4 quantized model
#19853 opened Jun 19, 2025
[Feature]: Quant & TP for VIT
#19850 opened Jun 19, 2025
[Bug]: Subprocess health check / automatic restart for V1 EngineCore
#19849 opened Jun 19, 2025
[Bug]: When "tool_choice": "auto" is set, there is a reasoning_content process in the output, but this process is missing when "tool_choice": "required" is used.
#19846 opened Jun 19, 2025
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 opened Jun 19, 2025
[Usage]: Troubleshooting Inconsistencies Between VLLM and Transformer Outputs
#19841 opened Jun 19, 2025
[Bug]: qwen3不思考模型，【内容放到reasoning_content，正常应该在content】
#19839 opened Jun 19, 2025
[Usage]: vllm启动qwen2.5vl-7b以后为什么显存使用越来越多
#19828 opened Jun 19, 2025
[Feature]: Improve startup time UX
#19824 opened Jun 19, 2025
[Feature]: `CustomOp` cleanup
#19817 opened Jun 18, 2025
[Bug]: Loading Qwen3MoE using Transformers backend
#19801 opened Jun 18, 2025
[Bug]: After receiving the request, the service froze
#19800 opened Jun 18, 2025
[Bug]:ValueError: Exceeds max model len when embedding using bge-large-zh-v1.5
#19798 opened Jun 18, 2025
[Usage]: How could vllm support token classifications(albert model)?
#19797 opened Jun 18, 2025
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen3_moe` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
#19793 opened Jun 18, 2025
[Bug]: MP Executor does not correctly handle device allocation for non-CUDA devices (e.g., NPUs)
#19791 opened Jun 18, 2025
[Bug]: 使用vllm0.7.3对Qwen2.5VL-7b有时会报错
#19786 opened Jun 18, 2025
[Doc]: Update the vllm quantization support for the AMD GPU
#19782 opened Jun 18, 2025
[Bug]: wrong output on L20 using fp8
#19779 opened Jun 17, 2025
[Bug]: nixl handshake is slow and can accumulate in a batch of requests where each request is pulling from a different prefill node
#19777 opened Jun 17, 2025
[Bug][Spec Decode]: TPOT in prometheus is ITL in vllm serve
#19776 opened Jun 17, 2025
[Bug]: Potential bug when using speculative decoding example similar as the one from docs
#19775 opened Jun 17, 2025
[Feature]: Evaluate prompt presence on subsequent audio chunks
#19772 opened Jun 17, 2025
Vllm + FlexAttention Work Tracking
#19765 opened Jun 17, 2025
[Bug]: Gemma3 reporting low image accuracy with v1 engine
#19763 opened Jun 17, 2025
[Feature]: Upgrade base Python version of vllm/vllm-tpu docker image to 3.11+
#19761 opened Jun 17, 2025
[Bug]: vllm/vllm-tpu uses Debian base but Ubuntu APT sources, causing package installation errors
#19760 opened Jun 17, 2025
[Feature]: Remove cupy dependency for multi-node Ray deployment
#19758 opened Jun 17, 2025
[Bug]: Error response from daemon: file integrity checksum failed for "usr/local/cuda-12.1/compat/libnvidia-nvvm.so.530.30.02"
#19748 opened Jun 17, 2025
[Usage]: embed prompts
#19746 opened Jun 17, 2025
[Usage]: ValueError: Initial test run failed - Please make sure benchmark arguments are correctly specified. Error: Gateway Time-out
#19743 opened Jun 17, 2025
[Bug]: Incorrect kernel selected when multiple GPUs
#19741 opened Jun 17, 2025
[Feature]: Logging details about incorrect requests
#19739 opened Jun 17, 2025
[Bug]: Fails to load llmcompressor GPTQ Qwen2.5-VL model
#19733 opened Jun 17, 2025
[Performance]: very slow performance for nested list with length constraints
#19732 opened Jun 17, 2025
[Bug]: tool_chat_template_deepseekr1 is no use
#19729 opened Jun 17, 2025
[Feature]: AWQ DeepSeek support on MI300X
#19727 opened Jun 17, 2025
[Bug]: DeepGEMM does not work with CUDA Graph
#19722 opened Jun 17, 2025
[Doc]: 请问支持的模型里，有没有支持嵌入维度可设置为1536的embedding模型？
#19720 opened Jun 17, 2025
[Performance]: No signficant speedup from Wfp8Afp8 (vs Wbf16Abf16) in Llama-4 Scout
#19714 opened Jun 16, 2025
[Bug]: Audio transcription cannot load `preprocessor_config.json` when using runai streamer
#19713 opened Jun 16, 2025
[Feature]: Simplify speculative-config format for vllm serve
#19709 opened Jun 16, 2025
[RFC]: Multimodal data IPC improvement
#19702 opened Jun 16, 2025
[Bug]: Truncated && Incomplete Response from LLAMA4 Scout Prefix Caching
#19697 opened Jun 16, 2025
[Bug]: In the DP online scenario of DeepSeek, when concurrency and request rate increase, TTFT drops sharply.
#19694 opened Jun 16, 2025
[Bug]: Enable LORA on Version 0.9.1 and RTX 5090 causes an issue
#19693 opened Jun 16, 2025
[Performance]: V1 engine runs slower than V0 on the MI300X
#19692 opened Jun 16, 2025
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device using H100 starting RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic
#19690 opened Jun 16, 2025
[Usage]: Changing image_feature and image_input_shape has no effect on VLM output
#19689 opened Jun 16, 2025
[Bug]: deploy 32B model using vllm + ray with two nodes failed with nccl error
#19684 opened Jun 16, 2025
[Bug]:MooncakeStoreConnector with IndexError('Dimension out of range (expected to be in range of [-2, 1], but got 2)') ,because the new cache_kernels.cu changed.
#19683 opened Jun 16, 2025
[Bug]: rocm build crashes with libcuda.so.1: cannot open shared object file
#19681 opened Jun 16, 2025
[Usage]: What is the meaning of `Avg generation throughput`
#19680 opened Jun 16, 2025
[Bug]: `guided_regex` not working on M2 Ultra VLLM
#19676 opened Jun 16, 2025
[New Model]: Support BAAI/bge-reranker-v2-gemma model
#19673 opened Jun 16, 2025
[Bug]: I have an error message when calling the vllm api, and vllm will be closed.vllm:0.91.graphics card:5060TI
#19671 opened Jun 16, 2025
[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229).
#19670 opened Jun 16, 2025
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 opened Jun 15, 2025
[Bug]: mismatch between multimodal tokens and placeholders for Qwen_2.5-3B (4 GPUs*24G)
#19666 opened Jun 15, 2025
[Bug]: Phi-3-Small model reporting AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
#19665 opened Jun 15, 2025
[Bug]: ValueError out of range float values are not json compliant
#19661 opened Jun 15, 2025
[Feature]: Add EP/DP/PD deps in docker image
#19653 opened Jun 14, 2025
[Usage]: Is there any vllm images include deep_ep ?
#19646 opened Jun 14, 2025
[Usage]: V1 can not support for macOS with Apple silicon.
#19645 opened Jun 14, 2025
[Bug]: Crash during OpenAI API server usage
#19639 opened Jun 14, 2025
[Bug]: Unable to run Jamba 1.6 Large with Tensor Parallelism
#19638 opened Jun 14, 2025
[Bug]: Illegal memory access on llama4 maverick
#19631 opened Jun 13, 2025
[Perf]: Support non-contiguous input for `dynamic_scaled_int8_quant` and `dynamic_per_token_scaled_fp8_quant`
#19630 opened Jun 13, 2025
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
#19629 opened Jun 13, 2025
[Bug]: I am traing to run unsloth/phi-4-bnb-4bit but I am getting always the same error Validation Error:1 validatiopn error for modelconfig Infer_schema(func): Parameter block_size has unsupported type list[int]
#19628 opened Jun 13, 2025

338 Unresolved conversations

Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.

[Core] feat: Implement Priority Scheduling in V1 Engine
#19057 commented on Jun 20, 2025 • 66 new comments
[Core] Support Local Chunked Attention for Hybrid KV Cache
#19351 commented on Jun 19, 2025 • 35 new comments
[Feature] Support sequence parallelism for static fp8 quantization
#19181 commented on Jun 20, 2025 • 26 new comments
Draft: WIP NixlConnector drop ZMQ in favor of HTTP metadata exchanges
#19447 commented on Jun 19, 2025 • 16 new comments
[Bugfix] Move hardware-dependent configuration resolution (FlashMLA capability, `dtype: 'auto'`) to worker
#18979 commented on Jun 18, 2025 • 15 new comments
[V1] - Enable worker -> scheduler connector metadata
#19555 commented on Jun 20, 2025 • 13 new comments
[Metrics] Compute and log the serving FLOPs
#19290 commented on Jun 15, 2025 • 11 new comments
[Feature][Quantization] MXFP4 support for MOE models
#17888 commented on Jun 19, 2025 • 8 new comments
[Bug][Frontend] Fix structure of transcription's decoder_prompt
#18809 commented on Jun 20, 2025 • 7 new comments
[Feature] microbatch tokenization
#19334 commented on Jun 19, 2025 • 6 new comments
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load
#19619 commented on Jun 20, 2025 • 6 new comments
[Platform] Add custom default max tokens
#18557 commented on Jun 19, 2025 • 6 new comments
[Misc] feat output content in stream response
#19608 commented on Jun 18, 2025 • 5 new comments
[Frontend] Add unix domain socket support
#18097 commented on Jun 17, 2025 • 4 new comments
[Misc] Configurable timeout for execute_model RPC calls via env var
#19544 commented on Jun 19, 2025 • 4 new comments
[Bugfix][Benchmarks]Fixed async_request_deepspeed_mii() to get ttft
#18689 commented on Jun 19, 2025 • 4 new comments
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions]
#19126 commented on Jun 20, 2025 • 4 new comments
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine
#19067 commented on Jun 19, 2025 • 4 new comments
[Kernel] Adding basic Triton JitCache for triton_attn
#16606 commented on Jun 18, 2025 • 4 new comments
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference
#18768 commented on Jun 18, 2025 • 3 new comments
[Frontend] Add `/v1/audio/translations` OpenAI API endpoint
#19615 commented on Jun 19, 2025 • 3 new comments
[Bugfix] VLLM_V1 supports passing other compilation levels
#19340 commented on Jun 19, 2025 • 2 new comments
[Misc]: refactor: ParallelConfig init func
#19310 commented on Jun 14, 2025 • 2 new comments
[Chore] debloat some initial logs
#19438 commented on Jun 20, 2025 • 2 new comments
Enable CPU nightly performance benchmark and its Markdown report
#18444 commented on Jun 20, 2025 • 2 new comments
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model
#19598 commented on Jun 18, 2025 • 2 new comments
[WIP] [Core][P/D] CPU connector for PD disagg
#18332 commented on Jun 16, 2025 • 2 new comments
[V1] LogitsProcessor programming model
#16728 commented on Jun 19, 2025 • 2 new comments
Add quickreduce as alternative to custom allreduce
#16804 commented on Jun 17, 2025 • 2 new comments
[CI] bump mypy version to 1.16.0
#19548 commented on Jun 17, 2025 • 2 new comments
[P/D][Bugfix]: Fix the issue where the remote KVCache cannot be loaded when PP > 1
#19558 commented on Jun 17, 2025 • 2 new comments
[Bugfix]: Fix messy code when using logprobs
#19209 commented on Jun 20, 2025 • 1 new comment
fix: cuda 12.6 installation
#19095 commented on Jun 15, 2025 • 1 new comment
[Bugfix]fix asyncLLM test_abort
#16090 commented on Jun 17, 2025 • 1 new comment
[Misc] Add gemma3 chat template with pythonic-style function calling
#17149 commented on Jun 16, 2025 • 1 new comment
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0
#19346 commented on Jun 20, 2025 • 1 new comment
[Frontend] Add -d/--detach option for vllm serve and process management
#18065 commented on Jun 19, 2025 • 1 new comment
[P/D][Feature] Support using NIXL together with PP
#19591 commented on Jun 15, 2025 • 1 new comment
[V1] Only print cudagraph tqdm on rank 0 with `is_global_first_rank`
#19516 commented on Jun 16, 2025 • 1 new comment
Support multicard for Disaggregated Prefill/Decode and provide a automatic benchmark test
#15221 commented on Jun 20, 2025 • 0 new comments
[WIP][TPU] Support mrope models (Qwen2VL)
#15149 commented on Jun 19, 2025 • 0 new comments
Metrics proposal OpenTelemetry API
#15138 commented on Jun 20, 2025 • 0 new comments
[Misc]Fix incorrect local IP detection in multi-network interface environments
#15071 commented on Jun 18, 2025 • 0 new comments
[Bugfix] Fix include prompt in stream response when echo=true
#15233 commented on Jun 18, 2025 • 0 new comments
When an exception happens in multiproc, die hard and fast
#15000 commented on Jun 18, 2025 • 0 new comments
Add missed ray[data] dependence in cuda.txt
#15283 commented on Jun 20, 2025 • 0 new comments
[PD] Skip `tp_size` exchange with rank0
#19413 commented on Jun 16, 2025 • 0 new comments
[Bugfix] Fix the missing '}' issue for nested object parameters in stream function call.
#16919 commented on Jun 19, 2025 • 0 new comments
[RFC] per module sharded weight tagging
#17001 commented on Jun 19, 2025 • 0 new comments
[Feat][Frontend] Added support for HermesToolParser for models without special tokens
#16890 commented on Jun 20, 2025 • 0 new comments
[Perf] Optimize MRoPR position preparing performance with numba
#16881 commented on Jun 18, 2025 • 0 new comments
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on Jun 19, 2025 • 0 new comments
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on Jun 18, 2025 • 0 new comments
[Bugfix][Disaggregated] Set min_tokens in disagg_proxy_demo.py
#16705 commented on Jun 19, 2025 • 0 new comments
[NIXL] vllm v0 nixl integration
#16677 commented on Jun 19, 2025 • 0 new comments
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on Jun 17, 2025 • 0 new comments
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on Jun 19, 2025 • 0 new comments
[Draft] SnapKV
#16160 commented on Jun 16, 2025 • 0 new comments
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 commented on Jun 17, 2025 • 0 new comments
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on Jun 20, 2025 • 0 new comments
feat: update allow_pattern
#15797 commented on Jun 19, 2025 • 0 new comments
[Benchmark] Fix two issues in benchmark result
#15795 commented on Jun 19, 2025 • 0 new comments
Optimizing Cascade Attention for Parallel Sampling
#15772 commented on Jun 15, 2025 • 0 new comments
[V1][Experimental] Jump-forward decoding
#15490 commented on Jun 19, 2025 • 0 new comments
Add KV-Cache int8 quant support
#10354 commented on Jun 14, 2025 • 0 new comments
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on Jun 20, 2025 • 0 new comments
[Bug]: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM
#12783 commented on Jun 20, 2025 • 0 new comments
[Bug]: RTX50xx GPU is not supported for running W8A8 FP8 quant models!
#19605 commented on Jun 20, 2025 • 0 new comments
[New Model]: surport for model: https://huggingface.co/jinaai/jina-clip-v2
#18448 commented on Jun 20, 2025 • 0 new comments
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 commented on Jun 20, 2025 • 0 new comments
[Bug]: Grammar error: Pointer '/$defs/xxxxx' does not exist
#16467 commented on Jun 20, 2025 • 0 new comments
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on Jun 20, 2025 • 0 new comments
[Feature]: support reasoning output when offline batched inference
#17292 commented on Jun 20, 2025 • 0 new comments
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on Jun 20, 2025 • 0 new comments
[Bug]: `uv run vllm serve` with DP results in NCCL error: two ranks use the same device
#17176 commented on Jun 20, 2025 • 0 new comments
[Bug]: Docker, v0.9.0.1, Gemma3-4B, "Unsupported conversion from f16 to f16" on Nvidia T4
#19203 commented on Jun 20, 2025 • 0 new comments
[Bug]: Outlines broken on vLLM 0.8+
#15636 commented on Jun 20, 2025 • 0 new comments
[Usage]: 使用vllm的docker镜像启动Qwen/Qwen3-32B模型服务，CPU占用一直100%
#19150 commented on Jun 20, 2025 • 0 new comments
[Usage]: RTX 5090 with vllm/vllm-openai docker image
#16652 commented on Jun 20, 2025 • 0 new comments
[Bug] TP=2 fails on dual RTX 5090: TorchInductor compile error or CUDA illegal memory access (TP=1 works)
#18814 commented on Jun 20, 2025 • 0 new comments
Llama3.2 Vision Model: Guides and Issues
#8826 commented on Jun 20, 2025 • 0 new comments
[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management
#10086 commented on Jun 20, 2025 • 0 new comments
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 commented on Jun 20, 2025 • 0 new comments
[V1]Improve V1 startup error handling
#14758 commented on Jun 14, 2025 • 0 new comments
[Perf] Optimize Qwen2/2.5-VL ViT tensor generating performance
#14684 commented on Jun 19, 2025 • 0 new comments
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on Jun 17, 2025 • 0 new comments
[Quant] SupportsQuant handles ignored_modules
#14635 commented on Jun 19, 2025 • 0 new comments
[Quant] Add SupportsQuant and packed_modules_mapping to all models
#14631 commented on Jun 19, 2025 • 0 new comments
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on Jun 19, 2025 • 0 new comments
First working PoC for bge-m3 sparse embeddings
#14526 commented on Jun 17, 2025 • 0 new comments
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 commented on Jun 19, 2025 • 0 new comments
[Model] add colqwen2_vl code & inference
#14291 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Make memory profiler account for speculative draft model weights
#14067 commented on Jun 20, 2025 • 0 new comments
[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization
#13942 commented on Jun 20, 2025 • 0 new comments
[WIP][Core] Support tensor parallelism with uneven heads
#13934 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Enable speculative decoding for models with nearly-identical vocab sizes
#13849 commented on Jun 19, 2025 • 0 new comments
[Model] Support VLMs with transformers backend
#13754 commented on Jun 18, 2025 • 0 new comments
[V0][Sampler] Use raw logits for greedy argmax
#13312 commented on Jun 19, 2025 • 0 new comments
[Hardware][Metal] Apple Metal support
#12640 commented on Jun 19, 2025 • 0 new comments
[Misc]add modules_to_not_convert attribute to gptq series
#12103 commented on Jun 20, 2025 • 0 new comments
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on Jun 17, 2025 • 0 new comments
[Model] LoRA with lm_head and embed_tokens fully trained - 4
#11714 commented on Jun 19, 2025 • 0 new comments
qwen optimze
#19406 commented on Jun 19, 2025 • 0 new comments
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend
#19395 commented on Jun 17, 2025 • 0 new comments
Add GLM4.1V model (Draft)
#19331 commented on Jun 18, 2025 • 0 new comments
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on Jun 20, 2025 • 0 new comments
[Core] Update error message for Whisper + num-scheduler-steps > 1
#19286 commented on Jun 14, 2025 • 0 new comments
[Bugfix] ROCm FP8 Quantization Padding Issue
#19251 commented on Jun 19, 2025 • 0 new comments
[Core] Allow vLLM to stream n tokens at a time
#19240 commented on Jun 19, 2025 • 0 new comments
[Misc][Bugfix] specify docker registry to support podman
#19236 commented on Jun 17, 2025 • 0 new comments
[Bugfix] Fix Qwen2-Audio chat template for online serving
#19230 commented on Jun 18, 2025 • 0 new comments
[Doc]: improve CPU(x86) build instructions and fix include path
#19156 commented on Jun 17, 2025 • 0 new comments
[Core] Add constants for CUDA compute capabilities
#19099 commented on Jun 19, 2025 • 0 new comments
Fix Incorrect data_parallel_rank and subsequent errors under torchrun
#19096 commented on Jun 19, 2025 • 0 new comments
[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation
#19094 commented on Jun 20, 2025 • 0 new comments
[Bugfix]: Fix DualChunkFlashAttention for short sequences
#19084 commented on Jun 19, 2025 • 0 new comments
[P/D] Exchange NIXL metadata through rank 0
#19080 commented on Jun 15, 2025 • 0 new comments
[BugFix]: Hermes tool parser stream output error in Qwen3 case #19056
#19058 commented on Jun 19, 2025 • 0 new comments
[Kernel] Add Conch Triton Attention backend
#19625 commented on Jun 19, 2025 • 0 new comments
Use the correct torch dtype in topk kernel assertion
#19614 commented on Jun 16, 2025 • 0 new comments
[Frontend] /metadata: Get more useful server information easily.
#19604 commented on Jun 18, 2025 • 0 new comments
[Core] Remove host GPU sync in `merge_multimodal_embeddings`
#19578 commented on Jun 16, 2025 • 0 new comments
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend.
#19560 commented on Jun 19, 2025 • 0 new comments
[Core] Rationalize boolean environment variable handling
#19550 commented on Jun 19, 2025 • 0 new comments
[Benchmark] fix request loss if "ping" is returned
#19535 commented on Jun 20, 2025 • 0 new comments
[Bugfix] Register reducer even if transformers_modules not available
#19510 commented on Jun 16, 2025 • 0 new comments
deps: Update torch and deps to 2.7.1
#19507 commented on Jun 16, 2025 • 0 new comments
[Models] Improve iteration over layers
#19497 commented on Jun 18, 2025 • 0 new comments
090
#19488 commented on Jun 18, 2025 • 0 new comments
fix: Properly set engine_id when using multi connector in dynamo
#19487 commented on Jun 20, 2025 • 0 new comments
[Perf] Improve/Fix-regression for FA3 in High QPS regimes
#19463 commented on Jun 20, 2025 • 0 new comments
[Kernel] Integrate IBM/Applied-AI fused moe kernels
#19443 commented on Jun 18, 2025 • 0 new comments
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client.
#19423 commented on Jun 15, 2025 • 0 new comments
Added FP8 support quantization support to DualChunkFlashAttentionBackend
#19420 commented on Jun 18, 2025 • 0 new comments
[ROCm][FEAT] Integrate AITER gemm w8a8 ptpc
#19417 commented on Jun 19, 2025 • 0 new comments
[Frontend] speed up import time of vllm.reasoning
#18236 commented on Jun 19, 2025 • 0 new comments
[V1] feat:add engine v1 tracing
#18069 commented on Jun 18, 2025 • 0 new comments
[Feature]: reasoning_tokens in Chat Completion Response usage
#18067 commented on Jun 19, 2025 • 0 new comments
[CI/Build] Allow hermetic builds
#18064 commented on Jun 18, 2025 • 0 new comments
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on Jun 17, 2025 • 0 new comments
[Core] Parallel multi-modal processor
#17831 commented on Jun 19, 2025 • 0 new comments
Update registry.py
#17762 commented on Jun 19, 2025 • 0 new comments
[Kernel] Bf16 data type support for awq quantization
#17705 commented on Jun 20, 2025 • 0 new comments
[Misc] Refactor VLM common generation tests to support audio inputs and mix-modality tests
#17633 commented on Jun 19, 2025 • 0 new comments
[PERF] Speed up of prepare_inputs / mrope
#17617 commented on Jun 19, 2025 • 0 new comments
[Security] Document StatelessProcessGroup security concerns
#17591 commented on Jun 14, 2025 • 0 new comments
[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel
#17583 commented on Jun 18, 2025 • 0 new comments
[Bugfix][Model] vllm-v0 engine run eagle algo with qwen2.5 model, KeyError: 'norm.weight' bugfix
#17518 commented on Jun 19, 2025 • 0 new comments
enable multiple platform device in DP init
#17368 commented on Jun 20, 2025 • 0 new comments
[Experiment] Parallel multi-modal processor
#17361 commented on Jun 19, 2025 • 0 new comments
[NVIDIA] Support Cutlass w8a8 for Blackwell Geforce GPUs (sm120)
#17280 commented on Jun 19, 2025 • 0 new comments
[Frontend] Expand tools even if tool_choice="none"
#17177 commented on Jun 19, 2025 • 0 new comments
Create E=128,N=768,device_name=NVIDIA_A100-PCIE-40GB.json
#19049 commented on Jun 19, 2025 • 0 new comments
[Bugfix] Improve JSON extraction in LlamaToolParser
#19024 commented on Jun 16, 2025 • 0 new comments
[DRAFT] Self-Speculative Decoding using LayerSkip
#18994 commented on Jun 18, 2025 • 0 new comments
[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.
#18864 commented on Jun 19, 2025 • 0 new comments
[Kernel] Porting triton_kernels for FusedMoE
#18595 commented on Jun 19, 2025 • 0 new comments
[Model][Speculative Decoding] Integrate PARD into vLLM
#18541 commented on Jun 20, 2025 • 0 new comments
Remove Vision FA warning
#18522 commented on Jun 19, 2025 • 0 new comments
Add reorder_batch to TPU V1
#18515 commented on Jun 19, 2025 • 0 new comments
[V1] Support `LLM.apply_model`
#18465 commented on Jun 18, 2025 • 0 new comments
[WIP] Two batch overlap
#18415 commented on Jun 19, 2025 • 0 new comments
[V1] [Spec decode] Llama4 type eagle support in v1
#18369 commented on Jun 15, 2025 • 0 new comments
[Misc] add xgrammar for arm64
#18359 commented on Jun 16, 2025 • 0 new comments
[Feature] Expert Parallelism Load Balancer (EPLB)
#18343 commented on Jun 20, 2025 • 0 new comments
[Don't merge] Debug failing quantization test with input batch move
#18298 commented on Jun 19, 2025 • 0 new comments
[P/D] Support CPU Transfer in NixlConnector
#18293 commented on Jun 19, 2025 • 0 new comments
[Kernel] Add EP support for cutlass_moe_fp4
#18281 commented on Jun 14, 2025 • 0 new comments
[Model] support dots1
#18254 commented on Jun 19, 2025 • 0 new comments
[Feature]: Disaggregated Prefill on multi-node & multi-gpu
#13004 commented on Jun 20, 2025 • 0 new comments
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,)
#8432 commented on Jun 17, 2025 • 0 new comments
[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
#8735 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support for priority preemption with chunked-prefill
#10101 commented on Jun 17, 2025 • 0 new comments
[Usage]: mlx-community/DeepSeek-R1-4bit exception：OSError: /data/coding/model-671b-MS/dir does not appear to have a file named configuration_deepseek.py；
#13283 commented on Jun 17, 2025 • 0 new comments
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 commented on Jun 17, 2025 • 0 new comments
[Bug]: Weird output when server with high load
#14491 commented on Jun 17, 2025 • 0 new comments
[Bug]:0.74 dev ,the error occurred in the gptq_marlin_gemm function call
#14887 commented on Jun 17, 2025 • 0 new comments
[Bug]: asyncio.exceptions.CancelledError and engine_client.dead_error
#14994 commented on Jun 17, 2025 • 0 new comments
[Usage]: How to use vllm in parallel
#14997 commented on Jun 17, 2025 • 0 new comments
[Bug]: Failed to Run Qwen2.5-7B with RTX 3070 & CPU Offload (14GB) Despite Sufficient Theoretical Memory
#15004 commented on Jun 17, 2025 • 0 new comments
[Feature]: PallasAttentionBackendImpl.__init__() got an unexpected keyword argument 'q_lora_rank'
#15026 commented on Jun 17, 2025 • 0 new comments
[Performance]: Speculative Decoder Optimization for Large-Batch Inference Overhead
#15029 commented on Jun 17, 2025 • 0 new comments
Precision loss occurs when using the MoE sum kernel.
#15045 commented on Jun 17, 2025 • 0 new comments
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#15058 commented on Jun 17, 2025 • 0 new comments
[Bug]: Bad requests are not captured as traces
#17528 commented on Jun 17, 2025 • 0 new comments
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on Jun 16, 2025 • 0 new comments
[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs
#5907 commented on Jun 16, 2025 • 0 new comments
[Bug]: LLMEngine.add_request can't handle erroneous type of request_id
#19588 commented on Jun 16, 2025 • 0 new comments
[Feature]: Optimize parallel sampling by batching add_request calls to avoid split scheduling latency
#16373 commented on Jun 16, 2025 • 0 new comments
[Feature]: Return hidden states (in progress?)
#6165 commented on Jun 16, 2025 • 0 new comments
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 commented on Jun 16, 2025 • 0 new comments
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError
#15127 commented on Jun 18, 2025 • 0 new comments
[Usage]: multiround QA when using qwen2.5vl with the same input image
#15132 commented on Jun 18, 2025 • 0 new comments
[Feature]: Configurable metrics export format - Prometheus, OpenTelemetry
#15141 commented on Jun 18, 2025 • 0 new comments
[Bug]: Error running ShieldGemma: 'guideline' is undefined
#15147 commented on Jun 18, 2025 • 0 new comments
[Bug]: Can't see NCCL profiling data in nsight sys for expert parallel
#15168 commented on Jun 18, 2025 • 0 new comments
[Bug]: Failed to initialize the TMA descriptor 700 use Qwen2.5 72B on H200
#15175 commented on Jun 18, 2025 • 0 new comments
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on Jun 18, 2025 • 0 new comments
[Bug]: Failed to run model Qwen3-30B-A3B on DGX V100x4
#17392 commented on Jun 18, 2025 • 0 new comments
[RFC]: Deprecating vLLM V0
#18571 commented on Jun 17, 2025 • 0 new comments
[Feature]: Consider parallel_tool_calls parameter at the API level
#9451 commented on Jun 17, 2025 • 0 new comments
[Bug]: `undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE` when running `0.7.3.dev57+g2ae88905.precompiled` on A100
#13047 commented on Jun 17, 2025 • 0 new comments
[Feature]: Add Support for thinking_budget for Qwen3 Models
#17887 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support Gemma 3 QAT series
#16856 commented on Jun 17, 2025 • 0 new comments
[Feature]: Support for RTX 5090 (CUDA 12.8)
#13306 commented on Jun 17, 2025 • 0 new comments
[Usage]: How to let Whisper return timestamps in transcript?
#19556 commented on Jun 17, 2025 • 0 new comments
[Usage]: why speculate decoding is slower than normal decoding？
#8439 commented on Jun 17, 2025 • 0 new comments
[Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant.
#18431 commented on Jun 17, 2025 • 0 new comments
[Doc]: Newest documentation for engine arguments is significantly worse than v0.8.5 and prior
#18707 commented on Jun 17, 2025 • 0 new comments
[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector
#18834 commented on Jun 17, 2025 • 0 new comments
[Usage]: Full cuda graph for vllm v1
#19607 commented on Jun 17, 2025 • 0 new comments
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 commented on Jun 17, 2025 • 0 new comments
[RFC]: Graceful Error Handling for KV Connector Load Failures
#19329 commented on Jun 15, 2025 • 0 new comments
[Bug]: "NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl- (size 192814336), error: No such file or directory (2)"
#18831 commented on Jun 15, 2025 • 0 new comments
[Feature]: Microbatch Tokenization
#19012 commented on Jun 15, 2025 • 0 new comments
[Bug]: Help, RuntimeError: CUDA error: no kernel image is available for execution on the device
#18835 commented on Jun 15, 2025 • 0 new comments
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on Jun 15, 2025 • 0 new comments
[Feature]: Custom attention masks
#5228 commented on Jun 15, 2025 • 0 new comments
[Usage]: DeepSeek R1 input tokens cannot exceed 32k and how to correctly use FlashMLA
#14882 commented on Jun 15, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: invalid argument
#14885 commented on Jun 15, 2025 • 0 new comments
[RFC]: Response format extensions for structured outputs
#19097 commented on Jun 14, 2025 • 0 new comments
[Feature]: v1 with kv-cache fp8
#16165 commented on Jun 14, 2025 • 0 new comments
[Bug]: Qwen3 Enable Reasoning breaks Tool Call Parsing
#19513 commented on Jun 14, 2025 • 0 new comments
[Usage]: [vLLM V1] `decoded_token` returns "Ċ" instead of "\n" in Qwen2.5-Math-7B-Instruct
#19595 commented on Jun 14, 2025 • 0 new comments
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 commented on Jun 14, 2025 • 0 new comments
[Bug]: gemma3 shows degraded accuracy in vLLM v0.8.4
#17689 commented on Jun 14, 2025 • 0 new comments
[Bug]: Qwen2VL-2b / Qwen2.5-7b has AssertionError and Cuda error when qps goes higher
#17171 commented on Jun 14, 2025 • 0 new comments
[Bug]: Running ROCm support v1 vLLM Arch triggers ERROR_MEMORY_APERTURE_VIOLATION
#13674 commented on Jun 14, 2025 • 0 new comments
[RFC]: Configurable multi-modal data for profiling
#14438 commented on Jun 14, 2025 • 0 new comments
[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm
#14715 commented on Jun 14, 2025 • 0 new comments
[Usage]: What should I do if I want to skip the prefill of a new request?
#14863 commented on Jun 14, 2025 • 0 new comments
[Feature]: Will vllm support sequence parallelism?
#19519 commented on Jun 14, 2025 • 0 new comments
[RFC]: Introduce a Triton-only Transformer Execution Path in vLLM
#13319 commented on Jun 13, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device with nvidia v100
#19185 commented on Jun 16, 2025 • 0 new comments
[Bug]: AttributeError: 'Llama_Nemotron_Nano_VL_Config' object has no attribute 'hidden_size'. Did you mean: 'vit_hidden_size'?
#19360 commented on Jun 16, 2025 • 0 new comments
[Bug]: vllm crashes after when using quantized model on CPU with error "torch not compiled with CUDA enabled"
#18198 commented on Jun 16, 2025 • 0 new comments
[Bug]: Hermes tool parser stream output error in Qwen3 case
#19056 commented on Jun 16, 2025 • 0 new comments
[Feature]: Add support for multi-lora and single lora for classification tasks
#19623 commented on Jun 16, 2025 • 0 new comments
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 commented on Jun 16, 2025 • 0 new comments
[Usage]: Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
#10151 commented on Jun 16, 2025 • 0 new comments
[Feature]: (Willing to PR) Avoid KV cache occupying GPU memory when not used
#11408 commented on Jun 16, 2025 • 0 new comments
[Bug]: 使用vllm sever 出现会卡住不动 v100-32g
#13753 commented on Jun 16, 2025 • 0 new comments
[Bug]: vllm server hang when running DeepSeek R1
#13778 commented on Jun 16, 2025 • 0 new comments
[Usage]: Help me! I can't run DeepSeek-R1 with the latest docker image on my server
#14039 commented on Jun 16, 2025 • 0 new comments
[Bug][Ray]: Pipeline parallelism fails on the same host
#14093 commented on Jun 16, 2025 • 0 new comments
[Bug]: when compilingFlashMLA/csrc/flash_api.cpp error occurred
#14250 commented on Jun 16, 2025 • 0 new comments
[Bug]: EAGLE / MTP Doesn't Overwrite Approximated Hidden States / KV Cache, 8%- 15% Acceptance Length Degradation
#14649 commented on Jun 16, 2025 • 0 new comments
[Bug]: UserWarning on skipping serialisation of PostGradPassManager
#14911 commented on Jun 16, 2025 • 0 new comments
[Usage]: CPU使用率
#14931 commented on Jun 16, 2025 • 0 new comments
[Feature]: a new attention adaptation
#14940 commented on Jun 16, 2025 • 0 new comments
[Performance]: Speculative Decoding vs. Standard Inference
#14941 commented on Jun 16, 2025 • 0 new comments
[Usage]:
#14944 commented on Jun 16, 2025 • 0 new comments
[Usage]: Torch 2.5.1 with latest main branch
#14973 commented on Jun 16, 2025 • 0 new comments
[Bug]: Unable to get vLLM working with RTX 5090
#18995 commented on Jun 15, 2025 • 0 new comments
[Usage]: How to use DeepSeek-R1-0528-Qwen3-8B with function call
#19001 commented on Jun 19, 2025 • 0 new comments
[Bug]: InternVL3 image dynamic preprocess issue
#19585 commented on Jun 19, 2025 • 0 new comments
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on Jun 19, 2025 • 0 new comments
[Bug]: vllm 0.8.4 whisper possible memory leak?
#16966 commented on Jun 19, 2025 • 0 new comments
[Installation]: undefined symbol: _ZNK3c1011StorageImpl27throw_data_ptr_access_errorEv
#15010 commented on Jun 19, 2025 • 0 new comments
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on Jun 19, 2025 • 0 new comments
[Bug]: the issue of "cuda out of memory" arises
#15182 commented on Jun 19, 2025 • 0 new comments
[Bug]: Extra Characters in `content` When Using `enable_reasoning` with `stop` Parameter
#15188 commented on Jun 19, 2025 • 0 new comments
[Bug]: Is the logic order correct during the scheduler procedure?
#16982 commented on Jun 19, 2025 • 0 new comments
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device
#5547 commented on Jun 19, 2025 • 0 new comments
[Bug]: prefix-caching: inconsistent completions
#5543 commented on Jun 19, 2025 • 0 new comments
[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length
#5344 commented on Jun 19, 2025 • 0 new comments
[Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using --enforce_eager
#17513 commented on Jun 19, 2025 • 0 new comments
[Bug]: Image Fails to Initialize (Undetected Platform) because of LD_LIBRARY_PATH, PATH environment error with vllm >= 0.9.0
#19184 commented on Jun 19, 2025 • 0 new comments
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on Jun 19, 2025 • 0 new comments
[New Model]: CSM 1b
#18005 commented on Jun 19, 2025 • 0 new comments
[RFC]: Multi-modality Support on vLLM
#4194 commented on Jun 18, 2025 • 0 new comments
[RFC]: Logits processor extensibility
#17799 commented on Jun 18, 2025 • 0 new comments
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM does not serve text-only version of Llama4
#18022 commented on Jun 18, 2025 • 0 new comments
[Feature]: Llama4 LoRA support
#16894 commented on Jun 18, 2025 • 0 new comments
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 commented on Jun 20, 2025 • 0 new comments
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 commented on Jun 20, 2025 • 0 new comments
[Usage]: Distributed inference not supported with OpenVINO?
#14933 commented on Jun 20, 2025 • 0 new comments
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 commented on Jun 20, 2025 • 0 new comments
[Bug]: Capture CudaGraph with LoRA
#15090 commented on Jun 20, 2025 • 0 new comments
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 commented on Jun 20, 2025 • 0 new comments
[RFC]: layer-wise kv cache offloading to enable larger batches
#15123 commented on Jun 20, 2025 • 0 new comments
[Performance]: online batch inference faster than offline batch inference
#15178 commented on Jun 20, 2025 • 0 new comments
[Usage]: VLLM 0.7.3 with tensor parallelism outputs only exclamation marks when using multiple GPUs
#15194 commented on Jun 20, 2025 • 0 new comments
[Feature]: vllm what supports dialog prefix continuation?
#15198 commented on Jun 20, 2025 • 0 new comments
[Misc][Help]: Adding support for a Custom model with External MoE Routing
#15214 commented on Jun 20, 2025 • 0 new comments
[Usage]: : How to properly use vllm when serving - keyerror 'text'
#15219 commented on Jun 20, 2025 • 0 new comments
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 commented on Jun 20, 2025 • 0 new comments
[Performance]: V0 and V1 give the same throughput number
#15253 commented on Jun 20, 2025 • 0 new comments
[Bug]: --tensor-parallel-size Error
#15255 commented on Jun 20, 2025 • 0 new comments
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 commented on Jun 20, 2025 • 0 new comments
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 commented on Jun 20, 2025 • 0 new comments
[Feature]: looking into adding a generation algorithm
#15315 commented on Jun 20, 2025 • 0 new comments
[Bug]:rtx5060ti apply_w8a8_block_fp8_linear
#19596 commented on Jun 20, 2025 • 0 new comments
[Feature]: Qwen 3 MoE Lora adapter support.
#18120 commented on Jun 19, 2025 • 0 new comments
[Bug]: RuntimeError: The size of tensor a (1059) must match the size of tensor b (376) at non-singleton dimension, DeepSeek R1 H20x16 pp2, v1 engine
#15332 commented on Jun 19, 2025 • 0 new comments
[Feature]: Phi-4 tool support
#11985 commented on Jun 18, 2025 • 0 new comments
[Bug]: V100 may can not support enable-prefix-caching
#13738 commented on Jun 18, 2025 • 0 new comments
[Bug]: The accuracy of multiple cards and single card is inconsistent
#13801 commented on Jun 18, 2025 • 0 new comments
[Feature]: Support DeepEP
#13804 commented on Jun 18, 2025 • 0 new comments
[Bug]: use cpu_offload_gb in gguf failed.
#14096 commented on Jun 18, 2025 • 0 new comments
[Feature]: `Invalid attention backend for cuda` with `TORCH_SDPA` better error message
#14320 commented on Jun 18, 2025 • 0 new comments
[Bug]: Multi GPU inference using two RTX 5090s(TP=2)
#14628 commented on Jun 18, 2025 • 0 new comments
[Bug]: EAGLE / DeepSeek MTP Handles First Input Token Incorrectly - 25% Acceptance Rate Drop
#14647 commented on Jun 18, 2025 • 0 new comments
[Bug] [ROCm]: RuntimeError: Calling `torch.linalg.cholesky` on a CUDA tensor requires compiling PyTorch with MAGMA. Please use PyTorch built with MAGMA support.
#14914 commented on Jun 18, 2025 • 0 new comments
[Bug]: CPU infrencing won't work for DeepSeek-R1
#15044 commented on Jun 18, 2025 • 0 new comments
[New Model]: StableLMAlphaForCausalLM
#15046 commented on Jun 18, 2025 • 0 new comments
[Usage]: How to use FlashMLA for DeepSeek-V2,
#15079 commented on Jun 18, 2025 • 0 new comments
[Usage]: can vllm support deepseek R1 to inference on FP8 natively on H20 servers?
#15084 commented on Jun 18, 2025 • 0 new comments
[Usage]: Request to Include vllm["audio,video"] Package in v0.8.0 Docker Image
#15087 commented on Jun 18, 2025 • 0 new comments
[Misc]: Why not sort the waiting queue before popleft waiting queue?
#15091 commented on Jun 18, 2025 • 0 new comments
[Usage]: The difference between 0.7.3 and 0.8.0
#15092 commented on Jun 18, 2025 • 0 new comments
[Usage]: `torch.compile` is turned on, but the model LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct does not support it.
#15093 commented on Jun 18, 2025 • 0 new comments
[Bug]: 0.8.0(V1) crash on NCCL when load MoE model on 16 GPUs(H20)
#15098 commented on Jun 18, 2025 • 0 new comments
[Usage]: How to use VLLM added functions for torch in a separate environment?
#15108 commented on Jun 18, 2025 • 0 new comments
[Feature]: Improve GPTQ implementation
#15116 commented on Jun 18, 2025 • 0 new comments
[Bug]: BadRequestError(400) when using completions API with stream=true and echo=true
#15119 commented on Jun 18, 2025 • 0 new comments
[New Model]: surport for model：jinaai/jina-reranker-m0
#18447 commented on Jun 18, 2025 • 0 new comments
[TPU] Supported models for multimodal multi-image inference on TPU?
#18463 commented on Jun 18, 2025 • 0 new comments
[Usage]: Gemma3 not supported on B200 w/ Flash-Infer
#19584 commented on Jun 18, 2025 • 0 new comments
[RFC]: Enhancing vLLM Plugin Architecture
#19161 commented on Jun 18, 2025 • 0 new comments
[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda
#19579 commented on Jun 18, 2025 • 0 new comments
[Bug]: Stuck request and empty streaming for gemma3 serving with ^v0.8.5
#17658 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM sleep experiences segmentation fault when used in TRL
#16993 commented on Jun 18, 2025 • 0 new comments
[Bug]: Issue of Unstable Output for Identical Queries
#19403 commented on Jun 18, 2025 • 0 new comments
[Feature]: Fused moe config for NVIDIA RTX 6000 ADA
#17768 commented on Jun 18, 2025 • 0 new comments
[Bug]: Dual a6000 pros not working. Arch 120.
#19025 commented on Jun 18, 2025 • 0 new comments
[Bug]: Model fails to load in background thread in versions >0.8.5
#18816 commented on Jun 18, 2025 • 0 new comments
[Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5
#17759 commented on Jun 18, 2025 • 0 new comments
[Bug]: N-gram speculative decoding performs slower than Qwen3-32B-FP8 with vLLM 0.9.0.1
#19254 commented on Jun 18, 2025 • 0 new comments
[RFC]: Blackwell Enablement for vLLM (SM100)
#18153 commented on Jun 18, 2025 • 0 new comments
[Feature]: Limit thinking tokens
#15418 commented on Jun 18, 2025 • 0 new comments
[RFC]: Schema for checking input shapes for multi-modal models
#14764 commented on Jun 18, 2025 • 0 new comments
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on Jun 18, 2025 • 0 new comments
[Bug]: gpu-memory-utilization is not exact
#17269 commented on Jun 18, 2025 • 0 new comments
[Usage]: Can I get the loss of model directly?
#9750 commented on Jun 18, 2025 • 0 new comments
[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm
#10478 commented on Jun 18, 2025 • 0 new comments
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#11715 commented on Jun 18, 2025 • 0 new comments