-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
116 Pull requests merged by 73 people
-
[Misc] Clean up useless code
#19889 merged
Jun 20, 2025 -
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError
#19749 merged
Jun 20, 2025 -
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tests
#19901 merged
Jun 20, 2025 -
Export NaNs in logits to scheduler_stats if output is corrupted
#18777 merged
Jun 20, 2025 -
[custom_op][vllm-plugin] update custom_op class to use op_registry
#19164 merged
Jun 20, 2025 -
[Model] GPT2ForSequenceClassification model
#19663 merged
Jun 20, 2025 -
[Fix] import regex instead of re
#19875 merged
Jun 20, 2025 -
[Kernel] correct cpu worker function parameter type
#19745 merged
Jun 20, 2025 -
[Misc] refactor example - openai_transcription_client
#19851 merged
Jun 20, 2025 -
[Misc] update cuda version
#19526 merged
Jun 20, 2025 -
[Bugfix][Ray] Set the cuda context eagerly in the ray worker
#19583 merged
Jun 20, 2025 -
[Bugfix] Enable PP with AITER+V1
#19822 merged
Jun 20, 2025 -
[Chore]: qwen3-moe-type-hints-mistake
#19860 merged
Jun 20, 2025 -
[Benchmark] Fix
Value of type "SampleRequest" is not indexable
#18032 merged
Jun 20, 2025 -
[CI][Neuron] Fail and exit on first error
#19622 merged
Jun 20, 2025 -
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI
#19872 merged
Jun 20, 2025 -
[Benchmark][Bugfix] Fix Dataset Length Calculation
#19868 merged
Jun 20, 2025 -
[Frontend] early return chat format resolution when specified
#19735 merged
Jun 19, 2025 -
[Core][Bugfix] Fix Online MM Beam Search
#19688 merged
Jun 19, 2025 -
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI
#19838 merged
Jun 19, 2025 -
[Doc] Update V1 user guide for embedding models
#19842 merged
Jun 19, 2025 -
Fixing Chunked Prefill Test.
#19762 merged
Jun 19, 2025 -
[Frontend] Add optional token-level progress bar to
LLM.beam_search
#19301 merged
Jun 19, 2025 -
Add xLAM tool parser support
#17148 merged
Jun 19, 2025 -
[Minor] Allow redirecting model path for HfRunner in test
#19795 merged
Jun 19, 2025 -
raise exception for pin_lora
#19809 merged
Jun 19, 2025 -
[Misc] [ROCm] Prevent surplus tensor reshape
#19803 merged
Jun 19, 2025 -
[ROCm] [AITER] [Bugfix] Patch for AITER commit
648764942e552a8bb5fe16026703716a81f05374
#18990 merged
Jun 19, 2025 -
Mark invariant normalizer in Gemma as non-persistent
#19788 merged
Jun 19, 2025 -
[Bugfix] Add check_health to v1 async client.
#19821 merged
Jun 19, 2025 -
[Bugfix] Fix the linter
#19826 merged
Jun 19, 2025 -
Support embedding models in V1
#16188 merged
Jun 19, 2025 -
[Quantization] Modify the logic of BNB double quantization
#19742 merged
Jun 19, 2025 -
[Misc][ROCm] Enforce no unused variable in ROCm C++ files
#19796 merged
Jun 19, 2025 -
Fix FA2 fallback for Blackwell V1
#19781 merged
Jun 19, 2025 -
[Frontend] Expose custom args in OpenAI APIs
#16862 merged
Jun 19, 2025 -
[BugFix] Fix use_cudagraph=False
#19612 merged
Jun 19, 2025 -
[Multimodal] Use fast processor for Qwen2/2.5-VL
#19789 merged
Jun 18, 2025 -
[Core] More fixes to MultiModalEmbeddings type handling
#19715 merged
Jun 18, 2025 -
[TPU] Update torch-xla version to include paged attention tuned block change
#19813 merged
Jun 18, 2025 -
[Core] Do not copy array during hashing
#19484 merged
Jun 18, 2025 -
Disable "Forbid direct 'import triton'" check for
vllm/triton_utils/importing.py
in an extensible way#19783 merged
Jun 18, 2025 -
docs: fix Slack bulletpoint in README
#19811 merged
Jun 18, 2025 -
[v1] Support mamba2
#19327 merged
Jun 18, 2025 -
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc
#19808 merged
Jun 18, 2025 -
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully
#19725 merged
Jun 18, 2025 -
[Hardware][AMD] integrate aiter chunked prefill into vllm
#18596 merged
Jun 18, 2025 -
[Qwen] Add tagging rule for Qwen related PRs
#19799 merged
Jun 18, 2025 -
[Platform] Allow platform use V1 Engine by default
#19792 merged
Jun 18, 2025 -
[doc] fix the incorrect label
#19787 merged
Jun 18, 2025 -
[Minor] Zero-initialize attn output buffer
#19784 merged
Jun 18, 2025 -
[V1] Decouple GPU and TPU
InputBatch
#19778 merged
Jun 18, 2025 -
[V1][P/D] An native implementation of xPyD based on P2P NCCL
#18242 merged
Jun 18, 2025 -
[V1] Add API docs for EncoderCacheManager
#19294 merged
Jun 18, 2025 -
[Misc] Add __str__ for RequestStatus
#19780 merged
Jun 18, 2025 -
[MISC] correct DeviceConfig device field static type analysis
#19699 merged
Jun 18, 2025 -
[MISC] correct copy_blocks src_to_dists param type
#19696 merged
Jun 18, 2025 -
[TPU] Update torch version to include paged attention kernel change
#19706 merged
Jun 17, 2025 -
[Feature][ROCm] Add full graph capture support for TritonAttentionBackend
#19158 merged
Jun 17, 2025 -
[Bugfix] Fix faulty triton importing logic when using Ray for DP
#19734 merged
Jun 17, 2025 -
[Misc] Update lmcache connector with the latest connector apis
#19441 merged
Jun 17, 2025 -
Remove sm120 arch from sm100 cutlass kernel arch list
#19716 merged
Jun 17, 2025 -
[Perf] Optimize
moe_align_block_size
CUDA kernel#19572 merged
Jun 17, 2025 -
[Bugfix] Update multimodel models mapping to fit new checkpoint after Transformers v4.52
#19151 merged
Jun 17, 2025 -
[Mis] remove duplicate engine status checks
#19647 merged
Jun 17, 2025 -
[V1][Kernel] Flashinfer HND KV cache layout
#19280 merged
Jun 17, 2025 -
[doc] split "Other AI Accelerators" tabs
#19708 merged
Jun 17, 2025 -
[doc][mkdocs] Add edit button to documentation
#19637 merged
Jun 17, 2025 -
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel
#19152 merged
Jun 17, 2025 -
Add a doc on how to update PyTorch version
#19705 merged
Jun 17, 2025 -
[Doc] Add missing llava family multi-image examples
#19698 merged
Jun 17, 2025 -
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManager
#19686 merged
Jun 17, 2025 -
Fixes IMA for TP w/ flex-attention
#19712 merged
Jun 17, 2025 -
[DOC] fix doc typos
#19600 merged
Jun 17, 2025 -
[Frontend] add chunking audio for > 30s audio
#19597 merged
Jun 17, 2025 -
[Wheel Size] Only build FA2 8.0+PTX
#19336 merged
Jun 17, 2025 -
[doc] add project flag to gcloud TPU command
#19664 merged
Jun 17, 2025 -
[Fix] Fall back to Gloo when NCCL backend is unavailable
#19641 merged
Jun 17, 2025 -
[Quantization] Remove FP4 emulation; Fall-back to marlin for device < 100
#19563 merged
Jun 16, 2025 -
[V1] Change return type on get_multimodal_embeddings()
#19446 merged
Jun 16, 2025 -
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture with MiniMaxText01ForCausalLM)
#19677 merged
Jun 16, 2025 -
[Kernels] Use empty for modular MoE workspaces
#19667 merged
Jun 16, 2025 -
[Bugfix] fix missing 'finish_reason': null in streaming chat
#19662 merged
Jun 16, 2025 -
[MISC] bump huggingface_hub pkg to 0.33.0
#19547 merged
Jun 16, 2025 -
[Bugfix] Fix TP inference for Flex attention backend
#19657 merged
Jun 16, 2025 -
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared experts.
#19652 merged
Jun 16, 2025 -
[DOC] Add reasoning capability to vLLM streamlit code
#19557 merged
Jun 16, 2025 -
[BugFix] Don't catch BaseException when dumping execute_model errors
#19626 merged
Jun 16, 2025 -
[Kernel] GGUF MMVQ kernel for multiple input vectors
#18754 merged
Jun 16, 2025 -
[Docs] Move multiproc doc to v1 dir
#19651 merged
Jun 16, 2025 -
[CI] Add mteb testing for rerank models
#19344 merged
Jun 16, 2025 -
[MISC] typo fix
#19672 merged
Jun 16, 2025 -
[TPU] support attention head dim smaller than 128
#19620 merged
Jun 16, 2025 -
[Misc] Fix skipped max-model-len validation when deriving max model length from tokenizer config
#19660 merged
Jun 16, 2025 -
[Misc][Frontend] passthrough
bad_words
#19564 merged
Jun 16, 2025 -
[Bugfix][Core] Prefix caching causes incorrect outputs due to outdated ComputedBlocksTracker
#18957 merged
Jun 16, 2025 -
[MISC] Remove unused variableds in C++
#19609 merged
Jun 16, 2025 -
[Misc] Remove duplicate multiproc method setting for CPU platform
#19649 merged
Jun 16, 2025 -
[CI/Build] Fix torch nightly CI dependencies part 2
#19589 merged
Jun 15, 2025 -
Enable prefix caching with full cuda graphs
#19617 merged
Jun 15, 2025 -
[Benchmark] Refactor benchmark script for fp8 & int8
#19627 merged
Jun 15, 2025 -
[Kernel] Raise verbose error and consolidate
num_heads/num_kv_heads
divisibility check#19339 merged
Jun 15, 2025 -
[Bugfix][2/n] Fix speculative decoding CI - Fix test_ngram_e2e_greedy_correctness
#19644 merged
Jun 15, 2025 -
[Perf] Further tunings for SM100 FP8 CUTLASS kernel
#19566 merged
Jun 15, 2025 -
[Fix] Convert kv_transfer_config from dict to KVTransferConfig
#19262 merged
Jun 14, 2025 -
[Bugfix] Don't attempt to use triton if no driver is active
#19561 merged
Jun 14, 2025 -
Only build CUTLASS MoE kernels on Hopper
#19648 merged
Jun 14, 2025 -
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization
#19500 merged
Jun 14, 2025 -
[Bugfix] Fix auto dtype casting for BatchFeature
#19316 merged
Jun 14, 2025 -
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts
#19593 merged
Jun 14, 2025 -
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype
#19633 merged
Jun 14, 2025 -
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics.
#18354 merged
Jun 14, 2025 -
[BugFix] Fix DP Coordinator incorrect debug log message
#19624 merged
Jun 14, 2025 -
Adding "AMD: Multi-step Tests" to amdproduction.
#19508 merged
Jun 14, 2025 -
[torch.compile] Use custom ops when use_inductor=False
#19618 merged
Jun 13, 2025 -
[Doc] Add troubleshooting section to k8s deployment
#19377 merged
Jun 13, 2025
96 Pull requests opened by 72 people
-
Sync test dependency with test.in for torch nightly
#19632 opened
Jun 14, 2025 -
[Config] Make prefix cache metrics interval configurable
#19634 opened
Jun 14, 2025 -
[Frontend] Support image object in llm.chat
#19635 opened
Jun 14, 2025 -
[Kernels] MoE refactor
#19636 opened
Jun 14, 2025 -
[Doc] Add inplace weights loading example
#19640 opened
Jun 14, 2025 -
[BugFix] Add an env to disable moe chunking by default
#19642 opened
Jun 14, 2025 -
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor
#19643 opened
Jun 14, 2025 -
When numa support is found but size is 0, divide by zero exception
#19654 opened
Jun 14, 2025 -
[Bugfix] - Add Trace Headers to Beam Search Path
#19655 opened
Jun 15, 2025 -
[Feature]: Support offline expert load distribution recording
#19658 opened
Jun 15, 2025 -
feat(model loader): add load format 'prefetch_auto' for parallel mmap…
#19659 opened
Jun 15, 2025 -
[Misc] add CLI completion
#19669 opened
Jun 16, 2025 -
[Model] Support TP/PP/mamba2 kernel for PLaMo2
#19674 opened
Jun 16, 2025 -
[Model] Automatic conversion of score (CrossEncoding) models
#19675 opened
Jun 16, 2025 -
[Bugfix] ensure tool_choice is popped when `tool_choice:null` is passed in json payload
#19679 opened
Jun 16, 2025 -
[V1] [Metrics] Hide deprecated metrics.
#19682 opened
Jun 16, 2025 -
Add Thor SBSA and Spark
#19685 opened
Jun 16, 2025 -
[P/D][NixlConnector] Support `tp_size > num_kv_heads` deployments
#19691 opened
Jun 16, 2025 -
feat: add enforce_include_usage option
#19695 opened
Jun 16, 2025 -
add type assertion of request_id for LLMEngine.add_request
#19700 opened
Jun 16, 2025 -
[Docs] Enhance SupportsMultiModal interface documentation
#19701 opened
Jun 16, 2025 -
Make sure the correct version of ao is installed in CI
#19704 opened
Jun 16, 2025 -
Adding "AMD: Plugin Tests" to amdproduction.
#19707 opened
Jun 16, 2025 -
[Model] Activated LoRA
#19710 opened
Jun 16, 2025 -
[Misc][Tools][Benchmark] Add profile to autotune script
#19711 opened
Jun 16, 2025 -
[Kernels][Bugfix] Use torch op for all kernels in FusedMoE forward. Add additional testing for cudagraphs.
#19717 opened
Jun 16, 2025 -
[V1] Perf optimization for layers reusing shared KV cache
#19719 opened
Jun 17, 2025 -
[Kernel] Masked act_mul and fp8-quant Kernels for Batched MoE
#19721 opened
Jun 17, 2025 -
v1: Add Request.block_hashes
#19728 opened
Jun 17, 2025 -
[PD] let toy proxy handle /chat/completions
#19730 opened
Jun 17, 2025 -
v1: Support KV events from connectors
#19737 opened
Jun 17, 2025 -
[P/D] Handle Abort and Make Lifecycle Explicit
#19740 opened
Jun 17, 2025 -
[Feature] add quick all reduce
#19744 opened
Jun 17, 2025 -
[BugFix] fix: aot passes kvcache dtype information
#19750 opened
Jun 17, 2025 -
[v1] Re-add fp32 support to v1 engine through FlexAttention
#19754 opened
Jun 17, 2025 -
[Multimodal] Optimize Qwen2/2.5-VL startup time
#19756 opened
Jun 17, 2025 -
[feat]: CUTLASS block scaled group gemm for SM100
#19757 opened
Jun 17, 2025 -
Register deepgemm moe kernels to work with v1 engine
#19759 opened
Jun 17, 2025 -
[BugFix] Fix topk_softmax assert
#19764 opened
Jun 17, 2025 -
add mamba head fix
#19766 opened
Jun 17, 2025 -
[Draft][torch.compile][ROCm][V1] Enable attention output FP8 fusion for V1 attention backends
#19767 opened
Jun 17, 2025 -
BLOCK_SIZE_K fix
#19769 opened
Jun 17, 2025 -
Workaround for an integer overflow with large CHUNK_SIZE
#19770 opened
Jun 17, 2025 -
Triton-fused DeepseekScalingRotaryEmbedding
#19771 opened
Jun 17, 2025 -
[AMD][P/D] Add libamdhip64.so.6 for llmd
#19773 opened
Jun 17, 2025 -
[WIP] Splitting attention _fwd_grouped_kernel_stage1 to improve occupancy
#19774 opened
Jun 17, 2025 -
[Ray] v1 Change device str for platform compatibility
#19785 opened
Jun 18, 2025 -
[DP] Support external DP Load Balancer mode
#19790 opened
Jun 18, 2025 -
Add SM120 to the Dockerfile
#19794 opened
Jun 18, 2025 -
Allow to override KV cache memory calculation
#19804 opened
Jun 18, 2025 -
[MISC] add cpu_kvcache_space_bytes to CacheConfig
#19812 opened
Jun 18, 2025 -
Improve quant config semantic clarity, add Nvidia ModelOpt config adaptation
#19815 opened
Jun 18, 2025 -
Introduce RayCudaCommunicator as Ray Compiled Graph communicator
#19816 opened
Jun 18, 2025 -
[Kernel] Add Conch backend for mixed-precision linear layer
#19818 opened
Jun 18, 2025 -
LoRA support on llama4
#19819 opened
Jun 18, 2025 -
[Feature] Integrate new deepgemm
#19820 opened
Jun 18, 2025 -
[Core] Add Flashinfer TRTLLM Backend for Flashinfer decode path (SM100).
#19825 opened
Jun 19, 2025 -
Move Gemma's stacked_params_mapping to class scope
#19829 opened
Jun 19, 2025 -
FP8 custom ops
#19830 opened
Jun 19, 2025 -
[WIP] Async Scheduler Prototype
#19831 opened
Jun 19, 2025 -
[P/D] Asynchronously do _nixl_handshake
#19836 opened
Jun 19, 2025 -
[Do not merge] Cache model info for faster startup
#19837 opened
Jun 19, 2025 -
Add Cutlass integration for MoE FP8
#19843 opened
Jun 19, 2025 -
[BugFix][V0] Fix AssertionError for prompt_logprobs
#19844 opened
Jun 19, 2025 -
refactor example - qwen3_reranker
#19847 opened
Jun 19, 2025 -
v1: Introduce an offloading component
#19848 opened
Jun 19, 2025 -
[Chore] logging metrics rename
#19852 opened
Jun 19, 2025 -
optimze attn
#19858 opened
Jun 19, 2025 -
[Misc] add vllm_config in __init__
#19866 opened
Jun 19, 2025 -
[Docs] Fix syntax highlighting of shell commands
#19870 opened
Jun 19, 2025 -
[V1 Scheduler] BatchScheduler to balance token-based microbatches and reduce GPU pipeline bubbles
#19873 opened
Jun 19, 2025 -
[BugFix][P/D] Fix for cases where _recving_transfers can be cleaned up when *all* transfer done
#19874 opened
Jun 19, 2025 -
Add page-aligned prefill scheduling.
#19878 opened
Jun 19, 2025 -
[Quantization] Add compressed-tensors emulations support for NVFP4
#19879 opened
Jun 19, 2025 -
[Misc] Add type alias `ReqId` and `EngineId` for better readability
#19880 opened
Jun 19, 2025 -
[Core] Add `update_load_config` RPC method
#19884 opened
Jun 20, 2025 -
[EP+DP] Optimize the little operations in the DeepGEMM + DeepEP low latency case
#19885 opened
Jun 20, 2025 -
[New model support]Support Tarsier2
#19887 opened
Jun 20, 2025 -
[Fix][ROCm] Remove unused variables to fix build error on GFX11/12
#19891 opened
Jun 20, 2025 -
add some examples for other benchmark scripts
#19893 opened
Jun 20, 2025 -
[Benchmark][New Dataset]Added benchmark support for Unsloth Vision Datasets
#19894 opened
Jun 20, 2025 -
[Docs] Change response symbol to json in openai_compatible_server.md
#19895 opened
Jun 20, 2025 -
[CI/Build] Push latest tag for cpu and neuron docker image
#19897 opened
Jun 20, 2025 -
[Bugfix][V1][ROCm] Fix AITER Flash Attention Backend to enable Llama-4
#19904 opened
Jun 20, 2025 -
add smollm3 support
#19905 opened
Jun 20, 2025 -
Fix: Check the type of params to be a Sequence not list.
#19910 opened
Jun 20, 2025 -
deepep low latency + fp8 dispatch - test fixes
#19911 opened
Jun 20, 2025 -
[V1] Logits processors extensibility
#19912 opened
Jun 20, 2025 -
[Misc] make get_class check for Executor instead of ExecutorBase
#19914 opened
Jun 20, 2025 -
Track expert selection metrics
#19915 opened
Jun 20, 2025 -
Fix: Missing newline at end of file
#19916 opened
Jun 20, 2025 -
[Bugfix] Fix bnb 8bit model weights loading
#19917 opened
Jun 20, 2025 -
[TPU] Add TPU specific var VLLM_TPU_MOST_MODEL_LEN
#19919 opened
Jun 20, 2025 -
[doc] improve readability for long commands
#19920 opened
Jun 20, 2025 -
Use FusedMoEQuantConfig everywhere
#19921 opened
Jun 20, 2025 -
[doc] add contact us in community
#19922 opened
Jun 20, 2025
132 Issues closed by 33 people
-
[Doc]: Add list of commands for `vllm serve`
#19859 closed
Jun 20, 2025 -
[Bug]: Inductor codegen: fatal error: stddef.h: No such file or directory
#19656 closed
Jun 20, 2025 -
[Usage]: How to eliminate randomness and obtain fixed results with VLLM 0.8
#15205 closed
Jun 20, 2025 -
[Bug]: enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
#19890 closed
Jun 20, 2025 -
[Performance]: speed regression 0.6.2 => 0.6.3?
#9476 closed
Jun 20, 2025 -
[Usage]: vLLM For maximally batched use case
#9760 closed
Jun 20, 2025 -
[Bug]: Interference of Tokens in Concurrent Requests Causing Result Confusion in Version 0.6.3
#9910 closed
Jun 20, 2025 -
[Usage]: Multi-Step Scheduling with Speculative Decoding
#11917 closed
Jun 20, 2025 -
[Bug]: Nccl Test Error
#12008 closed
Jun 20, 2025 -
[Usage]: V0 Does Qwen2-VL Support torch.compile in vllm?
#12693 closed
Jun 20, 2025 -
[Bug]: Pooling request fails for classification task
#12753 closed
Jun 20, 2025 -
[Bug]: V1 engine fails with offline batched inference code in V0 engine
#12929 closed
Jun 20, 2025 -
[Bug]: empty begin steam output
#13293 closed
Jun 20, 2025 -
[Bug]: GPU drops to 0 usage when handling concurrent requests
#13422 closed
Jun 20, 2025 -
[Bug]: Guided decoding only generating single character during inference with finetuned model
#13448 closed
Jun 20, 2025 -
[Feature]: Support xTTSv2
#13457 closed
Jun 20, 2025 -
[Bug]: AWQ doesn't support 4-bit?
#13462 closed
Jun 20, 2025 -
[Feature]: How can I set a different max_pixels for each request when starting the service?
#13463 closed
Jun 20, 2025 -
[Bug]: Memory leak in 0.6 and 0.7 when setting "max_tokens=1"
#13464 closed
Jun 20, 2025 -
[Bug]: Mamba should return states in fp32
#13466 closed
Jun 20, 2025 -
[Doc]: List of Models Supported By TPU Backend
#13476 closed
Jun 20, 2025 -
[Bug]: When deploying two llm services on the same batch of GPUs. Inference will be twice as slow
#13477 closed
Jun 20, 2025 -
[Feature]: kv cahce int8:Dynamic kv cache scaling factors computation
#13478 closed
Jun 20, 2025 -
[Bug]: vLLM on TPU is broken with XLA errors
#13479 closed
Jun 20, 2025 -
[Usage]: How to write scoring script when deploying to a managed Azure machine learning real-time endpoint?
#13491 closed
Jun 20, 2025 -
[Usage]: How to use logits processors with max_num_seqs > 1?
#13553 closed
Jun 20, 2025 -
[Bug]: Increasing root volume with guided decoding
#13556 closed
Jun 20, 2025 -
[Bug]: Index Out of Range Bug in Pooler when Using returned_token_ids with hidden_states
#13559 closed
Jun 20, 2025 -
[New Model]: Qwen3-Rerank 0.6B, 4B, 8B
#19529 closed
Jun 19, 2025 -
[Bug]: ValueError: Cannot cast <zmq.Socket(zmq.ROUTER) at 0x796c63de24a0> to int
#19444 closed
Jun 19, 2025 -
[Bug]: Get NCCL_ERROR_SYSTEM_ERROR with latest Docker vLLM image (v0.9.1)
#19613 closed
Jun 19, 2025 -
[Bug]: Async Beam Search Doesn't Pass Multimodal Data Correctly
#19687 closed
Jun 19, 2025 -
[Bug]: fail to load OpenGVLab/InternVL3-78B with vllm
#19856 closed
Jun 19, 2025 -
[Usage]: Implement a custom scheduler
#16479 closed
Jun 19, 2025 -
[Usage]: `gpu_memory_utilization` backend parameter questions
#19805 closed
Jun 19, 2025 -
[Misc]: why `3B-Instruct-AWQ` takes 16G
#15204 closed
Jun 19, 2025 -
[Usage]: How to deploy DeepSeek R1 in a K8s environment
#14740 closed
Jun 19, 2025 -
vector search
#15268 closed
Jun 19, 2025 -
[Bug]: Is vllm support function call mode?
#6631 closed
Jun 19, 2025 -
Conda Forge Package
#3126 closed
Jun 19, 2025 -
[Bug]: OOM with QwQ-32B
#15258 closed
Jun 19, 2025 -
[Bug]: TypeError: Qwen2_5OmniProcessor.__init__() got multiple values for argument 'image_processor'
#19833 closed
Jun 19, 2025 -
[Bug]:
#19832 closed
Jun 19, 2025 -
[Bug]: PreemptionMode.RECOMPUTE is incorrect
#16832 closed
Jun 19, 2025 -
[Bug]: Reproduction failed when evaluate model
#19802 closed
Jun 19, 2025 -
[Bug]: Runtime error occurs when running deepseek v3
#12827 closed
Jun 19, 2025 -
[Feature]: Support custom args in OpenAI (chat) completion requests
#16802 closed
Jun 19, 2025 -
[CI Failure]: Samplers Test - samplers/test_beam_search.py::test_beam_search_passes_multimodal_data
#19736 closed
Jun 18, 2025 -
[Bug]: RAY_CGRAPH_get_timeout is not set successfully. Ray still detects default timeout value.
#19703 closed
Jun 18, 2025 -
[Usage]: How to start the vllm service and pass parameters on XPU
#19528 closed
Jun 18, 2025 -
[Docs] Feedback for `/en/latest/getting_started/installation/index.html`
#19755 closed
Jun 18, 2025 -
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19753 closed
Jun 18, 2025 -
[Docs] Feedback for `/en/latest/getting_started/installation/google_tpu.html`
#19752 closed
Jun 18, 2025 -
[Docs] Feedback for `/en/latest/`
#19751 closed
Jun 18, 2025 -
[Bug]: vllm running on new H20-3e Nvidia card has occasional garbled bug using Qwen 2.5 VL 72B
#19723 closed
Jun 18, 2025 -
[Feature]: do you plan to support "suffix" of "v1/completions"
#9976 closed
Jun 18, 2025 -
[Bug]: Continuous batching (OpenAI Server) with greedy search return different results
#11658 closed
Jun 18, 2025 -
[Usage]: Does DeepSeek-R1 1.58-bit Dynamic Quant work on VLLM?
#12573 closed
Jun 18, 2025 -
[Usage]: How to get access to scheduler
#12772 closed
Jun 18, 2025 -
[Usage]: How to check the corresponding functionality of operators in Llama-2-7b-hf?
#13010 closed
Jun 18, 2025 -
[Bug]: CUDA memory error with benchmark_serving.py
#13152 closed
Jun 18, 2025 -
[Bug]: Cannot pull the docker image for installation
#13330 closed
Jun 18, 2025 -
[Bug]: vllm server bad
#13340 closed
Jun 18, 2025 -
[Bug]: DeepSeek R1 deployment panics when serving requests with cuda memory access
#13389 closed
Jun 18, 2025 -
[Bug]:Lora Adapters with num-scheduler-steps doesn't work in version 0.7.2, even with VLLM_USE_V1=0
#13394 closed
Jun 18, 2025 -
[Misc]: Why do we need to explicitly pass tool parsers?
#13399 closed
Jun 18, 2025 -
[Feature]: Support token-level timestamps in whisper models
#13400 closed
Jun 18, 2025 -
[Installation]: flash-attention internal "git submodule update" problematic for offline-install
#13424 closed
Jun 18, 2025 -
[Feature]: load_weights function in JambaForSequenceClassification
#13430 closed
Jun 18, 2025 -
[CI Failure]: Distributed Tests (2 GPUs) - v1/test_async_llm_dp.py::test_load
#19731 closed
Jun 17, 2025 -
[Feature]: Optimize `moe_align_block_size` CUDA kernel
#19517 closed
Jun 17, 2025 -
[Bug]: Error when loading model(gemma-3-4b) merged after DeepSpeed training into vLLM
#19139 closed
Jun 17, 2025 -
[Bug]: guided_regex parsing error crashes the server
#19270 closed
Jun 17, 2025 -
Release dataset of bug-fixing commits and test cases on Hugging Face
#19738 closed
Jun 17, 2025 -
[Bug]: GPU Placement Group Creation Error in Multi-Node Setup with vLLM
#13388 closed
Jun 17, 2025 -
[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100
#19724 closed
Jun 17, 2025 -
[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when `enable_thinking=False`
#18819 closed
Jun 17, 2025 -
[Doc]: Does llava onevision support VLM multi images?
#19521 closed
Jun 17, 2025 -
[Usage]: How to identify mix and max pixels for the image
#15034 closed
Jun 17, 2025 -
[Usage]: Can vllm multimodal generate use preprocessed image?
#14998 closed
Jun 17, 2025 -
[Usage]: Transcription "Maximum clip duration (30s) exceeded
#15012 closed
Jun 17, 2025 -
centos7 package err, is my problem?
#14750 closed
Jun 17, 2025 -
[Bug]: RuntimeError: HIP Error on vLLM ROCm Image in Kubernetes Cluster with AMD GPUs
#10855 closed
Jun 17, 2025 -
[Feature]: multiple gpus specification
#13357 closed
Jun 17, 2025 -
[CI Failure]: Spec Decoding - spec_decode/e2e/test_multistep_correctness.py
#18954 closed
Jun 16, 2025 -
[Bug]: (regression from v0.8.5) missing "finish_reason": null in streaming chat completion outputs
#19650 closed
Jun 16, 2025 -
[Bug]: Cannot run two containers on one card when using VLLM_USE_V1=1
#17366 closed
Jun 16, 2025 -
[Misc]: How does vllm consume request in async mode?
#13328 closed
Jun 16, 2025 -
[Bug]: issue about the gguf model's --compilation-config options
#13329 closed
Jun 16, 2025 -
[Usage]: How to enable tool calling in serve llm?
#19601 closed
Jun 16, 2025 -
[Feature]: Vectorize `scaled_int8_quant`
#18866 closed
Jun 15, 2025 -
[Benchmark Script] Refactor benchmark script for `bench_datatype_gemm`
#19364 closed
Jun 15, 2025 -
[Bug]: Docker deployment returns zmq.error.ZMQError: Operation not supported
#10856 closed
Jun 15, 2025 -
[Usage]: zmq.error.ZMQError: Operation not supported
#11564 closed
Jun 15, 2025 -
[Performance]: It takes too much time to Add a request.
#12314 closed
Jun 15, 2025 -
When tp>1 vllm not work (Qwen2.5-VL-72B)
#13124 closed
Jun 15, 2025 -
[Bug]: Speculative Decoding Output with Pytorch Rejection Sampling does not change when changing seed
#13196 closed
Jun 15, 2025 -
[Bug]: Gemma2ForSequenceClassification has no vLLM implementation
#13254 closed
Jun 15, 2025 -
[Misc]: Regarding the issue of inconsistent calculation of tokens
#13256 closed
Jun 15, 2025 -
[Bug]: Flashinfer Metadata init for speculative decoding
#13264 closed
Jun 15, 2025 -
[Bug]: transformer fallback encounters CUDA OOM with large models
#13268 closed
Jun 15, 2025 -
[Bug]: AttributeError: 'OpenVINOWorker' object has no attribute 'cache_engine'
#13287 closed
Jun 15, 2025 -
[Bug]: meta-llama/Llama-3.2-90B-Vision-Instruct in vllm/vllm-openai:latest
#13307 closed
Jun 15, 2025 -
[Bug]: 'dict' object has no attribute 'is_kv_transfer_instance'
#19259 closed
Jun 14, 2025 -
[Feature]: Support for DeepGEMM
#13857 closed
Jun 14, 2025 -
[Bug]: deepseek-vl2 `RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same`
#19219 closed
Jun 14, 2025 -
[Usage]: How to run intern_vit model using intern_vit.py?
#11221 closed
Jun 14, 2025 -
[Feature]: Any plan to support multi-token prediction via speculative decoding for DeepSeek?
#12730 closed
Jun 14, 2025 -
[Bug]: [V1] Mamba models fail on profile run
#12826 closed
Jun 14, 2025 -
[Bug]: Docker 多机多卡部署--模型启动报错
#13145 closed
Jun 14, 2025 -
[New Model]: Surya OCR
#13172 closed
Jun 14, 2025 -
[Bug]: Concurrent streaming mode produces jumbled tokens
#13186 closed
Jun 14, 2025 -
collecting data from user should be noticed, let user to choose whether upload or not
#13195 closed
Jun 14, 2025 -
[Bug]: when i run DeepSeek-V2-Lite with [ngram], the result is different
#13206 closed
Jun 14, 2025 -
[Bug]: examples/grad_web_server.py run error
#13223 closed
Jun 14, 2025
97 Issues opened by 88 people
-
Issue 1: Missing type hint for `wheel` argument in `generate_index.py`
#19918 opened
Jun 20, 2025 -
[Docs] Feedback for `/en/latest/getting_started/installation/gpu.html`
#19913 opened
Jun 20, 2025 -
[Bug]: file in `vllm/benchmarks/kernels/benchmark_marlin.py` cannot execute
#19909 opened
Jun 20, 2025 -
[Bug]: Deepseek R1 0528 tool calling not working
#19907 opened
Jun 20, 2025 -
[Bug]: RTX5080 got CUDA error: no kernel image is available for execution on the device
#19906 opened
Jun 20, 2025 -
[Bug]: nsys cann't open the file
#19903 opened
Jun 20, 2025 -
[Feature]: `kv_transfer_params` not returned for multiple subrequests
#19902 opened
Jun 20, 2025 -
[Doc]: The documentation should be updated to cover GPU compatibility
#19900 opened
Jun 20, 2025 -
[Feature]: EXL3 support
#19896 opened
Jun 20, 2025 -
[Bug]: AsyncLLMEngine stuck in V1
#19892 opened
Jun 20, 2025 -
[Performance]: the performance decline in fp8 inference mode
#19888 opened
Jun 20, 2025 -
[RFC]: Inplace model weights loading
#19886 opened
Jun 20, 2025 -
Issue 1: Incorrect comparison with MAIN_CUDA_VERSION for CPU target
#19882 opened
Jun 19, 2025 -
[Feature]: Implement `check_health` for V1
#19881 opened
Jun 19, 2025 -
[Feature]: Support passing token-level schedules for temperature and other sampling parameters
#19877 opened
Jun 19, 2025 -
[Bug]: InternVL3 poor (random) output with 8bit quantization
#19876 opened
Jun 19, 2025 -
[Bug]: 'IndexError: tuple index out of range' when using 8 gpu's
#19871 opened
Jun 19, 2025 -
[Usage]: missing latest tag from cpu docker registry
#19869 opened
Jun 19, 2025 -
[Bug]: AiterFlashAttentionImpl.__init__() got multiple values for argument 'use_irope' for llama4 model
#19867 opened
Jun 19, 2025 -
[Bug]: NCCL issues when running vllm v0.9.1 for the Deepseek-R1 model [B200 GPU]
#19865 opened
Jun 19, 2025 -
[Feature]: Returning embedding dimensions in /v1/models
#19864 opened
Jun 19, 2025 -
[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER.
#19863 opened
Jun 19, 2025 -
[Bug]: Internal Server Error when use max_tokens=null
#19862 opened
Jun 19, 2025 -
[Bug]: max_completion_tokens doesn't work as max
#19861 opened
Jun 19, 2025 -
[Bug]: AttributeError: 'InferenceClient' object has no attribute 'post'
#19857 opened
Jun 19, 2025 -
[Bug]: dynamic fp8 quantization does not save memory when enable_sleep_mode=True
#19855 opened
Jun 19, 2025 -
[RFC]: KV cache offloading
#19854 opened
Jun 19, 2025 -
[Bug]: Unable to deploy NVFP4 quantized model
#19853 opened
Jun 19, 2025 -
[Feature]: Quant & TP for VIT
#19850 opened
Jun 19, 2025 -
[Bug]: Subprocess health check / automatic restart for V1 EngineCore
#19849 opened
Jun 19, 2025 -
[Bug]: Not able to run vllm cpu using Dockerfile.cpu
#19845 opened
Jun 19, 2025 -
[Usage]: Troubleshooting Inconsistencies Between VLLM and Transformer Outputs
#19841 opened
Jun 19, 2025 -
[Bug]: qwen3不思考模型,【内容放到reasoning_content,正常应该在content】
#19839 opened
Jun 19, 2025 -
[Usage]: vllm启动qwen2.5vl-7b以后 为什么显存使用越来越多
#19828 opened
Jun 19, 2025 -
[Feature]: Improve startup time UX
#19824 opened
Jun 19, 2025 -
[Feature]: `CustomOp` cleanup
#19817 opened
Jun 18, 2025 -
[Bug]: Loading Qwen3MoE using Transformers backend
#19801 opened
Jun 18, 2025 -
[Bug]: After receiving the request, the service froze
#19800 opened
Jun 18, 2025 -
[Bug]:ValueError: Exceeds max model len when embedding using bge-large-zh-v1.5
#19798 opened
Jun 18, 2025 -
[Usage]: How could vllm support token classifications(albert model)?
#19797 opened
Jun 18, 2025 -
[Bug]: MP Executor does not correctly handle device allocation for non-CUDA devices (e.g., NPUs)
#19791 opened
Jun 18, 2025 -
[Bug]: 使用vllm0.7.3对Qwen2.5VL-7b有时会报错
#19786 opened
Jun 18, 2025 -
[Doc]: Update the vllm quantization support for the AMD GPU
#19782 opened
Jun 18, 2025 -
[Bug]: wrong output on L20 using fp8
#19779 opened
Jun 17, 2025 -
[Bug][Spec Decode]: TPOT in prometheus is ITL in vllm serve
#19776 opened
Jun 17, 2025 -
[Bug]: Potential bug when using speculative decoding example similar as the one from docs
#19775 opened
Jun 17, 2025 -
[Feature]: Evaluate prompt presence on subsequent audio chunks
#19772 opened
Jun 17, 2025 -
Vllm + FlexAttention Work Tracking
#19765 opened
Jun 17, 2025 -
[Bug]: Gemma3 reporting low image accuracy with v1 engine
#19763 opened
Jun 17, 2025 -
[Feature]: Upgrade base Python version of vllm/vllm-tpu docker image to 3.11+
#19761 opened
Jun 17, 2025 -
[Bug]: vllm/vllm-tpu uses Debian base but Ubuntu APT sources, causing package installation errors
#19760 opened
Jun 17, 2025 -
[Feature]: Remove cupy dependency for multi-node Ray deployment
#19758 opened
Jun 17, 2025 -
[Usage]: embed prompts
#19746 opened
Jun 17, 2025 -
[Bug]: Incorrect kernel selected when multiple GPUs
#19741 opened
Jun 17, 2025 -
[Feature]: Logging details about incorrect requests
#19739 opened
Jun 17, 2025 -
[Bug]: Fails to load llmcompressor GPTQ Qwen2.5-VL model
#19733 opened
Jun 17, 2025 -
[Performance]: very slow performance for nested list with length constraints
#19732 opened
Jun 17, 2025 -
[Bug]: tool_chat_template_deepseekr1 is no use
#19729 opened
Jun 17, 2025 -
[Feature]: AWQ DeepSeek support on MI300X
#19727 opened
Jun 17, 2025 -
[Bug]: DeepGEMM does not work with CUDA Graph
#19722 opened
Jun 17, 2025 -
[Doc]: 请问支持的模型里,有没有支持嵌入维度可设置为1536的embedding模型?
#19720 opened
Jun 17, 2025 -
[Performance]: No signficant speedup from Wfp8Afp8 (vs Wbf16Abf16) in Llama-4 Scout
#19714 opened
Jun 16, 2025 -
[Bug]: Audio transcription cannot load `preprocessor_config.json` when using runai streamer
#19713 opened
Jun 16, 2025 -
[Feature]: Simplify speculative-config format for vllm serve
#19709 opened
Jun 16, 2025 -
[RFC]: Multimodal data IPC improvement
#19702 opened
Jun 16, 2025 -
[Bug]: Truncated && Incomplete Response from LLAMA4 Scout Prefix Caching
#19697 opened
Jun 16, 2025 -
[Bug]: Enable LORA on Version 0.9.1 and RTX 5090 causes an issue
#19693 opened
Jun 16, 2025 -
[Performance]: V1 engine runs slower than V0 on the MI300X
#19692 opened
Jun 16, 2025 -
[Usage]: Changing image_feature and image_input_shape has no effect on VLM output
#19689 opened
Jun 16, 2025 -
[Bug]: deploy 32B model using vllm + ray with two nodes failed with nccl error
#19684 opened
Jun 16, 2025 -
[Bug]: rocm build crashes with libcuda.so.1: cannot open shared object file
#19681 opened
Jun 16, 2025 -
[Usage]: What is the meaning of `Avg generation throughput`
#19680 opened
Jun 16, 2025 -
[Bug]: `guided_regex` not working on M2 Ultra VLLM
#19676 opened
Jun 16, 2025 -
[New Model]: Support BAAI/bge-reranker-v2-gemma model
#19673 opened
Jun 16, 2025 -
[Bug]: vllm, EngineCore encountered a fatal error TimeoutError
#19668 opened
Jun 15, 2025 -
[Bug]: mismatch between multimodal tokens and placeholders for Qwen_2.5-3B (4 GPUs*24G)
#19666 opened
Jun 15, 2025 -
[Bug]: Phi-3-Small model reporting AttributeError: 'NoneType' object has no attribute 'prefill_metadata'
#19665 opened
Jun 15, 2025 -
[Bug]: ValueError out of range float values are not json compliant
#19661 opened
Jun 15, 2025 -
[Feature]: Add EP/DP/PD deps in docker image
#19653 opened
Jun 14, 2025 -
[Usage]: Is there any vllm images include deep_ep ?
#19646 opened
Jun 14, 2025 -
[Usage]: V1 can not support for macOS with Apple silicon.
#19645 opened
Jun 14, 2025 -
[Bug]: Crash during OpenAI API server usage
#19639 opened
Jun 14, 2025 -
[Bug]: Unable to run Jamba 1.6 Large with Tensor Parallelism
#19638 opened
Jun 14, 2025 -
[Bug]: Illegal memory access on llama4 maverick
#19631 opened
Jun 13, 2025
338 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Core] feat: Implement Priority Scheduling in V1 Engine
#19057 commented on
Jun 20, 2025 • 66 new comments -
[Core] Support Local Chunked Attention for Hybrid KV Cache
#19351 commented on
Jun 19, 2025 • 35 new comments -
[Feature] Support sequence parallelism for static fp8 quantization
#19181 commented on
Jun 20, 2025 • 26 new comments -
Draft: WIP NixlConnector drop ZMQ in favor of HTTP metadata exchanges
#19447 commented on
Jun 19, 2025 • 16 new comments -
[Bugfix] Move hardware-dependent configuration resolution (FlashMLA capability, `dtype: 'auto'`) to worker
#18979 commented on
Jun 18, 2025 • 15 new comments -
[V1] - Enable worker -> scheduler connector metadata
#19555 commented on
Jun 20, 2025 • 13 new comments -
[Metrics] Compute and log the serving FLOPs
#19290 commented on
Jun 15, 2025 • 11 new comments -
[Feature][Quantization] MXFP4 support for MOE models
#17888 commented on
Jun 19, 2025 • 8 new comments -
[Bug][Frontend] Fix structure of transcription's decoder_prompt
#18809 commented on
Jun 20, 2025 • 7 new comments -
[Feature] microbatch tokenization
#19334 commented on
Jun 19, 2025 • 6 new comments -
[Frontend] [Core] Integrate Tensorizer in to S3 loading machinery, allow passing arbitrary arguments during save/load
#19619 commented on
Jun 20, 2025 • 6 new comments -
[Platform] Add custom default max tokens
#18557 commented on
Jun 19, 2025 • 6 new comments -
[Misc] feat output content in stream response
#19608 commented on
Jun 18, 2025 • 5 new comments -
[Frontend] Add unix domain socket support
#18097 commented on
Jun 17, 2025 • 4 new comments -
[Misc] Configurable timeout for execute_model RPC calls via env var
#19544 commented on
Jun 19, 2025 • 4 new comments -
[Bugfix][Benchmarks]Fixed async_request_deepspeed_mii() to get ttft
#18689 commented on
Jun 19, 2025 • 4 new comments -
[Core] Add Support for Default Modality Specific LoRAs [generate / chat completions]
#19126 commented on
Jun 20, 2025 • 4 new comments -
[BugFix][V1][ROCm] Triton MLA uses V0 backend on V1 engine
#19067 commented on
Jun 19, 2025 • 4 new comments -
[Kernel] Adding basic Triton JitCache for triton_attn
#16606 commented on
Jun 18, 2025 • 4 new comments -
[Feature] A calibration-free RTN-based quantization for accurate and accelerated INT4/INT8 inference
#18768 commented on
Jun 18, 2025 • 3 new comments -
[Frontend] Add `/v1/audio/translations` OpenAI API endpoint
#19615 commented on
Jun 19, 2025 • 3 new comments -
[Bugfix] VLLM_V1 supports passing other compilation levels
#19340 commented on
Jun 19, 2025 • 2 new comments -
[Misc]: refactor: ParallelConfig init func
#19310 commented on
Jun 14, 2025 • 2 new comments -
[Chore] debloat some initial logs
#19438 commented on
Jun 20, 2025 • 2 new comments -
Enable CPU nightly performance benchmark and its Markdown report
#18444 commented on
Jun 20, 2025 • 2 new comments -
[Bugfix] Skip loading extra parameters for modelopt Qwen3 MoE model
#19598 commented on
Jun 18, 2025 • 2 new comments -
[WIP] [Core][P/D] CPU connector for PD disagg
#18332 commented on
Jun 16, 2025 • 2 new comments -
[V1] LogitsProcessor programming model
#16728 commented on
Jun 19, 2025 • 2 new comments -
Add quickreduce as alternative to custom allreduce
#16804 commented on
Jun 17, 2025 • 2 new comments -
[CI] bump mypy version to 1.16.0
#19548 commented on
Jun 17, 2025 • 2 new comments -
[P/D][Bugfix]: Fix the issue where the remote KVCache cannot be loaded when PP > 1
#19558 commented on
Jun 17, 2025 • 2 new comments -
[Bugfix]: Fix messy code when using logprobs
#19209 commented on
Jun 20, 2025 • 1 new comment -
fix: cuda 12.6 installation
#19095 commented on
Jun 15, 2025 • 1 new comment -
[Bugfix]fix asyncLLM test_abort
#16090 commented on
Jun 17, 2025 • 1 new comment -
[Misc] Add gemma3 chat template with pythonic-style function calling
#17149 commented on
Jun 16, 2025 • 1 new comment -
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6.0
#19346 commented on
Jun 20, 2025 • 1 new comment -
[Frontend] Add -d/--detach option for vllm serve and process management
#18065 commented on
Jun 19, 2025 • 1 new comment -
[P/D][Feature] Support using NIXL together with PP
#19591 commented on
Jun 15, 2025 • 1 new comment -
[V1] Only print cudagraph tqdm on rank 0 with `is_global_first_rank`
#19516 commented on
Jun 16, 2025 • 1 new comment -
Support multicard for Disaggregated Prefill/Decode and provide a automatic benchmark test
#15221 commented on
Jun 20, 2025 • 0 new comments -
[WIP][TPU] Support mrope models (Qwen2VL)
#15149 commented on
Jun 19, 2025 • 0 new comments -
Metrics proposal OpenTelemetry API
#15138 commented on
Jun 20, 2025 • 0 new comments -
[Misc]Fix incorrect local IP detection in multi-network interface environments
#15071 commented on
Jun 18, 2025 • 0 new comments -
[Bugfix] Fix include prompt in stream response when echo=true
#15233 commented on
Jun 18, 2025 • 0 new comments -
When an exception happens in multiproc, die hard and fast
#15000 commented on
Jun 18, 2025 • 0 new comments -
Add missed ray[data] dependence in cuda.txt
#15283 commented on
Jun 20, 2025 • 0 new comments -
[PD] Skip `tp_size` exchange with rank0
#19413 commented on
Jun 16, 2025 • 0 new comments -
[Bugfix] Fix the missing '}' issue for nested object parameters in stream function call.
#16919 commented on
Jun 19, 2025 • 0 new comments -
[RFC] per module sharded weight tagging
#17001 commented on
Jun 19, 2025 • 0 new comments -
[Feat][Frontend] Added support for HermesToolParser for models without special tokens
#16890 commented on
Jun 20, 2025 • 0 new comments -
[Perf] Optimize MRoPR position preparing performance with numba
#16881 commented on
Jun 18, 2025 • 0 new comments -
[Model][Frontend] Adding timeseries modality support and Qwen2.5-ChatTS model support
#16852 commented on
Jun 19, 2025 • 0 new comments -
[Kernel] Add Split-KV Attention Kernel to the triton_attn Backend
#16794 commented on
Jun 18, 2025 • 0 new comments -
[Bugfix][Disaggregated] Set min_tokens in disagg_proxy_demo.py
#16705 commented on
Jun 19, 2025 • 0 new comments -
[NIXL] vllm v0 nixl integration
#16677 commented on
Jun 19, 2025 • 0 new comments -
[Bugfix][Model] fix Phi3Small model only support v0
#16493 commented on
Jun 17, 2025 • 0 new comments -
[Model][VLM] Add Qwen2.5-Omni model support (end-to-end full support)
#16347 commented on
Jun 19, 2025 • 0 new comments -
[Draft] SnapKV
#16160 commented on
Jun 16, 2025 • 0 new comments -
[Bugfix] fix phi4-mini tool call parse in streaming mode
#17094 commented on
Jun 17, 2025 • 0 new comments -
[V0][V1][Core] Add outlines integration for V1, and update V0 integration.
#15975 commented on
Jun 20, 2025 • 0 new comments -
feat: update allow_pattern
#15797 commented on
Jun 19, 2025 • 0 new comments -
[Benchmark] Fix two issues in benchmark result
#15795 commented on
Jun 19, 2025 • 0 new comments -
Optimizing Cascade Attention for Parallel Sampling
#15772 commented on
Jun 15, 2025 • 0 new comments -
[V1][Experimental] Jump-forward decoding
#15490 commented on
Jun 19, 2025 • 0 new comments -
Add KV-Cache int8 quant support
#10354 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: Docker vLLM 0.9.1 CUDA error: an illegal memory access, sampled_token_ids.tolist()
#19483 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: There is no module or parameter named '_orig_mod' in Qwen2ForCausalLM
#12783 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: RTX50xx GPU is not supported for running W8A8 FP8 quant models!
#19605 commented on
Jun 20, 2025 • 0 new comments -
[New Model]: surport for model: https://huggingface.co/jinaai/jina-clip-v2
#18448 commented on
Jun 20, 2025 • 0 new comments -
[Installation]: deployment failure on Kuberentes with CPU device (testing).
#17187 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Grammar error: Pointer '/$defs/xxxxx' does not exist
#16467 commented on
Jun 20, 2025 • 0 new comments -
[Feature] [ROCm]: AITER Kernel Integration
#14964 commented on
Jun 20, 2025 • 0 new comments -
[Feature]: support reasoning output when offline batched inference
#17292 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Single-Node data parallel (--data-parallel-size=4) leads to vLLM crash
#18567 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: `uv run vllm serve` with DP results in NCCL error: two ranks use the same device
#17176 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Docker, v0.9.0.1, Gemma3-4B, "Unsupported conversion from f16 to f16" on Nvidia T4
#19203 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Outlines broken on vLLM 0.8+
#15636 commented on
Jun 20, 2025 • 0 new comments -
[Usage]: 使用vllm的docker镜像启动Qwen/Qwen3-32B模型服务,CPU占用一直100%
#19150 commented on
Jun 20, 2025 • 0 new comments -
[Usage]: RTX 5090 with vllm/vllm-openai docker image
#16652 commented on
Jun 20, 2025 • 0 new comments -
[Bug] TP=2 fails on dual RTX 5090: TorchInductor compile error or CUDA illegal memory access (TP=1 works)
#18814 commented on
Jun 20, 2025 • 0 new comments -
Llama3.2 Vision Model: Guides and Issues
#8826 commented on
Jun 20, 2025 • 0 new comments -
[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management
#10086 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Multi-Node Online Inference on TPUs Failing
#12179 commented on
Jun 20, 2025 • 0 new comments -
[V1]Improve V1 startup error handling
#14758 commented on
Jun 14, 2025 • 0 new comments -
[Perf] Optimize Qwen2/2.5-VL ViT tensor generating performance
#14684 commented on
Jun 19, 2025 • 0 new comments -
[Core] Add a level 3 sleep/wake_up that offloads tensors to disk
#14678 commented on
Jun 17, 2025 • 0 new comments -
[Quant] SupportsQuant handles ignored_modules
#14635 commented on
Jun 19, 2025 • 0 new comments -
[Quant] Add SupportsQuant and packed_modules_mapping to all models
#14631 commented on
Jun 19, 2025 • 0 new comments -
[Hardware][Intel GPU] Add V1 engine support and `chunked_prefill` kernel
#14612 commented on
Jun 19, 2025 • 0 new comments -
First working PoC for bge-m3 sparse embeddings
#14526 commented on
Jun 17, 2025 • 0 new comments -
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 commented on
Jun 19, 2025 • 0 new comments -
[Model] add colqwen2_vl code & inference
#14291 commented on
Jun 19, 2025 • 0 new comments -
[Bugfix] Make memory profiler account for speculative draft model weights
#14067 commented on
Jun 20, 2025 • 0 new comments -
[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization
#13942 commented on
Jun 20, 2025 • 0 new comments -
[WIP][Core] Support tensor parallelism with uneven heads
#13934 commented on
Jun 19, 2025 • 0 new comments -
[Bugfix] Enable speculative decoding for models with nearly-identical vocab sizes
#13849 commented on
Jun 19, 2025 • 0 new comments -
[Model] Support VLMs with transformers backend
#13754 commented on
Jun 18, 2025 • 0 new comments -
[V0][Sampler] Use raw logits for greedy argmax
#13312 commented on
Jun 19, 2025 • 0 new comments -
[Hardware][Metal] Apple Metal support
#12640 commented on
Jun 19, 2025 • 0 new comments -
[Misc]add modules_to_not_convert attribute to gptq series
#12103 commented on
Jun 20, 2025 • 0 new comments -
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on
Jun 17, 2025 • 0 new comments -
[Model] LoRA with lm_head and embed_tokens fully trained - 4
#11714 commented on
Jun 19, 2025 • 0 new comments -
qwen optimze
#19406 commented on
Jun 19, 2025 • 0 new comments -
Feat Dynamic Quantization for MoE Layers in GPTQ Marlin Backend
#19395 commented on
Jun 17, 2025 • 0 new comments -
Add GLM4.1V model (Draft)
#19331 commented on
Jun 18, 2025 • 0 new comments -
[V1] [P/D] Add Support for KV Load Failure Recovery
#19330 commented on
Jun 20, 2025 • 0 new comments -
[Core] Update error message for Whisper + num-scheduler-steps > 1
#19286 commented on
Jun 14, 2025 • 0 new comments -
[Bugfix] ROCm FP8 Quantization Padding Issue
#19251 commented on
Jun 19, 2025 • 0 new comments -
[Core] Allow vLLM to stream n tokens at a time
#19240 commented on
Jun 19, 2025 • 0 new comments -
[Misc][Bugfix] specify docker registry to support podman
#19236 commented on
Jun 17, 2025 • 0 new comments -
[Bugfix] Fix Qwen2-Audio chat template for online serving
#19230 commented on
Jun 18, 2025 • 0 new comments -
[Doc]: improve CPU(x86) build instructions and fix include path
#19156 commented on
Jun 17, 2025 • 0 new comments -
[Core] Add constants for CUDA compute capabilities
#19099 commented on
Jun 19, 2025 • 0 new comments -
Fix Incorrect data_parallel_rank and subsequent errors under torchrun
#19096 commented on
Jun 19, 2025 • 0 new comments -
[feature] Integrate quick allreduce into custom allreduce and select the best allreduce implementation
#19094 commented on
Jun 20, 2025 • 0 new comments -
[Bugfix]: Fix DualChunkFlashAttention for short sequences
#19084 commented on
Jun 19, 2025 • 0 new comments -
[P/D] Exchange NIXL metadata through rank 0
#19080 commented on
Jun 15, 2025 • 0 new comments -
[BugFix]: Hermes tool parser stream output error in Qwen3 case #19056
#19058 commented on
Jun 19, 2025 • 0 new comments -
[Kernel] Add Conch Triton Attention backend
#19625 commented on
Jun 19, 2025 • 0 new comments -
Use the correct torch dtype in topk kernel assertion
#19614 commented on
Jun 16, 2025 • 0 new comments -
[Frontend] /metadata: Get more useful server information easily.
#19604 commented on
Jun 18, 2025 • 0 new comments -
[Core] Remove host GPU sync in `merge_multimodal_embeddings`
#19578 commented on
Jun 16, 2025 • 0 new comments -
[Hardware][Intel GPU] Add v1 Intel GPU support with Flash attention backend.
#19560 commented on
Jun 19, 2025 • 0 new comments -
[Core] Rationalize boolean environment variable handling
#19550 commented on
Jun 19, 2025 • 0 new comments -
[Benchmark] fix request loss if "ping" is returned
#19535 commented on
Jun 20, 2025 • 0 new comments -
[Bugfix] Register reducer even if transformers_modules not available
#19510 commented on
Jun 16, 2025 • 0 new comments -
deps: Update torch and deps to 2.7.1
#19507 commented on
Jun 16, 2025 • 0 new comments -
[Models] Improve iteration over layers
#19497 commented on
Jun 18, 2025 • 0 new comments -
090
#19488 commented on
Jun 18, 2025 • 0 new comments -
fix: Properly set engine_id when using multi connector in dynamo
#19487 commented on
Jun 20, 2025 • 0 new comments -
[Perf] Improve/Fix-regression for FA3 in High QPS regimes
#19463 commented on
Jun 20, 2025 • 0 new comments -
[Kernel] Integrate IBM/Applied-AI fused moe kernels
#19443 commented on
Jun 18, 2025 • 0 new comments -
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client.
#19423 commented on
Jun 15, 2025 • 0 new comments -
Added FP8 support quantization support to DualChunkFlashAttentionBackend
#19420 commented on
Jun 18, 2025 • 0 new comments -
[ROCm][FEAT] Integrate AITER gemm w8a8 ptpc
#19417 commented on
Jun 19, 2025 • 0 new comments -
[Frontend] speed up import time of vllm.reasoning
#18236 commented on
Jun 19, 2025 • 0 new comments -
[V1] feat:add engine v1 tracing
#18069 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: reasoning_tokens in Chat Completion Response usage
#18067 commented on
Jun 19, 2025 • 0 new comments -
[CI/Build] Allow hermetic builds
#18064 commented on
Jun 18, 2025 • 0 new comments -
[kernel] integrate permute/unpermute kernel into deepgemm moe
#17934 commented on
Jun 17, 2025 • 0 new comments -
[Core] Parallel multi-modal processor
#17831 commented on
Jun 19, 2025 • 0 new comments -
Update registry.py
#17762 commented on
Jun 19, 2025 • 0 new comments -
[Kernel] Bf16 data type support for awq quantization
#17705 commented on
Jun 20, 2025 • 0 new comments -
[Misc] Refactor VLM common generation tests to support audio inputs and mix-modality tests
#17633 commented on
Jun 19, 2025 • 0 new comments -
[PERF] Speed up of prepare_inputs / mrope
#17617 commented on
Jun 19, 2025 • 0 new comments -
[Security] Document StatelessProcessGroup security concerns
#17591 commented on
Jun 14, 2025 • 0 new comments -
[Bugfix][ROCm] Fix incorrect casting in GPTQ GEMM kernel
#17583 commented on
Jun 18, 2025 • 0 new comments -
[Bugfix][Model] vllm-v0 engine run eagle algo with qwen2.5 model, KeyError: 'norm.weight' bugfix
#17518 commented on
Jun 19, 2025 • 0 new comments -
enable multiple platform device in DP init
#17368 commented on
Jun 20, 2025 • 0 new comments -
[Experiment] Parallel multi-modal processor
#17361 commented on
Jun 19, 2025 • 0 new comments -
[NVIDIA] Support Cutlass w8a8 for Blackwell Geforce GPUs (sm120)
#17280 commented on
Jun 19, 2025 • 0 new comments -
[Frontend] Expand tools even if tool_choice="none"
#17177 commented on
Jun 19, 2025 • 0 new comments -
Create E=128,N=768,device_name=NVIDIA_A100-PCIE-40GB.json
#19049 commented on
Jun 19, 2025 • 0 new comments -
[Bugfix] Improve JSON extraction in LlamaToolParser
#19024 commented on
Jun 16, 2025 • 0 new comments -
[DRAFT] Self-Speculative Decoding using LayerSkip
#18994 commented on
Jun 18, 2025 • 0 new comments -
[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.
#18864 commented on
Jun 19, 2025 • 0 new comments -
[Kernel] Porting triton_kernels for FusedMoE
#18595 commented on
Jun 19, 2025 • 0 new comments -
[Model][Speculative Decoding] Integrate PARD into vLLM
#18541 commented on
Jun 20, 2025 • 0 new comments -
Remove Vision FA warning
#18522 commented on
Jun 19, 2025 • 0 new comments -
Add reorder_batch to TPU V1
#18515 commented on
Jun 19, 2025 • 0 new comments -
[V1] Support `LLM.apply_model`
#18465 commented on
Jun 18, 2025 • 0 new comments -
[WIP] Two batch overlap
#18415 commented on
Jun 19, 2025 • 0 new comments -
[V1] [Spec decode] Llama4 type eagle support in v1
#18369 commented on
Jun 15, 2025 • 0 new comments -
[Misc] add xgrammar for arm64
#18359 commented on
Jun 16, 2025 • 0 new comments -
[Feature] Expert Parallelism Load Balancer (EPLB)
#18343 commented on
Jun 20, 2025 • 0 new comments -
[Don't merge] Debug failing quantization test with input batch move
#18298 commented on
Jun 19, 2025 • 0 new comments -
[P/D] Support CPU Transfer in NixlConnector
#18293 commented on
Jun 19, 2025 • 0 new comments -
[Kernel] Add EP support for cutlass_moe_fp4
#18281 commented on
Jun 14, 2025 • 0 new comments -
[Model] support dots1
#18254 commented on
Jun 19, 2025 • 0 new comments -
[Feature]: Disaggregated Prefill on multi-node & multi-gpu
#13004 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: ValueError: could not broadcast input array from shape (513,) into shape (512,)
#8432 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: stuck at "generating GPU P2P access cache in /home/luban/.cache/vllm/gpu_p2p_access_cache_for_0,1.json"
#8735 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: Support for priority preemption with chunked-prefill
#10101 commented on
Jun 17, 2025 • 0 new comments -
[Usage]: mlx-community/DeepSeek-R1-4bit exception:OSError: /data/coding/model-671b-MS/dir does not appear to have a file named configuration_deepseek.py;
#13283 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted
#14416 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: Weird output when server with high load
#14491 commented on
Jun 17, 2025 • 0 new comments -
[Bug]:0.74 dev ,the error occurred in the gptq_marlin_gemm function call
#14887 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: asyncio.exceptions.CancelledError and engine_client.dead_error
#14994 commented on
Jun 17, 2025 • 0 new comments -
[Usage]: How to use vllm in parallel
#14997 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: Failed to Run Qwen2.5-7B with RTX 3070 & CPU Offload (14GB) Despite Sufficient Theoretical Memory
#15004 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: PallasAttentionBackendImpl.__init__() got an unexpected keyword argument 'q_lora_rank'
#15026 commented on
Jun 17, 2025 • 0 new comments -
[Performance]: Speculative Decoder Optimization for Large-Batch Inference Overhead
#15029 commented on
Jun 17, 2025 • 0 new comments -
Precision loss occurs when using the MoE sum kernel.
#15045 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#15058 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: Bad requests are not captured as traces
#17528 commented on
Jun 17, 2025 • 0 new comments -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: TRACKING ISSUE: CUDA OOM with Logprobs
#5907 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: LLMEngine.add_request can't handle erroneous type of request_id
#19588 commented on
Jun 16, 2025 • 0 new comments -
[Feature]: Optimize parallel sampling by batching add_request calls to avoid split scheduling latency
#16373 commented on
Jun 16, 2025 • 0 new comments -
[Feature]: Return hidden states (in progress?)
#6165 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: Qwen2.5-VL-32B, Following weights were not initialized from checkpoint
#15536 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError
#15127 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: multiround QA when using qwen2.5vl with the same input image
#15132 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Configurable metrics export format - Prometheus, OpenTelemetry
#15141 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Error running ShieldGemma: 'guideline' is undefined
#15147 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Can't see NCCL profiling data in nsight sys for expert parallel
#15168 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Failed to initialize the TMA descriptor 700 use Qwen2.5 72B on H200
#15175 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: ModuleNotFoundError: No module named 'vllm.vllm_flash_attn.layers' vllm@0.9.0.1
#19131 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Failed to run model Qwen3-30B-A3B on DGX V100x4
#17392 commented on
Jun 18, 2025 • 0 new comments -
[RFC]: Deprecating vLLM V0
#18571 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: Consider parallel_tool_calls parameter at the API level
#9451 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: `undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE` when running `0.7.3.dev57+g2ae88905.precompiled` on A100
#13047 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: Add Support for thinking_budget for Qwen3 Models
#17887 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: Support Gemma 3 QAT series
#16856 commented on
Jun 17, 2025 • 0 new comments -
[Feature]: Support for RTX 5090 (CUDA 12.8)
#13306 commented on
Jun 17, 2025 • 0 new comments -
[Usage]: How to let Whisper return timestamps in transcript?
#19556 commented on
Jun 17, 2025 • 0 new comments -
[Usage]: why speculate decoding is slower than normal decoding?
#8439 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: Engine stuck with requests are blocked, running/waiting request count and KV cache usage remain constant.
#18431 commented on
Jun 17, 2025 • 0 new comments -
[Doc]: Newest documentation for engine arguments is significantly worse than v0.8.5 and prior
#18707 commented on
Jun 17, 2025 • 0 new comments -
[Bug][Regression]: Dimension out of range when using MooncakeStoreConnector
#18834 commented on
Jun 17, 2025 • 0 new comments -
[Usage]: Full cuda graph for vllm v1
#19607 commented on
Jun 17, 2025 • 0 new comments -
[Bug]: error: Segmentation fault(SIGSEGV received at time)
#6918 commented on
Jun 17, 2025 • 0 new comments -
[RFC]: Graceful Error Handling for KV Connector Load Failures
#19329 commented on
Jun 15, 2025 • 0 new comments -
[Bug]: "NCCL WARN Error while attaching to shared memory segment /dev/shm/nccl- (size 192814336), error: No such file or directory (2)"
#18831 commented on
Jun 15, 2025 • 0 new comments -
[Feature]: Microbatch Tokenization
#19012 commented on
Jun 15, 2025 • 0 new comments -
[Bug]: Help, RuntimeError: CUDA error: no kernel image is available for execution on the device
#18835 commented on
Jun 15, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q2 2025
#15735 commented on
Jun 15, 2025 • 0 new comments -
[Feature]: Custom attention masks
#5228 commented on
Jun 15, 2025 • 0 new comments -
[Usage]: DeepSeek R1 input tokens cannot exceed 32k and how to correctly use FlashMLA
#14882 commented on
Jun 15, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: invalid argument
#14885 commented on
Jun 15, 2025 • 0 new comments -
[RFC]: Response format extensions for structured outputs
#19097 commented on
Jun 14, 2025 • 0 new comments -
[Feature]: v1 with kv-cache fp8
#16165 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: Qwen3 Enable Reasoning breaks Tool Call Parsing
#19513 commented on
Jun 14, 2025 • 0 new comments -
[Usage]: [vLLM V1] `decoded_token` returns "Ċ" instead of "\n" in Qwen2.5-Math-7B-Instruct
#19595 commented on
Jun 14, 2025 • 0 new comments -
[New Model]: ByteDance-Seed/BAGEL-7B-MoT
#18793 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: gemma3 shows degraded accuracy in vLLM v0.8.4
#17689 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: Qwen2VL-2b / Qwen2.5-7b has AssertionError and Cuda error when qps goes higher
#17171 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: Running ROCm support v1 vLLM Arch triggers ERROR_MEMORY_APERTURE_VIOLATION
#13674 commented on
Jun 14, 2025 • 0 new comments -
[RFC]: Configurable multi-modal data for profiling
#14438 commented on
Jun 14, 2025 • 0 new comments -
[Bug]: Short prompts -> !!!!!!! output from Qwen2.5-32B-Instruct-GPTQ-Int4 w/ROCm
#14715 commented on
Jun 14, 2025 • 0 new comments -
[Usage]: What should I do if I want to skip the prefill of a new request?
#14863 commented on
Jun 14, 2025 • 0 new comments -
[Feature]: Will vllm support sequence parallelism?
#19519 commented on
Jun 14, 2025 • 0 new comments -
[RFC]: Introduce a Triton-only Transformer Execution Path in vLLM
#13319 commented on
Jun 13, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device with nvidia v100
#19185 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: AttributeError: 'Llama_Nemotron_Nano_VL_Config' object has no attribute 'hidden_size'. Did you mean: 'vit_hidden_size'?
#19360 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: vllm crashes after when using quantized model on CPU with error "torch not compiled with CUDA enabled"
#18198 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: Hermes tool parser stream output error in Qwen3 case
#19056 commented on
Jun 16, 2025 • 0 new comments -
[Feature]: Add support for multi-lora and single lora for classification tasks
#19623 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 commented on
Jun 16, 2025 • 0 new comments -
[Usage]: Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.
#10151 commented on
Jun 16, 2025 • 0 new comments -
[Feature]: (Willing to PR) Avoid KV cache occupying GPU memory when not used
#11408 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: 使用vllm sever 出现会卡住不动 v100-32g
#13753 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: vllm server hang when running DeepSeek R1
#13778 commented on
Jun 16, 2025 • 0 new comments -
[Usage]: Help me! I can't run DeepSeek-R1 with the latest docker image on my server
#14039 commented on
Jun 16, 2025 • 0 new comments -
[Bug][Ray]: Pipeline parallelism fails on the same host
#14093 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: when compilingFlashMLA/csrc/flash_api.cpp error occurred
#14250 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: EAGLE / MTP Doesn't Overwrite Approximated Hidden States / KV Cache, 8%- 15% Acceptance Length Degradation
#14649 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: UserWarning on skipping serialisation of PostGradPassManager
#14911 commented on
Jun 16, 2025 • 0 new comments -
[Usage]: CPU使用率
#14931 commented on
Jun 16, 2025 • 0 new comments -
[Feature]: a new attention adaptation
#14940 commented on
Jun 16, 2025 • 0 new comments -
[Performance]: Speculative Decoding vs. Standard Inference
#14941 commented on
Jun 16, 2025 • 0 new comments -
[Usage]:
#14944 commented on
Jun 16, 2025 • 0 new comments -
[Usage]: Torch 2.5.1 with latest main branch
#14973 commented on
Jun 16, 2025 • 0 new comments -
[Bug]: Unable to get vLLM working with RTX 5090
#18995 commented on
Jun 15, 2025 • 0 new comments -
[Usage]: How to use DeepSeek-R1-0528-Qwen3-8B with function call
#19001 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: InternVL3 image dynamic preprocess issue
#19585 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: wake up OOM (72B model in 8*A800(40G))
#13941 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: vllm 0.8.4 whisper possible memory leak?
#16966 commented on
Jun 19, 2025 • 0 new comments -
[Installation]: undefined symbol: _ZNK3c1011StorageImpl27throw_data_ptr_access_errorEv
#15010 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: the issue of "cuda out of memory" arises
#15182 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: Extra Characters in `content` When Using `enable_reasoning` with `stop` Parameter
#15188 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: Is the logic order correct during the scheduler procedure?
#16982 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device
#5547 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: prefix-caching: inconsistent completions
#5543 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length
#5344 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: [V1][Spec Dec] EAGLE TP > 1 leads to errors when using --enforce_eager
#17513 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: Image Fails to Initialize (Undetected Platform) because of LD_LIBRARY_PATH, PATH environment error with vllm >= 0.9.0
#19184 commented on
Jun 19, 2025 • 0 new comments -
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on
Jun 19, 2025 • 0 new comments -
[New Model]: CSM 1b
#18005 commented on
Jun 19, 2025 • 0 new comments -
[RFC]: Multi-modality Support on vLLM
#4194 commented on
Jun 18, 2025 • 0 new comments -
[RFC]: Logits processor extensibility
#17799 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: RuntimeError on RTX 5090: "no kernel image is available for execution on the device
#16901 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: vLLM does not serve text-only version of Llama4
#18022 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Llama4 LoRA support
#16894 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: Cannot use FA version 2 is not supported due to FA3 is only supported on devices with compute capability >= 8 excluding 8.6 and 8.9
#13766 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 commented on
Jun 20, 2025 • 0 new comments -
[Usage]: Distributed inference not supported with OpenVINO?
#14933 commented on
Jun 20, 2025 • 0 new comments -
[Performance]: only 0.4 tokens/s when running 2 or more request
#15018 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Capture CudaGraph with LoRA
#15090 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: flash_attn_with_kvcache kernel, an illegal memory access
#15113 commented on
Jun 20, 2025 • 0 new comments -
[RFC]: layer-wise kv cache offloading to enable larger batches
#15123 commented on
Jun 20, 2025 • 0 new comments -
[Performance]: online batch inference faster than offline batch inference
#15178 commented on
Jun 20, 2025 • 0 new comments -
[Usage]: VLLM 0.7.3 with tensor parallelism outputs only exclamation marks when using multiple GPUs
#15194 commented on
Jun 20, 2025 • 0 new comments -
[Feature]: vllm what supports dialog prefix continuation?
#15198 commented on
Jun 20, 2025 • 0 new comments -
[Misc][Help]: Adding support for a Custom model with External MoE Routing
#15214 commented on
Jun 20, 2025 • 0 new comments -
[Usage]: : How to properly use vllm when serving - keyerror 'text'
#15219 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: `VLLM_USE_V1` + `TORCH_SDPA` regression in v0.8
#15251 commented on
Jun 20, 2025 • 0 new comments -
[Performance]: V0 and V1 give the same throughput number
#15253 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: --tensor-parallel-size Error
#15255 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: Inconsistent Output Based on Presence of chat_template Parameter
#15272 commented on
Jun 20, 2025 • 0 new comments -
[Bug]: vLLM declares itself healthy before it can serve requests
#15313 commented on
Jun 20, 2025 • 0 new comments -
[Feature]: looking into adding a generation algorithm
#15315 commented on
Jun 20, 2025 • 0 new comments -
[Bug]:rtx5060ti apply_w8a8_block_fp8_linear
#19596 commented on
Jun 20, 2025 • 0 new comments -
[Feature]: Qwen 3 MoE Lora adapter support.
#18120 commented on
Jun 19, 2025 • 0 new comments -
[Bug]: RuntimeError: The size of tensor a (1059) must match the size of tensor b (376) at non-singleton dimension, DeepSeek R1 H20x16 pp2, v1 engine
#15332 commented on
Jun 19, 2025 • 0 new comments -
[Feature]: Phi-4 tool support
#11985 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: V100 may can not support enable-prefix-caching
#13738 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: The accuracy of multiple cards and single card is inconsistent
#13801 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Support DeepEP
#13804 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: use cpu_offload_gb in gguf failed.
#14096 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: `Invalid attention backend for cuda` with `TORCH_SDPA` better error message
#14320 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Multi GPU inference using two RTX 5090s(TP=2)
#14628 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: EAGLE / DeepSeek MTP Handles First Input Token Incorrectly - 25% Acceptance Rate Drop
#14647 commented on
Jun 18, 2025 • 0 new comments -
[Bug] [ROCm]: RuntimeError: Calling `torch.linalg.cholesky` on a CUDA tensor requires compiling PyTorch with MAGMA. Please use PyTorch built with MAGMA support.
#14914 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: CPU infrencing won't work for DeepSeek-R1
#15044 commented on
Jun 18, 2025 • 0 new comments -
[New Model]: StableLMAlphaForCausalLM
#15046 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: How to use FlashMLA for DeepSeek-V2,
#15079 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: can vllm support deepseek R1 to inference on FP8 natively on H20 servers?
#15084 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: Request to Include vllm["audio,video"] Package in v0.8.0 Docker Image
#15087 commented on
Jun 18, 2025 • 0 new comments -
[Misc]: Why not sort the waiting queue before popleft waiting queue?
#15091 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: The difference between 0.7.3 and 0.8.0
#15092 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: `torch.compile` is turned on, but the model LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct does not support it.
#15093 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: 0.8.0(V1) crash on NCCL when load MoE model on 16 GPUs(H20)
#15098 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: How to use VLLM added functions for torch in a separate environment?
#15108 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Improve GPTQ implementation
#15116 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: BadRequestError(400) when using completions API with stream=true and echo=true
#15119 commented on
Jun 18, 2025 • 0 new comments -
[New Model]: surport for model:jinaai/jina-reranker-m0
#18447 commented on
Jun 18, 2025 • 0 new comments -
[TPU] Supported models for multimodal multi-image inference on TPU?
#18463 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: Gemma3 not supported on B200 w/ Flash-Infer
#19584 commented on
Jun 18, 2025 • 0 new comments -
[RFC]: Enhancing vLLM Plugin Architecture
#19161 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: V1 piecewise cudagraph capture size on ROCm is much higher than on cuda
#19579 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Stuck request and empty streaming for gemma3 serving with ^v0.8.5
#17658 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: vLLM sleep experiences segmentation fault when used in TRL
#16993 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Issue of Unstable Output for Identical Queries
#19403 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Fused moe config for NVIDIA RTX 6000 ADA
#17768 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Dual a6000 pros not working. Arch 120.
#19025 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Model fails to load in background thread in versions >0.8.5
#18816 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: Inconsistent outputs with deterministic sampling (temperature=0) when serving Qwen3-32B with vllm-0.8.5
#17759 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: N-gram speculative decoding performs slower than Qwen3-32B-FP8 with vLLM 0.9.0.1
#19254 commented on
Jun 18, 2025 • 0 new comments -
[RFC]: Blackwell Enablement for vLLM (SM100)
#18153 commented on
Jun 18, 2025 • 0 new comments -
[Feature]: Limit thinking tokens
#15418 commented on
Jun 18, 2025 • 0 new comments -
[RFC]: Schema for checking input shapes for multi-modal models
#14764 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: `size_k must divisible by BLOCK_SIZE_K` error when using tensor parallelism with AWQ-quantized MoE models
#17604 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: gpu-memory-utilization is not exact
#17269 commented on
Jun 18, 2025 • 0 new comments -
[Usage]: Can I get the loss of model directly?
#9750 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: vLLM CPU mode broken Unable to get JIT kernel for brgemm
#10478 commented on
Jun 18, 2025 • 0 new comments -
[Bug]: ValueError: Model architectures ['LlamaForCausalLM'] failed to be inspected
#11715 commented on
Jun 18, 2025 • 0 new comments