-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Insights: vllm-project/vllm
Overview
Could not load contribution data
Please try again later
164 Pull requests merged by 77 people
-
[Misc] Ensure out-of-tree quantization method recognize by cli args
#14328 merged
Mar 9, 2025 -
[Hardware][TPU] Fix the recompiling issue in logits processor after warmup
#14510 merged
Mar 9, 2025 -
[Bugfix] Revert QKVCrossParallelLinear usage in Mllama to keep BNB quantization work
#14498 merged
Mar 9, 2025 -
[Bugfix] Fix tqdm progress bar when SamplingParams.n > 1
#12428 merged
Mar 9, 2025 -
[Feat] Support chunked prefill for LMCache connector
#14505 merged
Mar 9, 2025 -
[V1][TPU] Remove unnecessary padding for running on TPU.
#14467 merged
Mar 9, 2025 -
[Attention] Default to FlashMLA backend for MLA
#14451 merged
Mar 9, 2025 -
Revert "[V1][Core] Fix memory issue with logits & sampling"
#14504 merged
Mar 9, 2025 -
[V1] Support bad_words in sampler
#13376 merged
Mar 8, 2025 -
[Misc] Upgrade to Python 3.9 typing for additional directories
#14492 merged
Mar 8, 2025 -
Update CODEOWNERS for structured output
#14496 merged
Mar 8, 2025 -
[Bugfix] Fix profiling OOM and decouple encoder multimodal profiling
#14361 merged
Mar 8, 2025 -
[Bugfix] DeepSeek Accuracy
#14476 merged
Mar 8, 2025 -
Move requirements into their own directory
#12547 merged
Mar 8, 2025 -
[Misc] Don't run ruff at all on 3rd party libs
#14493 merged
Mar 8, 2025 -
[benchmarks] Add option to use unique jsonschema for each request
#14457 merged
Mar 8, 2025 -
[V1][Core] Fix memory issue with logits & sampling
#13776 merged
Mar 8, 2025 -
[Misc] add
use_tqdm_on_load
to reduce logs#14407 merged
Mar 8, 2025 -
[VLM] Add TP support for Phi-4-MM
#14453 merged
Mar 8, 2025 -
[V1] TPU - Add tensor parallel support via Ray
#13618 merged
Mar 8, 2025 -
[CI/Build] Use a fixed seed to avoid flaky tests
#14480 merged
Mar 8, 2025 -
Add RLHF document
#14482 merged
Mar 8, 2025 -
[Build/BugFix] Fix hopper 12.8 build
#14354 merged
Mar 8, 2025 -
Add training doc signposting to TRL
#14439 merged
Mar 8, 2025 -
[Bugfix] Make the deviceprofiler include LoRA memory.
#14469 merged
Mar 8, 2025 -
[Doc] Added QwQ-32B to the supported models list in the reasoning out…
#14479 merged
Mar 8, 2025 -
[Doc]add doc for Qwen models tool calling
#14478 merged
Mar 8, 2025 -
Default to
generation_config
from model#12622 merged
Mar 8, 2025 -
[CI/Build] refactor: set timezone of container to UTC
#12888 merged
Mar 8, 2025 -
[core] add
extra_args
toSamplingParams
#13300 merged
Mar 8, 2025 -
[MISC][V1] Register process killing handler only in the main thread
#14380 merged
Mar 8, 2025 -
Revert "[Perf] Reduce MLA CPU overheads in V1 (#14384)"
#14471 merged
Mar 8, 2025 -
[Bugfix][V1] Handle MLA in kv_cache_interface
#14462 merged
Mar 8, 2025 -
[V1] Prompt logprobs + APC compatibility; prompt logprobs reqs cannot fill APC
#13949 merged
Mar 8, 2025 -
[Bugfix] Fix torch_xla which can't handle None seed introduced in #14274
#14459 merged
Mar 7, 2025 -
[V1][Metrics] Fix traceback with preemptions+LoRA
#14220 merged
Mar 7, 2025 -
[V1] Eagerly remove finished requests from the batch
#14388 merged
Mar 7, 2025 -
[v1] torch.compile integration explanation
#14437 merged
Mar 7, 2025 -
[Misc] Add Phi4-MM example
#14343 merged
Mar 7, 2025 -
[Kernel] optimize performance of gptq marlin kernel when n is small
#14138 merged
Mar 7, 2025 -
[Benchmarks] Make detokenization optional in benchmark scripts
#11697 merged
Mar 7, 2025 -
[Doc] Update prefix_caching.md to match the example image
#14420 merged
Mar 7, 2025 -
[V1][Core] Support for Structured Outputs
#12388 merged
Mar 7, 2025 -
Use the optimized block sizes after tuning the kernel.
#14329 merged
Mar 7, 2025 -
Fix missing
kv_caches
andattn_metadata
inOpenVINOCausalLM
#14271 merged
Mar 7, 2025 -
[BUGFIX] Skip tokenization support for throughput benchmark
#12712 merged
Mar 7, 2025 -
[Misc] Set default value of seed to None
#14274 merged
Mar 7, 2025 -
[Bugfix] Clean up multi-modal processors
#14417 merged
Mar 7, 2025 -
[Bugfix] Further clean up LoRA test
#14422 merged
Mar 7, 2025 -
correct wrong markdown syntax
#14414 merged
Mar 7, 2025 -
[GH] Auto-apply multi-modality label to relevant PRs
#14402 merged
Mar 7, 2025 -
OpenVINO: added CPU-like conditions
#14338 merged
Mar 7, 2025 -
[Build] Add nightly wheel fallback when latest commit wheel unavailable
#14358 merged
Mar 7, 2025 -
[Bugfix] Fix JambaForCausalLM LoRA
#14370 merged
Mar 7, 2025 -
[BugFix] Illegal Memory Access in the blockwise cutlass fp8 GEMMs
#14396 merged
Mar 7, 2025 -
[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object
#14390 merged
Mar 7, 2025 -
[Perf] Reduce MLA CPU overheads in V1
#14384 merged
Mar 7, 2025 -
[Bugfix] Correctly call
cudaProfilerStop
in benchmarks script#14183 merged
Mar 7, 2025 -
[Doc] Fix a typo
#14385 merged
Mar 7, 2025 -
[Hardware][TPU]Enable ragged paged attention kernel and resolve recompilation issue
#14310 merged
Mar 6, 2025 -
[Docs] Add nsight guide to profiling docs
#14298 merged
Mar 6, 2025 -
[V1][Bugfix] Standardize quantized kv cache rejection for attention backends
#14221 merged
Mar 6, 2025 -
[Bug] Fix Attention when ignored in by quant_method
#14313 merged
Mar 6, 2025 -
[Bugfix] Fix use_direct_call condition in FusedMoE layer for
#14382 merged
Mar 6, 2025 -
[Kernel] Add needs_fixed_stride_order tag to most GEMMs
#14306 merged
Mar 6, 2025 -
[CI] Disable spawn when running V1 Test
#14345 merged
Mar 6, 2025 -
[CI/Build] Use uv python for docker rather than ppa:deadsnakes/ppa
#13569 merged
Mar 6, 2025 -
[Distributed] Add enable_expert_parallel arg
#14305 merged
Mar 6, 2025 -
[V1] Do not detokenize if sampling param detokenize is False
#14224 merged
Mar 6, 2025 -
Fix mla prefill context performance
#13897 merged
Mar 6, 2025 -
Add authors to license header.
#14371 merged
Mar 6, 2025 -
Adding cpu inference with VXE ISA for s390x architecture
#12613 merged
Mar 6, 2025 -
Reinstate
best_of
for V0#14356 merged
Mar 6, 2025 -
[RLHF] use worker_extension_cls for compatibility with V0 and V1
#14185 merged
Mar 6, 2025 -
[Doc] Fix date typo in README.md
#14366 merged
Mar 6, 2025 -
[Core] Don't use cache during multi-modal profiling
#14336 merged
Mar 6, 2025 -
[Bugfix][Core] fix abort_seq_group and memory leak when n>1
#14326 merged
Mar 6, 2025 -
[Kernel] [V1] Improved performance for V1 Triton (ROCm) backend
#14152 merged
Mar 6, 2025 -
[Doc] Correct beam_search using in generative_models.md
#14363 merged
Mar 6, 2025 -
[Doc] Update reasoning with stream example to use OpenAI library
#14077 merged
Mar 6, 2025 -
[Frontend][Docs] Transcription API streaming
#13301 merged
Mar 6, 2025 -
[Core] Optimizing cross-attention
QKVParallelLinear
computation#12325 merged
Mar 6, 2025 -
[VLM] Support Pixtral-HF on V1
#14275 merged
Mar 6, 2025 -
[Model] Update Paligemma multimodal processing with PromptUpdate
#14015 merged
Mar 6, 2025 -
[Hardware] Update the flash attn tag to support Blackwell
#14244 merged
Mar 6, 2025 -
[Bugfix][CI] ALiBi test case in xformers multi_query_kv_attention
#11301 merged
Mar 6, 2025 -
[V1] LoRA - Enable more V1 tests
#14315 merged
Mar 6, 2025 -
[Bugfix][Structured Output] Support outlines engine with reasoning outputs for DeepSeek R1
#14114 merged
Mar 6, 2025 -
[misc] Mention
ray list nodes
command to troubleshoot ray issues#14318 merged
Mar 6, 2025 -
[BugFix] MLA + V1, illegal memory access and accuracy issues
#14253 merged
Mar 6, 2025 -
[Build] Add UV_HTTP_TIMEOUT to avoid timeout during installation
#13850 merged
Mar 6, 2025 -
Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM
#13917 merged
Mar 6, 2025 -
[CI/Build] Use spawn multiprocessing mode for V1 test pipeline
#14243 merged
Mar 6, 2025 -
[BugFix] Fix prefix caching V0 MLA
#14255 merged
Mar 6, 2025 -
[Bugfix] Remove num_tokens_across_dp
#14302 merged
Mar 5, 2025 -
[Bugfix] Fix DeepSeek MTP crash when using TP1ModelRunner with CUDA graph due to shape mismatch
#14237 merged
Mar 5, 2025 -
[V1][Easy] Add empty allowed_token_ids in the v1 sampler test
#14308 merged
Mar 5, 2025 -
[misc] Add FlashMLA as a new option of VLLM_ATTENTION_BACKEND env
#14267 merged
Mar 5, 2025 -
[V1][BugFix] Fix for mixed top_k batch
#14301 merged
Mar 5, 2025 -
Deprecate
best_of
Sampling Parameter in anticipation for vLLM V1#13997 merged
Mar 5, 2025 -
[V1][Minor] Remove obsolete FIXME comment
#14304 merged
Mar 5, 2025 -
[Docs] Add Meta Slides
#14297 merged
Mar 5, 2025 -
[Bugfix] Fix broken vision language example
#14292 merged
Mar 5, 2025 -
[Doc] Fixed typo in prefix_caching.md
#14293 merged
Mar 5, 2025 -
[Misc] Add Qwen2MoeForCausalLM moe tuning support
#14276 merged
Mar 5, 2025 -
[LoRA] Remove linear hack outside transformers backend
#14177 merged
Mar 5, 2025 -
[V1][Frontend] Add Testing For V1 Runtime Parameters
#14159 merged
Mar 5, 2025 -
Small update for external_launcher backend docs
#14288 merged
Mar 5, 2025 -
[Doc] [3/N] Refer code examples for common cases in dev multimodal processor
#14278 merged
Mar 5, 2025 -
[Doc] Update nginx guide: remove privileged from vllm container run and add target GPU ID
#14217 merged
Mar 5, 2025 -
[Bugfix][V1] Fix allowed_token_ids for v1 Sampler
#14169 merged
Mar 5, 2025 -
[Misc][V1] Avoid using
envs.VLLM_USE_V1
in mm processing#14256 merged
Mar 5, 2025 -
[Frontend] Allow return_tokens_as_token_ids to be passed as a request param
#14066 merged
Mar 5, 2025 -
Temporarily disable test_awq_gemm_opcheck
#14251 merged
Mar 5, 2025 -
[platforms] improve rocm debugging info
#14257 merged
Mar 5, 2025 -
[V1] EP/TP MoE + DP Attention
#13931 merged
Mar 5, 2025 -
[Model] New model support for Phi-4-multimodal-instruct
#14119 merged
Mar 5, 2025 -
[V1][Bugfix] Do not reset prefix caching metrics
#14235 merged
Mar 5, 2025 -
[Bugfix] Fix gptq_marlin for deepseek-v3
#13750 merged
Mar 5, 2025 -
Disable GPTQ AllSpark kernels for CUDA Compiler < 12.0
#14157 merged
Mar 5, 2025 -
Moved numba from common requirements to cuda/rocm specific requirements
#14199 merged
Mar 5, 2025 -
[misc] announce china meetup
#14248 merged
Mar 5, 2025 -
[V1][TPU] TPU multimodal model support for ragged attention
#14158 merged
Mar 5, 2025 -
[ROCm] Disable a few more kernel tests that are broken on ROCm
#14145 merged
Mar 4, 2025 -
Clean up unused padding_idx variables across many model definitions
#13240 merged
Mar 4, 2025 -
Serialize using safetensors for KV caches
#14228 merged
Mar 4, 2025 -
[v1][Metrics] Add design doc
#12745 merged
Mar 4, 2025 -
[Docs] Update Dockerfile dependency image
#14215 merged
Mar 4, 2025 -
[Frontend] Do
prompt_logprobs
clamping for chat as well as completions#14225 merged
Mar 4, 2025 -
Fix performance when
--generation-config
is notNone
#14223 merged
Mar 4, 2025 -
[TPU][Profiler] Support start_profile/stop_profile in TPU worker
#13988 merged
Mar 4, 2025 -
add cutlass support for blackwell fp8 gemm
#13798 merged
Mar 4, 2025 -
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py
#14161 merged
Mar 4, 2025 -
[V0][Metrics] Deprecate some questionable request time metrics
#14135 merged
Mar 4, 2025 -
[V1][BugFix] Fix remaining sync engine client shutdown errors/hangs
#13869 merged
Mar 4, 2025 -
[Bugfix] Restrict MacOS CPU detection
#14210 merged
Mar 4, 2025 -
[doc] add "Failed to infer device type" to faq
#14200 merged
Mar 4, 2025 -
[sleep mode] error out with expandable_segments
#14189 merged
Mar 4, 2025 -
[platform] add debug logging during inferring the device type
#14195 merged
Mar 4, 2025 -
Fix benchmark_moe.py tuning for CUDA devices
#14164 merged
Mar 4, 2025 -
Use math.prod instead of np.prod for trivial ops
#14142 merged
Mar 4, 2025 -
[core] Pass all driver env vars to ray workers unless excluded
#14099 merged
Mar 4, 2025 -
[Misc] Remove lru_cache in NvmlCudaPlatform
#14156 merged
Mar 4, 2025 -
[core] moe fp8 block quant tuning support
#14068 merged
Mar 4, 2025 -
[Model] Add support for GraniteMoeShared models
#13313 merged
Mar 4, 2025 -
[v1] Add comments to the new ragged paged attention Pallas kernel
#14155 merged
Mar 3, 2025 -
[Docs] Add GPTQModel
#14056 merged
Mar 3, 2025 -
[Kernel] Optimize moe intermediate_cache usage
#13625 merged
Mar 3, 2025 -
[Bugfix] Allow shared_experts skip quantization for DeepSeekV2/V3
#14100 merged
Mar 3, 2025 -
[V1] Simplify stats logging
#14082 merged
Mar 3, 2025 -
[V0][Metrics] Deprecate some KV/prefix cache metrics
#14136 merged
Mar 3, 2025 -
[V0][Metrics] Remove unimplemented
vllm:tokens_total
#14134 merged
Mar 3, 2025 -
Fix
head_dim
not existing in all model configs (Transformers backend)#14141 merged
Mar 3, 2025 -
[ROCm] Faster Custom Paged Attention kernels
#12348 merged
Mar 3, 2025 -
Improve the docs for
TransformersModel
#14147 merged
Mar 3, 2025 -
[V1] Refactor parallel sampling support
#13774 merged
Mar 3, 2025 -
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPILED=1
#13921 merged
Mar 3, 2025 -
[Misc][Platform] Move use allgather to platform
#14010 merged
Mar 3, 2025 -
[Misc] typo find in deepseek_v2
#14106 merged
Mar 3, 2025 -
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation failure
#14051 merged
Mar 3, 2025 -
Update deprecated Python 3.8 typing
#13971 merged
Mar 3, 2025 -
[v0][structured output] Support reasoning output
#12955 merged
Mar 2, 2025
95 Pull requests opened by 81 people
-
[V1] Implement sliding window attention in kv_cache_manager
#14097 opened
Mar 2, 2025 -
[v1] Remove bind_kv_cache and self.kv_cache in model runner
#14098 opened
Mar 2, 2025 -
[Misc] Use model_overwrite to redirect the model name to a local folder.
#14116 opened
Mar 3, 2025 -
[Distributed] Add custom allreduce support for ROCM
#14125 opened
Mar 3, 2025 -
[Feature] Vllm int8 quantization enablement for ARM CPUs
#14129 opened
Mar 3, 2025 -
[Misc] Add log information for handle_process_request.
#14130 opened
Mar 3, 2025 -
feat: add DeepGEMM for fp8 dense models
#14140 opened
Mar 3, 2025 -
[WIP] FLUTE Integration
#14146 opened
Mar 3, 2025 -
[V1][Metrics] Add additional metrics to V1
#14148 opened
Mar 3, 2025 -
[Feature]: Pin vLLM process to the right NUMA Region
#14149 opened
Mar 3, 2025 -
[Frontend] Relax 'method' field constraint in BatchRequestInput
#14153 opened
Mar 3, 2025 -
[bug fix]: benchmark enabling torch profiler in openai chat backend
#14162 opened
Mar 4, 2025 -
[Bugfix] Remove unecessary call of `make_rand_sparse_tensors` in `make_n_rand_sparse_tensors`
#14165 opened
Mar 4, 2025 -
[DRAFT] Try to bump torch version
#14171 opened
Mar 4, 2025 -
Add CUDA kernel for per_token_group_quant_fp8
#14175 opened
Mar 4, 2025 -
Fix WorkerWrapperBase initialization: defer vllm_config setup
#14179 opened
Mar 4, 2025 -
Deepseek MTP for V1
#14182 opened
Mar 4, 2025 -
[misc] Update blog link in README
#14194 opened
Mar 4, 2025 -
docs: Add documentation for s390x cpu implementation
#14198 opened
Mar 4, 2025 -
[Model] Add Reasoning Parser for Granite Models
#14202 opened
Mar 4, 2025 -
[Kernel] MoE tuning, quickly skip slow config
#14207 opened
Mar 4, 2025 -
[Misc] Update `compressed-tensors` WNA16 to support zero-points
#14211 opened
Mar 4, 2025 -
[V1][PP] Support PP for MultiprocExecutor
#14219 opened
Mar 4, 2025 -
[V1][TPU] Support V1 Sampler for ragged attention
#14227 opened
Mar 4, 2025 -
[CI] Make UT cases in test_comm_ops.py compatible with more devices
#14229 opened
Mar 4, 2025 -
Use getattr for hidden_act and hidden_activation in Gemma models
#14230 opened
Mar 4, 2025 -
Torchao
#14231 opened
Mar 4, 2025 -
[Hardware][TPU][V1] Multi-LoRA implementation for the V1 TPU backend
#14238 opened
Mar 4, 2025 -
[V1] Aggregate prompt logprobs in `EngineCore`
#14240 opened
Mar 4, 2025 -
[V1] Enable Long Context LoRA tests for V1
#14241 opened
Mar 4, 2025 -
[do not merge][core] Workarounds for V1 for spyre plugin
#14242 opened
Mar 4, 2025 -
dynamic distpatch of fp8 kernels
#14245 opened
Mar 5, 2025 -
[ROCm] Tweak the benchmark script to run on ROCm
#14252 opened
Mar 5, 2025 -
[TPU][V1] Capture multimodal encoder during model compilation
#14254 opened
Mar 5, 2025 -
[WIP][Attention] FlashAttn MLA
#14258 opened
Mar 5, 2025 -
[utils] Update DNS server IPs in get_ip function to avoid rate limiting
#14262 opened
Mar 5, 2025 -
exmaple: manipulate cache
#14265 opened
Mar 5, 2025 -
[BugFix] Add explict defualt ctor for RankData to make it can be built with clang
#14268 opened
Mar 5, 2025 -
[Doc] Create tool_chat_template_llama3.3_json.jinja
#14269 opened
Mar 5, 2025 -
[Kernel] Add trition.autotune to address the high latency overhead of punica kernels
#14272 opened
Mar 5, 2025 -
[Doc] Fix env path name in `uv` Python env setup instructions
#14273 opened
Mar 5, 2025 -
[MISC] rename interval to max_recent_requests
#14285 opened
Mar 5, 2025 -
[MISC] Refine no available block debug msg
#14287 opened
Mar 5, 2025 -
[Model] add colqwen2_vl code & inference
#14291 opened
Mar 5, 2025 -
[Build] Cython compilation support fix
#14296 opened
Mar 5, 2025 -
[V1] TPU - Remove self.kv_caches
#14309 opened
Mar 5, 2025 -
[Core] Expose API endpoint `/is_sleeping`
#14312 opened
Mar 5, 2025 -
[ROCm] Enable chunked prefill/paged attention in MLA on ROCm
#14316 opened
Mar 5, 2025 -
[Model] Add PLaMo2
#14323 opened
Mar 6, 2025 -
fix minor miscalled method
#14327 opened
Mar 6, 2025 -
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros
#14347 opened
Mar 6, 2025 -
[Bugfix][Frontend] Fix validation of `logprobs` in `ChatCompletionRequest`
#14352 opened
Mar 6, 2025 -
Update Dockerfile, typo
#14362 opened
Mar 6, 2025 -
[Misc] Benchmarks: Fix guided decoding, token sampling, and request sorting
#14368 opened
Mar 6, 2025 -
[Misc] Fix test_sleep to use query parameters
#14373 opened
Mar 6, 2025 -
feat:Optimize qwen2-vl to reduce cudaMemcpyAsync
#14377 opened
Mar 6, 2025 -
[MISC][V1] Handle exception of current_platform.get_device_name() in arg_utils
#14379 opened
Mar 6, 2025 -
[wip] scaled_mm_bw_sm100
#14383 opened
Mar 6, 2025 -
[Core] Add DoRA Support
#14389 opened
Mar 7, 2025 -
[neuron] add reshape_and_cache
#14391 opened
Mar 7, 2025 -
A different take
#14393 opened
Mar 7, 2025 -
[Kernel] Update cutlass FP8 blockwise to use upstream CUTLASS
#14395 opened
Mar 7, 2025 -
[Bugfix] Make the fused_moe code compatible with non-triton supported hardware
#14400 opened
Mar 7, 2025 -
Clean up Engine Args & Documentation
#14409 opened
Mar 7, 2025 -
[rlhf] support named placement group
#14410 opened
Mar 7, 2025 -
[Misc] Add get_stream_cls() method for Platform class
#14411 opened
Mar 7, 2025 -
[Bugfix][V1] Exclude HBM used by other processes when calculating peak memory during profile runs
#14419 opened
Mar 7, 2025 -
[Bugfix] Fix When choice the specified tool call, it returns a ToolCa…
#14427 opened
Mar 7, 2025 -
[Refactor][Reasoning] Keep all logic about reasoning into one class
#14428 opened
Mar 7, 2025 -
[Bugfix][Kernel]: Fix AllSpark kernel compilation errors and enable for CUDA < 12.0
#14430 opened
Mar 7, 2025 -
[Kernel] [V1] Further optimizations to ROCm (Triton) Backend to better handle GQA.
#14431 opened
Mar 7, 2025 -
[Usage] Refactor speculative decoding configuration and tests
#14434 opened
Mar 7, 2025 -
[BUGFIX] fix the need_recv method of model_runner
#14436 opened
Mar 7, 2025 -
[Feature]: PD separation supports prefix caching #12257
#14440 opened
Mar 7, 2025 -
[WIP] Support models with mrope (Qwen2VL) on TPU
#14442 opened
Mar 7, 2025 -
[Core][LoRA] Add ignore layer for LoRA
#14445 opened
Mar 7, 2025 -
[WIP][Kernel] moe wna16 marlin kernel
#14447 opened
Mar 7, 2025 -
[ROCm] Fix kernel cache miss in Triton FA
#14448 opened
Mar 7, 2025 -
[ROCm][Kernel] MoE weights padding
#14454 opened
Mar 7, 2025 -
[INTEL-HPU] Deepseek R1 model enabling for Intel Gaudi
#14455 opened
Mar 7, 2025 -
Fix EAGLE output norm bug
#14464 opened
Mar 7, 2025 -
[core][V1] pluggable scheduler
#14466 opened
Mar 7, 2025 -
Fix GuidedDecodingParams backend_name issue
#14473 opened
Mar 8, 2025 -
[Frontend] Pythonic tool names flexibility (#14470)
#14474 opened
Mar 8, 2025 -
LLama 3.2 11b lm eval accuracy drop fix
#14477 opened
Mar 8, 2025 -
[not ready for review] introduce some profiling in the benchmark
#14481 opened
Mar 8, 2025 -
[Misc] Unify formatter and linter to use ruff
#14485 opened
Mar 8, 2025 -
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace
#14501 opened
Mar 8, 2025 -
[V1][Core] Fix memory issue with logits & sampling
#14508 opened
Mar 9, 2025 -
[Misc] QoL: add speculative_model to SpeculativeConfig
#14509 opened
Mar 9, 2025 -
[Frontend] Support both tool calling and reasoning parser for reasoni…
#14511 opened
Mar 9, 2025 -
[BugFix][V1] Fix parallel sampling finishing/aborts
#14512 opened
Mar 9, 2025 -
[DO NOT MERGE] Varun/fix memory
#14514 opened
Mar 9, 2025 -
[Misc] Replace os environ to monkeypatch in test suite
#14516 opened
Mar 9, 2025
89 Issues closed by 36 people
-
[Bug]: Mismatch of tqdm when n > 1
#10949 closed
Mar 9, 2025 -
[Bug]: tqdm progress bar seems to be wrong.
#11519 closed
Mar 9, 2025 -
[Usage]: How to use BitsAndBytesConfig with vllm serve
#8813 closed
Mar 9, 2025 -
[Bug]: Server - `aqlm` fails with `--cpu-offload-gb`
#8873 closed
Mar 9, 2025 -
[Bug]: CRITICAL 11-05 12:03:03 launcher.py:99] MQLLMEngine is already dead, terminating server process
#10024 closed
Mar 9, 2025 -
[Feature]: When generating multiple answers of the same prompt?
#10099 closed
Mar 9, 2025 -
[Bug]: Kernel crash while loading the 14B models on GPU L4x4
#14132 closed
Mar 8, 2025 -
[Bug]: remove_oldest LRU lora may remove lora which is still in usage
#14497 closed
Mar 8, 2025 -
[Bug][V1]: Qwen2-VL-7B OOM when loading the model in v0 but not in v1
#14184 closed
Mar 8, 2025 -
[Feature]: num of LoRAs requested by the batch is larger than num lora slots
#14495 closed
Mar 8, 2025 -
[Bug]: Corrupted responses for Llama-3.2-3B-Instruct with v0.6.6.post1
#12096 closed
Mar 8, 2025 -
[Performance]: LoRA is not taken into account when determining the number of KV cache blocks
#14450 closed
Mar 8, 2025 -
[RFC]: Prompt logprobs + APC compatibility
#13414 closed
Mar 8, 2025 -
[Usage]: Multiple rounds of image dialogue support ?(多轮图片对话支持?)
#11006 closed
Mar 7, 2025 -
[Usage]: torch.OutOfMemoryError: CUDA out of memory.
#11560 closed
Mar 7, 2025 -
[Usage]: How to bypass multimodal processor logic when inputs are already processed
#14281 closed
Mar 7, 2025 -
[Usage]: Inquiry about AsyncLLMEngine's generate method and multi-modal input support
#10937 closed
Mar 7, 2025 -
[Doc]: How can I set the date_string for the chat templates
#14344 closed
Mar 7, 2025 -
[Bug]: The driver_worker gets stuck 100% of the time, when using Medusa with TP > 1
#9573 closed
Mar 7, 2025 -
[Bug]: Broken outputs for large contexts if `max_model_len` is fixed.
#9615 closed
Mar 7, 2025 -
[Bug]: I cannot able to load the model on TESLA T4 GPU in Full precision
#9990 closed
Mar 7, 2025 -
[New Model]: Support Tencent-Hunyuan-Large
#10043 closed
Mar 7, 2025 -
[Usage]: Model architectures ['LlavaNextForConditionalGeneration'] are not supported for now
#10065 closed
Mar 7, 2025 -
[Misc]: w8a8 model inference
#10068 closed
Mar 7, 2025 -
[Bug]: vllm 多节点部署问题
#14186 closed
Mar 7, 2025 -
[Bug]: FP8 kvcache causes RuntimeError in v1 engine
#11329 closed
Mar 6, 2025 -
[Bug]: [V1] wrong output when using kv cache fp8
#13133 closed
Mar 6, 2025 -
[Feature]: Ovis2 VLM series
#14346 closed
Mar 6, 2025 -
[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)
#6155 closed
Mar 6, 2025 -
[Bug]: Memory leak due to LLMEngine.seq_id_to_seq_group
#14353 closed
Mar 6, 2025 -
[Bug]: Phi-4-multimodal-instruct audio tag seems wrong
#14342 closed
Mar 6, 2025 -
[Usage]: Running OpenAI Swarm with vLLM-hosted models
#11774 closed
Mar 6, 2025 -
[Bug]: Ray fails to register worker when running DeepSeek R1 model with vLLM and tensor parallelism
#13557 closed
Mar 6, 2025 -
[Bug][V1]: Kernel crashed when running qwen2.5_vl model
#14181 closed
Mar 6, 2025 -
[New Model]: QwQ-32B
#14321 closed
Mar 6, 2025 -
[Usage]: How do I use langchain for tool calls?
#9692 closed
Mar 6, 2025 -
[Bug]: DeepSeek R1 with outlines structured engine stops generation after `</think>`
#14113 closed
Mar 6, 2025 -
[Bug]: Failed to run example.py even if the pytorch framework has been compiled natively.
#14178 closed
Mar 6, 2025 -
[Feature]: need no_repeat_n_gram in SamplingParams
#7842 closed
Mar 6, 2025 -
[New Model]: We can able to run phi-3.5 vision instruct model but wanted to run in int4 quantization
#8463 closed
Mar 6, 2025 -
[Performance]: FP8 performance worse than FP16 for Qwen2-VL-2B-Instruct
#9992 closed
Mar 6, 2025 -
[Misc]: How to organize a large number of requests for invocation?
#10018 closed
Mar 6, 2025 -
[Performance]: latency of medusa is longer than naive inferece even the concurreny =2
#10031 closed
Mar 6, 2025 -
[Feature]: Slurm run_cluster.sh launcher instead of just Ray
#7933 closed
Mar 6, 2025 -
[RFC]: Deprecation of the `best_of` Sampling Parameter in vLLM V1
#13361 closed
Mar 5, 2025 -
[Usage]: Does benchmark of vllm support audio input to multi-modal ?
#14284 closed
Mar 5, 2025 -
[Bug]: 【Janus-Pro-7B】 KeyError: 'multi_modality'
#14247 closed
Mar 5, 2025 -
[Feature]: How to handle concurrent request in single instance of Qwen2-VL model.
#14226 closed
Mar 5, 2025 -
[Doc]: Typo in prefix_caching.md
#14294 closed
Mar 5, 2025 -
[Usage]: How to enforce think for deepseek-r1?
#14201 closed
Mar 5, 2025 -
[Feature]: enable prefix caching when MLA is enabled
#13720 closed
Mar 5, 2025 -
[Feature]: Only apply Guided/Structured grammar after reasoning steps in Reasoning models
#12619 closed
Mar 5, 2025 -
[New Model]: Phi-4 Multimodal Instruct
#13936 closed
Mar 5, 2025 -
[Feature]: LoRA support for InternVLChatModel
#9495 closed
Mar 5, 2025 -
[New Model]: SparseLLM/prosparse-llama-2-7b
#9916 closed
Mar 5, 2025 -
[Usage]: Inference delay
#9949 closed
Mar 5, 2025 -
[Performance]: Any up-to-date and convincing benchmark for chosing the fastest engine ?
#9975 closed
Mar 5, 2025 -
[Bug]: vLLM with ray backend and enable nsight can't get perf metrics due to connection issue
#7830 closed
Mar 4, 2025 -
[Feature]: Use math.prod instead of np.prod for trivial ops
#14144 closed
Mar 4, 2025 -
[Feature]: Avoid KV Cache and offload Model weights in RL workloads
#11638 closed
Mar 4, 2025 -
[Bug]: Running on a single machine with multiple GPUs error
#9875 closed
Mar 4, 2025 -
[Feature]: more exhaustive tracing
#9952 closed
Mar 4, 2025 -
[Installation]: build on arm64 meet a error
#9964 closed
Mar 4, 2025 -
[Feature]: frequency_penalties is missing in V1
#10696 closed
Mar 4, 2025 -
[Bug]: `TransformersModel` fails if model config does not have `head_dim` attr
#14139 closed
Mar 3, 2025 -
[Usage]: "POST /v1/audio/transcriptions HTTP/1.1" 404 Not Found
#14127 closed
Mar 3, 2025 -
[Usage]: Fail to create distributed inference serving with rocm/vllm
#14111 closed
Mar 3, 2025 -
[Feature]: API for evicting all KV cache from GPU memory (or `sleep mode`)
#10714 closed
Mar 3, 2025 -
[Bug]: preifx cache reuse
#9643 closed
Mar 3, 2025 -
[Feature]: Low GPU utilization and memory bandwidth utilization
#9953 closed
Mar 3, 2025 -
[Installation]: I was never able to install it, which cuda version is required?
#9960 closed
Mar 3, 2025 -
[Feature]: print config of vllm LLM instance and modify it afterwards
#9962 closed
Mar 3, 2025 -
[Feature]: Any plan run deepseek-r1 fp8 on Ampere gpu
#13885 closed
Mar 3, 2025 -
[Installation]: Can't find OpenMP headers on macOS
#14034 closed
Mar 3, 2025
142 Issues opened by 136 people
-
[Usage]: Can't use the local model with LLM class.
#14518 opened
Mar 9, 2025 -
[Bug]: Cannot be run on multiple GPUs.
#14517 opened
Mar 9, 2025 -
[Usage]: How to improve concurrent processing capacity
#14513 opened
Mar 9, 2025 -
[Usage]: The example of using microsoft/Phi-4-multimodal-instruct audio
#14507 opened
Mar 9, 2025 -
[Bug]: RuntimeError: Phi4MM cannot process x audios and ximages in a prompt
#14506 opened
Mar 9, 2025 -
[Bug]: ValueError: Expected a torch.device with a specified index or an integer, but got:cuda
#14500 opened
Mar 8, 2025 -
[Feature]: Convert all `os.environ(xxx)` to `monkeypatch.setenv` in test suite
#14499 opened
Mar 8, 2025 -
[Bug]: Weird output when server with high load
#14491 opened
Mar 8, 2025 -
[Feature]: Apply tool calling after reasoning steps in Reasoning models.
#14490 opened
Mar 8, 2025 -
[Bug]: ModuleNotFoundError: No module named 'pyarrow" in main branch
#14487 opened
Mar 8, 2025 -
[Performance]: Eagle implentation is not efficient obviously
#14486 opened
Mar 8, 2025 -
[New Model]: Babel Model, Open Multilingual Large Language Models Serving Over 90% of Global Speakers
#14484 opened
Mar 8, 2025 -
[Bug]: trl's grpo-trainer with vllm not convergence
#14483 opened
Mar 8, 2025 -
[Feature]: Add reasoning token usage
#14472 opened
Mar 8, 2025 -
[Bug]: pythonic tool parser only accepts alphabetical tool names
#14470 opened
Mar 8, 2025 -
[Usage]: Model MllamaForConditionalGeneration does not support BitsAndBytes quantization yet.
#14458 opened
Mar 7, 2025 -
[Bug]: No Cuda GPUs are available when running vLLM on Ray (Qwen 2.5 VL AWQ)
#14456 opened
Mar 7, 2025 -
[Doc]: Steps to run vLLM on your RTX5080 or 5090!
#14452 opened
Mar 7, 2025 -
[Feature]: Run/Debug vllm in pycharm
#14444 opened
Mar 7, 2025 -
[Bug]: External Launcher producing NaN outputs on Large Models when Collocating with Model Training
#14443 opened
Mar 7, 2025 -
[Usage]: Question about Multimodal token ids on offloaded tokenization
#14441 opened
Mar 7, 2025 -
[RFC]: Configurable multi-modal data for profiling
#14438 opened
Mar 7, 2025 -
[Usage]: VLLM Inference - 2x slower with LoRA rank=256 vs none.
#14435 opened
Mar 7, 2025 -
[Bug]: Docker GPU image is unnecessarily fat due to two (mismatching) copies of CUDA runtime libraries
#14433 opened
Mar 7, 2025 -
[Usage]: Cuda out of memory while loading the quantized model
#14432 opened
Mar 7, 2025 -
[Feature]: support tool and reasoning together
#14429 opened
Mar 7, 2025 -
[Usage]: LLM.beam_search is much slower in vLLM 0.7.3 compared to 0.5.4
#14426 opened
Mar 7, 2025 -
[Feature]: Prefill. How to support 1M prompt tokens input?
#14425 opened
Mar 7, 2025 -
[Bug]: Unexpected content when selecting the choice tool
#14424 opened
Mar 7, 2025 -
[Bug]: size mismatch when loading MixtralForCausalLM GGUF model
#14423 opened
Mar 7, 2025 -
[Bug]: low quality of deepseek-vl2 when using vllm
#14421 opened
Mar 7, 2025 -
[Usage]: How to use the image datasets sharegpt4v provided in benchmark_serving?
#14418 opened
Mar 7, 2025 -
[Bug]: RuntimeError: No CUDA GPUs are available | when using TP>1 and using vllm v0.7
#14413 opened
Mar 7, 2025 -
[Performance]:
#14412 opened
Mar 7, 2025 -
[Bug]: pipeline-parallel not working properly with QwQ model on TPU v4
#14406 opened
Mar 7, 2025 -
[Misc, this is not a dev issue]: Congrats to vllm for having 888 developers!
#14405 opened
Mar 7, 2025 -
[Bug]: Error when Run Image Docker Vllm v0.7.3 - Unexpected error from cudaGetDeviceCount(). ....
#14403 opened
Mar 7, 2025 -
[Bug]: "transformers not installed" when using --guided-decoding-backend lm-format-enforcer
#14401 opened
Mar 7, 2025 -
[Bug]: stop_sequences is applied to both reasoning_content and content
#14399 opened
Mar 7, 2025 -
[Installation]:
#14398 opened
Mar 7, 2025 -
[Bug]: `triton_scaled_mm` never used on ROCm
#14397 opened
Mar 7, 2025 -
[Bug]: vllm-0.7.3. gptq-int3 model cannot run.
#14394 opened
Mar 7, 2025 -
[Bug]: Enable lora returns garbage output
#14392 opened
Mar 7, 2025 -
[Usage]: Clean up Engine Args & Documentation
#14386 opened
Mar 6, 2025 -
[Feature]: Q-Filters for KV Cache Compression
#14381 opened
Mar 6, 2025 -
[RFC]: Drop Support for OpenVINO
#14374 opened
Mar 6, 2025 -
[Bug]: VLLM process dies when trying to profile with nsight systemss
#14372 opened
Mar 6, 2025 -
[Bug]: API Connection Error after concurrent API calls
#14365 opened
Mar 6, 2025 -
[Usage]: STREAMING doesn't generate spaces
#14364 opened
Mar 6, 2025 -
[Performance]: Multimodal embeds input reduces service throughput
#14360 opened
Mar 6, 2025 -
[Bug]: phi-4-mini-instruct auto tool call doesnt have tool-call-parser
#14359 opened
Mar 6, 2025 -
[Bug]: ChatCompletionRequest rejects its own defaults
#14351 opened
Mar 6, 2025 -
[Bug]: vllm cannot connect to an external ray cluster
#14349 opened
Mar 6, 2025 -
[Feature]: How can I get embedding result from images? Can Qwen2.5-vl-7b do this?
#14348 opened
Mar 6, 2025 -
[Bug]: online-rl sampling is different from offline-sampling
#14341 opened
Mar 6, 2025 -
distributed inference multi-node communication bug
#14340 opened
Mar 6, 2025 -
[Feature]: eagle支持多模态模型
#14337 opened
Mar 6, 2025 -
[Feature]: `reasoning_tokens` in Chat Completion Response `usage`
#14335 opened
Mar 6, 2025 -
[Bug]: GPU索引指定失败
#14334 opened
Mar 6, 2025 -
[Bug]: vLLM returning 415 status code at high load
#14333 opened
Mar 6, 2025 -
[Bug]: opentelemetry POC vLLM span cannot be concatenated with HTTP spans.
#14330 opened
Mar 6, 2025 -
[Usage]: How do I set the input image size when using qwen2-vl?
#14325 opened
Mar 6, 2025 -
[Bug]: 'DeepseekV2Model' object has no attribute 'config' when enabling P/D Disaggregation
#14324 opened
Mar 6, 2025 -
[Usage]: What is the default input construction of multimodel input?
#14322 opened
Mar 6, 2025 -
[Feature]: `Invalid attention backend for cuda` with `TORCH_SDPA` better error message
#14320 opened
Mar 6, 2025 -
[Doc]: Why is max block_size on CUDA 32?
#14319 opened
Mar 5, 2025 -
[Feature]: Expose a read-only API to check whether engine is sleeping
#14311 opened
Mar 5, 2025 -
Issue with Mistral Small and greek characters
#14307 opened
Mar 5, 2025 -
[Usage]: Logprobs Scaling with O(n) Complexity – Unexpected Performance Degradation
#14300 opened
Mar 5, 2025 -
[Installation]: Attempting to build and run vLLM for Intel Core Ultra 7 155H with ARC iGPU
#14295 opened
Mar 5, 2025 -
[New Model]: llava-onevision-qwen2-72b-ov-sft
#14290 opened
Mar 5, 2025 -
[Feature]: Chat inputs to AsyncLLMEngine
#14289 opened
Mar 5, 2025 -
[Bug][V1]: Loading Llama3.1-8B-INT8 gets OOM when using VLLM_USE_v1=1 but safe using v0
#14286 opened
Mar 5, 2025 -
[Doc]: dead link in source code comment
#14282 opened
Mar 5, 2025 -
[Bug]: V1 still sample n=1 when set n>1 in samplingparam
#14280 opened
Mar 5, 2025 -
[Misc]: running multiple vLLM instances on a single ray cluster
#14277 opened
Mar 5, 2025 -
[Performance]: About peak activation memory usage for quantized model
#14270 opened
Mar 5, 2025 -
[Bug]: weight_loader of fp8 weights are wrongly set to None. [Deepseek V3/R1]
#14266 opened
Mar 5, 2025 -
[Bug]: Dose V1 support MLA + PP now? Raise error while using PP+TP+V1.
#14263 opened
Mar 5, 2025 -
[Bug]: ValueError: "CompilationConfig" object has no field "max_capture_size"
#14261 opened
Mar 5, 2025 -
[New Model]: baichuan-inc/Baichuan-M1-14B-Instruct
#14259 opened
Mar 5, 2025 -
[Bug]: when compilingFlashMLA/csrc/flash_api.cpp error occurred
#14250 opened
Mar 5, 2025 -
[Misc]: [V1] prompt logprobs + chunked prefill can result in `EngineCore` partial prefill output
#14239 opened
Mar 4, 2025 -
[Feature]: Bump xgrammar version to 1.1.14 to support ARM64 processors
#14236 opened
Mar 4, 2025 -
[Bug]: ValueError: The vocabulary does not allow us to build a sequence that matches the input regex
#14233 opened
Mar 4, 2025 -
[Bug]: Corrupted output from Llama-3.2-1B when LoRA is enabled on multi-GPU instance
#14232 opened
Mar 4, 2025 -
[Usage]: Getting intermediate outputs to store on disk
#14222 opened
Mar 4, 2025 -
[Bug]: Deepseek R1 671B int8 not working on TPU
#14218 opened
Mar 4, 2025 -
[New Model]: aya 32b vision support
#14216 opened
Mar 4, 2025 -
[New Model]: pfnet/plamo-2-8b
#14214 opened
Mar 4, 2025 -
[Misc]: How does the system evenly distribute the requests to multiple micro batches?
#14213 opened
Mar 4, 2025 -
[Bug]: ValueError: There is no module or parameter named 'lm_head' in Gemma2ForCausalLM
#14212 opened
Mar 4, 2025 -
[Bug]: Ultravox audio doesn't work with auto tool choice
#14209 opened
Mar 4, 2025 -
[New Model]: nicolinho/QRM-Llama3.1-8B-v2
#14208 opened
Mar 4, 2025 -
[Feature]: how to log out the request token length?
#14206 opened
Mar 4, 2025 -
[Bug]: Different results in gsm8k when using chat
#14203 opened
Mar 4, 2025 -
[Bug]: Port is still open after crashing vllm
#14197 opened
Mar 4, 2025 -
[New Model]: deepseek-vl2
#14192 opened
Mar 4, 2025 -
[Usage]: CUDA_VISIBLE_DEVICES not support
#14191 opened
Mar 4, 2025 -
[Feature]: Prompt Formatting Issue with LLaMA 3.1 Instruction Model in vLLM
#14190 opened
Mar 4, 2025 -
[Usage]: The inference results are not the same order as the inputs.
#14187 opened
Mar 4, 2025 -
[New Model]: InternVideo2.5 by OpenGVLab
#14180 opened
Mar 4, 2025 -
[Feature]: deepseek-r1-w8a8
#14176 opened
Mar 4, 2025 -
[Feature]: will whisper add language detection?
#14174 opened
Mar 4, 2025 -
[Performance] [V1]: Optimize batch token processing in `IncrementalDetokenizer.update()`
#14173 opened
Mar 4, 2025 -
[RFC]: Deprecate `max_num_generation_tokens`
#14168 opened
Mar 4, 2025 -
[Feature]: SLora hot loading
#14166 opened
Mar 4, 2025 -
[Feature]: Support Dynamic Loading of Prompt Adapters
#14163 opened
Mar 4, 2025 -
[Feature]: Support multi step drafting for DeepSeek MTP when k > n_predict
#14160 opened
Mar 3, 2025 -
[Bug]: Structured output requests can hang the server
#14151 opened
Mar 3, 2025 -
[Bug]: qwen2.5-vl 3B inference is OOM, but qwen2-vl 7B does not
#14150 opened
Mar 3, 2025 -
[Bug]: Failure to Init Qwen2VL-2B-Instruct with tensor-parallel-size == 4 and quantization
#14143 opened
Mar 3, 2025 -
[Bug]: MTP Implementation Inconsistency Between DeepSeek Paper and vllm
#14137 opened
Mar 3, 2025 -
[Installation]: Error occured while installing vllm
#14124 opened
Mar 3, 2025 -
[Bug]: run v1 engine with cuda graph. raise error.
#14121 opened
Mar 3, 2025 -
[Bug]: last gpu always OOM when only use pipeline parallelism with 2 nodes x 8cards
#14120 opened
Mar 3, 2025 -
[Bug]: PP > 1 with speculative decoding enabled reports an unsupported error
#14117 opened
Mar 3, 2025 -
[Feature]: Does Qwen2.5-VL support batch processing of multiple videos using vLLM?
#14112 opened
Mar 3, 2025 -
[Bug]: [vllm + Qwen2.5 VL72B] Model Continuously Outputs “!” for Certain Images
#14110 opened
Mar 3, 2025 -
[Bug]: Using fractional GPU will change the GPU resource names on Ray cluster nodes
#14109 opened
Mar 3, 2025 -
[New Model]: No supported config format found in deepseek-vl2-small
#14105 opened
Mar 3, 2025 -
[Feature]: baichuan-inc/Baichuan-Omni-1.5 support
#14104 opened
Mar 3, 2025 -
[Bug]: cannot launch deepseek-vl2 on A100
#14103 opened
Mar 3, 2025 -
[Bug]: max_model_len setting fail
#14102 opened
Mar 3, 2025
268 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[Frontend] support image embeds
#13955 commented on
Mar 9, 2025 • 35 new comments -
[V1] V1 Enablement Oracle
#13726 commented on
Mar 9, 2025 • 33 new comments -
[Feature] Consolidate performance benchmark datasets
#14036 commented on
Mar 9, 2025 • 31 new comments -
[V1] AsyncLLM data parallel
#13923 commented on
Mar 8, 2025 • 30 new comments -
[Kernel] CUTLASS grouped gemm fp8 MoE kernel
#13972 commented on
Mar 7, 2025 • 24 new comments -
track server_load
#13950 commented on
Mar 8, 2025 • 19 new comments -
[V1] [Spec Decode] Support random sampling for spec decode
#13933 commented on
Mar 6, 2025 • 18 new comments -
[Model] Extend Ultravox to accept audio longer than 30s
#13631 commented on
Mar 9, 2025 • 14 new comments -
[MODEL] Add support for Zamba2 models
#13185 commented on
Mar 7, 2025 • 13 new comments -
[Doc] V1 user guide
#13991 commented on
Mar 5, 2025 • 13 new comments -
[V1] LoRA - Add triton kernels for V1
#13096 commented on
Mar 7, 2025 • 11 new comments -
[Core][Feature] Input metadata dump on crash
#13407 commented on
Mar 7, 2025 • 9 new comments -
[Doc] More neutral K8s deployment guide
#14084 commented on
Mar 9, 2025 • 9 new comments -
[Core] Integrate Fastsafetensor loader for loading model weights
#10647 commented on
Mar 7, 2025 • 8 new comments -
[v1] Refactor KVCacheConfig
#14079 commented on
Mar 9, 2025 • 8 new comments -
[Neuron] Add Neuron device communicator for vLLM v1
#14085 commented on
Mar 7, 2025 • 8 new comments -
[Feature] Add `vllm bench` CLI
#13993 commented on
Mar 4, 2025 • 6 new comments -
[FEAT] [ROCm] Enabling AITER Kernel
#14007 commented on
Mar 8, 2025 • 5 new comments -
[BigFix] Fix the lm_head in gpt_bigcode in lora mode
#6357 commented on
Mar 3, 2025 • 5 new comments -
Support FP8 Quantization and Inference Run on Intel Gaudi (HPU) using INC (Intel Neural Compressor)
#12010 commented on
Mar 5, 2025 • 4 new comments -
[Attention] Update to lastest FA3 code that supports different K and V head dims
#13111 commented on
Mar 7, 2025 • 4 new comments -
[Hardware][TPU] improve kv cache update performance in prefill
#13176 commented on
Mar 8, 2025 • 3 new comments -
[Model] Add Support for Ovis1.6-Gemma2-9B Model
#11240 commented on
Mar 3, 2025 • 2 new comments -
[Kernel] Add more dtype support for GGUF kernels
#14043 commented on
Mar 8, 2025 • 2 new comments -
[Feature] Add filter for log redaction
#13225 commented on
Mar 5, 2025 • 1 new comment -
XGRAMMAR now support aarch64
#13894 commented on
Mar 9, 2025 • 1 new comment -
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs
#14071 commented on
Mar 6, 2025 • 1 new comment -
[V1][Frontend] Improve Shutdown And Logs
#14048 commented on
Mar 2, 2025 • 1 new comment -
[Bugfix] Fix Precision Mismatch in MoE Router of DeepSeek V2/V3 Models and Fused Kernels (BF16 -> FP32)
#14027 commented on
Mar 3, 2025 • 1 new comment -
Add ROCm Quark docs
#13984 commented on
Mar 5, 2025 • 1 new comment -
[Bug]: exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
#8840 commented on
Mar 9, 2025 • 0 new comments -
[RFC]: Encoder/decoder models & feature compatibility
#7366 commented on
Mar 9, 2025 • 0 new comments -
[Feature]: Support gemma2 GGUF architecture
#12000 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: GPTQ llama2-7b infer server failed!!!
#10848 commented on
Mar 7, 2025 • 0 new comments -
[Feature]: Support for Controlled Decoding
#9541 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16
#4392 commented on
Mar 8, 2025 • 0 new comments -
[Bug]: sentence_bert_config.json 404 Client Error
#11268 commented on
Mar 8, 2025 • 0 new comments -
[Bug]: Missing detection of BFloat16 for CPU ARM
#11814 commented on
Mar 8, 2025 • 0 new comments -
[Bug]: Vllm CPU mode only takes 1 single core for multi-core cpu
#10971 commented on
Mar 8, 2025 • 0 new comments -
[V1] Feedback Thread
#12568 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: ImportError: /workspace/vllm-abo/vllm/_C.abi3.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSsb
#13608 commented on
Mar 9, 2025 • 0 new comments -
[Roadmap] vLLM Roadmap Q1 2025
#11862 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
#10300 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: Inf2 Serving Fails
#11189 commented on
Mar 9, 2025 • 0 new comments -
Failed to find C compiler. Please specify via CC environment variable
#2997 commented on
Mar 9, 2025 • 0 new comments -
[Feature]: Support for RTX 5090 (CUDA 12.8)
#13306 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: vllm using ray in eks hangs when using --pipeline_parallel_size > 1
#11139 commented on
Mar 9, 2025 • 0 new comments -
[Model] Update MPT model with GLU and rope and add low precision layer norm
#9500 commented on
Mar 4, 2025 • 0 new comments -
[Frontend][Core] Add Guidance backend for guided decoding
#10217 commented on
Mar 7, 2025 • 0 new comments -
[Kernel] Add CUTLASS sparse support, heuristics, and torch operators
#10340 commented on
Mar 4, 2025 • 0 new comments -
Configuration of the model parallelism does not make sense
#10749 commented on
Mar 4, 2025 • 0 new comments -
[Bugfix] Check prompt length < max_model_len for all models in AsyncLLMEngine
#10881 commented on
Mar 5, 2025 • 0 new comments -
[Bugfix] add input embedding
#11684 commented on
Mar 8, 2025 • 0 new comments -
[V1] Allow sliding window + prefix caching
#13069 commented on
Mar 9, 2025 • 0 new comments -
[Bug]: AttributeError: 'Qwen2Model' object has no attribute 'rotary_emb'
#10773 commented on
Mar 7, 2025 • 0 new comments -
[Installation]: Missing v0.6.3.post1-cu118-cp310.whl. Can share it? Thanks so much
#10036 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere
#8024 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
#8893 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Engine fails to start when running Qwen2.5 Deepseek r1
#12554 commented on
Mar 7, 2025 • 0 new comments -
[Usage]: Qwen2-VL keyword argument `max_pixels` is not a valid argument for this processor and will be ignored.
#13143 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: RuntimeError: No CUDA GPUs are available in transformers v4.48.0 or above when running Ray RLHF example
#13597 commented on
Mar 7, 2025 • 0 new comments -
[RFC]: Hardware pluggable
#11162 commented on
Mar 7, 2025 • 0 new comments -
[Feature]: Support torch.distributed as the runtime for multi-node inference
#12511 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: DeepseekR1 model load fails with weights tied error
#12541 commented on
Mar 7, 2025 • 0 new comments -
[RFC]: Drop support for prompt adapter
#13981 commented on
Mar 7, 2025 • 0 new comments -
Generate nothing from VLLM output
#1185 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Segment fault when loading model on multi-gpu
#13309 commented on
Mar 7, 2025 • 0 new comments -
[Performance]: decoding speed on long context
#11286 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: RuntimeError: CUDA error: invalid argument [ Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
#13270 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Spurious warning on dropped args
#12856 commented on
Mar 7, 2025 • 0 new comments -
[V1][Help Wanted] Porting missing sampling parameters to V1
#13058 commented on
Mar 7, 2025 • 0 new comments -
[New Model]: answerdotai/ModernBERT-large
#11347 commented on
Mar 7, 2025 • 0 new comments -
[Installation]: subprocess-exited-with-error while installing vllm
#12965 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Can't use yarn rope config for long context in Qwen2 model
#10293 commented on
Mar 7, 2025 • 0 new comments -
ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details.
#13216 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Waiting for output from MQLLMEngine. Hangs and then crashes after about an 1 hour
#14025 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: see connection to gpu node timeout issue when initializing ray vllm multi-node serving
#13052 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: KV Cache Quantization with GGUF turns out quite poorly.
#10411 commented on
Mar 8, 2025 • 0 new comments -
[Misc] [ROCm]: Build from source failure with Arch/gcc14 with ROCm 6.3
#13777 commented on
Mar 8, 2025 • 0 new comments -
[RFC]: Merge input processor and input mapper for multi-modal models
#10114 commented on
Mar 8, 2025 • 0 new comments -
[Doc]: Why NGramWorker does not support cache operations
#11758 commented on
Mar 8, 2025 • 0 new comments -
[V1][Bugfix] DeepSeek-V3 v1 attn_backend miss q_lora_rank
#13092 commented on
Mar 6, 2025 • 0 new comments -
[BUG] Addreses #3935 and #3683, by making `intial_incremental_detokenization_offset` configurable
#13106 commented on
Mar 6, 2025 • 0 new comments -
[CI/Build] Add support for Python 3.13
#13164 commented on
Mar 6, 2025 • 0 new comments -
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1
#13305 commented on
Mar 4, 2025 • 0 new comments -
[V0][Sampler] Use raw logits for greedy argmax
#13312 commented on
Mar 7, 2025 • 0 new comments -
[Kernel] moe wna16 cuda kernel
#13321 commented on
Mar 9, 2025 • 0 new comments -
[CPU] Upgrade CPU backend to torch-2.6
#13381 commented on
Mar 7, 2025 • 0 new comments -
[Model][MiniMaxText01] Support MiniMaxText01 model inference
#13454 commented on
Mar 7, 2025 • 0 new comments -
[Frontend] Implement Tool Calling with `tool_choice='required'`
#13483 commented on
Mar 5, 2025 • 0 new comments -
[Bugfix] Fix quantization skip modules logic
#13562 commented on
Mar 8, 2025 • 0 new comments -
Integrating torchao quantization into vllm
#13588 commented on
Mar 4, 2025 • 0 new comments -
[Model] Support VLMs with transformers backend
#13754 commented on
Mar 8, 2025 • 0 new comments -
[ROCm] Enable custom paged attention kernel for Navi3/4
#13843 commented on
Mar 6, 2025 • 0 new comments -
Fix TPU CI
#13898 commented on
Mar 9, 2025 • 0 new comments -
Upgrade `transformers` to `v4.49.0`
#13905 commented on
Mar 4, 2025 • 0 new comments -
[Feat][whisper] add more sampling parameters to whisper endpoint
#13910 commented on
Mar 6, 2025 • 0 new comments -
Add test for DeepGEMM contiguous layout MoE kernels
#13932 commented on
Mar 8, 2025 • 0 new comments -
[WIP][Core] Support tensor parallelism with uneven heads
#13934 commented on
Mar 2, 2025 • 0 new comments -
[Model] [Quantization] Support deepseek v3/r1 w8a8 int8 block-wise quantization
#13942 commented on
Mar 7, 2025 • 0 new comments -
Support non-attention path operators in Triton
#13963 commented on
Mar 3, 2025 • 0 new comments -
[BugFix]: properly catch templating error when preprocess input
#13976 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: Speculative Decoding Tokens not being included in Prometheus metrics
#13992 commented on
Mar 3, 2025 • 0 new comments -
[Kernel] Integrate DeepGEMM dense block fp8
#13996 commented on
Mar 3, 2025 • 0 new comments -
benchmark serving: random + sharegpt dataset
#14026 commented on
Mar 3, 2025 • 0 new comments -
[Bugfix] Make memory profiler account for speculative draft model weights
#14067 commented on
Mar 3, 2025 • 0 new comments -
[Bugfix]: do not shutdown server is `skip_special_use=False` for MistralTokenizer
#14094 commented on
Mar 7, 2025 • 0 new comments -
[Frontend] Disaggregate prefill decode with zmq
#11791 commented on
Mar 9, 2025 • 0 new comments -
Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support
#11844 commented on
Mar 7, 2025 • 0 new comments -
[API_SERVER] Add maximum concurrency limit for API interface
#11997 commented on
Mar 3, 2025 • 0 new comments -
[VLM] Merged multi-modal processor for Pixtral
#12211 commented on
Mar 8, 2025 • 0 new comments -
[Model] Enable Inference Support for the New Baichuan-M1 Model
#12251 commented on
Mar 8, 2025 • 0 new comments -
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral
#12303 commented on
Mar 7, 2025 • 0 new comments -
add support for AMD MI25/50/60
#12431 commented on
Mar 7, 2025 • 0 new comments -
[CI/Build] Better default num jobs heuristic
#12477 commented on
Mar 7, 2025 • 0 new comments -
[Kernel] Add ModelOpt FP4 Checkpoint Support
#12520 commented on
Mar 7, 2025 • 0 new comments -
layerwise KV transfer in PD Disaggregation
#12523 commented on
Mar 5, 2025 • 0 new comments -
[Bugfix] fix vocab size assertion
#12550 commented on
Mar 5, 2025 • 0 new comments -
[CI] Performance regression fastcheck
#12576 commented on
Mar 5, 2025 • 0 new comments -
add tools definition into tokenize api
#12684 commented on
Mar 6, 2025 • 0 new comments -
Add helm chart release workflow
#12685 commented on
Mar 6, 2025 • 0 new comments -
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform
#12695 commented on
Mar 6, 2025 • 0 new comments -
add initial Blackwell support
#12702 commented on
Mar 6, 2025 • 0 new comments -
Update to torch==2.6.0
#12721 commented on
Mar 9, 2025 • 0 new comments -
[Core] Add Additional Metrics to vLLM Server
#12726 commented on
Mar 7, 2025 • 0 new comments -
[Bugfix] Add Containerfile.arm for podman support
#12735 commented on
Mar 8, 2025 • 0 new comments -
[build][misc] allow to use recent numpy
#12759 commented on
Mar 9, 2025 • 0 new comments -
[Hardware][Intel-Gaudi] Multi-step scheduling implementation for HPU
#12779 commented on
Mar 3, 2025 • 0 new comments -
adding workaround for c2x/c3x initializer issue
#12866 commented on
Mar 6, 2025 • 0 new comments -
[Core][Frontend] Fix : Adding Control Vector Support
#12870 commented on
Mar 6, 2025 • 0 new comments -
Add flag for enabling finer-grained cuda graph capture
#12920 commented on
Mar 6, 2025 • 0 new comments -
[Bugfix] Adjust tool call handling in llama template to support single tool calls only
#12938 commented on
Mar 3, 2025 • 0 new comments -
[Feature][Frontend] Add KVTransferParams for disaggregated prefill feature
#12957 commented on
Mar 4, 2025 • 0 new comments -
Registered the model config for DeepSeek-V3.
#13055 commented on
Mar 6, 2025 • 0 new comments -
[Feature]: LoRA support for qwen2-vl Models
#11255 commented on
Mar 4, 2025 • 0 new comments -
[Bug]:Phi-4-Mini giving garbage outputs with torch 2.5.1 and vllm==0.7.3 with multiple parallel requests
#14058 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Unsloth bitsandbytes quantized model cannot be run due to: `KeyError: 'layers.42.mlp.down_proj.weight.absmax`
#10710 commented on
Mar 4, 2025 • 0 new comments -
[Installation]: When i build vllm from source with pip install -e ,there is a ninja error: unknown target '_vllm_fa3_C', did you mean '_vllm_fa2_C'.
#13183 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Qwen2vl vllm grounding任务效果不如transformers推理
#11254 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: running deepseek-r1 14B with 2*5090D
#13914 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Why 0 device need more memory? will it cause OOM?
#14011 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: deepseek-r1 mutlti-node crash
#13136 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: how to use prefill-decode disaggregation ??
#11490 commented on
Mar 4, 2025 • 0 new comments -
unload the model
#3281 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: vllm api_server often crashes when the version is higher than 0.5.3.
#7936 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Issue when benchmarking the dynamically served LoRA adapter
#8564 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: Engine iteration timed out. (during using qwen2-vl-7b)
#10123 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Get meaningless output when run long context inference of Qwen2.5 model with vllm>=0.6.3
#10298 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: Cant use vllm on a multiGPU node
#10474 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: benchmark random input-len inconsistent
#10847 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: vllm部署qwen2.5_vl_72b之后,你们有出现,刚部署好之后调用一切正常3-5秒一条,然后使用一段时间,就越来越慢了的情况吗60s一条
#13886 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: ERROR hermes_tool_parser.py:108] Error in extracting tool call from response.
#10831 commented on
Mar 5, 2025 • 0 new comments -
[Usage]: Sampling several sequences from OpenAI compatible server.
#10852 commented on
Mar 5, 2025 • 0 new comments -
[Usage]: How to access to the generated token from LogitsProcessor
#10885 commented on
Mar 5, 2025 • 0 new comments -
[BUG] [MultiStep+AsyncOutputProc] the remaining steps not released when request output reaches max-token
#10890 commented on
Mar 5, 2025 • 0 new comments -
[Feature]: when will publish a new version
#10892 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: VLLM (0.7.0) will report gpu missing on the hosting node in Ray
#12614 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Runtime error when running MLA models with "prefix caching enabled" and "chunked prefill disabled"
#14069 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: The value of --max-model-len may influence results although the length of input less than max-model-len
#11447 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: Qwen2-VL-2B-Instruct Issue when passing a video URL to /chat/completions
#13927 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Generation mismatch with Model: meta-llama/Llama-3.2-11B-Vision-Instruct
#13763 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: mllama AssertionError during kv cache profiling
#13929 commented on
Mar 4, 2025 • 0 new comments -
[Feature]: Add an endpoint to know the server config
#13056 commented on
Mar 4, 2025 • 0 new comments -
[Installation]: Could not find a version that satisfies the requirement xgrammar>=0.1.6; platform_machine == "x86_64" (from vllm) (from versions: none)
#11886 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: Error in benchmark model with vllm backend for endpoint /v1/chat/completions
#10158 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: vLLM 0.7.3 TypeError in vllm.entrypoints.api_server Argument Parsing
#13848 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: How to use pipeline parallelism in offline inference?
#13453 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: How to do offline inference on multi-node with tensor-parallel and pipeline-parallel
#12950 commented on
Mar 4, 2025 • 0 new comments -
[Usage]: Guided choice not working as expected
#12225 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: RuntimeError: ('Quantization scheme is not supported for ', 'the current GPU. Min capability: 80. ', 'Current capability: 75.')
#14040 commented on
Mar 3, 2025 • 0 new comments -
[New Model]: support Ovis VLM series
#13441 commented on
Mar 3, 2025 • 0 new comments -
Please add lora support for higher ranks and alpha values
#2847 commented on
Mar 3, 2025 • 0 new comments -
[RFC] Initial Support for Cloud TPUs
#3620 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Chunked prefill + lora
#4995 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: Phi-3-small-128k-instruct on 1 A100 GPUs - Assertion error: Does not support prefix-enabled attention.
#7787 commented on
Mar 3, 2025 • 0 new comments -
[Usage]: What's the relationship between KV cache and MAX_SEQUENCE_LENGTH.
#10517 commented on
Mar 3, 2025 • 0 new comments -
[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
#10713 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: Improve Error Messaging for Unsupported Tasks in vLLM (e.g., embedding with Llama Models)
#10794 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: LoRA support for LLama 3.2 Vision Models
#10824 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: vllm Multicar inference bnb model TP is not supported
#10823 commented on
Mar 3, 2025 • 0 new comments -
[Usage]: Moving from 1 to 2 GPU's in vLLM
#10826 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Implement Priority Scheduling In V1 Engine
#14002 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Return hidden states (in progress?)
#6165 commented on
Mar 3, 2025 • 0 new comments -
[Bug]:ValueError: vllm serve Qwen/Qwen2.5-VL-72B-Instruct-AWQ ERROR:The input size is not aligned with the quantized weight shape.
#13980 commented on
Mar 2, 2025 • 0 new comments -
[Bug][Ray]: Pipeline parallelism fails on the same host
#14093 commented on
Mar 4, 2025 • 0 new comments -
[Feature][Frontend]: Deprecate `--enable-reasoning`
#14088 commented on
Mar 4, 2025 • 0 new comments -
[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend
#6766 commented on
Mar 3, 2025 • 0 new comments -
[RFC]: Implement Structured Output support for V1 engine
#11908 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Add KV Cache Metrics to Usage Object
#12283 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Online Inference on local model with OpenAI Python SDK
#8631 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Improve Logging for Error Messages
#14083 commented on
Mar 3, 2025 • 0 new comments -
[RFC]: Sparse KV cache management framework
#12254 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Expose option to load new model weights from disk
#12774 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Support for DeepGEMM
#13857 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: Nonsensical Sentences Generated When Inferencing INT8 Quantized Qwen2.5-72B Model
#11175 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: macOS with vllm-cpu v0.6.6-post2 serving Qwen2.5-1.5b-Instruct results in endless exclamation marks
#12427 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: The accuracy of multiple cards and single card is inconsistent
#13801 commented on
Mar 3, 2025 • 0 new comments -
[Doc]: provide docker-compose.yml for multi-node serving
#13158 commented on
Mar 3, 2025 • 0 new comments -
[Installation]: Dockerfile.cpu installation problem vLLM
#14033 commented on
Mar 3, 2025 • 0 new comments -
[Feature]: Store KVCache in 3FS
#14012 commented on
Mar 3, 2025 • 0 new comments -
[RFC]: Async KV Cache Transfer for Disaggregated Inference
#13020 commented on
Mar 3, 2025 • 0 new comments -
[Bug]: CUDA Exception on multi-gpus with concurrent users
#12307 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: vllm is hang after upgrade to v0.5.4
#7297 commented on
Mar 6, 2025 • 0 new comments -
[RFC]: Multi-modality Support on vLLM
#4194 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: How can I use 6 GPU's?
#11147 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: AttributeError: Qwen2Tokenizer has no attribute lower
#13127 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: crash:RecursionError: maximum recursion depth exceeded
#9608 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: ValueError: The checkpoint you are trying to load has model type `qwen2_5_vl` but Transformers does not recognize this architecture
#13446 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: 为什么CPU KV cache usage一直为0.0%
#11871 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: Low GPU utilization when running Deepseek-r1-distill-llama-8B
#14022 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: LLM repeat automatically
#13952 commented on
Mar 6, 2025 • 0 new comments -
[Performance]: Plan to support DP attention for Deepseek models
#12871 commented on
Mar 6, 2025 • 0 new comments -
[Installation]: AttributeError: '_OpNamespace' '_C' object has no attribute 'silu_and_mul' on the CPU instance
#13593 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: When using the latest 0.6.3, No module named 'vllm._version' appears
#9421 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: deepseek r1 + vllm (v0.7.2) torch.compile error
#13471 commented on
Mar 6, 2025 • 0 new comments -
[Feature]: Support tool parser for DeepSeek-V3
#13764 commented on
Mar 6, 2025 • 0 new comments -
[Usage]: LLama-3.1-405B Inference with vLLM TPU
#9052 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: XGrammar-based CFG decoding degraded after 0.6.5
#12122 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: The inference result from vLLM is incorrect on specific prompt
#10916 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Not able to install/compile vllm using alpine linux base image
#10924 commented on
Mar 7, 2025 • 0 new comments -
[Feature]: add optimum-neuron
#10946 commented on
Mar 7, 2025 • 0 new comments -
[Misc]: Saved sharded state should also include GPU P2P access cache
#10967 commented on
Mar 7, 2025 • 0 new comments -
[Feature]: Support Qwen/Qwen2.5-14B-Instruct-1M
#12452 commented on
Mar 7, 2025 • 0 new comments -
[Feature]: add DoRA support
#10849 commented on
Mar 7, 2025 • 0 new comments -
[Bug]: Using "response_format": { "type": "json_object" } with /v1/chat/completions is terminating the engine
#11828 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: The api server /health endpoint is unable to detect when the Worker VllmWorkerProcess has died
#11996 commented on
Mar 6, 2025 • 0 new comments -
[Feature][v1]: Add metrics support
#10582 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Qwen/Qwen2.5-1.5B-Instruct generates out of vocabulary tokens
#13175 commented on
Mar 6, 2025 • 0 new comments -
[Feature]: Support "required" option in tool_choice
#13002 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: When using the VLLM framework to load visual models, CPU memory overflow occurs while continuously processing data with images.
#12973 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Not support MiniCPM-o 2.6 ‘s finetune lora
#13018 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Using lm_format_enforcer, or using certain schemas, with Llama-3.2-90B-Vision-Instruct causes a crash
#11248 commented on
Mar 6, 2025 • 0 new comments -
[New Model]: Ovis2
#13251 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: RuntimeError: Error in model execution: CUDA error: an illegal memory access was encountered
#12233 commented on
Mar 6, 2025 • 0 new comments -
[New Model]: nomic-ai/nomic-embed-text-v1
#12054 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: No available block found in 60 second in shm
#6614 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: torchvision.libs/libcudart.41118559.so.12 (deleted): cannot open shared object file: No such file or directory
#13040 commented on
Mar 5, 2025 • 0 new comments -
[Performance]: enforce_eager=False degrade the performance metrics for long context input
#13536 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: Fatal Python Error when Starting DeepSeek V3
#13014 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: cross-node tensor parallel + cudagraph issue
#13552 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: Make https://wheels.vllm.ai/nightly inspectable
#13545 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: Gemma 2 - AttributeError: 'Gemma2Config' object has no attribute 'interleaved_sliding_window'
#13226 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: AssertionError, assert prefill_metadata.context_chunk_seq_tot is not None
#14009 commented on
Mar 5, 2025 • 0 new comments -
How to use vllm to compute ppl score for input text?
#1019 commented on
Mar 5, 2025 • 0 new comments -
[Feature]: Support for priority preemption with chunked-prefill
#10101 commented on
Mar 5, 2025 • 0 new comments -
[Usage]: OpenTelemetry with fastapi not working
#10213 commented on
Mar 5, 2025 • 0 new comments -
[Doc]: Docker+vllm+fastchat deploys multimodal large model Qwen2-vl-7b-instruct(docker+vllm+fastchat部署多模态大模型Qwen2-vl-7b-instruct)
#10566 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: Engine process (pid 76) died
#10812 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: The following error occurred when I ran the qwen2.5-7b-fp8-dynamic model with vllm0.6.4.post1 on a single card 4090
#10828 commented on
Mar 5, 2025 • 0 new comments -
[Installation]: XPU dependencies are missing
#11173 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: An error occurred while using H20 to perform multi machine inference 405B through the ray cluster, causing inference to crash.
#9215 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Can't serve on ray cluster although passing VLLM_HOST_IP
#13521 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Speculative Decoding without enabling eager mode returns gibberish output after some tokens.
#10559 commented on
Mar 6, 2025 • 0 new comments -
[Performance]: logit bias implementation uses a slow for loop
#10741 commented on
Mar 6, 2025 • 0 new comments -
[Feature]: Serving VLM VILA
#10889 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Illegal Memory access was encounterd when running UT: pytest -s -v vllm/tests/spec_decode/test_multi_step_worker.py::test_use_draft_model_runner_advance_step
#10918 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: deepseek v2.5 gtpq(int4) error with vllm-0.6.4
#10923 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: vllm v1/chat/completions Internal Server Error
#10925 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: ERROR 07-26 14:50:35 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 214281 died, exit code: -11
#6823 commented on
Mar 6, 2025 • 0 new comments -
[Misc]: I want to run Llama 3.1 405B using speculative. Can you give me a guide?
#7456 commented on
Mar 6, 2025 • 0 new comments -
[Bug]: Llama 3.2 90b crash
#10648 commented on
Mar 5, 2025 • 0 new comments -
[Usage]: why speculate decoding is slower than normal decoding?
#8439 commented on
Mar 5, 2025 • 0 new comments -
[RFC]: Disaggregated prefilling and KV cache transfer roadmap
#10818 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: vllm0.7.3: an illegal memory access was encountered
#13824 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: (v0.7.2): RuntimeError: CUDA error: an illegal memory access was encountered
#13939 commented on
Mar 5, 2025 • 0 new comments -
[Bug]: V1 engine ignores logits processors and min-p sampling
#12678 commented on
Mar 5, 2025 • 0 new comments