diff --git a/_posts/2025-01-14-struct-decode-intro.md b/_posts/2025-01-14-struct-decode-intro.md index 6116b46..dabc4ee 100644 --- a/_posts/2025-01-14-struct-decode-intro.md +++ b/_posts/2025-01-14-struct-decode-intro.md @@ -2,6 +2,7 @@ layout: post title: "Structured Decoding in vLLM: a gentle introduction" author: "Guest Post by BentoML and Red Hat" +image: /assets/figures/struct-decode-intro/vllm-xgrammar-decode-time-per-output-token.png --- **TL/DR**: diff --git a/_posts/2025-01-21-stack-release.md b/_posts/2025-01-21-stack-release.md index 81c7248..3250bdc 100644 --- a/_posts/2025-01-21-stack-release.md +++ b/_posts/2025-01-21-stack-release.md @@ -1,8 +1,6 @@ --- layout: post -title: "High Performance and Easy Deployment of vLLM in K8S with “vLLM production-stack”" -thumbnail-img: /assets/figures/stack/stack-thumbnail.png -share-img: /assets/figures/stack/stack-thumbnail.png +title: "High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack" author: LMCache Team image: /assets/figures/stack/stack-thumbnail.png --- diff --git a/_posts/2025-02-24-ptpc-fp8-rocm.md b/_posts/2025-02-24-ptpc-fp8-rocm.md index 5c76b7e..8ef998f 100644 --- a/_posts/2025-02-24-ptpc-fp8-rocm.md +++ b/_posts/2025-02-24-ptpc-fp8-rocm.md @@ -3,8 +3,6 @@ layout: post title: "PTPC-FP8: Boosting vLLM Performance on AMD ROCm" author: "AMD and Embedded LLM" image: /assets/figures/ptpc/PTPC-tumbnail.png -thumbnail-img: /assets/figures/ptpc/PTPC-tumbnail.png -share-img: /assets/figures/ptpc/PTPC-tumbnail.png math: true --- diff --git a/_posts/2025-04-05-llama4.md b/_posts/2025-04-05-llama4.md index 42aca6a..a8e6df2 100644 --- a/_posts/2025-04-05-llama4.md +++ b/_posts/2025-04-05-llama4.md @@ -3,8 +3,6 @@ layout: post title: "Llama 4 in vLLM" author: "The vLLM Team" image: /assets/figures/llama4/perf.png -thumbnail-img: /assets/figures/llama4/perf.png -share-img: /assets/figures/llama4/perf.png --- We're excited to announce that vLLM now supports the [Llama 4 herd of models](https://ai.meta.com/blog/llama-4-multimodal-intelligence/): **Scout** (17B-16E) and **Maverick** (17B-128E). You can run these powerful long-context, natively multi-modal (up to 8-10 images with good results), mixture-of-experts models in vLLM today by updating to version v0.8.3 or later: diff --git a/_posts/2025-04-11-transformers-backend.md b/_posts/2025-04-11-transformers-backend.md index 88691b9..68c4f90 100644 --- a/_posts/2025-04-11-transformers-backend.md +++ b/_posts/2025-04-11-transformers-backend.md @@ -3,8 +3,6 @@ layout: post title: "Transformers backend integration in vLLM" author: "The Hugging Face Team" image: /assets/figures/transformers-backend/transformers-backend.png -thumbnail-img: /assets/figures/transformers-backend/transformers-backend.png -share-img: /assets/figures/transformers-backend/transformers-backend.png --- The [Hugging Face Transformers library](https://huggingface.co/docs/transformers/main/en/index) diff --git a/_posts/2025-04-23-openrlhf-vllm.md b/_posts/2025-04-23-openrlhf-vllm.md index 6b6e39d..c5e77ea 100644 --- a/_posts/2025-04-23-openrlhf-vllm.md +++ b/_posts/2025-04-23-openrlhf-vllm.md @@ -1,10 +1,8 @@ --- layout: post title: "Accelerating RLHF with vLLM, Best Practice from OpenRLHF" -author: "The OpenRLHF Team" -image: /assets/figures/openrlhf-vllm/ray.png -thumbnail-img: /assets/figures/openrlhf-vllm/ray.png -share-img: /assets/figures/openrlhf-vllm/ray.png +author: "The OpenRLHF Team" +image: /assets/figures/openrlhf-vllm/ray.png --- As demand grows for training reasoning-capable large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has emerged as a cornerstone technique. However, conventional RLHF pipelines—especially those using Proximal Policy Optimization (PPO)—are often hindered by substantial computational overhead. This challenge is particularly pronounced with models that excel at complex reasoning tasks (such as OpenAI-o1 and DeepSeek-R1), where generating long chain-of-thought (CoT) outputs can account for up to 90% of total training time. These models must produce detailed, step-by-step reasoning that can span thousands of tokens, making inference significantly more time-consuming than the training phase itself. As a pioneering inference framework, vLLM provides a user-friendly interface for generating RLHF samples and updating model weights. diff --git a/_posts/2025-06-30-minimax-m1.md b/_posts/2025-06-30-minimax-m1.md index d49c0ca..0e0404a 100644 --- a/_posts/2025-06-30-minimax-m1.md +++ b/_posts/2025-06-30-minimax-m1.md @@ -2,8 +2,9 @@ layout: post title: "MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference" author: "MiniMax" -benchmark-img: /assets/figures/minimax-m1/benchmark.png -moe-img: /assets/figures/minimax-m1/moe.png +image: /assets/figures/minimax-m1/benchmark.png +benchmark-img: /assets/figures/minimax-m1/benchmark.png +moe-img: /assets/figures/minimax-m1/moe.png lightning_attention-img: /assets/figures/minimax-m1/lightning_attention.png --- diff --git a/_posts/2025-09-11-qwen3-next.md b/_posts/2025-09-11-qwen3-next.md index 7b75274..cb1eeea 100644 --- a/_posts/2025-09-11-qwen3-next.md +++ b/_posts/2025-09-11-qwen3-next.md @@ -3,8 +3,6 @@ layout: post title: "vLLM Now Supports Qwen3-Next: Hybrid Architecture with Extreme Efficiency" author: "The vLLM Team" image: /assets/figures/qwen3-next/qwen.png -thumbnail-img: /assets/figures/qwen3-next/qwen.png -share-img: /assets/figures/qwen3-next/qwen.png --- We’re excited to announce that **vLLM now supports Qwen3-Next**, the latest generation of foundation models from the Qwen team. Qwen3-Next introduces a **hybrid architecture with extreme efficiency for long context support**, and vLLM offers full support of its functionalities. diff --git a/_posts/2025-09-16-vllm-meetup.md b/_posts/2025-09-16-vllm-meetup.md index 4f9cd42..e329c61 100644 --- a/_posts/2025-09-16-vllm-meetup.md +++ b/_posts/2025-09-16-vllm-meetup.md @@ -1,7 +1,8 @@ --- layout: post title: "The First vLLM Meetup in Korea" -author: "vLLM Team" +author: "vLLM Team" +image: /assets/figures/vllm-meetup/image-3.png ---

diff --git a/_posts/2025-09-29-deepseek-v3-2.md b/_posts/2025-09-29-deepseek-v3-2.md index c3983e2..cf43b75 100644 --- a/_posts/2025-09-29-deepseek-v3-2.md +++ b/_posts/2025-09-29-deepseek-v3-2.md @@ -1,10 +1,8 @@ --- layout: post title: "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action" -author: "vLLM Team" +author: "vLLM Team" image: /assets/figures/deepseek-v3-2/dsa-explained.png -thumbnail-img: /assets/figures/deepseek-v3-2/dsa-explained.png -share-img: /assets/figures/deepseek-v3-2/dsa-explained.png --- ### Introduction diff --git a/_posts/2025-10-09-blackwell-inferencemax.md b/_posts/2025-10-09-blackwell-inferencemax.md index 2c71c43..ba40ecf 100644 --- a/_posts/2025-10-09-blackwell-inferencemax.md +++ b/_posts/2025-10-09-blackwell-inferencemax.md @@ -1,7 +1,8 @@ ---- -layout: post -title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference" -author: "vLLM Team" +--- +layout: post +title: "SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference" +author: "vLLM Team" +image: /assets/figures/blackwell-inferencemax/gpt-oss-120b-1k-1k.png --- ### Introduction diff --git a/_posts/2025-10-16-vllm-tpu.md b/_posts/2025-10-16-vllm-tpu.md index c994d4c..4e49082 100644 --- a/_posts/2025-10-16-vllm-tpu.md +++ b/_posts/2025-10-16-vllm-tpu.md @@ -1,7 +1,8 @@ --- -layout: post -title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU " -author: "Google Team" +layout: post +title: "vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU" +author: "Google Team" +image: /assets/figures/vllm-tpu/vllm-tpu.png ---

diff --git a/_posts/2025-10-22-agent-lightning.md b/_posts/2025-10-22-agent-lightning.md index 72c5f76..04caa70 100644 --- a/_posts/2025-10-22-agent-lightning.md +++ b/_posts/2025-10-22-agent-lightning.md @@ -1,7 +1,8 @@ --- -layout: post -title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL" -author: "The Agent Lightning (AGL) Team" +layout: post +title: "No More Retokenization Drift: Returning Token IDs via the OpenAI Compatible API Matters in Agent RL" +author: "The Agent Lightning (AGL) Team" +image: /assets/figures/agent-lightning/1_rewards.png --- **TL;DR.** Agent often calls LLMs via OpenAI‑compatible endpoints, which previously return only string-based inputs and outputs. In **agent RL**, this can lead to inconsistencies between training and inference due to the phenomenon we call **Retokenization Drift**. This phenomenon occurs because tokens are detokenized during inference and subsequently retokenized during training; the two sets of tokens may differ even though their corresponding strings are identical. Now, you can ask vLLM’s OpenAI‑compatible endpoints to return the **exact token IDs** for both prompts and generated responses. Pass `"return_token_ids": true` to `/v1/chat/completions` or `/v1/completions` and you’ll receive `prompt_token_ids` and `token_ids` alongside the regular text output. This makes **agent RL** robust, as no more drift will happen. This pairs perfectly with Agent Lightning, where each model call is viewed as separate update sample without stitching; just log the returned IDs via `return_token_ids` enabled. diff --git a/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md b/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md index 8ecd24a..3958c9c 100644 --- a/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md +++ b/_posts/2025-10-23-now_serving_nvidia_nemotron_with_vllm.md @@ -1,7 +1,8 @@ --- -layout: post -title: "Now Serving NVIDIA Nemotron with vLLM" -author: "NVIDIA Nemotron Team" +layout: post +title: "Now Serving NVIDIA Nemotron with vLLM" +author: "NVIDIA Nemotron Team" +image: /assets/figures/2025-vllm-nvidia-nemotron/figure1.png --- Agentic AI systems, capable of reasoning, planning, and taking autonomous actions, are powering the next leap in developer applications. To build these systems, developers need tools that are open, efficient, and ready to scale. And, as demand for agents grows, open, performant models are the key as they provide transparency, adaptability, and cost-control. diff --git a/_posts/2025-10-27-semantic-router-modular.md b/_posts/2025-10-27-semantic-router-modular.md index c0239d6..1fa2f8f 100644 --- a/_posts/2025-10-27-semantic-router-modular.md +++ b/_posts/2025-10-27-semantic-router-modular.md @@ -1,7 +1,8 @@ --- layout: post -title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA" -author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)" +title: "From Monolithic to Modular: Scaling Semantic Routing with Extensible LoRA" +author: "Ivar Flakstad (Hugging Face), OneZero-Y, Huamin Chen (Red Hat), Xunzhuo Liu (Tencent)" +image: /assets/figures/semantic-router/modular.png --- Semantic routing systems face a scaling challenge. When each classification request requires running multiple fine-tuned models independently, the computational cost grows linearly with the number of models. This post examines how a recent refactoring of the vLLM Semantic Router's Rust-based classification layer addresses this problem through architectural modularity, Low-Rank Adaptation (LoRA), and concurrency optimization. diff --git a/_posts/2025-10-28-Kimi-K2-Accuracy.md b/_posts/2025-10-28-Kimi-K2-Accuracy.md index eace59b..ff372fc 100644 --- a/_posts/2025-10-28-Kimi-K2-Accuracy.md +++ b/_posts/2025-10-28-Kimi-K2-Accuracy.md @@ -2,6 +2,7 @@ layout: post title: "Chasing 100% Accuracy: A Deep Dive into Debugging Kimi K2's Tool-Calling on vLLM" author: "Linian Wang (Peking University)" +image: /assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg --- **TL;DR:** For best compatibility with vLLM, use Kimi K2 models whose chat templates were updated after commit 94a4053eb8863059dd8afc00937f054e1365abbd ([Kimi-K2-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)) or commit 0102674b179db4ca5a28cd9a4fb446f87f0c1454 ([Kimi-K2](https://huggingface.co/moonshotai/Kimi-K2-Instruct)). The updates are committed per model. @@ -152,6 +153,8 @@ Through systematic and collaborative debugging, we successfully resolved the cri I hope this detailed account serves as a useful roadmap for other developers integrating complex models into vLLM and beyond. As the open-source community continues to mature, we look forward to an even more seamless model integration experience and more powerful agentic capabilities for everyone. +![](/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg) + ### Acknowledgements I'd like to extend my sincere gratitude to the engineers at the Kimi team. Their deep technical expertise was crucial in pinpointing the root causes, and they swiftly implemented the necessary fixes on the Hugging Face Hub once the issues were identified. This journey and its successful outcome would not have been possible without their active collaboration and support. diff --git a/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md b/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md index 3d2f63c..ed81675 100644 --- a/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md +++ b/_posts/2025-10-31-run-multimodal-reasoning-agents-nvidia-nemotron.md @@ -1,7 +1,8 @@ --- layout: post -title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM" -author: "NVIDIA Nemotron Team" +title: "Run Multimodal Reasoning Agents with NVIDIA Nemotron on vLLM" +author: "NVIDIA Nemotron Team" +image: /assets/figures/2025-multimodal-nvidia-nemotron/figure1.png --- We are excited to release [NVIDIA Nemotron Nano 2 VL](https://huggingface.co/nvidia/Nemotron-Nano-12B-v2-VL-BF16), supported by vLLM. This open vision language model ([VLM](https://www.nvidia.com/en-us/glossary/vision-language-models/)) is built for video understanding and document intelligence. diff --git a/_posts/2025-11-11-intel-arc-pro-b.md b/_posts/2025-11-11-intel-arc-pro-b.md index ce12e05..4a287b0 100755 --- a/_posts/2025-11-11-intel-arc-pro-b.md +++ b/_posts/2025-11-11-intel-arc-pro-b.md @@ -2,6 +2,7 @@ layout: post title: "Fast and Affordable LLMs serving on Intel Arc Pro B-Series GPUs with vLLM" author: "Intel vLLM Team" +image: /assets/figures/2025-vllm-on-intel-arc/perf-figure1.png --- [Intel® Arc™ Pro B-Series GPU Family](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/workstations/b-series/overview.html) GPUs deliver powerful AI capabilities with a focus on accessibility and exceptional price-to-performance ratios. Their large memory capacity and scalability with multi-GPU setups make it possible to run the latest, large and capable AI models locally, making advanced AI inference accessible to professionals looking to deploy Large Language Models (LLMs) without the premium costs typically associated with AI hardware. @@ -51,8 +52,8 @@ Intel® Arc™ Pro B60 GPU has 20 XeCores, each with identical resources that ca One observation is that each group runs a different amount of work due to the imbalance of expert routing. If a group loops fixed stride of work, there is always a group that takes the largest amount of work and another, smallest. The gap between them will accumulate up to 15% of the total MoE GEMM time. A better alternative is whoever finishes a task in one loop starts the immediate available task in the next loop. For a concrete example, there are 40 groups to crunch 200 GEMM blocks, static stride will result that group 0 loop through 0, 40, 80, ... group 1 loop through 1, 41, 81, etc. A caveat is that due to the nature of MoE, each GEMM block may not have same amount of compute intensity. Also, randomized access patterns will let certain groups finish work faster than others. This will limit efficiency in such a way that the groups always finished job earlier can’t help those always meet heavy loads. -| Before | After | -|---|---| +| Before | After | +| ----------------------------------------------------------------------- | ----------------------------------------------------------------------- | | ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load1.png) | ![thread load](/assets/figures/2025-vllm-on-intel-arc/thread-load2.png) | We mitigate the effect by letting each group compete for the next job through an atomic number. Whoever finishes computing one GEMM block will get a rank from the atomic number who decides which next block it’ll take. In this case, we eliminated small gaps in kernel looping and achieved perfect scheduling among all scenarios of experts routing. @@ -85,14 +86,14 @@ Figure 3: TTFT/TPOT for llama-70B single batch with long context input from 1K t GPT-OSS: Intel® Arc™ Pro B60 GPU also demonstrates exceptional performance with OpenAI's recently launched GPT-OSS model, providing developers and enterprises with a powerful, cost-effective solution for large-scale AI inference as shown in the table below. -| Model | Data type | TP | Input/output seq length | Concurrency | TTFT (s) | TPOT (ms) | Output Token Throughput (toks/s) | -| --- | --- | --- | --- | --- | --- | --- | --- | -| GPT-OSS-20b |MXFP4 |1 |1024/1024 |75 |7.614 |53.96 |1210.74| -| GPT-OSS-20b |MXFP4 |1 |2048/2048 |38 |7.823 |42.35 |818.92 | -| GPT-OSS-20b |MXFP4 |1 |5120/5120 |15 |8.36 |34.27 |416.94 | -| GPT-OSS-120b |MXFP4 |4 |1024/1024 |100|8.04 |58.78 |1495.12| -| GPT-OSS-120b |MXFP4 |4 |2048/2048 |50 |8.11 |41.98 |1085.58| -| GPT-OSS-120b |MXFP4 |4 |5120/5120 |20 |8.60 |30.60 |619.10 | +| Model | Data type | TP | Input/output seq length | Concurrency | TTFT (s) | TPOT (ms) | Output Token Throughput (toks/s) | +| ------------ | --------- | --- | ----------------------- | ----------- | -------- | --------- | -------------------------------- | +| GPT-OSS-20b | MXFP4 | 1 | 1024/1024 | 75 | 7.614 | 53.96 | 1210.74 | +| GPT-OSS-20b | MXFP4 | 1 | 2048/2048 | 38 | 7.823 | 42.35 | 818.92 | +| GPT-OSS-20b | MXFP4 | 1 | 5120/5120 | 15 | 8.36 | 34.27 | 416.94 | +| GPT-OSS-120b | MXFP4 | 4 | 1024/1024 | 100 | 8.04 | 58.78 | 1495.12 | +| GPT-OSS-120b | MXFP4 | 4 | 2048/2048 | 50 | 8.11 | 41.98 | 1085.58 | +| GPT-OSS-120b | MXFP4 | 4 | 5120/5120 | 20 | 8.60 | 30.60 | 619.10 | Table 1: GPT-OSS vLLM inference throughput using 1-4 GPUs on x8 Intel® Arc™ Pro B-series System. diff --git a/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg b/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg new file mode 100644 index 0000000..1daa080 Binary files /dev/null and b/assets/figures/kimi-k2-accuracy/k2-vendor-verifier.jpeg differ