feed.xml

<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.2">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2023-11-14T16:44:45-08:00</updated><id>/feed.xml</id><title type="html">vLLM Blog</title><author><name>© 2023. vLLM Team. All rights reserved.</name></author><entry><title type="html">Notes on vLLM v.s. DeepSpeed-FastGen</title><link href="/2023/11/14/notes-vllm-vs-deepspeed.html" rel="alternate" type="text/html" title="Notes on vLLM v.s. DeepSpeed-FastGen" /><published>2023-11-14T00:00:00-08:00</published><updated>2023-11-14T00:00:00-08:00</updated><id>/2023/11/14/notes-vllm-vs-deepspeed</id><content type="html" xml:base="/2023/11/14/notes-vllm-vs-deepspeed.html"><![CDATA[<hr />
<p><strong>TL;DR:</strong></p>

<ul>
  <li>vLLM matches DeepSpeed-FastGen’s speed in common scenarios and surpasses it when handling longer outputs.</li>
  <li>DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap.</li>
  <li>vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support.</li>
</ul>

<hr />

<p>The DeepSpeed team recently published <a href="https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen">a blog post</a> claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique.
We are happy to see the technology advancements from the open-source community.
In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited.
For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed-FastGen.</p>

<h3 id="performance-benchmark">Performance Benchmark</h3>

<p>We’ve identified two key differences between vLLM and DeepSpeed-FastGen in terms of performance optimization:</p>

<ol>
  <li><strong>DeepSpeed-FastGen adopts a conservative/suboptimal memory allocation scheme</strong>, which wastes memory when output lengths are large.</li>
  <li>DeepSpeed-FastGen’s Dynamic SplitFuse scheduling gives <strong>speedup only when prompt lengths are much greater than output lengths</strong>.</li>
</ol>

<p>As a result, DeepSpeed-FastGen outperforms when the workload is consistently long prompt and short output.
In other scenarios, vLLM shows superior performance.</p>

<p>We benchmarked the two systems on an NVIDIA A100-80GB GPU with the LLaMA-7B model in the following scenarios:</p>

<h4 id="scenario-1-long-prompt-length-short-output">Scenario 1: Long Prompt Length, Short Output</h4>
<p>Here, DeepSpeed-FastGen’s Dynamic SplitFuse scheduling is expected to shine.
However, the performance gain we observe isn’t as significant as 2x.</p>

<p align="center">
<picture>
<img src="/assets/figures/notes-vllm-vs-deepspeed/s1.png" width="50%" />
</picture>
</p>

<h4 id="scenario-2-other-cases">Scenario 2: Other cases</h4>
<p>In these cases, vLLM is up to <strong>1.8x</strong> faster than DeepSpeed-FastGen.</p>

<p align="center">
<picture>
<img src="/assets/figures/notes-vllm-vs-deepspeed/s2.png" width="50%" />
</picture>
</p>

<h3 id="vllms-future-a-true-community-project">vLLM’s Future: A True Community Project</h3>
<p>We are committed to making vLLM the best open-source project incorporating the community’s best models, optimizations, and hardware. Coming out of UC Berkeley Sky Computing Lab, we are building vLLM truly in open source with the Apache 2.0 license.</p>

<p>The vLLM team prioritizes collaborations and we strive to keep the codebase with high quality code and easy to contribute. We are actively working on system performance; as well as new features like LoRA, Speculative Decoding, and better Quantization Support. Additionally, we are collaborating with hardware vendors like AMD, AWS Inferenetia, and Intel Habana to bring LLM to the broadest community.</p>

<p>Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on <a href="https://github.com/vllm-project/vllm">GitHub</a>. We also published the benchmark code <a href="https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py">here</a>.</p>

<h3 id="appendix-feature-comparison">Appendix: Feature Comparison</h3>

<p>DeepSpeed-FastGen currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search).
We do expect the DeepSpeed-FastGen is eager to catch up and we welcome the creative innovation in the market!</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th style="text-align: center">vLLM</th>
      <th style="text-align: center">DeepSpeed-FastGen</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Runtime</td>
      <td style="text-align: center">Python/PyTorch</td>
      <td style="text-align: center">Python/PyTorch</td>
    </tr>
    <tr>
      <td>Model implementation</td>
      <td style="text-align: center">HuggingFace Transformers</td>
      <td style="text-align: center">Custom implementation + converter for HF models</td>
    </tr>
    <tr>
      <td>Server frontend</td>
      <td style="text-align: center">Simple FastAPI server for demo purposes</td>
      <td style="text-align: center">Custom gRPC-based server</td>
    </tr>
    <tr>
      <td>Scheduling</td>
      <td style="text-align: center">Continuous batching</td>
      <td style="text-align: center">Dynamic SplitFuse</td>
    </tr>
    <tr>
      <td>Attention kernel</td>
      <td style="text-align: center">PagedAttention &amp; FlashAttention</td>
      <td style="text-align: center">PagedAttention &amp; FlashAttention</td>
    </tr>
    <tr>
      <td>Custom kernels (for LLaMA)</td>
      <td style="text-align: center">Attention, RoPE, RMS, SILU</td>
      <td style="text-align: center">Attention, RoPE, RMS, SILU, Embedding</td>
    </tr>
    <tr>
      <td>KV Cache allocation</td>
      <td style="text-align: center">Near-optimal</td>
      <td style="text-align: center">Suboptimal/conservative</td>
    </tr>
    <tr>
      <td>Supported models</td>
      <td style="text-align: center">16 different architectures</td>
      <td style="text-align: center">LLaMA, Mistral, OPT</td>
    </tr>
    <tr>
      <td>Sampling methods</td>
      <td style="text-align: center">Random, parallel, beam search</td>
      <td style="text-align: center">Random</td>
    </tr>
    <tr>
      <td>Stop criterion</td>
      <td style="text-align: center">Stop strings, stop tokens, EOS</td>
      <td style="text-align: center">EOS</td>
    </tr>
  </tbody>
</table>]]></content><author><name>vLLM Team</name></author><summary type="html"><![CDATA[TL;DR:]]></summary></entry><entry><title type="html">vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention</title><link href="/2023/06/20/vllm.html" rel="alternate" type="text/html" title="vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention" /><published>2023-06-20T00:00:00-07:00</published><updated>2023-06-20T00:00:00-07:00</updated><id>/2023/06/20/vllm</id><content type="html" xml:base="/2023/06/20/vllm.html"><![CDATA[<p align="center" style="margin-top:-15px">
<a href="https://github.com/vllm-project/vllm"><b>GitHub</b></a> | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://arxiv.org/pdf/2309.06180.pdf"><b>Paper</b></a>
</p>

<p>LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we are excited to introduce vLLM, an open-source library for fast LLM inference and serving. vLLM utilizes <strong>PagedAttention</strong>, our new attention algorithm that effectively manages attention keys and values. vLLM equipped with PagedAttention redefines the new state of the art in LLM serving: it delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes.</p>

<p>vLLM has been developed at UC Berkeley and deployed at <a href="https://chat.lmsys.org">Chatbot Arena and Vicuna Demo</a> for the past two months. It is the core technology that makes LLM serving affordable even for a small research team like LMSYS with limited compute resources. Try out vLLM now with a single command at our <a href="https://github.com/vllm-project/vllm">GitHub repository</a>.</p>

<h3 id="beyond-state-of-the-art-performance">Beyond State-of-the-art Performance</h3>

<p>We compare the throughput of vLLM with <a href="https://huggingface.co/docs/transformers/main_classes/text_generation">HuggingFace Transformers (HF)</a>, the most popular LLM library and <a href="https://github.com/huggingface/text-generation-inference">HuggingFace Text Generation Inference (TGI)</a>, the previous state of the art. We evaluate in two settings: LLaMA-7B on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). We sample the requests’ input/output lengths from the ShareGPT dataset. In our experiments, vLLM achieves up to <strong>24x</strong> higher throughput compared to HF and up to <strong>3.5x</strong> higher throughput than TGI.</p>

<p align="center">
<picture>
<img src="/assets/figures/perf_a100_n1_light.png" width="45%" />
</picture><picture>
<img src="/assets/figures/perf_a10g_n1_light.png" width="45%" />
</picture><br />
Serving throughput when each request asks for <em> one output completion</em>. vLLM achieves 14x - 24x higher throughput than HF and 2.2x - 2.5x higher throughput than TGI.
</p>

<p align="center">
<picture>
<img src="/assets/figures/perf_a100_n3_light.png" width="45%" />
</picture><picture>
<img src="/assets/figures/perf_a10g_n3_light.png" width="45%" />
</picture>
<br />Serving throughput when each request asks for <em>three parallel output completions</em>. vLLM achieves 8.5x - 15x higher throughput than HF and 3.3x - 3.5x higher throughput than TGI.
</p>

<h3 id="the-secret-sauce-pagedattention">The Secret Sauce: PagedAttention</h3>

<p>In vLLM, we identify that the performance of LLM serving is bottlenecked by memory. In the autoregressive decoding process, all the input tokens to the LLM produce their attention key and value tensors, and these tensors are kept in GPU memory to generate next tokens. These cached key and value tensors are often referred to as KV cache. The KV cache is</p>
<ul>
  <li><em>Large:</em> Takes up to 1.7GB for a single sequence in LLaMA-13B.</li>
  <li><em>Dynamic:</em> Its size depends on the sequence length, which is highly variable and unpredictable.
As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste <strong>60% – 80%</strong> of memory due to fragmentation and over-reservation.</li>
</ul>

<p>To address this problem, we introduce <strong>PagedAttention</strong>, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation0.gif" width="80%" />
</picture>
<br />
<em>PagedAttention:</em> KV Cache are partitioned into blocks. Blocks do not need to be contiguous in memory space.
</p>

<p>Because the blocks do not need to be contiguous in memory, we can manage the keys and values in a more flexible way as in OS’s virtual memory: one can think of blocks as pages, tokens as bytes, and sequences as processes. The contiguous <em>logical blocks</em> of a sequence are mapped to non-contiguous <em>physical blocks</em> via a block table. The physical blocks are allocated on demand as new tokens are generated.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation1.gif" width="100%" />
</picture>
<br />
Example generation process for a request with PagedAttention.
</p>

<p>In PagedAttention, memory waste only happens in the last block of a sequence. In practice, this results in near-optimal memory usage, with a mere waste of under 4%. This boost in memory efficiency proves highly beneficial: It allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput as shown in the performance result above.</p>

<p>PagedAttention has another key advantage: efficient memory sharing. For example, in <em>parallel sampling</em>, multiple output sequences are generated from the same prompt. In this case, the computation and memory for the prompt can be shared between the output sequences.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation2.gif" width="80%" />
</picture>
<br />
Example of parallel sampling.
</p>

<p>PagedAttention naturally enables memory sharing through its block table. Similar to how processes share physical pages, different sequences in PagedAttention can share the blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention keeps track of the reference counts of the physical blocks and implements the <em>Copy-on-Write</em> mechanism.</p>

<p align="center">
<picture>
<img src="/assets/figures/annimation3.gif" width="100%" />
</picture>
<br />
Example generation process for a request that samples multiple outputs.
</p>

<p>PageAttention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput. This makes such sampling methods practical in LLM services.</p>

<p>PagedAttention is the core technology behind vLLM, our LLM inference and serving engine that supports a variety of models with high performance and an easy-to-use interface. For more technical details about vLLM and PagedAttention, check out our <a href="https://github.com/vllm-project/vllm">GitHub repo</a> and stay tuned for our paper.</p>

<h3 id="the-silent-hero-behind-lmsys-vicuna-and-chatbot-arena">The Silent Hero Behind LMSYS Vicuna and Chatbot Arena</h3>

<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 5x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM serving backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>

<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>

<p align="center">
<picture>
<img src="/assets/figures/lmsys_traffic.png" width="100%" />
</picture>
<br />
Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.
</p>

<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>

<h3 id="get-started-with-vllm">Get started with vLLM</h3>

<p>Install vLLM with the following command (check out our <a href="https://vllm.readthedocs.io/en/latest/getting_started/installation.html">installation guide</a> for more):</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pip <span class="nb">install </span>vllm
</code></pre></div></div>

<p>vLLM can be used for both offline inference and online serving. To use vLLM for offline inference, you can import vLLM and use the <code class="language-plaintext highlighter-rouge">LLM</code> class in your Python scripts:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">vllm</span> <span class="kn">import</span> <span class="n">LLM</span>

<span class="n">prompts</span> <span class="o">=</span> <span class="p">[</span><span class="sh">"</span><span class="s">Hello, my name is</span><span class="sh">"</span><span class="p">,</span> <span class="sh">"</span><span class="s">The capital of France is</span><span class="sh">"</span><span class="p">]</span>  <span class="c1"># Sample prompts.
</span><span class="n">llm</span> <span class="o">=</span> <span class="nc">LLM</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="sh">"</span><span class="s">lmsys/vicuna-7b-v1.3</span><span class="sh">"</span><span class="p">)</span>  <span class="c1"># Create an LLM.
</span><span class="n">outputs</span> <span class="o">=</span> <span class="n">llm</span><span class="p">.</span><span class="nf">generate</span><span class="p">(</span><span class="n">prompts</span><span class="p">)</span>  <span class="c1"># Generate texts from the prompts.
</span></code></pre></div></div>

<p>To use vLLM for online serving, you can start an OpenAI API-compatible server via:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>python <span class="nt">-m</span> vllm.entrypoints.openai.api_server <span class="nt">--model</span> lmsys/vicuna-7b-v1.3
</code></pre></div></div>

<p>You can query the server with the same format as OpenAI API:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>curl http://localhost:8000/v1/completions <span class="se">\</span>
    <span class="nt">-H</span> <span class="s2">"Content-Type: application/json"</span> <span class="se">\</span>
    <span class="nt">-d</span> <span class="s1">'{
        "model": "lmsys/vicuna-7b-v1.3",
        "prompt": "San Francisco is a",
        "max_tokens": 7,
        "temperature": 0
    }'</span>
</code></pre></div></div>

<p>For more ways to use vLLM, please check out the <a href="https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html">quickstart guide</a>.</p>

<p><br /></p>

<hr />

<p><em>Blog written by Woosuk Kwon and Zhuohan Li (UC Berkeley). Special thanks to Hao Zhang for the integration of vLLM and FastChat and for writing the corresponding section. We thank the entire team — Siyuan Zhuang, Ying Sheng, Lianmin Zheng (UC Berkeley), Cody Yu (Independent Researcher), Joey Gonzalez (UC Berkeley), Hao Zhang (UC Berkeley &amp; UCSD), and Ion Stoica (UC Berkeley).</em></p>]]></content><author><name>Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Yu, Joey Gonzalez, Hao Zhang, and Ion Stoica (* Equal Contribution)</name></author><summary type="html"><![CDATA[GitHub | Documentation | Paper]]></summary></entry></feed>