From 734b320fae7c86645fb4f9cfdff74d6c22d688bd Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 20:25:41 +0000 Subject: [PATCH 1/8] Polish --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 32 ++++++++++++-------- 1 file changed, 20 insertions(+), 12 deletions(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 4136585..2d9da02 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -6,26 +6,32 @@ author: "vLLM Team" --- **TL;DR:** -- vLLM is as fast as DeepSpeed in common scenarios and faster than Deepspeed when outputs are long. -- DeepSpeed only outperforms vLLM in long prompt, short output use cases due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap. -- vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned with broad model and optimization support. + +- vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs. +- DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap. +- vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 licensed and driven by a community focus, offering extensive model and optimization support. --- -Recently, the DeepSpeed team published [a blog](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM by utilizing the Dynamic Splitfuse technique. We are happy to see the technology advancements from the open-source community. In this blog, we clarify the workloads that benefit from the Dynamic SplitFuse enhancement, which are quite narrow. For most workloads, vLLM is on par with or faster than DeepSpeed MII. +The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique. +We are happy to see the technology advancements within the open-source community. +In our blog today, we'll elucidate the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited. +For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII. -In this post, we will discuss the difference between the two systems, share our benchmarks, and discuss future steps. ### Performance Benchmark -In terms of performance optimization, we believe there are 2 key differences between vLLM and DeepSpeed: -DeepSpeed uses a conservative/suboptimal memory allocation scheme, which wastes memory when output lengths are large. -DeepSpeed uses Dynamic SplitFuse scheduling which gives speedup only when prompt lengths are much greater than output lengths. +We've identified two key differences between vLLM and DeepSpeed in terms of performance optimization: + +1. DeepSpeed adopts a conservative/suboptimal memory allocation scheme, which wastes memory when output lengths are large. +2. DeepSpeed’s Dynamic SplitFuse scheduling gives speedup only when prompt lengths are much greater than output lengths. -Consequently, DeepSpeed wins when the workload is consistently long prompt and short output. In other cases, vLLM wins. +As a result, DeepSpeed outperforms when the workload is consistently long prompt and short output. +In other scenarios, vLLM shows superior performance. #### Scenario 1: Long Prompt Length, Short Output -In this scenario, we expect DeepSpeed to perform well due to Dynamic SplitFuse. However, the benefit we observe is not as significant as 2x. +Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine. +However, the performance gain we observe isn't as significant as a 2x increase.

@@ -34,7 +40,7 @@ In this scenario, we expect DeepSpeed to perform well due to Dynamic SplitFuse.

#### Scenario 2: All other cases -In this scenario, we observe vLLM perform better or on par with DeepSpeed. +In these cases, vLLM is up to 1.8x faster than DeepSpeed.

@@ -51,7 +57,9 @@ The vLLM team prioritizes collaborations and we strive to keep the codebase with Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on [GitHub](https://github.com/vllm-project/vllm). We also published the benchmark code [here](https://github.com/vllm-project/vllm/pull/1649). ### Appendix: Feature Comparison -DeepSpeed currently supports only basic functionalities. For example, it only supports 3 types of models and does not support popular features like stop strings and parallel sampling (beam search). We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market! + +DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (beam search). +We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market! | | vLLM | DeepSpeed | |----------------------------|:---------------------------------------:|:-----------------------------------------------:| From 1e4fef6a7fcb779b8d8397f39deaaff21abf9f29 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 20:40:26 +0000 Subject: [PATCH 2/8] Fix link{ --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 2d9da02..9032dd1 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -54,7 +54,7 @@ We are committed to making vLLM the best open-source project incorporating the c The vLLM team prioritizes collaborations and we strive to keep the codebase with high quality code and easy to contribute. We are actively working on system performance; as well as new features like LoRA, Speculative Decoding, and better Quantization Support. Additionally, we are collaborating with hardware vendors like AMD, AWS Inferenetia, and Intel Habana to bring LLM to the broadest community. -Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on [GitHub](https://github.com/vllm-project/vllm). We also published the benchmark code [here](https://github.com/vllm-project/vllm/pull/1649). +Specifically for the Dynamic SplitFuse optimization, we are actively investigating the proper integration. If you have any questions and suggestions, please feel free to contact us on [GitHub](https://github.com/vllm-project/vllm). We also published the benchmark code [here](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_throughput.py). ### Appendix: Feature Comparison From 96d1e575234a83ec391b4c19817dff6d54e36500 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:44:56 -0800 Subject: [PATCH 3/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 9032dd1..eeacbb8 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -14,7 +14,7 @@ author: "vLLM Team" --- The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique. -We are happy to see the technology advancements within the open-source community. +We are happy to see the technology advancements from the open-source community. In our blog today, we'll elucidate the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited. For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII. From e51ece8b3169ef9ac06b81f73871a61d3202f8a9 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:45:03 -0800 Subject: [PATCH 4/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index eeacbb8..9ea9f60 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -9,7 +9,7 @@ author: "vLLM Team" - vLLM matches DeepSpeed's speed in common scenarios and surpasses it when handling longer outputs. - DeepSpeed only outperforms vLLM in scenarios with long prompts and short outputs, due to its Dynamic SplitFuse optimization. This optimization is on vLLM’s roadmap. -- vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 licensed and driven by a community focus, offering extensive model and optimization support. +- vLLM’s mission is to build the fastest and easiest-to-use open-source LLM inference and serving engine. It is Apache 2.0 and community-owned, offering extensive model and optimization support. --- From ba9eb7994f6d3034be67c0957dc1462db407722b Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:45:17 -0800 Subject: [PATCH 5/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 9ea9f60..4adce62 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -15,7 +15,7 @@ author: "vLLM Team" The DeepSpeed team recently published [a blog post](https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen) claiming 2x throughput improvement over vLLM, achieved by leveraging the Dynamic SplitFuse technique. We are happy to see the technology advancements from the open-source community. -In our blog today, we'll elucidate the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited. +In this blog, we show the specific scenarios where the Dynamic SplitFuse technique is advantageous, noting that these cases are relatively limited. For the majority of workloads, vLLM is faster than (or performs comparably to) DeepSpeed MII. From a0f139a454d5f96cfc26e5c0caad78690ca3212b Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:45:25 -0800 Subject: [PATCH 6/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 4adce62..8b458d0 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -39,7 +39,7 @@ However, the performance gain we observe isn't as significant as a 2x increase.

-#### Scenario 2: All other cases +#### Scenario 2: Other cases In these cases, vLLM is up to 1.8x faster than DeepSpeed.

From 5232941cfe36ef1492bedeb11cdcebae8d84e337 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:45:32 -0800 Subject: [PATCH 7/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index 8b458d0..dda474a 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -58,7 +58,7 @@ Specifically for the Dynamic SplitFuse optimization, we are actively investigati ### Appendix: Feature Comparison -DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (beam search). +DeepSpeed currently offers basic functionalities, supporting only three model types and lacking popular features like stop strings and parallel sampling (e.g., beam search). We do expect the DeepSpeed open source are eager to catch up and we welcome the creative innovation in the market! | | vLLM | DeepSpeed | From 783c7628b2cd238d01226ccf4a9abc262a55bfd8 Mon Sep 17 00:00:00 2001 From: Woosuk Kwon Date: Tue, 14 Nov 2023 12:45:39 -0800 Subject: [PATCH 8/8] Update _posts/2023-11-14-notes-vllm-vs-deepspeed.md Co-authored-by: Zhuohan Li --- _posts/2023-11-14-notes-vllm-vs-deepspeed.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md index dda474a..20f525b 100644 --- a/_posts/2023-11-14-notes-vllm-vs-deepspeed.md +++ b/_posts/2023-11-14-notes-vllm-vs-deepspeed.md @@ -31,7 +31,7 @@ In other scenarios, vLLM shows superior performance. #### Scenario 1: Long Prompt Length, Short Output Here, DeepSpeed's Dynamic SplitFuse scheduling is expected to shine. -However, the performance gain we observe isn't as significant as a 2x increase. +However, the performance gain we observe isn't as significant as 2x.