Support generation from input embedding #1265

pfldy2850 · 2023-10-05T09:43:57Z

This PR implements the feature of generating text from embedding input (popularly known as inputs_embeds).
This is related to #369 and #416.

More to do

Enhance test codes for generate.
Determine whether the feature reduces core performance.
Add more details to the comments.
apply it to async_llm_engine and api_server

pfldy2850 · 2023-10-12T06:21:08Z

We conducted several tests and confirmed that the performance degradation was not significant.

In fact, we measured the benchmark 5 times for the main branch and feature branch using the command below.

python benchmarks/benchmark_latency.py --input-len=2048 --num-iters=5

## main
Avg latency: 0.36247589644044637 seconds
Avg latency: 0.35677395705133674 seconds
Avg latency: 0.3622682703658938 seconds
Avg latency: 0.36043337155133487 seconds
Avg latency: 0.3593990854918957 seconds

## feature
Avg latency: 0.3586543008685112 seconds
Avg latency: 0.3557318979874253 seconds
Avg latency: 0.36645207908004523 seconds
Avg latency: 0.3598199490457773 seconds
Avg latency: 0.36111502479761837 seconds

bobchen1980

fixed input embedding function related to #369 and #416

pfldy2850 · 2023-10-18T06:56:49Z

@WoosukKwon @zhuohan123

Hello authors, I have tested this PR and completed the alignment with the latest prepare_inputs function.
Could you please review this PR?

pfldy2850 · 2023-12-10T17:17:37Z

Hi @WoosukKwon ,
I've made some changes to the PR that you saw, so I'm asking you to review it again.

Updated source code to accept prompt_embeds argument in refactored code
Updated to only handle prompt_embeds during _prepare_prompt (as it is unnecessary during the decoding part)
Updated to accept prompt_embeds as an argument for all models (also added test code)
Updated it available to async engine as well
Updated entrypoints/api_server.py to accept prompt_embeds as a body

I have run the added test code and it passes all as shown below.

$ pytest tests/models/test_models.py -k test_models_from_prompt_embeds --forked
...
collected 22 items / 11 deselected / 11 selected                                                                                                                                         

tests/models/test_models.py ...........

I know you are very busy with a lot of interest and requests for vLLM.
I would appreciate it if you could review this PR when you have a chance.

will-wiki · 2023-12-13T07:56:42Z

@pfldy2850
Thank you so much for your work! However, I encountered a problem when running this version of the code, almost the same environment, the same input and the same call method, the running result is not consistent with the official vLLM version.
The relevant call code, environment, and test results are as follows:

Same call code：

import time
from vllm import LLM, SamplingParams
prompts = [
    "请介绍下爱因斯坦的生平。",
    "人工智能的未来是什么样的？",
    "自由意志是否存在？",
    "深度学习与机器学习有什么区别？",
    "量子力学的基本原理是什么？"
]
sampling_params = SamplingParams(
    temperature=0, top_p=1, max_tokens=128, repetition_penalty=1.1,
    use_beam_search=True, best_of=5)

llm = LLM(model="internlm/internlm-7b", trust_remote_code=True)

outputs = []
start_time = time.perf_counter()
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)
print(f"cost time: {(time.perf_counter() - start_time)/ len(prompts)} ")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Operating environment V1：(office)
cuda：11.8
torch：2.1.0
xformers：0.0.22.post7+cu118
transformers：4.35.0
vllm：vllm @ https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl#sha256=7a8b51f0565baaa820f8dc0376e1ff5a732fcabda26397a55becd90e07b5fc63(office)

Result：
Prompt: '请介绍下爱因斯坦的生平。', Generated text: '\n爱因斯坦（1879年3月14日—1955 ），出生于德国巴登-符腾堡州乌尔姆市的一个犹太人家庭。他是20世纪最伟大的科学家之一、现代物理学的开创者和奠基人；他创立的狭义相对论和广义相…'
Prompt: '人工智能的未来是什么样的？', Generated text: '人工智能（Artificial Intelligence），英文缩写为AI。它是研究、开发用于模拟人类智能活动的计算机系统技术的科学和工程领域；是利用数字计算方法使电子设备“理解”人话并做出类似人类的反应的技术总称 【1】 [2]'
Prompt: '自由意志是否存在？', Generated text: '自由意志是否存在？\n首先，我们要明白什么是“决定论”。所谓**确定性理论（determinism）又称因果律、机械唯物主义和宿命说等：认为宇宙中一切事物的发展变化都有其必然的规律或原因可循；人的一切行为都由自然法则支配着并受物质运动规律的制约而发生和发展的一种哲学观点与世界观体系 【1】**\n其次要明确一个概念—意识是物质的反映还是独立于客观世界之外的存在呢?这个问题在科学界一直争论不休,至今没有定结论.但可以肯定的是:人的主观能动性的存在'
Prompt: '深度学习与机器学习有什么区别？', Generated text: ' - 云+社区-腾讯云计算\n深度学习与机器学习的区别主要体现在以下几个方面：1、应用领域不同。2.训练方式的不同，3模型结构上的差异以及4数据集上的一些差别等几个方面的内容来分析的'
Prompt: '量子力学的基本原理是什么？', Generated text: '量子力学的基本原理包括：\n1. 波粒二象性（wave-particle duality），即物质同时具有波动性和粒子性的特征。在微观尺度下观察到的是概率分布而不是确定的位置和速度等物理量；2.\n不确定关系(uncertainty relation),它描述了两个相互关联的力学量的精确测量之间的不可能完美一致的关系;3、薛定谔方程(Schrödinger equation),它是描述一个孤立系统随时间演化的微分方程式,其解给出了该系统的态函数及其演化;\n4．Heisenberg Uncertainity Principle (海森'

Operating environment V2：
cuda：11.8
torch：2.1.0
xformers：0.0.22.post7+cu118
transformers：4.35.0
vllm：https://github.com/pfldy2850/vllm/tree/feature-input-embeds  (v0.2.3，install with 'pip install -e .')

Result：
Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月26日18：14更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹雷阵雨晴转多云东北风'
Prompt: '人工智能的未来是什么样的？', Generated text: '_亚博yabo888vip官网\n来源：www。xg111.net 发布时间2022-05-25 |浏览量177次返回搜狐，查阅更多责任编辑您好！欢迎参与调查您的性别（单选）男女民族/种族 （多选题）。A汉族B蒙古族C藏族D维吾尔族E回族F苗族人G彝族H布依人I哈萨克J朝鲜K壮族的L白俄罗斯的M蒙古国的N塔吉克O阿塞拜疆P乌兹别克斯坦Q吉尔吉斯斯坦R土库曼S阿富汗T伊朗U巴基斯坦V印度W尼泊尔X'
Prompt: '自由意志是否存在？', Generated text: ' - 哲学百科\n您现在的位置：首页 > TAG信息列表展示> Free Will Existence?\nFree will is the ability to make choices that are not determined by external factors. It has been debated for centuries, with some philosophers arguing it does exist and others claiming there\'s no such thing as free choice.\nThe debate over whether ornot we have a "free"will can be traced back at least two thousand years ago when Aristotle wrote about this topic in his book On The Soul (De Anima). He argued against determinism because he believed humans were capable of making decisions based on their own desires rather than being controlled'
Prompt: '深度学习与机器学习有什么区别？', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中，人工智能（AI）是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术；自然语言处理以及专家系统等应用领域的人工神经网络算法的研究和使用等等都是属于这个范畴之内的内容之一了！那么接下来我们就一起来了解一下什么是“**Deep Learning & Machine learning”吧~**\n1. DeepLearning是什么?\n2.Deeplearning的应用场景有哪些呢?'
Prompt: '量子力学的基本原理是什么？', Generated text: "1. 薛定谔方程：描述一个粒子在空间和时间中的状态如何随时间变化。\n2.波函数和概率幅的概念，描述了粒子的位置、动量和能量等物理量之间的关系.\n3..不确定性关系: Heisenberg Uncertainty Principle, which states that the more precisely we know a particle's position or momentum (or both),the less accuratelywe can measure its other properties such as energy and time"

The official output is significantly better than the embedding version of the code.The code has not changed, I am not clear whether the vllm installation is incorrect or there are other reasons, please help to look at it, thank you very much！if you need other information I can add.

pfldy2850 · 2023-12-13T13:32:06Z

Hello @will-wiki ,
Thank you for your interest in my work!

I have installed vLLM latest release v0.2.4, and run the script you provided.
As a result, the latest vllm generated the same values as generated by the package version of this PR.

Version: 0.2.4

Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月28日18：15更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹多云转晴雷阵雨东北风'
Prompt: '人工智能的未来是什么样的？', Generated text: ' - 云+社区-腾讯云计算\n在过去的几年里，我们见证了计算机视觉、语音识别和自然语言处理等技术的发展。这些技术的进步使机器能够执行以前只能由人类完成的复杂任务：从自动驾驶汽车到智能家居设备再到医疗诊断工具的广泛应用都证明了这一点（见图1）。随着深度学习算法的不断改进以及硬件性能的大幅提升—尤其是GPU的出现使得训练大型神经网络成为可能并大大加快了这个过程的速度[2] [3][4]。此外还有一些其他因素也在推动着这一趋势向前发展；例如数据集规模的扩大带来了更多的可用信息源'
Prompt: '自由意志是否存在？', Generated text: ' - 哲学百科\n您现在的位置：首页 > TAG信息列表展示> Free Will Existence?\nFree will existence? | Philosophy Wiki: The free encyclopedia, the largest online open content multilingual dictionary and reference work of philosophy available in several languages.\nPhilosophy wiki is a community site that anyone can contribute to. Discover what’s happening now on philosphy-wiki.org!'
Prompt: '深度学习与机器学习有什么区别？', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中，人工智能（AI）是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术以及自然语言处理等应用领域的人工神经网络算法的研究与应用；而“人工”一词则来源于希腊语中的αριθμός （arithmos=number）。因此可以说： “Artificial Intelligence is the study of how machines can be made to think like humans do so that they too may learn from experience and improve their performance over time through self-correction mechanisms such as reinforcement learning algorithms which allow them not only recognize patterns'
Prompt: '量子力学的基本原理是什么？', Generated text: '1. 薛定谔方程\n2.\n3..\n4.....\n5....'

Version: this PR

Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月28日18：15更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹多云转晴雷阵雨东北风'
Prompt: '人工智能的未来是什么样的？', Generated text: ' - 云+社区-腾讯云计算\n在过去的几年里，我们见证了计算机视觉、语音识别和自然语言处理等技术的发展。这些技术的进步使机器能够执行以前只能由人类完成的复杂任务：从自动驾驶汽车到智能家居设备再到医疗诊断工具的广泛应用都证明了这一点（见图1）。随着深度学习算法的不断改进以及硬件性能的大幅提升—尤其是GPU的出现使得训练大型神经网络成为可能并大大加快了这个过程的速度[2] [3][4]。此外还有一些其他因素也在推动着这一趋势向前发展；例如数据集规模的扩大带来了更多的可用信息源'
Prompt: '自由意志是否存在？', Generated text: ' - 哲学百科\n您现在的位置：首页 > TAG信息列表展示> Free Will Existence?\nFree will existence? | Philosophy Wiki: The free encyclopedia, the largest online open content multilingual dictionary and reference work of philosophy available in several languages.\nPhilosophy wiki is a community site that anyone can contribute to. Discover what’s happening now on philosphy-wiki.org!'
Prompt: '深度学习与机器学习有什么区别？', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中，人工智能（AI）是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术以及自然语言处理等应用领域的人工神经网络算法的研究与应用；而“人工”一词则来源于希腊语中的αριθμός （arithmos=number）。因此可以说： “Artificial Intelligence is the study of how machines can be made to think like humans do so that they too may learn from experience and improve their performance over time through self-correction mechanisms such as reinforcement learning algorithms which allow them not only recognize patterns'
Prompt: '量子力学的基本原理是什么？', Generated text: '1. 薛定谔方程\n2.\n3..\n4.....\n5....'

I think that the changes on the internlm modeling file after version you tested make the differences.
If you want to learn more about the differences, you can look at the commit history.
https://github.com/vllm-project/vllm/commits/main/vllm/model_executor/models/internlm.py

I hope this explanation help you to trouble shoot that problem.

js8544 · 2024-01-03T14:49:30Z

We've been using this branch in production and it works like a charm. Thanks so much for your contribution. Can't wait for it to be merged!

fedshyvana · 2024-01-21T17:38:05Z

thanks for this! Any plan to merge this into main anytime soon?

pfldy2850 · 2024-01-31T06:43:04Z

Hello @zhuohan123 ,

I just saw that you created an issue for the vLLM Q1 2024 roadmap.

If you have any plans to consider this feature or merge for this PR,
I would like to resume the updating work for that PR.

matankley · 2024-02-08T12:03:01Z

This PR would be super valuable for us. @pfldy2850 Do you plan to adjust it to the current master branch ? Because I see it is a bit outdated.

bks5881 · 2024-03-01T09:37:09Z

vllm/entrypoints/api_server.py

    - stream: whether to stream the results or not.
    - other fields: the sampling parameters (See `SamplingParams` for details).
    """
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
+    prompt_embeds = request_dict.pop("prompt_embeds", None)
+    if prompt_embeds is not None:
+        prompt_embeds = torch.tensor(prompt_embeds).to("cuda")


This loads stuff in float32. Eats all the GPU.

bks5881 · 2024-03-01T09:37:57Z

vllm/entrypoints/api_server.py

@@ -29,16 +30,27 @@ async def generate(request: Request) -> Response:

    The request should be a JSON object with the following fields:
    - prompt: the prompt to use for the generation.
+    - prompt_embeds: the prompt embedding to use for the generation
+        instead of the prompt.
    - stream: whether to stream the results or not.
    - other fields: the sampling parameters (See `SamplingParams` for details).
    """
    request_dict = await request.json()
    prompt = request_dict.pop("prompt")


This throws an error when only prompt_embeds are passed.

bks5881

Thanks a lot for doing this. I tested it and had some issues I ran into but fixed them locally.
Also, for some reason when serializing i got torch.cuda.is_available() as false. so had to set CUDA_VISBILE_DEVICES in ray. init.py

tweeter0830 · 2024-03-22T17:24:41Z

@zhuohan123 Do you have plans for this? It would be really helpful to me for this MR to get merged. I can help push it through if you need.

zhuohan123 · 2024-03-22T19:58:41Z

@zhuohan123 Do you have plans for this? It would be really helpful to me for this MR to get merged. I can help push it through if you need.

We are doing this in this PR for llava support: #3042. Please take a look and let us know any suggestions!

pfldy2850 added 11 commits October 5, 2023 18:17

feat: add prompt_embeds interface

bed0e15

fix: add get_input_embeddings

3394d25

feat: support all models to generate from embeds

aa9b215

Merge branch 'main' into feature-input-embeds

ce70fe7

fix: bugfix for inputs_embeds and add last line

de4199d

fix: add prompt_embeds to async engine

9275b2d

Merge branch 'main' into feature-input-embeds

e6963eb

fix: bugfix of get_last_token_id

bd5539a

fix: apply prompt_embeds to api_server

99605bc

refact: refactor test_models

87162d2

fix: apply style guide

a3d9de6

pfldy2850 changed the title ~~[WIP] Support generate from input embedding~~ [WIP] Support generation from input embedding Oct 12, 2023

pfldy2850 added 4 commits October 12, 2023 11:56

fix: improve comments

44ff4ec

refact: refactor prepare_inputs and models

a37cef0

fix: apply style guide

9633148

refact: refactor zero embeds

eec19ed

fix: apply style guide

bebc26b

pfldy2850 changed the title ~~[WIP] Support generation from input embedding~~ Support generation from input embedding Oct 12, 2023

bobchen1980 approved these changes Oct 15, 2023

View reviewed changes

pfldy2850 added 5 commits October 17, 2023 09:40

Merge branch 'main' into feature-input-embeds

a2f2054

Merge branch 'main' into feature-input-embeds

58391ac

fix: update for new prepare_inputs

c28d8bf

fix: rollback commented

117b47f

fix: update style

c0fae79

dimitry12 mentioned this pull request Oct 20, 2023

[Question] Usage with Multimodal LLM #307

Closed

WoosukKwon self-requested a review October 22, 2023 16:42

Merge branch 'main' into feature-input-embeds

2151bc1

WoosukKwon mentioned this pull request Nov 2, 2023

[v0.2.2] Release Tracker #1551

Closed

3 tasks

pfldy2850 added 6 commits November 6, 2023 11:20

Merge branch 'main' into feature-input-embeds

d613790

Merge branch 'main' into feature-input-embeds

1956ce4

Merge branch 'main' into feature-input-embeds

d26465a

fix: update model_runner with input_embeds

0790351

fix: fix typo

e313eae

fix: bug fix

57c1701

This comment was marked as resolved.

Sign in to view

pfldy2850 mentioned this pull request Dec 8, 2023

[Feature Request] Support input embedding in LLM.generate() #416

Open

pfldy2850 added 4 commits December 10, 2023 23:47

fix: change input_embeds argument

d266c39

refact: refactor replace_prompt_embeds

662a658

fix: bugfix

1110834

Merge branch 'main' into feature-input-embeds

ff22471

Merge branch 'main' into feature-input-embeds

f2b10c3

Aakash-kaushik mentioned this pull request Jan 23, 2024

feat: Input embeddings #2563

Open

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

bks5881 reviewed Mar 1, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support generation from input embedding #1265

Support generation from input embedding #1265

pfldy2850 commented Oct 5, 2023 •

edited

pfldy2850 commented Oct 12, 2023

bobchen1980 left a comment •

edited

pfldy2850 commented Oct 18, 2023

This comment was marked as resolved.

pfldy2850 commented Dec 10, 2023

will-wiki commented Dec 13, 2023 •

edited

pfldy2850 commented Dec 13, 2023 •

edited

js8544 commented Jan 3, 2024

fedshyvana commented Jan 21, 2024

pfldy2850 commented Jan 31, 2024

matankley commented Feb 8, 2024

bks5881 Mar 1, 2024 •

edited

bks5881 Mar 1, 2024

bks5881 left a comment

tweeter0830 commented Mar 22, 2024

zhuohan123 commented Mar 22, 2024

Support generation from input embedding #1265

Are you sure you want to change the base?

Support generation from input embedding #1265

Conversation

pfldy2850 commented Oct 5, 2023 • edited

pfldy2850 commented Oct 12, 2023

bobchen1980 left a comment • edited

Choose a reason for hiding this comment

pfldy2850 commented Oct 18, 2023

This comment was marked as resolved.

pfldy2850 commented Dec 10, 2023

will-wiki commented Dec 13, 2023 • edited

pfldy2850 commented Dec 13, 2023 • edited

js8544 commented Jan 3, 2024

fedshyvana commented Jan 21, 2024

pfldy2850 commented Jan 31, 2024

matankley commented Feb 8, 2024

bks5881 Mar 1, 2024 • edited

Choose a reason for hiding this comment

bks5881 Mar 1, 2024

Choose a reason for hiding this comment

bks5881 left a comment

Choose a reason for hiding this comment

tweeter0830 commented Mar 22, 2024

zhuohan123 commented Mar 22, 2024

pfldy2850 commented Oct 5, 2023 •

edited

bobchen1980 left a comment •

edited

will-wiki commented Dec 13, 2023 •

edited

pfldy2850 commented Dec 13, 2023 •

edited

bks5881 Mar 1, 2024 •

edited