Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support generation from input embedding #1265

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

pfldy2850
Copy link
Contributor

@pfldy2850 pfldy2850 commented Oct 5, 2023

This PR implements the feature of generating text from embedding input (popularly known as inputs_embeds).
This is related to #369 and #416.

More to do

  • Enhance test codes for generate.
  • Determine whether the feature reduces core performance.
  • Add more details to the comments.
  • apply it to async_llm_engine and api_server

@pfldy2850 pfldy2850 changed the title [WIP] Support generate from input embedding [WIP] Support generation from input embedding Oct 12, 2023
@pfldy2850
Copy link
Contributor Author

We conducted several tests and confirmed that the performance degradation was not significant.

In fact, we measured the benchmark 5 times for the main branch and feature branch using the command below.

python benchmarks/benchmark_latency.py --input-len=2048 --num-iters=5

## main
Avg latency: 0.36247589644044637 seconds
Avg latency: 0.35677395705133674 seconds
Avg latency: 0.3622682703658938 seconds
Avg latency: 0.36043337155133487 seconds
Avg latency: 0.3593990854918957 seconds

## feature
Avg latency: 0.3586543008685112 seconds
Avg latency: 0.3557318979874253 seconds
Avg latency: 0.36645207908004523 seconds
Avg latency: 0.3598199490457773 seconds
Avg latency: 0.36111502479761837 seconds

@pfldy2850 pfldy2850 changed the title [WIP] Support generation from input embedding Support generation from input embedding Oct 12, 2023
Copy link

@bobchen1980 bobchen1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed input embedding function related to #369 and #416

@pfldy2850
Copy link
Contributor Author

@WoosukKwon @zhuohan123

Hello authors, I have tested this PR and completed the alignment with the latest prepare_inputs function.
Could you please review this PR?

@WoosukKwon WoosukKwon mentioned this pull request Nov 2, 2023
3 tasks
@pfldy2850

This comment was marked as resolved.

@pfldy2850
Copy link
Contributor Author

Hi @WoosukKwon ,
I've made some changes to the PR that you saw, so I'm asking you to review it again.

  • Updated source code to accept prompt_embeds argument in refactored code
  • Updated to only handle prompt_embeds during _prepare_prompt (as it is unnecessary during the decoding part)
  • Updated to accept prompt_embeds as an argument for all models (also added test code)
  • Updated it available to async engine as well
  • Updated entrypoints/api_server.py to accept prompt_embeds as a body

I have run the added test code and it passes all as shown below.

$ pytest tests/models/test_models.py -k test_models_from_prompt_embeds --forked
...
collected 22 items / 11 deselected / 11 selected                                                                                                                                         

tests/models/test_models.py ...........

I know you are very busy with a lot of interest and requests for vLLM.
I would appreciate it if you could review this PR when you have a chance.

@will-wiki
Copy link

will-wiki commented Dec 13, 2023

@pfldy2850
Thank you so much for your work! However, I encountered a problem when running this version of the code, almost the same environment, the same input and the same call method, the running result is not consistent with the official vLLM version.
The relevant call code, environment, and test results are as follows:

Same call code:

import time
from vllm import LLM, SamplingParams
prompts = [
    "请介绍下爱因斯坦的生平。",
    "人工智能的未来是什么样的?",
    "自由意志是否存在?",
    "深度学习与机器学习有什么区别?",
    "量子力学的基本原理是什么?"
]
sampling_params = SamplingParams(
    temperature=0, top_p=1, max_tokens=128, repetition_penalty=1.1,
    use_beam_search=True, best_of=5)

llm = LLM(model="internlm/internlm-7b", trust_remote_code=True)

outputs = []
start_time = time.perf_counter()
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)
print(f"cost time: {(time.perf_counter() - start_time)/ len(prompts)} ")

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Operating environment V1:(office)
cuda:11.8
torch:2.1.0
xformers:0.0.22.post7+cu118
transformers:4.35.0
vllm:vllm @ https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl#sha256=7a8b51f0565baaa820f8dc0376e1ff5a732fcabda26397a55becd90e07b5fc63(office)

Result:
Prompt: '请介绍下爱因斯坦的生平。', Generated text: '\n爱因斯坦(1879年3月14日—1955 ),出生于德国巴登-符腾堡州乌尔姆市的一个犹太人家庭。他是20世纪最伟大的科学家之一、现代物理学的开创者和奠基人;他创立的狭义相对论和广义相…'
Prompt: '人工智能的未来是什么样的?', Generated text: '人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟人类智能活动的计算机系统技术的科学和工程领域;是利用数字计算方法使电子设备“理解”人话并做出类似人类的反应的技术总称 【1】 [2]'
Prompt: '自由意志是否存在?', Generated text: '自由意志是否存在?\n首先,我们要明白什么是“决定论”。所谓**确定性理论(determinism)又称因果律、机械唯物主义和宿命说等:认为宇宙中一切事物的发展变化都有其必然的规律或原因可循;人的一切行为都由自然法则支配着并受物质运动规律的制约而发生和发展的一种哲学观点与世界观体系 【1】**\n其次要明确一个概念—意识是物质的反映还是独立于客观世界之外的存在呢?这个问题在科学界一直争论不休,至今没有定结论.但可以肯定的是:人的主观能动性的存在'
Prompt: '深度学习与机器学习有什么区别?', Generated text: ' - 云+社区-腾讯云计算\n深度学习与机器学习的区别主要体现在以下几个方面:1、应用领域不同。2.训练方式的不同,3模型结构上的差异以及4数据集上的一些差别等几个方面的内容来分析的'
Prompt: '量子力学的基本原理是什么?', Generated text: '量子力学的基本原理包括:\n1. 波粒二象性(wave-particle duality),即物质同时具有波动性和粒子性的特征。在微观尺度下观察到的是概率分布而不是确定的位置和速度等物理量;2.\n不确定关系(uncertainty relation),它描述了两个相互关联的力学量的精确测量之间的不可能完美一致的关系;3、薛定谔方程(Schrödinger equation),它是描述一个孤立系统随时间演化的微分方程式,其解给出了该系统的态函数及其演化;\n4.Heisenberg Uncertainity Principle (海森'
Operating environment V2:
cuda:11.8
torch:2.1.0
xformers:0.0.22.post7+cu118
transformers:4.35.0
vllm:https://github.com/pfldy2850/vllm/tree/feature-input-embeds  (v0.2.3,install with 'pip install -e .')

Result:
Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月26日18:14更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹雷阵雨晴转多云东北风'
Prompt: '人工智能的未来是什么样的?', Generated text: '_亚博yabo888vip官网\n来源:www。xg111.net 发布时间2022-05-25 |浏览量177次返回搜狐,查阅更多责任编辑您好!欢迎参与调查您的性别(单选)男女民族/种族 (多选题)。A汉族B蒙古族C藏族D维吾尔族E回族F苗族人G彝族H布依人I哈萨克J朝鲜K壮族的L白俄罗斯的M蒙古国的N塔吉克O阿塞拜疆P乌兹别克斯坦Q吉尔吉斯斯坦R土库曼S阿富汗T伊朗U巴基斯坦V印度W尼泊尔X'
Prompt: '自由意志是否存在?', Generated text: ' - 哲学百科\n您现在的位置:首页 > TAG信息列表展示> Free Will Existence?\nFree will is the ability to make choices that are not determined by external factors. It has been debated for centuries, with some philosophers arguing it does exist and others claiming there\'s no such thing as free choice.\nThe debate over whether ornot we have a "free"will can be traced back at least two thousand years ago when Aristotle wrote about this topic in his book On The Soul (De Anima). He argued against determinism because he believed humans were capable of making decisions based on their own desires rather than being controlled'
Prompt: '深度学习与机器学习有什么区别?', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中,人工智能(AI)是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术;自然语言处理以及专家系统等应用领域的人工神经网络算法的研究和使用等等都是属于这个范畴之内的内容之一了!那么接下来我们就一起来了解一下什么是“**Deep Learning & Machine learning”吧~**\n1. DeepLearning是什么?\n2.Deeplearning的应用场景有哪些呢?'
Prompt: '量子力学的基本原理是什么?', Generated text: "1. 薛定谔方程:描述一个粒子在空间和时间中的状态如何随时间变化。\n2.波函数和概率幅的概念,描述了粒子的位置、动量和能量等物理量之间的关系.\n3..不确定性关系: Heisenberg Uncertainty Principle, which states that the more precisely we know a particle's position or momentum (or both),the less accuratelywe can measure its other properties such as energy and time"

The official output is significantly better than the embedding version of the code.The code has not changed, I am not clear whether the vllm installation is incorrect or there are other reasons, please help to look at it, thank you very much!if you need other information I can add.

@pfldy2850
Copy link
Contributor Author

pfldy2850 commented Dec 13, 2023

Hello @will-wiki ,
Thank you for your interest in my work!

I have installed vLLM latest release v0.2.4, and run the script you provided.
As a result, the latest vllm generated the same values as generated by the package version of this PR.

Version: 0.2.4

Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月28日18:15更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹多云转晴雷阵雨东北风'
Prompt: '人工智能的未来是什么样的?', Generated text: ' - 云+社区-腾讯云计算\n在过去的几年里,我们见证了计算机视觉、语音识别和自然语言处理等技术的发展。这些技术的进步使机器能够执行以前只能由人类完成的复杂任务:从自动驾驶汽车到智能家居设备再到医疗诊断工具的广泛应用都证明了这一点(见图1)。随着深度学习算法的不断改进以及硬件性能的大幅提升—尤其是GPU的出现使得训练大型神经网络成为可能并大大加快了这个过程的速度[2] [3][4]。此外还有一些其他因素也在推动着这一趋势向前发展;例如数据集规模的扩大带来了更多的可用信息源'
Prompt: '自由意志是否存在?', Generated text: ' - 哲学百科\n您现在的位置:首页 > TAG信息列表展示> Free Will Existence?\nFree will existence? | Philosophy Wiki: The free encyclopedia, the largest online open content multilingual dictionary and reference work of philosophy available in several languages.\nPhilosophy wiki is a community site that anyone can contribute to. Discover what’s happening now on philosphy-wiki.org!'
Prompt: '深度学习与机器学习有什么区别?', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中,人工智能(AI)是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术以及自然语言处理等应用领域的人工神经网络算法的研究与应用;而“人工”一词则来源于希腊语中的αριθμός (arithmos=number)。因此可以说: “Artificial Intelligence is the study of how machines can be made to think like humans do so that they too may learn from experience and improve their performance over time through self-correction mechanisms such as reinforcement learning algorithms which allow them not only recognize patterns'
Prompt: '量子力学的基本原理是什么?', Generated text: '1. 薛定谔方程\n2.\n3..\n4.....\n5....'
Version: this PR

Prompt: '请介绍下爱因斯坦的生平。', Generated text: '-度小视\nPlease introduce the life of Einstein.\n1万播放 · 0弹幕2021年11月28日18:15更新·原创性地址信息新型科研机构研究基地享誉全球[s][s][]冰雹多云转晴雷阵雨东北风'
Prompt: '人工智能的未来是什么样的?', Generated text: ' - 云+社区-腾讯云计算\n在过去的几年里,我们见证了计算机视觉、语音识别和自然语言处理等技术的发展。这些技术的进步使机器能够执行以前只能由人类完成的复杂任务:从自动驾驶汽车到智能家居设备再到医疗诊断工具的广泛应用都证明了这一点(见图1)。随着深度学习算法的不断改进以及硬件性能的大幅提升—尤其是GPU的出现使得训练大型神经网络成为可能并大大加快了这个过程的速度[2] [3][4]。此外还有一些其他因素也在推动着这一趋势向前发展;例如数据集规模的扩大带来了更多的可用信息源'
Prompt: '自由意志是否存在?', Generated text: ' - 哲学百科\n您现在的位置:首页 > TAG信息列表展示> Free Will Existence?\nFree will existence? | Philosophy Wiki: The free encyclopedia, the largest online open content multilingual dictionary and reference work of philosophy available in several languages.\nPhilosophy wiki is a community site that anyone can contribute to. Discover what’s happening now on philosphy-wiki.org!'
Prompt: '深度学习与机器学习有什么区别?', Generated text: ' - 云+社区-腾讯云计算\n在计算机科学中,人工智能(AI)是研究、开发用于模拟或扩展人类智能的理论和实践的学科。它包括机器人技术以及自然语言处理等应用领域的人工神经网络算法的研究与应用;而“人工”一词则来源于希腊语中的αριθμός (arithmos=number)。因此可以说: “Artificial Intelligence is the study of how machines can be made to think like humans do so that they too may learn from experience and improve their performance over time through self-correction mechanisms such as reinforcement learning algorithms which allow them not only recognize patterns'
Prompt: '量子力学的基本原理是什么?', Generated text: '1. 薛定谔方程\n2.\n3..\n4.....\n5....'

I think that the changes on the internlm modeling file after version you tested make the differences.
If you want to learn more about the differences, you can look at the commit history.
https://github.com/vllm-project/vllm/commits/main/vllm/model_executor/models/internlm.py

I hope this explanation help you to trouble shoot that problem.

@js8544
Copy link
Contributor

js8544 commented Jan 3, 2024

We've been using this branch in production and it works like a charm. Thanks so much for your contribution. Can't wait for it to be merged!

@fedshyvana
Copy link

thanks for this! Any plan to merge this into main anytime soon?

@pfldy2850
Copy link
Contributor Author

Hello @zhuohan123 ,

I just saw that you created an issue for the vLLM Q1 2024 roadmap.

If you have any plans to consider this feature or merge for this PR,
I would like to resume the updating work for that PR.

@matankley
Copy link

This PR would be super valuable for us. @pfldy2850 Do you plan to adjust it to the current master branch ? Because I see it is a bit outdated.

- stream: whether to stream the results or not.
- other fields: the sampling parameters (See `SamplingParams` for details).
"""
request_dict = await request.json()
prompt = request_dict.pop("prompt")
prompt_embeds = request_dict.pop("prompt_embeds", None)
if prompt_embeds is not None:
prompt_embeds = torch.tensor(prompt_embeds).to("cuda")
Copy link

@bks5881 bks5881 Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads stuff in float32. Eats all the GPU.

@@ -29,16 +30,27 @@ async def generate(request: Request) -> Response:

The request should be a JSON object with the following fields:
- prompt: the prompt to use for the generation.
- prompt_embeds: the prompt embedding to use for the generation
instead of the prompt.
- stream: whether to stream the results or not.
- other fields: the sampling parameters (See `SamplingParams` for details).
"""
request_dict = await request.json()
prompt = request_dict.pop("prompt")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This throws an error when only prompt_embeds are passed.

Copy link

@bks5881 bks5881 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this. I tested it and had some issues I ran into but fixed them locally.
Also, for some reason when serializing i got torch.cuda.is_available() as false. so had to set CUDA_VISBILE_DEVICES in ray. init.py

@tweeter0830
Copy link

@zhuohan123 Do you have plans for this? It would be really helpful to me for this MR to get merged. I can help push it through if you need.

@zhuohan123
Copy link
Collaborator

@zhuohan123 Do you have plans for this? It would be really helpful to me for this MR to get merged. I can help push it through if you need.

We are doing this in this PR for llava support: #3042. Please take a look and let us know any suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants