-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Batch Completion in Server #2529
Support Batch Completion in Server #2529
Conversation
if request.echo and request.max_tokens == 0: | ||
# only return the prompt | ||
delta_text = res.prompt | ||
delta_token_ids = res.prompt_token_ids | ||
top_logprobs = res.prompt_logprobs | ||
has_echoed[i] = True | ||
elif request.echo and request.max_tokens > 0 and not has_echoed[i]: | ||
# echo the prompt and first token | ||
delta_text = res.prompt + output.text | ||
delta_token_ids = res.prompt_token_ids + output.token_ids | ||
top_logprobs = res.prompt_logprobs + (output.logprobs or []) | ||
has_echoed[i] = True | ||
else: | ||
# return just the delta | ||
delta_text = output.text[len(previous_texts[i]):] | ||
delta_token_ids = output.token_ids[previous_num_tokens[i]:] | ||
top_logprobs = output.logprobs[ | ||
previous_num_tokens[i]:] if output.logprobs else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is refactoring for readability
if request.echo and request.max_tokens == 0: | ||
token_ids = prompt_token_ids | ||
top_logprobs = prompt_logprobs | ||
output_text = prompt_text | ||
elif request.echo and request.max_tokens > 0: | ||
token_ids = prompt_token_ids + output.token_ids | ||
top_logprobs = prompt_logprobs + output.logprobs | ||
output_text = prompt_text + output.text | ||
else: | ||
token_ids = output.token_ids | ||
top_logprobs = output.logprobs | ||
if request.echo: | ||
token_ids = prompt_token_ids + token_ids | ||
top_logprobs = prompt_logprobs + top_logprobs | ||
output_text = output.text | ||
|
||
if request.logprobs is not None: | ||
logprobs = create_logprobs_fn( | ||
token_ids=token_ids, | ||
top_logprobs=top_logprobs, | ||
num_output_top_logprobs=request.logprobs, | ||
) | ||
else: | ||
token_ids = prompt_token_ids | ||
top_logprobs = prompt_logprobs | ||
logprobs = create_logprobs_fn( | ||
token_ids=token_ids, | ||
top_logprobs=top_logprobs, | ||
num_output_top_logprobs=request.logprobs, | ||
logprobs = None | ||
|
||
choice_data = CompletionResponseChoice( | ||
index=len(choices), | ||
text=output_text, | ||
logprobs=logprobs, | ||
finish_reason=output.finish_reason, | ||
) | ||
else: | ||
logprobs = None | ||
if not echo_without_generation: | ||
output_text = output.text | ||
if request.echo: | ||
output_text = prompt_text + output_text | ||
else: | ||
output_text = prompt_text | ||
choice_data = CompletionResponseChoice( | ||
index=output.index, | ||
text=output_text, | ||
logprobs=logprobs, | ||
finish_reason=output.finish_reason, | ||
) | ||
choices.append(choice_data) | ||
|
||
num_prompt_tokens = len(final_res.prompt_token_ids) | ||
num_generated_tokens = sum( | ||
len(output.token_ids) for output in final_res.outputs) | ||
choices.append(choice_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is refactoring for readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! In the future, probably we can also add batched APIs to AsyncLLMEngine
, so that we can reduce the number of async function calls.
request: CompletionRequest, raw_request: Request, on_abort, | ||
result_generator: AsyncIterator[tuple[int, RequestOutput]], | ||
create_logprobs_fn, request_id, created_time, model_name, | ||
num_prompts) -> AsyncGenerator[str, None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add full type annotation for all parameters?
@simon-mo Hi, I am glad that the openai interface in vllm supports multiple batches, but I encountered the following two problems during actual testing:
|
OpenAI completion protocol support inputs for
prompt
to bea string, array of strings, array of tokens, or array of token arrays.
. This PR adds support for that.Closes #2441
Closes #2396