Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Batch Completion in Server #2529

Merged
merged 2 commits into from
Jan 25, 2024

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Jan 21, 2024

OpenAI completion protocol support inputs for prompt to be a string, array of strings, array of tokens, or array of token arrays.. This PR adds support for that.

Closes #2441
Closes #2396

Comment on lines +43 to +60
if request.echo and request.max_tokens == 0:
# only return the prompt
delta_text = res.prompt
delta_token_ids = res.prompt_token_ids
top_logprobs = res.prompt_logprobs
has_echoed[i] = True
elif request.echo and request.max_tokens > 0 and not has_echoed[i]:
# echo the prompt and first token
delta_text = res.prompt + output.text
delta_token_ids = res.prompt_token_ids + output.token_ids
top_logprobs = res.prompt_logprobs + (output.logprobs or [])
has_echoed[i] = True
else:
# return just the delta
delta_text = output.text[len(previous_texts[i]):]
delta_token_ids = output.token_ids[previous_num_tokens[i]:]
top_logprobs = output.logprobs[
previous_num_tokens[i]:] if output.logprobs else None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is refactoring for readability

Comment on lines +156 to +184
if request.echo and request.max_tokens == 0:
token_ids = prompt_token_ids
top_logprobs = prompt_logprobs
output_text = prompt_text
elif request.echo and request.max_tokens > 0:
token_ids = prompt_token_ids + output.token_ids
top_logprobs = prompt_logprobs + output.logprobs
output_text = prompt_text + output.text
else:
token_ids = output.token_ids
top_logprobs = output.logprobs
if request.echo:
token_ids = prompt_token_ids + token_ids
top_logprobs = prompt_logprobs + top_logprobs
output_text = output.text

if request.logprobs is not None:
logprobs = create_logprobs_fn(
token_ids=token_ids,
top_logprobs=top_logprobs,
num_output_top_logprobs=request.logprobs,
)
else:
token_ids = prompt_token_ids
top_logprobs = prompt_logprobs
logprobs = create_logprobs_fn(
token_ids=token_ids,
top_logprobs=top_logprobs,
num_output_top_logprobs=request.logprobs,
logprobs = None

choice_data = CompletionResponseChoice(
index=len(choices),
text=output_text,
logprobs=logprobs,
finish_reason=output.finish_reason,
)
else:
logprobs = None
if not echo_without_generation:
output_text = output.text
if request.echo:
output_text = prompt_text + output_text
else:
output_text = prompt_text
choice_data = CompletionResponseChoice(
index=output.index,
text=output_text,
logprobs=logprobs,
finish_reason=output.finish_reason,
)
choices.append(choice_data)

num_prompt_tokens = len(final_res.prompt_token_ids)
num_generated_tokens = sum(
len(output.token_ids) for output in final_res.outputs)
choices.append(choice_data)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is refactoring for readability

Copy link
Collaborator

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! In the future, probably we can also add batched APIs to AsyncLLMEngine, so that we can reduce the number of async function calls.

Comment on lines 24 to 27
request: CompletionRequest, raw_request: Request, on_abort,
result_generator: AsyncIterator[tuple[int, RequestOutput]],
create_logprobs_fn, request_id, created_time, model_name,
num_prompts) -> AsyncGenerator[str, None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add full type annotation for all parameters?

@simon-mo simon-mo merged commit 3a7dd7e into vllm-project:main Jan 25, 2024
16 checks passed
NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
@lixiaolx
Copy link

@simon-mo Hi, I am glad that the openai interface in vllm supports multiple batches, but I encountered the following two problems during actual testing:

  1. First, when the openai version is high (for example, my local 0.28.0) https://community.openai.com/t/cannot-import-name-openai-from-openai/486147 , an error occurs and the interface inside needs to be recoded.
    from openai import OpenAI
ImportError: cannot import name 'OpenAI' from 'openai' (/usr/local/lib/python3.10/dist-packages/openai/__init__.py)

  1. Second, when using echo=True, an error occurs
    I positioned it here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_completion.py#L300 , can you fix it?
completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

error:

vllm/entrypoints/openai/serving_completion.py", line 178, in request_output_to_completion_response
    output_text = prompt_text + output.text
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Server batch request issue batching and streaming
3 participants