Support Batch Completion in Server #2529

simon-mo · 2024-01-21T21:56:21Z

OpenAI completion protocol support inputs for prompt to be a string, array of strings, array of tokens, or array of token arrays.. This PR adds support for that.

Closes #2441
Closes #2396

simon-mo · 2024-01-21T21:56:55Z

vllm/entrypoints/openai/serving_completion.py

+            if request.echo and request.max_tokens == 0:
+                # only return the prompt
+                delta_text = res.prompt
+                delta_token_ids = res.prompt_token_ids
+                top_logprobs = res.prompt_logprobs
+                has_echoed[i] = True
+            elif request.echo and request.max_tokens > 0 and not has_echoed[i]:
+                # echo the prompt and first token
+                delta_text = res.prompt + output.text
+                delta_token_ids = res.prompt_token_ids + output.token_ids
+                top_logprobs = res.prompt_logprobs + (output.logprobs or [])
                has_echoed[i] = True
+            else:
+                # return just the delta
+                delta_text = output.text[len(previous_texts[i]):]
+                delta_token_ids = output.token_ids[previous_num_tokens[i]:]
+                top_logprobs = output.logprobs[
+                    previous_num_tokens[i]:] if output.logprobs else None


This part is refactoring for readability

simon-mo · 2024-01-21T21:57:04Z

vllm/entrypoints/openai/serving_completion.py

+            if request.echo and request.max_tokens == 0:
+                token_ids = prompt_token_ids
+                top_logprobs = prompt_logprobs
+                output_text = prompt_text
+            elif request.echo and request.max_tokens > 0:
+                token_ids = prompt_token_ids + output.token_ids
+                top_logprobs = prompt_logprobs + output.logprobs
+                output_text = prompt_text + output.text
+            else:
                token_ids = output.token_ids
                top_logprobs = output.logprobs
-                if request.echo:
-                    token_ids = prompt_token_ids + token_ids
-                    top_logprobs = prompt_logprobs + top_logprobs
+                output_text = output.text
+
+            if request.logprobs is not None:
+                logprobs = create_logprobs_fn(
+                    token_ids=token_ids,
+                    top_logprobs=top_logprobs,
+                    num_output_top_logprobs=request.logprobs,
+                )
            else:
-                token_ids = prompt_token_ids
-                top_logprobs = prompt_logprobs
-            logprobs = create_logprobs_fn(
-                token_ids=token_ids,
-                top_logprobs=top_logprobs,
-                num_output_top_logprobs=request.logprobs,
+                logprobs = None
+
+            choice_data = CompletionResponseChoice(
+                index=len(choices),
+                text=output_text,
+                logprobs=logprobs,
+                finish_reason=output.finish_reason,
            )
-        else:
-            logprobs = None
-        if not echo_without_generation:
-            output_text = output.text
-            if request.echo:
-                output_text = prompt_text + output_text
-        else:
-            output_text = prompt_text
-        choice_data = CompletionResponseChoice(
-            index=output.index,
-            text=output_text,
-            logprobs=logprobs,
-            finish_reason=output.finish_reason,
-        )
-        choices.append(choice_data)
-
-    num_prompt_tokens = len(final_res.prompt_token_ids)
-    num_generated_tokens = sum(
-        len(output.token_ids) for output in final_res.outputs)
+            choices.append(choice_data)


This part is refactoring for readability

zhuohan123

LGTM! In the future, probably we can also add batched APIs to AsyncLLMEngine, so that we can reduce the number of async function calls.

zhuohan123 · 2024-01-24T06:47:07Z

vllm/entrypoints/openai/serving_completion.py

+        request: CompletionRequest, raw_request: Request, on_abort,
+        result_generator: AsyncIterator[tuple[int, RequestOutput]],
+        create_logprobs_fn, request_id, created_time, model_name,
+        num_prompts) -> AsyncGenerator[str, None]:


Add full type annotation for all parameters?

lixiaolx · 2024-02-28T11:34:12Z

@simon-mo Hi, I am glad that the openai interface in vllm supports multiple batches, but I encountered the following two problems during actual testing:

First, when the openai version is high (for example, my local 0.28.0) https://community.openai.com/t/cannot-import-name-openai-from-openai/486147 , an error occurs and the interface inside needs to be recoded.

    from openai import OpenAI
ImportError: cannot import name 'OpenAI' from 'openai' (/usr/local/lib/python3.10/dist-packages/openai/__init__.py)

Second, when using echo=True, an error occurs
I positioned it here https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_completion.py#L300 , can you fix it?

completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

error:

vllm/entrypoints/openai/serving_completion.py", line 178, in request_output_to_completion_response
    output_text = prompt_text + output.text
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Support Batch Completion in Server

d04fd49

simon-mo commented Jan 21, 2024

View reviewed changes

simon-mo requested review from Yard1 and zhuohan123 January 21, 2024 21:57

simon-mo mentioned this pull request Jan 21, 2024

API Server batch request issue #2441

Closed

titu1994 mentioned this pull request Jan 22, 2024

Add support for batched completion (offline) with OpenAI server #2191

Closed

zhuohan123 approved these changes Jan 24, 2024

View reviewed changes

Add type hint

a3a405c

simon-mo merged commit 3a7dd7e into vllm-project:main Jan 25, 2024
16 checks passed

NikolaBorisov pushed a commit to deepinfra/vllm that referenced this pull request Jan 31, 2024

Support Batch Completion in Server (vllm-project#2529)

ea61266

jony0113 mentioned this pull request Feb 1, 2024

the syntax in serving_completion.py is not compatible in python3.8 #2704

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support Batch Completion in Server (vllm-project#2529)

ce4a756

erkintelnyx mentioned this pull request Feb 16, 2024

IndexError when using Beam Search in Chat Completions #2653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Batch Completion in Server #2529

Support Batch Completion in Server #2529

simon-mo commented Jan 21, 2024 •

edited

simon-mo Jan 21, 2024

simon-mo Jan 21, 2024

zhuohan123 left a comment

zhuohan123 Jan 24, 2024

lixiaolx commented Feb 28, 2024

Support Batch Completion in Server #2529

Support Batch Completion in Server #2529

Conversation

simon-mo commented Jan 21, 2024 • edited

simon-mo Jan 21, 2024

Choose a reason for hiding this comment

simon-mo Jan 21, 2024

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

zhuohan123 Jan 24, 2024

Choose a reason for hiding this comment

lixiaolx commented Feb 28, 2024

simon-mo commented Jan 21, 2024 •

edited