Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support OpenAI API server in benchmark_serving.py #2172

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

hmellor
Copy link
Collaborator

@hmellor hmellor commented Dec 18, 2023

Adds --endpoint and --model parameters so that the user can benchmark their OpenAI compatible vLLM server

@hmellor
Copy link
Collaborator Author

hmellor commented Jan 17, 2024

@simon-mo can I request a review?

This might be something nice to merge before #2433 so that the OpenAI compatible server can benefit from @ywang96's work.

@simon-mo simon-mo self-assigned this Jan 17, 2024
@simon-mo
Copy link
Collaborator

simon-mo commented Jan 18, 2024

can you enable maintainers to edit this PR? I have some small fixes here

  1. the tqdm part is wrong, it should use as_completed otherwise it is just measuring request sending, not completion
  2. i want to add this to CI if possible

@hmellor
Copy link
Collaborator Author

hmellor commented Jan 18, 2024

can you enable maintainers to edit this PR?

Thanks for the review @simon-mo, unfortunately I don't think I can because my PR is from an organisation-owned fork, not a user-owned fork. Although, I'd be happy to make any changes you suggest!

@hmellor
Copy link
Collaborator Author

hmellor commented Jan 19, 2024

I have just realised that using tqdm.gather (as per my latest commit) is also not the right solution because it'll only get called to gather the last few tasks once the async for is complete.

@simon-mo simon-mo merged commit 2709c00 into vllm-project:main Jan 19, 2024
2 of 4 checks passed
@simon-mo
Copy link
Collaborator

Thanks for making the change. Let me add the benchmark script because it requires some back and forth.

@hmellor hmellor deleted the support-openai-benchmark branch January 19, 2024 08:35
@hmellor
Copy link
Collaborator Author

hmellor commented Jan 19, 2024

@simon-mo I've just confirmed this morning that the fix I made in ddd7e33 is not actually correct.

The progress bar only appears once the async for loop is complete so we only benefit from the progress bar once most of the requests have already completed.

@simon-mo
Copy link
Collaborator

Hmm I see. It is pretty tricky because a one liner won't do. Feel free to open another PR!

@tattrongvu
Copy link

tattrongvu commented Jan 24, 2024

Hi guys, thanks for the PR, just saw that in benchmark: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L138

It doesn't count the tokens length in the output response, it take directly the max_tokens as output_len?

I think we should measure the actual output token throughput, not the requested max token.

@hmellor
Copy link
Collaborator Author

hmellor commented Jan 24, 2024

Hi @tattrongvu, this PR is only really concerned with enabling the benchmarking of the OpenAI compatible server, not with any specifics of the benchmark itself.

It doesn't count the tokens length in the output response, it take directly the max_tokens as output_len?

This is partially correct. You're right in saying that the actual output token length is not measured, but it's not using max_tokens either. From the code snippet below, you can see that the output_len actually comes from measuring the length of the completion in the dataset:

prompts = [prompt for prompt, _ in dataset]
prompt_token_ids = tokenizer(prompts).input_ids
completions = [completion for _, completion in dataset]
completion_token_ids = tokenizer(completions).input_ids
tokenized_dataset = []
for i in range(len(dataset)):
output_len = len(completion_token_ids[i])
tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))

This method is not great because it's very unlikely that any model will actually recreate the exact completion in the dataset.

The good news is that this is resolved in #2433!

@ywang96
Copy link
Collaborator

ywang96 commented Jan 24, 2024

Hello @tattrongvu! As @hmellor mentioned, the original benchmark script always assumes the model generates the same number of tokens as output_len, which I believe won't be true for some engines (e.g., TGI) if they don't have a ignore_eos feature.

One of the fixes I did in #2433 is to postprocess the generated text and measure the number of tokens to give a more accurate result.

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants