Support OpenAI API server in `benchmark_serving.py` #2172

hmellor · 2023-12-18T10:09:31Z

Adds --endpoint and --model parameters so that the user can benchmark their OpenAI compatible vLLM server

hmellor · 2024-01-17T14:20:54Z

@simon-mo can I request a review?

This might be something nice to merge before #2433 so that the OpenAI compatible server can benefit from @ywang96's work.

simon-mo · 2024-01-18T23:28:11Z

can you enable maintainers to edit this PR? I have some small fixes here

the tqdm part is wrong, it should use as_completed otherwise it is just measuring request sending, not completion
i want to add this to CI if possible

hmellor · 2024-01-18T23:57:05Z

can you enable maintainers to edit this PR?

Thanks for the review @simon-mo, unfortunately I don't think I can because my PR is from an organisation-owned fork, not a user-owned fork. Although, I'd be happy to make any changes you suggest!

hmellor · 2024-01-19T00:19:11Z

I have just realised that using tqdm.gather (as per my latest commit) is also not the right solution because it'll only get called to gather the last few tasks once the async for is complete.

simon-mo · 2024-01-19T04:34:25Z

Thanks for making the change. Let me add the benchmark script because it requires some back and forth.

hmellor · 2024-01-19T09:11:01Z

@simon-mo I've just confirmed this morning that the fix I made in ddd7e33 is not actually correct.

The progress bar only appears once the async for loop is complete so we only benefit from the progress bar once most of the requests have already completed.

simon-mo · 2024-01-19T16:44:00Z

Hmm I see. It is pretty tricky because a one liner won't do. Feel free to open another PR!

tattrongvu · 2024-01-24T08:47:46Z

Hi guys, thanks for the PR, just saw that in benchmark: https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py#L138

It doesn't count the tokens length in the output response, it take directly the max_tokens as output_len?

I think we should measure the actual output token throughput, not the requested max token.

hmellor · 2024-01-24T09:22:05Z

Hi @tattrongvu, this PR is only really concerned with enabling the benchmarking of the OpenAI compatible server, not with any specifics of the benchmark itself.

It doesn't count the tokens length in the output response, it take directly the max_tokens as output_len?

This is partially correct. You're right in saying that the actual output token length is not measured, but it's not using max_tokens either. From the code snippet below, you can see that the output_len actually comes from measuring the length of the completion in the dataset:

vllm/benchmarks/benchmark_serving.py

Lines 50 to 57 in 3209b49

    
           prompts = [prompt for prompt, _ in dataset] 
        
           prompt_token_ids = tokenizer(prompts).input_ids 
        
           completions = [completion for _, completion in dataset] 
        
           completion_token_ids = tokenizer(completions).input_ids 
        
           tokenized_dataset = [] 
        
           for i in range(len(dataset)): 
        
               output_len = len(completion_token_ids[i]) 
        
               tokenized_dataset.append((prompts[i], prompt_token_ids[i], output_len))

This method is not great because it's very unlikely that any model will actually recreate the exact completion in the dataset.

The good news is that this is resolved in #2433!

ywang96 · 2024-01-24T17:24:16Z

Hello @tattrongvu! As @hmellor mentioned, the original benchmark script always assumes the model generates the same number of tokens as output_len, which I believe won't be true for some engines (e.g., TGI) if they don't have a ignore_eos feature.

One of the fixes I did in #2433 is to postprocess the generated text and measure the number of tokens to give a more accurate result.

Support OpenAI API server in benchmark_serving.py

42ce274

hmellor mentioned this pull request Jan 17, 2024

Refactor Prometheus and Add Request Level Metrics #2316

Merged

simon-mo self-assigned this Jan 17, 2024

simon-mo approved these changes Jan 18, 2024

View reviewed changes

Fix tqdm

ddd7e33

simon-mo merged commit 2709c00 into vllm-project:main Jan 19, 2024
2 of 4 checks passed

hmellor deleted the support-openai-benchmark branch January 19, 2024 08:35

hmellor mentioned this pull request Jan 22, 2024

Fix progress bar and allow HTTPS in benchmark_serving.py #2552

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Support OpenAI API server in benchmark_serving.py (vllm-project#2172)

bacad9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support OpenAI API server in `benchmark_serving.py` #2172

Support OpenAI API server in `benchmark_serving.py` #2172

hmellor commented Dec 18, 2023 •

edited

Loading

hmellor commented Jan 17, 2024

simon-mo commented Jan 18, 2024 •

edited

Loading

hmellor commented Jan 18, 2024

hmellor commented Jan 19, 2024

simon-mo commented Jan 19, 2024

hmellor commented Jan 19, 2024

simon-mo commented Jan 19, 2024

tattrongvu commented Jan 24, 2024 •

edited

Loading

hmellor commented Jan 24, 2024

ywang96 commented Jan 24, 2024

Support OpenAI API server in benchmark_serving.py #2172

Support OpenAI API server in benchmark_serving.py #2172

Conversation

hmellor commented Dec 18, 2023 • edited Loading

hmellor commented Jan 17, 2024

simon-mo commented Jan 18, 2024 • edited Loading

hmellor commented Jan 18, 2024

hmellor commented Jan 19, 2024

simon-mo commented Jan 19, 2024

hmellor commented Jan 19, 2024

simon-mo commented Jan 19, 2024

tattrongvu commented Jan 24, 2024 • edited Loading

hmellor commented Jan 24, 2024

ywang96 commented Jan 24, 2024

Support OpenAI API server in `benchmark_serving.py` #2172

Support OpenAI API server in `benchmark_serving.py` #2172

hmellor commented Dec 18, 2023 •

edited

Loading

simon-mo commented Jan 18, 2024 •

edited

Loading

tattrongvu commented Jan 24, 2024 •

edited

Loading