perf: Improve vLLM backend performance by using a separate thread for responses #46

Tabrizian · 2024-07-10T17:53:13Z

What does the PR do?

Triton's output token throughput for generate endpoint increases by 18% for concurrency 50. There is still a small gap between vLLM-only and vLLM + Triton solution.

The model is llama-2-7b.

Changes:

Use add_request instead of generate for vLLM backend. It appears that with generate , the token generation is delayed until the next iteration of the loop.
Delegate response sending to a separate thread.

Next steps:

In the long term it might be better to have an async_send API to avoid creating a separate thread for sending the responses.

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

N/A

Where should the reviewer start?

N/A

Test plan:

This is a performance improvement, existing test cases should be sufficient at covering any possible issues.

CI Pipeline ID: 16917857 16926866

Caveats:

N/A

Background

N/A

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

…nses

src/model.py

tanmayv25 · 2024-07-23T20:35:21Z

@Tabrizian Can you add description for the code changes in the PR? Also including the performance improvement you observed and in what cases?

src/model.py

Tabrizian · 2024-07-23T21:43:42Z

@tanmayv25 I updated the PR description.

oandreeva-nv · 2024-07-25T17:25:35Z

@kthui So you've received close to Iman's perf results after sync?

kthui · 2024-07-26T01:16:19Z

@kthui So you've received close to Iman's perf results after sync?

yes

… logic (#52)

oandreeva-nv

LGTM!

zhaotyer · 2024-08-02T09:25:36Z

I test Qwen2-7B-chat whith a A100*80G，The pr did not work, The gap compared to vllm deployments is nearly 40% for concurrency 64

vllm

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  15.35     
Total input tokens:                      14970     
Total generated tokens:                  15304     
Request throughput (req/s):              4.17      
Input token throughput (tok/s):          975.53    
Output token throughput (tok/s):         997.30    
---------------Time to First Token----------------
Mean TTFT (ms):                          1222.87   
Median TTFT (ms):                        1168.27   
P99 TTFT (ms):                           2284.71   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.58     
Median TPOT (ms):                        55.96     
P99 TPOT (ms):                           56.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.51     
Median ITL (ms):                         51.57     
P99 ITL (ms):                            78.99     
==================================================

triton+llm+stream

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  24.85     
Total input tokens:                      14970     
Total generated tokens:                  15178     
Request throughput (req/s):              2.58      
Input token throughput (tok/s):          602.44    
Output token throughput (tok/s):         610.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          1762.53   
Median TTFT (ms):                        1687.10   
P99 TTFT (ms):                           3342.62   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          91.16     
Median TPOT (ms):                        91.76     
P99 TPOT (ms):                           93.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.93    
Median ITL (ms):                         93.49     
P99 ITL (ms):                            368.58    
==================================================

triton+llm+no stream

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  18.48
Total input tokens:                      14784     
Total generated tokens:                  15232
Request throughput (req/s):              3.46
Input token throughput (tok/s):          801.91
Output token throughput (tok/s):         826.32

zhaotyer · 2024-08-02T09:29:17Z

I test Qwen2-7B-chat whith a A100*80G，The pr did not work, The gap compared to vllm deployments is nearly 40% for concurrency 64

vllm

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  15.35     
Total input tokens:                      14970     
Total generated tokens:                  15304     
Request throughput (req/s):              4.17      
Input token throughput (tok/s):          975.53    
Output token throughput (tok/s):         997.30    
---------------Time to First Token----------------
Mean TTFT (ms):                          1222.87   
Median TTFT (ms):                        1168.27   
P99 TTFT (ms):                           2284.71   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          55.58     
Median TPOT (ms):                        55.96     
P99 TPOT (ms):                           56.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           53.51     
Median ITL (ms):                         51.57     
P99 ITL (ms):                            78.99     
==================================================

triton+llm+stream

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  24.85     
Total input tokens:                      14970     
Total generated tokens:                  15178     
Request throughput (req/s):              2.58      
Input token throughput (tok/s):          602.44    
Output token throughput (tok/s):         610.81    
---------------Time to First Token----------------
Mean TTFT (ms):                          1762.53   
Median TTFT (ms):                        1687.10   
P99 TTFT (ms):                           3342.62   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          91.16     
Median TPOT (ms):                        91.76     
P99 TPOT (ms):                           93.63     
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.93    
Median ITL (ms):                         93.49     
P99 ITL (ms):                            368.58    
==================================================

triton+llm+no stream

============ Serving Benchmark Result ============
Successful requests:                     64        
Benchmark duration (s):                  18.48
Total input tokens:                      14784     
Total generated tokens:                  15232
Request throughput (req/s):              3.46
Input token throughput (tok/s):          801.91
Output token throughput (tok/s):         826.32

llm+triton gpu utilization is significantly lower than vllm

Improve vLLM backend performance by using a separate thread for respo…

b1ac8f0

…nses

Tabrizian marked this pull request as draft July 10, 2024 17:53

github-advanced-security bot found potential problems Jul 10, 2024

View reviewed changes

src/model.py Fixed Show fixed Hide fixed

Tabrizian requested review from kthui and oandreeva-nv July 18, 2024 18:06

oandreeva-nv reviewed Jul 18, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

oandreeva-nv reviewed Jul 18, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

oandreeva-nv reviewed Jul 18, 2024

View reviewed changes

src/model.py Show resolved Hide resolved

oandreeva-nv reviewed Jul 18, 2024

View reviewed changes

src/model.py Outdated Show resolved Hide resolved

kthui mentioned this pull request Jul 22, 2024

fix: Fix non garbage collect on response sender #48

Merged

20 tasks

fix: Fix non garbage collect on response sender (#48)

e9f39de

kthui requested a review from tanmayv25 July 23, 2024 18:33

kthui previously approved these changes Jul 23, 2024

View reviewed changes

Tabrizian marked this pull request as ready for review July 23, 2024 19:10

tanmayv25 reviewed Jul 23, 2024

View reviewed changes

src/model.py Fixed Show fixed Hide fixed

src/model.py Outdated Show resolved Hide resolved

src/model.py Show resolved Hide resolved

kthui mentioned this pull request Jul 25, 2024

fix: Address review comments #50

Merged

20 tasks

fix: Address review comments (#50)

65bbe4f

Tabrizian dismissed kthui’s stale review via 65bbe4f July 25, 2024 16:39

kthui previously approved these changes Jul 25, 2024

View reviewed changes

kthui mentioned this pull request Jul 25, 2024

docs: Document why adding the response thread #51

Merged

20 tasks

docs: Document why adding the response thread (#51)

fab8e86

Tabrizian dismissed kthui’s stale review via fab8e86 July 25, 2024 16:53

kthui previously approved these changes Jul 25, 2024

View reviewed changes

kthui changed the title ~~Improve vLLM backend performance by using a separate thread for responses~~ perf: Improve vLLM backend performance by using a separate thread for responses Jul 25, 2024

tanmayv25 previously approved these changes Jul 25, 2024

View reviewed changes

kthui mentioned this pull request Jul 25, 2024

fix: Include garbage collect for non streaming and improve triggering logic #52

Merged

20 tasks

fix: Include garbage collect for non streaming and improve triggering…

c54dfef

… logic (#52)

kthui dismissed stale reviews from tanmayv25 and themself via c54dfef July 26, 2024 17:46

kthui requested review from oandreeva-nv and tanmayv25 July 26, 2024 17:46

kthui approved these changes Jul 26, 2024

View reviewed changes

oandreeva-nv approved these changes Jul 26, 2024

View reviewed changes

kthui merged commit 128abc3 into main Jul 26, 2024

kthui added the PR: perf A code change that improves performance label Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Improve vLLM backend performance by using a separate thread for responses #46

perf: Improve vLLM backend performance by using a separate thread for responses #46

Uh oh!

Tabrizian commented Jul 10, 2024 •

edited by kthui

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tanmayv25 commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tabrizian commented Jul 23, 2024

Uh oh!

oandreeva-nv commented Jul 25, 2024

Uh oh!

kthui commented Jul 26, 2024

Uh oh!

oandreeva-nv left a comment

Uh oh!

zhaotyer commented Aug 2, 2024 •

edited

Loading

Uh oh!

zhaotyer commented Aug 2, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

perf: Improve vLLM backend performance by using a separate thread for responses #46

perf: Improve vLLM backend performance by using a separate thread for responses #46

Uh oh!

Conversation

Tabrizian commented Jul 10, 2024 • edited by kthui Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does the PR do?

Checklist

Commit Type:

Related PRs:

Where should the reviewer start?

Test plan:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tanmayv25 commented Jul 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tabrizian commented Jul 23, 2024

Uh oh!

oandreeva-nv commented Jul 25, 2024

Uh oh!

kthui commented Jul 26, 2024

Uh oh!

oandreeva-nv left a comment

Choose a reason for hiding this comment

Uh oh!

zhaotyer commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaotyer commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Tabrizian commented Jul 10, 2024 •

edited by kthui

Loading

zhaotyer commented Aug 2, 2024 •

edited

Loading

zhaotyer commented Aug 2, 2024 •

edited

Loading