Continous batching for single GPU LLM inference #2628

mreso · 2023-09-29T14:40:04Z

Description

This PR enables continuous batching for LLM by creating a new batch aggregator that keeps jobs in the batch as long as they are not yet finished.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

pytest test/pytest/test_continuous_batching.py

============================================================================================================ test session starts =============================================================================================================
platform linux -- Python 3.10.12, pytest-7.3.1, pluggy-1.3.0
rootdir: /home/ubuntu/serve
plugins: mock-3.10.0, cov-4.1.0
collected 3 items

test/pytest/test_continuous_batching.py ..2023-10-03T13:48:12,231 [INFO ] W-9000-streaming_handler_1.0 org.pytorch.serve.wlm.ContinuousBatching - Connection to client got closed; Removing job: 9fbdacb3-a91f-40e8-8fd6-2e7944162aae
2023-10-03T13:48:12,232 [INFO ] W-9000-streaming_handler_1.0-stdout MODEL_METRICS - PredictionTime.ms:10.69|#ModelName:streaming_handler,Level:Model|#hostname:ip-172-31-15-101,requestID:9fbdacb3-a91f-40e8-8fd6-2e7944162aae,timestamp:1696340892
2023-10-03T13:48:12,232 [DEBUG] W-9000-streaming_handler_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true
2023-10-03T13:48:12,232 [INFO ] W-9000-streaming_handler_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 11
2023-10-03T13:48:12,232 [INFO ] W-9000-streaming_handler_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:1.0|#Level:Host|#hostname:ip-172-31-15-101,timestamp:1696340892
.                                                                                                                                                                                            [100%]

============================================================================================================== warnings summary ==============================================================================================================
ts/torch_handler/base_handler.py:13
 /home/ubuntu/serve/ts/torch_handler/base_handler.py:13: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
   from pkg_resources import packaging

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================= 3 passed, 1 warning in 14.54s ========================================================================================================

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

…essed

frontend/server/src/main/java/org/pytorch/serve/wlm/WorkerThread.java

frontend/server/src/main/java/org/pytorch/serve/wlm/BatchAggregator.java

frontend/server/src/main/java/org/pytorch/serve/wlm/ContinuousBatching.java

frontend/server/src/main/java/org/pytorch/serve/wlm/Model.java

test/pytest/test_data/streaming/model_config.yaml

lxning · 2023-10-02T00:40:31Z

test/pytest/test_data/streaming/stream_handler.py

+logger = logging.getLogger(__name__)
+
+
+class StreamingHandler(BaseHandler):


Should we move this handler to ts_handler/distrubuted or move the core function to handler_utils/distributed?

Lets postpone this for a later PR. I want to get more clarity on the details on the TP implementation first and see what is the overlap between them to make sure we only move the generic part into core.

codecov · 2023-10-03T13:45:00Z

Codecov Report

Merging #2628 (fc300b6) into master (a6fd770) will increase coverage by 1.05%.
The diff coverage is 97.26%.

❗ Current head fc300b6 differs from pull request most recent head 7855a9c. Consider uploading reports for the commit 7855a9c to get more accurate results

@@            Coverage Diff             @@
##           master    #2628      +/-   ##
==========================================
+ Coverage   71.34%   72.39%   +1.05%     
==========================================
  Files          85       85              
  Lines        3905     3956      +51     
  Branches       58       58              
==========================================
+ Hits         2786     2864      +78     
+ Misses       1115     1088      -27     
  Partials        4        4

Files	Coverage Δ
ts/context.py	`77.92% <100.00%> (+10.38%)`	⬆️
ts/tests/unit_tests/test_otf_codec_protocol.py	`100.00% <100.00%> (ø)`
ts/protocol/otf_message_handler.py	`82.41% <75.00%> (+9.82%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…atching_for_streaming

HamidShojanazeri

LGTM

mreso and others added 30 commits September 6, 2023 20:48

Make test/pytest/test_handler.py run stand-alone

83b88aa

Refactor if else statement

5138de5

First working poc for streaming inference with continuous batching

afd183a

WIP: stopping criteria + caching

490f676

FE add continuousbatching

ae3d64e

fmt

5395cc5

Fmt

c73c46c

Fix continuous batching PoC; remove batchDelay if jobs are being proc…

fd7b93a

…essed

Add model_config.yaml

e94ffc4

Added ipynb for generate_next_token

e81b79d

Update notebook and move to right subfolder

7eef111

Fix buffer underruns; wait until enough bytes are in for reading

8f19e00

Add bandaid for bug in our otf

cef6e33

Added test for otf protocol with context

c6aece6

Fix buffer underrun; handle batch quota of zero correctly

2883a30

Initial implementation of prefill + decode without kv caching for now

e1f000f

adds missing __init__py files

e4f9b56

WIP kv caching

2f4ef20

Fixed kv cache; missing tuple;

6db970c

Cleaned up streaming handler code

8c3a890

Added cache cleaning

481ce10

clean up aggregator jobs forcontrol cmd

6beb42c

fmt

eb396a5

fix streaming handler test

be28f29

Rename streaming test into continuous batching test

1bc2154

fmt

b67942e

Enable gpu usage in continuous batching unit test

1ec6982

Add llama to stream notebook

ae205be

skip pull mgmt job if jobs is not empty

48694a8

set pollMgmtJobStatus init value as false

67ec104

merege origin/feature/continous_batching_for_streaming commit 68829ba

2825ae0

lxning reviewed Sep 29, 2023

View reviewed changes

fmt

289c702

lxning reviewed Sep 29, 2023

View reviewed changes

test/pytest/test_data/streaming/model_config.yaml Outdated Show resolved Hide resolved

lxning added 2 commits September 29, 2023 16:21

remove llama2-13b stream_handler.py

c5e7f7e

revert otf

088f330

lxning reviewed Oct 2, 2023

View reviewed changes

lxning and others added 3 commits October 2, 2023 15:03

update maxDelay logic

c8b4604

replace size checking with isEmpty

d78ce15

Use handler section

835e17d

mreso changed the title ~~[WIP] Feature/continous batching for streaming~~ Continous batching for single GPU LLM inference Oct 3, 2023

mreso added 3 commits October 3, 2023 13:57

Fix linter errors

0b6e309

Fix linter error in oft mesg handler

b774002

Fix linter error in test_otf_codec_protocol.py

8ac8626

mreso marked this pull request as ready for review October 3, 2023 15:31

Merge remote-tracking branch 'origin/master' into feature/continous_b…

7af5f2a

…atching_for_streaming

mreso requested a review from lxning October 3, 2023 16:39

Merge branch 'master' into feature/continous_batching_for_streaming

4e802f0

mreso requested review from chauhang and HamidShojanazeri October 3, 2023 20:28

HamidShojanazeri approved these changes Oct 3, 2023

View reviewed changes

Merge branch 'master' into feature/continous_batching_for_streaming

7855a9c

mreso enabled auto-merge October 3, 2023 20:54

mreso added this pull request to the merge queue Oct 4, 2023

lxning approved these changes Oct 4, 2023

View reviewed changes

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 4, 2023

mreso added this pull request to the merge queue Oct 4, 2023

Merged via the queue into master with commit 8d12993 Oct 4, 2023
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continous batching for single GPU LLM inference #2628

Continous batching for single GPU LLM inference #2628

mreso commented Sep 29, 2023 •

edited

lxning Oct 2, 2023

mreso Oct 3, 2023

codecov bot commented Oct 3, 2023 •

edited

HamidShojanazeri left a comment

		logger = logging.getLogger(__name__)


		class StreamingHandler(BaseHandler):

Continous batching for single GPU LLM inference #2628

Continous batching for single GPU LLM inference #2628

Conversation

mreso commented Sep 29, 2023 • edited

Description

Type of change

Feature/Issue validation/testing

Checklist:

lxning Oct 2, 2023

Choose a reason for hiding this comment

mreso Oct 3, 2023

Choose a reason for hiding this comment

codecov bot commented Oct 3, 2023 • edited

Codecov Report

HamidShojanazeri left a comment

Choose a reason for hiding this comment

mreso commented Sep 29, 2023 •

edited

codecov bot commented Oct 3, 2023 •

edited