[3/N] Refactor scheduler for chunked prefill scheduling #3550

rkooo567 · 2024-03-21T07:10:25Z

Refactor the current scheduler to make it easy to understand with chunked prefill later.

This simply moves logic for prefill scheduling and decoding scheudling to a dedicated function. The purpose of doing this is we want the different scheduling policy for chunked prefill (by default, we do prefill -> decoding. But when chunked prefill is enabled, we want decoding -> prefill to reduce ITL impact).

The functionality must be exactly the same except that I made it use lora_enabled instead of directly checking if lora config is None.

Related: #3130

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

rkooo567 · 2024-03-29T13:12:18Z

before: Throughput: 2.01 requests/s, 972.94 tokens/s
after: Throughput: 1.99 requests/s, 961.40 tokens/s

Benchmark result. I'd say it is just the same

rkooo567 · 2024-03-29T15:25:39Z

@simon-mo Updated (plz take a look one more time);

swap is a separate API now
each API is more thoroughly tested. Also better unit testing for swapping.
general cleanup (e.g., use dataclass, use better naming).

vllm/core/scheduler.py

cadedaniel · 2024-03-29T23:06:23Z

vllm/core/scheduler.py

+        self.running.extend([s.seq_group for s in prefills.seq_groups])
+        self.running.extend([s.seq_group for s in decodes.seq_groups])
+        self.running.extend([s.seq_group for s in swapped_in.seq_groups])


can you help me understand what clears self.running each step?

it is popped out when it is scheduled from each func!

seems not

vllm/vllm/core/scheduler.py

Lines 281 to 287 in 810c56d

seq_group = self.running[0]

new_token_size = (

seq_group.num_seqs(status=SequenceStatus.RUNNING) *

self.num_decoding_tokens_per_seq)

if num_batched_tokens + new_token_size > token_budget:

break

+1 The logic here is confusing. Some of the requests in self.running will be poped out in _schedule_decodes. But the lines here give people a feeling that self.running is a queue that keep extending to infinity.

This is basically the same behavior as the master;

vllm/vllm/core/scheduler.py

Line 342 in d8658c8

self.running = running

It is cleared up when the model output is processed

vllm/vllm/engine/llm_engine.py

Line 600 in 563c1d7

self.scheduler.free_finished_seq_groups()

.

I will comment it here

vllm/core/scheduler.py

zhuohan123

Thanks for the changes! The code looks better than the last version. My main concern on this PR is that self.running seems to be a leaking abstraction that is used everywhere. Can we somehow make this interface a bit more clean?

zhuohan123 · 2024-03-31T08:30:12Z

vllm/core/scheduler.py

+        self.running.extend([s.seq_group for s in prefills.seq_groups])
+        self.running.extend([s.seq_group for s in decodes.seq_groups])
+        self.running.extend([s.seq_group for s in swapped_in.seq_groups])


+1 The logic here is confusing. Some of the requests in self.running will be poped out in _schedule_decodes. But the lines here give people a feeling that self.running is a queue that keep extending to infinity.

vllm/core/scheduler.py

tests/core/test_scheduler.py

vllm/core/scheduler.py

rkooo567 · 2024-04-02T13:43:39Z

@zhuohan123 @simon-mo

As we discussed offline, I updated code based on the proposal I made.

Made all _schedule APIs as stateless as possible.
Each APIs can be used with any order
Increase test coverage A LOT compared to before including beam search case, block updates, swap & preemption, individual APIs, loras, max_seqs and max_batched_tokens.
Fix 2 bugs.
- the first prefill doesn't include running batched tokens when it counts num_batched_tokens.
- swapping doesn't break when it reaches to max_num_batched_tokens.

vllm/core/scheduler.py

rkooo567 · 2024-04-01T09:08:43Z

vllm/core/scheduler.py

+        self.running.extend([s.seq_group for s in prefills.seq_groups])
+        self.running.extend([s.seq_group for s in decodes.seq_groups])
+        self.running.extend([s.seq_group for s in swapped_in.seq_groups])


This is basically the same behavior as the master;

vllm/vllm/core/scheduler.py

Line 342 in d8658c8

self.running = running

It is cleared up when the model output is processed

vllm/vllm/engine/llm_engine.py

Line 600 in 563c1d7

self.scheduler.free_finished_seq_groups()

.

I will comment it here

vllm/core/scheduler.py

rkooo567 · 2024-04-02T15:22:07Z

vllm/core/scheduler.py

@@ -573,17 +791,13 @@ def _preempt_by_recompute(
            seq.status = SequenceStatus.WAITING
            self.free_seq(seq)
            seq.reset_state_for_recompute()
-        # NOTE: For FCFS, we insert the preempted sequence group to the front
-        # of the waiting queue.
-        self.waiting.appendleft(seq_group)


updated within _schedule now

rkooo567 · 2024-04-03T01:14:03Z

sampler test failure seems unrelated

rkooo567 · 2024-04-03T08:17:03Z

lora test failure unrelated

stale

…#3550)

rkooo567 added 30 commits February 27, 2024 22:55

[1/n] Support efficient reshape caching.

06fe872

[2/n] support flash attention kernel

9a0b6be

oss flash attention works

6947167

in progress

4769a26

flash attn enabled.

963db44

ip

2b9c36b

support every model

2c1bb6c

Fixed broken tests.

2bb5e62

[2/n] scheduler changes

4d6a05f

[2/n] ip

0831f84

[2/n]ip

f31371f

ip

78bb887

Merge branch 'chunked-prefill-3' into chunked-prefill-scheduler

b9d93c5

[2/n] ip

42dd362

seems to work.

74ac900

Merge branch 'chunked-prefill-3' into chunked-prefill-scheduler

e3afc25

[2/n] ip

6141885

.

71bdada

ip?

d4c3b5d

block tables updated correctly

baef7c6

Merge branch 'chunked-prefill-3' into chunked-prefill-scheduler

d503a22

hopefully tests pass

a12ec68

Merge branch 'chunked-prefill-3' into chunked-prefill-scheduler

85760db

[2/n] update sequence data

e40bc45

[2/n] add prefill range apis

d85670f

Merge branch 'main' into chunked-prefill-3

0d8785f

.

08c8541

ip

3bac9af

add data.

0ca1284

ip

2487bda

rkooo567 added 2 commits March 29, 2024 03:26

not done, but good progress.

31a039c

ip

0480014

add more tests + swapped tests

ac414b1

Merge branch 'main' into chunked-prefill-scheduler-refactor

810c56d

cadedaniel reviewed Mar 29, 2024

View reviewed changes

AgrawalAmey reviewed Mar 29, 2024

View reviewed changes

vllm/core/scheduler.py Show resolved Hide resolved

zhuohan123 reviewed Mar 31, 2024

View reviewed changes

rkooo567 added 3 commits April 2, 2024 00:11

Addressed small code review.

8d11423

Merge branch 'main' into chunked-prefill-scheduler-refactor

fe6fb0b

work e2e

85c9b40

Merge branch 'main' into chunked-prefill-scheduler-refactor

5e9f549

rkooo567 requested review from zhuohan123 and cadedaniel April 2, 2024 13:54

rkooo567 commented Apr 2, 2024

View reviewed changes

rkooo567 added 2 commits April 2, 2024 08:27

retry ci

3ae03f9

Merge branch 'main' into chunked-prefill-scheduler-refactor

054e04f

Merge branch 'main' into chunked-prefill-scheduler-refactor

2e47f5f

simon-mo merged commit 3dcb3e8 into vllm-project:main Apr 3, 2024
35 checks passed

rkooo567 mentioned this pull request Apr 4, 2024

[Chunked Prefill][4/n] Chunked prefill scheduler. #3853

Merged

zhaotyer mentioned this pull request Apr 10, 2024

[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable #3955

Merged

simon-mo mentioned this pull request Apr 19, 2024

Performance Regression between v0.4.0 and v0.4.1 #4210

Closed

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024

[3/N] Refactor scheduler for chunked prefill scheduling (vllm-project…

86cdde7

…#3550)

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3/N] Refactor scheduler for chunked prefill scheduling #3550

[3/N] Refactor scheduler for chunked prefill scheduling #3550

rkooo567 commented Mar 21, 2024 •

edited

rkooo567 commented Mar 29, 2024 •

edited

rkooo567 commented Mar 29, 2024 •

edited

cadedaniel Mar 29, 2024

rkooo567 Mar 29, 2024

cadedaniel Mar 29, 2024

zhuohan123 Mar 31, 2024

rkooo567 Apr 1, 2024

zhuohan123 left a comment

zhuohan123 Mar 31, 2024

rkooo567 commented Apr 2, 2024 •

edited

rkooo567 Apr 1, 2024

rkooo567 Apr 2, 2024

rkooo567 commented Apr 3, 2024

rkooo567 commented Apr 3, 2024

	seq_group = self.running[0]
	new_token_size = (
	seq_group.num_seqs(status=SequenceStatus.RUNNING) *
	self.num_decoding_tokens_per_seq)

	if num_batched_tokens + new_token_size > token_budget:
	break

[3/N] Refactor scheduler for chunked prefill scheduling #3550

[3/N] Refactor scheduler for chunked prefill scheduling #3550

Conversation

rkooo567 commented Mar 21, 2024 • edited

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

rkooo567 commented Mar 29, 2024 • edited

rkooo567 commented Mar 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuohan123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Apr 3, 2024

rkooo567 commented Apr 3, 2024

rkooo567 commented Mar 21, 2024 •

edited

rkooo567 commented Mar 29, 2024 •

edited

rkooo567 commented Mar 29, 2024 •

edited

rkooo567 commented Apr 2, 2024 •

edited