Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] [Bugfix] Refactor block manager subsystem for better testability #3492

Merged
merged 96 commits into from
Mar 28, 2024

Conversation

cadedaniel
Copy link
Collaborator

@cadedaniel cadedaniel commented Mar 19, 2024

This PR refactors the block manager / allocator / prefix caching to make things easier to test. Concretely, it establishes the following layers (diagrams from https://docs.google.com/document/d/1ipAypZYfZgloP_08sLi1Z2BADHMauWScbT_H1F0ThCQ/edit?pli=1):

Untitled drawing (5)

The interfaces are such that the BlockAllocator can be implemented via a NaiveBlockAllocator (e.g. no prefix caching) or a PrefixCachingBlockAllocator. The value of this approach is in its separation of concerns and consequent improved testability.

This PR is missing the following features before the BlockManagerV2 can become the default.

  • Swap in/swap out implementation
  • BlockTable supports sliding window
  • [Prefix caching] Evictor policies (currently an arbitrary block is evicted)
  • [Prefix caching] Tests on prefix caching (e.g. prefix blocks are not evicted)
  • [Prefix caching] Update last_access_time for blocks
  • [Prefix caching] Track computed bit.

Testing

  • Unit tests
=================== 639 passed in 1.73s ==========
  • Correctness test (comparing model output with v1 and v2 block managers, with preemption)
===== 1 passed in 37.54s ========

Design

Key APIs

class BlockAllocator:
	def allocate_mutable(self, prev_block: Block) -> Block:
		# A block is mutable if it is not full.
		...
	def allocate_immutable(self, prev_block: Block, token_ids: List[int]) -> Block:
		# A block is immutable if it is full.
		...
	def free(block) -> None:
		...

class Block:
	def append_token_ids(self, token_ids: List[int]) -> None:
		pass


class PrefixCachingBlock(Block):
	def content_hash(self) -> Optional[int]:
		# None if the block is mutable.
		return hash(self.prev_block.content_hash, self.token_ids)

Important interactions

Swapping GPU/CPU

  • Note this is not implemented in this PR
    Untitled drawing (2)

Copy-on-write

Untitled drawing (3)

Prefix caching promotion

Untitled drawing (4)

@cadedaniel cadedaniel changed the title [WIP] [Core] [Bugfix] Refactor block manager for better testability [Core] [Bugfix] Refactor block manager subsystem for better testability Mar 27, 2024
@cadedaniel cadedaniel marked this pull request as ready for review March 27, 2024 20:53
@simon-mo simon-mo self-assigned this Mar 27, 2024
@cadedaniel
Copy link
Collaborator Author

  • Added docstrings
  • Cleaned up code
  • Added e2e correctness test. It compares generated token ids of opt-125m using block manager v1 and v2, using greedy sampling. The test includes preemption -- if the output is the same, it means the v2 block manager is correct.

@cadedaniel
Copy link
Collaborator Author

@simon-mo ready for review


# Now that the batch has been created, we can assume all blocks in the
# batch will have been computed before the next scheduling invocation.
for seq_group in scheduler_outputs.scheduled_seq_groups:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the model execution won't fail & there's no retry. It is the general assumption in the engine now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this assumption to comments? What happen to later preemption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment added

What happen to later preemption

these lines do not change prefix caching behavior wrt later preemption (either case requires recomputing the blocks)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh btw @rkooo567 this assumption is not introduced in this PR; the scheduler already assumes that its schedule is implemented by the engine. retries and failures are not yet in scope of vllm

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work. The new structure is looking great. I don't have major issues therefore approving. I left some comments mostly due to my confusion when reading the code, which I would imagine others will run into as well.

Comment on lines 54 to 57
def generator_outer():
for llm in generator_inner():
yield llm
del llm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need another level of wrapper? would it work without it? if not please comment why

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the usage of the llm generator below but still confused since we are only yielding one llm instance

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good catch, not necessary

pass


class DeviceAwareBlockAllocator(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do wonder whether this can "inherit" BlockAllocator somehow so we are only re-defining allocate_mutable and allocate_immutable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea


# Now that the batch has been created, we can assume all blocks in the
# batch will have been computed before the next scheduling invocation.
for seq_group in scheduler_outputs.scheduled_seq_groups:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this assumption to comments? What happen to later preemption?

@@ -196,6 +196,8 @@ def lora_int_id(self) -> int:
return self.lora_request.lora_int_id if self.lora_request else 0

def hash_of_block(self, logical_idx: int) -> int:
# TODO This can produce incorrect hash when block size > prompt size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😨

Comment on lines 162 to 163
def access_all_blocks_in_seq(self, seq, now):
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the todo here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added comment (basically #3667)

Comment on lines +165 to +169
def mark_blocks_as_computed(self, seq_group: SequenceGroup):
# We ignore the sequence group as its not necessary. After the batch is
# formed by the scheduler, we do not need to mark blocks from individual
# sequence groups as computed -- all blocks in the batch can be marked
# as computed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a no-op then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kinda -- the plumbing is here to validate the design, but prefix caching isn't tested e2e (need #3667)

caching.

Args:
create_block (Block.Factory): A factory function for creating new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we parameterize the type somehow so we are constraining the type to NaiveBlock because it only works there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, open to more typing, but here we need it to return an unspecialized block. the prefix caching allocator will specify to the naive block allocator that it should construct prefix caching blocks instead of the default naive block.

I will add a comment


def free(self, block: Block) -> None:
block_id = block.block_id
block.block_id = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? plz comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Refcount = int


class NaiveBlockAllocator(BlockAllocator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be called CoWBlockAllocator btw because it's actually quite powerful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this -- the conflict is that the prefix caching allocator also supports CoW (it isn't unique to this allocator).

@cadedaniel
Copy link
Collaborator Author

cadedaniel commented Mar 28, 2024

Thanks for the review 🙏. Applying feedback

@youkaichao youkaichao merged commit 14ccd94 into vllm-project:main Mar 28, 2024
33 checks passed
@cadedaniel cadedaniel deleted the block-manager-tests branch March 28, 2024 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants