-
Notifications
You must be signed in to change notification settings - Fork 665
Update cache position population and arg order for multimodal runner #14225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14225
Note: Links to docs will display an error until the docs builds have been completed. ❌ 12 New Failures, 3 Cancelled JobsAs of commit cb633cf with merge base 10e93fb ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Synced offline)
@@ -94,6 +94,11 @@ Result<uint64_t> MultimodalPrefiller::prefill( | |||
// `cache_position` goes from start_pos to start_pos + encoder_output.size(1). | |||
// e.g. if start_pos = 2 and encoder_output.size(1) = 5, | |||
// cache_position_tensor should be [2, 3, 4, 5, 6]. | |||
auto method_meta = ET_UNWRAP(module_->method_meta(kTextModelMethod)); | |||
auto first_input_info = ET_UNWRAP(method_meta.input_tensor_meta(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to second_input_info
cache_positions.data(), | ||
{static_cast<int>(seq_len)}, | ||
executorch::aten::ScalarType::Long); | ||
auto cache_position_tensor = (numel > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do something like
if (numel > 1) {
// `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
// e.g. if start_pos = 2 and encoder_output.size(1) = 5,
// cache_position_tensor should be [2, 3, 4, 5, 6].
for (int64_t i = 0; i < seq_len; ++i) {
cache_positions[i] = start_pos + i;
}
auto cache_position_tensor = ::executorch::extension::from_blob(
cache_positions.data(),
{static_cast<int>(seq_len)},
executorch::aten::ScalarType::Long)
} else {
// Cache position is size 1.
auto cache_position_tensor = ::executorch::extension::from_blob(
&start_pos, {1}, executorch::aten::ScalarType::Long);
}
@@ -94,6 +94,11 @@ Result<uint64_t> MultimodalPrefiller::prefill( | |||
// `cache_position` goes from start_pos to start_pos + encoder_output.size(1). | |||
// e.g. if start_pos = 2 and encoder_output.size(1) = 5, | |||
// cache_position_tensor should be [2, 3, 4, 5, 6]. | |||
auto method_meta = ET_UNWRAP(module_->method_meta(kTextModelMethod)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment like
// Get expected shape of cache position tensor, which should be the second argument
auto second_input_info = ET_UNWRAP(method_meta.input_tensor_meta(1)); | ||
auto second_input_sizes = second_input_info.sizes(); | ||
auto numel = second_input_sizes[0]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reuse the logic here https://github.com/pytorch/executorch/blob/main/extension/llm/runner/text_decoder_runner.cpp#L44-L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit later?
auto cache_position_tensor = ::executorch::extension::from_blob( | ||
cache_positions.data(), | ||
{static_cast<int>(seq_len)}, | ||
executorch::aten::ScalarType::Long); | ||
auto prefill_result = module_->execute( | ||
kTextModelMethod, {cache_position_tensor, encoder_output}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap these two
extension/llm/runner/util.h
Outdated
@@ -99,6 +102,37 @@ ET_EXPERIMENTAL size_t inline get_rss_bytes() { | |||
// when this changed. | |||
return 0; | |||
} | |||
|
|||
inline runtime::Result<TensorPtr> | |||
populate_start_pos_tensor(Module* module, int64_t& start_pos, int seq_len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some docstring please, like how we assume the second argument is cache position/ start pos and based on the shape to populate the tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the name should be populate_start_pos_or_cache_position
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I'll need to land #14238 first
Summary
For voxtral, we construct the cache_position_tensor like before; for llava, it will construct underneath so we pass in size 1.
Test plan
CI