Skip to content

Conversation

kirklandsign
Copy link
Contributor

Summary

For voxtral, we construct the cache_position_tensor like before; for llava, it will construct underneath so we pass in size 1.

Test plan

CI

Copy link

pytorch-bot bot commented Sep 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14225

Note: Links to docs will display an error until the docs builds have been completed.

❌ 12 New Failures, 3 Cancelled Jobs

As of commit cb633cf with merge base 10e93fb (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kirklandsign kirklandsign marked this pull request as ready for review September 11, 2025 19:33
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Synced offline)

@@ -94,6 +94,11 @@ Result<uint64_t> MultimodalPrefiller::prefill(
// `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
// e.g. if start_pos = 2 and encoder_output.size(1) = 5,
// cache_position_tensor should be [2, 3, 4, 5, 6].
auto method_meta = ET_UNWRAP(module_->method_meta(kTextModelMethod));
auto first_input_info = ET_UNWRAP(method_meta.input_tensor_meta(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to second_input_info

cache_positions.data(),
{static_cast<int>(seq_len)},
executorch::aten::ScalarType::Long);
auto cache_position_tensor = (numel > 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do something like

if (numel > 1) {
    // `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
    // e.g. if start_pos = 2 and encoder_output.size(1) = 5,
    // cache_position_tensor should be [2, 3, 4, 5, 6].
    for (int64_t i = 0; i < seq_len; ++i) {
      cache_positions[i] = start_pos + i;
    }
    auto cache_position_tensor = ::executorch::extension::from_blob(
            cache_positions.data(),
            {static_cast<int>(seq_len)},
            executorch::aten::ScalarType::Long)
} else {
    // Cache position is size 1.
    auto cache_position_tensor = ::executorch::extension::from_blob(
            &start_pos, {1}, executorch::aten::ScalarType::Long);
}

@@ -94,6 +94,11 @@ Result<uint64_t> MultimodalPrefiller::prefill(
// `cache_position` goes from start_pos to start_pos + encoder_output.size(1).
// e.g. if start_pos = 2 and encoder_output.size(1) = 5,
// cache_position_tensor should be [2, 3, 4, 5, 6].
auto method_meta = ET_UNWRAP(module_->method_meta(kTextModelMethod));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment like

// Get expected shape of cache position tensor, which should be the second argument

Comment on lines 98 to 101
auto second_input_info = ET_UNWRAP(method_meta.input_tensor_meta(1));
auto second_input_sizes = second_input_info.sizes();
auto numel = second_input_sizes[0];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit later?

@mergennachin mergennachin added this to the 1.0.0 milestone Sep 12, 2025
auto cache_position_tensor = ::executorch::extension::from_blob(
cache_positions.data(),
{static_cast<int>(seq_len)},
executorch::aten::ScalarType::Long);
auto prefill_result = module_->execute(
kTextModelMethod, {cache_position_tensor, encoder_output});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap these two

@@ -99,6 +102,37 @@ ET_EXPERIMENTAL size_t inline get_rss_bytes() {
// when this changed.
return 0;
}

inline runtime::Result<TensorPtr>
populate_start_pos_tensor(Module* module, int64_t& start_pos, int seq_len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some docstring please, like how we assume the second argument is cache position/ start pos and based on the shape to populate the tensor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the name should be populate_start_pos_or_cache_position

@kirklandsign kirklandsign changed the title Update cache position size for llava Update cache position population and arg order for multimodal runner Sep 12, 2025
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I'll need to land #14238 first

@kirklandsign kirklandsign added the release notes: llm Changes to llm utilities label Sep 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: llm Changes to llm utilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants