Add prefill/decode multifunction support in ET (#16552) #16791

metascroy · 2026-01-22T18:33:33Z

Summary:
This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach.

Key Changes

CoreML Backend Compiler (coreml_preprocess.py)

Added MULTIMETHOD_WEIGHT_SHARING_STRATEGY enum with NONE and POSITIONAL strategies
Added generate_multimethod_weight_sharing_strategy_compile_spec() to enable weight sharing across methods
Implemented multifunction CoreML model compilation using ct.utils.MultiFunctionDescriptor
When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods

Model Metadata (model_metadata.h, serde_json.mm)

Added MethodMetadata struct to store per-method input/output names for multifunction models
Extended ModelMetadata with methods map and default_method field
Added is_multifunction() helper to detect multifunction models
Updated JSON serialization to handle the new multifunction metadata format

Runtime Changes (ETCoreMLModelManager.mm, backend_delegate.mm, coreml_backend_delegate.mm)

Updated ETCoreMLModelManager to set functionName on MLModelConfiguration only for multifunction models (based on metadata.is_multifunction())
Legacy single-function models continue to work with functionName=nil
Added method name propagation through the delegate initialization path
Updated model loading to use per-method input/output names when available

Export Script (export_static_llm_coreml.py)

Added --multifunction flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods
Multifunction mode uses generate_full_logits=False for efficiency (only outputs last token logits)
Single method mode (default) retains generate_full_logits=True for lookahead decoding support
Generates combined metadata with method-specific prefixes (e.g., decode_input_len, prefill_input_len)

New Runner (run_static_llm_multifunction.py)

Added dedicated runner for multifunction models
Handles separate prefill and decode method execution
Manages cache state transfer between prefill and decode phases
Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output

Build System (CMakeLists.txt)

Fixed installation of CoreML backend headers

Utilities (extract_coreml_models.py)

Updated model extraction script to handle multifunction models

Documentation (README.md)

Added documentation for both export modes (single method and multifunction)
Added comprehensive export options reference table
Added usage examples for both modes

Usage Examples

Single Method Export (for lookahead decoding):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_model.pte \
    --input_len 32 \
    --max_context_len 1024

Multifunction Export (separate prefill/decode):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_multifunction.pte \
    --input_len 64 \
    --max_context_len 1024 \
    --multifunction

Run Single Method Model (with lookahead):

python examples/apple/coreml/llama/run_static_llm.py \
    --model static_llm_coreml_model.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --lookahead

Run Multifunction Model:

python examples/apple/coreml/llama/run_static_llm_multifunction.py \
    --model static_llm_coreml_multifunction.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --input_len 64 \
    --max_context_len 1024

Mode Comparison

Feature	Single Method	Multifunction
Sequence length	Fixed (input_len for both prefill & decode)	Separate (input_len for prefill, 1 for decode)
Logits output	Full (all tokens)	Last token only
Lookahead decoding	✅ Supported	❌ Not supported
Weight sharing	N/A	✅ Enabled
Generation efficiency	Good with lookahead	Optimized decode step

Test Plan:
New unit test +

Tested both export modes on Llama 1B:

Exported single method model with --input_len 32 --max_context_len 1024
Exported multifunction model with --input_len 64 --max_context_len 1024 --multifunction
Ran single method model with --lookahead flag
Ran multifunction model with matching input_len and max_context_len
Verified text generation produces coherent output for both modes

Reviewed By: billmguo

Differential Revision: D91243088

Pulled By: metascroy

Summary: This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach. ### Key Changes **CoreML Backend Compiler (`coreml_preprocess.py`)** - Added `MULTIMETHOD_WEIGHT_SHARING_STRATEGY` enum with `NONE` and `POSITIONAL` strategies - Added `generate_multimethod_weight_sharing_strategy_compile_spec()` to enable weight sharing across methods - Implemented multifunction CoreML model compilation using `ct.utils.MultiFunctionDescriptor` - When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods **Model Metadata (`model_metadata.h`, `serde_json.mm`)** - Added `MethodMetadata` struct to store per-method input/output names for multifunction models - Extended `ModelMetadata` with `methods` map and `default_method` field - Added `is_multifunction()` helper to detect multifunction models - Updated JSON serialization to handle the new multifunction metadata format **Runtime Changes (`ETCoreMLModelManager.mm`, `backend_delegate.mm`, `coreml_backend_delegate.mm`)** - Updated `ETCoreMLModelManager` to set `functionName` on `MLModelConfiguration` only for multifunction models (based on `metadata.is_multifunction()`) - Legacy single-function models continue to work with `functionName=nil` - Added method name propagation through the delegate initialization path - Updated model loading to use per-method input/output names when available **Export Script (`export_static_llm_coreml.py`)** - Added `--multifunction` flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods - Multifunction mode uses `generate_full_logits=False` for efficiency (only outputs last token logits) - Single method mode (default) retains `generate_full_logits=True` for lookahead decoding support - Generates combined metadata with method-specific prefixes (e.g., `decode_input_len`, `prefill_input_len`) **New Runner (`run_static_llm_multifunction.py`)** - Added dedicated runner for multifunction models - Handles separate prefill and decode method execution - Manages cache state transfer between prefill and decode phases - Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output **Build System (`CMakeLists.txt`)** - Fixed installation of CoreML backend headers **Utilities (`extract_coreml_models.py`)** - Updated model extraction script to handle multifunction models **Documentation (`README.md`)** - Added documentation for both export modes (single method and multifunction) - Added comprehensive export options reference table - Added usage examples for both modes ### Usage Examples **Single Method Export (for lookahead decoding):** ```bash python examples/apple/coreml/llama/export_static_llm_coreml.py \ --checkpoint $HOME/models/llama1b/llama1b.pth \ --params $HOME/models/llama1b/params.json \ --output static_llm_coreml_model.pte \ --input_len 32 \ --max_context_len 1024 ``` **Multifunction Export (separate prefill/decode):** ```bash python examples/apple/coreml/llama/export_static_llm_coreml.py \ --checkpoint $HOME/models/llama1b/llama1b.pth \ --params $HOME/models/llama1b/params.json \ --output static_llm_coreml_multifunction.pte \ --input_len 64 \ --max_context_len 1024 \ --multifunction ``` **Run Single Method Model (with lookahead):** ```bash python examples/apple/coreml/llama/run_static_llm.py \ --model static_llm_coreml_model.pte \ --params $HOME/models/llama1b/params.json \ --tokenizer $HOME/models/llama1b/tokenizer.model \ --prompt "Once upon a time" \ --max_new_tokens 100 \ --lookahead ``` **Run Multifunction Model:** ```bash python examples/apple/coreml/llama/run_static_llm_multifunction.py \ --model static_llm_coreml_multifunction.pte \ --params $HOME/models/llama1b/params.json \ --tokenizer $HOME/models/llama1b/tokenizer.model \ --prompt "Once upon a time" \ --max_new_tokens 100 \ --input_len 64 \ --max_context_len 1024 ``` ### Mode Comparison | Feature | Single Method | Multifunction | |---------|---------------|---------------| | Sequence length | Fixed (input_len for both prefill & decode) | Separate (input_len for prefill, 1 for decode) | | Logits output | Full (all tokens) | Last token only | | Lookahead decoding | ✅ Supported | ❌ Not supported | | Weight sharing | N/A | ✅ Enabled | | Generation efficiency | Good with lookahead | Optimized decode step | Test Plan: New unit test + Tested both export modes on Llama 1B: 1. Exported single method model with `--input_len 32 --max_context_len 1024` 2. Exported multifunction model with `--input_len 64 --max_context_len 1024 --multifunction` 3. Ran single method model with `--lookahead` flag 4. Ran multifunction model with matching input_len and max_context_len 5. Verified text generation produces coherent output for both modes Reviewed By: billmguo Differential Revision: D91243088 Pulled By: metascroy

pytorch-bot · 2026-01-22T18:33:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16791

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a0aae36 with merge base 86b4bea ():

NEW FAILURE - The following job has failed:

pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-01-22T18:33:40Z

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91243088.

github-actions · 2026-01-22T18:34:57Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy requested review from cccclai, mergennachin and shoumikhin as code owners January 22, 2026 18:33

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 22, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 22, 2026

metascroy added the ciflow/trunk label Jan 22, 2026

metascroy requested a review from billmguo January 22, 2026 18:40

billmguo approved these changes Jan 22, 2026

View reviewed changes

meta-codesync bot merged commit 8ab593b into pytorch:main Jan 22, 2026
309 of 322 checks passed

metascroy mentioned this pull request Jan 23, 2026

Add prefill/decode multifunction support in ET #16552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prefill/decode multifunction support in ET (#16552) #16791

Add prefill/decode multifunction support in ET (#16552) #16791

Uh oh!

metascroy commented Jan 22, 2026

Uh oh!

pytorch-bot bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add prefill/decode multifunction support in ET (#16552) #16791

Add prefill/decode multifunction support in ET (#16552) #16791

Uh oh!

Conversation

metascroy commented Jan 22, 2026

Key Changes

Usage Examples

Mode Comparison

Uh oh!

pytorch-bot bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16791

❌ 1 New Failure

Uh oh!

meta-codesync bot commented Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot bot commented Jan 22, 2026 •

edited

Loading

This PR needs a `release notes:` label