Skip to content

Conversation

@metascroy
Copy link
Contributor

Summary:
This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach.

Key Changes

CoreML Backend Compiler (coreml_preprocess.py)

  • Added MULTIMETHOD_WEIGHT_SHARING_STRATEGY enum with NONE and POSITIONAL strategies
  • Added generate_multimethod_weight_sharing_strategy_compile_spec() to enable weight sharing across methods
  • Implemented multifunction CoreML model compilation using ct.utils.MultiFunctionDescriptor
  • When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods

Model Metadata (model_metadata.h, serde_json.mm)

  • Added MethodMetadata struct to store per-method input/output names for multifunction models
  • Extended ModelMetadata with methods map and default_method field
  • Added is_multifunction() helper to detect multifunction models
  • Updated JSON serialization to handle the new multifunction metadata format

Runtime Changes (ETCoreMLModelManager.mm, backend_delegate.mm, coreml_backend_delegate.mm)

  • Updated ETCoreMLModelManager to set functionName on MLModelConfiguration only for multifunction models (based on metadata.is_multifunction())
  • Legacy single-function models continue to work with functionName=nil
  • Added method name propagation through the delegate initialization path
  • Updated model loading to use per-method input/output names when available

Export Script (export_static_llm_coreml.py)

  • Added --multifunction flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods
  • Multifunction mode uses generate_full_logits=False for efficiency (only outputs last token logits)
  • Single method mode (default) retains generate_full_logits=True for lookahead decoding support
  • Generates combined metadata with method-specific prefixes (e.g., decode_input_len, prefill_input_len)

New Runner (run_static_llm_multifunction.py)

  • Added dedicated runner for multifunction models
  • Handles separate prefill and decode method execution
  • Manages cache state transfer between prefill and decode phases
  • Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output

Build System (CMakeLists.txt)

  • Fixed installation of CoreML backend headers

Utilities (extract_coreml_models.py)

  • Updated model extraction script to handle multifunction models

Documentation (README.md)

  • Added documentation for both export modes (single method and multifunction)
  • Added comprehensive export options reference table
  • Added usage examples for both modes

Usage Examples

Single Method Export (for lookahead decoding):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_model.pte \
    --input_len 32 \
    --max_context_len 1024

Multifunction Export (separate prefill/decode):

python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_multifunction.pte \
    --input_len 64 \
    --max_context_len 1024 \
    --multifunction

Run Single Method Model (with lookahead):

python examples/apple/coreml/llama/run_static_llm.py \
    --model static_llm_coreml_model.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --lookahead

Run Multifunction Model:

python examples/apple/coreml/llama/run_static_llm_multifunction.py \
    --model static_llm_coreml_multifunction.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --input_len 64 \
    --max_context_len 1024

Mode Comparison

Feature Single Method Multifunction
Sequence length Fixed (input_len for both prefill & decode) Separate (input_len for prefill, 1 for decode)
Logits output Full (all tokens) Last token only
Lookahead decoding ✅ Supported ❌ Not supported
Weight sharing N/A ✅ Enabled
Generation efficiency Good with lookahead Optimized decode step

Test Plan:
New unit test +

Tested both export modes on Llama 1B:

  1. Exported single method model with --input_len 32 --max_context_len 1024
  2. Exported multifunction model with --input_len 64 --max_context_len 1024 --multifunction
  3. Ran single method model with --lookahead flag
  4. Ran multifunction model with matching input_len and max_context_len
  5. Verified text generation produces coherent output for both modes

Reviewed By: billmguo

Differential Revision: D91243088

Pulled By: metascroy

Summary:
This diff adds multifunction export support for static Llama models on CoreML. Multifunction models export separate prefill and decode graphs with weight sharing, enabling more efficient autoregressive generation compared to the single-method approach.

### Key Changes

**CoreML Backend Compiler (`coreml_preprocess.py`)**
- Added `MULTIMETHOD_WEIGHT_SHARING_STRATEGY` enum with `NONE` and `POSITIONAL` strategies
- Added `generate_multimethod_weight_sharing_strategy_compile_spec()` to enable weight sharing across methods
- Implemented multifunction CoreML model compilation using `ct.utils.MultiFunctionDescriptor`
- When weight sharing is enabled, weights from the first method are shared positionally with subsequent methods

**Model Metadata (`model_metadata.h`, `serde_json.mm`)**
- Added `MethodMetadata` struct to store per-method input/output names for multifunction models
- Extended `ModelMetadata` with `methods` map and `default_method` field
- Added `is_multifunction()` helper to detect multifunction models
- Updated JSON serialization to handle the new multifunction metadata format

**Runtime Changes (`ETCoreMLModelManager.mm`, `backend_delegate.mm`, `coreml_backend_delegate.mm`)**
- Updated `ETCoreMLModelManager` to set `functionName` on `MLModelConfiguration` only for multifunction models (based on `metadata.is_multifunction()`)
- Legacy single-function models continue to work with `functionName=nil`
- Added method name propagation through the delegate initialization path
- Updated model loading to use per-method input/output names when available

**Export Script (`export_static_llm_coreml.py`)**
- Added `--multifunction` flag to export models with separate prefill (seqlen=input_len) and decode (seqlen=1) methods
- Multifunction mode uses `generate_full_logits=False` for efficiency (only outputs last token logits)
- Single method mode (default) retains `generate_full_logits=True` for lookahead decoding support
- Generates combined metadata with method-specific prefixes (e.g., `decode_input_len`, `prefill_input_len`)

**New Runner (`run_static_llm_multifunction.py`)**
- Added dedicated runner for multifunction models
- Handles separate prefill and decode method execution
- Manages cache state transfer between prefill and decode phases
- Supports both 2D (generate_full_logits=False) and 3D (generate_full_logits=True) logits output

**Build System (`CMakeLists.txt`)**
- Fixed installation of CoreML backend headers

**Utilities (`extract_coreml_models.py`)**
- Updated model extraction script to handle multifunction models

**Documentation (`README.md`)**
- Added documentation for both export modes (single method and multifunction)
- Added comprehensive export options reference table
- Added usage examples for both modes

### Usage Examples

**Single Method Export (for lookahead decoding):**
```bash
python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_model.pte \
    --input_len 32 \
    --max_context_len 1024
```

**Multifunction Export (separate prefill/decode):**
```bash
python examples/apple/coreml/llama/export_static_llm_coreml.py \
    --checkpoint $HOME/models/llama1b/llama1b.pth \
    --params $HOME/models/llama1b/params.json \
    --output static_llm_coreml_multifunction.pte \
    --input_len 64 \
    --max_context_len 1024 \
    --multifunction
```

**Run Single Method Model (with lookahead):**
```bash
python examples/apple/coreml/llama/run_static_llm.py \
    --model static_llm_coreml_model.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --lookahead
```

**Run Multifunction Model:**
```bash
python examples/apple/coreml/llama/run_static_llm_multifunction.py \
    --model static_llm_coreml_multifunction.pte \
    --params $HOME/models/llama1b/params.json \
    --tokenizer $HOME/models/llama1b/tokenizer.model \
    --prompt "Once upon a time" \
    --max_new_tokens 100 \
    --input_len 64 \
    --max_context_len 1024
```

### Mode Comparison

| Feature | Single Method | Multifunction |
|---------|---------------|---------------|
| Sequence length | Fixed (input_len for both prefill & decode) | Separate (input_len for prefill, 1 for decode) |
| Logits output | Full (all tokens) | Last token only |
| Lookahead decoding | ✅ Supported | ❌ Not supported |
| Weight sharing | N/A | ✅ Enabled |
| Generation efficiency | Good with lookahead | Optimized decode step |


Test Plan:
New unit test +

Tested both export modes on Llama 1B:
1. Exported single method model with `--input_len 32 --max_context_len 1024`
2. Exported multifunction model with `--input_len 64 --max_context_len 1024 --multifunction`
3. Ran single method model with `--lookahead` flag
4. Ran multifunction model with matching input_len and max_context_len
5. Verified text generation produces coherent output for both modes

Reviewed By: billmguo

Differential Revision: D91243088

Pulled By: metascroy
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 22, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16791

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit a0aae36 with merge base 86b4bea (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 22, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Jan 22, 2026

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D91243088.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@metascroy metascroy requested a review from billmguo January 22, 2026 18:40
@meta-codesync meta-codesync bot merged commit 8ab593b into pytorch:main Jan 22, 2026
309 of 322 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants