LLM support: improve VGF export and calibration pipeline#19157
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19157
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 Cancelled Jobs, 1 Pending, 5 Unrelated Failures, 1 Unclassified FailureAs of commit 90d1236 with merge base a89f1b4 ( UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Pull request overview
This PR improves the ARM VGF PT2E export + calibration workflow for LLMs, including new VGF quantization modes/scope controls and fixes to non-KV-cache evaluation/calibration behavior.
Changes:
- Add VGF 16a8w PT2E quantization support and a configurable VGF quantization scope (global vs Linear-only).
- Update PT2E calibration/eval wrappers to better handle non-KV-cache graphs (padding/static-shape behavior, optional lm-eval tasks, CPU stability).
- Update model/example configs (SmolLM2
rope_theta) and example inputs used for static exports.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| extension/llm/export/quantizer_lib.py | Adds VGF vgf_16a8w and quantize_scope handling when constructing the VGF quantizer. |
| extension/llm/export/config/llm_config.py | Introduces VgfQuantizeScope and wires CLI args into LlmConfig for VGF. |
| extension/llm/export/builder.py | Updates PT2E calibration loop for non-KV-cache padding and makes lm-eval evaluation optional; forces eval wrapper to CPU. |
| examples/models/smollm2/135M_config.json | Adjusts rope_theta to match the referenced HF config behavior. |
| examples/models/llama/tests/test_export_llama_lib.py | Adds unit tests for VGF quantization scope and vgf_16a8w INT16 compile-spec requirement. |
| examples/models/llama/model.py | Updates non-KV-cache example inputs to match static export window length. |
| examples/models/llama/export_llama_lib.py | Exposes VGF options via CLI and forwards quantize_scope into quantizer construction. |
| examples/models/llama/evaluate/eager_eval.py | Adds optional device override support in the lm-eval wrapper base class. |
| examples/models/llama/eval_llama_lib.py | Reconstructs per-position logits for non-KV-cache graphs when generate_full_logits=False; supports device override. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if self._generate_full_logits: | ||
| return self._model(pad_to_max_len(inps)) | ||
|
|
||
| # Reconstruct full logits by running prefixes. | ||
| result_logits = [] | ||
| for pos in range(min(seq_len, self._max_seq_length)): | ||
| prefix = pad_to_max_len(inps[:, : pos + 1]) | ||
| logits = self._model(prefix) | ||
| if logits.dim() == 3: | ||
| logits = logits[:, -1, :] | ||
| result_logits.append(logits) | ||
|
|
||
| return torch.stack(result_logits, dim=1) |
There was a problem hiding this comment.
In the non-KV-cache + generate_full_logits=False case, logits are reconstructed by running the model once per prefix position (for pos in range(...)). This makes lm-eval O(seq_len) forward passes per sample, which can become extremely slow at larger seq lengths. If possible, prefer exporting with generate_full_logits=True for evaluation/calibration, or add a fast path/guard (e.g., only reconstruct up to the required positions or raise with guidance when seq_len is large).
There was a problem hiding this comment.
Good point. The prefix reconstruction path should be numerically equivalent for a causal model, but it is not a good default for lm-eval because it scales as one forward per prefix position.
I found one wiring issue here: the eval entrypoint forces generate_full_logits=True, but the quantized GraphModuleEvalWrapper was not receiving that config, so it defaulted to False and could hit this slow fallback. I fixed that by passing llm_config.debug.generate_full_logits into GraphModuleEvalWrapper.
Longer term, I agree a guard or warning would be useful if someone explicitly evaluates a non-KV-cache export with generate_full_logits=False at larger seq lengths.
| ) | ||
| max_len = getattr(self.llm_config.export, "max_seq_length", 3) | ||
| max_len = max(3, int(max_len)) | ||
| example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0) |
There was a problem hiding this comment.
In the non-KV-cache path, the example token tensor is created with dtype int32. Because torch.export uses example inputs to specialize/guard the graph, this can lock the exported program to int32 token IDs, while most tokenization/eval codepaths in this repo use torch.long/int64 tokens. Consider generating example_tokens with dtype=torch.long (or otherwise ensuring the rest of the pipeline consistently uses int32) to avoid dtype guard failures at runtime.
| example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0) | |
| example_tokens = torch.arange(max_len, dtype=torch.long).unsqueeze(0) |
There was a problem hiding this comment.
This has been addressed in the latest patchset. The non-KV-cache example input no longer unconditionally uses int32: it now uses torch.long by default, while preserving torch.int32 only for the Arm backend paths that require it (ethosu/tosa/vgf).
| if calibration_tasks: | ||
| eval_wrapper = GraphModuleEvalWrapper( | ||
| model=prepared_module, | ||
| tokenizer=tokenizer, | ||
| max_seq_length=calibration_seq_length, | ||
| use_kv_cache=self.use_kv_cache, | ||
| generate_full_logits=self.generate_full_logits, | ||
| enable_dynamic_shape=self.enable_dynamic_shape, | ||
| # The exported graph can contain ops like aten.full.default | ||
| # without explicit device, which default to CPU and can | ||
| # trigger device-mismatch errors when lm_eval runs on CUDA. | ||
| # Calibrate on CPU for stability. | ||
| device="cpu", | ||
| ) | ||
|
|
||
| for task, res in eval_results["results"].items(): | ||
| print(f"{task}: {res}") | ||
| logging.info("Calibration finish...") | ||
| with torch.no_grad(): | ||
| eval_results = simple_evaluate( | ||
| model=eval_wrapper, | ||
| tasks=calibration_tasks, | ||
| limit=calibration_limit, | ||
| ) |
There was a problem hiding this comment.
With the new if calibration_tasks: gate, simple_evaluate(..., limit=calibration_limit) will be called even when calibration_limit is None (the CLI default). In lm-eval this typically means evaluating the full dataset, which can make calibration unexpectedly long/expensive. Consider requiring calibration_limit when tasks are provided, supplying a conservative default, or emitting a clear warning when tasks are set without a limit.
There was a problem hiding this comment.
This is addressed in the latest patchset by adding a warning before calibration starts when calibration_tasks is provided without calibration_limit. I intentionally left the behavior unchanged instead of forcing a default, because existing callers may rely on lm-eval's full-dataset behavior.
|
Hi @digantdesai OK to merge this? |
e09cffa to
4e0992a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
examples/models/llama/model.py:302
- The new example token pattern can produce token id
100(arange % 100 + 1). Llama configurations in this repo usevocab_size=100in tests (for exampleexamples/models/llama/tests/test_yoco_attention.py:365), where valid token ids are0..99; exporting such a non-KV model would hit an embedding index out of range. Please bound the generated ids by the configured vocab size (or keep them in a small known-safe range).
example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
example_tokens = example_tokens % 100 + 1
| result_logits = [] | ||
| for pos in range(min(seq_len, self._max_seq_length)): | ||
| prefix = pad_to_max_len(inps[:, : pos + 1]) | ||
| logits = self._model(prefix) |
| max_len = getattr(self.llm_config.export, "max_seq_length", 3) | ||
| max_len = max(3, int(max_len)) | ||
| backend = self.llm_config.backend | ||
| token_dtype = ( | ||
| torch.int32 | ||
| if ( | ||
| backend.ethosu.enabled | ||
| or backend.tosa.enabled | ||
| or backend.vgf.enabled | ||
| ) | ||
| else torch.long | ||
| ) | ||
| example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0) |
There was a problem hiding this comment.
Addressed. The non-KV example input now uses max_seq_length - 1 only when enable_dynamic_shape=True, matching the existing dynamic-shape upper bound in LLMEdgeManager. Static exports still use the full max_seq_length window.
| if ( | ||
| self.calibration_tasks is not None | ||
| and self.calibration_limit is not None | ||
| and self.calibration_seq_length is not None | ||
| self.calibration_seq_length is not None | ||
| and self.calibration_data is not None | ||
| and self.tokenizer_path is not None | ||
| ): |
There was a problem hiding this comment.
Addressed. The lm-eval-specific imports are now guarded by if calibration_tasks:, so prompt-only calibration can run without lm-eval installed. The broadened pt2e_quantize() condition now only requires calibration_seq_length, calibration_data, and tokenizer_path for template calibration.
| if pos >= len(token_list): | ||
| if self.generate_full_logits: | ||
| token_list.append( | ||
| torch.argmax(logits[:, -1], dim=-1).item() | ||
| ) | ||
| next_token = torch.argmax(logits[:, -1], dim=-1).item() | ||
| else: | ||
| token_list.append(torch.argmax(logits[:], dim=-1).item()) | ||
| next_token = torch.argmax(logits[:], dim=-1).item() |
There was a problem hiding this comment.
Addressed. Static non-KV full-logits calibration now tracks logits_token_pos before padding and samples logits[:, logits_token_pos], i.e. the last non-pad prefix token rather than the padded tail. Static non-KV calibration without full logits is rejected because that output shape cannot select the non-pad token row safely.
4e0992a to
81f6f45
Compare
| token_len = tokens.shape[-1] | ||
| if token_len < self._max_seq_length: | ||
| pad_len = self._max_seq_length - token_len | ||
| pad_token = getattr( |
There was a problem hiding this comment.
curious what's the impact on eval from padding with eos for this non-KV path with generate_full_logits=False or this is mainly for shape compatibility?
There was a problem hiding this comment.
So yes, with static non-KV + generate_full_logits=False, EOS padding can affect the returned last-position logits if the graph uses the padded tail as the last valid token. That mode is mainly a shape-compatibility fallback, not the preferred evaluation path. For accurate static non-KV eval/calibration we should use generate_full_logits=True or pass an explicit last_valid_token_pos through the graph.
| [[1, 2, 3]], dtype=torch.long | ||
| ), # tokens, with kv cache our input token length is always just 1 token. | ||
| max_seq_len = getattr(self.llm_config.export, "max_seq_length", 3) | ||
| max_seq_len = max(3, int(max_seq_len)) |
There was a problem hiding this comment.
what's special about 3, curious.
There was a problem hiding this comment.
Nothing special semantically. 3 preserves the old example-input length/default ([[1, 2, 3]]) for the generic non-KV path when no export max_seq_length is configured.
The max(3, ...) guard also avoids producing a too-small dynamic-shape example, because the dynamic non-KV path uses max_seq_length - 1 as the example length. So with the minimum of 3, the dynamic example is still length 2.
For real exports, export.max_seq_length should be set explicitly, so this fallback only affects default/example behavior.
|
@claude Review this PR. |
|
Claude finished @digantdesai's task in 2m 21s —— View job PR Review: LLM support - improve VGF export and calibration pipeline
Overall this is a well-structured PR that addresses real issues in the non-KV-cache calibration/eval flow. The changes are logically decomposed and the author has been responsive to Copilot's earlier feedback. A few items to address: 1.
|
|
Let me finish the review later today, LGTM at a high level. |
|
Thanks, I have been addressing the Claude comments and will get back to it on Monday. Need internal reviews before upstreaming. |
- make non-KV-cache example inputs match the static export window - fix PT2E calibration flow for padded prefixes and optional LM-Eval tasks - update SmolLM2 export settings used by the VGF PT2E workflow - Fix rope_theta in 135M_config.json to align with Hugging face model config Signed-off-by: Xingguo Li <xingguo.li@arm.com> Change-Id: Ic7feadfa2cb41d17a2fd3c509e8fa65a113018d7
ae60e69 to
8f4eeaa
Compare
|
|
@claude here is the reply, implemented in the new commits: Thanks for the detailed review. I have addressed the actionable items in the latest update:
I left the eval-time and calibration-time padding helpers separate for now because the contexts differ slightly. |
|
Claude finished @xingguo01's task in 2m 12s —— View job Verification of Fixes
Thanks for the thorough responses and fixes, @xingguo01. I've verified each item against the current code: 1. Static non-KV eval with
|
|
@digantdesai all comments addressed, highly appreciated if you let me know if there are additional comments on this PR. |
This is stacked on top of #19029
and optional LM-Eval tasks
model config
cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani