LLM support: improve VGF export and calibration pipeline by xingguo01 · Pull Request #19157 · pytorch/executorch

xingguo01 · 2026-04-27T16:44:27Z

This is stacked on top of #19029

make non-KV-cache example inputs match the static export window
fix PT2E calibration flow for padded prefixes
and optional LM-Eval tasks
update SmolLM2 export settings used by the VGF PT2E workflow
Fix rope_theta in 135M_config.json to align with Hugging face
model config

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

pytorch-bot · 2026-04-27T16:44:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19157

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs, 1 Pending, 5 Unrelated Failures, 1 Unclassified Failure

As of commit 90d1236 with merge base a89f1b4 ():

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

CANCELLED JOBS - The following jobs were cancelled. Please retry:

MLX / test-mlx-parakeet / test-mlx-parakeet (gh)
##[error]The operation was canceled.
trunk / test-arm-backend-ethos-u (test_smaller_stories_llama) / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / test-models-windows (resnet50, xnnpack-q8) / windows-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-static-llama-qnn-linux (stories_110m) / linux-job (gh) (trunk failure)
test_llama_stories_110m
pull / unittest / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
##[error]The operation was canceled.
trunk / unittest-release / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

This PR improves the ARM VGF PT2E export + calibration workflow for LLMs, including new VGF quantization modes/scope controls and fixes to non-KV-cache evaluation/calibration behavior.

Changes:

Add VGF 16a8w PT2E quantization support and a configurable VGF quantization scope (global vs Linear-only).
Update PT2E calibration/eval wrappers to better handle non-KV-cache graphs (padding/static-shape behavior, optional lm-eval tasks, CPU stability).
Update model/example configs (SmolLM2 rope_theta) and example inputs used for static exports.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
extension/llm/export/quantizer_lib.py	Adds VGF `vgf_16a8w` and `quantize_scope` handling when constructing the VGF quantizer.
extension/llm/export/config/llm_config.py	Introduces `VgfQuantizeScope` and wires CLI args into `LlmConfig` for VGF.
extension/llm/export/builder.py	Updates PT2E calibration loop for non-KV-cache padding and makes lm-eval evaluation optional; forces eval wrapper to CPU.
examples/models/smollm2/135M_config.json	Adjusts `rope_theta` to match the referenced HF config behavior.
examples/models/llama/tests/test_export_llama_lib.py	Adds unit tests for VGF quantization scope and `vgf_16a8w` INT16 compile-spec requirement.
examples/models/llama/model.py	Updates non-KV-cache example inputs to match static export window length.
examples/models/llama/export_llama_lib.py	Exposes VGF options via CLI and forwards `quantize_scope` into quantizer construction.
examples/models/llama/evaluate/eager_eval.py	Adds optional device override support in the lm-eval wrapper base class.
examples/models/llama/eval_llama_lib.py	Reconstructs per-position logits for non-KV-cache graphs when `generate_full_logits=False`; supports device override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T16:49:53Z

+            if self._generate_full_logits:
+                return self._model(pad_to_max_len(inps))
+
+            # Reconstruct full logits by running prefixes.
+            result_logits = []
+            for pos in range(min(seq_len, self._max_seq_length)):
+                prefix = pad_to_max_len(inps[:, : pos + 1])
+                logits = self._model(prefix)
+                if logits.dim() == 3:
+                    logits = logits[:, -1, :]
+                result_logits.append(logits)
+
+            return torch.stack(result_logits, dim=1)


In the non-KV-cache + generate_full_logits=False case, logits are reconstructed by running the model once per prefix position (for pos in range(...)). This makes lm-eval O(seq_len) forward passes per sample, which can become extremely slow at larger seq lengths. If possible, prefer exporting with generate_full_logits=True for evaluation/calibration, or add a fast path/guard (e.g., only reconstruct up to the required positions or raise with guidance when seq_len is large).

Good point. The prefix reconstruction path should be numerically equivalent for a causal model, but it is not a good default for lm-eval because it scales as one forward per prefix position.

I found one wiring issue here: the eval entrypoint forces generate_full_logits=True, but the quantized GraphModuleEvalWrapper was not receiving that config, so it defaulted to False and could hit this slow fallback. I fixed that by passing llm_config.debug.generate_full_logits into GraphModuleEvalWrapper.

Longer term, I agree a guard or warning would be useful if someone explicitly evaluates a non-KV-cache export with generate_full_logits=False at larger seq lengths.

Copilot · 2026-04-27T16:49:54Z

-            )
+            max_len = getattr(self.llm_config.export, "max_seq_length", 3)
+            max_len = max(3, int(max_len))
+            example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0)


In the non-KV-cache path, the example token tensor is created with dtype int32. Because torch.export uses example inputs to specialize/guard the graph, this can lock the exported program to int32 token IDs, while most tokenization/eval codepaths in this repo use torch.long/int64 tokens. Consider generating example_tokens with dtype=torch.long (or otherwise ensuring the rest of the pipeline consistently uses int32) to avoid dtype guard failures at runtime.

Suggested change

example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0)

example_tokens = torch.arange(max_len, dtype=torch.long).unsqueeze(0)

This has been addressed in the latest patchset. The non-KV-cache example input no longer unconditionally uses int32: it now uses torch.long by default, while preserving torch.int32 only for the Arm backend paths that require it (ethosu/tosa/vgf).

Copilot · 2026-04-27T16:49:54Z

+        if calibration_tasks:
+            eval_wrapper = GraphModuleEvalWrapper(
+                model=prepared_module,
+                tokenizer=tokenizer,
+                max_seq_length=calibration_seq_length,
+                use_kv_cache=self.use_kv_cache,
+                generate_full_logits=self.generate_full_logits,
+                enable_dynamic_shape=self.enable_dynamic_shape,
+                # The exported graph can contain ops like aten.full.default
+                # without explicit device, which default to CPU and can
+                # trigger device-mismatch errors when lm_eval runs on CUDA.
+                # Calibrate on CPU for stability.
+                device="cpu",
            )

-        for task, res in eval_results["results"].items():
-            print(f"{task}: {res}")
-        logging.info("Calibration finish...")
+            with torch.no_grad():
+                eval_results = simple_evaluate(
+                    model=eval_wrapper,
+                    tasks=calibration_tasks,
+                    limit=calibration_limit,
+                )


With the new if calibration_tasks: gate, simple_evaluate(..., limit=calibration_limit) will be called even when calibration_limit is None (the CLI default). In lm-eval this typically means evaluating the full dataset, which can make calibration unexpectedly long/expensive. Consider requiring calibration_limit when tasks are provided, supplying a conservative default, or emitting a clear warning when tasks are set without a limit.

This is addressed in the latest patchset by adding a warning before calibration starts when calibration_tasks is provided without calibration_limit. I intentionally left the behavior unchanged instead of forcing a default, because existing callers may rely on lm-eval's full-dataset behavior.

zingo · 2026-05-12T14:02:39Z

Hi @digantdesai OK to merge this?

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

examples/models/llama/model.py:302

The new example token pattern can produce token id 100 (arange % 100 + 1). Llama configurations in this repo use vocab_size=100 in tests (for example examples/models/llama/tests/test_yoco_attention.py:365), where valid token ids are 0..99; exporting such a non-KV model would hit an embedding index out of range. Please bound the generated ids by the configured vocab size (or keep them in a small known-safe range).

            example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
            example_tokens = example_tokens % 100 + 1

+            result_logits = []
+            for pos in range(min(seq_len, self._max_seq_length)):
+                prefix = pad_to_max_len(inps[:, : pos + 1])
+                logits = self._model(prefix)


xingguo01 · 2026-05-19T10:37:12Z

+            max_len = getattr(self.llm_config.export, "max_seq_length", 3)
+            max_len = max(3, int(max_len))
+            backend = self.llm_config.backend
+            token_dtype = (
+                torch.int32
+                if (
+                    backend.ethosu.enabled
+                    or backend.tosa.enabled
+                    or backend.vgf.enabled
+                )
+                else torch.long
            )
+            example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)


Addressed. The non-KV example input now uses max_seq_length - 1 only when enable_dynamic_shape=True, matching the existing dynamic-shape upper bound in LLMEdgeManager. Static exports still use the full max_seq_length window.

xingguo01 · 2026-05-19T10:37:49Z

                if (
-                    self.calibration_tasks is not None
-                    and self.calibration_limit is not None
-                    and self.calibration_seq_length is not None
+                    self.calibration_seq_length is not None
                    and self.calibration_data is not None
                    and self.tokenizer_path is not None
                ):


Addressed. The lm-eval-specific imports are now guarded by if calibration_tasks:, so prompt-only calibration can run without lm-eval installed. The broadened pt2e_quantize() condition now only requires calibration_seq_length, calibration_data, and tokenizer_path for template calibration.

xingguo01 · 2026-05-19T10:38:22Z

                    if pos >= len(token_list):
                        if self.generate_full_logits:
-                            token_list.append(
-                                torch.argmax(logits[:, -1], dim=-1).item()
-                            )
+                            next_token = torch.argmax(logits[:, -1], dim=-1).item()
                        else:
-                            token_list.append(torch.argmax(logits[:], dim=-1).item())
+                            next_token = torch.argmax(logits[:], dim=-1).item()


Addressed. Static non-KV full-logits calibration now tracks logits_token_pos before padding and samples logits[:, logits_token_pos], i.e. the last non-pad prefix token rather than the padded tail. Static non-KV calibration without full logits is rejected because that output shape cannot select the non-pad token row safely.

digantdesai · 2026-05-20T15:00:07Z

+                token_len = tokens.shape[-1]
+                if token_len < self._max_seq_length:
+                    pad_len = self._max_seq_length - token_len
+                    pad_token = getattr(


curious what's the impact on eval from padding with eos for this non-KV path with generate_full_logits=False or this is mainly for shape compatibility?

So yes, with static non-KV + generate_full_logits=False, EOS padding can affect the returned last-position logits if the graph uses the padded tail as the last valid token. That mode is mainly a shape-compatibility fallback, not the preferred evaluation path. For accurate static non-KV eval/calibration we should use generate_full_logits=True or pass an explicit last_valid_token_pos through the graph.

digantdesai · 2026-05-20T15:02:08Z

-                    [[1, 2, 3]], dtype=torch.long
-                ),  # tokens, with kv cache our input token length is always just 1 token.
+            max_seq_len = getattr(self.llm_config.export, "max_seq_length", 3)
+            max_seq_len = max(3, int(max_seq_len))


what's special about 3, curious.

Nothing special semantically. 3 preserves the old example-input length/default ([[1, 2, 3]]) for the generic non-KV path when no export max_seq_length is configured.

The max(3, ...) guard also avoids producing a too-small dynamic-shape example, because the dynamic non-KV path uses max_seq_length - 1 as the example length. So with the minimum of 3, the dynamic example is still length 2.

For real exports, export.max_seq_length should be set explicitly, so this fallback only affects default/example behavior.

digantdesai · 2026-05-20T15:03:42Z

@claude Review this PR.

claude · 2026-05-20T15:04:38Z

Claude finished @digantdesai's task in 2m 21s —— View job

PR Review: LLM support - improve VGF export and calibration pipeline

Read PR diff and changed files
Review examples/models/llama/eval_llama_lib.py
Review examples/models/llama/evaluate/eager_eval.py
Review examples/models/llama/model.py
Review extension/llm/export/builder.py
Post review feedback

Overall this is a well-structured PR that addresses real issues in the non-KV-cache calibration/eval flow. The changes are logically decomposed and the author has been responsive to Copilot's earlier feedback. A few items to address:

1. `eval_llama_lib.py` — Non-KV static + `generate_full_logits=False` logits are still incorrect

examples/models/llama/eval_llama_lib.py:119-128

When enable_dynamic_shape=False and generate_full_logits=False, each prefix is padded and the model returns only "last-position" logits. But for a causal model with padding, the "last position" ([:, -1, :] at line 125) corresponds to the last pad token position, not the last real prefix token. The builder code (_check_calibration_prefix_options) correctly rejects this combination for calibration, but the eval wrapper still silently runs it — it will produce incorrect lm-eval scores.

Suggestion: Mirror the builder's validation here. Either raise an error for this combination or use a logits_token_pos approach (as done in _prepare_calibration_prefix) to select the correct position from 3D logits.

2. `model.py:302-303` — Token ID can hit `vocab_size` boundary

example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
example_tokens = example_tokens % 100 + 1

This produces token IDs in range [1, 100]. Some test configs in this repo use vocab_size=100 (valid IDs: 0..99), which means token ID 100 would be out of bounds for the embedding layer during tracing. Using % 100 (yielding [0, 99]) or % vocab_size would be safer. The Copilot reviewer flagged this too.

Also, responding to @digantdesai's question on line 290 — the max(3, ...) ensures the example has at least 3 tokens to avoid degenerate graph specialization, but the magic number 3 deserves a brief inline comment or a named constant.

3. `builder.py:411-418` — Duplicate logging statement

logging.info(
    f"Calibrating with tasks: {self.calibration_tasks}, limit: ..."
)

This exact log message appears twice — once at line 411 (unconditionally within the quantizer block) and again at line 420 (inside the if that actually runs calibration). The first one at line 411 fires even when calibration is not going to run. One of these should be removed.

4. `builder.py:307-308` — `_check_calibration_prefix_options` called inside `calibrate_template`

The validation self._check_calibration_prefix_options() is inside the inner function calibrate_template, which means it runs after tokenization. It would be cleaner to call this at the top of pt2e_calibrate — fail fast before doing any work. Minor nit.

5. `eager_eval.py` — Device override changes behavior for existing callers

The device parameter defaults to None, preserving backward compatibility. However, GraphModuleEvalWrapper in builder.py hardcodes device="cpu" for calibration stability, but gen_eval_wrapper (line 262-268) does not pass device — so eval still auto-selects CUDA if available. This asymmetry is intentional per the comment about calibration stability, but it means eval and calibration may run on different devices for the same model. Worth documenting or at least noting in the PR description.

6. `eval_llama_lib.py:116-117` — Full-logits path with padding doesn't select correct token

if self._generate_full_logits:
    return self._model(pad_to_max_len(inps))

When enable_dynamic_shape=False and the input is shorter than max_seq_length, this pads and returns the full [batch, max_seq_len, vocab] tensor to lm-eval. lm-eval will use logits[:, :seq_len, :] to score only the real tokens — this is correct because causal attention means pad tokens at the end don't affect earlier positions. Just confirming this is sound.

Minor nits

eval_llama_lib.py:102 — @digantdesai asked about the impact of EOS padding on eval. The EOS padding for static graphs only affects positions after the real tokens; due to causal masking, logits for real prefix positions are unchanged. The answer is: no eval impact for the generate_full_logits=True path.
The pad_to_max_len helper is duplicated across eval_llama_lib.py (eval time) and _prepare_calibration_prefix in builder.py (calibration time). These could share code, but it's fine to leave as-is for now since the contexts differ slightly.

Summary

The core design — making calibration work for static non-KV-cache graphs by properly handling padding and logits_token_pos — is correct and well-reasoned. The main actionable items are:

Bug: Eval wrapper's non-KV static + generate_full_logits=False path produces incorrect logits (same issue the builder guards against)
Bug risk: Token ID 100 can be out of bounds for small vocab configs
Cleanup: Duplicate logging statement in builder.py

digantdesai · 2026-05-20T15:16:12Z

Let me finish the review later today, LGTM at a high level.

xingguo01 · 2026-05-20T18:59:53Z

Thanks, I have been addressing the Claude comments and will get back to it on Monday. Need internal reviews before upstreaming.

- make non-KV-cache example inputs match the static export window - fix PT2E calibration flow for padded prefixes and optional LM-Eval tasks - update SmolLM2 export settings used by the VGF PT2E workflow - Fix rope_theta in 135M_config.json to align with Hugging face model config Signed-off-by: Xingguo Li <xingguo.li@arm.com> Change-Id: Ic7feadfa2cb41d17a2fd3c509e8fa65a113018d7

linux-foundation-easycla · 2026-05-26T10:56:34Z

✅ login: xingguo01 / name: Xingguo Li (8f4eeaa)
❌ - login: @zingo / name: Zingo Andersen. The commit (90d1236) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

xingguo01 · 2026-05-26T13:49:54Z

@claude here is the reply, implemented in the new commits: Thanks for the detailed review.

I have addressed the actionable items in the latest update:

Static non-KV eval with generate_full_logits=False
GraphModuleEvalWrapper now rejects this combination, matching the builder-side calibration guard. Static non-KV eval requires generate_full_logits=True so the wrapper can use the full [batch, seq, vocab] output and avoid reading logits from the padded tail.
Example token IDs and max(3, ...)
The example token generation is now bounded by the model vocab size instead of % 100 + 1, avoiding out-of-range token IDs for small-vocab test configs. I also added a short inline comment explaining that 3 preserves the historical minimum example input length.
Duplicate calibration logging
Removed the unconditional duplicate log. The calibration config is now logged only when calibration actually runs.
Calibration prefix validation
Moved _check_calibration_prefix_options() to the start of pt2e_calibrate() so unsupported static non-KV/non-full-logits calibration fails before tokenization or other work.
Device behavior
The calibration path intentionally passes device="cpu" for stability because some exported graphs can contain CPU-defaulting ops. The general eval wrapper still preserves the existing default behavior when no device is passed, so existing eval callers keep their current device selection behavior. I can add this note to the PR description.
Full-logits static padding
Agreed. For generate_full_logits=True, padding is shape compatibility only. lm-eval scores the real token positions, and causal masking means trailing pad/EOS tokens do not affect earlier logits.

I left the eval-time and calibration-time padding helpers separate for now because the contexts differ slightly.

claude · 2026-05-26T13:50:55Z

Claude finished @xingguo01's task in 2m 12s —— View job

Verification of Fixes

Read updated examples/models/llama/eval_llama_lib.py
Read updated examples/models/llama/model.py
Read updated extension/llm/export/builder.py
Read updated examples/models/llama/evaluate/eager_eval.py
Verify all 6 items addressed
Post final review

Thanks for the thorough responses and fixes, @xingguo01. I've verified each item against the current code:

1. Static non-KV eval with `generate_full_logits=False` ✅

eval_llama_lib.py:91-95 now raises a clear ValueError for this combination, matching the builder-side guard. The error message correctly directs users to use generate_full_logits=True.

2. Example token IDs and `max(3, ...)` ✅

model.py:304-306 now bounds tokens by the model's vocab size:

vocab_size = int(getattr(self.model_.params, "vocab_size", 0))
if vocab_size > 1:
    example_tokens = example_tokens % (vocab_size - 1) + 1

This produces IDs in [1, vocab_size-1], which is safe for all configs. The comment at line 290 explaining the historical minimum is helpful.

3. Duplicate calibration logging ✅

builder.py:416-418 — the log statement now only appears inside the conditional block that actually runs calibration. No more unconditional duplicate.

4. Calibration prefix validation ✅

builder.py:298 — _check_calibration_prefix_options() is called at the very top of pt2e_calibrate(), before tokenizer loading or any other work. Fails fast as requested.

5. Device behavior ✅

eager_eval.py:34 resolves device with fallback: device or ("cuda" if ...). The calibration path in builder.py:363 explicitly passes device="cpu" with a clear comment explaining why (CPU-defaulting ops in exported graphs). The eval path preserves existing behavior. The asymmetry is intentional and well-documented in the code comment. Adding a note to the PR description as you suggested would be a nice touch but not blocking.

6. Full-logits static padding ✅

The _pad_to_max_len helper (eval_llama_lib.py:111-128) pads after real tokens with EOS/pad. For generate_full_logits=True, causal masking ensures padding doesn't affect earlier positions — lm-eval reads only logits[:, :seq_len, :]. Sound.

Additional observations on the latest code

_prepare_calibration_prefix logic (builder.py:271-286): The logits_token_pos = min(len(prefix_tokens), max_len) - 1 correctly captures the last real token position before padding is applied. The calibration loop then uses this at line 328 (logits[:, logits_token_pos]) only in the full-logits path. For the non-full-logits dynamic path, logits[:] on the 2D output is correct.

lm-eval import guard (builder.py:341-348): The imports are now properly inside the if calibration_tasks: block, so prompt-only calibration doesn't require lm-eval. Good fix for the Copilot reviewer's concern.

Warning for missing calibration_limit (builder.py:400-405): Appropriately warns when tasks are provided without a limit, without changing behavior for existing callers.

Dynamic shape constraint (model.py:292): max_len = max_seq_len - 1 if self.enable_dynamic_shape else max_seq_len correctly matches the max=self.max_seq_len - 1 in LLMEdgeManager's dynamic shape setup (builder.py:144).

Summary

All six items from my previous review are properly addressed. The code is well-structured with appropriate guards, clear error messages, and the calibration/eval separation is sound. LGTM from my side.

One remaining note: the CLA check is currently failing — @xingguo01 may need to sign the CLA before this can merge.

View job

xingguo01 · 2026-05-26T14:11:44Z

@digantdesai all comments addressed, highly appreciated if you let me know if there are additional comments on this PR.

Copilot AI review requested due to automatic review settings April 27, 2026 16:44

xingguo01 requested review from larryliu0820, lucylq and mergennachin as code owners April 27, 2026 16:44

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026

Copilot started reviewing on behalf of xingguo01 April 27, 2026 16:45 View session

xingguo01 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Apr 27, 2026

Copilot AI reviewed Apr 27, 2026

View reviewed changes

zingo added the help wanted Extra attention is needed label May 12, 2026

Copilot AI review requested due to automatic review settings May 14, 2026 12:53

xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from e09cffa to 4e0992a Compare May 14, 2026 12:53

Copilot started reviewing on behalf of xingguo01 May 14, 2026 12:54 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from 4e0992a to 81f6f45 Compare May 19, 2026 10:34

digantdesai reviewed May 20, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 26, 2026 10:56

xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from ae60e69 to 8f4eeaa Compare May 26, 2026 10:56

Copilot AI reviewed May 26, 2026

View reviewed changes

Copilot started reviewing on behalf of xingguo01 May 26, 2026 12:20 View session

Merge branch 'main' into llm-support-vgf-non-kv-cache

90d1236

digantdesai approved these changes May 28, 2026

View reviewed changes

xingguo01 merged commit f6be985 into pytorch:main May 29, 2026
476 of 489 checks passed

	example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0)
	example_tokens = torch.arange(max_len, dtype=torch.long).unsqueeze(0)

Conversation

xingguo01 commented Apr 27, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19157

❌ 2 Cancelled Jobs, 1 Pending, 5 Unrelated Failures, 1 Unclassified Failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zingo commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

digantdesai May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

digantdesai commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: LLM support - improve VGF export and calibration pipeline

1. eval_llama_lib.py — Non-KV static + generate_full_logits=False logits are still incorrect

2. model.py:302-303 — Token ID can hit vocab_size boundary

3. builder.py:411-418 — Duplicate logging statement

4. builder.py:307-308 — _check_calibration_prefix_options called inside calibrate_template

5. eager_eval.py — Device override changes behavior for existing callers

6. eval_llama_lib.py:116-117 — Full-logits path with padding doesn't select correct token

Minor nits

Summary

Uh oh!

digantdesai commented May 20, 2026

Uh oh!

xingguo01 commented May 20, 2026

Uh oh!

linux-foundation-easycla Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

xingguo01 commented May 26, 2026

Uh oh!

claude Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

xingguo01 commented Apr 27, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 27, 2026 •

edited

Loading

digantdesai May 20, 2026 •

edited

Loading

claude Bot commented May 20, 2026 •

edited

Loading

1. `eval_llama_lib.py` — Non-KV static + `generate_full_logits=False` logits are still incorrect

2. `model.py:302-303` — Token ID can hit `vocab_size` boundary

3. `builder.py:411-418` — Duplicate logging statement

4. `builder.py:307-308` — `_check_calibration_prefix_options` called inside `calibrate_template`

5. `eager_eval.py` — Device override changes behavior for existing callers

6. `eval_llama_lib.py:116-117` — Full-logits path with padding doesn't select correct token

linux-foundation-easycla Bot commented May 26, 2026 •

edited

Loading

claude Bot commented May 26, 2026 •

edited

Loading

1. Static non-KV eval with `generate_full_logits=False` ✅

2. Example token IDs and `max(3, ...)` ✅