Skip to content

LLM support: improve VGF export and calibration pipeline#19157

Merged
xingguo01 merged 2 commits into
pytorch:mainfrom
xingguo01:llm-support-vgf-non-kv-cache
May 29, 2026
Merged

LLM support: improve VGF export and calibration pipeline#19157
xingguo01 merged 2 commits into
pytorch:mainfrom
xingguo01:llm-support-vgf-non-kv-cache

Conversation

@xingguo01
Copy link
Copy Markdown
Collaborator

@xingguo01 xingguo01 commented Apr 27, 2026

This is stacked on top of #19029

  • make non-KV-cache example inputs match the static export window
  • fix PT2E calibration flow for padded prefixes
    and optional LM-Eval tasks
  • update SmolLM2 export settings used by the VGF PT2E workflow
  • Fix rope_theta in 135M_config.json to align with Hugging face
    model config

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell @rascani

Copilot AI review requested due to automatic review settings April 27, 2026 16:44
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 27, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19157

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 Cancelled Jobs, 1 Pending, 5 Unrelated Failures, 1 Unclassified Failure

As of commit 90d1236 with merge base a89f1b4 (image):

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026
@xingguo01 xingguo01 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk release notes: arm Changes to the ARM backend delegate labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the ARM VGF PT2E export + calibration workflow for LLMs, including new VGF quantization modes/scope controls and fixes to non-KV-cache evaluation/calibration behavior.

Changes:

  • Add VGF 16a8w PT2E quantization support and a configurable VGF quantization scope (global vs Linear-only).
  • Update PT2E calibration/eval wrappers to better handle non-KV-cache graphs (padding/static-shape behavior, optional lm-eval tasks, CPU stability).
  • Update model/example configs (SmolLM2 rope_theta) and example inputs used for static exports.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
extension/llm/export/quantizer_lib.py Adds VGF vgf_16a8w and quantize_scope handling when constructing the VGF quantizer.
extension/llm/export/config/llm_config.py Introduces VgfQuantizeScope and wires CLI args into LlmConfig for VGF.
extension/llm/export/builder.py Updates PT2E calibration loop for non-KV-cache padding and makes lm-eval evaluation optional; forces eval wrapper to CPU.
examples/models/smollm2/135M_config.json Adjusts rope_theta to match the referenced HF config behavior.
examples/models/llama/tests/test_export_llama_lib.py Adds unit tests for VGF quantization scope and vgf_16a8w INT16 compile-spec requirement.
examples/models/llama/model.py Updates non-KV-cache example inputs to match static export window length.
examples/models/llama/export_llama_lib.py Exposes VGF options via CLI and forwards quantize_scope into quantizer construction.
examples/models/llama/evaluate/eager_eval.py Adds optional device override support in the lm-eval wrapper base class.
examples/models/llama/eval_llama_lib.py Reconstructs per-position logits for non-KV-cache graphs when generate_full_logits=False; supports device override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/models/llama/eval_llama_lib.py Outdated
Comment on lines +116 to +128
if self._generate_full_logits:
return self._model(pad_to_max_len(inps))

# Reconstruct full logits by running prefixes.
result_logits = []
for pos in range(min(seq_len, self._max_seq_length)):
prefix = pad_to_max_len(inps[:, : pos + 1])
logits = self._model(prefix)
if logits.dim() == 3:
logits = logits[:, -1, :]
result_logits.append(logits)

return torch.stack(result_logits, dim=1)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-KV-cache + generate_full_logits=False case, logits are reconstructed by running the model once per prefix position (for pos in range(...)). This makes lm-eval O(seq_len) forward passes per sample, which can become extremely slow at larger seq lengths. If possible, prefer exporting with generate_full_logits=True for evaluation/calibration, or add a fast path/guard (e.g., only reconstruct up to the required positions or raise with guidance when seq_len is large).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The prefix reconstruction path should be numerically equivalent for a causal model, but it is not a good default for lm-eval because it scales as one forward per prefix position.

I found one wiring issue here: the eval entrypoint forces generate_full_logits=True, but the quantized GraphModuleEvalWrapper was not receiving that config, so it defaulted to False and could hit this slow fallback. I fixed that by passing llm_config.debug.generate_full_logits into GraphModuleEvalWrapper.

Longer term, I agree a guard or warning would be useful if someone explicitly evaluates a non-KV-cache export with generate_full_logits=False at larger seq lengths.

Comment thread examples/models/llama/model.py Outdated
)
max_len = getattr(self.llm_config.export, "max_seq_length", 3)
max_len = max(3, int(max_len))
example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the non-KV-cache path, the example token tensor is created with dtype int32. Because torch.export uses example inputs to specialize/guard the graph, this can lock the exported program to int32 token IDs, while most tokenization/eval codepaths in this repo use torch.long/int64 tokens. Consider generating example_tokens with dtype=torch.long (or otherwise ensuring the rest of the pipeline consistently uses int32) to avoid dtype guard failures at runtime.

Suggested change
example_tokens = torch.arange(max_len, dtype=torch.int32).unsqueeze(0)
example_tokens = torch.arange(max_len, dtype=torch.long).unsqueeze(0)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been addressed in the latest patchset. The non-KV-cache example input no longer unconditionally uses int32: it now uses torch.long by default, while preserving torch.int32 only for the Arm backend paths that require it (ethosu/tosa/vgf).

Comment on lines +326 to +346
if calibration_tasks:
eval_wrapper = GraphModuleEvalWrapper(
model=prepared_module,
tokenizer=tokenizer,
max_seq_length=calibration_seq_length,
use_kv_cache=self.use_kv_cache,
generate_full_logits=self.generate_full_logits,
enable_dynamic_shape=self.enable_dynamic_shape,
# The exported graph can contain ops like aten.full.default
# without explicit device, which default to CPU and can
# trigger device-mismatch errors when lm_eval runs on CUDA.
# Calibrate on CPU for stability.
device="cpu",
)

for task, res in eval_results["results"].items():
print(f"{task}: {res}")
logging.info("Calibration finish...")
with torch.no_grad():
eval_results = simple_evaluate(
model=eval_wrapper,
tasks=calibration_tasks,
limit=calibration_limit,
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new if calibration_tasks: gate, simple_evaluate(..., limit=calibration_limit) will be called even when calibration_limit is None (the CLI default). In lm-eval this typically means evaluating the full dataset, which can make calibration unexpectedly long/expensive. Consider requiring calibration_limit when tasks are provided, supplying a conservative default, or emitting a clear warning when tasks are set without a limit.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is addressed in the latest patchset by adding a warning before calibration starts when calibration_tasks is provided without calibration_limit. I intentionally left the behavior unchanged instead of forcing a default, because existing callers may rely on lm-eval's full-dataset behavior.

@zingo zingo added the help wanted Extra attention is needed label May 12, 2026
@zingo
Copy link
Copy Markdown
Collaborator

zingo commented May 12, 2026

Hi @digantdesai OK to merge this?

Copilot AI review requested due to automatic review settings May 14, 2026 12:53
@xingguo01 xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from e09cffa to 4e0992a Compare May 14, 2026 12:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

examples/models/llama/model.py:302

  • The new example token pattern can produce token id 100 (arange % 100 + 1). Llama configurations in this repo use vocab_size=100 in tests (for example examples/models/llama/tests/test_yoco_attention.py:365), where valid token ids are 0..99; exporting such a non-KV model would hit an embedding index out of range. Please bound the generated ids by the configured vocab size (or keep them in a small known-safe range).
            example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
            example_tokens = example_tokens % 100 + 1

Comment thread examples/models/llama/eval_llama_lib.py Outdated
result_logits = []
for pos in range(min(seq_len, self._max_seq_length)):
prefix = pad_to_max_len(inps[:, : pos + 1])
logits = self._model(prefix)
Comment thread examples/models/llama/model.py Outdated
Comment on lines +289 to +301
max_len = getattr(self.llm_config.export, "max_seq_length", 3)
max_len = max(3, int(max_len))
backend = self.llm_config.backend
token_dtype = (
torch.int32
if (
backend.ethosu.enabled
or backend.tosa.enabled
or backend.vgf.enabled
)
else torch.long
)
example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The non-KV example input now uses max_seq_length - 1 only when enable_dynamic_shape=True, matching the existing dynamic-shape upper bound in LLMEdgeManager. Static exports still use the full max_seq_length window.

Comment on lines 389 to 393
if (
self.calibration_tasks is not None
and self.calibration_limit is not None
and self.calibration_seq_length is not None
self.calibration_seq_length is not None
and self.calibration_data is not None
and self.tokenizer_path is not None
):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. The lm-eval-specific imports are now guarded by if calibration_tasks:, so prompt-only calibration can run without lm-eval installed. The broadened pt2e_quantize() condition now only requires calibration_seq_length, calibration_data, and tokenizer_path for template calibration.

Comment on lines 312 to +316
if pos >= len(token_list):
if self.generate_full_logits:
token_list.append(
torch.argmax(logits[:, -1], dim=-1).item()
)
next_token = torch.argmax(logits[:, -1], dim=-1).item()
else:
token_list.append(torch.argmax(logits[:], dim=-1).item())
next_token = torch.argmax(logits[:], dim=-1).item()
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. Static non-KV full-logits calibration now tracks logits_token_pos before padding and samples logits[:, logits_token_pos], i.e. the last non-pad prefix token rather than the padded tail. Static non-KV calibration without full logits is rejected because that output shape cannot select the non-pad token row safely.

@xingguo01 xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from 4e0992a to 81f6f45 Compare May 19, 2026 10:34
Comment thread examples/models/llama/eval_llama_lib.py Outdated
token_len = tokens.shape[-1]
if token_len < self._max_seq_length:
pad_len = self._max_seq_length - token_len
pad_token = getattr(
Copy link
Copy Markdown
Contributor

@digantdesai digantdesai May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what's the impact on eval from padding with eos for this non-KV path with generate_full_logits=False or this is mainly for shape compatibility?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, with static non-KV + generate_full_logits=False, EOS padding can affect the returned last-position logits if the graph uses the padded tail as the last valid token. That mode is mainly a shape-compatibility fallback, not the preferred evaluation path. For accurate static non-KV eval/calibration we should use generate_full_logits=True or pass an explicit last_valid_token_pos through the graph.

[[1, 2, 3]], dtype=torch.long
), # tokens, with kv cache our input token length is always just 1 token.
max_seq_len = getattr(self.llm_config.export, "max_seq_length", 3)
max_seq_len = max(3, int(max_seq_len))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's special about 3, curious.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing special semantically. 3 preserves the old example-input length/default ([[1, 2, 3]]) for the generic non-KV path when no export max_seq_length is configured.

The max(3, ...) guard also avoids producing a too-small dynamic-shape example, because the dynamic non-KV path uses max_seq_length - 1 as the example length. So with the minimum of 3, the dynamic example is still length 2.

For real exports, export.max_seq_length should be set explicitly, so this fallback only affects default/example behavior.

@digantdesai
Copy link
Copy Markdown
Contributor

@claude Review this PR.

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Claude finished @digantdesai's task in 2m 21s —— View job


PR Review: LLM support - improve VGF export and calibration pipeline

  • Read PR diff and changed files
  • Review examples/models/llama/eval_llama_lib.py
  • Review examples/models/llama/evaluate/eager_eval.py
  • Review examples/models/llama/model.py
  • Review extension/llm/export/builder.py
  • Post review feedback

Overall this is a well-structured PR that addresses real issues in the non-KV-cache calibration/eval flow. The changes are logically decomposed and the author has been responsive to Copilot's earlier feedback. A few items to address:


1. eval_llama_lib.py — Non-KV static + generate_full_logits=False logits are still incorrect

examples/models/llama/eval_llama_lib.py:119-128

When enable_dynamic_shape=False and generate_full_logits=False, each prefix is padded and the model returns only "last-position" logits. But for a causal model with padding, the "last position" ([:, -1, :] at line 125) corresponds to the last pad token position, not the last real prefix token. The builder code (_check_calibration_prefix_options) correctly rejects this combination for calibration, but the eval wrapper still silently runs it — it will produce incorrect lm-eval scores.

Suggestion: Mirror the builder's validation here. Either raise an error for this combination or use a logits_token_pos approach (as done in _prepare_calibration_prefix) to select the correct position from 3D logits.


2. model.py:302-303 — Token ID can hit vocab_size boundary

example_tokens = torch.arange(max_len, dtype=token_dtype).unsqueeze(0)
example_tokens = example_tokens % 100 + 1

This produces token IDs in range [1, 100]. Some test configs in this repo use vocab_size=100 (valid IDs: 0..99), which means token ID 100 would be out of bounds for the embedding layer during tracing. Using % 100 (yielding [0, 99]) or % vocab_size would be safer. The Copilot reviewer flagged this too.

Also, responding to @digantdesai's question on line 290 — the max(3, ...) ensures the example has at least 3 tokens to avoid degenerate graph specialization, but the magic number 3 deserves a brief inline comment or a named constant.


3. builder.py:411-418 — Duplicate logging statement

logging.info(
    f"Calibrating with tasks: {self.calibration_tasks}, limit: ..."
)

This exact log message appears twice — once at line 411 (unconditionally within the quantizer block) and again at line 420 (inside the if that actually runs calibration). The first one at line 411 fires even when calibration is not going to run. One of these should be removed.


4. builder.py:307-308_check_calibration_prefix_options called inside calibrate_template

The validation self._check_calibration_prefix_options() is inside the inner function calibrate_template, which means it runs after tokenization. It would be cleaner to call this at the top of pt2e_calibrate — fail fast before doing any work. Minor nit.


5. eager_eval.py — Device override changes behavior for existing callers

The device parameter defaults to None, preserving backward compatibility. However, GraphModuleEvalWrapper in builder.py hardcodes device="cpu" for calibration stability, but gen_eval_wrapper (line 262-268) does not pass device — so eval still auto-selects CUDA if available. This asymmetry is intentional per the comment about calibration stability, but it means eval and calibration may run on different devices for the same model. Worth documenting or at least noting in the PR description.


6. eval_llama_lib.py:116-117 — Full-logits path with padding doesn't select correct token

if self._generate_full_logits:
    return self._model(pad_to_max_len(inps))

When enable_dynamic_shape=False and the input is shorter than max_seq_length, this pads and returns the full [batch, max_seq_len, vocab] tensor to lm-eval. lm-eval will use logits[:, :seq_len, :] to score only the real tokens — this is correct because causal attention means pad tokens at the end don't affect earlier positions. Just confirming this is sound.


Minor nits

  • eval_llama_lib.py:102@digantdesai asked about the impact of EOS padding on eval. The EOS padding for static graphs only affects positions after the real tokens; due to causal masking, logits for real prefix positions are unchanged. The answer is: no eval impact for the generate_full_logits=True path.

  • The pad_to_max_len helper is duplicated across eval_llama_lib.py (eval time) and _prepare_calibration_prefix in builder.py (calibration time). These could share code, but it's fine to leave as-is for now since the contexts differ slightly.


Summary

The core design — making calibration work for static non-KV-cache graphs by properly handling padding and logits_token_pos — is correct and well-reasoned. The main actionable items are:

  1. Bug: Eval wrapper's non-KV static + generate_full_logits=False path produces incorrect logits (same issue the builder guards against)
  2. Bug risk: Token ID 100 can be out of bounds for small vocab configs
  3. Cleanup: Duplicate logging statement in builder.py

@digantdesai
Copy link
Copy Markdown
Contributor

Let me finish the review later today, LGTM at a high level.

@xingguo01
Copy link
Copy Markdown
Collaborator Author

Thanks, I have been addressing the Claude comments and will get back to it on Monday. Need internal reviews before upstreaming.

- make non-KV-cache example inputs match the static export window
- fix PT2E calibration flow for padded prefixes
  and optional LM-Eval tasks
- update SmolLM2 export settings used by the VGF PT2E workflow
- Fix rope_theta in 135M_config.json to align with Hugging face
  model config

Signed-off-by: Xingguo Li <xingguo.li@arm.com>
Change-Id: Ic7feadfa2cb41d17a2fd3c509e8fa65a113018d7
Copilot AI review requested due to automatic review settings May 26, 2026 10:56
@xingguo01 xingguo01 force-pushed the llm-support-vgf-non-kv-cache branch from ae60e69 to 8f4eeaa Compare May 26, 2026 10:56
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented May 26, 2026

CLA Not Signed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@xingguo01
Copy link
Copy Markdown
Collaborator Author

@claude here is the reply, implemented in the new commits: Thanks for the detailed review.

I have addressed the actionable items in the latest update:

  1. Static non-KV eval with generate_full_logits=False
    GraphModuleEvalWrapper now rejects this combination, matching the builder-side calibration guard. Static non-KV eval requires generate_full_logits=True so the wrapper can use the full [batch, seq, vocab] output and avoid reading logits from the padded tail.

  2. Example token IDs and max(3, ...)
    The example token generation is now bounded by the model vocab size instead of % 100 + 1, avoiding out-of-range token IDs for small-vocab test configs. I also added a short inline comment explaining that 3 preserves the historical minimum example input length.

  3. Duplicate calibration logging
    Removed the unconditional duplicate log. The calibration config is now logged only when calibration actually runs.

  4. Calibration prefix validation
    Moved _check_calibration_prefix_options() to the start of pt2e_calibrate() so unsupported static non-KV/non-full-logits calibration fails before tokenization or other work.

  5. Device behavior
    The calibration path intentionally passes device="cpu" for stability because some exported graphs can contain CPU-defaulting ops. The general eval wrapper still preserves the existing default behavior when no device is passed, so existing eval callers keep their current device selection behavior. I can add this note to the PR description.

  6. Full-logits static padding
    Agreed. For generate_full_logits=True, padding is shape compatibility only. lm-eval scores the real token positions, and causal masking means trailing pad/EOS tokens do not affect earlier logits.

I left the eval-time and calibration-time padding helpers separate for now because the contexts differ slightly.

@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Claude finished @xingguo01's task in 2m 12s —— View job


Verification of Fixes

  • Read updated examples/models/llama/eval_llama_lib.py
  • Read updated examples/models/llama/model.py
  • Read updated extension/llm/export/builder.py
  • Read updated examples/models/llama/evaluate/eager_eval.py
  • Verify all 6 items addressed
  • Post final review

Thanks for the thorough responses and fixes, @xingguo01. I've verified each item against the current code:

1. Static non-KV eval with generate_full_logits=False

eval_llama_lib.py:91-95 now raises a clear ValueError for this combination, matching the builder-side guard. The error message correctly directs users to use generate_full_logits=True.

2. Example token IDs and max(3, ...)

model.py:304-306 now bounds tokens by the model's vocab size:

vocab_size = int(getattr(self.model_.params, "vocab_size", 0))
if vocab_size > 1:
    example_tokens = example_tokens % (vocab_size - 1) + 1

This produces IDs in [1, vocab_size-1], which is safe for all configs. The comment at line 290 explaining the historical minimum is helpful.

3. Duplicate calibration logging ✅

builder.py:416-418 — the log statement now only appears inside the conditional block that actually runs calibration. No more unconditional duplicate.

4. Calibration prefix validation ✅

builder.py:298_check_calibration_prefix_options() is called at the very top of pt2e_calibrate(), before tokenizer loading or any other work. Fails fast as requested.

5. Device behavior ✅

eager_eval.py:34 resolves device with fallback: device or ("cuda" if ...). The calibration path in builder.py:363 explicitly passes device="cpu" with a clear comment explaining why (CPU-defaulting ops in exported graphs). The eval path preserves existing behavior. The asymmetry is intentional and well-documented in the code comment. Adding a note to the PR description as you suggested would be a nice touch but not blocking.

6. Full-logits static padding ✅

The _pad_to_max_len helper (eval_llama_lib.py:111-128) pads after real tokens with EOS/pad. For generate_full_logits=True, causal masking ensures padding doesn't affect earlier positions — lm-eval reads only logits[:, :seq_len, :]. Sound.


Additional observations on the latest code

_prepare_calibration_prefix logic (builder.py:271-286): The logits_token_pos = min(len(prefix_tokens), max_len) - 1 correctly captures the last real token position before padding is applied. The calibration loop then uses this at line 328 (logits[:, logits_token_pos]) only in the full-logits path. For the non-full-logits dynamic path, logits[:] on the 2D output is correct.

lm-eval import guard (builder.py:341-348): The imports are now properly inside the if calibration_tasks: block, so prompt-only calibration doesn't require lm-eval. Good fix for the Copilot reviewer's concern.

Warning for missing calibration_limit (builder.py:400-405): Appropriately warns when tasks are provided without a limit, without changing behavior for existing callers.

Dynamic shape constraint (model.py:292): max_len = max_seq_len - 1 if self.enable_dynamic_shape else max_seq_len correctly matches the max=self.max_seq_len - 1 in LLMEdgeManager's dynamic shape setup (builder.py:144).


Summary

All six items from my previous review are properly addressed. The code is well-structured with appropriate guards, clear error messages, and the calibration/eval separation is sound. LGTM from my side.

One remaining note: the CLA check is currently failing — @xingguo01 may need to sign the CLA before this can merge.


View job

@xingguo01
Copy link
Copy Markdown
Collaborator Author

@digantdesai all comments addressed, highly appreciated if you let me know if there are additional comments on this PR.

@xingguo01 xingguo01 merged commit f6be985 into pytorch:main May 29, 2026
476 of 489 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. help wanted Extra attention is needed partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm release notes: arm Changes to the ARM backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants