Skip to content

[ARM/Ethos-U] Partition boundary buffer dtype mismatch causes silent accuracy loss in attention models #19364

@chanil222

Description

@chanil222

Bug Description

FoldAndAnnotateQParamsPass folds the boundary dequantize_per_tensor at an Ethos-U partition exit into the preceding passthrough op (view_copy, permute_copy). Vela then compiles the partition INT8-only and writes 1 byte per output element, but the parent graph's executorch_call_delegate node retains the original FP32 dtype on meta["val"], so ExecuTorch allocates a 4× larger buffer. At runtime copy_with_layout_adjustment() sees expand_factor=4 elem_size=1 ScalarType::Float, which is not a handled case, and returns Error::InvalidProgram — hard-fault on the first delegate-call boundary in any attention model.

Error

EthosUBackend.cpp::copy_with_layout_adjustment()
  expand_factor=4 elem_size=1 ScalarType::Float → Error::InvalidProgram

If a raw-memcpy fallback is applied (v1.0.1-style, skipping the reject), inference completes but produces wrong outputs: Vela's INT8 bytes land in the first 25% of the FP32 buffer and downstream FP32 consumers (softmax, scale, mask) reinterpret them as floats.

Steps to Reproduce

  1. Lower any quantized attention model (e.g. mobilevit_s INT8) via EthosUQuantizerEthosUPartitioner targeting ethos-u85-256.
  2. Build the runner and run on FVP_Corstone_SSE-320 with torch.ones(1, 3, 256, 256) (mobilevit_s default input is 256×256 in timm).
  3. Without fallback: hard-fault at the first CALL_DELEGATE instruction.
  4. With v1.0.1-style raw-memcpy fallback: inference completes with wrong outputs — on mobilevit_s INT8 we observed CPU top-1 = 916, FVP top-1 = 482, max |diff| = 14.73.

Root Cause

Two AOT-side legs interact:

  1. FoldAndAnnotateQParamsPass.is_foldable() returns True for passthrough ops (view_copy, permute_copy) annotated with ArmAnnotationInfo(quantized=True) by EthosUQuantizer. For a chain matmul (INT8) → DQ → view_copy → partition output, the DQ is erased and view_copy is rewired to the pre-DQ INT8 node. The trailing retrace updates view_copy.meta["val"] to INT8 inside the deep-copied partition graph.

  2. _insert_lowered_submodule() in exir/backend/backend_api.py captures call_delegate.meta["val"] from submodule_output_node.args[0] — the pre-deepcopy FX nodes in the parent graph, which were never refreshed. The stale FP32 dtype is serialized into the PTE, causing ExecuTorch to allocate a 4×-sized buffer for a 1-byte-per-element Vela output.

On mobilevit_s INT8: 36 out of 58 partition outputs have this stale FP32 dtype, matching 36 runtime expand_factor=4 rejects exactly.

Why Narrower Fixes Don't Work

  • Skip the boundary DQ fold: keeping the DQ inside the partition leaves its FP32-typed scale/zero_point placeholders in the TOSA graph; TOSA-1.0+INT rejects every FP32 placeholder regardless of role.
  • Set meta["val"] inside the pass: only updates the deep-copied partition graph, not the parent graph's submodule_output_node.args[0].
  • Vela output flag: Vela has no FP32 output mode (--output-format only exposes tflite/raw).

Proposed Fix

Propagate the actual Vela output dtype back to the parent graph through PreprocessResult:

  1. Return output_elem_sizes from vela_compile() (already in the Vela NPZ as output_elem_size).
  2. Add output_dtypes: Optional[List[torch.dtype]] = None and output_qparams: Optional[List[Optional[Tuple[float, int]]]] = None to PreprocessResult (both default None → no impact on other backends).
  3. In _insert_lowered_submodule(), when output_dtypes is present: rewrite call_delegate.meta["val"] to the correct dtype and pre-set meta["spec"] (the if "spec" not in node.meta: guard in SpecPropPass from Fix double-tracing in SpecPropPass #15485 prevents subsequent retrace from overwriting it). Insert a CPU dequantize_per_tensor after each getitem to restore the FP32 view expected by downstream consumers.

We have a working implementation against v1.1.0 and will open an upstream PR once the design direction is acknowledged.

Verification

AOT (PTE flatbuffer inspection, 6 INT8 models):

Model NPU partitions FP32 DQ inputs (baseline) FP32 DQ inputs (fixed)
mobilevit_s 37 36 0
swin_t ¹ 48 0
swin_b 24 0
vit_b_16 25 0
deit_base 13 0
convnext_small 2 0
Total 149 0 (289 DQ inputs all INT8)

Baseline FP32 DQ inputs were measured only for mobilevit_s; the other models hard-fault before PTE inspection is possible on unpatched code. The "fixed" column was verified by PTE flatbuffer inspection after applying the fix.

¹ swin_t also requires a one-line fix in op_slice.py::define_node to handle the 5th step argument (inputs[:4]); without it lowering aborts with ValueError: too many values to unpack (expected 4). Orthogonal to this bug.

FVP (mobilevit_s INT8, ones(1,3,256,256)):

Metric Baseline (raw-memcpy fallback) Fixed
factor4 fires 36 0
FVP top-1 482 916
Max abs diff 14.73 0.896

Environment

  • ExecuTorch: v1.1.0
  • Target: ethos-u85-256
  • Simulator: FVP_Corstone_SSE-320
  • Quantizer: EthosUQuantizer (PT2E flow)

Related (not duplicates)

Asks

  1. Does this analysis match maintainers' understanding of the Ethos-U INT-only partition boundary contract?
  2. Is PreprocessResult schema additions (output_dtypes / output_qparams) the right mechanism, or is a dedicated callback on BackendDetails preferred?
  3. Are there additional models we should validate before opening the upstream PR?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions