You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FoldAndAnnotateQParamsPass folds the boundary dequantize_per_tensor at an Ethos-U partition exit into the preceding passthrough op (view_copy, permute_copy). Vela then compiles the partition INT8-only and writes 1 byte per output element, but the parent graph's executorch_call_delegate node retains the original FP32 dtype on meta["val"], so ExecuTorch allocates a 4× larger buffer. At runtime copy_with_layout_adjustment() sees expand_factor=4 elem_size=1 ScalarType::Float, which is not a handled case, and returns Error::InvalidProgram — hard-fault on the first delegate-call boundary in any attention model.
If a raw-memcpy fallback is applied (v1.0.1-style, skipping the reject), inference completes but produces wrong outputs: Vela's INT8 bytes land in the first 25% of the FP32 buffer and downstream FP32 consumers (softmax, scale, mask) reinterpret them as floats.
Steps to Reproduce
Lower any quantized attention model (e.g. mobilevit_s INT8) via EthosUQuantizer → EthosUPartitioner targeting ethos-u85-256.
Build the runner and run on FVP_Corstone_SSE-320 with torch.ones(1, 3, 256, 256) (mobilevit_s default input is 256×256 in timm).
Without fallback: hard-fault at the first CALL_DELEGATE instruction.
With v1.0.1-style raw-memcpy fallback: inference completes with wrong outputs — on mobilevit_s INT8 we observed CPU top-1 = 916, FVP top-1 = 482, max |diff| = 14.73.
Root Cause
Two AOT-side legs interact:
FoldAndAnnotateQParamsPass.is_foldable() returns True for passthrough ops (view_copy, permute_copy) annotated with ArmAnnotationInfo(quantized=True) by EthosUQuantizer. For a chain matmul (INT8) → DQ → view_copy → partition output, the DQ is erased and view_copy is rewired to the pre-DQ INT8 node. The trailing retrace updates view_copy.meta["val"] to INT8 inside the deep-copied partition graph.
_insert_lowered_submodule() in exir/backend/backend_api.py captures call_delegate.meta["val"] from submodule_output_node.args[0] — the pre-deepcopy FX nodes in the parent graph, which were never refreshed. The stale FP32 dtype is serialized into the PTE, causing ExecuTorch to allocate a 4×-sized buffer for a 1-byte-per-element Vela output.
On mobilevit_s INT8: 36 out of 58 partition outputs have this stale FP32 dtype, matching 36 runtime expand_factor=4 rejects exactly.
Why Narrower Fixes Don't Work
Skip the boundary DQ fold: keeping the DQ inside the partition leaves its FP32-typed scale/zero_point placeholders in the TOSA graph; TOSA-1.0+INT rejects every FP32 placeholder regardless of role.
Set meta["val"] inside the pass: only updates the deep-copied partition graph, not the parent graph's submodule_output_node.args[0].
Vela output flag: Vela has no FP32 output mode (--output-format only exposes tflite/raw).
Proposed Fix
Propagate the actual Vela output dtype back to the parent graph through PreprocessResult:
Return output_elem_sizes from vela_compile() (already in the Vela NPZ as output_elem_size).
Add output_dtypes: Optional[List[torch.dtype]] = None and output_qparams: Optional[List[Optional[Tuple[float, int]]]] = None to PreprocessResult (both default None → no impact on other backends).
In _insert_lowered_submodule(), when output_dtypes is present: rewrite call_delegate.meta["val"] to the correct dtype and pre-set meta["spec"] (the if "spec" not in node.meta: guard in SpecPropPass from Fix double-tracing in SpecPropPass #15485 prevents subsequent retrace from overwriting it). Insert a CPU dequantize_per_tensor after each getitem to restore the FP32 view expected by downstream consumers.
We have a working implementation against v1.1.0 and will open an upstream PR once the design direction is acknowledged.
Verification
AOT (PTE flatbuffer inspection, 6 INT8 models):
Model
NPU partitions
FP32 DQ inputs (baseline)
FP32 DQ inputs (fixed)
mobilevit_s
37
36
0
swin_t ¹
48
—
0
swin_b
24
—
0
vit_b_16
25
—
0
deit_base
13
—
0
convnext_small
2
—
0
Total
149
—
0 (289 DQ inputs all INT8)
Baseline FP32 DQ inputs were measured only for mobilevit_s; the other models hard-fault before PTE inspection is possible on unpatched code. The "fixed" column was verified by PTE flatbuffer inspection after applying the fix.
¹ swin_t also requires a one-line fix in op_slice.py::define_node to handle the 5th step argument (inputs[:4]); without it lowering aborts with ValueError: too many values to unpack (expected 4). Orthogonal to this bug.
Arm backend: Fix meta propagation in some call passes #19154 (Arm backend: Fix meta propagation in some call passes): targets RewriteUpsamplePass-class node creation; does not address cross-graph dtype propagation between preprocess and parent graph.
Fix double-tracing in SpecPropPass #15485 (Fix double-tracing in SpecPropPass): documents the if "spec" not in node.meta: guard for executorch_call_delegate that this fix relies on.
Asks
Does this analysis match maintainers' understanding of the Ethos-U INT-only partition boundary contract?
Is PreprocessResult schema additions (output_dtypes / output_qparams) the right mechanism, or is a dedicated callback on BackendDetails preferred?
Are there additional models we should validate before opening the upstream PR?
Bug Description
FoldAndAnnotateQParamsPassfolds the boundarydequantize_per_tensorat an Ethos-U partition exit into the preceding passthrough op (view_copy,permute_copy). Vela then compiles the partition INT8-only and writes 1 byte per output element, but the parent graph'sexecutorch_call_delegatenode retains the original FP32 dtype onmeta["val"], so ExecuTorch allocates a 4× larger buffer. At runtimecopy_with_layout_adjustment()seesexpand_factor=4 elem_size=1 ScalarType::Float, which is not a handled case, and returnsError::InvalidProgram— hard-fault on the first delegate-call boundary in any attention model.Error
If a raw-memcpy fallback is applied (v1.0.1-style, skipping the reject), inference completes but produces wrong outputs: Vela's INT8 bytes land in the first 25% of the FP32 buffer and downstream FP32 consumers (softmax, scale, mask) reinterpret them as floats.
Steps to Reproduce
mobilevit_sINT8) viaEthosUQuantizer→EthosUPartitionertargetingethos-u85-256.FVP_Corstone_SSE-320withtorch.ones(1, 3, 256, 256)(mobilevit_sdefault input is 256×256 in timm).CALL_DELEGATEinstruction.mobilevit_sINT8 we observed CPU top-1 = 916, FVP top-1 = 482, max |diff| = 14.73.Root Cause
Two AOT-side legs interact:
FoldAndAnnotateQParamsPass.is_foldable()returnsTruefor passthrough ops (view_copy,permute_copy) annotated withArmAnnotationInfo(quantized=True)byEthosUQuantizer. For a chainmatmul (INT8) → DQ → view_copy → partition output, the DQ is erased andview_copyis rewired to the pre-DQ INT8 node. The trailing retrace updatesview_copy.meta["val"]to INT8 inside the deep-copied partition graph._insert_lowered_submodule()inexir/backend/backend_api.pycapturescall_delegate.meta["val"]fromsubmodule_output_node.args[0]— the pre-deepcopy FX nodes in the parent graph, which were never refreshed. The stale FP32 dtype is serialized into the PTE, causing ExecuTorch to allocate a 4×-sized buffer for a 1-byte-per-element Vela output.On
mobilevit_sINT8: 36 out of 58 partition outputs have this stale FP32 dtype, matching 36 runtimeexpand_factor=4rejects exactly.Why Narrower Fixes Don't Work
scale/zero_pointplaceholders in the TOSA graph;TOSA-1.0+INTrejects every FP32 placeholder regardless of role.meta["val"]inside the pass: only updates the deep-copied partition graph, not the parent graph'ssubmodule_output_node.args[0].--output-formatonly exposestflite/raw).Proposed Fix
Propagate the actual Vela output dtype back to the parent graph through
PreprocessResult:output_elem_sizesfromvela_compile()(already in the Vela NPZ asoutput_elem_size).output_dtypes: Optional[List[torch.dtype]] = Noneandoutput_qparams: Optional[List[Optional[Tuple[float, int]]]] = NonetoPreprocessResult(both defaultNone→ no impact on other backends)._insert_lowered_submodule(), whenoutput_dtypesis present: rewritecall_delegate.meta["val"]to the correct dtype and pre-setmeta["spec"](theif "spec" not in node.meta:guard inSpecPropPassfrom Fix double-tracing in SpecPropPass #15485 prevents subsequent retrace from overwriting it). Insert a CPUdequantize_per_tensorafter eachgetitemto restore the FP32 view expected by downstream consumers.We have a working implementation against v1.1.0 and will open an upstream PR once the design direction is acknowledged.
Verification
AOT (PTE flatbuffer inspection, 6 INT8 models):
mobilevit_sswin_t¹swin_bvit_b_16deit_baseconvnext_smallBaseline FP32 DQ inputs were measured only for
mobilevit_s; the other models hard-fault before PTE inspection is possible on unpatched code. The "fixed" column was verified by PTE flatbuffer inspection after applying the fix.¹
swin_talso requires a one-line fix inop_slice.py::define_nodeto handle the 5thstepargument (inputs[:4]); without it lowering aborts withValueError: too many values to unpack (expected 4). Orthogonal to this bug.FVP (
mobilevit_sINT8,ones(1,3,256,256)):Environment
ethos-u85-256FVP_Corstone_SSE-320EthosUQuantizer(PT2E flow)Related (not duplicates)
TOSAPartitioner): different mechanism; cycle path does not fire onmobilevit_s(0 cycle warnings).FoldAndAnnotateQParamsPassfolds qdomain changes throughaten.cat#18999 / Arm backend: Fix quantized constant-folding for aten.cat lists (#18971) #19064 (FoldAndAnnotateQParamsPassqdomain change viaaten.cat): same pass, different case (within-INT qdomain change, not INT/FP partition boundary).Arm backend: Fix meta propagation in some call passes): targetsRewriteUpsamplePass-class node creation; does not address cross-graph dtype propagation between preprocess and parent graph.Arm backend: Partition boundary Q/DQ nodes for INT+FP): for INT+FP profile where Vela has FP32 ops; Ethos-U INT-only is a different scenario.Fix double-tracing in SpecPropPass): documents theif "spec" not in node.meta:guard forexecutorch_call_delegatethat this fix relies on.Asks
PreprocessResultschema additions (output_dtypes/output_qparams) the right mechanism, or is a dedicated callback onBackendDetailspreferred?