[ARM/Ethos-U] Partition boundary buffer dtype mismatch causes silent accuracy loss in attention models

## Bug Description

`FoldAndAnnotateQParamsPass` folds the boundary `dequantize_per_tensor` at an Ethos-U partition exit into the preceding passthrough op (`view_copy`, `permute_copy`). Vela then compiles the partition INT8-only and writes 1 byte per output element, but the parent graph's `executorch_call_delegate` node retains the original FP32 dtype on `meta["val"]`, so ExecuTorch allocates a 4× larger buffer. At runtime `copy_with_layout_adjustment()` sees `expand_factor=4 elem_size=1 ScalarType::Float`, which is not a handled case, and returns `Error::InvalidProgram` — hard-fault on the first delegate-call boundary in any attention model.

## Error

```
EthosUBackend.cpp::copy_with_layout_adjustment()
  expand_factor=4 elem_size=1 ScalarType::Float → Error::InvalidProgram
```

If a raw-memcpy fallback is applied (v1.0.1-style, skipping the reject), inference completes but produces wrong outputs: Vela's INT8 bytes land in the first 25% of the FP32 buffer and downstream FP32 consumers (softmax, scale, mask) reinterpret them as floats.

## Steps to Reproduce

1. Lower any quantized attention model (e.g. `mobilevit_s` INT8) via `EthosUQuantizer` → `EthosUPartitioner` targeting `ethos-u85-256`.
2. Build the runner and run on `FVP_Corstone_SSE-320` with `torch.ones(1, 3, 256, 256)` (`mobilevit_s` default input is 256×256 in timm).
3. **Without fallback**: hard-fault at the first `CALL_DELEGATE` instruction.
4. **With v1.0.1-style raw-memcpy fallback**: inference completes with wrong outputs — on `mobilevit_s` INT8 we observed CPU top-1 = 916, FVP top-1 = 482, max |diff| = 14.73.

## Root Cause

Two AOT-side legs interact:

1. `FoldAndAnnotateQParamsPass.is_foldable()` returns `True` for passthrough ops (`view_copy`, `permute_copy`) annotated with `ArmAnnotationInfo(quantized=True)` by `EthosUQuantizer`. For a chain `matmul (INT8) → DQ → view_copy → partition output`, the DQ is erased and `view_copy` is rewired to the pre-DQ INT8 node. The trailing retrace updates `view_copy.meta["val"]` to INT8 inside the deep-copied partition graph.

2. `_insert_lowered_submodule()` in `exir/backend/backend_api.py` captures `call_delegate.meta["val"]` from `submodule_output_node.args[0]` — the **pre-deepcopy** FX nodes in the parent graph, which were never refreshed. The stale FP32 dtype is serialized into the PTE, causing ExecuTorch to allocate a 4×-sized buffer for a 1-byte-per-element Vela output.

On `mobilevit_s` INT8: 36 out of 58 partition outputs have this stale FP32 dtype, matching 36 runtime `expand_factor=4` rejects exactly.

## Why Narrower Fixes Don't Work

- **Skip the boundary DQ fold**: keeping the DQ inside the partition leaves its FP32-typed `scale`/`zero_point` placeholders in the TOSA graph; `TOSA-1.0+INT` rejects every FP32 placeholder regardless of role.
- **Set `meta["val"]` inside the pass**: only updates the deep-copied partition graph, not the parent graph's `submodule_output_node.args[0]`.
- **Vela output flag**: Vela has no FP32 output mode (`--output-format` only exposes `tflite`/`raw`).

## Proposed Fix

Propagate the actual Vela output dtype back to the parent graph through `PreprocessResult`:

1. Return `output_elem_sizes` from `vela_compile()` (already in the Vela NPZ as `output_elem_size`).
2. Add `output_dtypes: Optional[List[torch.dtype]] = None` and `output_qparams: Optional[List[Optional[Tuple[float, int]]]] = None` to `PreprocessResult` (both default `None` → no impact on other backends).
3. In `_insert_lowered_submodule()`, when `output_dtypes` is present: rewrite `call_delegate.meta["val"]` to the correct dtype and pre-set `meta["spec"]` (the `if "spec" not in node.meta:` guard in `SpecPropPass` from #15485 prevents subsequent retrace from overwriting it). Insert a CPU `dequantize_per_tensor` after each `getitem` to restore the FP32 view expected by downstream consumers.

We have a working implementation against v1.1.0 and will open an upstream PR once the design direction is acknowledged.

## Verification

AOT (PTE flatbuffer inspection, 6 INT8 models):

| Model | NPU partitions | FP32 DQ inputs (baseline) | FP32 DQ inputs (fixed) |
|-------|---------------|--------------------------|------------------------|
| `mobilevit_s` | 37 | 36 | **0** |
| `swin_t` ¹ | 48 | — | **0** |
| `swin_b` | 24 | — | **0** |
| `vit_b_16` | 25 | — | **0** |
| `deit_base` | 13 | — | **0** |
| `convnext_small` | 2 | — | **0** |
| **Total** | 149 | — | **0** (289 DQ inputs all INT8) |

Baseline FP32 DQ inputs were measured only for `mobilevit_s`; the other models hard-fault before PTE inspection is possible on unpatched code. The "fixed" column was verified by PTE flatbuffer inspection after applying the fix.

¹ `swin_t` also requires a one-line fix in `op_slice.py::define_node` to handle the 5th `step` argument (`inputs[:4]`); without it lowering aborts with `ValueError: too many values to unpack (expected 4)`. Orthogonal to this bug.

FVP (`mobilevit_s` INT8, `ones(1,3,256,256)`):

| Metric | Baseline (raw-memcpy fallback) | Fixed |
|--------|-------------------------------|-------|
| factor4 fires | 36 | **0** |
| FVP top-1 | 482 | **916** |
| Max abs diff | 14.73 | **0.896** |

## Environment

- ExecuTorch: v1.1.0
- Target: `ethos-u85-256`
- Simulator: `FVP_Corstone_SSE-320`
- Quantizer: `EthosUQuantizer` (PT2E flow)

## Related (not duplicates)

- **#18190 / #18191** (cycle after Q/DQ de-tagging in `TOSAPartitioner`): different mechanism; cycle path does not fire on `mobilevit_s` (0 cycle warnings).
- **#18999 / #19064** (`FoldAndAnnotateQParamsPass` qdomain change via `aten.cat`): same pass, different case (within-INT qdomain change, not INT/FP partition boundary).
- **#19154** (`Arm backend: Fix meta propagation in some call passes`): targets `RewriteUpsamplePass`-class node creation; does not address cross-graph dtype propagation between preprocess and parent graph.
- **#16312** (`Arm backend: Partition boundary Q/DQ nodes for INT+FP`): for INT+FP profile where Vela has FP32 ops; Ethos-U INT-only is a different scenario.
- **#15485** (`Fix double-tracing in SpecPropPass`): documents the `if "spec" not in node.meta:` guard for `executorch_call_delegate` that this fix relies on.

## Asks

1. Does this analysis match maintainers' understanding of the Ethos-U INT-only partition boundary contract?
2. Is `PreprocessResult` schema additions (`output_dtypes` / `output_qparams`) the right mechanism, or is a dedicated callback on `BackendDetails` preferred?
3. Are there additional models we should validate before opening the upstream PR?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ARM/Ethos-U] Partition boundary buffer dtype mismatch causes silent accuracy loss in attention models #19364

Bug Description

Error

Steps to Reproduce

Root Cause

Why Narrower Fixes Don't Work

Proposed Fix

Verification

Environment

Related (not duplicates)

Asks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	NPU partitions	FP32 DQ inputs (baseline)	FP32 DQ inputs (fixed)
`mobilevit_s`	37	36	0
`swin_t` ¹	48	—	0
`swin_b`	24	—	0
`vit_b_16`	25	—	0
`deit_base`	13	—	0
`convnext_small`	2	—	0
Total	149	—	0 (289 DQ inputs all INT8)

Metric	Baseline (raw-memcpy fallback)	Fixed
factor4 fires	36	0
FVP top-1	482	916
Max abs diff	14.73	0.896

[ARM/Ethos-U] Partition boundary buffer dtype mismatch causes silent accuracy loss in attention models #19364

Description

Bug Description

Error

Steps to Reproduce

Root Cause

Why Narrower Fixes Don't Work

Proposed Fix

Verification

Environment

Related (not duplicates)

Asks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions