This could potentially be added as a pass as well: https://github.com/pytorch/executorch/blob/main/backends/arm/_passes/broadcast_args_pass.py. But long term the ideal solution would be to add broadcast support to CMSIS-NN to get it accelerated w/o memcopies. _Originally posted by @AdrianLundell in https://github.com/pytorch/executorch/pull/13296#discussion_r2297871408_ Creating this issue to track further improvements in AOT phase: - CMSIS-NN kernel dispatch could be further abstracted into a pass (Broadcast Pass / Similar new dedicated pass):