[Embedded] [Cortex-M] Improve performance of quantized MV2 or 3 performance using Arm CMSIS-NN

* Building up the existing `backend/cortex-m` by integrating CMSIS-NN
* Use Quantized MV2/3 as a pilot models
* Use FVP (if not a real board - preferred with M55 or M85) to showcase performance improvements over portable (reference) operators running this
* [stretch] compare against TFLM
* [stretch] compare against ET Ethos