In [1]:
from IPython.display import Code

# Example: Code size comparison: muRISCVNN vs. CMSIS-NN

While we consider the program runtime (in ms, Cycles or Instructions) most of the time, the memory demand of a given application should not be underestimated. While most of the ROM usage is proably fixed due to the model weights, the program code itself also might take over 100kB of space, which might exceed the possibilities of some edge ML devices.

## Supported components

**Models:** Any (`aww` and `resnet` used below)

**Frontends:** `tflite` only (becaus eof used backend)

**Frameworks/Backends:** `tflmi` or `tflmc` only

**Platforms/Targets:** Any target/platform supporting both `muriscvnn` as well as `cmsisnn` (spike used below)

**Features:** `muriscvnn` and `cmsisnn` features have to be enabled 

## Prerequisites

If not done already, setup a virtual python environment and install the required packages into it. (See `requirements.txt`)

In [2]:
Code(filename="requirements.txt")

Set up MLonmCU as usual, i.e. initialize an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

In [3]:
Code(filename="environment.yml.j2")

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

## Usage

The following experiments mainly discuss the ROM usage or more specifically the code size (e.g. how large is the `.text` ELF section). Only the scalar version (non-SIMD) versions of the library are discussed in the following!

*Warning:* Wile muRISCV-NN and CMSIS-NN share a very similar code-base, differences in the observed ROM metrics are expected, espiecially when comparing different compilers (e.g. ARM-GCC vs. RISC-V) and eventually different optimization flags.

### A) Command Line Interface

First we want to check if the `muriscvnn` and `cmsisnn` feature are working as expected with a simple (2 models, 1 target) benchmark configuration:

In [4]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO -  Processing stage LOAD
INFO -  Processing stage BUILD


INFO -  Processing stage COMPILE


INFO -  Processing stage RUN


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework Backend Platform Target  Total Cycles  Total Instructions  Total CPI  Validation  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data     Features                                             Config Postprocesses Comment
0        0    0     aww   tflite      tflm   tflmi     mlif  spike      53427082            53427082        1.0        True     146217      36208          62643     83558        16      2108               34100           []  {'aww.output_shapes': {'Identity': [1, 12]}, '...            []       -
1        0    1     aww   tflite      tflm   tflmi     mlif  spike      15658993            15658993        1.0        True     176042      36224          62644    113382        16      2124               34100  [muriscvnn]  {'aww.output_shapes': {'Identity': [1, 12]}, '...       

Now let's focus on the reported ROM metrics running only until the `build` instead of the `run` stage.

In [5]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn \
        --postprocess filter_cols --config filter_cols.keep="Model,Total Cycles,Features,Total ROM,ROM read-only,ROM code, ROM misc"

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-1]  Processing stage LOAD


INFO - [session-1]  Processing stage BUILD


INFO - [session-1]  Processing stage COMPILE


INFO - [session-1]  Processing stage RUN


INFO - [session-1]  Processing stage POSTPROCESS


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-1] Done processing runs
INFO - Report:
    Model  Total Cycles  Total ROM  ROM read-only  ROM code     Features
0     aww      53427082     146217          62643     83558           []
1     aww      15658993     176042          62644    113382  [muriscvnn]
2     aww      16946083     175609          62643    112950    [cmsisnn]
3  resnet     169950655     193589         101491     92082           []
4  resnet      54873560     211524         101492    110016  [muriscvnn]
5  resnet      64017261     212195         101491    110688    [cmsisnn]


Above we have some preliminary results. It can be seen that the muRISCV-NN library adds another 5-15kB in terms of ROM usage to the baseline which is probably dominated by the TFLite Micro Interpreter itself.
However these programs compiled for optimal performance (using the `-O3` compiler optimization flag). Maybe we can improve the ROM usage by some amount by telling MLonMCU to optimize for size (`-Os`) instead?

In [6]:
!python -m mlonmcu.cli.main flow run aww resnet -b tflmi -t spike --config mlif.optimize=s \
        --feature-gen _ --feature-gen muriscvnn --feature-gen cmsisnn \
        --postprocess filter_cols --config filter_cols.keep="Model,Total Cycles,Features,Total ROM,ROM read-only,ROM code, ROM misc"

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-2]  Processing stage LOAD
INFO - [session-2]  Processing stage BUILD


INFO - [session-2]  Processing stage COMPILE


INFO - [session-2]  Processing stage RUN


INFO - [session-2]  Processing stage POSTPROCESS


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-2] Done processing runs
INFO - Report:
    Model  Total Cycles  Total ROM  ROM read-only  ROM code     Features
0     aww     175065985     132371          62603     69752           []
1     aww      17033402     150756          62604     88136  [muriscvnn]
2     aww      17057796     167519          62603    104900    [cmsisnn]
3  resnet     746108611     172949         101451     71482           []
4  resnet      81325784     185258         101452     83790  [muriscvnn]
5  resnet      64103855     197825         101451     96358    [cmsisnn]


~~Well this looks better, but not optimal. One issue here is, that CMSIS-NN lacks an possibility to pass over the optimization flags from  another CMake project. Hence in the end only the non CMSIS-NN/muRISCV-NN code was compiled with `-Os`.~~ (fixed in new version of MLonMCU)

### B) Python Scripting

To achieve the previous results with a Python script, only a few lines of code are required. Let's start with some imports:

In [7]:
from tempfile import TemporaryDirectory
from pathlib import Path

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [8]:
FRONTEND = "tflite"
MODELS = ["aww", "resnet"]
BACKEND = "tflmi"
PLATFORM = "mlif"
TARGET = "spike"
POSTPROCESSES = ["config2cols", "rename_cols", "filter_cols"]
FEATURES = [[], ["cmsisnn"], ["muriscvnn"]]
CONFIG = {"mlif.optimize": "s", "filter_cols.keep": ["Model", "Total Cycles", "ROM code", "Features"]}

Initialize and run a single benchmark

In [9]:
with MlonMcuContext() as context:
    with context.create_session() as session:
        for model in MODELS:
            for features in FEATURES:

                def helper(session):
                    cfg = CONFIG.copy()
                    run = session.create_run(config=cfg)
                    run.add_features_by_name(features, context=context)
                    run.add_frontend_by_name(FRONTEND, context=context)
                    run.add_model_by_name(model, context=context)
                    run.add_backend_by_name(BACKEND, context=context)
                    run.add_platform_by_name(PLATFORM, context=context)
                    run.add_target_by_name(TARGET, context=context)
                    run.add_postprocesses_by_name(POSTPROCESSES)

                helper(session)
        session.process_runs(context=context)
        report = session.get_reports()
assert not session.failing
report.df

INFO - Loading environment cache from file


INFO - Successfully initialized cache


INFO - [session-3] Processing all stages


INFO - All runs completed successfuly!


INFO - Postprocessing session report


INFO - [session-3] Done processing runs


Unnamed: 0,Model,Total Cycles,ROM code,Features
0,aww,175065985,69752,[]
1,aww,17057796,104900,[cmsisnn]
2,aww,17033402,88136,[muriscvnn]
3,resnet,746108611,71482,[]
4,resnet,64103855,96358,[cmsisnn]
5,resnet,81325784,83790,[muriscvnn]


Here we have the report as pandas dataframe. Of course we can also look at relative differences instead:

In [10]:
df = report.df
df.set_index("Features", inplace=True)
df.index = df.index.map(lambda x: x[0] if len(x) > 0 else "default")
cycles_firsts = df.groupby("Model")["Total Cycles"].transform("first")
rom_firsts = df.groupby("Model")["ROM code"].transform("first")
df["Total Cycles (rel.)"] = 1 / (df["Total Cycles"] / cycles_firsts)
df["ROM code (rel.)"] = 1 / (df["ROM code"] / rom_firsts)
df

Unnamed: 0_level_0,Model,Total Cycles,ROM code,Total Cycles (rel.),ROM code (rel.)
Features,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
default,aww,175065985,69752,1.0,1.0
cmsisnn,aww,17057796,104900,10.263107,0.664938
muriscvnn,aww,17033402,88136,10.277805,0.791413
default,resnet,746108611,71482,1.0,1.0
cmsisnn,resnet,64103855,96358,11.63906,0.741838
muriscvnn,resnet,81325784,83790,9.174318,0.853109
