# Example: Per-Layer Model Benchmarking

When investigating why a given model performs not as expected or implementing optimiations for specific types of layers, it is often useful to consider the runtime individual layers instead of the end-to-end execution time.

MLonMCU currently supports to approaches for per-layer benchmarking:
1. Using the `split_layers` feature of the `tflite` frontend
2. Using the profiling feature provided by the `tvm` and `microtvm` (WIP) platform

Both use-cases are explained briefly in the rest of this notebook.

## 1. Splitting TFLite Models into individual layers

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** `tflite` only

**Frameworks/Backends:** Any (`tvmaotplus` used below)

**Platforms/Targets:** Any (`etiss_pulpino` used below)

**Features:** The `split_layers` feature of the `tflite` frontend needs to be enabled

### Prerequisites

Set up MLonmCU as usual, i.e. initializa an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

```yaml
---
TODO
```

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

TODO

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [1]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-272]  Processing stage LOAD
INFO - [session-272]  Processing stage BUILD
INFO - [session-272]  Processing stage COMPILE
INFO - [session-272]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-272] Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework     Backend Platform         Target    Cycles  MIPS  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data Features                                             Config Postprocesses Comment
0      272    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino  81824730    71     229042     108185         167384     61514       144      2493              105692       []  {'tflite.use_inout_data': False, 'tflite.visua...            []       -


Now lets enable the `split_layers` feature:

In [2]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino -f split_layers

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-273]  Processing stage LOAD
INFO - [session-273]  Processing stage BUILD
INFO - [session-273]  Processing stage COMPILE
INFO - [session-273]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-273] Done processing runs
INFO - Report:
    Session  Run   Model Frontend Framework     Backend Platform         Target      Sub    Cycles  MIPS  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data        Features                                             Config Postprocesses Comment
0       273    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino      NaN  81824730    73     229042     108185         167384     61514       144      2493              105692  [split_layers]  {'tflite.split_layers': True, 'tflite.use_inou...            []       -
1       2

The resulting report should contain the the original benchmark results (for the whole model) in the first row. The remaining 16 rows are for each of the layers found in the `resnet.tflite` model. The layer-index can be found in the 'Sub' column. The cycle count of these should roughly sum up to the total execution time measured in row one.

#### B) Python Scripting

TODO

Use pandas instead of postprocess

In [3]:
from tempfile import TemporaryDirectory
from pathlib import Path
import pandas as pd

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [4]:
FRONTEND = "tflite"
MODEL = "sine_model"
BACKEND = "tvmaotplus"
PLATFORM = "mlif"
TARGET = "etiss_pulpino"
FEATURES = ["log_instrs"]
CONFIG = {"log_instrs.to_file": True}
POSTPROCESSES = ["analyse_instructions"]

Initialize and run a single benchmark

In [5]:
with MlonMcuContext() as context:
    session = context.create_session()
    run = session.create_run(config=CONFIG)
    run.add_features_by_name(FEATURES, context=context)
    run.add_frontend_by_name(FRONTEND, context=context)
    run.add_model_by_name(MODEL, context=context)
    run.add_backend_by_name(BACKEND, context=context)
    run.add_platform_by_name(PLATFORM, context=context)
    run.add_target_by_name(TARGET, context=context)
    run.add_postprocesses_by_name(POSTPROCESSES)
    session.process_runs(context=context)
    report = session.get_reports()
report.df

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-241] Processing all stages
ERROR - 'builtin_function_or_method' object is not subscriptable
Traceback (most recent call last):
  File "/var/tmp/ga87puy/mlonmcu/mlonmcu/venv/lib/python3.8/site-packages/mlonmcu-0.3.0.dev0-py3.8.egg/mlonmcu/session/run.py", line 782, in process
    func()
  File "/var/tmp/ga87puy/mlonmcu/mlonmcu/venv/lib/python3.8/site-packages/mlonmcu-0.3.0.dev0-py3.8.egg/mlonmcu/session/run.py", line 549, in postprocess
    artifacts = postprocess.post_run(temp_report, self.artifacts)
  File "/var/tmp/ga87puy/mlonmcu/mlonmcu/venv/lib/python3.8/site-packages/mlonmcu-0.3.0.dev0-py3.8.egg/mlonmcu/session/run.py", line 507, in artifacts
    itertools.chain([subs[subs.keys[0]] for stage, subs in self.artifacts_per_stage.items()])
  File "/var/tmp/ga87puy/mlonmcu/mlonmcu/venv/lib/python3.8/site-packages/mlonmcu-0.3.0.dev0-py3.8.egg/mlonmcu/sessi

Unnamed: 0,Session,Run,Model,Frontend,Framework,Backend,Platform,Target,Total ROM,Total RAM,ROM read-only,ROM code,ROM misc,RAM data,RAM zero-init data,Features,Config,Postprocesses,Comment,Failing
0,241,0,sine_model,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,56100,2737,4280,51676,144,2493,244,[log_instrs],"{'tflite.use_inout_data': False, 'tflite.visua...",[analyse_instructions],-,True


## 2. Using ~~(Micro)~~TVMs profiling functionality

Instead of splitting the model layer wise before optimization, this will use the functionality of TVMs graph runtime to benchmark individual functions conatine din the model graph. These functions do not nessessarily map directly to a single layer in the original model operator fusing is automatically performed by TVMs compilation pipeline.

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** Any frontend supported by TVM (`tflite` used below)

**Frameworks/Backends:** TVM: `tvmllvm` ~~MicroTVM: `tvmrt`~~

**Platforms/Targets:** TVM: `tvm_cpu` ~~MicroTVM: Any~~

**Features:** The `tvm_profile` feature needs to be enabled

Let's only consider the `tvm_cpu` target here until this is supported officially by upstream TVM. Hence we are profiling on the host cpu here, not on a MCU or simulator. 

### Prerequisites

Set up MLonmCU as usual, i.e. initializa an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

```yaml
---
TODO
```

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

TODO

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [3]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-274]  Processing stage LOAD
INFO - [session-274]  Processing stage BUILD
INFO - [session-274]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-274] Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework  Backend Platform   Target  Runtime [s] Features                                             Config Postprocesses Comment
0      274    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu     0.001342       []  {'tflite.use_inout_data': False, 'tflite.visua...            []       -


To enable TVM's profiling feature just just add `-f tvm_profile` to the command line:

In [5]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-276]  Processing stage LOAD
INFO - [session-276]  Processing stage BUILD
INFO - [session-276]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-276] Done processing runs
INFO - Report:
    Session  Run   Model Frontend Framework  Backend Platform   Target                                                Sub   Runtime [s]       Features                                             Config Postprocesses Comment
0       276    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu                                                NaN  8.562000e-04  [tvm_profile]  {'tflite.use_inout_data': False, 'tflite.visua...            []       -
1       276    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu  tvmgen_default_fused_cast_subtract_fixed_point...  1.568800e-04  [tvm_profile]  {'tflite.use_inout

Since tvm used quite long function-names, this might not be very reaible. As a last step, let's try to improve that using the `filter_cols` postprocess:

In [7]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile \
        --postprocess filter_cols -c filter_cols.keep="Model,Sub,Runtime [s]"

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-278]  Processing stage LOAD
INFO - [session-278]  Processing stage BUILD
INFO - [session-278]  Processing stage RUN
INFO - [session-278]  Processing stage POSTPROCESS
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-278] Done processing runs
INFO - Report:
     Model                                                Sub   Runtime [s]
0   resnet                                                NaN  1.063600e-03
1   resnet  tvmgen_default_fused_cast_subtract_fixed_point...  2.024000e-04
2   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.971900e-04
3   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.941100e-04
4   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.896400e-04
5   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.057300e-04
6   resnet  tvmgen_default_fused_nn

#### B) Python Scripting

TODO