# Example: Per-Layer Model Benchmarking

When investigating why a given model performs not as expected or implementing optimiations for specific types of layers, it is often useful to consider the runtime individual layers instead of the end-to-end execution time.

MLonMCU currently supports to approaches for per-layer benchmarking:
1. Using the `split_layers` feature of the `tflite` frontend
2. Using the profiling feature provided by the `tvm` and `microtvm` (WIP) platform

Both use-cases are explained briefly in the rest of this notebook.

## 1. Splitting TFLite Models into individual layers

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** `tflite` only

**Frameworks/Backends:** Any (`tvmaotplus` used below)

**Platforms/Targets:** Any (`etiss_pulpino` used below)

**Features:** The `split_layers` feature of the `tflite` frontend needs to be enabled

### Prerequisites

Set up MLonmCU as usual, i.e. initialize an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

```yaml
---
home: "{{ home_dir }}"
logging:
  level: DEBUG
  to_file: false
  rotate: false
cleanup:
  auto: true
  keep: 10
paths:
  deps: deps
  logs: logs
  results: results
  plugins: plugins
  temp: temp
  models:
    - "{{ home_dir }}/models"
    - "{{ config_dir }}/models"
repos:
  tvm:
    url: "https://github.com/apache/tvm.git"
    ref: de6d8067754d746d88262c530b5241b5577b9aae
  etiss:
    url: "https://github.com/tum-ei-eda/etiss.git"
    ref: 4d2d26fb1fdb17e1da3a397c35d6f8877bf3ceab
  mlif:
    url: "https://github.com/tum-ei-eda/mlonmcu-sw.git"
    ref: 4b9a32659f7c5340e8de26a0b8c4135ca67d64ac
  tflite_pack:
    url: "https://github.com/tum-ei-eda/tflite-pack.git"
    ref: 439b78d36456f716629ad9dbaff9734baaa75db9
frameworks:
  default: tvm
  tvm:
    enabled: true
    backends:
      default: tvmaotplus
      tvmaotplus:
        enabled: true
        features: []
    features: []
frontends:
  tflite:
    enabled: true
    features:
      split_layers: true
toolchains:
  gcc: true
platforms:
  mlif:
    enabled: true
    features: []
targets:
  default: etiss_pulpino
  etiss_pulpino:
    enabled: true
    features: []
```

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

The following examples demonstrate the `split_layers` feature recently added MLonMCU.

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [1]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-272]  Processing stage LOAD
INFO - [session-272]  Processing stage BUILD
INFO - [session-272]  Processing stage COMPILE
INFO - [session-272]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-272] Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework     Backend Platform         Target    Cycles  MIPS  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data Features                                             Config Postprocesses Comment
0      272    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino  81824730    71     229042     108185         167384     61514       144      2493              105692       []  {'tflite.use_inout_data': False, 'tflite.visua...            []       -


Now lets enable the `split_layers` feature:

In [2]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino -f split_layers

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-273]  Processing stage LOAD
INFO - [session-273]  Processing stage BUILD
INFO - [session-273]  Processing stage COMPILE
INFO - [session-273]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-273] Done processing runs
INFO - Report:
    Session  Run   Model Frontend Framework     Backend Platform         Target      Sub    Cycles  MIPS  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data        Features                                             Config Postprocesses Comment
0       273    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino      NaN  81824730    73     229042     108185         167384     61514       144      2493              105692  [split_layers]  {'tflite.split_layers': True, 'tflite.use_inou...            []       -
1       2

The resulting report should contain the the original benchmark results (for the whole model) in the first row. The remaining 16 rows are for each of the layers found in the `resnet.tflite` model. The layer-index can be found in the 'Sub' column. The cycle count of these should roughly sum up to the total execution time measured in row one.

#### B) Python Scripting

Some imports

In [1]:
from tempfile import TemporaryDirectory
from pathlib import Path
import pandas as pd

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [2]:
FRONTEND = "tflite"
MODEL = "resnet"
BACKEND = "tvmaotplus"
PLATFORM = "mlif"
TARGET = "etiss_pulpino"
FEATURES = ["split_layers"]
CONFIG = {}
POSTPROCESSES = []

Initialize and run a single benchmark

In [3]:
with MlonMcuContext() as context:
    session = context.create_session()
    run = session.create_run(config=CONFIG)
    run.add_features_by_name(FEATURES, context=context)
    run.add_frontend_by_name(FRONTEND, context=context)
    run.add_model_by_name(MODEL, context=context)
    run.add_backend_by_name(BACKEND, context=context)
    run.add_platform_by_name(PLATFORM, context=context)
    run.add_target_by_name(TARGET, context=context)
    run.add_postprocesses_by_name(POSTPROCESSES)
    session.process_runs(context=context)
    report = session.get_reports()
report.df

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-384] Processing all stages
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-384] Done processing runs


Unnamed: 0,Session,Run,Model,Frontend,Framework,Backend,Platform,Target,Sub,Cycles,MIPS,Total ROM,Total RAM,ROM read-only,ROM code,ROM misc,RAM data,RAM zero-init data,Features,Config,Postprocesses,Comment
0,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,,81824730,72,229042,108185,167384,61514,144,2493,105692,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
1,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer0,3871968,12,56350,35113,4304,51902,144,2493,32620,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
2,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer1,13006152,32,59694,105097,8048,51502,144,2493,102604,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
3,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer2,13024494,32,59688,105097,8048,51496,144,2493,102604,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
4,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer3,629131,2,53862,51729,3024,50694,144,2501,49228,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
5,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer4,6504429,20,64938,94761,13104,51690,144,2493,92268,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
6,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer5,17546278,46,73902,56073,22320,51438,144,2493,53580,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
7,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer6,1138722,4,56796,90665,4912,51740,144,2493,88172,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
8,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer7,316038,1,53898,27153,3024,50730,144,2501,24652,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-
9,384,0,resnet,tflite,tvm,tvmaotplus,mlif,etiss_pulpino,layer8,8854578,27,93314,49737,41648,51522,144,2493,47244,[split_layers],"{'tflite.split_layers': True, 'tflite.use_inou...",[],-


Stripping out all common data, we get this:

In [19]:
df = report.df
df.drop(["Session", "Run", "Frontend", "Model", "Framework", "Backend", "Platform", "Target", "Config", "Features", "Postprocesses", "Comment"], axis=1, inplace=True)
df.fillna("full", inplace=True)
df.set_index("Sub", inplace=True)
df

Unnamed: 0_level_0,Cycles,MIPS,Total ROM,Total RAM,ROM read-only,ROM code,ROM misc,RAM data,RAM zero-init data
Sub,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
full,81824730,72,229042,108185,167384,61514,144,2493,105692
layer0,3871968,12,56350,35113,4304,51902,144,2493,32620
layer1,13006152,32,59694,105097,8048,51502,144,2493,102604
layer2,13024494,32,59688,105097,8048,51496,144,2493,102604
layer3,629131,2,53862,51729,3024,50694,144,2501,49228
layer4,6504429,20,64938,94761,13104,51690,144,2493,92268
layer5,17546278,46,73902,56073,22320,51438,144,2493,53580
layer6,1138722,4,56796,90665,4912,51740,144,2493,88172
layer7,316038,1,53898,27153,3024,50730,144,2501,24652
layer8,8854578,27,93314,49737,41648,51522,144,2493,47244


## 2. Using ~~(Micro)~~TVMs profiling functionality

Instead of splitting the model layer wise before optimization, this will use the functionality of TVMs graph runtime to benchmark individual functions conatine din the model graph. These functions do not nessessarily map directly to a single layer in the original model operator fusing is automatically performed by TVMs compilation pipeline.

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** Any frontend supported by TVM (`tflite` used below)

**Frameworks/Backends:** TVM: `tvmllvm` ~~MicroTVM: `tvmrt`~~

**Platforms/Targets:** TVM: `tvm_cpu` ~~MicroTVM: Any~~

**Features:** The `tvm_profile` feature needs to be enabled

Let's only consider the `tvm_cpu` target here until this is supported officially by upstream TVM. Hence we are profiling on the host cpu here, not on a MCU or simulator. 

### Prerequisites

Set up MLonmCU as usual, i.e. initializa an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

```yaml
---
home: "{{ home_dir }}"
logging:
  level: DEBUG
  to_file: false
  rotate: false
cleanup:
  auto: true
  keep: 10
paths:
  deps: deps
  logs: logs
  results: results
  plugins: plugins
  temp: temp
  models:
    - "{{ home_dir }}/models"
    - "{{ config_dir }}/models"
repos:
  tvm:
    url: "https://github.com/apache/tvm.git"
    ref: de6d8067754d746d88262c530b5241b5577b9aae
  tvm:
    url: "https://github.com/apache/tvm.git"
    ref: de6d8067754d746d88262c530b5241b5577b9aae
frameworks:
  default: tvm
  tvm:
    enabled: true
    backends:
      default: tvmllvm
      tvmllvm:
        enabled: true
        features: []
    features: []
frontends:
  tflite:
    enabled: true
    features: []
toolchains:
  gcc: true
platforms:
  tvm:
    enabled: true
    features:
      tvm_profile: true
targets:
  tvm_cpu:
    enabled: true
```

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

The following examples demonstrate the `tvm_profile` of the TVM and MicroTVM platform.

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [20]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-385]  Processing stage LOAD
INFO - [session-385]  Processing stage BUILD
INFO - [session-385]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-385] Done processing runs
INFO - Report:
   Session  Run   Model Frontend Framework  Backend Platform   Target  Runtime [s] Features                                             Config Postprocesses Comment
0      385    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu     0.001746       []  {'tflite.use_inout_data': False, 'tflite.visua...            []       -


To enable TVM's profiling feature just just add `-f tvm_profile` to the command line:

In [21]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-386]  Processing stage LOAD
INFO - [session-386]  Processing stage BUILD
INFO - [session-386]  Processing stage RUN
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-386] Done processing runs
INFO - Report:
    Session  Run   Model Frontend Framework  Backend Platform   Target                                                Sub   Runtime [s]       Features                                             Config Postprocesses Comment
0       386    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu                                                NaN  1.716600e-03  [tvm_profile]  {'tflite.use_inout_data': False, 'tflite.visua...            []       -
1       386    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu  tvmgen_default_fused_cast_subtract_fixed_point...  3.127400e-04  [tvm_profile]  {'tflite.use_inout

Since tvm used quite long function-names, this might not be very reaible. As a last step, let's try to improve that using the `filter_cols` postprocess:

In [22]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile \
        --postprocess filter_cols -c filter_cols.keep="Model,Sub,Runtime [s]"

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-387]  Processing stage LOAD
INFO - [session-387]  Processing stage BUILD
INFO - [session-387]  Processing stage RUN
INFO - [session-387]  Processing stage POSTPROCESS
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-387] Done processing runs
INFO - Report:
     Model                                                Sub   Runtime [s]
0   resnet                                                NaN  1.090300e-03
1   resnet  tvmgen_default_fused_cast_subtract_fixed_point...  1.996900e-04
2   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.967800e-04
3   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.937500e-04
4   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.772700e-04
5   resnet  tvmgen_default_fused_nn_conv2d_add_cast_multip...  1.048600e-04
6   resnet  tvmgen_default_fused_nn

#### B) Python Scripting

Some imports

In [1]:
from tempfile import TemporaryDirectory
from pathlib import Path
import pandas as pd

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [2]:
FRONTEND = "tflite"
MODEL = "resnet"
BACKEND = "tvmllvm"
PLATFORM = "tvm"
TARGET = "tvm_cpu"
FEATURES = ["tvm_profile"]
CONFIG = {}
POSTPROCESSES = []

Initialize and run a single benchmark

In [3]:
with MlonMcuContext() as context:
    session = context.create_session()
    run = session.create_run(config=CONFIG)
    run.add_features_by_name(FEATURES, context=context)
    run.add_frontend_by_name(FRONTEND, context=context)
    run.add_model_by_name(MODEL, context=context)
    run.add_platform_by_name(PLATFORM, context=context)
    run.add_backend_by_name(BACKEND, context=context)
    run.add_target_by_name(TARGET, context=context)
    run.add_postprocesses_by_name(POSTPROCESSES)
    session.process_runs(context=context)
    report = session.get_reports()
report.df

INFO - Loading environment cache from file
INFO - Successfully initialized cache
INFO - Loading extensions.py (User)
INFO - [session-391] Processing all stages
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-391] Done processing runs


Unnamed: 0,Session,Run,Model,Frontend,Framework,Backend,Platform,Target,Sub,Runtime [s],Features,Config,Postprocesses,Comment
0,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,,0.0010569,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
1,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,0.00020228,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
2,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_cast_subtract_fixed_point...,0.00020169,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
3,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,0.00018928,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
4,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,0.00018022,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
5,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,0.00010568,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
6,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,0.00010043,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
7,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,3.373e-05,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
8,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,2.274e-05,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-
9,391,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_cast_multip...,1.952e-05,[tvm_profile],"{'tflite.use_inout_data': False, 'tflite.visua...",[],-


After stripping it down to the essential data:

In [4]:
df = report.df
df.drop(["Session", "Run", "Frontend", "Model", "Framework", "Backend", "Platform", "Target", "Config", "Features", "Postprocesses", "Comment"], axis=1, inplace=True)
df.fillna("full", inplace=True)
df.set_index("Sub", inplace=True)
df

Unnamed: 0_level_0,Runtime [s]
Sub,Unnamed: 1_level_1
full,0.0010569
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_8a376065fd35c245_,0.00020228
tvmgen_default_fused_cast_subtract_fixed_point_multiply_add_nn_conv2d_add_cast_multiply_add_rig_5494088d2bce6f3f_,0.00020169
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_cast_subtra_9b1cea826623845_,0.00018928
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_cast_subtra_9b1cea826623845__1,0.00018022
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_8a376065fd35c245__1,0.00010568
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip_cast_s_8a376065fd35c245__2,0.00010043
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_clip,3.373e-05
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_cast_subtra_866ca172ecfe2cfb_,2.274e-05
tvmgen_default_fused_nn_conv2d_add_cast_multiply_add_right_shift_cast_add_clip_cast_cast_subtra_8b760d480c798df_,1.952e-05
