In [1]:
from IPython.display import Code

# Example: Per-Layer Model Benchmarking

When investigating why a given model performs not as expected or implementing optimiations for specific types of layers, it is often useful to consider the runtime individual layers instead of the end-to-end execution time.

MLonMCU currently supports to approaches for per-layer benchmarking:
1. Using the `split_layers` feature of the `tflite` frontend
2. Using the profiling feature provided by the `tvm` and `microtvm` (WIP) platform

Both use-cases are explained briefly in the rest of this notebook.

## 1. Splitting TFLite Models into individual layers

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** `tflite` only

**Frameworks/Backends:** Any (`tvmaotplus` used below)

**Platforms/Targets:** Any (`etiss` used below)

**Features:** The `split_layers` feature of the `tflite` frontend needs to be enabled

### Prerequisites

If not done already, setup a virtual python environment and install the required packages into it. (See `requirements.txt`)

In [2]:
Code(filename="requirements.txt")

Set up MLonmCU as usual, i.e. initialize an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

In [3]:
Code(filename="environment.yml.j2")

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

The following examples demonstrate the `split_layers` feature recently added MLonMCU.

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [4]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO -  Processing stage LOAD
INFO -  Processing stage BUILD


INFO -  Processing stage COMPILE


INFO -  Processing stage RUN


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - Done processing runs


INFO - Report:
   Session  Run   Model Frontend Framework     Backend Platform         Target  Total Cycles  Total Instructions  Total CPI  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data  Validation Features                                             Config Postprocesses Comment
0        0    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino      82184457            82184457        1.0     215478     108184         162592     52742       144      1732              106452        True       []  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -


Now lets enable the `split_layers` feature:

In [5]:
!mlonmcu flow run resnet --backend tvmaotplus --target etiss_pulpino -f split_layers

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-1]  Processing stage LOAD


INFO - [session-1]  Processing stage BUILD


INFO - [session-1]  Processing stage COMPILE


INFO - [session-1]  Processing stage RUN


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-1] Done processing runs


INFO - Report:
    Session  Run   Model Frontend Framework     Backend Platform         Target      Sub  Total Cycles  Total Instructions  Total CPI  Total ROM  Total RAM  ROM read-only  ROM code  ROM misc  RAM data  RAM zero-init data  Validation        Features                                             Config Postprocesses Comment
0         1    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino      NaN      82184457            82184457        1.0     215478     108184         162592     52742       144      1732              106452        True  [split_layers]  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -
1         1    0  resnet   tflite       tvm  tvmaotplus     mlif  etiss_pulpino   layer0       3540646             3540646        1.0      46870      35112           3352     43374       144      1732               33380        True  [split_layers]  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -
2         1    

The resulting report should contain the the original benchmark results (for the whole model) in the first row. The remaining 16 rows are for each of the layers found in the `resnet.tflite` model. The layer-index can be found in the 'Sub' column. The cycle count of these should roughly sum up to the total execution time measured in row one.

#### B) Python Scripting

Some imports

In [6]:
from tempfile import TemporaryDirectory
from pathlib import Path
import pandas as pd

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [7]:
FRONTEND = "tflite"
MODEL = "resnet"
BACKEND = "tvmaotplus"
PLATFORM = "mlif"
TARGET = "etiss_pulpino"
FEATURES = ["split_layers"]
CONFIG = {"filter_cols.keep": ["Sub", "Total Instructions", "Total ROM", "Total RAM"]}
POSTPROCESSES = ["filter_cols"]

Initialize and run a single benchmark

In [8]:
with MlonMcuContext() as context:
    with context.create_session() as session:
        run = session.create_run(config=CONFIG)
        run.add_features_by_name(FEATURES, context=context)
        run.add_frontend_by_name(FRONTEND, context=context)
        run.add_model_by_name(MODEL, context=context)
        run.add_backend_by_name(BACKEND, context=context)
        run.add_platform_by_name(PLATFORM, context=context)
        run.add_target_by_name(TARGET, context=context)
        run.add_postprocesses_by_name(POSTPROCESSES)
        session.process_runs(context=context)
        report = session.get_reports()
assert not session.failing
report.df

INFO - Loading environment cache from file


INFO - Successfully initialized cache


INFO - [session-2] Processing all stages


INFO - All runs completed successfuly!


INFO - Postprocessing session report


INFO - [session-2] Done processing runs


Unnamed: 0,Sub,Total Instructions,Total ROM,Total RAM
0,,82184457,215478,108184
1,layer0,3540646,46870,35112
2,layer1,13186214,49990,105096
3,layer2,13212781,49990,105096
4,layer3,629212,44416,51728
5,layer4,6296259,55116,94760
6,layer5,17536342,64034,56072
7,layer6,1073329,46936,90664
8,layer7,316491,44480,27152
9,layer8,8910030,83060,49736


Stripping out all common data, we get this:

In [9]:
df = report.df
df.fillna("full", inplace=True)
df.set_index("Sub", inplace=True)
df

Unnamed: 0_level_0,Total Instructions,Total ROM,Total RAM
Sub,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
full,82184457,215478,108184
layer0,3540646,46870,35112
layer1,13186214,49990,105096
layer2,13212781,49990,105096
layer3,629212,44416,51728
layer4,6296259,55116,94760
layer5,17536342,64034,56072
layer6,1073329,46936,90664
layer7,316491,44480,27152
layer8,8910030,83060,49736


## 2. Using ~~(Micro)~~TVMs profiling functionality

Instead of splitting the model layer wise before optimization, this will use the functionality of TVMs graph runtime to benchmark individual functions conatine din the model graph. These functions do not nessessarily map directly to a single layer in the original model operator fusing is automatically performed by TVMs compilation pipeline.

### Supported components

**Models:** Any (`resnet` used below)

**Frontends:** Any frontend supported by TVM (`tflite` used below)

**Frameworks/Backends:** TVM: `tvmllvm` ~~MicroTVM: `tvmrt`~~

**Platforms/Targets:** TVM: `tvm_cpu` ~~MicroTVM: Any~~

**Features:** The `tvm_profile` feature needs to be enabled

Let's only consider the `tvm_cpu` target here until this is supported officially by upstream TVM. Hence we are profiling on the host cpu here, not on a MCU or simulator. 

### Prerequisites

If not done already, setup a virtual python environment and install the required packages into it. (See `requirements.txt`)

In [10]:
Code(filename="requirements.txt")

Set up MLonmCU as usual, i.e. initializa an environment and install all required dependencies. Feel free to use the following minimal `environment.yml.j2` template:

In [11]:
Code(filename="environment.yml.j2")

Do not forget to set your `MLONMCU_HOME` environment variable first if not using the default location!

### Usage

The following examples demonstrate the `tvm_profile` of the TVM and MicroTVM platform.

#### A) Command Line Interface

First define a simple benchmark of a single model/backend/target combination:

In [12]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-3]  Processing stage LOAD
INFO - [session-3]  Processing stage BUILD


INFO - [session-3]  Processing stage RUN


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-3] Done processing runs


INFO - Report:
   Session  Run   Model Frontend Framework  Backend Platform   Target  Runtime [s] Features                                             Config Postprocesses Comment
0        3    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu     0.003898       []  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -


To enable TVM's profiling feature just just add `-f tvm_profile` to the command line:

In [13]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-4]  Processing stage LOAD
INFO - [session-4]  Processing stage BUILD


INFO - [session-4]  Processing stage RUN


INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-4] Done processing runs


INFO - Report:
    Session  Run   Model Frontend Framework  Backend Platform   Target                                                Sub   Runtime [s]       Features                                             Config Postprocesses Comment
0         4    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu                                                NaN  3.537700e-03  [tvm_profile]  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -
1         4    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu  tvmgen_default_fused_cast_subtract_fixed_point...  9.033000e-04  [tvm_profile]  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -
2         4    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu  tvmgen_default_fused_nn_conv2d_add_fixed_point...  8.918100e-04  [tvm_profile]  {'resnet.output_shapes': {'Identity_int8': [1,...            []       -
3         4    0  resnet   tflite       tvm  tvmllvm      tvm  tvm_cpu  tvmgen_defau

Since tvm used quite long function-names, this might not be very reaible. As a last step, let's try to improve that using the `filter_cols` postprocess:

In [14]:
!python -m mlonmcu.cli.main flow run resnet -b tvmllvm -t tvm_cpu -f tvm_profile \
        --postprocess filter_cols -c filter_cols.keep="Model,Sub,Runtime [s]"

INFO - Loading environment cache from file
INFO - Successfully initialized cache


INFO - [session-5]  Processing stage LOAD
INFO - [session-5]  Processing stage BUILD


INFO - [session-5]  Processing stage RUN


INFO - [session-5]  Processing stage POSTPROCESS
INFO - All runs completed successfuly!
INFO - Postprocessing session report
INFO - [session-5] Done processing runs
INFO - Report:
     Model                                                Sub   Runtime [s]
0   resnet                                                NaN  3.586800e-03
1   resnet  tvmgen_default_fused_cast_subtract_fixed_point...  9.090000e-04
2   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  8.940600e-04
3   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  8.696100e-04
4   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  4.585300e-04
5   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  1.312500e-04
6   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  1.073700e-04
7   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  7.498000e-05
8   resnet  tvmgen_default_fused_nn_conv2d_add_fixed_point...  5.649000e-05
9   resnet  tvmgen_default_fused_nn_conv2d_ad

#### B) Python Scripting

Some imports

In [15]:
from tempfile import TemporaryDirectory
from pathlib import Path
import pandas as pd

from mlonmcu.context.context import MlonMcuContext
from mlonmcu.session.run import RunStage

Benchmark Configuration

In [16]:
FRONTEND = "tflite"
MODEL = "resnet"
BACKEND = "tvmllvm"
PLATFORM = "tvm"
TARGET = "tvm_cpu"
FEATURES = ["tvm_profile"]
CONFIG = {}
POSTPROCESSES = []

Initialize and run a single benchmark

In [17]:
with MlonMcuContext() as context:
    with context.create_session() as session:
        run = session.create_run(config=CONFIG)
        run.add_features_by_name(FEATURES, context=context)
        run.add_frontend_by_name(FRONTEND, context=context)
        run.add_model_by_name(MODEL, context=context)
        run.add_platform_by_name(PLATFORM, context=context)
        run.add_backend_by_name(BACKEND, context=context)
        run.add_target_by_name(TARGET, context=context)
        run.add_postprocesses_by_name(POSTPROCESSES)
        session.process_runs(context=context)
        report = session.get_reports()
assert not session.failing
report.df

INFO - Loading environment cache from file


INFO - Successfully initialized cache


INFO - [session-6] Processing all stages


INFO - All runs completed successfuly!


INFO - Postprocessing session report


INFO - [session-6] Done processing runs


Unnamed: 0,Session,Run,Model,Frontend,Framework,Backend,Platform,Target,Sub,Runtime [s],Features,Config,Postprocesses,Comment
0,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,,0.0035605,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
1,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,0.00092241,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
2,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_cast_subtract_fixed_point...,0.00091292,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
3,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,0.00086566,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
4,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,0.00045773,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
5,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,0.00012828,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
6,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,0.00010516,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
7,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,7.208e-05,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
8,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,5.54e-05,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-
9,6,0,resnet,tflite,tvm,tvmllvm,tvm,tvm_cpu,tvmgen_default_fused_nn_conv2d_add_fixed_point...,2.762e-05,[tvm_profile],"{'resnet.output_shapes': {'Identity_int8': [1,...",[],-


After stripping it down to the essential data:

In [18]:
df = report.df
df.drop(
    [
        "Session",
        "Run",
        "Frontend",
        "Model",
        "Framework",
        "Backend",
        "Platform",
        "Target",
        "Config",
        "Features",
        "Postprocesses",
        "Comment",
    ],
    axis=1,
    inplace=True,
)
df.fillna("full", inplace=True)
df.set_index("Sub", inplace=True)
df

Unnamed: 0_level_0,Runtime [s]
Sub,Unnamed: 1_level_1
full,0.0035605
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_cast_subtract,0.00092241
tvmgen_default_fused_cast_subtract_fixed_point_multiply_add_nn_conv2d_add_fixed_point_multiply__cc9246e62aa5afb_,0.00091292
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_subtract_fixed_point__eb606f94f03ebac6_,0.00086566
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_cast_subtract_1,0.00045773
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_subtract_fixed_point__eb606f94f03ebac6__1,0.00012828
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_cast,0.00010516
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_cast_subtract_2,7.208e-05
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_subtract_fixed_point__26c49bbe582da641_,5.54e-05
tvmgen_default_fused_nn_conv2d_add_fixed_point_multiply_per_axis_add_clip_subtract_fixed_point__cacd0002c6404764_,2.762e-05
