In [None]:
# Copyright 2025 Arm Limited and/or its affiliates.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# VGF Backend flow example

This guide demonstrates the full flow for lowering a module using the VGF backend using ExecuTorch. 
Tested on Linux x86_64. If something is not working for you, please raise a GitHub issue and tag Arm.

Before you begin:
1. (In a clean virtual environment with a compatible Python version) Install executorch using `./install_executorch.sh`
2. Install MLSDK and Tosa using `examples/arm/setup.sh --disable-ethos-u-deps --enable-mlsdk-deps (For further guidance, refer to https://docs.pytorch.org/executorch/main/tutorial-arm.html)
3. Export vulkan environment variables and add MLSDK components to PATH and LD_LIBRARY_PATH using `examples/arm/ethos-u-scratch/setup_path.sh`

With all commands executed from the base `executorch` folder.



*Some scripts in this notebook produce long output logs: Configuring the 'Customizing Notebook Layout' settings to enable 'Output:scrolling' and setting 'Output:Text Line Limit' makes this more manageable*

## AOT Flow

The first step is creating the PyTorch module and exporting it. Exporting converts the python code in the module into a graph structure. The result is still runnable python code, which can be displayed by printing the `graph_module` of the exported program.  

In [None]:
import torch

class Add(torch.nn.Module):
    def forward(self, x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
        return x + y

example_inputs = (torch.ones(1,1,1,1),torch.ones(1,1,1,1))

model = Add()
model = model.eval()
exported_program = torch.export.export_for_training(model, example_inputs)
graph_module = exported_program.module()

_ = graph_module.print_readable()

# VGF backend supports both INT and FP targets. 

To lower the graph_module for FP targets using the VGF backend, we run it through the default FP lowering pipeline. 

FP lowering can be customized for different subgraphs; the sequence shown here is the recommended workflow for VGF.
Because we are staying in floating-point precision, no calibration with example inputs is required. 

If you print the module again, you will see that nodes are left in FP form (or annotated with any necessary casts) without any quantize/dequantize wrappers.


In [None]:
from executorch.backends.arm.arm_backend import ArmCompileSpecBuilder
from executorch.backends.arm.tosa_specification import ( 
    TosaSpecification,
)

# Create a compilation spec describing the floating point target.
tosa_spec = TosaSpecification.create_from_string("TOSA-1.0+FP")

spec_builder = ArmCompileSpecBuilder().vgf_compile_spec(tosa_spec)
compile_spec = spec_builder.build()

_ = graph_module.print_readable()

# Create a new exported program using the graph_module
exported_program = torch.export.export_for_training(graph_module, example_inputs)

To lower the graph_module for INT targets using the VGF backend, we apply the arm_quantizer. 

Quantization can be performed in various ways and tailored to different subgraphs; the sequence shown here represents the recommended workflow for VGF. 

This step also requires calibrating the module with representative inputs. 

If you print the module again, you’ll see that each node is now wrapped in quantization/dequantization nodes that embed the calculated quantization parameters.

In [None]:
from executorch.backends.arm.quantizer import (
    VgfQuantizer,
    get_symmetric_quantization_config,
)
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e

# Create a compilation spec describing the target for configuring the quantizer
tosa_spec = TosaSpecification.create_from_string("TOSA-1.0+INT")

spec_builder = ArmCompileSpecBuilder().vgf_compile_spec(tosa_spec)
compile_spec = spec_builder.build()

# Create and configure quantizer to use a symmetric quantization config globally on all nodes
quantizer = VgfQuantizer(compile_spec)
operator_config = get_symmetric_quantization_config(is_per_channel=False)
quantizer.set_global(operator_config)

# Post training quantization
quantized_graph_module = prepare_pt2e(graph_module, quantizer)
quantized_graph_module(*example_inputs) # Calibrate the graph module with the example input
quantized_graph_module = convert_pt2e(quantized_graph_module)

_ = quantized_graph_module.print_readable()

# Create a new exported program using the quantized_graph_module
quantized_exported_program = torch.export.export_for_training(quantized_graph_module, example_inputs)

# In the example below, we will make use of the quantized graph module.

The lowering in the VGFBackend happens in five steps:

1. **Lowering to core Aten operator set**: Transform module to use a subset of operators applicable to edge devices. 
2. **Partitioning**: Find subgraphs that will be lowered by the VGF backend.
3. **Lowering to TOSA compatible operator set**: Perform transforms to make the VGF subgraph(s) compatible with TOSA 
4. **Serialization to TOSA**: Compiles the graph module into a TOSA graph 
5. **Compilation to VGF**: Compiles the FX GraphModule into a VGF representation using the model_converter and the previously created compile_spec. It also prints a network summary for each processed VGF partition.

All of this happens behind the scenes in `to_edge_transform_and_lower`. Printing the graph module shows that what is left in the graph is two quantization nodes for `x` and `y` going into an `executorch_call_delegate` node, followed by a dequantization node.

In [None]:
import os
from executorch.backends.arm.vgf_partitioner import VgfPartitioner
from executorch.exir import (
    EdgeCompileConfig,
    ExecutorchBackendConfig,
    to_edge_transform_and_lower,
)
from executorch.extension.export_util.utils import save_pte_program

# Create partitioner from compile spec
partitioner = VgfPartitioner(compile_spec)

# Lower the exported program to the VGF backend
edge_program_manager = to_edge_transform_and_lower(
            quantized_exported_program,
            partitioner=[partitioner],
            compile_config=EdgeCompileConfig(
                _check_ir_validity=False,
            ),
)

# Convert edge program to executorch
executorch_program_manager = edge_program_manager.to_executorch(
            config=ExecutorchBackendConfig(extract_delegate_segments=False)
)

executorch_program_manager.exported_program().module().print_readable()

# Save pte file
cwd_dir = os.getcwd()
pte_base_name = "simple_example"
pte_name = pte_base_name + ".pte"
pte_path = os.path.join(cwd_dir, pte_name)
save_pte_program(executorch_program_manager, pte_name)
assert os.path.exists(pte_path), "Build failed; no .pte-file found"

## Build executor runtime

### Prerequisite
With our VGF inside our PTE we now need to setup the runtime. To do this we will use the previously built MLSDK dependencies, but we will also need to setup a Vulkan environment externally to Executorch.
Plese follow https://vulkan.lunarg.com/sdk/home in order to setup. 


After the AOT compilation flow is done, we need to build the executor_runner target. For this example the generic version will be used.
To do this, please ensure the following commands are executed before moving onto the next step.

Clean and configure the CMake build system. Compiled programs will appear in the executorch/cmake-out directory we create here.
```
cmake \
  -DCMAKE_INSTALL_PREFIX=cmake-out \
  -DCMAKE_BUILD_TYPE=Debug \
  -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
  -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
  -DEXECUTORCH_BUILD_EXTENSION_FLAT_TENSOR=ON \
  -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
  -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
  -DEXECUTORCH_BUILD_XNNPACK=OFF \
  -DEXECUTORCH_BUILD_VULKAN=ON \
  -DEXECUTORCH_BUILD_VGF=ON \
  -DEXECUTORCH_ENABLE_LOGGING=ON \
  -DPYTHON_EXECUTABLE=python \
  -Bcmake-out .
```

Build the executor_runner target
`cmake --build cmake-out --target executor_runner`


# Run on VKML Emulator

We can finally use the `backends/arm/scripts/run_vkml.sh` utility script to run the .pte end-to-end and proving out a backend’s kernel implementation. This Script runs the model with an input of ones, so the expected result of the addition should be close to 2.

In [None]:
import subprocess

# Setup paths
et_dir = os.path.join(cwd_dir, "..", "..")
et_dir = os.path.abspath(et_dir)
script_dir = os.path.join(et_dir, "backends", "arm", "scripts")

args = f"--model={pte_path}"
subprocess.run(os.path.join(script_dir, "run_vkml.sh") + " " + args, shell=True, cwd=et_dir)