# Copyright 2025 Arm Limited and/or its affiliates.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

# Introduction
Model conditioning techniques like pruning modify the weights of a Machine Learning model and in some cases allow significant speed-up of the inference execution, reduction of the memory footprint and reduction in the overall power consumption of the system. Assuming you can optimise your workload without loss in accuracy and you target an Arm® Ethos™ NPU or a GPU with a Neural Engine, you should consider pruning the neural network before compiling it in the to_edge_transform_and_lower stage.

# Why apply model conditioning?
The Ethos-U hardware has a dedicated weight decoder to process the model weights. At the same time, the compiler arranges the weights into blocks and the blocks are then fed to the hardware weight decoder. As part of the block arrangement process, the compiler compresses sequences of zero weights and clusters of weights. To avoid any doubt, the compression by the compiler is lossless - to the same input tensor, irrespective of whether compression was applied or not, the output tensor from execution on the NPU will be the same. If the model you provide in the to_edge_transform_and_lower stage is optimised to have sequences of zero weights and/or clusters of the same weights, the compiler will be able to compress these weights very efficiently. The good compression would result in lower number of memory accesses by the NPU at runtime, which would mean that the MAC engines are not waiting on memory accesses resulting in better overall performance. In other words, if you have a memory bound model, you should consider pruning and clustering your neural network before lowering it in the to_edge_transform_and_lower stage.

The Ethos-U85 hardware also has hardware support for 2:4 sparse weights - if you have 2:4 sparse weights, the MAC array will skip multiplications where the result will be 0. The 2:4 sparsity allow power savings for all configurations and provides a speed-up on compute-bound neural networks.

Before we begin, make sure you are running the Jupyter notebook from the correct python virtual environment variable.

# Prerequisites
Let's import python the packages you will need to run through the jupyter notebook.

In [None]:
import torch
from torchvision import datasets, transforms
from torch import nn
import torch.nn.utils.prune as prune
import torch.nn.functional as F
from torch.utils.data import DataLoader, Subset
import random

from executorch.backends.arm.ethosu import EthosUPartitioner
from executorch.exir import (
    EdgeCompileConfig,
    ExecutorchBackendConfig,
    to_edge_transform_and_lower,
)
from executorch.backends.arm.ethosu import EthosUCompileSpec
from executorch.backends.arm.quantizer import (
    EthosUQuantizer,
    get_symmetric_quantization_config,
)
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
from executorch.extension.export_util.utils import save_pte_program

# Model conditioning with PyTorch and deployment with ExecuTorch 
We'll define a simple model with 3 back-to-back Linear layers. We will execute the model on the Ethos-U85 NPU, then we will prune the model and execute the pruned variant on the Ethos-U85 and compare the performance.

In [None]:
LR = 1e-3
NUM_EPOCHS = 1
BATCH_SIZE = 128

# Data
transform = transforms.Compose([transforms.ToTensor()])
train_ds = datasets.MNIST("./data", train=True, download=True, transform=transform)
test_ds = datasets.MNIST("./data", train=False, transform=transform)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False)

class Simple_NN(nn.Module): 
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
    def prunable_parameters(self):
        return (
            (self.fc1, "weight"),
            (self.fc2, "weight"),
            (self.fc3, "weight"),
        )

    def prune(self, pruning_method: prune.BasePruningMethod, amount: float = 0.1):
        # reference https://pytorch.org/tutorials/intermediate/pruning_tutorial.html

        # produces a mask that is multiplied with the parameter
        prune.global_unstructured(
            self.prunable_parameters(),
            pruning_method=pruning_method,
            amount=amount,
        )

We define a simple model with 3 back-to-back linear layers. Linear is highly memory bound operation because every weight is read once only from the external memory. It is impossible to buffer the weights in memory(you usually have more weights in the external memory than space in the SARM) and reuse them for the computation. In comparison, in a convolution you usually have small filter sizes(e.g. 3x3 filter) which means you can buffer all the convolution weights in memory and reuse them for the computation. If your model or module within the model is composed entirely of Linear layers, the workload will be memory bound and pruning is likely to provide good speed-up.

Next, let's define a simple function to train the network and a function to evaluate the accuracy of the model.

In [None]:
# Training loop
def train(model):
    # The model is simple enough that we can train it on CPU
    device = "cpu"
    for epoch in range(NUM_EPOCHS):
        # ---- Training ----
        model.train()
        opt = torch.optim.Adam(model.parameters(), lr=LR)
        criterion = torch.nn.CrossEntropyLoss()
        for step, (inp, out_real) in enumerate(train_loader):
            inp, out_real = inp.to(device), out_real.to(device)
            opt.zero_grad()
            out_pred = model(inp)
            loss = criterion(out_pred, out_real)
            #print(f"Loss: {loss.item():.4f}")
            loss.backward()
            opt.step()

def evaluate(model):
    # ---- Evaluation ----
    correct, total = 0, 0
    with torch.no_grad():
        for inp, out_real in test_loader:
            out_pred = model(inp)
            preds = out_pred.argmax(1)
            correct += (preds == out_real).sum().item()
            total += out_real.size(0)

    acc = 100 * correct / total
    print(f"Top 1 accuracy = {acc:.2f}%")

Let's instantiate the model and train it. In order to get reproducible results, we will fix the seed.

In [None]:
SEED = 123
torch.manual_seed(SEED)
model = Simple_NN()
train(model)
print("Evaluate FP32 model accuracy")
evaluate(model)

We obtain 96% top1 accuracy for the FP32 model.

Next, we would like to apply post-training quantization with ExecuTorch and evaluate the accuracy of the quantized model. It is important to calibrate the quantized model on a few real samples from the MNIST dataset to get good quantization parameters.

In [None]:
# MNIST images are 28x28 in greyscale, hence the shape is 1x1x28x28
example_inputs = (torch.randn(1,1,28,28),)
exported_program = torch.export.export(model, example_inputs)
graph_module = exported_program.module(check_guards=False)

# Create a compilation spec describing the target for configuring the quantizer
compile_spec = EthosUCompileSpec(
            target="ethos-u85-128",
            system_config="Ethos_U85_SYS_Flash_High",
            memory_mode="Shared_Sram",
            extra_flags=["--output-format=raw", "--debug-force-regor --verbose-weights"]
        )

# Create and configure quantizer to use a symmetric quantization config globally on all nodes
quantizer = EthosUQuantizer(compile_spec)
operator_config = get_symmetric_quantization_config()
quantizer.set_global(operator_config)

# Post training quantization, need a few example images to obtain good quantization parameters
subset_indices = random.sample(range(len(train_ds)), 50)
calibration_set = Subset(train_ds, subset_indices)
calibration_loader = DataLoader(calibration_set, shuffle=False)

quantized_graph_module = prepare_pt2e(graph_module, quantizer)
for batch_images,label in calibration_loader:
    quantized_graph_module(*batch_images) # Calibrate the graph module with the example input
quantized_graph_module = convert_pt2e(quantized_graph_module)

Next, let us evaluate the accuracy of the quantized model.

In [None]:
print("Accuracy of the quantized model")
evaluate(quantized_graph_module)

We maintain the 96% top1 accuracy for the quantized model. Next, let's compile the model for the Ethos-U backend. We will define a function `generate_pte` that calls `to_edge_transform_and_lower` and saves the pte file on device.

In [None]:
def generate_pte(quantized_exported_program,compile_spec,name):
    # Create partitioner from compile spec
    partitioner = EthosUPartitioner(compile_spec)

    # Lower the exported program to the Ethos-U backend
    edge_program_manager = to_edge_transform_and_lower(
                quantized_exported_program,
                partitioner=[partitioner],
                compile_config=EdgeCompileConfig(
                    _check_ir_validity=False,
                ),
            )

    # Convert edge program to executorch
    executorch_program_manager = edge_program_manager.to_executorch(
                config=ExecutorchBackendConfig(extract_delegate_segments=False)
            )

    # Save pte file
    save_pte_program(executorch_program_manager, f"{name}.pte")

# Create a new exported program using the quantized_graph_module
quantized_exported_program = torch.export.export(quantized_graph_module, example_inputs)
generate_pte(quantized_exported_program,compile_spec,"original_model")

Note that as part of the compilation process in `to_edge_transform_and_lower`, we get Weight Compression information:
```
Original Weights Size                          522.50 KiB
NPU Encoded Weights Size                       507.44 KiB
```
In other words, the original Weights are 522KB and after compilation and encoding by the compiler, we get 507KB of weights that will be read by the NPU at runtime. Remember this is for the case when we've not applied pruning or clustering. This will generate original_model.pte file that we will deploy on device later on. 

Next, let's move on to prune the model and evaluate its accuracy. We have a lot of weights in the original network, so we will apply 95% pruning rate.

In [None]:
print("Prune the model")
model.prune(pruning_method=prune.L1Unstructured, amount=0.95)
print("Evaluate pruned model accuracy")
evaluate(model)

We obtain 37% top1 accuracy for the pruned model. That can seem surprising at first sight, but remember that when we prune, we randomly set 95% of the weights to 0. It is normal to lose accuracy when applying pruning. We need to retrain the model in order to recover the accuracy we've lost from the pruning. We can do that easily by calling the train function one more time. Once we are done with the retraining, it is important to remove the parameters we've pruned.

In [None]:
print("Train the pruned model to recover the lost information")
train(model)
# Remove the pruned parameters when we've retrained the model and recovered the lost accuracy
for a,b  in model.prunable_parameters():
    prune.remove(a, b)

print("Evaluate pruned model accuracy")
evaluate(model)

We obtain 96% top1 accuracy for the pruned workload so we have recovered the accuracy we've lost with the pruning. Let's quantize the pruned model, evaluate the accuracy of the int8 network and obtain a pte file.

In [None]:
pruned_exported_program = torch.export.export(model, example_inputs)
pruned_graph_module = pruned_exported_program.module(check_guards=False)
quantized_pruned_graph_module = prepare_pt2e(pruned_graph_module, quantizer)
for batch_images,label in calibration_loader:
    quantized_pruned_graph_module(*batch_images) # Calibrate the graph module with the example input
quantized_pruned_graph_module = convert_pt2e(quantized_pruned_graph_module)
print("Accuracy of the pruned quantized model")
evaluate(quantized_pruned_graph_module)

quantized_ep_pruned = torch.export.export(quantized_pruned_graph_module, example_inputs)
generate_pte(quantized_ep_pruned,compile_spec,"pruned_model")

We obtain 96% top1 accuracy of the quantized pruned model. What is interesting is that this time, the NPU encoded weights size shrank considerably:
```
Original Weights Size                          522.50 KiB
NPU Encoded Weights Size                        46.12 KiB
```
In other words, we are now solving the MNIST classification problem with just 46KB of encoded weights. This is a significant reduction from the 507KB we had in the original model.



# NPU performance
In the sections above, we generated two pte files - one pte for the original model and another pte for the pruned model. These models perform very similarly in terms of accuracy. Let's run both of these models on the NPU and analyse the performance at runtime.

# Performance of the original model

In [None]:
%%bash
# Ensure the arm-none-eabi-gcc toolchain and FVP:s are available on $PATH
source arm-scratch/setup_path.sh

# Build executorch libraries cross-compiled for arm baremetal to executorch/cmake-out-arm
cmake --preset arm-baremetal \
-DCMAKE_BUILD_TYPE=Release \
-B../../cmake-out-arm ../..
cmake --build ../../cmake-out-arm --target install -j$(nproc) 

In [None]:
%%bash 
source arm-scratch/setup_path.sh
# Build example executor runner application to examples/arm/ethos_u_minimal_example
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/ethos-u-setup/arm-none-eabi-gcc.cmake \
      -DCMAKE_BUILD_TYPE=Release \
      -DET_PTE_FILE_PATH=original_model.pte \
      -DTARGET_CPU=cortex-m55 \
      -DETHOSU_TARGET_NPU_CONFIG=ethos-u85-128 \
      -DMEMORY_MODE=Shared_Sram \
      -DSYSTEM_CONFIG=Ethos_U85_SYS_DRAM_Mid \
      -Bethos_u_original_model \
      executor_runner
cmake --build ethos_u_original_model -j$(nproc) -- arm_executor_runner

In [None]:
%%bash 
source arm-scratch/setup_path.sh
# Run the pruned model
../../backends/arm/scripts/run_fvp.sh --elf=ethos_u_original_model/arm_executor_runner --target=ethos-u85-128

We obtain a total of 99k NPU Active cycles. The MAC engines of the NPU are active during 8k cycles and the Weight Decoder is active during 74k NPU cycles. It's worth noting that the data flow in the Ethos-U is pipelined. In other words, the MAC array and the Weight Decoder are working at the same time. Having a total of 99k NPU cycles and only 8k Active MAC cycles and 74k of Weight Decoder active cycles means that the NPU is spending most of the time decoding weights and the MAC array is underutilized. Pruning is designed to alleviate that bottleneck. Let's analyse the performance of the pruned workload.

# Performance of the pruned model

In [None]:
%%bash 
source arm-scratch/setup_path.sh

# Build example executor runner application to examples/arm/ethos_u_minimal_example
cmake -DCMAKE_TOOLCHAIN_FILE=$(pwd)/ethos-u-setup/arm-none-eabi-gcc.cmake \
      -DCMAKE_BUILD_TYPE=Release \
      -DET_PTE_FILE_PATH=pruned_model.pte \
      -DTARGET_CPU=cortex-m55 \
      -DETHOSU_TARGET_NPU_CONFIG=ethos-u85-128 \
      -DMEMORY_MODE=Shared_Sram \
      -DSYSTEM_CONFIG=Ethos_U85_SYS_DRAM_Mid \
      -Bethos_u_pruned_model \
      executor_runner
cmake --build ethos_u_pruned_model -j$(nproc) -- arm_executor_runner

In [None]:
%%bash 
source arm-scratch/setup_path.sh
# Run the pruned model
../../backends/arm/scripts/run_fvp.sh --elf=ethos_u_pruned_model/arm_executor_runner --target=ethos-u85-128

On the pruned model, the inference completes in 22k NPU cycles. The NPU still performs 8k MACs, but this time the number of cycles when the weight decoder is active has dropped to to 17k cycles. 
It's also worth noting that the size of the pte file has been reduced significantly - from 518 KB of the original model to 57KB of the pruned workload. 

# Conclusion
We defined a simple model to solve the MNIST dataset. The model is using Linear layers and is heavily memory-bound on the external memory. We pruned the model and obtain similar int8 accuracy between the original workload and the pruned counterpart. Let us put the results from the runtime in a table and draw a few conclusions: 

| Model                                   |NPU_ACTIVE cycles | NPU Encoded Weight Size   | Weight Decoder Active Cycles |    External memory beats read   | Size of the pte file  |
| ----------------------------------------|----------------- | ------------------------- | -----------------------------|---------------------------------|-----------------------|
| Original model                          |         97k      |           506 KB          |             74k              |              32k                |       517 KB          |
| Pruned model                            |         22k      |           46 KB           |              8k              |              3k                 |       57 KB           |

For the pruned network, we obtain a significant uplift - over 3x improvement in the inference speed and a drastic reduction in the number of cycles when the Weight Decoder is active. The NPU will consume lower power and the size of the pruned model that we save on-device is significantly smaller compared to the original network.