diff --git a/.gitignore b/.gitignore index b166f8c9512..54572407274 100644 --- a/.gitignore +++ b/.gitignore @@ -62,7 +62,6 @@ xcuserdata/ /include/ /share/ /version.py -*.csv *_etdump # Android diff --git a/backends/xnnpack/README.md b/backends/xnnpack/README.md index 6e6be7ddb4c..7c6a7ccbc33 100644 --- a/backends/xnnpack/README.md +++ b/backends/xnnpack/README.md @@ -134,4 +134,4 @@ create an issue on [github](https://www.github.com/pytorch/executorch/issues). ## See Also For more information about the XNNPACK Backend, please check out the following resources: - [XNNPACK Backend](https://pytorch.org/executorch/main/backends-xnnpack) -- [XNNPACK Backend Internals](https://pytorch.org/executorch/main/backend-delegates-xnnpack-reference) +- [XNNPACK Backend Internals](https://pytorch.org/executorch/main/backends/xnnpack/backend-delegates-xnnpack-reference) diff --git a/docs/source/backend-delegate-advanced.md b/docs/source/backend-delegate-advanced.md index 752bd1cdc02..e82e5ee035d 100644 --- a/docs/source/backend-delegate-advanced.md +++ b/docs/source/backend-delegate-advanced.md @@ -6,10 +6,6 @@ - {doc}`backend-delegates-integration` — Learn how to integrate a backend delegate into ExecuTorch -## XNNPACK Reference - -- {doc}`backend-delegates-xnnpack-reference` — Deep dive into XNNPACK delegate internals and implementation details - ## Dependency Management - {doc}`backend-delegates-dependencies` — Manage third-party dependencies for backend delegates @@ -27,7 +23,6 @@ :maxdepth: 1 backend-delegates-integration -backend-delegates-xnnpack-reference backend-delegates-dependencies compiler-delegate-and-partitioner debug-backend-delegate diff --git a/docs/source/backend-development.md b/docs/source/backend-development.md index ec5ceb3b37a..40c50a8ad11 100644 --- a/docs/source/backend-development.md +++ b/docs/source/backend-development.md @@ -4,7 +4,6 @@ :maxdepth: 1 backend-delegates-integration -backend-delegates-xnnpack-reference backend-delegates-dependencies compiler-delegate-and-partitioner debug-backend-delegate diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md index 4a3313964a8..4565869662e 100644 --- a/docs/source/backends-overview.md +++ b/docs/source/backends-overview.md @@ -18,20 +18,20 @@ Backends are the bridge between your exported model and the hardware it runs on. ## Choosing a Backend -| Backend | Platform(s) | Hardware Type | Typical Use Case | -|------------------------------------------|---------------------|---------------|---------------------------------| -| [XNNPACK](backends-xnnpack) | All | CPU | General-purpose, fallback | -| [Core ML](backends-coreml) | iOS, macOS | NPU/GPU | Apple devices, high performance | -| [Metal Performance Shaders](backends-mps)| iOS, macOS | GPU | Apple GPU acceleration | -| [Vulkan ](backends-vulkan) | Android | GPU | Android GPU acceleration | -| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | -| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | -| [ARM EthosU](backends-arm-ethos-u) | Embedded | NPU | ARM MCUs | -| [ARM VGF](backends-arm-vgf) | Android | NPU | ARM platforms | -| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | -| [NXP](backends-nxp) | Embedded | NPU | NXP SoCs | -| [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads | -| [Samsung Exynos](backends-samsung-exynos)| Android | NPU | Samsung SoCs | +| Backend | Platform(s) | Hardware Type | Typical Use Case | +|-----------------------------------------------------------|---------------------|---------------|---------------------------------| +| [XNNPACK](/backends/xnnpack/xnnpack-overview) | All | CPU | General-purpose, fallback | +| [Core ML](backends-coreml) | iOS, macOS | NPU/GPU | Apple devices, high performance | +| [Metal Performance Shaders](backends-mps) | iOS, macOS | GPU | Apple GPU acceleration | +| [Vulkan ](backends-vulkan) | Android | GPU | Android GPU acceleration | +| [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | +| [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | +| [ARM EthosU](backends-arm-ethos-u) | Embedded | NPU | ARM MCUs | +| [ARM VGF](backends-arm-vgf) | Android | NPU | ARM platforms | +| [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | +| [NXP](backends-nxp) | Embedded | NPU | NXP SoCs | +| [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads | +| [Samsung Exynos](backends-samsung-exynos) | Android | NPU | Samsung Socs | **Tip:** For best performance, export a `.pte` file for each backend you plan to support. @@ -46,11 +46,11 @@ Backends are the bridge between your exported model and the hardware it runs on. --- ```{toctree} -:maxdepth: 1 +:maxdepth: 3 :hidden: :caption: Backend Overview -backends-xnnpack +backends/xnnpack/xnnpack-overview backends-coreml backends-mps backends-vulkan diff --git a/docs/source/backends-xnnpack.md b/docs/source/backends-xnnpack.md deleted file mode 100644 index 42e76741ec8..00000000000 --- a/docs/source/backends-xnnpack.md +++ /dev/null @@ -1,182 +0,0 @@ -# XNNPACK Backend - -The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs. - -## Features - -- Wide operator support on Arm and x86 CPUs, available on any modern mobile phone. -- Support for a wide variety of quantization schemes and quantized operators. -- Supports fp32 and fp16 activations. -- Supports 8-bit quantization. - -## Target Requirements - -- ARM64 on Android, iOS, macOS, Linux, and Windows. -- ARMv7 (with NEON) on Android. -- ARMv6 (with VFPv2) on Linux. -- x86 and x86-64 (up to AVX512) on Windows, Linux, Android. - -## Development Requirements - -The XNNPACK delegate does not introduce any development system requirements beyond those required by -the core ExecuTorch runtime. - ----- - -## Using the XNNPACK Backend - -To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. - -```python -import torch -import torchvision.models as models -from torchvision.models.mobilenetv2 import MobileNet_V2_Weights -from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner -from executorch.exir import to_edge_transform_and_lower - -mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() -sample_inputs = (torch.randn(1, 3, 224, 224), ) - -et_program = to_edge_transform_and_lower( - torch.export.export(mobilenet_v2, sample_inputs), - partitioner=[XnnpackPartitioner()], -).to_executorch() - -with open("mv2_xnnpack.pte", "wb") as file: - et_program.write_to_file(file) -``` - -### Partitioner API - -The XNNPACK partitioner API allows for configuration of the model delegation to XNNPACK. Passing an `XnnpackPartitioner` instance with no additional parameters will run as much of the model as possible on the XNNPACK backend. This is the most common use-case. For advanced use cases, the partitioner exposes the following options via the [constructor](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/xnnpack_partitioner.py#L31): - - - `configs`: Control which operators are delegated to XNNPACK. By default, all available operators all delegated. See [../config/\_\_init\_\_.py](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/__init__.py#L66) for an exhaustive list of available operator configs. - - `config_precisions`: Filter operators by data type. By default, delegate all precisions. One or more of `ConfigPrecisionType.FP32`, `ConfigPrecisionType.STATIC_QUANT`, or `ConfigPrecisionType.DYNAMIC_QUANT`. See [ConfigPrecisionType](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/xnnpack_config.py#L24). - - `per_op_mode`: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false. - - `verbose`: If true, print additional information during lowering. - -### Testing the Model - -After generating the XNNPACK-delegated .pte, the model can be tested from Python using the ExecuTorch runtime python bindings. This can be used to sanity check the model and evaluate numerical accuracy. See [Testing the Model](using-executorch-export.md#testing-the-model) for more information. - ----- - -## Quantization - -The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library. - -### Supported Quantization Schemes -The XNNPACK delegate supports the following quantization schemes: - -- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). - - Supports both static and dynamic activations. - - Supports per-channel and per-tensor schemes. - - Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators. - -Weight-only quantization is not currently supported on XNNPACK. - -### 8-bit Quantization using the PT2E Flow - -To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model: - -1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters. -2) Use `torch.export.export` to prepare for quantization. -3) Call `prepare_pt2e` to prepare the model for quantization. -4) For static quantization, run the prepared model with representative samples to calibrate the quantized tensor activation ranges. -5) Call `convert_pt2e` to quantize the model. -6) Export and lower the model using the standard flow. - -The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. - -```python -import torch -import torchvision.models as models -from torchvision.models.mobilenetv2 import MobileNet_V2_Weights -from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config -from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner -from executorch.exir import to_edge_transform_and_lower -from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e - -model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() -sample_inputs = (torch.randn(1, 3, 224, 224), ) - -qparams = get_symmetric_quantization_config(is_per_channel=True) # (1) -quantizer = XNNPACKQuantizer() -quantizer.set_global(qparams) - -training_ep = torch.export.export(model, sample_inputs).module() # (2) -prepared_model = prepare_pt2e(training_ep, quantizer) # (3) - -for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs - prepared_model(cal_sample) # (4) Calibrate - -quantized_model = convert_pt2e(prepared_model) # (5) - -et_program = to_edge_transform_and_lower( # (6) - torch.export.export(quantized_model, sample_inputs), - partitioner=[XnnpackPartitioner()], -).to_executorch() -``` - -See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. - -### LLM quantization with quantize_ - -The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK: - -* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity) -* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity) - -Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch). - -```python -from torchao.quantization.granularity import PerGroup, PerAxis -from torchao.quantization.quant_api import ( - IntxWeightOnlyConfig, - Int8DynamicActivationIntxWeightConfig, - quantize_, -) - -# Quantize embeddings with 8-bits, per channel -embedding_config = IntxWeightOnlyConfig( - weight_dtype=torch.int8, - granularity=PerAxis(0), -) -qunatize_( - eager_model, - lambda m, fqn: isinstance(m, torch.nn.Embedding), -) - - -# Quatize linear layers with 8-bit dynamic activations and 4-bit weights -linear_config = Int8DynamicActivationIntxWeightConfig( - weight_dtype=torch.int4, - weight_granularity=PerGroup(32), -) -quantize_(eager_model, linear_config) -``` - ----- - -## Runtime Integration - -To run the model on-device, use the standard ExecuTorch runtime APIs. See [Running on Device](getting-started.md#running-on-device) for more information. - -The XNNPACK delegate is included by default in the published Android, iOS, and pip packages. When building from source, pass `-DEXECUTORCH_BUILD_XNNPACK=ON` when configuring the CMake build to compile the XNNPACK backend. - -To link against the backend, add the `xnnpack_backend` CMake target as a build dependency, or link directly against `libxnnpack_backend`. Due to the use of static registration, it may be necessary to link with whole-archive. This can typically be done by passing `"$"` to `target_link_libraries`. - -``` -# CMakeLists.txt -add_subdirectory("executorch") -... -target_link_libraries( - my_target - PRIVATE executorch - extension_module_static - extension_tensor - optimized_native_cpu_ops_lib - xnnpack_backend) -``` - -No additional steps are necessary to use the backend beyond linking the target. Any XNNPACK-delegated .pte file will automatically run on the registered backend. diff --git a/docs/source/backends/template/README.md b/docs/source/backends/template/README.md new file mode 100644 index 00000000000..e7cb037bd6c --- /dev/null +++ b/docs/source/backends/template/README.md @@ -0,0 +1,53 @@ +# Backend Documentation Template + +This template provides a standardized structure and starting point for backend documentation. It is intended to provide a uniform experience for users while allowing for backends to customize their documentation as needed. + +## Template Structure + +The template includes the following files: + +### Required Pages + +- `backend-overview.md` - Main backend overview and introduction + +### Recommended Pages + +- `backend-quantization.md` - Quantization support and API documentation +- `backend-partitioner.md` - Partitioner API reference +- `op-support.csv` - Operator support data in CSV format + +### Optional Pages (and Subsections) + +- `backend-troubleshooting.md` - Common issues and troubleshooting guide +- `backend-op-support.rst` - Operator support documentation (RST format) +- `backend-arch-internals.md` - Architecture and internals documentation +- `tutorials/backend-tutorials.md` - Tutorial sub-section + - Use this sub-section to provide tutorials for your backend. + - Tutorials should explain how a user can accomplish a task, in a step by step manner. + - Some examples might include: + - An end to end example of lowering and running a model on a specific platform. +- `tutorials/backend-guides.md` - Guides sub-section + - Use this sub-section to provide guides or how-tos for backend-specific functionality. + - Guides should focus on providing information and building conceptual understanding, rather than giving step by step directions. + - Some examples might include: + - LLM attention management / static attention + - Performance optimization guide + +## Using the Template + +To use this template for a new backend: + +1. Copy the entire `template` directory contents to your backend's documentation directory +2. Rename files to match your backend name (e.g., `backend-overview.md` → `mybackend-overview.md`) +3. Populate the content for your backend. + +### Additional Customization + +You may need to: +- Add backend-specific sections to any file +- Remove sections that don't apply to your backend +- Update the operator support CSV with your backend's supported operators +- Add backend-specific images or diagrams +- Update cross-references and links + +Try to keep the landing page (`backend-overview.md`) simple and straigtforward. Use the child pages and sections to provide more detailed information. diff --git a/docs/source/backends/template/backend-arch-internals.md b/docs/source/backends/template/backend-arch-internals.md new file mode 100644 index 00000000000..66c4a27eb4e --- /dev/null +++ b/docs/source/backends/template/backend-arch-internals.md @@ -0,0 +1,8 @@ +# {BACKEND_NAME} Architecture and Internals + +This page covers internal implementation details of the backend, and is mainly aimed at contributors and heavy power users. This is an optional page for each backend and has no set structure. + +Some topics to consider: + * High-level design of the backend + * Details on the lowering flow + * Internal debugging tools and techniques diff --git a/docs/source/backends/template/backend-op-support.rst b/docs/source/backends/template/backend-op-support.rst new file mode 100644 index 00000000000..0e8f7784a5e --- /dev/null +++ b/docs/source/backends/template/backend-op-support.rst @@ -0,0 +1,13 @@ +================ +Operator Support +================ + +This page lists the operators supported by the {BACKEND_NAME} backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set. + +{OPERATOR_SUPPORT_NOTES} + +.. csv-table:: Operator Support + :file: op-support.csv + :header-rows: 1 + :widths: 20 15 30 30 + :align: center diff --git a/docs/source/backend-template.md b/docs/source/backends/template/backend-overview.md similarity index 62% rename from docs/source/backend-template.md rename to docs/source/backends/template/backend-overview.md index bf992c1ffab..e6b7e2e5d76 100644 --- a/docs/source/backend-template.md +++ b/docs/source/backends/template/backend-overview.md @@ -4,7 +4,7 @@ Provide a brief overview/description of the backend. At a high-level, what does ## Features -List high-level features of backend, such as general operator and hardware support. +List high-level features of backend, such as operator and hardware support. ## Target Requirements @@ -18,27 +18,32 @@ What software and hardware is needed to create a .PTE file targeting this backen This section describes the steps users need to take in order to generate a .PTE targeting this backend. Include a full code sample for exporting and lowering a model to this backend. Make sure relevant imports for the backend partitioner are included. -### Partitioner API +## Runtime Integration -What options, if any, does the partitioner take? Are there any other export-time configurations that can be applied? Document each option. +This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend. +- What CMake targets should they link to? +- How is this backend compiled from source? +- Is the backend bundled by default in iOS and/or Android pre-built libraries? -### Quantization +## Reference -What quantization schemes does this backend support? Consider including the following, as appropriate. -- What operators are supported? -- Number of bits? -- Static vs dynamic activations? -- Weight only vs activations + weights? -- Symmetric vs asymmetric weights? -- Per-tensor, per-chanel, group/blockwise? +**→{doc}`backend-troubleshooting` — Debug common issues.** -If using a PT2E quantizer, document how to initialize the quantizer and all relevant configs and options. +**→{doc}`backend-partitioner` — Partitioner options.** -Include a code snippet demonstrating how to perform quantization for this backend. Document, or link to, a description of the parameters that the user can specify. +**→{doc}`backend-quantization` — Supported quantization schemes.** -## Runtime Integration +**→{doc}`backend-op-support` — Supported operators.** -This section is intended to tell the user all of the steps they'll need to take to be able to run a .PTE file on-device that is targeting the given backend. -- What CMake targets should they link to? -- How is this backend compiled from source? -- Is the backend bundled by default in iOS and/or Android pre-built libraries? +**→{doc}`backend-arch-internals` — Backend internals.** + +```{toctree} +:maxdepth: 2 +:hidden: +:caption: {BACKEND} Backend + +backend-troubleshooting +backend-partitioner +backend-quantization +backend-op-support +backend-arch-internals diff --git a/docs/source/backends/template/backend-partitioner.md b/docs/source/backends/template/backend-partitioner.md new file mode 100644 index 00000000000..f5af70761c7 --- /dev/null +++ b/docs/source/backends/template/backend-partitioner.md @@ -0,0 +1,9 @@ +# {BACKEND_NAME} Partitioner API + +Document the partitioner API for the backend, including configuration options and compile specs. + + * `option1` - description of the option and values. + * `option2` + * ... + +... diff --git a/docs/source/backends/template/backend-quantization.md b/docs/source/backends/template/backend-quantization.md new file mode 100644 index 00000000000..4997a56e248 --- /dev/null +++ b/docs/source/backends/template/backend-quantization.md @@ -0,0 +1,31 @@ +# {BACKEND_NAME} Quantization + +Document quantization schemes and flows for the backend. This should include a description of each scheme and a code example to perform quantization. Example sections for PT2E and quantize_ are included below, to be replaced with details for the target backend. + +For each supported quantization scheme, include the following: + * What is the quantization scheme? + * How are weights quantized? + * How are activations quantized? Static or dynamic? + * How many bits? + * What is the granularity? Per-tensor, per-channel, group/block-wise? + * What are the steps to quantize a model with this scheme? + * Include a code sample. + * If the quantization flow only supports a small set of operators - for example, linear only - note this. + +### Supported Quantization Schemes +The {BACKEND_NAME} delegate supports the following quantization schemes: + +- {QUANTIZATION_SCHEME_1} +- {QUANTIZATION_SCHEME_2} + +### {QUANTIZATION_METHOD_1} using the PT2E Flow + +[Description] + +[Code Sample] + +### LLM Quantization with quantize_ + +[Description] + +[Code Sample] diff --git a/docs/source/backends/template/backend-troubleshooting.md b/docs/source/backends/template/backend-troubleshooting.md new file mode 100644 index 00000000000..851c04f34ea --- /dev/null +++ b/docs/source/backends/template/backend-troubleshooting.md @@ -0,0 +1,15 @@ +# {BACKEND_NAME} Troubleshooting + +This page describes common issues that you may encounter when using the {BACKEND_NAME} backend and how to debug and resolve them. + +## {COMMON_ISSUE_1} + +{ISSUE_DESCRIPTION_1} + +{SOLUTION_STEPS_1} + +## {COMMON_ISSUE_2} + +{ISSUE_DESCRIPTION_2} + +{SOLUTION_STEPS_2} diff --git a/docs/source/backends/template/guides/backend-basic-guide.md b/docs/source/backends/template/guides/backend-basic-guide.md new file mode 100644 index 00000000000..44f86d8bd4d --- /dev/null +++ b/docs/source/backends/template/guides/backend-basic-guide.md @@ -0,0 +1,3 @@ +# Using {FEATURE} on {BACKEND_NAME} + +This is a placeholder guide. diff --git a/docs/source/backends/template/guides/backend-tutorials.md b/docs/source/backends/template/guides/backend-tutorials.md new file mode 100644 index 00000000000..15e226dd5c5 --- /dev/null +++ b/docs/source/backends/template/guides/backend-tutorials.md @@ -0,0 +1,10 @@ +# {BACKEND_NAME} Tutorials + +**→{doc}`{backend_name}-basic-tutorial` — Lower and run a model on the {BACKEND_NAME} backend.** + +```{toctree} +:hidden: +:maxdepth: 1 + +{backend_name}-basic-tutorial +``` diff --git a/docs/source/backends/template/op-support.csv b/docs/source/backends/template/op-support.csv new file mode 100644 index 00000000000..66af56d6a44 --- /dev/null +++ b/docs/source/backends/template/op-support.csv @@ -0,0 +1,6 @@ +Operator,Compute DType,Quantization,Constraints +{OPERATOR_1},{DTYPE_SUPPORT_1},{QUANTIZATION_SUPPORT_1},{CONSTRAINTS_1} +{OPERATOR_2},{DTYPE_SUPPORT_2},{QUANTIZATION_SUPPORT_2},{CONSTRAINTS_2} +{OPERATOR_3},{DTYPE_SUPPORT_3},{QUANTIZATION_SUPPORT_3},{CONSTRAINTS_3} +{OPERATOR_4},{DTYPE_SUPPORT_4},{QUANTIZATION_SUPPORT_4},{CONSTRAINTS_4} +{OPERATOR_5},{DTYPE_SUPPORT_5},{QUANTIZATION_SUPPORT_5},{CONSTRAINTS_5} diff --git a/docs/source/backends/template/tutorials/backend-basic-tutorial.md b/docs/source/backends/template/tutorials/backend-basic-tutorial.md new file mode 100644 index 00000000000..23d76857116 --- /dev/null +++ b/docs/source/backends/template/tutorials/backend-basic-tutorial.md @@ -0,0 +1,91 @@ +# Preparing a Model for {BACKEND_NAME} + +This is a placeholder tutorial. + +## Step 1: Environment Setup + +This tutorial is intended to be run from a {SUPPORTED_HOST_OS} and uses Conda for Python environment management. For full setup details and system requirements, see [Getting Started with ExecuTorch](/getting-started). + +Create a Conda environment and install the ExecuTorch Python package. +```bash +conda create -y --name executorch python=3.12 +conda activate executorch +conda install executorch +``` + +{ADDITIONAL_SETUP_STEPS} + +## Step 2: Model Preparation + +Create a python file named `export_{model_filename}.py`. This script will be responsible for loading the {EXAMPLE_MODEL} model from {MODEL_SOURCE} and create a {BACKEND_NAME}-targeted .pte file. + +```py +# export_{model_filename}.py +from executorch.backends.{backend_name}.partition.{backend_name}_partitioner import {BackendName}Partitioner +from executorch.exir import to_edge_transform_and_lower +import torch +import {MODEL_IMPORT} +``` + +### Model Instantiation and Example Inputs + +Instantiate the {EXAMPLE_MODEL} model from [{MODEL_SOURCE}]({MODEL_SOURCE_URL}). The export process also needs an example model input to trace the model. The model takes {MODEL_INPUT_DESCRIPTION}, so we'll create {INPUT_TUPLE_DESCRIPTION}. +```py +model = {MODEL_INSTANTIATION_CODE} +example_inputs = ({EXAMPLE_INPUTS},) +``` + +### Lower the Model + +Next, export and lower the model to ExecuTorch. Note that the `{BackendName}Partitioner` passed to the `partitioner` parameter tells ExecuTorch to target the {BACKEND_NAME} backend. +```py +exported_program = torch.export.export(model, example_inputs) + +executorch_program = to_edge_transform_and_lower( + exported_program, + partitioner=[{BackendName}Partitioner()], +).to_executorch() + +executorch_program.save("{model_filename}_{backend_name}.pte") +``` + +### Run the Script + +Save the above script to export_{model_filename}.py and run the script. You should see a file named `{model_filename}_{backend_name}.pte` in the current directory. +```bash +python export_{model_filename}.py +``` + +## Step 3: Running the Model + +The .pte file created in the previous step can be run on a variety of devices, including {SUPPORTED_PLATFORMS}. ExecuTorch provides runtime APIs and language bindings for a variety of platforms. This tutorial will demonstrate running the model on a desktop using the Python runtime. + +### Smoke Test + +First, we'll verify that the model loads and runs correctly by running the model with {TEST_INPUT_DESCRIPTION}. Create a new script, named `run_{model_filename}.py`, and add the following code. +```py +# run_{model_filename}.py + +from executorch.runtime import Runtime +import torch + +runtime = Runtime.get() + +input_tensor = {TEST_INPUT_TENSOR} +program = runtime.load_program("{model_filename}_{backend_name}.pte") +method = program.load_method("forward") +outputs = method.execute([input_tensor])[0] + +print(outputs) +``` + +When running the script with `python run_{model_filename}.py`, you should see {EXPECTED_OUTPUT_DESCRIPTION} printed to the console. +``` +{EXPECTED_OUTPUT_EXAMPLE} +``` + +# Next Steps + + - See [Edge Platforms](/edge-platforms-section) to deploy the .pte file on {SUPPORTED_PLATFORMS}. + - See [Model Export and Lowering](/using-executorch-export) for more information on model preparation. + - See [{BACKEND_NAME} Overview](/backends/{backend_name}/{backend_name}-overview) for more information about the {BACKEND_NAME} backend. diff --git a/docs/source/backends/template/tutorials/backend-tutorials.md b/docs/source/backends/template/tutorials/backend-tutorials.md new file mode 100644 index 00000000000..15e226dd5c5 --- /dev/null +++ b/docs/source/backends/template/tutorials/backend-tutorials.md @@ -0,0 +1,10 @@ +# {BACKEND_NAME} Tutorials + +**→{doc}`{backend_name}-basic-tutorial` — Lower and run a model on the {BACKEND_NAME} backend.** + +```{toctree} +:hidden: +:maxdepth: 1 + +{backend_name}-basic-tutorial +``` diff --git a/docs/source/backends/xnnpack/op-support.csv b/docs/source/backends/xnnpack/op-support.csv new file mode 100644 index 00000000000..5350fed8d12 --- /dev/null +++ b/docs/source/backends/xnnpack/op-support.csv @@ -0,0 +1,47 @@ +Operator,Compute DType,Quantization,Constraints +_to_dim_order_copy,"fp16, fp32",,no dtype conversion +abs,"fp16, fp32",, +add,"fp16, fp32",PT2E: static int8,alpha=1 +avg_pool2d,"fp16, fp32",PT2E: static int8,"ceil_mode=False, count_include_pad=False, divisor_override=pooling_region" +bmm,"fp16, fp32",, +cat,"fp16, fp32",PT2E: static int8, +ceil,"fp16, fp32",, +clamp,"fp16, fp32",, +constant_pad_nd,"fp16, fp32",,no negative padding values +conv1d,"fp16, fp32","PT2E: static or dynamic int8 activations +8-bit weights, symmetric per-tensor or per-channel",constant weights +conv2d,"fp16, fp32","PT2E: static or dynamic int8 activations +8-bit weights, symmetric per-tensor or per-channel",constant weights +dequantize_per_tensor,"fp16, fp32",, +div,"fp16, fp32",, +elu,"fp16, fp32",, +exp,"fp16, fp32",, +floor,"fp16, fp32",, +gelu,"fp16, fp32",, +hardswish,"fp16, fp32",, +hardtanh,"fp16, fp32",, +leaky_relu,"fp16, fp32",, +linear,"fp16, fp32","PT2E: static or dynamic int8 activations +8-bit weights, symmetric per-tensor or per-channel + +quantize\_: 8-bit dynamic activations +4-bit groupwise weights",constant weights +log,"fp16, fp32",, +max_pool2d,"fp16, fp32",,"stride ≤ kernel_size, ceil_mode only for static shapes" +maximum,"fp16, fp32",, +mean,"fp16, fp32",,"4D tensors only; dims=[-2,-1] or [-1,-2]" +minimum,"fp16, fp32",, +mul,"fp16, fp32",PT2E: static int8, +neg,"fp16, fp32",, +permute_copy,"fp16, fp32",, +pow,"fp16, fp32",,power=2 only +quantize_per_tensor,"fp16, fp32",, +relu,"fp16, fp32",, +rsqrt,"fp16, fp32",, +sigmoid,"fp16, fp32",, +slice_copy,"fp16, fp32",,"no zero-dim tensors, no dynamic shapes" +softmax,"fp16, fp32",,dim must be last dimension +sqrt,"fp16, fp32",, +sub,"fp16, fp32",,alpha=1 +tanh,"fp16, fp32",, +upsample_bilinear2d,"fp16, fp32",,no dynamic output sizes diff --git a/docs/source/backend-delegates-xnnpack-reference.md b/docs/source/backends/xnnpack/xnnpack-arch-internals.md similarity index 97% rename from docs/source/backend-delegates-xnnpack-reference.md rename to docs/source/backends/xnnpack/xnnpack-arch-internals.md index 8b4338e703c..405a460df38 100644 --- a/docs/source/backend-delegates-xnnpack-reference.md +++ b/docs/source/backends/xnnpack/xnnpack-arch-internals.md @@ -1,4 +1,4 @@ -# XNNPACK Delegate Internals +# Architecture and Internals This is a high-level overview of the ExecuTorch XNNPACK backend delegate. This high performance delegate is aimed to reduce CPU inference latency for ExecuTorch models. We will provide a brief introduction to the XNNPACK library and explore the delegate’s overall architecture and intended use cases. @@ -9,12 +9,12 @@ XNNPACK is a library of highly-optimized neural network operators for ARM, x86, A delegate is an entry point for backends to process and execute parts of the ExecuTorch program. Delegated portions of ExecuTorch models hand off execution to backends. The XNNPACK backend delegate is one of many available in ExecuTorch. It leverages the XNNPACK third-party library to accelerate ExecuTorch programs efficiently across a variety of CPUs. More detailed information on the delegates and developing your own delegates is available [here](compiler-delegate-and-partitioner.md). It is recommended that you get familiar with that content before continuing on to the Architecture section. ## Architecture -![High Level XNNPACK delegate Architecture](xnnpack-delegate-architecture.png) +![High Level XNNPACK delegate Architecture](/backends/xnnpack/xnnpack-delegate-architecture.png) ### Ahead-of-time In the ExecuTorch export flow, lowering to the XNNPACK delegate happens at the `to_backend()` stage. In this stage, the model is partitioned by the `XnnpackPartitioner`. Partitioned sections of the graph are converted to a XNNPACK specific graph represenationed and then serialized via flatbuffer. The serialized flatbuffer is then ready to be deserialized and executed by the XNNPACK backend at runtime. -![ExecuTorch XNNPACK delegate Export Flow](xnnpack-et-flow-diagram.png) +![ExecuTorch XNNPACK delegate Export Flow](/backends/xnnpack/xnnpack-et-flow-diagram.png) #### Partitioner The partitioner is implemented by backend delegates to mark nodes suitable for lowering. The `XnnpackPartitioner` lowers using node targets and module metadata. Some more references for partitioners can be found [here](compiler-delegate-and-partitioner.md) diff --git a/docs/source/xnnpack-delegate-architecture.png b/docs/source/backends/xnnpack/xnnpack-delegate-architecture.png similarity index 100% rename from docs/source/xnnpack-delegate-architecture.png rename to docs/source/backends/xnnpack/xnnpack-delegate-architecture.png diff --git a/docs/source/xnnpack-et-flow-diagram.png b/docs/source/backends/xnnpack/xnnpack-et-flow-diagram.png similarity index 100% rename from docs/source/xnnpack-et-flow-diagram.png rename to docs/source/backends/xnnpack/xnnpack-et-flow-diagram.png diff --git a/docs/source/backends/xnnpack/xnnpack-op-support.rst b/docs/source/backends/xnnpack/xnnpack-op-support.rst new file mode 100644 index 00000000000..bd460ca4171 --- /dev/null +++ b/docs/source/backends/xnnpack/xnnpack-op-support.rst @@ -0,0 +1,13 @@ +================ +Operator Support +================ + +This page lists the operators supported by the XNNPACK backend. Operators are the building blocks of the ML model. See `IRs `_ for more information on the PyTorch operator set. + +All operators support dynamic input shapes unless otherwise noted. + +.. csv-table:: Operator Support + :file: op-support.csv + :header-rows: 1 + :widths: 20 15 30 30 + :align: center diff --git a/docs/source/backends/xnnpack/xnnpack-overview.md b/docs/source/backends/xnnpack/xnnpack-overview.md new file mode 100644 index 00000000000..fa8db027e42 --- /dev/null +++ b/docs/source/backends/xnnpack/xnnpack-overview.md @@ -0,0 +1,102 @@ +# XNNPACK Backend + +The XNNPACK delegate is the ExecuTorch solution for CPU execution on mobile CPUs. [XNNPACK](https://github.com/google/XNNPACK/tree/master) is a library that provides optimized kernels for machine learning operators on Arm and x86 CPUs. + +## Features + +- Wide operator support on Arm and x86 CPUs, available on any modern mobile phone. +- Support for a wide variety of quantization schemes and quantized operators. +- Supports fp32 and fp16 activations. +- Supports 8-bit quantization. + +## Target Requirements + +- ARM64 on Android, iOS, macOS, Linux, and Windows. +- ARMv7 (with NEON) on Android. +- ARMv6 (with VFPv2) on Linux. +- x86 and x86-64 (up to AVX512) on Windows, Linux, Android. + +## Development Requirements + +The XNNPACK delegate does not introduce any development system requirements beyond those required by +the core ExecuTorch runtime. + +---- + +## Using the XNNPACK Backend + +To target the XNNPACK backend during the export and lowering process, pass an instance of the `XnnpackPartitioner` to `to_edge_transform_and_lower`. The example below demonstrates this process using the MobileNet V2 model from torchvision. + +```python +import torch +import torchvision.models as models +from torchvision.models.mobilenetv2 import MobileNet_V2_Weights +from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner +from executorch.exir import to_edge_transform_and_lower + +mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() +sample_inputs = (torch.randn(1, 3, 224, 224), ) + +et_program = to_edge_transform_and_lower( + torch.export.export(mobilenet_v2, sample_inputs), + partitioner=[XnnpackPartitioner()], +).to_executorch() + +with open("mv2_xnnpack.pte", "wb") as file: + et_program.write_to_file(file) +``` + +See [Partitioner API](/backends/xnnpack/xnnpack-partitioner) for a reference on available partitioner options. + +---- + +## Quantization + +The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. See [XNNPACK Quantization](/backends/xnnpack/xnnpack-quantization) for more information on available quantization schemes and APIs. + +---- + +## Runtime Integration + +To run the model on-device, use the standard ExecuTorch runtime APIs. + +The XNNPACK delegate is included by default in the published Android, iOS, and pip packages. When building from source, pass `-DEXECUTORCH_BUILD_XNNPACK=ON` when configuring the CMake build to compile the XNNPACK backend. See [Running on Device](/getting-started.md#running-on-device) for more information. + +To link against the backend, add the `executorch_backends` CMake target as a build dependency, or link directly against `libxnnpack_backend`. Due to the use of static registration, it may be necessary to link with whole-archive. This can typically be done by passing `"$"` to `target_link_libraries`. + +``` +# CMakeLists.txt +add_subdirectory("executorch") +... +target_link_libraries( + my_target + PRIVATE executorch + executorch_backends + ... +) +``` + +No additional steps are necessary to use the backend beyond linking the target. Any XNNPACK-delegated .pte file will automatically run on the registered backend. + +## Reference + +**→{doc}`xnnpack-troubleshooting` — Debug common issues.** + +**→{doc}`xnnpack-partitioner` — Partitioner options.** + +**→{doc}`xnnpack-quantization` — Supported quantization schemes.** + +**→{doc}`xnnpack-op-support` — Supported operators.** + +**→{doc}`xnnpack-arch-internals` — XNNPACK backend internals.** + +```{toctree} +:maxdepth: 2 +:hidden: +:caption: XNNPACK Backend + +xnnpack-troubleshooting +xnnpack-partitioner +xnnpack-quantization +xnnpack-op-support +xnnpack-arch-internals diff --git a/docs/source/backends/xnnpack/xnnpack-partitioner.md b/docs/source/backends/xnnpack/xnnpack-partitioner.md new file mode 100644 index 00000000000..c8c85ca628c --- /dev/null +++ b/docs/source/backends/xnnpack/xnnpack-partitioner.md @@ -0,0 +1,8 @@ +# Partitioner API + +The XNNPACK partitioner API allows for configuration of the model delegation to XNNPACK. Passing an `XnnpackPartitioner` instance with no additional parameters will run as much of the model as possible on the XNNPACK backend. This is the most common use-case. For advanced use cases, the partitioner exposes the following options via the [constructor](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/xnnpack_partitioner.py#L31): + + - `configs`: Control which operators are delegated to XNNPACK. By default, all available operators all delegated. See [../config/\_\_init\_\_.py](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/__init__.py#L66) for an exhaustive list of available operator configs. + - `config_precisions`: Filter operators by data type. By default, delegate all precisions. One or more of `ConfigPrecisionType.FP32`, `ConfigPrecisionType.STATIC_QUANT`, or `ConfigPrecisionType.DYNAMIC_QUANT`. See [ConfigPrecisionType](https://github.com/pytorch/executorch/blob/release/0.6/backends/xnnpack/partition/config/xnnpack_config.py#L24). + - `per_op_mode`: If true, emit individual delegate calls for every operator. This is an advanced option intended to reduce memory overhead in some contexts at the cost of a small amount of runtime overhead. Defaults to false. + - `verbose`: If true, print additional information during lowering. diff --git a/docs/source/backends/xnnpack/xnnpack-quantization.md b/docs/source/backends/xnnpack/xnnpack-quantization.md new file mode 100644 index 00000000000..e3a02d4bffc --- /dev/null +++ b/docs/source/backends/xnnpack/xnnpack-quantization.md @@ -0,0 +1,94 @@ +# Quantization + +The XNNPACK delegate can also be used as a backend to execute symmetrically quantized models. To quantize a PyTorch model for the XNNPACK backend, use the `XNNPACKQuantizer`. `Quantizers` are backend specific, which means the `XNNPACKQuantizer` is configured to quantize models to leverage the quantized operators offered by the XNNPACK Library. + +### Supported Quantization Schemes +The XNNPACK delegate supports the following quantization schemes: + +- 8-bit symmetric weights with 8-bit asymmetric activations (via the PT2E quantization flow). + - Supports both static and dynamic activations. + - Supports per-channel and per-tensor schemes. + - Supports linear, convolution, add, mul, cat, and adaptive avg pool 2d operators. + +Weight-only quantization is not currently supported on XNNPACK. + +### 8-bit Quantization using the PT2E Flow + +To perform 8-bit quantization with the PT2E flow, perform the following steps prior to exporting the model: + +1) Create an instance of the `XnnpackQuantizer` class. Set quantization parameters. +2) Use `torch.export.export` to prepare for quantization. +3) Call `prepare_pt2e` to prepare the model for quantization. +4) For static quantization, run the prepared model with representative samples to calibrate the quantizated tensor activation ranges. +5) Call `convert_pt2e` to quantize the model. +6) Export and lower the model using the standard flow. + +The output of `convert_pt2e` is a PyTorch model which can be exported and lowered using the normal flow. As it is a regular PyTorch model, it can also be used to evaluate the accuracy of the quantized model using standard PyTorch techniques. + +```python +import torch +import torchvision.models as models +from torchvision.models.mobilenetv2 import MobileNet_V2_Weights +from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import XNNPACKQuantizer, get_symmetric_quantization_config +from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner +from executorch.exir import to_edge_transform_and_lower +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e + +model = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval() +sample_inputs = (torch.randn(1, 3, 224, 224), ) + +qparams = get_symmetric_quantization_config(is_per_channel=True) # (1) +quantizer = XNNPACKQuantizer() +quantizer.set_global(qparams) + +training_ep = torch.export.export(model, sample_inputs).module() # (2) +prepared_model = prepare_pt2e(training_ep, quantizer) # (3) + +for cal_sample in [torch.randn(1, 3, 224, 224)]: # Replace with representative model inputs + prepared_model(cal_sample) # (4) Calibrate + +quantized_model = convert_pt2e(prepared_model) # (5) + +et_program = to_edge_transform_and_lower( # (6) + torch.export.export(quantized_model, sample_inputs), + partitioner=[XnnpackPartitioner()], +).to_executorch() +``` + +See [PyTorch 2 Export Post Training Quantization](https://docs.pytorch.org/ao/main/tutorials_source/pt2e_quant_ptq.html) for more information. + +### LLM quantization with quantize_ + +The XNNPACK backend also supports quantizing models with the [torchao](https://github.com/pytorch/ao) quantize_ API. This is most commonly used for LLMs, requiring more advanced quantization. Since quantize_ is not backend aware, it is important to use a config that is compatible with CPU/XNNPACK: + +* Quantize embeedings with IntxWeightOnlyConfig (with weight_dtype torch.int2, torch.int4, or torch.int8, using PerGroup or PerAxis granularity) +* Quantize linear layers with Int8DynamicActivationIntxWeightConfig (with weight_dtype=torch.int4, using PerGroup or PerAxis granularity) + +Below is a simple example, but a more detailed tutorial including accuracy evaluation on popular LLM benchmarks can be found in the [torchao documentation](https://docs.pytorch.org/ao/main/serving.html#mobile-deployment-with-executorch). + +```python +from torchao.quantization.granularity import PerGroup, PerAxis +from torchao.quantization.quant_api import ( + IntxWeightOnlyConfig, + Int8DynamicActivationIntxWeightConfig, + quantize_, +) + +# Quantize embeddings with 8-bits, per channel +embedding_config = IntxWeightOnlyConfig( + weight_dtype=torch.int8, + granularity=PerAxis(0), +) +qunatize_( + eager_model, + lambda m, fqn: isinstance(m, torch.nn.Embedding), +) + + +# Quatize linear layers with 8-bit dynamic activations and 4-bit weights +linear_config = Int8DynamicActivationIntxWeightConfig( + weight_dtype=torch.int4, + weight_granularity=PerGroup(32), +) +quantize_(eager_model, linear_config) +``` diff --git a/docs/source/backends/xnnpack/xnnpack-troubleshooting.md b/docs/source/backends/xnnpack/xnnpack-troubleshooting.md new file mode 100644 index 00000000000..d5459feb632 --- /dev/null +++ b/docs/source/backends/xnnpack/xnnpack-troubleshooting.md @@ -0,0 +1,25 @@ +# Troubleshooting + +This page describes common issues that you may encounter when using the XNNPACK backend and how to debug and resolve them. + +## XNNPACK Backend Not Found + +This error indicates the XNNPACK backend is not registered with the runtime. This can happen because the backend was not compiled or linked, or because the registration code was optimized out. + +The XNNPACK backend is built by default for Python, Android, iOS, and in most CMake presets. + +* Set the `EXECUTORCH_BUILD_XNNPACK=ON` CMake option option when building from source. + * Either by passing the option during CMake configuration or setting it inside the user CMake logic before including ExecuTorch. + * See [Building from Source](using-executorch-building-from-source). +* On iOS, link the `backend_xnnpack` [framework](/using-executorch-ios). +* If the backend is still not found, link with `WHOLE_ARCHIVE`. + * Pass `"LINK_LIBRARY:WHOLE_ARCHIVE,xnnpack_backend>"` to `target_link_libraries` in CMake. + +## Slow Performance + + * Try reducing the thread count using [_unsafe_reset_threadpool](/using-executorch-faqs#inference-is-slow-performance-troubleshooting). + * Small models may benefit from using fewer threads than default. + * Try values between 1 and 4 threads and measure performance on your model. + * Use [op-level profiling](tutorials/devtools-integration-tutorial) to understand which operators are taking the most time. + * The XNNPACK backend provides operator-level timing for delegated operators. + * See general performance troubleshooting tips in [Performance Troubleshooting](/using-executorch-faqs#inference-is-slow-performance-troubleshooting).