diff --git a/backends/vulkan/README.md b/backends/vulkan/README.md index 63a9b0b049a..b51a736c7df 100644 --- a/backends/vulkan/README.md +++ b/backends/vulkan/README.md @@ -1,205 +1,4 @@ -# Vulkan Backend +# The ExecuTorch Vulkan Backend -The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is -built on top of the cross-platform Vulkan GPU API standard. It is primarily -designed to leverage the GPU to accelerate model inference on Android devices, -but can be used on any platform that supports an implementation of Vulkan: -laptops, servers, and edge devices. - -::::{note} -The Vulkan delegate is currently under active development, and its components -are subject to change. -:::: - -## What is Vulkan? - -Vulkan is a low-level GPU API specification developed as a successor to OpenGL. -It is designed to offer developers more explicit control over GPUs compared to -previous specifications in order to reduce overhead and maximize the -capabilities of the modern graphics hardware. - -Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both -desktop and mobile) in the market support Vulkan. Vulkan is also included in -Android from Android 7.0 onwards. - -**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it -provides a way to execute compute and graphics operations on a GPU, but does not -come with a built-in library of performant compute kernels. - -## The Vulkan Compute Library - -The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as -the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to -provide GPU implementations for PyTorch operators via GLSL compute shaders. - -The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html). -The core components of the PyTorch Vulkan backend were forked into ExecuTorch -and adapted for an AOT graph-mode style of model inference (as opposed to -PyTorch which adopted an eager execution style of model inference). - -The components of the Vulkan Compute Library are contained in the -`executorch/backends/vulkan/runtime/` directory. The core components are listed -and described below: - -``` -runtime/ -├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects -└── graph/ .................. ComputeGraph class which implements graph mode inference - └── ops/ ................ Base directory for operator implementations - ├── glsl/ ........... GLSL compute shaders - │ ├── *.glsl - │ └── conv2d.glsl - └── impl/ ........... C++ code to dispatch GPU compute shaders - ├── *.cpp - └── Conv2d.cpp -``` - -## Features - -The Vulkan delegate currently supports the following features: - -* **Memory Planning** - * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference. -* **Capability Based Partitioning**: - * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs -* **Support for upper-bound dynamic shapes**: - * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering - -In addition to increasing operator coverage, the following features are -currently in development: - -* **Quantization Support** - * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future. -* **Memory Layout Management** - * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication. -* **Selective Build** - * We plan to make it possible to control build size by selecting which operators/shaders you want to build with - -## End to End Example - -To further understand the features of the Vulkan Delegate and how to use it, -consider the following end to end example with a simple single operator model. - -### Compile and lower a model to the Vulkan Delegate - -Assuming ExecuTorch has been set up and installed, the following script can be -used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`. - -Once ExecuTorch has been set up and installed, the following script can be used -to generate a simple model and lower it to the Vulkan delegate. - -``` -# Note: this script is the same as the script from the "Setting up ExecuTorch" -# page, with one minor addition to lower to the Vulkan backend. -import torch -from torch.export import export -from executorch.exir import to_edge - -from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner - -# Start with a PyTorch model that adds two input tensors (matrices) -class Add(torch.nn.Module): - def __init__(self): - super(Add, self).__init__() - - def forward(self, x: torch.Tensor, y: torch.Tensor): - return x + y - -# 1. torch.export: Defines the program with the ATen operator set. -aten_dialect = export(Add(), (torch.ones(1), torch.ones(1))) - -# 2. to_edge: Make optimizations for Edge devices -edge_program = to_edge(aten_dialect) -# 2.1 Lower to the Vulkan backend -edge_program = edge_program.to_backend(VulkanPartitioner()) - -# 3. to_executorch: Convert the graph to an ExecuTorch program -executorch_program = edge_program.to_executorch() - -# 4. Save the compiled .pte program -with open("vk_add.pte", "wb") as file: - file.write(executorch_program.buffer) -``` - -Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate -using the `to_backend()` API. The Vulkan Delegate implements the -`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph -that are supported by the Vulkan delegate, and separates compatible sections of -the model to be executed on the GPU. - -This means the a model can be lowered to the Vulkan delegate even if it contains -some unsupported operators. This will just mean that only parts of the graph -will be executed on the GPU. - - -::::{note} -The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194) -Vulkan partitioner code can be inspected to examine which ops are currently -implemented in the Vulkan delegate. -:::: - -### Build Vulkan Delegate libraries - -The easiest way to build and test the Vulkan Delegate is to build for Android -and test on a local Android device. Android devices have built in support for -Vulkan, and the Android NDK ships with a GLSL compiler which is needed to -compile the Vulkan Compute Library's GLSL compute shaders. - -The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON` -when building with CMake. - -First, make sure that you have the Android NDK installed; any NDK version past -NDK r19c should work. Note that the examples in this doc have been validated with -NDK r28c. The Android SDK should also be installed so that you have access to `adb`. - -The instructions in this page assumes that the following environment variables -are set. - -```shell -export ANDROID_NDK= -# Select the appropriate Android ABI for your device -export ANDROID_ABI=arm64-v8a -# All subsequent commands should be performed from ExecuTorch repo root -cd -# Make sure adb works -adb --version -``` - -To build and install ExecuTorch libraries (for Android) with the Vulkan -Delegate: - -```shell -# From executorch root directory -(rm -rf cmake-android-out && \ - pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ - -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ - -DANDROID_ABI=$ANDROID_ABI \ - -DEXECUTORCH_BUILD_VULKAN=ON \ - -DPYTHON_EXECUTABLE=python \ - -Bcmake-android-out && \ - cmake --build cmake-android-out -j16 --target install) -``` - -### Run the Vulkan model on device - -::::{note} -Since operator support is currently limited, only binary arithmetic operators -will run on the GPU. Expect inference to be slow as the majority of operators -are being executed via Portable operators. -:::: - -Now, the partially delegated model can be executed (partially) on your device's -GPU! - -```shell -# Build a model runner binary linked with the Vulkan delegate libs -cmake --build cmake-android-out --target executor_runner -j32 - -# Push model to device -adb push vk_add.pte /data/local/tmp/vk_add.pte -# Push binary to device -adb push cmake-android-out/executor_runner /data/local/tmp/runner_bin - -# Run the model -adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte -``` +Please see the [Vulkan Backend Overview](../../docs/source/backends/vulkan/vulkan-overview.md) +to learn more about the ExecuTorch Vulkan Backend. diff --git a/backends/vulkan/docs/android_demo.md b/backends/vulkan/docs/android_demo.md deleted file mode 100644 index ff84938b06f..00000000000 --- a/backends/vulkan/docs/android_demo.md +++ /dev/null @@ -1,128 +0,0 @@ -# Building and Running ExecuTorch with the Vulkan Backend - -The [ExecuTorch Vulkan Delegate](../../../docs/source/native-delegates-executorch-vulkan-delegate.md) -is a native GPU delegate for ExecuTorch. - - -::::{grid} 2 -:::{grid-item-card} What you will learn in this tutorial: -:class-card: card-content -* How to export the Llama3.2-1B parameter model with partial GPU delegation -* How to execute the partially delegated model on Android -::: -:::{grid-item-card} Prerequisites: -:class-card: card-prerequisites -* Follow [**Setting up ExecuTorch**](../../../docs/source/getting-started-setup.rst) -* It is also recommended that you read through [**ExecuTorch Vulkan Delegate**](../../../docs/source/native-delegates-executorch-vulkan-delegate.md) and follow the example in that page -::: -:::: - -## Prerequisites - -Note that all the steps below should be performed from the ExecuTorch repository -root directory, and assumes that you have gone through the steps of setting up -ExecuTorch. - -It is also assumed that the Android NDK and Android SDK is installed, and the -following environment examples are set. - -```shell -export ANDROID_NDK= -# Select an appropriate Android ABI for your device -export ANDROID_ABI=arm64-v8a -# All subsequent commands should be performed from ExecuTorch repo root -cd -# Make sure adb works -adb --version -``` - -## Lowering the Llama3.2-1B model to Vulkan - -::::{note} -The resultant model will only be partially delegated to the Vulkan backend. In -particular, only binary arithmetic operators (`aten.add`, `aten.sub`, -`aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`), -and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan -delegate. The rest of the model will be executed using Portable operators. - -Operator support for LLaMA models is currently in active development; please -check out the `main` branch of the ExecuTorch repo for the latest capabilities. -:::: - -First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model` -files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/). - -Once the files have been downloaded, the `export_llama` script can be used to -partially lower the Llama model to Vulkan. - -```shell -# The files will usually be downloaded to ~/.llama -python -m examples.models.llama.export_llama \ - --disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \ - --model "llama3_2" \ - -c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \ - -p ~/.llama/checkpoints/Llama3.2-1B/params.json \ - --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' -``` - -A `vulkan_llama2.pte` file should have been created as a result of running the -script. - -Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device: - -```shell -adb push ~/.llama/tokenizer.model /data/local/tmp/ -adb push vulkan_llama2.pte /data/local/tmp/ -``` - -## Build and Run the LLaMA runner binary on Android - -First, build and install ExecuTorch libraries, then build the LLaMA runner -binary using the Android NDK toolchain. - -```shell -./install_executorch.sh --clean -(mkdir cmake-android-out && \ - cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ - -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ - -DANDROID_ABI=$ANDROID_ABI \ - -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \ - -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \ - -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \ - -DEXECUTORCH_BUILD_VULKAN=ON \ - -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \ - -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ - -DPYTHON_EXECUTABLE=python \ - -Bcmake-android-out && \ - cmake --build cmake-android-out -j16 --target install) - -# Build LLaMA Runner library -(rm -rf cmake-android-out/examples/models/llama && \ - cmake examples/models/llama \ - -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ - -DANDROID_ABI=$ANDROID_ABI \ - -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \ - -DEXECUTORCH_BUILD_KERNELS_LLM=ON \ - -DCMAKE_INSTALL_PREFIX=cmake-android-out \ - -DPYTHON_EXECUTABLE=python \ - -Bcmake-android-out/examples/models/llama && \ - cmake --build cmake-android-out/examples/models/llama -j16) -``` - -Finally, push and run the llama runner binary on your Android device. Note that -your device must have sufficient GPU memory to execute the model. - -```shell -adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main - -adb shell /data/local/tmp/llama_main \ - --model_path=/data/local/tmp/vulkan_llama2.pte \ - --tokenizer_path=/data/local/tmp/tokenizer.model \ - --prompt "Hello" -``` - -Note that currently model inference will be very slow due to the high amount of -delegate blobs in the lowered graph, which requires a transfer to and from the -GPU for each sub graph. Performance is expected to improve drastically as more -of the model can be lowered to the Vulkan delegate, and techniques such as -quantization are supported. diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md index dfeb6243d37..da2febced3a 100644 --- a/docs/source/backends-overview.md +++ b/docs/source/backends-overview.md @@ -23,7 +23,7 @@ Backends are the bridge between your exported model and the hardware it runs on. | [XNNPACK](backends-xnnpack) | All | CPU | General-purpose, fallback | | [Core ML](/backends/coreml/coreml-overview.md) | iOS, macOS | NPU/GPU/CPU | Apple devices, high performance | | [Metal Performance Shaders](/backends/mps/mps-overview.md) | iOS, macOS | GPU | Apple GPU acceleration | -| [Vulkan ](backends-vulkan) | Android | GPU | Android GPU acceleration | +| [Vulkan ](/backends/vulkan/vulkan-overview.md) | Android | GPU | Android GPU acceleration | | [Qualcomm](backends-qualcomm) | Android | NPU | Qualcomm SoCs | | [MediaTek](backends-mediatek) | Android | NPU | MediaTek SoCs | | [ARM EthosU](backends-arm-ethos-u) | Embedded | NPU | ARM MCUs | @@ -31,7 +31,7 @@ Backends are the bridge between your exported model and the hardware it runs on. | [OpenVINO](build-run-openvino) | Embedded | CPU/GPU/NPU | Intel SoCs | | [NXP](backends-nxp) | Embedded | NPU | NXP SoCs | | [Cadence](backends-cadence) | Embedded | DSP | DSP-optimized workloads | -| [Samsung Exynos](backends-samsung-exynos) | Android | NPU | Samsung SoCs | +| [Samsung Exynos](/backends/samsung/samsung-overview.md) | Android | NPU | Samsung SoCs | **Tip:** For best performance, export a `.pte` file for each backend you plan to support. @@ -53,7 +53,7 @@ Backends are the bridge between your exported model and the hardware it runs on. backends-xnnpack backends/coreml/coreml-overview backends/mps/mps-overview -backends-vulkan +backends/vulkan/vulkan-overview backends-qualcomm backends-mediatek backends-arm-ethos-u @@ -61,4 +61,4 @@ backends-arm-vgf build-run-openvino backends-nxp backends-cadence -backends-samsung-exynos +backends/samsung/samsung-overview diff --git a/docs/source/backends-vulkan.md b/docs/source/backends-vulkan.md deleted file mode 100644 index 531deece4e2..00000000000 --- a/docs/source/backends-vulkan.md +++ /dev/null @@ -1,205 +0,0 @@ -# Vulkan Backend - -The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is -built on top of the cross-platform Vulkan GPU API standard. It is primarily -designed to leverage the GPU to accelerate model inference on Android devices, -but can be used on any platform that supports an implementation of Vulkan: -laptops, servers, and edge devices. - -::::{note} -The Vulkan delegate is currently under active development, and its components -are subject to change. -:::: - -## What is Vulkan? - -Vulkan is a low-level GPU API specification developed as a successor to OpenGL. -It is designed to offer developers more explicit control over GPUs compared to -previous specifications in order to reduce overhead and maximize the -capabilities of the modern graphics hardware. - -Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both -desktop and mobile) in the market support Vulkan. Vulkan is also included in -Android from Android 7.0 onwards. - -**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it -provides a way to execute compute and graphics operations on a GPU, but does not -come with a built-in library of performant compute kernels. - -## The Vulkan Compute Library - -The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as -the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to -provide GPU implementations for PyTorch operators via GLSL compute shaders. - -The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html). -The core components of the PyTorch Vulkan backend were forked into ExecuTorch -and adapted for an AOT graph-mode style of model inference (as opposed to -PyTorch which adopted an eager execution style of model inference). - -The components of the Vulkan Compute Library are contained in the -`executorch/backends/vulkan/runtime/` directory. The core components are listed -and described below: - -``` -runtime/ -├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects -└── graph/ .................. ComputeGraph class which implements graph mode inference - └── ops/ ................ Base directory for operator implementations - ├── glsl/ ........... GLSL compute shaders - │ ├── *.glsl - │ └── conv2d.glsl - └── impl/ ........... C++ code to dispatch GPU compute shaders - ├── *.cpp - └── Conv2d.cpp -``` - -## Features - -The Vulkan delegate currently supports the following features: - -* **Memory Planning** - * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference. -* **Capability Based Partitioning**: - * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs -* **Support for upper-bound dynamic shapes**: - * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering - -In addition to increasing operator coverage, the following features are -currently in development: - -* **Quantization Support** - * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future. -* **Memory Layout Management** - * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication. -* **Selective Build** - * We plan to make it possible to control build size by selecting which operators/shaders you want to build with - -## End to End Example - -To further understand the features of the Vulkan Delegate and how to use it, -consider the following end to end example with a simple single operator model. - -### Compile and lower a model to the Vulkan Delegate - -Assuming ExecuTorch has been set up and installed, the following script can be -used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`. - -Once ExecuTorch has been set up and installed, the following script can be used -to generate a simple model and lower it to the Vulkan delegate. - -``` -# Note: this script is the same as the script from the "Setting up ExecuTorch" -# page, with one minor addition to lower to the Vulkan backend. -import torch -from torch.export import export -from executorch.exir import to_edge - -from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner - -# Start with a PyTorch model that adds two input tensors (matrices) -class Add(torch.nn.Module): - def __init__(self): - super(Add, self).__init__() - - def forward(self, x: torch.Tensor, y: torch.Tensor): - return x + y - -# 1. torch.export: Defines the program with the ATen operator set. -aten_dialect = export(Add(), (torch.ones(1), torch.ones(1))) - -# 2. to_edge: Make optimizations for Edge devices -edge_program = to_edge(aten_dialect) -# 2.1 Lower to the Vulkan backend -edge_program = edge_program.to_backend(VulkanPartitioner()) - -# 3. to_executorch: Convert the graph to an ExecuTorch program -executorch_program = edge_program.to_executorch() - -# 4. Save the compiled .pte program -with open("vk_add.pte", "wb") as file: - file.write(executorch_program.buffer) -``` - -Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate -using the `to_backend()` API. The Vulkan Delegate implements the -`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph -that are supported by the Vulkan delegate, and separates compatible sections of -the model to be executed on the GPU. - -This means the a model can be lowered to the Vulkan delegate even if it contains -some unsupported operators. This will just mean that only parts of the graph -will be executed on the GPU. - - -::::{note} -The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194) -Vulkan partitioner code can be inspected to examine which ops are currently -implemented in the Vulkan delegate. -:::: - -### Build Vulkan Delegate libraries - -The easiest way to build and test the Vulkan Delegate is to build for Android -and test on a local Android device. Android devices have built in support for -Vulkan, and the Android NDK ships with a GLSL compiler which is needed to -compile the Vulkan Compute Library's GLSL compute shaders. - -The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON` -when building with CMake. - -First, make sure that you have the Android NDK installed; any NDK version past -NDK r19c should work. Note that the examples in this doc have been validated with -NDK r28c. The Android SDK should also be installed so that you have access to `adb`. - -The instructions in this page assumes that the following environment variables -are set. - -```shell -export ANDROID_NDK= -# Select the appropriate Android ABI for your device -export ANDROID_ABI=arm64-v8a -# All subsequent commands should be performed from ExecuTorch repo root -cd -# Make sure adb works -adb --version -``` - -To build and install ExecuTorch libraries (for Android) with the Vulkan -Delegate: - -```shell -# From executorch root directory -(rm -rf cmake-android-out && \ - pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \ - -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ - -DANDROID_ABI=$ANDROID_ABI \ - -DEXECUTORCH_BUILD_VULKAN=ON \ - -DPYTHON_EXECUTABLE=python \ - -Bcmake-android-out && \ - cmake --build cmake-android-out -j16 --target install) -``` - -### Run the Vulkan model on device - -::::{note} -Since operator support is currently limited, only binary arithmetic operators -will run on the GPU. Expect inference to be slow as the majority of operators -are being executed via Portable operators. -:::: - -Now, the partially delegated model can be executed (partially) on your device's -GPU! - -```shell -# Build a model runner binary linked with the Vulkan delegate libs -cmake --build cmake-android-out --target vulkan_executor_runner -j32 - -# Push model to device -adb push vk_add.pte /data/local/tmp/vk_add.pte -# Push binary to device -adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin - -# Run the model -adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte -``` diff --git a/docs/source/backends/samsung/samsung-overview.md b/docs/source/backends/samsung/samsung-overview.md index 464d4e322c7..8b0dea0c696 100644 --- a/docs/source/backends/samsung/samsung-overview.md +++ b/docs/source/backends/samsung/samsung-overview.md @@ -101,17 +101,17 @@ Exynos delegated .pte file will automatically run on the registered backend. ## Reference -**→{doc}`exynos-partitioner` — Partitioner options.** +**→{doc}`samsung-partitioner` — Partitioner options.** -**→{doc}`exynos-quantization` — Supported quantization schemes.** +**→{doc}`samsung-quantization` — Supported quantization schemes.** -**→{doc}`exynos-op-support` — Supported operators.** +**→{doc}`samsung-op-support` — Supported operators.** ```{toctree} :maxdepth: 2 :hidden: :caption: Exynos Backend -exynos-partitioner -exynos-quantization -exynos-op-support +samsung-partitioner +samsung-quantization +samsung-op-support diff --git a/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md new file mode 100644 index 00000000000..cb14c72331e --- /dev/null +++ b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md @@ -0,0 +1,159 @@ +# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device + +This tutorial assumes that you have a working local copy of the ExecuTorch repo, +and have gone through the steps to install the executorch pip package or have +installed it by building from source. + +This tutorial also assumes that you have the Android SDK tools installed and +that you are able to connect to an Android device via `adb`. + +Finally, the Android NDK should also be installed, and your environment should +have a variable `ANDROID_NDK` that points to the root directory of the NDK. + +```shell +export ANDROID_NDK= +``` + +## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer + +The model checkpoint and tokenizer can be downloaded from the +[Meta Llama website](https://www.llama.com/llama-downloads/). + +The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`. + +## Export the Llama 3.2 1B/3B model + +First, navigate to the root of the ExecuTorch repo. + +```shell +# Navigate to executorch root +cd ~/executorch +``` + +Then, set some environment variables to describe how the model should be +exported. Feel free to tune the values to your preferences. + +```shell +export LLM_NAME=Llama3.2 && \ +export LLM_SIZE=1B && \ +export LLM_SUFFIX="-Instruct" && \ +export QUANT=8da4w && \ +export BACKEND=vulkan && \ +export GROUP_SIZE=64 && \ +export CONTEXT_LENGTH=2048 +``` + +Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that +that `--vulkan-force-fp16` flag is set, which will improve model inference +latency at the cost of model accuracy. Feel free to remove this flag. + +```shell +python -m examples.models.llama.export_llama \ + -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \ + -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \ + -d fp32 --${BACKEND} \ + -qmode ${QUANT} -G ${GROUP_SIZE} \ + --max_seq_length ${CONTEXT_LENGTH} \ + --max_context_length ${CONTEXT_LENGTH} \ + -kv --use_sdpa_with_kv_cache \ + --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ + --model "llama3_2" \ + --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte + +``` + +After exporting the model, push the exported `.pte` file and the tokenizer to +your device. + +```shell +adb shell mkdir -p /data/local/tmp/llama && \ +adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \ + /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \ +adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ + /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte +``` + +## Build Core Executorch Components + +To be able to run the `.pte` file on device, first the core libraries, +including the Vulkan backend, must be compiled for Android. + +```shell +cmake . \ + -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ + -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ + -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ + --preset "android-arm64-v8a" \ + -DANDROID_PLATFORM=android-28 \ + -DPYTHON_EXECUTABLE=python \ + -DCMAKE_BUILD_TYPE=Release \ + -DEXECUTORCH_PAL_DEFAULT=posix \ + -DEXECUTORCH_BUILD_LLAMA_JNI=ON \ + -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \ + -DEXECUTORCH_BUILD_VULKAN=ON \ + -DEXECUTORCH_BUILD_TESTS=OFF \ + -Bcmake-out-android-so && \ +cmake --build cmake-out-android-so -j16 --target install --config Release +``` + +## Build and push the llama runner binary to Android + +Then, build a binary that can be used to run the `.pte` file. + +```shell +cmake examples/models/llama \ + -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ + -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ + -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ + -DEXECUTORCH_ENABLE_LOGGING=ON \ + -DANDROID_ABI=arm64-v8a \ + -DANDROID_PLATFORM=android-28 \ + -DCMAKE_BUILD_TYPE=Release \ + -DPYTHON_EXECUTABLE=python \ + -Bcmake-out-android-so/examples/models/llama && \ +cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release +``` + +Once the binary is built, it can be pushed to your Android device. + +```shell +adb shell mkdir /data/local/tmp/etvk/ && \ +adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/ +``` + +## Execute the llama runner binary + +Finally, we can execute the lowered `.pte` file on your device. + +```shell +adb shell /data/local/tmp/etvk/llama_main \ + --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \ + --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \ + --temperature=0 --seq_len=400 --warmup \ + --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\" +``` + +Here is some sample output captured from a Galaxy S24: + +```shell +E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I' +<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|> + +Here is a short poem I came up with: + +"Moonlight whispers secrets to the night +A gentle breeze that rustles the light +The stars up high, a twinkling show +A peaceful world, where dreams grow slow" + +I hope you enjoy it!<|eot_id|> + +PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000} + Prompt Tokens: 14 Generated Tokens: 54 + Model Load Time: 2.277000 (seconds) + Total inference time: 1.189000 (seconds) Rate: 45.416316 (tokens/second) + Prompt evaluation: 0.164000 (seconds) Rate: 85.365854 (tokens/second) + Generated 54 tokens: 1.025000 (seconds) Rate: 52.682927 (tokens/second) + Time to first generated token: 0.164000 (seconds) + Sampling time over 68 tokens: 0.019000 (seconds) +``` diff --git a/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md new file mode 100644 index 00000000000..07982d81c1c --- /dev/null +++ b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md @@ -0,0 +1,144 @@ +# Executing and profiling an ExecuTorch Vulkan model on device + +This tutorial assumes that you have a working local copy of the ExecuTorch repo, +and have gone through the steps to install the executorch pip package or have +installed it by building from source. + +This tutorial also assumes that you have the Android SDK tools installed and +that you are able to connect to an Android device via `adb`. + +Finally, the Android NDK should also be installed, and your environment should +have a variable `ANDROID_NDK` that points to the root directory of the NDK. + +```shell +export ANDROID_NDK= +``` + +## Lower a model to ExecuTorch Vulkan and obtain the `.pte` file + + +The commands in this tutorial are assumed to be executed from ExecuTorch's root +directory. + +```shell +cd ~/executorch +``` + +For this tutorial, we will use the export script in +[`executorch/examples/vulkan/export.py`](https://github.com/pytorch/executorch/tree/main/examples/vulkan), +however any method of generating a `.pte` file will suffice. In this tutorial, +the InceptionV3 model is exported. + +```shell +python -m examples.vulkan.export --model_name=ic3 -o . -fp16 +``` + +After exporting, there should be a file called `ic3_vulkan.pte` in the root +directory of ExecuTorch. Feel free to modify the `-o` argument of the script to +control where the `.pte` file will be stored. + +Then, push the `.pte` file to device. + +```shell +adb shell mkdir -p /data/local/tmp/etvk/models/ && \ +adb push ic3_vulkan.pte /data/local/tmp/etvk/models/ic3_vulkan.pte +``` + +## Build the `executor_runner` binary and push to device + +To be able to run the `.pte` file on device, first the core libraries, +including the Vulkan backend, must be compiled for Android. Note that +`-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` is used to turn on profiling, and +`-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON` is used to build the runner binary that +will be used to execute and profile the `.pte` file. + + +```shell +cmake . \ + -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \ + -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \ + -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \ + --preset "android-arm64-v8a" \ + -DANDROID_PLATFORM=android-28 \ + -DPYTHON_EXECUTABLE=python \ + -DCMAKE_BUILD_TYPE=Release \ + -DEXECUTORCH_PAL_DEFAULT=posix \ + -DEXECUTORCH_BUILD_LLAMA_JNI=ON \ + -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \ + -DEXECUTORCH_BUILD_VULKAN=ON \ + -DEXECUTORCH_BUILD_TESTS=OFF \ + -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON \ + -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \ + -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \ + -Bcmake-out-android-so && \ +cmake --build cmake-out-android-so -j16 --target install --config Release +``` + +Once the build completes, we can push the runner binary to device. + +```shell +adb push cmake-out-android-so/executor_runner /data/local/tmp/etvk/executor_runner +``` + +## Execute the `.pte` file + +Finally, we can execute the lowered `.pte` file on your device. To test run the +model file without profiling: + +```shell +adb shell /data/local/tmp/etvk/executor_runner \ + --model_path /data/local/tmp/etvk/models/ic3_vulkan.pte +``` + +Now, with profiling: + +```shell +MODEL_NAME=ic3 && \ +BACKEND=vulkan && \ +NUM_ITERS=3 && \ +adb shell mkdir -p /data/local/tmp/etvk/etdumps/ && \ +adb shell /data/local/tmp/etvk/executor_runner \ + --model_path /data/local/tmp/etvk/models/${MODEL_NAME}_${BACKEND}.pte \ + --num_executions=${NUM_ITERS} \ + --etdump_path /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \ +adb pull /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp ${MODEL_NAME}_${BACKEND}.etdp && \ +adb shell rm /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \ +python devtools/inspector/inspector_cli.py \ + --etdump_path ${MODEL_NAME}_${BACKEND}.etdp +``` + +Here is some sample (tailed) output from a Samsung Galaxy S24: + +```shell +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 165 │ Execute │ conv2d_clamp_half_163 │ 0.345082 │ 0.346164 │ 0.346247 │ 0.345748 │ 0.344812 │ 0.346268 │ [] │ True │ │ [2081488974948084, 2081488995911052, 2081489016763676] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 166 │ Execute │ conv2d_clamp_half_164 │ 0.306124 │ 0.30654 │ 0.306998 │ 0.306557 │ 0.30602 │ 0.307112 │ [] │ True │ │ [2081488975294716, 2081488996256228, 2081489017110204] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 167 │ Execute │ set_zero_int32_165 │ 0.00240245 │ 0.00244403 │ 0.00248561 │ 0.00244403 │ 0.00239205 │ 0.002496 │ [] │ True │ │ [2081488975601100, 2081488996563132, 2081489017417680] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 168 │ Execute │ concat_2_texture3d_half_166 │ 0.0122305 │ 0.01248 │ 0.0125634 │ 0.0124108 │ 0.0121682 │ 0.0125842 │ [] │ True │ │ [2081488975603960, 2081488996565940, 2081489017420436] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 169 │ Execute │ set_zero_int32_167 │ 0.00157056 │ 0.00161195 │ 0.00161214 │ 0.00159478 │ 0.00156021 │ 0.00161219 │ [] │ True │ │ [2081488975616804, 2081488996578888, 2081489017432968] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 170 │ Execute │ concat_3_texture3d_half_168 │ 0.0420369 │ 0.0423281 │ 0.0427857 │ 0.0423974 │ 0.0419641 │ 0.0429001 │ [] │ True │ │ [2081488975618728, 2081488996580864, 2081489017434944] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 171 │ Execute │ update_concat_offset_3_int32_169 │ 0.00261035 │ 0.00265193 │ 0.00265212 │ 0.00263468 │ 0.00259995 │ 0.00265217 │ [] │ True │ │ [2081488975661992, 2081488996623556, 2081489017477272] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 172 │ Execute │ concat_1_texture3d_half_170 │ 0.00758157 │ 0.00774789 │ 0.00803914 │ 0.00779994 │ 0.00753999 │ 0.00811195 │ [] │ True │ │ [2081488975664956, 2081488996626572, 2081489017480288] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 173 │ Execute │ mean2d_half_171 │ 0.0147889 │ 0.0148721 │ 0.0150384 │ 0.0149067 │ 0.0147681 │ 0.01508 │ [] │ True │ │ [2081488975673432, 2081488996634476, 2081489017488400] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 174 │ Execute │ view_half_172 │ 0.00644803 │ 0.00644803 │ 0.00653119 │ 0.00648268 │ 0.00644803 │ 0.00655198 │ [] │ True │ │ [2081488975688876, 2081488996649712, 2081489017503532] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 175 │ Execute │ view_half_173 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ 0.00488806 │ [] │ True │ │ [2081488975695688, 2081488996656524, 2081489017510448] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 176 │ Execute │ linear_naive_texture3d_half_174 │ 0.586726 │ 0.590096 │ 0.595338 │ 0.590876 │ 0.585884 │ 0.596648 │ [] │ True │ │ [2081488975700940, 2081488996661776, 2081489017515700] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 177 │ Execute │ image_to_nchw_texture3d_half_float_175 │ 0.00270395 │ 0.00270414 │ 0.00274572 │ 0.00272139 │ 0.00270391 │ 0.00275612 │ [] │ True │ │ [2081488976297952, 2081488997248024, 2081489018106160] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 178 │ Execute │ DELEGATE_CALL │ 20.8864 │ 20.9461 │ 21.5925 │ 21.1906 │ 20.8715 │ 21.7541 │ [] │ False │ │ [358395625, 380178646, 401147657] │ +├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤ +│ 179 │ Execute │ Method::execute │ 20.8867 │ 20.9464 │ 21.593 │ 21.191 │ 20.8718 │ 21.7547 │ [] │ False │ │ [358395521, 380178542, 401147552] │ +╘═════╧════════════════════╧════════════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧════════════╧═══════════════════╧═════════════════════════╧════════════════════════════════════════════════════════╛ +``` diff --git a/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md new file mode 100644 index 00000000000..953c93a9c12 --- /dev/null +++ b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md @@ -0,0 +1,13 @@ +# Vulkan Backend Tutorials + +**→{doc}`etvk-profiling-tutorial`** + +**→{doc}`etvk-llama-tutorial`** + +```{toctree} +:maxdepth: 2 +:hidden: +:caption: Tutorials + +etvk-profiling-tutorial +etvk-llama-tutorial diff --git a/docs/source/backends/vulkan/vulkan-op-support.rst b/docs/source/backends/vulkan/vulkan-op-support.rst new file mode 100644 index 00000000000..623907cb504 --- /dev/null +++ b/docs/source/backends/vulkan/vulkan-op-support.rst @@ -0,0 +1,46 @@ +================ +Operator Support +================ + +This page lists the operators currently supported by the Vulkan backend. The +source of truth for this information is `op_registry.py `_, +which is used by the Vulkan Partitioner to determine which operators should be +lowered to the Vulkan backend and additionally describes the capabilities of +each operator implementation. + +If an operator used in your model is not in this list, feel free to create a +feature request on Github and we will do our best to add an implementation for +the operator. + +The namespace of an operator describes where it originates from: + +* **aten** - operators in this namespace correspond 1:1 to operators in PyTorch's + `ATen library `_. + They all support fp16 and fp32 dtypes at a minimum. +* **dim_order_op** - these operators are inserted when lowering to ExecuTorch in + order to manage optimal tensor memory layouts. They are typically removed, + since the Vulkan backend manages optimal tensor representations internally. +* **llama** - custom ops targeted for LLM inference. These are typically inserted + by model source transformations applied to a `nn.Module` and are not invoked + directly by a PyTorch model. +* **operator** - these operators work with symbolic integers, which are also + supported by the Vulkan backend. +* **quantized_decomposed** / **torchao** - these ops are introduced by quantization + workflows (either torchao's `quantize_` API or the PT2E quantization flow). + They typically represent quantizing/dequantizing a tensor, or choosing the + quantization parameters for a tensor. In practice, most instances of these + operators will be fused into a custom op in the **et_vk** namespace. +* **et_vk** - these are custom operators implemented only in the Vulkan backend. + They typically represent quantized variants of **aten** operators, or fusions + of common operator patterns. They are inserted by operator fusion graph passes + when lowering to the Vulkan backend. + +All operators support dynamic input shapes unless otherwise noted (i.e. "no +resize support"). The expectation is that over time, all operators will be able +to support dynamic shapes. + +.. csv-table:: Operator Support + :file: vulkan-op-support-table.csv + :header-rows: 1 + :widths: 25 25 75 + :align: left diff --git a/docs/source/backends/vulkan/vulkan-overview.md b/docs/source/backends/vulkan/vulkan-overview.md new file mode 100644 index 00000000000..50c87cd047b --- /dev/null +++ b/docs/source/backends/vulkan/vulkan-overview.md @@ -0,0 +1,163 @@ +# Vulkan Backend + +The ExecuTorch Vulkan (ET-VK) backend enables ExecuTorch models to execute on +GPUs via the cross-platform [Vulkan API](https://www.vulkan.org/). Although the +Vulkan API support is almost ubiquitous among modern GPUs, the ExecuTorch Vulkan +backend is currently developed with a specific focus for **Android GPUs**. + +## Features + +- Wide operator support via an in-tree [GLSL compute shader library](https://github.com/pytorch/executorch/tree/main/backends/vulkan/runtime/graph/ops/glsl) +- Support for models that require dynamic shapes +- Support for FP32 and FP16 inference modes +- Support for quantized linear layers with 8-bit/4-bit weights and 8-bit dynamically quantized activations +- Support for quantized linear layers with 8-bit/4-bit weights and FP32/FP16 activations + +Note that the Vulkan backend is under active development, and its GLSL compute +shader library is being consistently expanded over time. Additional support for +quantized operators (i.e. quantized convolution) and additional quantization +modes is on the way. + +## Target Requirements + +- Supports Vulkan 1.1 + +## Development Requirements + +To contribute to the Vulkan delegate, the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android) +must be installed on the development system. After installation, the `glslc` binary must +be found in your `PATH` in order to compile Vulkan shaders. This can be checked by +running + +```sh +glslc --version +``` + +If this is not the case after completing the Vulkan SDK installation, you may have to +go into `~/VulkanSDK//` and run + +```sh +source setup-env.sh +``` + +or alternatively, + +```sh +python install_vulkan.py +``` + +The [Android NDK](https://developer.android.com/ndk/downloads) must also be installed. +Any NDK version past NDK r17c should suffice. + +---- + +## Using the Vulkan Backend + +To lower a model to the Vulkan backend during the export and lowering process, +pass an instance of `VulkanPartitioner` to `to_edge_transform_and_lower`. The +example below demonstrates this process using the MobileNet V2 model from +torchvision. + +```python +import torch +import torchvision.models as models + +from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner +from executorch.exir import to_edge_transform_and_lower + +from torchvision.models.mobilenetv2 import MobileNet_V2_Weights + +mobilenet_v2 = models.mobilenetv2.mobilenet_v2( + weights=MobileNet_V2_Weights.DEFAULT +).eval() + +sample_inputs = (torch.randn(1, 3, 224, 224),) + +exported_program = torch.export.export(mobilenet_v2, sample_inputs) + +etvk_program = to_edge_transform_and_lower( + exported_program, + partitioner=[VulkanPartitioner()], +).to_executorch() + +with open("mv2_vulkan.pte", "wb") as file: + etvk_program.write_to_file(file) +``` + +See [Partitioner API](vulkan-partitioner.md) +for a reference on available partitioner options. + +---- + +## Quantization + +The Vulkan delegate currently supports execution of quantized linear layers. +See [Vulkan Quantization](vulkan-quantization.md) +for more information on available quantization schemes and APIs. + +---- + +## Runtime Integration + +To run the model on-device, use the standard ExecuTorch runtime APIs. + +For integration in Android applications, the Vulkan backend is included in the +[executorch-android-vulkan](https://mvnrepository.com/artifact/org.pytorch/executorch-android-vulkan) +package. + +When building from source, pass `-DEXECUTORCH_BUILD_VULKAN=ON` when configuring +the CMake build to compile the Vulkan backend. See [Running on Device](/getting-started.md#running-on-device) +for more information. + +To link against the backend, add the `executorch_backends` CMake target as a +build dependency, or link directly against `libvulkan_backend`. Due to the use +of static initialization to register available compute shaders and operators, +it is required to ensure that the library is linked with `--whole-archive`. + +```cmake +# CMakeLists.txt +find_package(executorch CONFIG REQUIRED COMPONENTS vulkan_backend executorch_backends) + +... +target_link_libraries( + my_target + PRIVATE + executorch + executorch_backends + ... +) + +# Ensure that unused code is not discarded. The required linker options may be +# different depending on the target platform. Typically, the +# executorch_target_link_options_shared_lib function from +# executorch/tools/cmake/Utils.cmake can be used to set the required linker +# options. +target_link_options( + executorch_backends INTERFACE "SHELL:LINKER:--whole-archive \ + $ \ + LINKER:--no-whole-archive" +) +``` + +No additional steps are necessary to use the backend beyond linking the target. +Any Vulkan-delegated .pte file will automatically run on the registered backend. + +## Additional Resources + +**→{doc}`vulkan-partitioner`** + +**→{doc}`vulkan-quantization`** + +**→{doc}`vulkan-troubleshooting`** + +```{toctree} +:maxdepth: 2 +:hidden: +:caption: Vulkan Backend + +vulkan-partitioner +vulkan-quantization +vulkan-op-support +vulkan-troubleshooting + +tutorials/vulkan-tutorials diff --git a/docs/source/backends/vulkan/vulkan-partitioner.md b/docs/source/backends/vulkan/vulkan-partitioner.md new file mode 100644 index 00000000000..566ec491b47 --- /dev/null +++ b/docs/source/backends/vulkan/vulkan-partitioner.md @@ -0,0 +1,55 @@ +# Partitioner API + +[VulkanPartitioner](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py) +is a Python class that controls what operators in a model can or should be +delegated to the Vulkan backend. It is the primary entrypoint to the Vulkan +backend and is also used to configure the behaviour of the Vulkan backend. + +## Usage + +For most use-cases, constructing `VulkanPartitioner()` with no arguments is +sufficient. In this case, the partitioner will lower as much of the model to +the Vulkan backend as possible. + +```python +etvk_program = to_edge_transform_and_lower( + exported_program, + partitioner=[VulkanPartitioner()], +).to_executorch() +``` + +## Common Config Options + +Generally, the Vulkan backend is configured by passing a `compile_options` +dictionary to `VulkanPartitioner()`, i.e. + +```python +compile_options = { + "require_dynamic_shapes": True, + "force_fp16": True, +} + +etvk_program = to_edge_transform_and_lower( + exported_program, + partitioner=[VulkanPartitioner(compile_options)], +).to_executorch() +``` + +### `require_dynamic_shapes` + +If a model is expected to use dynamic shapes, then it is recommended to set the +`"required_dynamic_shapes"` key in `compile_options`. + +Not all operators in Vulkan support dynamic shapes at the moment, although the +majority do. This flag will prevent operators that don't support dynamic shapes +from being lowered to Vulkan. + +### `force_fp16` + +This option causes the Vulkan backend to internally convert all FP32 tensors to +FP16. This can improve inference latency and memory footprint at the cost of +model accuracy. + +FP32 input tensors will be automatically converted to FP16 upon entering the +Vulkan backend, and FP16 outputs will be automatically be converted to FP32 as +they are returned. diff --git a/docs/source/backends/vulkan/vulkan-quantization.md b/docs/source/backends/vulkan/vulkan-quantization.md new file mode 100644 index 00000000000..89c9f7514b0 --- /dev/null +++ b/docs/source/backends/vulkan/vulkan-quantization.md @@ -0,0 +1,163 @@ +# Quantization + +The Vulkan backend currently supports execution of quantized linear layers, +where weights are symmetrically quantized to 8-bit or 4-bit with per output +channel or per group quantization scales. + +Support for additional quantized operators and quantization schemes (i.e. static ++ dynamic quantized convolution, support for statically quantized linear) is +under active development and will be added soon. + +### 4-bit quantization with torchao `quantize_` + +The `quantize_` API from [torchao](https://github.com/pytorch/ao) allows for +more advanced quantization schemes, and is the quantization workflow needed to +access 4-bit quantization. 4-bit quantization is commonly used for LLMs. + +Two options are available to execute linear layers with 4-bit quantization: + +1. Dynamically quantized activations via `Int8DynamicActivationIntxWeightConfig` +2. Weight only quantization via `IntxWeightOnlyConfig` + +Dynamically quantized activations can provide a significant boost in latency +compared to weight only quantization, since it allows GPUs to leverage +accelerated integer dot product instructions when computing matrix +multiplication. + +Below is a simple example of quantizing a simple sequence of linear layers using +the `quantize_` API. + +```python +import torch + +from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner + +from executorch.exir import to_edge_transform_and_lower +from torchao.quantization.granularity import PerGroup +from torchao.quantization.quant_api import ( + Int8DynamicActivationIntxWeightConfig, + IntxWeightOnlyConfig, + quantize_, +) +from torchao.utils import unwrap_tensor_subclass + + +class LinearSequenceModule(torch.nn.Module): + def __init__(self): + super().__init__() + self.linear1 = torch.nn.Linear(128, 64, bias=False) + self.linear2 = torch.nn.Linear(64, 32, bias=False) + self.linear3 = torch.nn.Linear(32, 16, bias=False) + + def forward(self, x): + x = self.linear1(x) + x = self.linear2(x) + x = self.linear3(x) + return x + + +linear_sequence_module = LinearSequenceModule() + +M = 32 +sample_inputs = (torch.randn(M, 128),) + +group_size = 32 + +q_config_8da4w = Int8DynamicActivationIntxWeightConfig( + weight_dtype=torch.int4, weight_granularity=PerGroup(group_size) +) + +q_config_4w = IntxWeightOnlyConfig( + weight_dtype=torch.int4, granularity=PerGroup(group_size) +) + +quantize_(linear_sequence_module, q_config_8da4w) +unwrap_tensor_subclass(linear_sequence_module) + +# Regular export path from here +exported_program = torch.export.export(linear_sequence_module, sample_inputs) + +etvk_program = to_edge_transform_and_lower( + exported_program, + partitioner=[VulkanPartitioner()], +).to_executorch() +``` + +### 8-bit quantization with PT2E quantization + +For 8-bit quantized linear layers, currently the only quantization scheme +supported is weight only quantization, with weights that are symmetrically +quantized to 8 bits with per output channel quantization scales. + +To access this quantization mode, the PT2E quantization flow must be used. At a +high level, the steps to quantize a model are: + +1) Create an instance of the `VulkanQuantizer` class and specify desired quantization behaviour +2) Use `torch.export.export` to prepare for quantization. +3) Call `prepare_pt2e` to prepare the exported graph for quantization. +4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges. +5) Call `convert_pt2e` to quantize the model. +6) Export and lower the model using the standard flow. + +For example: + +```python +import torch + +from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner + +from executorch.backends.vulkan.quantizer.vulkan_quantizer import ( + get_symmetric_quantization_config, + VulkanQuantizer, +) + +from executorch.exir import to_edge_transform_and_lower + +from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e + +from torchao.utils import unwrap_tensor_subclass + + +class LinearSequenceModule(torch.nn.Module): + def __init__(self): + super().__init__() + self.linear1 = torch.nn.Linear(128, 64, bias=False) + self.linear2 = torch.nn.Linear(64, 32, bias=False) + self.linear3 = torch.nn.Linear(32, 16, bias=False) + + def forward(self, x): + x = self.linear1(x) + x = self.linear2(x) + x = self.linear3(x) + return x + + +linear_sequence_module = LinearSequenceModule() + +M = 32 +# Create sample inputs +sample_inputs = (torch.randn(M, 128),) + +# Setup quantizer +quantizer = VulkanQuantizer() +quantizer.set_global(get_symmetric_quantization_config(is_dynamic=False, weight_bits=8)) + +# Export the model +exported_program = torch.export.export(linear_sequence_module, sample_inputs) +graph_module = exported_program.module() + +# Quantize the exported program with PT2E quantization flow +quantized_module = prepare_pt2e(graph_module, quantizer) +# Calibrate. In practice, this would be done by iterating over a real dataset +quantized_module(*sample_inputs) +quantized_module = convert_pt2e(quantized_module) + +# Export once more +exported_program = torch.export.export(quantized_module, sample_inputs) + +# Lower to vulkan +etvk_program = to_edge_transform_and_lower( + exported_program, + partitioner=[VulkanPartitioner()], +).to_executorch() +``` diff --git a/docs/source/backends/vulkan/vulkan-troubleshooting.md b/docs/source/backends/vulkan/vulkan-troubleshooting.md new file mode 100644 index 00000000000..9845f588004 --- /dev/null +++ b/docs/source/backends/vulkan/vulkan-troubleshooting.md @@ -0,0 +1,57 @@ +# Troubleshooting + +This page describes common issues that you may encounter when using the Vulkan +backend and how to debug and resolve them. + +## Vulkan Backend Not Found + +If you try to execute a .pte file that has been lowered to the Vulkan backend +and you see an error like: + +```shell +E 00:00:00.366934 executorch:method.cpp:74] Backend VulkanBackend is not registered. +``` + +This error indicates the Vulkan backend is not registered with the runtime. This +can happen because the backend was not compiled or linked, or because the +registration code was optimized out. + +First, make sure that when building ExecuTorch, cmake is configured with +`-DEXECUTORCH_BUILD_VULKAN=ON`. + +Next, make sure that your application is linking the `vulkan_backend` target, +or the `executorch_backends` target. + +Finally, ensure that `vulkan_backend` or `executorch_backends` is being linked +with the equivalent of `--whole-archive`. + +## Slow Performance + +Performance issues can be caused by a variety of factors: + +* A key compute shader (most often convolution or linear) is not performing well + on your target GPU +* Unsupported operators are causing too many graph breaks +* An existing operator is lacking support for some memory layout or storage type + resulting in a high number of copies being inserted to ensure tensors are in + a required representation for the next operator + +If you experience poor on-device performance for a particular model, please +obtain some profiling data while running your model. The +[profiling tutorial](./tutorials/etvk-profiling-tutorial.md) can +be a good reference for how to do this. + +Then, please file an issue on Github with the following details: + +* The device(s) you have tested with, and which devices exhibit poor performance + running the model +* The profiling data collected from executing the model +* The release version of ExecuTorch you are using, or the commit hash you built + from if you built from source +* If available, an export script that can be used to export your model to aid + in reproducing the issue +* If available, the `.pte` file you are testing with to aid in reproducing the + issue. + +We will do our best to patch performance problems in the Vulkan backend and +help you resolve your issue. diff --git a/examples/vulkan/README.md b/examples/vulkan/README.md index 71fdd0e4183..7831809be69 100644 --- a/examples/vulkan/README.md +++ b/examples/vulkan/README.md @@ -1,80 +1,84 @@ -# Vulkan Delegate Export Examples +# Example export script for the ExecuTorch Vulkan backend -This directory contains scripts for exporting models with the Vulkan delegate in ExecuTorch. Vulkan delegation allows you to run your models on devices with Vulkan-capable GPUs, potentially providing significant performance improvements over CPU execution. +This directory contains `export.py`, a utility script that can be used to export +models registered in [`executorch/examples/models/__init__.py`](https://github.com/pytorch/executorch/blob/main/examples/models/__init__.py) +to the Vulkan backend. -## Scripts +## Usage -- `export.py`: Basic export script for models to use with Vulkan delegate -- `aot_compiler.py`: Advanced export script with quantization support +Note that all example commands are assumed to be executed from executorch root. -## Usage +```shell +cd ~/executorch +``` ### Basic Export -```bash -python -m executorch.examples.vulkan.export -m -o +For example, to export MobileNet V2: + +```shell +MODEL_NAME=mv2 && \ +OUTPUT_DIR=. && \ +python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} ``` -### Export with Quantization (Experimental) +This will create a file name `mv2_vulkan.pte` in the specified output directory. -```bash -python -m executorch.examples.vulkan.aot_compiler -m -q -o -``` +### With dynamic shape support -### Dynamic Shape Support +To enable exporting with dynamic shapes, simply add the `-d` flag. -```bash -python -m executorch.examples.vulkan.export -m -d -o +```shell +MODEL_NAME=mv2 && \ +OUTPUT_DIR=. && \ +python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d ``` -### Additional Options +### Export a bundled pte -- `-s/--strict`: Export with strict mode (default: True) -- `-a/--segment_alignment`: Specify segment alignment in hex (default: 0x1000) -- `-e/--external_constants`: Save constants in external .ptd file (default: False) -- `-r/--etrecord`: Generate and save an ETRecord to the given file location +Use the `-b` flag to export a bundled PTE file (i.e. `.bpte`). This is a `.pte` +file with bundled test cases that can be used for correctness checking. -## Examples +```shell +MODEL_NAME=mv2 && \ +OUTPUT_DIR=. && \ +python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d -b +``` -```bash -# Export MobileNetV2 with Vulkan delegate -python -m executorch.examples.vulkan.export -m mobilenet_v2 -o ./exported_models +This will create a file called `mv2_vulkan.bpte` in the specified output directory. -# Export MobileNetV3 with quantization -python -m executorch.examples.vulkan.aot_compiler -m mobilenet_v3 -q -o ./exported_models +### With correctness testing -# Export with dynamic shapes -python -m executorch.examples.vulkan.export -m mobilenet_v2 -d -o ./exported_models +The script can also execute the exported and lowered model via pybindings to +check output correctness before writing the output file. -# Export with ETRecord for debugging -python -m executorch.examples.vulkan.export -m mobilenet_v2 -r ./records/mobilenet_record.etrecord -o ./exported_models -``` +To enable this, ensure that your machine: -## Supported Operations +1. Has the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android) installed +2. Has Vulkan drivers -The Vulkan delegate supports various operations including: +Additionally, you will need to install the executorch python package from +source, since the Vulkan backend is not included by default in the pip package. -- Basic arithmetic (add, subtract, multiply, divide) -- Activations (ReLU, Sigmoid, Tanh, etc.) -- Convolutions (Conv1d, Conv2d, ConvTranspose2d) -- Pooling operations (MaxPool2d, AvgPool2d) -- Linear/Fully connected layers -- BatchNorm, GroupNorm -- Various tensor operations (cat, reshape, permute, etc.) +```shell +CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON " ./install_executorch.sh -e +``` -For a complete list of supported operations, refer to the Vulkan delegate implementation in the ExecuTorch codebase. +Once these conditions are fulfilled, the `--test` flag can be passed to the +script. -## Debugging and Optimization +```shell +MODEL_NAME=mv2 && \ +OUTPUT_DIR=. && \ +python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d --test +``` -If you encounter issues with Vulkan delegation: +You should see some output like -1. Use `-r/--etrecord` to generate an ETRecord for debugging -2. Check if your operations are supported by the Vulkan delegate -3. Ensure your Vulkan drivers are up to date -4. Try using the export script with `--strict False` if strict mode causes issues +```shell +INFO:root:✓ Model test PASSED - outputs match reference within tolerance +``` -## Requirements +### Quantization support -- Vulkan runtime libraries (libvulkan.so.1) -- A Vulkan-capable GPU with appropriate drivers -- PyTorch with Vulkan support +Support for quantization is under active development and will be added soon!