diff --git a/backends/vulkan/README.md b/backends/vulkan/README.md
index 63a9b0b049a..b51a736c7df 100644
--- a/backends/vulkan/README.md
+++ b/backends/vulkan/README.md
@@ -1,205 +1,4 @@
-# Vulkan Backend
+# The ExecuTorch Vulkan Backend
 
-The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
-built on top of the cross-platform Vulkan GPU API standard. It is primarily
-designed to leverage the GPU to accelerate model inference on Android devices,
-but can be used on any platform that supports an implementation of Vulkan:
-laptops, servers, and edge devices.
-
-::::{note}
-The Vulkan delegate is currently under active development, and its components
-are subject to change.
-::::
-
-## What is Vulkan?
-
-Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
-It is designed to offer developers more explicit control over GPUs compared to
-previous specifications in order to reduce overhead and maximize the
-capabilities of the modern graphics hardware.
-
-Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
-desktop and mobile) in the market support Vulkan. Vulkan is also included in
-Android from Android 7.0 onwards.
-
-**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
-provides a way to execute compute and graphics operations on a GPU, but does not
-come with a built-in library of performant compute kernels.
-
-## The Vulkan Compute Library
-
-The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
-the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
-provide GPU implementations for PyTorch operators via GLSL compute shaders.
-
-The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
-The core components of the PyTorch Vulkan backend were forked into ExecuTorch
-and adapted for an AOT graph-mode style of model inference (as opposed to
-PyTorch which adopted an eager execution style of model inference).
-
-The components of the Vulkan Compute Library are contained in the
-`executorch/backends/vulkan/runtime/` directory. The core components are listed
-and described below:
-
-```
-runtime/
-├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
-└── graph/ .................. ComputeGraph class which implements graph mode inference
-    └── ops/ ................ Base directory for operator implementations
-        ├── glsl/ ........... GLSL compute shaders
-        │   ├── *.glsl
-        │   └── conv2d.glsl
-        └── impl/ ........... C++ code to dispatch GPU compute shaders
-            ├── *.cpp
-            └── Conv2d.cpp
-```
-
-## Features
-
-The Vulkan delegate currently supports the following features:
-
-* **Memory Planning**
-  * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
-* **Capability Based Partitioning**:
-  * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
-* **Support for upper-bound dynamic shapes**:
-  * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
-
-In addition to increasing operator coverage, the following features are
-currently in development:
-
-* **Quantization Support**
-  * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
-* **Memory Layout Management**
-  * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
-* **Selective Build**
-  * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
-
-## End to End Example
-
-To further understand the features of the Vulkan Delegate and how to use it,
-consider the following end to end example with a simple single operator model.
-
-### Compile and lower a model to the Vulkan Delegate
-
-Assuming ExecuTorch has been set up and installed, the following script can be
-used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
-
-Once ExecuTorch has been set up and installed, the following script can be used
-to generate a simple model and lower it to the Vulkan delegate.
-
-```
-# Note: this script is the same as the script from the "Setting up ExecuTorch"
-# page, with one minor addition to lower to the Vulkan backend.
-import torch
-from torch.export import export
-from executorch.exir import to_edge
-
-from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
-
-# Start with a PyTorch model that adds two input tensors (matrices)
-class Add(torch.nn.Module):
-  def __init__(self):
-    super(Add, self).__init__()
-
-  def forward(self, x: torch.Tensor, y: torch.Tensor):
-      return x + y
-
-# 1. torch.export: Defines the program with the ATen operator set.
-aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
-
-# 2. to_edge: Make optimizations for Edge devices
-edge_program = to_edge(aten_dialect)
-# 2.1 Lower to the Vulkan backend
-edge_program = edge_program.to_backend(VulkanPartitioner())
-
-# 3. to_executorch: Convert the graph to an ExecuTorch program
-executorch_program = edge_program.to_executorch()
-
-# 4. Save the compiled .pte program
-with open("vk_add.pte", "wb") as file:
-    file.write(executorch_program.buffer)
-```
-
-Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
-using the `to_backend()` API. The Vulkan Delegate implements the
-`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
-that are supported by the Vulkan delegate, and separates compatible sections of
-the model to be executed on the GPU.
-
-This means the a model can be lowered to the Vulkan delegate even if it contains
-some unsupported operators. This will just mean that only parts of the graph
-will be executed on the GPU.
-
-
-::::{note}
-The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
-Vulkan partitioner code can be inspected to examine which ops are currently
-implemented in the Vulkan delegate.
-::::
-
-### Build Vulkan Delegate libraries
-
-The easiest way to build and test the Vulkan Delegate is to build for Android
-and test on a local Android device. Android devices have built in support for
-Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
-compile the Vulkan Compute Library's GLSL compute shaders.
-
-The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
-when building with CMake.
-
-First, make sure that you have the Android NDK installed; any NDK version past
-NDK r19c should work. Note that the examples in this doc have been validated with
-NDK r28c. The Android SDK should also be installed so that you have access to `adb`.
-
-The instructions in this page assumes that the following environment variables
-are set.
-
-```shell
-export ANDROID_NDK=<path_to_ndk>
-# Select the appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd <path_to_executorch_root>
-# Make sure adb works
-adb --version
-```
-
-To build and install ExecuTorch libraries (for Android) with the Vulkan
-Delegate:
-
-```shell
-# From executorch root directory
-(rm -rf cmake-android-out && \
-  pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-    -DANDROID_ABI=$ANDROID_ABI \
-    -DEXECUTORCH_BUILD_VULKAN=ON \
-    -DPYTHON_EXECUTABLE=python \
-    -Bcmake-android-out && \
-  cmake --build cmake-android-out -j16 --target install)
-```
-
-### Run the Vulkan model on device
-
-::::{note}
-Since operator support is currently limited, only binary arithmetic operators
-will run on the GPU. Expect inference to be slow as the majority of operators
-are being executed via Portable operators.
-::::
-
-Now, the partially delegated model can be executed (partially) on your device's
-GPU!
-
-```shell
-# Build a model runner binary linked with the Vulkan delegate libs
-cmake --build cmake-android-out --target executor_runner -j32
-
-# Push model to device
-adb push vk_add.pte /data/local/tmp/vk_add.pte
-# Push binary to device
-adb push cmake-android-out/executor_runner /data/local/tmp/runner_bin
-
-# Run the model
-adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
-```
+Please see the [Vulkan Backend Overview](../../docs/source/backends/vulkan/vulkan-overview.md)
+to learn more about the ExecuTorch Vulkan Backend.
diff --git a/backends/vulkan/docs/android_demo.md b/backends/vulkan/docs/android_demo.md
deleted file mode 100644
index ff84938b06f..00000000000
--- a/backends/vulkan/docs/android_demo.md
+++ /dev/null
@@ -1,128 +0,0 @@
-# Building and Running ExecuTorch with the Vulkan Backend
-
-The [ExecuTorch Vulkan Delegate](../../../docs/source/native-delegates-executorch-vulkan-delegate.md)
-is a native GPU delegate for ExecuTorch.
-
-<!----This will show a grid card on the page----->
-::::{grid} 2
-:::{grid-item-card}  What you will learn in this tutorial:
-:class-card: card-content
-* How to export the Llama3.2-1B parameter model with partial GPU delegation
-* How to execute the partially delegated model on Android
-:::
-:::{grid-item-card}  Prerequisites:
-:class-card: card-prerequisites
-* Follow [**Setting up ExecuTorch**](../../../docs/source/getting-started-setup.rst)
-* It is also recommended that you read through [**ExecuTorch Vulkan Delegate**](../../../docs/source/native-delegates-executorch-vulkan-delegate.md) and follow the example in that page
-:::
-::::
-
-## Prerequisites
-
-Note that all the steps below should be performed from the ExecuTorch repository
-root directory, and assumes that you have gone through the steps of setting up
-ExecuTorch.
-
-It is also assumed that the Android NDK and Android SDK is installed, and the
-following environment examples are set.
-
-```shell
-export ANDROID_NDK=<path_to_ndk>
-# Select an appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd <path_to_executorch_root>
-# Make sure adb works
-adb --version
-```
-
-## Lowering the Llama3.2-1B model to Vulkan
-
-::::{note}
-The resultant model will only be partially delegated to the Vulkan backend. In
-particular, only binary arithmetic operators (`aten.add`, `aten.sub`,
-`aten.mul`, `aten.div`), matrix multiplication operators (`aten.mm`, `aten.bmm`),
-and linear layers (`aten.linear`) will be executed on the GPU via the Vulkan
-delegate. The rest of the model will be executed using Portable operators.
-
-Operator support for LLaMA models is currently in active development; please
-check out the `main` branch of the ExecuTorch repo for the latest capabilities.
-::::
-
-First, obtain the `consolidated.00.pth`, `params.json` and `tokenizer.model`
-files for the `Llama3.2-1B` model from the [Llama website](https://www.llama.com/llama-downloads/).
-
-Once the files have been downloaded, the `export_llama` script can be used to
-partially lower the Llama model to Vulkan.
-
-```shell
-# The files will usually be downloaded to ~/.llama
-python -m examples.models.llama.export_llama \
-  --disable_dynamic_shape --vulkan -kv --use_sdpa_with_kv_cache -d fp32 \
-  --model "llama3_2" \
-  -c ~/.llama/checkpoints/Llama3.2-1B/consolidated.00.pth \
-  -p ~/.llama/checkpoints/Llama3.2-1B/params.json \
-  --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
-```
-
-A `vulkan_llama2.pte` file should have been created as a result of running the
-script.
-
-Push the tokenizer binary and `vulkan_llama2.pte` onto your Android device:
-
-```shell
-adb push ~/.llama/tokenizer.model /data/local/tmp/
-adb push vulkan_llama2.pte /data/local/tmp/
-```
-
-## Build and Run the LLaMA runner binary on Android
-
-First, build and install ExecuTorch libraries, then build the LLaMA runner
-binary using the Android NDK toolchain.
-
-```shell
-./install_executorch.sh --clean
-(mkdir cmake-android-out && \
-  cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-    -DANDROID_ABI=$ANDROID_ABI \
-    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
-    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
-    -DEXECUTORCH_BUILD_EXTENSION_TENSOR=ON \
-    -DEXECUTORCH_BUILD_VULKAN=ON \
-    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-    -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-    -DPYTHON_EXECUTABLE=python \
-    -Bcmake-android-out && \
-  cmake --build cmake-android-out -j16 --target install)
-
-# Build LLaMA Runner library
-(rm -rf cmake-android-out/examples/models/llama && \
-  cmake examples/models/llama \
-    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-    -DANDROID_ABI=$ANDROID_ABI \
-    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-    -DEXECUTORCH_BUILD_KERNELS_LLM=ON \
-    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-    -DPYTHON_EXECUTABLE=python \
-    -Bcmake-android-out/examples/models/llama && \
-  cmake --build cmake-android-out/examples/models/llama -j16)
-```
-
-Finally, push and run the llama runner binary on your Android device. Note that
-your device must have sufficient GPU memory to execute the model.
-
-```shell
-adb push cmake-android-out/examples/models/llama/llama_main /data/local/tmp/llama_main
-
-adb shell /data/local/tmp/llama_main \
-    --model_path=/data/local/tmp/vulkan_llama2.pte \
-    --tokenizer_path=/data/local/tmp/tokenizer.model \
-    --prompt "Hello"
-```
-
-Note that currently model inference will be very slow due to the high amount of
-delegate blobs in the lowered graph, which requires a transfer to and from the
-GPU for each sub graph. Performance is expected to improve drastically as more
-of the model can be lowered to the Vulkan delegate, and techniques such as
-quantization are supported.
diff --git a/docs/source/backends-overview.md b/docs/source/backends-overview.md
index dfeb6243d37..da2febced3a 100644
--- a/docs/source/backends-overview.md
+++ b/docs/source/backends-overview.md
@@ -23,7 +23,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
 | [XNNPACK](backends-xnnpack)                                     | All                 | CPU           | General-purpose, fallback       |
 | [Core ML](/backends/coreml/coreml-overview.md)                  | iOS, macOS          | NPU/GPU/CPU   | Apple devices, high performance |
 | [Metal Performance Shaders](/backends/mps/mps-overview.md)      | iOS, macOS          | GPU           | Apple GPU acceleration          |
-| [Vulkan ](backends-vulkan)                                      | Android             | GPU           | Android GPU acceleration        |
+| [Vulkan ](/backends/vulkan/vulkan-overview.md)                  | Android             | GPU           | Android GPU acceleration        |
 | [Qualcomm](backends-qualcomm)                                   | Android             | NPU           | Qualcomm SoCs                   |
 | [MediaTek](backends-mediatek)                                   | Android             | NPU           | MediaTek SoCs                   |
 | [ARM EthosU](backends-arm-ethos-u)                              | Embedded            | NPU           | ARM MCUs                        |
@@ -31,7 +31,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
 | [OpenVINO](build-run-openvino)                                  | Embedded            | CPU/GPU/NPU   | Intel  SoCs                     |
 | [NXP](backends-nxp)                                             | Embedded            | NPU           | NXP SoCs                        |
 | [Cadence](backends-cadence)                                     | Embedded            | DSP           | DSP-optimized workloads         |
-| [Samsung Exynos](backends-samsung-exynos)                       | Android             | NPU           | Samsung SoCs                    |
+| [Samsung Exynos](/backends/samsung/samsung-overview.md)         | Android             | NPU           | Samsung SoCs                    |
 
 **Tip:** For best performance, export a `.pte` file for each backend you plan to support.
 
@@ -53,7 +53,7 @@ Backends are the bridge between your exported model and the hardware it runs on.
 backends-xnnpack
 backends/coreml/coreml-overview
 backends/mps/mps-overview
-backends-vulkan
+backends/vulkan/vulkan-overview
 backends-qualcomm
 backends-mediatek
 backends-arm-ethos-u
@@ -61,4 +61,4 @@ backends-arm-vgf
 build-run-openvino
 backends-nxp
 backends-cadence
-backends-samsung-exynos
+backends/samsung/samsung-overview
diff --git a/docs/source/backends-vulkan.md b/docs/source/backends-vulkan.md
deleted file mode 100644
index 531deece4e2..00000000000
--- a/docs/source/backends-vulkan.md
+++ /dev/null
@@ -1,205 +0,0 @@
-# Vulkan Backend
-
-The ExecuTorch Vulkan delegate is a native GPU delegate for ExecuTorch that is
-built on top of the cross-platform Vulkan GPU API standard. It is primarily
-designed to leverage the GPU to accelerate model inference on Android devices,
-but can be used on any platform that supports an implementation of Vulkan:
-laptops, servers, and edge devices.
-
-::::{note}
-The Vulkan delegate is currently under active development, and its components
-are subject to change.
-::::
-
-## What is Vulkan?
-
-Vulkan is a low-level GPU API specification developed as a successor to OpenGL.
-It is designed to offer developers more explicit control over GPUs compared to
-previous specifications in order to reduce overhead and maximize the
-capabilities of the modern graphics hardware.
-
-Vulkan has been widely adopted among GPU vendors, and most modern GPUs (both
-desktop and mobile) in the market support Vulkan. Vulkan is also included in
-Android from Android 7.0 onwards.
-
-**Note that Vulkan is a GPU API, not a GPU Math Library**. That is to say it
-provides a way to execute compute and graphics operations on a GPU, but does not
-come with a built-in library of performant compute kernels.
-
-## The Vulkan Compute Library
-
-The ExecuTorch Vulkan Delegate is a wrapper around a standalone runtime known as
-the **Vulkan Compute Library**. The aim of the Vulkan Compute Library is to
-provide GPU implementations for PyTorch operators via GLSL compute shaders.
-
-The Vulkan Compute Library is a fork/iteration of the [PyTorch Vulkan Backend](https://pytorch.org/tutorials/prototype/vulkan_workflow.html).
-The core components of the PyTorch Vulkan backend were forked into ExecuTorch
-and adapted for an AOT graph-mode style of model inference (as opposed to
-PyTorch which adopted an eager execution style of model inference).
-
-The components of the Vulkan Compute Library are contained in the
-`executorch/backends/vulkan/runtime/` directory. The core components are listed
-and described below:
-
-```
-runtime/
-├── api/ .................... Wrapper API around Vulkan to manage Vulkan objects
-└── graph/ .................. ComputeGraph class which implements graph mode inference
-    └── ops/ ................ Base directory for operator implementations
-        ├── glsl/ ........... GLSL compute shaders
-        │   ├── *.glsl
-        │   └── conv2d.glsl
-        └── impl/ ........... C++ code to dispatch GPU compute shaders
-            ├── *.cpp
-            └── Conv2d.cpp
-```
-
-## Features
-
-The Vulkan delegate currently supports the following features:
-
-* **Memory Planning**
-  * Intermediate tensors whose lifetimes do not overlap will share memory allocations. This reduces the peak memory usage of model inference.
-* **Capability Based Partitioning**:
-  * A graph can be partially lowered to the Vulkan delegate via a partitioner, which will identify nodes (i.e. operators) that are supported by the Vulkan delegate and lower only supported subgraphs
-* **Support for upper-bound dynamic shapes**:
-  * Tensors can change shape between inferences as long as its current shape is smaller than the bounds specified during lowering
-
-In addition to increasing operator coverage, the following features are
-currently in development:
-
-* **Quantization Support**
-  * We are currently working on support for 8-bit dynamic quantization, with plans to extend to other quantization schemes in the future.
-* **Memory Layout Management**
-  * Memory layout is an important factor to optimizing performance. We plan to introduce graph passes to introduce memory layout transitions throughout a graph to optimize memory-layout sensitive operators such as Convolution and Matrix Multiplication.
-* **Selective Build**
-  * We plan to make it possible to control build size by selecting which operators/shaders you want to build with
-
-## End to End Example
-
-To further understand the features of the Vulkan Delegate and how to use it,
-consider the following end to end example with a simple single operator model.
-
-### Compile and lower a model to the Vulkan Delegate
-
-Assuming ExecuTorch has been set up and installed, the following script can be
-used to produce a lowered MobileNet V2 model as `vulkan_mobilenetv2.pte`.
-
-Once ExecuTorch has been set up and installed, the following script can be used
-to generate a simple model and lower it to the Vulkan delegate.
-
-```
-# Note: this script is the same as the script from the "Setting up ExecuTorch"
-# page, with one minor addition to lower to the Vulkan backend.
-import torch
-from torch.export import export
-from executorch.exir import to_edge
-
-from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
-
-# Start with a PyTorch model that adds two input tensors (matrices)
-class Add(torch.nn.Module):
-  def __init__(self):
-    super(Add, self).__init__()
-
-  def forward(self, x: torch.Tensor, y: torch.Tensor):
-      return x + y
-
-# 1. torch.export: Defines the program with the ATen operator set.
-aten_dialect = export(Add(), (torch.ones(1), torch.ones(1)))
-
-# 2. to_edge: Make optimizations for Edge devices
-edge_program = to_edge(aten_dialect)
-# 2.1 Lower to the Vulkan backend
-edge_program = edge_program.to_backend(VulkanPartitioner())
-
-# 3. to_executorch: Convert the graph to an ExecuTorch program
-executorch_program = edge_program.to_executorch()
-
-# 4. Save the compiled .pte program
-with open("vk_add.pte", "wb") as file:
-    file.write(executorch_program.buffer)
-```
-
-Like other ExecuTorch delegates, a model can be lowered to the Vulkan Delegate
-using the `to_backend()` API. The Vulkan Delegate implements the
-`VulkanPartitioner` class which identifies nodes (i.e. operators) in the graph
-that are supported by the Vulkan delegate, and separates compatible sections of
-the model to be executed on the GPU.
-
-This means the a model can be lowered to the Vulkan delegate even if it contains
-some unsupported operators. This will just mean that only parts of the graph
-will be executed on the GPU.
-
-
-::::{note}
-The [supported ops list](https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py#L194)
-Vulkan partitioner code can be inspected to examine which ops are currently
-implemented in the Vulkan delegate.
-::::
-
-### Build Vulkan Delegate libraries
-
-The easiest way to build and test the Vulkan Delegate is to build for Android
-and test on a local Android device. Android devices have built in support for
-Vulkan, and the Android NDK ships with a GLSL compiler which is needed to
-compile the Vulkan Compute Library's GLSL compute shaders.
-
-The Vulkan Delegate libraries can be built by setting `-DEXECUTORCH_BUILD_VULKAN=ON`
-when building with CMake.
-
-First, make sure that you have the Android NDK installed; any NDK version past
-NDK r19c should work. Note that the examples in this doc have been validated with
-NDK r28c. The Android SDK should also be installed so that you have access to `adb`.
-
-The instructions in this page assumes that the following environment variables
-are set.
-
-```shell
-export ANDROID_NDK=<path_to_ndk>
-# Select the appropriate Android ABI for your device
-export ANDROID_ABI=arm64-v8a
-# All subsequent commands should be performed from ExecuTorch repo root
-cd <path_to_executorch_root>
-# Make sure adb works
-adb --version
-```
-
-To build and install ExecuTorch libraries (for Android) with the Vulkan
-Delegate:
-
-```shell
-# From executorch root directory
-(rm -rf cmake-android-out && \
-  pp cmake . -DCMAKE_INSTALL_PREFIX=cmake-android-out \
-    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
-    -DANDROID_ABI=$ANDROID_ABI \
-    -DEXECUTORCH_BUILD_VULKAN=ON \
-    -DPYTHON_EXECUTABLE=python \
-    -Bcmake-android-out && \
-  cmake --build cmake-android-out -j16 --target install)
-```
-
-### Run the Vulkan model on device
-
-::::{note}
-Since operator support is currently limited, only binary arithmetic operators
-will run on the GPU. Expect inference to be slow as the majority of operators
-are being executed via Portable operators.
-::::
-
-Now, the partially delegated model can be executed (partially) on your device's
-GPU!
-
-```shell
-# Build a model runner binary linked with the Vulkan delegate libs
-cmake --build cmake-android-out --target vulkan_executor_runner -j32
-
-# Push model to device
-adb push vk_add.pte /data/local/tmp/vk_add.pte
-# Push binary to device
-adb push cmake-android-out/backends/vulkan/vulkan_executor_runner /data/local/tmp/runner_bin
-
-# Run the model
-adb shell /data/local/tmp/runner_bin --model_path /data/local/tmp/vk_add.pte
-```
diff --git a/docs/source/backends/samsung/samsung-overview.md b/docs/source/backends/samsung/samsung-overview.md
index 464d4e322c7..8b0dea0c696 100644
--- a/docs/source/backends/samsung/samsung-overview.md
+++ b/docs/source/backends/samsung/samsung-overview.md
@@ -101,17 +101,17 @@ Exynos delegated .pte file will automatically run on the registered backend.
 
 ## Reference
 
-**→{doc}`exynos-partitioner` — Partitioner options.**
+**→{doc}`samsung-partitioner` — Partitioner options.**
 
-**→{doc}`exynos-quantization` — Supported quantization schemes.**
+**→{doc}`samsung-quantization` — Supported quantization schemes.**
 
-**→{doc}`exynos-op-support` — Supported operators.**
+**→{doc}`samsung-op-support` — Supported operators.**
 
 ```{toctree}
 :maxdepth: 2
 :hidden:
 :caption: Exynos Backend
 
-exynos-partitioner
-exynos-quantization
-exynos-op-support
+samsung-partitioner
+samsung-quantization
+samsung-op-support
diff --git a/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md
new file mode 100644
index 00000000000..cb14c72331e
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/etvk-llama-tutorial.md
@@ -0,0 +1,159 @@
+# Exporting Llama 3.2 1B/3B Instruct to ExecuTorch Vulkan and running on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+Finally, the Android NDK should also be installed, and your environment should
+have a variable `ANDROID_NDK` that points to the root directory of the NDK.
+
+```shell
+export ANDROID_NDK=<path_to_ndk>
+```
+
+## Download the Llama 3.2 1B/3B Instruct model checkpoint and tokenizer
+
+The model checkpoint and tokenizer can be downloaded from the
+[Meta Llama website](https://www.llama.com/llama-downloads/).
+
+The model files should be downloaded to `~/.llama/checkpoints/Llama3.2-1B-Instruct`.
+
+## Export the Llama 3.2 1B/3B model
+
+First, navigate to the root of the ExecuTorch repo.
+
+```shell
+# Navigate to executorch root
+cd ~/executorch
+```
+
+Then, set some environment variables to describe how the model should be
+exported. Feel free to tune the values to your preferences.
+
+```shell
+export LLM_NAME=Llama3.2 && \
+export LLM_SIZE=1B && \
+export LLM_SUFFIX="-Instruct" && \
+export QUANT=8da4w && \
+export BACKEND=vulkan && \
+export GROUP_SIZE=64 && \
+export CONTEXT_LENGTH=2048
+```
+
+Then, export the Llama 3.2 1B/3B Instruct model to ExecuTorch Vulkan. Note that
+that `--vulkan-force-fp16` flag is set, which will improve model inference
+latency at the cost of model accuracy. Feel free to remove this flag.
+
+```shell
+python -m examples.models.llama.export_llama \
+    -c $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/consolidated.00.pth \
+    -p $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/params.json \
+    -d fp32 --${BACKEND} \
+    -qmode ${QUANT} -G ${GROUP_SIZE} \
+    --max_seq_length ${CONTEXT_LENGTH} \
+    --max_context_length ${CONTEXT_LENGTH} \
+    -kv --use_sdpa_with_kv_cache \
+    --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
+    --model "llama3_2" \
+    --output_name $HOME/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+
+```
+
+After exporting the model, push the exported `.pte` file and the tokenizer to
+your device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/llama && \
+adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/tokenizer.model \
+  /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model && \
+adb push ~/.llama/checkpoints/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+  /data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte
+```
+
+## Build Core Executorch Components
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android.
+
+```shell
+cmake . \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    --preset "android-arm64-v8a" \
+    -DANDROID_PLATFORM=android-28 \
+    -DPYTHON_EXECUTABLE=python \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DEXECUTORCH_PAL_DEFAULT=posix \
+    -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DEXECUTORCH_BUILD_TESTS=OFF \
+    -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+## Build and push the llama runner binary to Android
+
+Then, build a binary that can be used to run the `.pte` file.
+
+```shell
+cmake examples/models/llama \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake  \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    -DEXECUTORCH_ENABLE_LOGGING=ON \
+    -DANDROID_ABI=arm64-v8a \
+    -DANDROID_PLATFORM=android-28 \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DPYTHON_EXECUTABLE=python \
+    -Bcmake-out-android-so/examples/models/llama && \
+cmake --build cmake-out-android-so/examples/models/llama -j16 --config Release
+```
+
+Once the binary is built, it can be pushed to your Android device.
+
+```shell
+adb shell mkdir /data/local/tmp/etvk/ && \
+adb push cmake-out-android-so/examples/models/llama/llama_main /data/local/tmp/etvk/
+```
+
+## Execute the llama runner binary
+
+Finally, we can execute the lowered `.pte` file on your device.
+
+```shell
+adb shell /data/local/tmp/etvk/llama_main \
+  --model_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_${BACKEND}_${QUANT}_g${GROUP_SIZE}_c${CONTEXT_LENGTH}.pte \
+  --tokenizer_path=/data/local/tmp/llama/${LLM_NAME}-${LLM_SIZE}${LLM_SUFFIX}_tokenizer.model \
+  --temperature=0 --seq_len=400 --warmup \
+  --prompt=\"\<\|begin_of_text\|\>\<\|start_header_id\|\>system\<\|end_header_id\|\>Write me a short poem.\<\|eot_id\|\>\<\|start_header_id\|\>assistant\<\|end_header_id\|\>\"
+```
+
+Here is some sample output captured from a Galaxy S24:
+
+```shell
+E tokenizers:hf_tokenizer.cpp:60] Error parsing json file: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - invalid literal; last read: 'I'
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>Write me a short poem.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+
+Here is a short poem I came up with:
+
+"Moonlight whispers secrets to the night
+A gentle breeze that rustles the light
+The stars up high, a twinkling show
+A peaceful world, where dreams grow slow"
+
+I hope you enjoy it!<|eot_id|>
+
+PyTorchObserver {"prompt_tokens":14,"generated_tokens":54,"model_load_start_ms":1760077800721,"model_load_end_ms":1760077802998,"inference_start_ms":1760077802998,"inference_end_ms":1760077804187,"prompt_eval_end_ms":1760077803162,"first_token_ms":1760077803162,"aggregate_sampling_time_ms":19,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
+        Prompt Tokens: 14    Generated Tokens: 54
+        Model Load Time:                2.277000 (seconds)
+        Total inference time:           1.189000 (seconds)               Rate:  45.416316 (tokens/second)
+                Prompt evaluation:      0.164000 (seconds)               Rate:  85.365854 (tokens/second)
+                Generated 54 tokens:    1.025000 (seconds)               Rate:  52.682927 (tokens/second)
+        Time to first generated token:  0.164000 (seconds)
+        Sampling time over 68 tokens:   0.019000 (seconds)
+```
diff --git a/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md
new file mode 100644
index 00000000000..07982d81c1c
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/etvk-profiling-tutorial.md
@@ -0,0 +1,144 @@
+# Executing and profiling an ExecuTorch Vulkan model on device
+
+This tutorial assumes that you have a working local copy of the ExecuTorch repo,
+and have gone through the steps to install the executorch pip package or have
+installed it by building from source.
+
+This tutorial also assumes that you have the Android SDK tools installed and
+that you are able to connect to an Android device via `adb`.
+
+Finally, the Android NDK should also be installed, and your environment should
+have a variable `ANDROID_NDK` that points to the root directory of the NDK.
+
+```shell
+export ANDROID_NDK=<path_to_ndk>
+```
+
+## Lower a model to ExecuTorch Vulkan and obtain the `.pte` file
+
+
+The commands in this tutorial are assumed to be executed from ExecuTorch's root
+directory.
+
+```shell
+cd ~/executorch
+```
+
+For this tutorial, we will use the export script in
+[`executorch/examples/vulkan/export.py`](https://github.com/pytorch/executorch/tree/main/examples/vulkan),
+however any method of generating a `.pte` file will suffice. In this tutorial,
+the InceptionV3 model is exported.
+
+```shell
+python -m examples.vulkan.export --model_name=ic3 -o . -fp16
+```
+
+After exporting, there should be a file called `ic3_vulkan.pte` in the root
+directory of ExecuTorch. Feel free to modify the `-o` argument of the script to
+control where the `.pte` file will be stored.
+
+Then, push the `.pte` file to device.
+
+```shell
+adb shell mkdir -p /data/local/tmp/etvk/models/ && \
+adb push ic3_vulkan.pte /data/local/tmp/etvk/models/ic3_vulkan.pte
+```
+
+## Build the `executor_runner` binary and push to device
+
+To be able to run the `.pte` file on device, first the core libraries,
+including the Vulkan backend, must be compiled for Android. Note that
+`-DEXECUTORCH_ENABLE_EVENT_TRACER=ON` is used to turn on profiling, and
+`-DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON` is used to build the runner binary that
+will be used to execute and profile the `.pte` file.
+
+
+```shell
+cmake . \
+    -DCMAKE_INSTALL_PREFIX=cmake-out-android-so \
+    -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake \
+    -DANDROID_SUPPORT_FLEXIBLE_PAGE_SIZES=ON \
+    --preset "android-arm64-v8a" \
+    -DANDROID_PLATFORM=android-28 \
+    -DPYTHON_EXECUTABLE=python \
+    -DCMAKE_BUILD_TYPE=Release \
+    -DEXECUTORCH_PAL_DEFAULT=posix \
+    -DEXECUTORCH_BUILD_LLAMA_JNI=ON \
+    -DEXECUTORCH_BUILD_EXTENSION_NAMED_DATA_MAP=ON \
+    -DEXECUTORCH_BUILD_VULKAN=ON \
+    -DEXECUTORCH_BUILD_TESTS=OFF \
+    -DEXECUTORCH_BUILD_EXTENSION_EVALUE_UTIL=ON \
+    -DEXECUTORCH_BUILD_EXECUTOR_RUNNER=ON \
+    -DEXECUTORCH_ENABLE_EVENT_TRACER=ON \
+    -Bcmake-out-android-so && \
+cmake --build cmake-out-android-so -j16 --target install --config Release
+```
+
+Once the build completes, we can push the runner binary to device.
+
+```shell
+adb push cmake-out-android-so/executor_runner /data/local/tmp/etvk/executor_runner
+```
+
+## Execute the `.pte` file
+
+Finally, we can execute the lowered `.pte` file on your device. To test run the
+model file without profiling:
+
+```shell
+adb shell /data/local/tmp/etvk/executor_runner \
+  --model_path /data/local/tmp/etvk/models/ic3_vulkan.pte
+```
+
+Now, with profiling:
+
+```shell
+MODEL_NAME=ic3 && \
+BACKEND=vulkan && \
+NUM_ITERS=3 && \
+adb shell mkdir -p /data/local/tmp/etvk/etdumps/ && \
+adb shell /data/local/tmp/etvk/executor_runner \
+  --model_path /data/local/tmp/etvk/models/${MODEL_NAME}_${BACKEND}.pte \
+  --num_executions=${NUM_ITERS} \
+  --etdump_path /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \
+adb pull /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp ${MODEL_NAME}_${BACKEND}.etdp && \
+adb shell rm /data/local/tmp/etvk/etdumps/${MODEL_NAME}_${BACKEND}.etdp && \
+python devtools/inspector/inspector_cli.py \
+  --etdump_path ${MODEL_NAME}_${BACKEND}.etdp
+```
+
+Here is some sample (tailed) output from a Samsung Galaxy S24:
+
+```shell
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 165 │ Execute            │ conv2d_clamp_half_163                  │   0.345082   │   0.346164   │   0.346247   │   0.345748   │   0.344812   │   0.346268   │ []         │ True              │                         │ [2081488974948084, 2081488995911052, 2081489016763676] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 166 │ Execute            │ conv2d_clamp_half_164                  │   0.306124   │   0.30654    │   0.306998   │   0.306557   │   0.30602    │   0.307112   │ []         │ True              │                         │ [2081488975294716, 2081488996256228, 2081489017110204] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 167 │ Execute            │ set_zero_int32_165                     │   0.00240245 │   0.00244403 │   0.00248561 │   0.00244403 │   0.00239205 │   0.002496   │ []         │ True              │                         │ [2081488975601100, 2081488996563132, 2081489017417680] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 168 │ Execute            │ concat_2_texture3d_half_166            │   0.0122305  │   0.01248    │   0.0125634  │   0.0124108  │   0.0121682  │   0.0125842  │ []         │ True              │                         │ [2081488975603960, 2081488996565940, 2081489017420436] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 169 │ Execute            │ set_zero_int32_167                     │   0.00157056 │   0.00161195 │   0.00161214 │   0.00159478 │   0.00156021 │   0.00161219 │ []         │ True              │                         │ [2081488975616804, 2081488996578888, 2081489017432968] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 170 │ Execute            │ concat_3_texture3d_half_168            │   0.0420369  │   0.0423281  │   0.0427857  │   0.0423974  │   0.0419641  │   0.0429001  │ []         │ True              │                         │ [2081488975618728, 2081488996580864, 2081489017434944] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 171 │ Execute            │ update_concat_offset_3_int32_169       │   0.00261035 │   0.00265193 │   0.00265212 │   0.00263468 │   0.00259995 │   0.00265217 │ []         │ True              │                         │ [2081488975661992, 2081488996623556, 2081489017477272] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 172 │ Execute            │ concat_1_texture3d_half_170            │   0.00758157 │   0.00774789 │   0.00803914 │   0.00779994 │   0.00753999 │   0.00811195 │ []         │ True              │                         │ [2081488975664956, 2081488996626572, 2081489017480288] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 173 │ Execute            │ mean2d_half_171                        │   0.0147889  │   0.0148721  │   0.0150384  │   0.0149067  │   0.0147681  │   0.01508    │ []         │ True              │                         │ [2081488975673432, 2081488996634476, 2081489017488400] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 174 │ Execute            │ view_half_172                          │   0.00644803 │   0.00644803 │   0.00653119 │   0.00648268 │   0.00644803 │   0.00655198 │ []         │ True              │                         │ [2081488975688876, 2081488996649712, 2081489017503532] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 175 │ Execute            │ view_half_173                          │   0.00488806 │   0.00488806 │   0.00488806 │   0.00488806 │   0.00488806 │   0.00488806 │ []         │ True              │                         │ [2081488975695688, 2081488996656524, 2081489017510448] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 176 │ Execute            │ linear_naive_texture3d_half_174        │   0.586726   │   0.590096   │   0.595338   │   0.590876   │   0.585884   │   0.596648   │ []         │ True              │                         │ [2081488975700940, 2081488996661776, 2081489017515700] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 177 │ Execute            │ image_to_nchw_texture3d_half_float_175 │   0.00270395 │   0.00270414 │   0.00274572 │   0.00272139 │   0.00270391 │   0.00275612 │ []         │ True              │                         │ [2081488976297952, 2081488997248024, 2081489018106160] │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 178 │ Execute            │ DELEGATE_CALL                          │  20.8864     │  20.9461     │  21.5925     │  21.1906     │  20.8715     │  21.7541     │ []         │ False             │                         │ [358395625, 380178646, 401147657]                      │
+├─────┼────────────────────┼────────────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼──────────────┼────────────┼───────────────────┼─────────────────────────┼────────────────────────────────────────────────────────┤
+│ 179 │ Execute            │ Method::execute                        │  20.8867     │  20.9464     │  21.593      │  21.191      │  20.8718     │  21.7547     │ []         │ False             │                         │ [358395521, 380178542, 401147552]                      │
+╘═════╧════════════════════╧════════════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧══════════════╧════════════╧═══════════════════╧═════════════════════════╧════════════════════════════════════════════════════════╛
+```
diff --git a/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md
new file mode 100644
index 00000000000..953c93a9c12
--- /dev/null
+++ b/docs/source/backends/vulkan/tutorials/vulkan-tutorials.md
@@ -0,0 +1,13 @@
+# Vulkan Backend Tutorials
+
+**→{doc}`etvk-profiling-tutorial`**
+
+**→{doc}`etvk-llama-tutorial`**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Tutorials
+
+etvk-profiling-tutorial
+etvk-llama-tutorial
diff --git a/docs/source/backends/vulkan/vulkan-op-support.rst b/docs/source/backends/vulkan/vulkan-op-support.rst
new file mode 100644
index 00000000000..623907cb504
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-op-support.rst
@@ -0,0 +1,46 @@
+================
+Operator Support
+================
+
+This page lists the operators currently supported by the Vulkan backend. The
+source of truth for this information is `op_registry.py <https://github.com/pytorch/executorch/blob/main/backends/vulkan/op_registry.py>`_,
+which is used by the Vulkan Partitioner to determine which operators should be
+lowered to the Vulkan backend and additionally describes the capabilities of
+each operator implementation.
+
+If an operator used in your model is not in this list, feel free to create a
+feature request on Github and we will do our best to add an implementation for
+the operator.
+
+The namespace of an operator describes where it originates from:
+
+* **aten** - operators in this namespace correspond 1:1 to operators in PyTorch's
+  `ATen library <https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml>`_.
+  They all support fp16 and fp32 dtypes at a minimum.
+* **dim_order_op** - these operators are inserted when lowering to ExecuTorch in
+  order to manage optimal tensor memory layouts. They are typically removed,
+  since the Vulkan backend manages optimal tensor representations internally.
+* **llama** - custom ops targeted for LLM inference. These are typically inserted
+  by model source transformations applied to a `nn.Module` and are not invoked
+  directly by a PyTorch model.
+* **operator** - these operators work with symbolic integers, which are also
+  supported by the Vulkan backend.
+* **quantized_decomposed** / **torchao** - these ops are introduced by quantization
+  workflows (either torchao's `quantize_` API or the PT2E quantization flow).
+  They typically represent quantizing/dequantizing a tensor, or choosing the
+  quantization parameters for a tensor. In practice, most instances of these
+  operators will be fused into a custom op in the **et_vk** namespace.
+* **et_vk** - these are custom operators implemented only in the Vulkan backend.
+  They typically represent quantized variants of **aten** operators, or fusions
+  of common operator patterns. They are inserted by operator fusion graph passes
+  when lowering to the Vulkan backend.
+
+All operators support dynamic input shapes unless otherwise noted (i.e. "no
+resize support"). The expectation is that over time, all operators will be able
+to support dynamic shapes.
+
+.. csv-table:: Operator Support
+   :file: vulkan-op-support-table.csv
+   :header-rows: 1
+   :widths: 25 25 75
+   :align: left
diff --git a/docs/source/backends/vulkan/vulkan-overview.md b/docs/source/backends/vulkan/vulkan-overview.md
new file mode 100644
index 00000000000..50c87cd047b
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-overview.md
@@ -0,0 +1,163 @@
+# Vulkan Backend
+
+The ExecuTorch Vulkan (ET-VK) backend enables ExecuTorch models to execute on
+GPUs via the cross-platform [Vulkan API](https://www.vulkan.org/). Although the
+Vulkan API support is almost ubiquitous among modern GPUs, the ExecuTorch Vulkan
+backend is currently developed with a specific focus for **Android GPUs**.
+
+## Features
+
+- Wide operator support via an in-tree [GLSL compute shader library](https://github.com/pytorch/executorch/tree/main/backends/vulkan/runtime/graph/ops/glsl)
+- Support for models that require dynamic shapes
+- Support for FP32 and FP16 inference modes
+- Support for quantized linear layers with 8-bit/4-bit weights and 8-bit dynamically quantized activations
+- Support for quantized linear layers with 8-bit/4-bit weights and FP32/FP16 activations
+
+Note that the Vulkan backend is under active development, and its GLSL compute
+shader library is being consistently expanded over time. Additional support for
+quantized operators (i.e. quantized convolution) and additional quantization
+modes is on the way.
+
+## Target Requirements
+
+- Supports Vulkan 1.1
+
+## Development Requirements
+
+To contribute to the Vulkan delegate, the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android)
+must be installed on the development system. After installation, the `glslc` binary must
+be found in your `PATH` in order to compile Vulkan shaders. This can be checked by
+running
+
+```sh
+glslc --version
+```
+
+If this is not the case after completing the Vulkan SDK installation, you may have to
+go into `~/VulkanSDK/<version>/` and run
+
+```sh
+source setup-env.sh
+```
+
+or alternatively,
+
+```sh
+python install_vulkan.py
+```
+
+The [Android NDK](https://developer.android.com/ndk/downloads) must also be installed.
+Any NDK version past NDK r17c should suffice.
+
+----
+
+## Using the Vulkan Backend
+
+To lower a model to the Vulkan backend during the export and lowering process,
+pass an instance of `VulkanPartitioner` to `to_edge_transform_and_lower`. The
+example below demonstrates this process using the MobileNet V2 model from
+torchvision.
+
+```python
+import torch
+import torchvision.models as models
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+from executorch.exir import to_edge_transform_and_lower
+
+from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
+
+mobilenet_v2 = models.mobilenetv2.mobilenet_v2(
+    weights=MobileNet_V2_Weights.DEFAULT
+).eval()
+
+sample_inputs = (torch.randn(1, 3, 224, 224),)
+
+exported_program = torch.export.export(mobilenet_v2, sample_inputs)
+
+etvk_program = to_edge_transform_and_lower(
+    exported_program,
+    partitioner=[VulkanPartitioner()],
+).to_executorch()
+
+with open("mv2_vulkan.pte", "wb") as file:
+    etvk_program.write_to_file(file)
+```
+
+See [Partitioner API](vulkan-partitioner.md)
+for a reference on available partitioner options.
+
+----
+
+## Quantization
+
+The Vulkan delegate currently supports execution of quantized linear layers.
+See [Vulkan Quantization](vulkan-quantization.md)
+for more information on available quantization schemes and APIs.
+
+----
+
+## Runtime Integration
+
+To run the model on-device, use the standard ExecuTorch runtime APIs.
+
+For integration in Android applications, the Vulkan backend is included in the
+[executorch-android-vulkan](https://mvnrepository.com/artifact/org.pytorch/executorch-android-vulkan)
+package.
+
+When building from source, pass `-DEXECUTORCH_BUILD_VULKAN=ON` when configuring
+the CMake build to compile the Vulkan backend. See [Running on Device](/getting-started.md#running-on-device)
+for more information.
+
+To link against the backend, add the `executorch_backends` CMake target as a
+build dependency, or link directly against `libvulkan_backend`. Due to the use
+of static initialization to register available compute shaders and operators,
+it is required to ensure that the library is linked with `--whole-archive`.
+
+```cmake
+# CMakeLists.txt
+find_package(executorch CONFIG REQUIRED COMPONENTS vulkan_backend executorch_backends)
+
+...
+target_link_libraries(
+    my_target
+    PRIVATE
+    executorch
+    executorch_backends
+    ...
+)
+
+# Ensure that unused code is not discarded. The required linker options may be
+# different depending on the target platform. Typically, the
+# executorch_target_link_options_shared_lib function from
+# executorch/tools/cmake/Utils.cmake can be used to set the required linker
+# options.
+target_link_options(
+    executorch_backends INTERFACE "SHELL:LINKER:--whole-archive \
+    $<TARGET_FILE:${target_name}> \
+    LINKER:--no-whole-archive"
+)
+```
+
+No additional steps are necessary to use the backend beyond linking the target.
+Any Vulkan-delegated .pte file will automatically run on the registered backend.
+
+## Additional Resources
+
+**→{doc}`vulkan-partitioner`**
+
+**→{doc}`vulkan-quantization`**
+
+**→{doc}`vulkan-troubleshooting`**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Vulkan Backend
+
+vulkan-partitioner
+vulkan-quantization
+vulkan-op-support
+vulkan-troubleshooting
+
+tutorials/vulkan-tutorials
diff --git a/docs/source/backends/vulkan/vulkan-partitioner.md b/docs/source/backends/vulkan/vulkan-partitioner.md
new file mode 100644
index 00000000000..566ec491b47
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-partitioner.md
@@ -0,0 +1,55 @@
+# Partitioner API
+
+[VulkanPartitioner](https://github.com/pytorch/executorch/blob/main/backends/vulkan/partitioner/vulkan_partitioner.py)
+is a Python class that controls what operators in a model can or should be
+delegated to the Vulkan backend. It is the primary entrypoint to the Vulkan
+backend and is also used to configure the behaviour of the Vulkan backend.
+
+## Usage
+
+For most use-cases, constructing `VulkanPartitioner()` with no arguments is
+sufficient. In this case, the partitioner will lower as much of the model to
+the Vulkan backend as possible.
+
+```python
+etvk_program = to_edge_transform_and_lower(
+    exported_program,
+    partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
+
+## Common Config Options
+
+Generally, the Vulkan backend is configured by passing a `compile_options`
+dictionary to `VulkanPartitioner()`, i.e.
+
+```python
+compile_options = {
+  "require_dynamic_shapes": True,
+  "force_fp16": True,
+}
+
+etvk_program = to_edge_transform_and_lower(
+    exported_program,
+    partitioner=[VulkanPartitioner(compile_options)],
+).to_executorch()
+```
+
+### `require_dynamic_shapes`
+
+If a model is expected to use dynamic shapes, then it is recommended to set the
+`"required_dynamic_shapes"` key in `compile_options`.
+
+Not all operators in Vulkan support dynamic shapes at the moment, although the
+majority do. This flag will prevent operators that don't support dynamic shapes
+from being lowered to Vulkan.
+
+### `force_fp16`
+
+This option causes the Vulkan backend to internally convert all FP32 tensors to
+FP16. This can improve inference latency and memory footprint at the cost of
+model accuracy.
+
+FP32 input tensors will be automatically converted to FP16 upon entering the
+Vulkan backend, and FP16 outputs will be automatically be converted to FP32 as
+they are returned.
diff --git a/docs/source/backends/vulkan/vulkan-quantization.md b/docs/source/backends/vulkan/vulkan-quantization.md
new file mode 100644
index 00000000000..89c9f7514b0
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-quantization.md
@@ -0,0 +1,163 @@
+# Quantization
+
+The Vulkan backend currently supports execution of quantized linear layers,
+where weights are symmetrically quantized to 8-bit or 4-bit with per output
+channel or per group quantization scales.
+
+Support for additional quantized operators and quantization schemes (i.e. static
++ dynamic quantized convolution, support for statically quantized linear) is
+under active development and will be added soon.
+
+### 4-bit quantization with torchao `quantize_`
+
+The `quantize_` API from [torchao](https://github.com/pytorch/ao) allows for
+more advanced quantization schemes, and is the quantization workflow needed to
+access 4-bit quantization. 4-bit quantization is commonly used for LLMs.
+
+Two options are available to execute linear layers with 4-bit quantization:
+
+1. Dynamically quantized activations via `Int8DynamicActivationIntxWeightConfig`
+2. Weight only quantization via `IntxWeightOnlyConfig`
+
+Dynamically quantized activations can provide a significant boost in latency
+compared to weight only quantization, since it allows GPUs to leverage
+accelerated integer dot product instructions when computing matrix
+multiplication.
+
+Below is a simple example of quantizing a simple sequence of linear layers using
+the `quantize_` API.
+
+```python
+import torch
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+
+from executorch.exir import to_edge_transform_and_lower
+from torchao.quantization.granularity import PerGroup
+from torchao.quantization.quant_api import (
+    Int8DynamicActivationIntxWeightConfig,
+    IntxWeightOnlyConfig,
+    quantize_,
+)
+from torchao.utils import unwrap_tensor_subclass
+
+
+class LinearSequenceModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(128, 64, bias=False)
+        self.linear2 = torch.nn.Linear(64, 32, bias=False)
+        self.linear3 = torch.nn.Linear(32, 16, bias=False)
+
+    def forward(self, x):
+        x = self.linear1(x)
+        x = self.linear2(x)
+        x = self.linear3(x)
+        return x
+
+
+linear_sequence_module = LinearSequenceModule()
+
+M = 32
+sample_inputs = (torch.randn(M, 128),)
+
+group_size = 32
+
+q_config_8da4w = Int8DynamicActivationIntxWeightConfig(
+    weight_dtype=torch.int4, weight_granularity=PerGroup(group_size)
+)
+
+q_config_4w = IntxWeightOnlyConfig(
+    weight_dtype=torch.int4, granularity=PerGroup(group_size)
+)
+
+quantize_(linear_sequence_module, q_config_8da4w)
+unwrap_tensor_subclass(linear_sequence_module)
+
+# Regular export path from here
+exported_program = torch.export.export(linear_sequence_module, sample_inputs)
+
+etvk_program = to_edge_transform_and_lower(
+    exported_program,
+    partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
+
+### 8-bit quantization with PT2E quantization
+
+For 8-bit quantized linear layers, currently the only quantization scheme
+supported is weight only quantization, with weights that are symmetrically
+quantized to 8 bits with per output channel quantization scales.
+
+To access this quantization mode, the PT2E quantization flow must be used. At a
+high level, the steps to quantize a model are:
+
+1) Create an instance of the `VulkanQuantizer` class and specify desired quantization behaviour
+2) Use `torch.export.export` to prepare for quantization.
+3) Call `prepare_pt2e` to prepare the exported graph for quantization.
+4) Execute the prepared model with representative samples to calibrate the quantizated tensor activation ranges.
+5) Call `convert_pt2e` to quantize the model.
+6) Export and lower the model using the standard flow.
+
+For example:
+
+```python
+import torch
+
+from executorch.backends.vulkan.partitioner.vulkan_partitioner import VulkanPartitioner
+
+from executorch.backends.vulkan.quantizer.vulkan_quantizer import (
+    get_symmetric_quantization_config,
+    VulkanQuantizer,
+)
+
+from executorch.exir import to_edge_transform_and_lower
+
+from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
+
+from torchao.utils import unwrap_tensor_subclass
+
+
+class LinearSequenceModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.linear1 = torch.nn.Linear(128, 64, bias=False)
+        self.linear2 = torch.nn.Linear(64, 32, bias=False)
+        self.linear3 = torch.nn.Linear(32, 16, bias=False)
+
+    def forward(self, x):
+        x = self.linear1(x)
+        x = self.linear2(x)
+        x = self.linear3(x)
+        return x
+
+
+linear_sequence_module = LinearSequenceModule()
+
+M = 32
+# Create sample inputs
+sample_inputs = (torch.randn(M, 128),)
+
+# Setup quantizer
+quantizer = VulkanQuantizer()
+quantizer.set_global(get_symmetric_quantization_config(is_dynamic=False, weight_bits=8))
+
+# Export the model
+exported_program = torch.export.export(linear_sequence_module, sample_inputs)
+graph_module = exported_program.module()
+
+# Quantize the exported program with PT2E quantization flow
+quantized_module = prepare_pt2e(graph_module, quantizer)
+# Calibrate. In practice, this would be done by iterating over a real dataset
+quantized_module(*sample_inputs)
+quantized_module = convert_pt2e(quantized_module)
+
+# Export once more
+exported_program = torch.export.export(quantized_module, sample_inputs)
+
+# Lower to vulkan
+etvk_program = to_edge_transform_and_lower(
+    exported_program,
+    partitioner=[VulkanPartitioner()],
+).to_executorch()
+```
diff --git a/docs/source/backends/vulkan/vulkan-troubleshooting.md b/docs/source/backends/vulkan/vulkan-troubleshooting.md
new file mode 100644
index 00000000000..9845f588004
--- /dev/null
+++ b/docs/source/backends/vulkan/vulkan-troubleshooting.md
@@ -0,0 +1,57 @@
+# Troubleshooting
+
+This page describes common issues that you may encounter when using the Vulkan
+backend and how to debug and resolve them.
+
+## Vulkan Backend Not Found
+
+If you try to execute a .pte file that has been lowered to the Vulkan backend
+and you see an error like:
+
+```shell
+E 00:00:00.366934 executorch:method.cpp:74] Backend VulkanBackend is not registered.
+```
+
+This error indicates the Vulkan backend is not registered with the runtime. This
+can happen because the backend was not compiled or linked, or because the
+registration code was optimized out.
+
+First, make sure that when building ExecuTorch, cmake is configured with
+`-DEXECUTORCH_BUILD_VULKAN=ON`.
+
+Next, make sure that your application is linking the `vulkan_backend` target,
+or the `executorch_backends` target.
+
+Finally, ensure that `vulkan_backend` or `executorch_backends` is being linked
+with the equivalent of `--whole-archive`.
+
+## Slow Performance
+
+Performance issues can be caused by a variety of factors:
+
+* A key compute shader (most often convolution or linear) is not performing well
+  on your target GPU
+* Unsupported operators are causing too many graph breaks
+* An existing operator is lacking support for some memory layout or storage type
+  resulting in a high number of copies being inserted to ensure tensors are in
+  a required representation for the next operator
+
+If you experience poor on-device performance for a particular model, please
+obtain some profiling data while running your model. The
+[profiling tutorial](./tutorials/etvk-profiling-tutorial.md) can
+be a good reference for how to do this.
+
+Then, please file an issue on Github with the following details:
+
+* The device(s) you have tested with, and which devices exhibit poor performance
+  running the model
+* The profiling data collected from executing the model
+* The release version of ExecuTorch you are using, or the commit hash you built
+  from if you built from source
+* If available, an export script that can be used to export your model to aid
+  in reproducing the issue
+* If available, the `.pte` file you are testing with to aid in reproducing the
+  issue.
+
+We will do our best to patch performance problems in the Vulkan backend and
+help you resolve your issue.
diff --git a/examples/vulkan/README.md b/examples/vulkan/README.md
index 71fdd0e4183..7831809be69 100644
--- a/examples/vulkan/README.md
+++ b/examples/vulkan/README.md
@@ -1,80 +1,84 @@
-# Vulkan Delegate Export Examples
+# Example export script for the ExecuTorch Vulkan backend
 
-This directory contains scripts for exporting models with the Vulkan delegate in ExecuTorch. Vulkan delegation allows you to run your models on devices with Vulkan-capable GPUs, potentially providing significant performance improvements over CPU execution.
+This directory contains `export.py`, a utility script that can be used to export
+models registered in [`executorch/examples/models/__init__.py`](https://github.com/pytorch/executorch/blob/main/examples/models/__init__.py)
+to the Vulkan backend.
 
-## Scripts
+## Usage
 
-- `export.py`: Basic export script for models to use with Vulkan delegate
-- `aot_compiler.py`: Advanced export script with quantization support
+Note that all example commands are assumed to be executed from executorch root.
 
-## Usage
+```shell
+cd ~/executorch
+```
 
 ### Basic Export
 
-```bash
-python -m executorch.examples.vulkan.export -m <model_name> -o <output_dir>
+For example, to export MobileNet V2:
+
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR}
 ```
 
-### Export with Quantization (Experimental)
+This will create a file name `mv2_vulkan.pte` in the specified output directory.
 
-```bash
-python -m executorch.examples.vulkan.aot_compiler -m <model_name> -q -o <output_dir>
-```
+### With dynamic shape support
 
-### Dynamic Shape Support
+To enable exporting with dynamic shapes, simply add the `-d` flag.
 
-```bash
-python -m executorch.examples.vulkan.export -m <model_name> -d -o <output_dir>
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d
 ```
 
-### Additional Options
+### Export a bundled pte
 
-- `-s/--strict`: Export with strict mode (default: True)
-- `-a/--segment_alignment`: Specify segment alignment in hex (default: 0x1000)
-- `-e/--external_constants`: Save constants in external .ptd file (default: False)
-- `-r/--etrecord`: Generate and save an ETRecord to the given file location
+Use the `-b` flag to export a bundled PTE file (i.e. `.bpte`). This is a `.pte`
+file with bundled test cases that can be used for correctness checking.
 
-## Examples
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d -b
+```
 
-```bash
-# Export MobileNetV2 with Vulkan delegate
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -o ./exported_models
+This will create a file called `mv2_vulkan.bpte` in the specified output directory.
 
-# Export MobileNetV3 with quantization
-python -m executorch.examples.vulkan.aot_compiler -m mobilenet_v3 -q -o ./exported_models
+### With correctness testing
 
-# Export with dynamic shapes
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -d -o ./exported_models
+The script can also execute the exported and lowered model via pybindings to
+check output correctness before writing the output file.
 
-# Export with ETRecord for debugging
-python -m executorch.examples.vulkan.export -m mobilenet_v2 -r ./records/mobilenet_record.etrecord -o ./exported_models
-```
+To enable this, ensure that your machine:
 
-## Supported Operations
+1. Has the [Vulkan SDK](https://vulkan.lunarg.com/sdk/home#android) installed
+2. Has Vulkan drivers
 
-The Vulkan delegate supports various operations including:
+Additionally, you will need to install the executorch python package from
+source, since the Vulkan backend is not included by default in the pip package.
 
-- Basic arithmetic (add, subtract, multiply, divide)
-- Activations (ReLU, Sigmoid, Tanh, etc.)
-- Convolutions (Conv1d, Conv2d, ConvTranspose2d)
-- Pooling operations (MaxPool2d, AvgPool2d)
-- Linear/Fully connected layers
-- BatchNorm, GroupNorm
-- Various tensor operations (cat, reshape, permute, etc.)
+```shell
+CMAKE_ARGS="-DEXECUTORCH_BUILD_VULKAN=ON " ./install_executorch.sh -e
+```
 
-For a complete list of supported operations, refer to the Vulkan delegate implementation in the ExecuTorch codebase.
+Once these conditions are fulfilled, the `--test` flag can be passed to the
+script.
 
-## Debugging and Optimization
+```shell
+MODEL_NAME=mv2 && \
+OUTPUT_DIR=. && \
+python -m examples.vulkan.export -m ${MODEL_NAME} -o ${OUTPUT_DIR} -d --test
+```
 
-If you encounter issues with Vulkan delegation:
+You should see some output like
 
-1. Use `-r/--etrecord` to generate an ETRecord for debugging
-2. Check if your operations are supported by the Vulkan delegate
-3. Ensure your Vulkan drivers are up to date
-4. Try using the export script with `--strict False` if strict mode causes issues
+```shell
+INFO:root:✓ Model test PASSED - outputs match reference within tolerance
+```
 
-## Requirements
+### Quantization support
 
-- Vulkan runtime libraries (libvulkan.so.1)
-- A Vulkan-capable GPU with appropriate drivers
-- PyTorch with Vulkan support
+Support for quantization is under active development and will be added soon!