RFC: Unified Vulkan and VGF GPU backend path in ExecuTorch

### 🚀 The feature, motivation and pitch

### Why do we need this

This RFC proposes a unified GPU export and runtime path for ExecuTorch models that can use both the existing Vulkan backend and the Arm VGF backend in a single `.pte` program.

The goal is not to rewrite either backend. The proposal keeps Vulkan and VGF as separate ExecuTorch backends and adds a thin shared layer that lets them participate in the same exported program and, at runtime, bind to a compatible Vulkan execution context.

The main motivation is operator coverage and hardware enablement. In realistic models, different graph regions may be better suited to different GPU backends:

* VGF can execute VGF/TOSA-lowered graph regions targeting Arm GPU data graph/tensor functionality.
* Vulkan can execute graph regions already covered by the Vulkan delegate.
* A unified path lets a model use both backends in one `.pte` rather than forcing users to pick only one backend for the entire model.

The proposed design has four principles:

1. Keep Vulkan and VGF as separate backend implementations.
2. Add a unified Python export path that orchestrates both partitioners.
3. Add a shared GPU runtime layer for common Vulkan objects.
4. Pass shared GPU configuration through compile specs so both backends agree on context identity, lifetime mode, and synchronization policy.

The initial implementation is going to be deliberately limited. It shares runtime Vulkan objects such as `VkInstance`, `VkPhysicalDevice`, `VkDevice`, `VkQueue`, queue family index, and synchronization metadata. It does **not** attempt zero-copy exchange between delegate-owned tensors in the first version. Constants and backend-internal tensor layouts remain owned by each backend, so constants may be duplicated when used on both sides. We don't consider using `.pds` at this point to avoid duplicated weights in some cases.

The intended outcome is a mixed `.pte` where execution can look like:

```text
VGFBackend   -> VulkanBackend -> VGFBackend
```

For example, a small end-to-end test model can place VGF-friendly prefix/suffix regions on VGF and a Vulkan-supported middle island on Vulkan:

```python
class UnifiedVgfVulkanModel(torch.nn.Module):
    """
    Expected partitioning intent:
      - VGF: transpose -> avg_pool2d
      - Vulkan: index_select -> gather -> index_select
      - VGF: sin -> cos -> reciprocal -> mul -> transpose
    """

    def __init__(self) -> None:
        super().__init__()
        self.register_buffer(
            "channel_index",
            torch.tensor([2, 0, 1], dtype=torch.int64),
        )
        self.register_buffer(
            "row_index",
            torch.tensor([3, 1, 2, 0], dtype=torch.int64),
        )
        col_pattern = torch.tensor([3, 1, 2, 0], dtype=torch.int64)
        self.register_buffer(
            "col_gather_index",
            col_pattern.view(1, 1, 1, 4).repeat(1, 3, 4, 1),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # VGF-friendly prefix
        x = x.transpose(-1, -2)
        x = F.avg_pool2d(x, kernel_size=2, stride=2)

        # Vulkan fallback island
        x = torch.index_select(x, dim=1, index=self.channel_index)
        x = torch.gather(x, dim=3, index=self.col_gather_index)
        x = torch.index_select(x, dim=2, index=self.row_index)

        # VGF-friendly suffix
        s = torch.sin(x)
        c = torch.cos(x)
        y = s * torch.reciprocal(c)
        return y.transpose(-1, -2)
```

The generated .pte should contain both delegate names:
```bash
strings unified_vgf_vulkan.pte | grep -E 'VgfBackend|VulkanBackend'
```

And a runtime run should show both backends being initialized and executed:
```text
[VGF] init: ...
[Vulkan] init: ...
[VGF] execute: ...
[Vulkan] execute: ...
[VGF] execute: ...
```

### Alternatives

#### Keep separate Vulkan-only and VGF-only export paths

This is the current baseline, and ExecuTorch can already run multiple delegated regions in sequence. It is also possible for separate Vulkan and VGF flows to use heuristic-based or user-directed partitioning choices.
The limitation is that the combined decision-making remains distributed across independent backend paths. Each backend can decide what it supports, but there is no single place to reason about cross-backend ordering, fallback behavior, minimum partition size, shared compile specs, or shared runtime context setup.

The proposed unified path keeps Vulkan and VGF as separate backends, but introduces one orchestration layer for combined partitioning. This allows the export flow to make smarter decisions across both backends, such as the current `auto` policy, while still supporting explicit ordering through options like vgf_first and vulkan_first.

Future user-directed controls will be added on top of this unified orchestration layer. They are intentionally left out of the initial implementation to keep the first change focused on shared infrastructure, deterministic mixed partitioning, and runtime context sharing.

#### Merge VGF into the Vulkan backend

Merging VGF into the Vulkan backend would create a large and invasive rewrite, and it would also violate an important deployment requirement: VGF must remain available as a standalone export flow for users who capture exported VGF artifacts and deploy them in bespoke runtimes without the ExecuTorch runtime.

Vulkan and VGF also have different serialization formats, graph lowering paths, runtime objects, and execution assumptions. Keeping them separate preserves the existing VGF artifact boundary while still allowing ExecuTorch to coordinate mixed execution when both delegates are used inside one `.pte`.

There is an additional design constraint around shader generation. Some operators may need to be lowered through `tosa.custom` into shader code inside a VGF file. For example, an operator such as `grid_sample` may need to end up in a VGF artifact that contains both graph and compute regions. This means the long-term design needs to support two related but distinct capabilities:

Vulkan handles shader regions while VGF handles graph regions.

VGF handles both graph regions and custom shader/compute regions inside the VGF artifact.

For these reasons, merging VGF into Vulkan would make ownership and artifact semantics less clear. The proposed design keeps Vulkan and VGF as separate backends, preserves standalone VGF export, and adds a unified orchestration layer only where mixed-backend ExecuTorch execution needs it.

#### Create a monolithic new GPU backend

A new backend could hide both Vulkan and VGF behind a single runtime implementation, but this would duplicate existing backend logic and slow incremental adoption. The proposal instead introduces a small orchestration layer and shared runtime context while preserving the existing backend implementations.

#### Require application-level orchestration

Applications could manually split models or run multiple `.pte` programs, but this moves graph partitioning, data movement, and correctness risks to users. ExecuTorch should be able to represent this as one lowered program.

#### Implement full zero-copy sharing immediately

Zero-copy delegate tensor sharing is desirable, but it is a larger design problem involving tensor layout compatibility, ownership, synchronization, constants, and backend-specific packing. This RFC proposes a smaller first step: shared device/context/sync domain first, zero-copy later.

### Additional context

The proposal introduces two related pieces:

- Python-side unified export and partitioning.
- C++ runtime support for sharing GPU context information between delegates.

The shared GPU runtime layer owns or borrows:

- `VkInstance`
- `VkPhysicalDevice`
- `VkDevice`
- `VkQueue`
- Queue family index
- Optional synchronization primitive metadata, such as a timeline semaphore

A key design point is that command pools are **not** shared across delegates. The shared context can create one command pool per delegate instance:

```cpp
VkResult SharedGpuContext::create_command_pool(
    VkCommandPool* command_pool) const {
  if (!is_valid()) {
    return VK_ERROR_INITIALIZATION_FAILED;
  }

  VkCommandPoolCreateInfo create_info{};
  create_info.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
  create_info.flags = VK_COMMAND_POOL_CREATE_RESET_COMMAND_BUFFER_BIT;
  create_info.queueFamilyIndex = create_info_.queue_family_index;

  return vkCreateCommandPool(
      create_info_.device, &create_info, nullptr, command_pool);
}
```

This gives both backends the same Vulkan device and queue family while avoiding command-pool ownership and thread-safety issues.

The shared context is identified by a logical key:
```text
(shared_context_token, shared_group_id) -> SharedGpuContext
```

The token and group id are ExecuTorch compile-spec concepts, not Vulkan API concepts. They let independent delegates agree on which GPU context they should use.

# RFC

## Scope

This RFC is scoped to unified execution across the existing Vulkan backend and the Arm VGF backend.

### In scope

- A unified Python export API and partitioner.
- Shared compile-spec keys for GPU context identity and policy.
- A small shared GPU C++ runtime library.
- Vulkan backend changes to create, register, or reuse a shared context.
- VGF backend changes to create, register, or reuse a shared context.
- Unit tests for compile-spec parsing, registry behavior, and Vulkan shared-context adapter behavior.
- End-to-end export/runtime examples that produce a `.pte` containing both `VgfBackend` and `VulkanBackend`.

### Out of scope for the first implementation

- Zero-copy tensor exchange between Vulkan and VGF delegates.
- Shared constant storage between delegates.
- Rewriting either backend into the other.
- Replacing existing Vulkan-only or VGF-only flows.
- Full model performance tuning.
- Changing unrelated backend partitioning behavior.

The project will be submitted as a four-phase PR stack:

1. Python export part.
2. Shared GPU runtime.
3. Vulkan backend changes.
4. VGF backend changes.

## Phase 1 — Python export part

Phase 1 introduces the Python API and partitioning orchestration.

The proposed new export-side pieces are:

- shared compile-spec helpers;
- `UnifiedGpuCompileSpec`;
- `UnifiedGpuPartitioner`;
- policy logic for backend ordering;
- small end-to-end export example.

The shared compile specs use common keys consumed by both backends:

```python
GPU_SHARED_CONTEXT_TOKEN = "gpu_shared_context_token"
GPU_SHARED_CONTEXT_MODE = "gpu_shared_context_mode"
GPU_SHARED_SYNC_MODE = "gpu_shared_sync_mode"
GPU_SHARED_GROUP_ID = "gpu_shared_group_id"
GPU_SHARED_MIN_PARTITION_SIZE = "gpu_shared_min_partition_size"
GPU_SHARED_PREFER = "gpu_shared_prefer"
```

Example helper usage:
```python
make_shared_gpu_compile_specs(
    token="scene0",
    mode="lookup_or_create",
    sync_mode="timeline",
    group_id=2,
    min_partition_size=3,
    prefer="vgf_first",
)
```

### Proposed context modes

- `lookup_only`: use a context already registered by the application or another component.
- `lookup_or_create`: default mode; first backend creates the context, later backends reuse it.
- `create_only`: create and register a new context for this token/group pair.

### Proposed sync modes

- `timeline`: intended long-term default for shared GPU execution.
- `queue_wait_idle`: simple and conservative mode useful for early integration/debugging.
- `none`: only for cases where execution is known to be independent.

### Proposed preference modes

- `auto`: policy chooses backend order based on graph/backend suitability.
- `vgf_first`: try VGF partitioning first, then Vulkan.
- `vulkan_first`: try Vulkan partitioning first, then VGF.

The unified compile spec can look like:
```python
@dataclass(frozen=True)
class UnifiedGpuCompileSpec:
    vgf: Any
    vulkan_compile_options: Optional[Dict[str, Any]] = None
    shared_context_token: str = "default"
    shared_context_mode: str = "lookup_or_create"
    shared_sync_mode: str = "timeline"
    shared_group_id: int = 0
    min_partition_size: int = 3
    prefer: str = "auto"
    extra_shared_compile_specs: tuple[CompileSpec, ...] = field(default_factory=tuple)

    def shared_compile_specs(self) -> list[CompileSpec]:
        return [
            *make_shared_gpu_compile_specs(
                token=self.shared_context_token,
                mode=self.shared_context_mode,
                sync_mode=self.shared_sync_mode,
                group_id=self.shared_group_id,
                min_partition_size=self.min_partition_size,
                prefer=self.prefer,
            ),
            *self.extra_shared_compile_specs,
        ]
```

The partitioner is responsible for building the real VGF and Vulkan partitioners and running them in a deterministic order:

- Build VGF and Vulkan partitioners.
- Attach shared compile specs to both.
- Choose backend order from prefer.
- Run the first partitioner.
- Run the second partitioner with skip_delegated_nodes=True.
- Prune partitions smaller than min_partition_size.

Initial behavior after pruning is conservative: pruned nodes remain undelegated rather than being reattempted with another backend. Repartitioning pruned nodes can be added later.

Example export usage:
```python
model = UnifiedVgfVulkanModel().eval()

example_inputs = (
    torch.linspace(
        -0.8,
        0.8,
        steps=1 * 3 * 8 * 8,
        dtype=torch.float32,
    ).reshape(1, 3, 8, 8),
)

unified_spec = UnifiedGpuCompileSpec(
    vgf=VgfCompileSpec(),
    vulkan_compile_options={},
    shared_context_token="vgf_vk_unified_e2e",
    shared_context_mode="lookup_or_create",
    shared_sync_mode="timeline",
    shared_group_id=7,
    min_partition_size=1,
    prefer="vgf_first",
)

exported = torch.export.export(model, example_inputs)

lowered = to_edge_transform_and_lower(
    exported,
    partitioner=[UnifiedGpuPartitioner(unified_spec)],
)

et_program = lowered.to_executorch()

with open("unified_vgf_vulkan.pte", "wb") as f:
    et_program.write_to_file(f)
```

### Phase 1 deliverables

- `UnifiedGpuCompileSpec`
- Shared compile-spec helper API
- `UnifiedGpuPartitioner`
- Partition-order policy
- Export example producing a mixed Vulkan/VGF `.pte`
- Python tests for compile-spec generation and partitioner behavior

## Phase 2 — Shared GPU runtime

Phase 2 introduces the backend-independent C++ runtime support under a shared GPU component.

Proposed files:

- `backends/gpu_shared/runtime/SharedGpuCompileSpec.h`
- `backends/gpu_shared/runtime/SharedGpuCompileSpec.cpp`
- `backends/gpu_shared/runtime/SharedGpuContext.h`
- `backends/gpu_shared/runtime/SharedGpuContext.cpp`
- `backends/gpu_shared/runtime/SharedGpuContextRegistry.h`
- `backends/gpu_shared/runtime/SharedGpuContextRegistry.cpp`

The runtime layer has three responsibilities.

First, parse the shared compile specs:
```cpp
Result<SharedGpuCompileSpec> parse_shared_gpu_compile_spec(
    ArrayRef<const CompileSpec> compile_specs,
    BackendInitContext* context);
```

Second, represent a shared GPU context:
```cpp
struct SharedGpuContextCreateInfo {
  std::string token;
  int64_t group_id = 0;

  VkInstance instance = VK_NULL_HANDLE;
  VkPhysicalDevice physical_device = VK_NULL_HANDLE;
  VkDevice device = VK_NULL_HANDLE;
  VkQueue queue = VK_NULL_HANDLE;
  uint32_t queue_family_index = 0;

  VkSemaphore timeline_semaphore = VK_NULL_HANDLE;

  bool owns_instance = false;
  bool owns_device = false;
  bool owns_timeline_semaphore = false;
};
```

Third, provide a process-local registry:
```cpp
class SharedGpuContextRegistry final {
 public:
  static SharedGpuContextRegistry& Get();

  SharedGpuContextPtr lookup(const SharedGpuCompileSpec& spec);

  Result<SharedGpuContextPtr> lookup_or_create(
      const SharedGpuCompileSpec& spec,
      std::function<Result<SharedGpuContextPtr>()> create_fn);

  Error register_context(
      const SharedGpuCompileSpec& spec,
      SharedGpuContextPtr context);

  void clear_for_testing();
};
```

The registry key is:
```cpp
std::string SharedGpuContextRegistry::make_key(
    const SharedGpuCompileSpec& spec) const {
  return spec.token + ":" + std::to_string(spec.group_id);
}
```

This phase should include unit tests for:

- parsing all shared compile-spec keys;
- default values;
- invalid modes;
- registry lookup;
- lookup-or-create behavior;
- duplicate registration behavior;
- stale or invalid context handling, if weak ownership is used.

Example build/test commands:
```bash
cmake -S . -B cmake-out -GNinja \
  -DEXECUTORCH_BUILD_VULKAN=ON \
  -DEXECUTORCH_BUILD_VGF=ON \
  -DBUILD_TESTING=ON \
  -DEXECUTORCH_BUILD_TESTS=ON

cmake --build cmake-out --target \
  executorch_gpu_shared_runtime \
  shared_gpu_compile_spec_test \
  shared_gpu_context_registry_test \
  -j

ctest --test-dir cmake-out --output-on-failure -R \
  'shared_gpu_(compile_spec|context_registry)_test'
```

Phase 2 deliverables:

- executorch_gpu_shared_runtime library
- shared compile-spec parser
- SharedGpuContext
- SharedGpuContextRegistry
- unit tests for compile spec parsing and registry behavior
- no Vulkan or VGF behavior changes yet, except build linkage preparation if needed

## Phase 3 — Vulkan backend changes

Phase 3 integrates the existing Vulkan backend with the shared GPU runtime.

The Vulkan backend should parse the shared GPU compile spec during `VulkanBackend::init()` and choose one of the following actions:

- Use the existing Vulkan path when shared GPU is disabled.
- Create a new Vulkan adapter and register a shared context.
- Reuse an existing shared context by constructing a Vulkan adapter from it.
- Return a compatibility error if `lookup_only` is requested but no context exists.
- Return a compatibility error if `create_only` is requested but a context already exists for the same token/group.

The proposed helper type is:

```cpp
struct ResolvedVulkanAdapter final {
  vkapi::Adapter* adapter = nullptr;
  bool owns_adapter = false;
  bool register_shared_context_after_compile = false;
  SharedGpuCompileSpec shared_spec;
};
```

A shared context can be created from an existing Vulkan adapter:
```cpp
static SharedGpuContextPtr make_shared_context_from_adapter(
    vkapi::Adapter* adapter,
    const SharedGpuCompileSpec& spec) {
  auto queue = adapter->request_queue();

  SharedGpuContextCreateInfo create_info;
  create_info.token = spec.token;
  create_info.group_id = spec.group_id;
  create_info.instance = adapter->instance_handle();
  create_info.physical_device = adapter->physical_handle();
  create_info.device = adapter->device_handle();
  create_info.queue = queue.handle;
  create_info.queue_family_index = queue.family_index;
  create_info.owns_instance = false;
  create_info.owns_device = false;
  create_info.owns_timeline_semaphore = false;

  adapter->return_queue(queue);
  return std::make_shared<SharedGpuContext>(std::move(create_info));
}
```

And an adapter can be created from an existing shared context:
```cpp
static vkapi::Adapter* make_adapter_from_shared_context(
    const SharedGpuContextPtr& shared_context) {
  return new vkapi::Adapter(
      shared_context->instance(),
      shared_context->physical_device(),
      shared_context->device(),
      shared_context->queue(),
      shared_context->queue_family_index(),
      "");
}
```

The Vulkan delegate handle needs to know whether it owns an externally created adapter:
```cpp
struct VulkanDelegateHandle final {
  ComputeGraph* compute_graph = nullptr;
  vkapi::Adapter* owned_external_adapter = nullptr;
};
```

On teardown, the backend should destroy the ComputeGraph and delete the owned external adapter only if Vulkan created that adapter wrapper.

Phase 3 should include a unit test for the Vulkan shared-context adapter policy, for example:

```bash
cmake --build cmake-out --target \
  executorch_gpu_shared_runtime \
  vulkan_backend \
  shared_gpu_compile_spec_test \
  shared_gpu_context_registry_test \
  vulkan_shared_context_adapter_test \
  -j

ctest --test-dir cmake-out --output-on-failure -R \
  'shared_gpu_(compile_spec|context_registry)_test|vulkan_shared_context_adapter_test'
```

Phase 3 deliverables:

- Vulkan backend links against executorch_gpu_shared_runtime
- shared-context resolution helper
- adapter-from-shared-context path
- shared-context-from-adapter path
- delegate handle ownership update
- Vulkan shared-context adapter unit test
- existing Vulkan-only behavior preserved when shared GPU compile specs are absent

## Phase 4 — VGF backend changes

Phase 4 integrates the Arm VGF backend with the shared GPU runtime.

The VGF backend should parse the shared GPU compile spec during VGFBackend::init() and resolve a SharedGpuContextPtr.

The high-level initialization flow becomes:
```cpp
auto maybe_shared_context = resolve_vgf_context(compile_specs, context);
if (!maybe_shared_context.ok()) {
  ET_LOG(Error, "Failed to resolve shared VGF context");
  return maybe_shared_context.error();
}

SharedGpuContextPtr shared_context = maybe_shared_context.get();
if (shared_context == nullptr || !shared_context->is_valid()) {
  ET_LOG(Error, "Resolved shared VGF context is invalid");
  return Error::Internal;
}

#if defined(USE_VULKAN_VOLK)
volkLoadDevice(shared_context->device());
#endif

VkDevice shared_device = shared_context->device();
VkResult result = vkml_load_extensions(&shared_device);
if (result != VK_SUCCESS) {
  ET_LOG(
      Error,
      "Failed to verify VKML extensions on shared device, error 0x%08X",
      result);
  return Error::NotSupported;
}

VkCommandPool command_pool = VK_NULL_HANDLE;
result = shared_context->create_command_pool(&command_pool);
if (result != VK_SUCCESS) {
  ET_LOG(
      Error,
      "Failed to create delegate command pool error 0x%08X",
      result);
  return Error::Internal;
}
```

The VGF delegate handle stores the shared context so the device remains valid for the lifetime of the VGF representation:
```cpp
struct VgfDelegateHandle final {
  executorch::backends::gpu_shared::SharedGpuContextPtr shared_context;
  VkCommandPool command_pool = VK_NULL_HANDLE;
  VgfRepr* repr = nullptr;
};
```

VgfRepr continues to own the VGF-specific Vulkan objects:
```cpp
class VgfRepr {
 public:
  VgfRepr(
      VkInstance inst,
      VkPhysicalDevice phys,
      VkDevice dev,
      VkQueue queue,
      VkCommandPool pool)
      : vk_instance(inst),
        vk_physical(phys),
        vk_device(dev),
        vk_queue(queue),
        vk_command_pool(pool) {}

  bool process_vgf(
      const char* vgf_data,
      executorch::runtime::ArrayRef<executorch::runtime::CompileSpec> specs);

  bool execute_vgf();

  ~VgfRepr() {
    free_vgf();
  }

 private:
  VkInstance vk_instance;
  VkPhysicalDevice vk_physical;
  VkDevice vk_device;
  VkQueue vk_queue;
  VkCommandPool vk_command_pool;

  VkCommandBuffer vk_execute_cmd = VK_NULL_HANDLE;
  VkDataGraphPipelineSessionARM vk_session = VK_NULL_HANDLE;
  VkPipeline vk_pipeline = VK_NULL_HANDLE;
  VkPipelineLayout vk_pipeline_layout = VK_NULL_HANDLE;
  VkDescriptorPool vk_descriptor_pool = VK_NULL_HANDLE;
  VkDescriptorSetLayout vk_layout = VK_NULL_HANDLE;
  VkShaderModule vk_shader = VK_NULL_HANDLE;
};
```

The VGF backend should continue to support non-shared mode by creating a local context when shared GPU compile specs are absent or disabled.

Example VGF context resolution:
```cpp
static Result<SharedGpuContextPtr> resolve_vgf_context(
    ArrayRef<CompileSpec> compile_specs,
    BackendInitContext& context) {
  auto maybe_spec = parse_shared_gpu_compile_spec(
      ArrayRef<const CompileSpec>(compile_specs.data(), compile_specs.size()),
      &context);

  if (!maybe_spec.ok()) {
    return maybe_spec.error();
  }

  const SharedGpuCompileSpec& spec = maybe_spec.get();

  if (!spec.enabled()) {
    SharedGpuCompileSpec local_spec;
    local_spec.token = "vgf-local";
    local_spec.group_id = 0;
    local_spec.context_mode = SharedContextMode::kDisabled;
    return create_shared_context_from_vkml(local_spec);
  }

  auto& registry = SharedGpuContextRegistry::Get();

  if (spec.lookup_only()) {
    auto ctx = registry.lookup(spec);
    if (ctx == nullptr) {
      return Error::NotFound;
    }
    return ctx;
  }

  if (spec.lookup_or_create()) {
    return registry.lookup_or_create(
        spec, [spec]() -> Result<SharedGpuContextPtr> {
          return create_shared_context_from_vkml(spec);
        });
  }

  if (spec.create_only()) {
    auto ctx = create_shared_context_from_vkml(spec);
    if (!ctx.ok()) {
      return ctx.error();
    }
    Error err = registry.register_context(spec, ctx.get());
    if (err != Error::Ok) {
      return err;
    }
    return ctx.get();
  }

  return Error::InvalidArgument;
}
```

Phase 4 deliverables:

- VGF backend links against executorch_gpu_shared_runtime
- VGF backend parses shared GPU compile specs
- VGF backend creates or reuses SharedGpuContext
- VGF backend creates one command pool per delegate instance
- VGF delegate handle keeps the shared context alive
- runtime logs for init/execute/destroy paths
- mixed Vulkan/VGF end-to-end runtime example
- existing VGF-only behavior preserved when shared GPU compile specs are absent


## End-to-end example

The intended end-to-end workflow is:

```bash
python export_unified_vgf_vulkan_e2e.py \
  --out_dir /tmp/unified_gpu \
  --token vgf_vk_unified_e2e \
  --group_id 7 \
  --prefer vgf_first \
  --min_partition_size 1 \
  --sync_mode timeline
```

This writes:
```text
/tmp/unified_gpu/unified_vgf_vulkan.pte
/tmp/unified_gpu/input0.bin
/tmp/unified_gpu/expected0.bin
```

Check that the `.pte` contains both backends:

```bash
strings /tmp/unified_gpu/unified_vgf_vulkan.pte | grep -E 'VgfBackend|VulkanBackend'
```

Run with `executor_runner`:
```bash
executor_runner \
  --model_path /tmp/unified_gpu/unified_vgf_vulkan.pte \
  --input_list /tmp/unified_gpu/input0.bin \
  --output_path /tmp/unified_gpu/unified_gpu_output
```

Expected logs include both initialization and execution paths:
```text
[VGF] init: processed=... compile_specs=...
[Vulkan] init: processed=... compile_specs=...
[VGF] execute: handle=... args=...
[Vulkan] execute: handle=... args=...
[VGF] execute: handle=... args=...
```

The correctness check compares executor_runner output against PyTorch eager output generated during export.

## Testing plan

Phase 1 tests:

- compile-spec generation tests;
- partition order tests for auto, vgf_first, and vulkan_first;
- tests for min_partition_size;
- test that the exported .pte contains both VgfBackend and VulkanBackend for the mixed example model.


Phase 2 tests:
- shared_gpu_compile_spec_testshared_gpu_context_registry_test

Phase 3 tests:
- vulkan_shared_context_adapter_test

Phase 4 tests:

- VGF init with shared context disabled;
- VGF init with lookup_or_create;
- VGF init with lookup_only;
- VGF behavior when required VGF/Vulkan extension function pointers are unavailable;
- mixed VGF/Vulkan runtime smoke test with executor_runner;
- numerical comparison against PyTorch eager output.


Combined CMake example:
```bash
cmake -S . -B cmake-out -GNinja \  -DEXECUTORCH_BUILD_VULKAN=ON \  -DEXECUTORCH_BUILD_VGF=ON \  -DBUILD_TESTING=ON \  -DEXECUTORCH_BUILD_TESTS=ONcmake --build cmake-out --target \  executorch_gpu_shared_runtime \  vulkan_backend \  vgf_backend \  shared_gpu_compile_spec_test \  shared_gpu_context_registry_test \  vulkan_shared_context_adapter_test \  -jctest --test-dir cmake-out --output-on-failure -R \  'shared_gpu_(compile_spec|context_registry)_test|vulkan_shared_context_adapter_test'
```

## Backward compatibility
This proposal should preserve current behavior when shared GPU compile specs are not present.

Expected behavior:

- Existing Vulkan-only `.pte` files continue to run through the Vulkan backend.
- Existing VGF-only `.pte` files continue to run through the VGF backend.
- The unified path is opt-in through the unified partitioner and shared GPU compile specs.
- Shared runtime code should not affect CPU-only builds.
- Vulkan and VGF should still be build-time optional.

## Risks and mitigations

### Risk: backend lifetime and teardown ordering

Both delegates may reference the same underlying Vulkan device. The shared context object must keep the device alive for all delegate users. Delegate handles should hold a SharedGpuContextPtr to extend lifetime through teardown.

### Risk: stale registry entries
If the registry stores weak references, stale entries must be detected and erased. If it stores strong references, ownership and teardown must be explicit enough to avoid leaking contexts longer than intended.

### Risk: command pool sharing
Vulkan command pools should not be shared across delegate instances. The shared context should provide create_command_pool() and each backend should own/destroy its own pool.

### Risk: synchronization semantics
The first implementation may use conservative synchronization where needed. The compile spec carries shared_sync_mode so timeline-semaphore integration can be made explicit and tested.

### Risk: tiny delegate islands
Very small partitions may regress performance or complicate debugging. gpu_shared_min_partition_size lets export remove tiny partitions.

### Risk: unsupported operators in both backends
If neither backend supports an operator, it should remain undelegated and execute through the normal fallback path if available. The unified partitioner should make this explicit in logs/tests.

## Future work

- Zero-copy tensor handoff between Vulkan and VGF delegate regions.
- Shared external constant storage, for example using .pds, so constants used by both delegates do not need to be embedded/transformed twice.
- More advanced repartitioning after tiny-partition pruning.
- Broader model coverage, including larger CNN/detection examples.
- CI coverage for the mixed Vulkan/VGF export example.
- CI coverage for runtime execution where Vulkan and VGF support is available.
- User-facing documentation and tutorial.
- Performance measurement once correctness and lifecycle semantics are stable.

## Open questions

- Should the default prefer="auto" policy be VGF-first for quantized/TOSA-friendly regions and Vulkan-first otherwise?
- Should pruned small partitions be retried with the other backend, or should they remain undelegated?
- Should create_only fail if a context already exists for the same token/group, or should users be expected to choose a distinct group_id?
- Where should the mixed Vulkan/VGF end-to-end example live: under backend tests, examples, or both?
- What CI environment should be used for VGF runtime coverage, given that VGF requires emulation layer?
- Should the public API expose UnifiedGpuCompileSpec from a backend-neutral package path such as executorch.backends.gpu.unified, or keep it under the Arm backend tree initially?


### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

cc @SS-JIA @manuelcandales @digantdesai @cbilgin @freddan80 @per @zingo @oscarandersson8218 @mansnils @Sebastian-Larsson @robell

RFC: Unified Vulkan and VGF GPU backend path in ExecuTorch #19298

Description

🚀 The feature, motivation and pitch

Why do we need this

Alternatives

Keep separate Vulkan-only and VGF-only export paths

Merge VGF into the Vulkan backend

Create a monolithic new GPU backend

Require application-level orchestration

Implement full zero-copy sharing immediately

Additional context

RFC

Scope

In scope

Out of scope for the first implementation

Phase 1 — Python export part

Proposed context modes

Proposed sync modes

Proposed preference modes

Phase 1 deliverables

Phase 2 — Shared GPU runtime

Phase 3 — Vulkan backend changes

Phase 4 — VGF backend changes

End-to-end example

Testing plan

Backward compatibility

Risks and mitigations

Risk: backend lifetime and teardown ordering

Risk: stale registry entries

Risk: command pool sharing

Risk: synchronization semantics

Risk: tiny delegate islands

Risk: unsupported operators in both backends

Future work

Open questions

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions