Skip to content

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Jul 28, 2025

Summary:

Context

Operator implementations in the Vulkan delegate may require that input and output tensors use a specific representation. Representation in this case refers to a combination of storage type (buffer or texture) and memory layout (width, height, or channels packed).

The tag memory metadata pass is responsible for marking each tensor in the graph with the appropriate representation to use. It is also responsible for inserting operators to transition argument tensors to a required/compatible representation if a mismatch has been detected.

The memory metadata tagging pass uses the operator registry to determine what tensor representations are valid for the inputs and outputs of a given operator. When operators are registered, fields like has_buffer_impl, texture_impl, optimal_storage, etc. are used to annotate what tensor representations are supported by a given operator.

However, the current implementation of the operator registry and the memory metadata tagging pass assumes that all tensors participating in a given operator must use the same representation. As of late, quantization and normalization operators have been added that break this assumption; their implementations require certain inputs/outputs to use specific tensor representations, which do not need to be the same as other tensors participating in the op.

The goal of this diff is to introduce a better (i.e. more flexible) way to express the tensor representation requirements of an operator, and re-implement the memory metadata tagging pass to be able to account for the certain inputs/outputs tensors require a specific representation.

More specifically, this is required to unblock dynamic quantization since some quantized operator implementations need scales/zeros to be contiguous buffers, regardless of the representation used for other tensors.

Changes

Introduce several utility classes to aid in expressing the possible representations of a tensors.

TensorRepr represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

TensorRepSet represents the set of possible representations that may be used for a single tensor. This is needed because a given operator may support multiple different representations.

OpRepSet maintains the set of possible representations (i.e. RepSets) for all tensors participating in an operator.

Please see the docstrings for these new classes for more context.

All functionality related to determining or checking tensor representation is now centered around the new OpRepSet class, which automatically maintains rules about which tensors in an operator should use the same representation and provides utilities to constrain representation sets based on pre-existing input representations.

The tag_memory_metadata_pass.py has been rewritten to use the OpRepSet utility class.

Another consequence of these changes is to simplify how operator implementations are registered. Instead of defining texture_impl and buffer_impl separately, registration now directly specifies what storage types are valid for inputs and outputs. Sync rules that require inputs/outputs to have the same representation are inferred.

Differential Revision: D79116560

Copy link

pytorch-bot bot commented Jul 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12927

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a55bc0c with merge base d4c78ab (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 28, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@SS-JIA SS-JIA force-pushed the export-D79116560 branch from f763776 to 4adc4b0 Compare July 29, 2025 14:55
SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 29, 2025
Summary:

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.


Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized. 
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.


### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Differential Revision: D79116560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

@SS-JIA SS-JIA force-pushed the export-D79116560 branch from 4adc4b0 to eee719f Compare July 29, 2025 19:04
SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 29, 2025
Summary:

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.


Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized. 
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.


### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Differential Revision: D79116560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 29, 2025
Summary:
Pull Request resolved: pytorch#12927

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.

Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized.
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.

### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Differential Revision: D79116560
@SS-JIA SS-JIA force-pushed the export-D79116560 branch from eee719f to 6afebd4 Compare July 29, 2025 19:08
@SS-JIA SS-JIA force-pushed the export-D79116560 branch from 6afebd4 to 2000369 Compare July 30, 2025 22:10
SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 30, 2025
Summary:

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.


Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized. 
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.


### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Reviewed By: trivedivivek

Differential Revision: D79116560
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 30, 2025
Summary:
Pull Request resolved: pytorch#12927

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.

Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized.
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.

### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Reviewed By: trivedivivek

Differential Revision: D79116560
@SS-JIA SS-JIA force-pushed the export-D79116560 branch from 2000369 to 248e962 Compare July 30, 2025 22:14
SS-JIA added a commit to SS-JIA/executorch-1 that referenced this pull request Jul 30, 2025
Summary:

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.


Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized. 
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.


### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Reviewed By: trivedivivek

Differential Revision: D79116560
@SS-JIA SS-JIA force-pushed the export-D79116560 branch from 248e962 to db8b83a Compare July 30, 2025 23:52
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

Summary:

## Context

In ET-VK, tensors may be stored with either a GPU buffer or a GPU texture. They may also be stored with a specific memory layout: width packed, height packed, or channels packed. The memory layout controls which dimension will have its elements be adjacent in physical memory.

In this way, the "representation" of tensors in ET-VK may be described with a storage type, memory layout pair.

Operator implementations may only support certain tensor representations for inputs and outputs. Furthermore, implementations typically have expectations around which input/output tensors will share the same representation.

Some examples:

* Binary Operators:
  * I/O tensors may use any representation; however, all tensors in the op must use the same representation. i.e. If the first input tensor uses buffer storage, so must the other tensor and the output tensor
* Native Group Norm:
  *Input tensors must be a channels packed texture. However, the op produces 3 outputs: the normalized tensor, the running mean, and the running stddev. The normalized tensor must use the same representation as the first input. However, the mean and stddev tensors are expected to be contiguous buffers.
* Choose qparams:
  * The Input tensor can use any representation. However, the two output tensors (zero points and scales) will always be contiguous buffers
* Dynamically quantized linear:
  * The input tensor can be either buffer or texture, but must be contiguous/width packed. The scales and zeros tensors for the inputs and weights must all be contiguous buffers. The output tensor must be the same representation as the input tensors.

The operator registry (`op_registry.py`) is responsible for denoting these representational requirements for each op, and the `tag_memory_metadata_pass.py` graph pass is responsible for determining what representation each tensor in each operator should use.  The graph pass is also responsible for inserting nodes to move input arguments to a required representation, if they have been created with a non-supported representation.

## Current Method

Currently, the operator registry will indicate the following:

* Are texture inputs supported for the op
  * If yes, which texture memory layouts are supported for inputs to the op
* Are buffer inputs supported for the op
* An "optimal" storage type and memory layout to use for inputs/outputs of the operator.

The underlying assumption is that all tensors participating in an operator will use the same representation for all tensors. Although this assumption holds true for most operators, this assumption is clearly insufficient for some of the example operators described above, where some input tensors may require that certain inputs use specific representations that are different from other tensors.

During export, the memory metadata tagging pass will go through each op and mark the tensors participating in the op with a valid representation for that op. It will ensure that all tensors participating in an op will use the same representation. To determine the representation to use, it accounts for three things in order of priority:

* The "optimal" storage type and memory layout marked for the op in the operator registry
* Any existing representation that have already been determined for input tensors
* What representations are supported by users of the output tensor of the current op

## Goals of this diff

The main goal of this diff is to address the problem that the current method of annotating tensor representation requirements for operators is insufficient for describing the tensor representation requirements for operator implementation.


Critically, for operators like choose_qparams and dynamically quantized linear, the current system cannot ensure that all input/output tensors are using representations that are supported by the op impl, since the current system tries to make all tensors participating in an operator use the same representation.

## Changes

### `utils.py`

First, in 'utils.py` I introduce several classes to abstract the concept of tensor representations and sets of possible tensor representations.

`TensorRepr` represents a pair of storage type + memory layout which describes the representation to use for a single tensor.

`TensorRepSet` represents the set of possible representations that may be used for a single tensor.

`OpRepSet` manages the set of possible representations (i.e. `TensorRepSet`s) for all tensors participating in a operation. To do this, it accounts for 3 things:

* The supported tensor representations for input/output that are denoted by the operator registration
* The actual sizes of the tensor - some tensors may have dims that are too large to fit into a texture.
* Sync requirements, i.e. requirements re: which tensors in the operation must use the same representation

For the last point, `OpRepSet` accounts for three "rules" internally:

* All input tensors must use the same representation
* All output tensors must use the same representation
* The "primary" (i.e. first) input and output tensors must use the same representation

I have settled on these three rules for now since they adequately describe the possible requirements of all operators.

These three rules are validated to be true at all times within `OpRepSet`. Since `TensorRepSet`s may be ambiguous (i.e. there are multiple possible representations that could be used), `OpRepSet` also provides utility functions to constrain the possible representation set of an input operator while maintaining the synchronization rules.

I have also defined `TensorRepSet` instances like:

* `utils.ANY_STORAGE`
* `utils.CONTIGUOUS_BUFFER`
* `utils.CHANNELS_PACKED_TEXTURE`

as convenience definitions for common representation set configurations.

### `op_registry.py`

Now, in `op_registry.py` operator registrations only need to define 2 things: `input_storages` and optionally `output_storages`, which describe the possible representation sets that may be used for input and output tensors.

The registrations for each example operator would be:

```
# binary ops
def register_binary_op():
    return OpFeatures(
        inputs_storage=utils.ANY_STORAGE,
        supports_resize=True,
    )

# group norm
def register_native_group_norm():
    return OpFeatures(
        inputs_storage=utils.CHANNELS_PACKED_TEXTURE,
        outputs_storage=[
            utils.CHANNELS_PACKED_TEXTURE,
            utils.CONTIGUOUS_BUFFER,
            utils.CONTIGUOUS_BUFFER,
        ],
        supports_prepacking=True,
    )

# choose qparams
update_features(
    [
        exir_ops.edge.torchao.choose_qparams_affine.default,
    ]
)
def register_torchao_quantization_op():
    return OpFeatures(
        inputs_storage=utils.CONTIGUOUS_ANY,
        outputs_storage=utils.CONTIGUOUS_BUFFER
        supports_resize=True,
    )

# DQ-Linear
def register_linear_qta8a_qga4w_op():
    return OpFeatures(
        inputs_storage=[
          utils.CONTIGUOUS_ANY,     # input
          utils.CONTIGUOUS_BUFFER,  # mat1 scales
          utils.CONTIGUOUS_BUFFER,  # mat1 zeros
          utils.NO_STORAGE,         # weight (prepacked)
          utils.NO_STORAGE,         # group size (non tensor)
          utils.CONTIGUOUS_BUFFER,  # mat2 scales
          utils.CONTIGUOUS_BUFFER,  # mat2 zeros
        ],
        supports_resize=True,
        supports_prepacking=True,
    )
```

The 3 synchronization rules are inferred from the defined `inputs_storage` and `outputs_storage`:

* If no `outputs_storage` is defined, then assume that the `outputs_storage` is the same as the first `TensorRepSet` in `inputs_storage`. This also implies that the primary input and output need to be synced
* If `inputs_storage` only contains a single `TensorRepSet`, it is assumed that all input tensors need to be synchronized. 
* Similarly, if `outputs_storage` only contains a single `TensorRepSet`, it is assumed that all output tensors need to be synchronized
* If the first entry in `inputs_storage` and `outputs_storage` are the same, assume that the primary input and output need to be synced.


### `tag_memory_metadata_pass.py`

The `tag_memory_metadata_pass.py` maintains the same scope and behaviour as before. However, it is almost re-written completely to use `OpRepSet` utility class. However, it goes through the same steps as before:

* For each operator, determine the initial `OpRepSets`
* Constrain the initial `OpRepSets` by checking any existing representations of input tensors, and checking future uses of the output tensor(s) to try and reduce the number of representation transitions needed
* Set the representation of each input/output tensor in the operator. If an input tensor requires a different representation than it currently has, insert a clone node to transition the arg to the required representation.

Reviewed By: trivedivivek

Differential Revision: D79116560
@SS-JIA SS-JIA force-pushed the export-D79116560 branch from db8b83a to a55bc0c Compare July 31, 2025 00:44
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79116560

@facebook-github-bot facebook-github-bot merged commit bedce91 into pytorch:main Jul 31, 2025
102 of 104 checks passed
agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request Aug 26, 2025
Differential Revision: D79116560

Pull Request resolved: pytorch#12927
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants