Skip to content

Conversation

@kirklandsign
Copy link
Contributor

Stack from ghstack (oldest at bottom):

Context

This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension.

Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a vec4[] array instead of a float[] array.

This was a premature optimization built on the assumption is that loading data in units of vec4 would be more efficient than loading in units of float. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. Thus the added zero padding does not serve any purpose for buffer-backed tensors.

Motivation

Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as view, permute, unsqueeze, squeeze, etc. which can now just modify the strides of the original tensor instead of allocating new storage. This will be especially important for Large Language models which perform a lot of these operations.

Changes

  • Introduce the buffer_to_buffer shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors
  • Renaming various tensor properties to improve clarity:
    • gpu_sizes -> padded_sizes
    • gpu_strides -> strides
    • deprecate texel_numel

fb:

The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold:

  1. Memory is loaded in units of cache-lines, so loading a vec4 would trigger the same amount of memory to be fetched as loading a float. Thus loading consecutive vec4s should have no improvement over loading consecutive float`s
  2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs

Differential Revision: D60931000

SS-JIA added 5 commits August 8, 2024 08:01
## Context

This diff adds some additional API functions to the `utils::vecN` family of classes. The following improvements were made:

1. Added overloaded assignment operator, allowing for `vec_instance = vec_instance_2`
2. Added overloaded indexing operator, allowing for `vec_instance[2]` instead of having to do `vec_instance.data[2]`

Note that the large number of changes are due to replacing `.data[` with `[` throughout the codebase.

Differential Revision: [D60931001](https://our.internmc.facebook.com/intern/diff/D60931001/)

[ghstack-poisoned]
## Context

This diff adds some API functions to `ParamsBindList` to make it easier to use, specifically

1. Added default constructor
2. Added overload for `append` that takes only one `BufferBindInfo`

The reason for these changes is to make the following pattern easier:

```
ParamsBindList ubo;
if (kernel1) {
  ubo.append(ubo1);
}
else {
  ubo.append(ubo2);
}
```

This pattern was not possible before because `ubo` could not be default constructed, and `ubo1` and `ubo2` had to be wrapped in an initializer list before being passed to `append`.

Differential Revision: [D60930997](https://our.internmc.facebook.com/intern/diff/D60930997/)

[ghstack-poisoned]
…er backed tensors"

## Context

This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension.

Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a `vec4[]` array instead of a `float[]` array.

This was a premature optimization built on the assumption is that loading data in units of `vec4` would be more efficient than loading in units of `float`. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. **Thus the added zero padding does not serve any purpose for buffer-backed tensors**.

## Motivation

Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as `view`, `permute`, `unsqueeze`, `squeeze`, etc. which can now just **modify the strides of the original tensor instead of allocating new storage.** This will be especially important for Large Language models which perform a lot of these operations.

## Changes

* Introduce the `buffer_to_buffer` shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors
* Renaming various tensor properties to improve clarity:
  * `gpu_sizes` -> `padded_sizes`
  * `gpu_strides` -> `strides`
  * deprecate `texel_numel`

fb:

The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold:

1. Memory is loaded in units of cache-lines, so loading a `vec4` would trigger the same amount of memory to be fetched as loading a `float`. Thus loading consecutive `vec4`s should have no improvement over loading `consecutive `float`s
2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs

Differential Revision: [D60931000](https://our.internmc.facebook.com/intern/diff/D60931000/)

[ghstack-poisoned]
…er backed tensors"

## Context

This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension.

Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a `vec4[]` array instead of a `float[]` array.

This was a premature optimization built on the assumption is that loading data in units of `vec4` would be more efficient than loading in units of `float`. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. **Thus the added zero padding does not serve any purpose for buffer-backed tensors**.

## Motivation

Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as `view`, `permute`, `unsqueeze`, `squeeze`, etc. which can now just **modify the strides of the original tensor instead of allocating new storage.** This will be especially important for Large Language models which perform a lot of these operations.

## Changes

* Introduce the `buffer_to_buffer` shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors
* Renaming various tensor properties to improve clarity:
  * `gpu_sizes` -> `padded_sizes`
  * `gpu_strides` -> `strides`
  * deprecate `texel_numel`

fb:

The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold:

1. Memory is loaded in units of cache-lines, so loading a `vec4` would trigger the same amount of memory to be fetched as loading a `float`. Thus loading consecutive `vec4`s should have no improvement over loading `consecutive `float`s
2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs

Differential Revision: [D60931000](https://our.internmc.facebook.com/intern/diff/D60931000/)

[ghstack-poisoned]
Differential Revision: D60931000

Pull Request resolved: #4594
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 9, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4637

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6c49740 with merge base 192d463 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2024
@kirklandsign kirklandsign merged commit e4897dd into main Aug 9, 2024
@SS-JIA SS-JIA deleted the gh/SS-JIA/54/base branch January 24, 2025 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants