[ET-VK] Do not apply zero padding for buffer backed tensors #4637

kirklandsign · 2024-08-09T18:18:52Z

Stack from ghstack (oldest at bottom):

Context

This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension.

Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a vec4[] array instead of a float[] array.

This was a premature optimization built on the assumption is that loading data in units of vec4 would be more efficient than loading in units of float. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. Thus the added zero padding does not serve any purpose for buffer-backed tensors.

Motivation

Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as view, permute, unsqueeze, squeeze, etc. which can now just modify the strides of the original tensor instead of allocating new storage. This will be especially important for Large Language models which perform a lot of these operations.

Changes

Introduce the buffer_to_buffer shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors
Renaming various tensor properties to improve clarity:
- gpu_sizes -> padded_sizes
- gpu_strides -> strides
- deprecate texel_numel

fb:

The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold:

Memory is loaded in units of cache-lines, so loading a vec4 would trigger the same amount of memory to be fetched as loading a float. Thus loading consecutive vec4s should have no improvement over loading consecutive float`s
processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs

Differential Revision: D60931000

## Context This diff adds some additional API functions to the `utils::vecN` family of classes. The following improvements were made: 1. Added overloaded assignment operator, allowing for `vec_instance = vec_instance_2` 2. Added overloaded indexing operator, allowing for `vec_instance[2]` instead of having to do `vec_instance.data[2]` Note that the large number of changes are due to replacing `.data[` with `[` throughout the codebase. Differential Revision: [D60931001](https://our.internmc.facebook.com/intern/diff/D60931001/) [ghstack-poisoned]

## Context This diff adds some API functions to `ParamsBindList` to make it easier to use, specifically 1. Added default constructor 2. Added overload for `append` that takes only one `BufferBindInfo` The reason for these changes is to make the following pattern easier: ``` ParamsBindList ubo; if (kernel1) { ubo.append(ubo1); } else { ubo.append(ubo2); } ``` This pattern was not possible before because `ubo` could not be default constructed, and `ubo1` and `ubo2` had to be wrapped in an initializer list before being passed to `append`. Differential Revision: [D60930997](https://our.internmc.facebook.com/intern/diff/D60930997/) [ghstack-poisoned]

…er backed tensors" ## Context This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension. Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a `vec4[]` array instead of a `float[]` array. This was a premature optimization built on the assumption is that loading data in units of `vec4` would be more efficient than loading in units of `float`. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. **Thus the added zero padding does not serve any purpose for buffer-backed tensors**. ## Motivation Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as `view`, `permute`, `unsqueeze`, `squeeze`, etc. which can now just **modify the strides of the original tensor instead of allocating new storage.** This will be especially important for Large Language models which perform a lot of these operations. ## Changes * Introduce the `buffer_to_buffer` shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors * Renaming various tensor properties to improve clarity: * `gpu_sizes` -> `padded_sizes` * `gpu_strides` -> `strides` * deprecate `texel_numel` fb: The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold: 1. Memory is loaded in units of cache-lines, so loading a `vec4` would trigger the same amount of memory to be fetched as loading a `float`. Thus loading consecutive `vec4`s should have no improvement over loading `consecutive `float`s 2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs Differential Revision: [D60931000](https://our.internmc.facebook.com/intern/diff/D60931000/) [ghstack-poisoned]

Differential Revision: D60931000 Pull Request resolved: #4594

pytorch-bot · 2024-08-09T18:18:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4637

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6c49740 with merge base 192d463 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

SS-JIA added 5 commits August 8, 2024 08:01

[ET-VK] Do not apply zero padding for buffer backed tensors

6c49740

Differential Revision: D60931000 Pull Request resolved: #4594

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 9, 2024

cccclai approved these changes Aug 9, 2024

View reviewed changes

kirklandsign merged commit e4897dd into main Aug 9, 2024

SS-JIA deleted the gh/SS-JIA/54/base branch January 24, 2025 19:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Do not apply zero padding for buffer backed tensors #4637

[ET-VK] Do not apply zero padding for buffer backed tensors #4637

Uh oh!

kirklandsign commented Aug 9, 2024

Uh oh!

pytorch-bot bot commented Aug 9, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[ET-VK] Do not apply zero padding for buffer backed tensors #4637

[ET-VK] Do not apply zero padding for buffer backed tensors #4637

Uh oh!

Conversation

kirklandsign commented Aug 9, 2024

Context

Motivation

Changes

Uh oh!

pytorch-bot bot commented Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4637

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pytorch-bot bot commented Aug 9, 2024 •

edited

Loading