-
Notifications
You must be signed in to change notification settings - Fork 758
[ET-VK] Do not apply zero padding for buffer backed tensors #4637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Context This diff adds some additional API functions to the `utils::vecN` family of classes. The following improvements were made: 1. Added overloaded assignment operator, allowing for `vec_instance = vec_instance_2` 2. Added overloaded indexing operator, allowing for `vec_instance[2]` instead of having to do `vec_instance.data[2]` Note that the large number of changes are due to replacing `.data[` with `[` throughout the codebase. Differential Revision: [D60931001](https://our.internmc.facebook.com/intern/diff/D60931001/) [ghstack-poisoned]
## Context
This diff adds some API functions to `ParamsBindList` to make it easier to use, specifically
1. Added default constructor
2. Added overload for `append` that takes only one `BufferBindInfo`
The reason for these changes is to make the following pattern easier:
```
ParamsBindList ubo;
if (kernel1) {
ubo.append(ubo1);
}
else {
ubo.append(ubo2);
}
```
This pattern was not possible before because `ubo` could not be default constructed, and `ubo1` and `ubo2` had to be wrapped in an initializer list before being passed to `append`.
Differential Revision: [D60930997](https://our.internmc.facebook.com/intern/diff/D60930997/)
[ghstack-poisoned]
…er backed tensors" ## Context This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension. Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a `vec4[]` array instead of a `float[]` array. This was a premature optimization built on the assumption is that loading data in units of `vec4` would be more efficient than loading in units of `float`. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. **Thus the added zero padding does not serve any purpose for buffer-backed tensors**. ## Motivation Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as `view`, `permute`, `unsqueeze`, `squeeze`, etc. which can now just **modify the strides of the original tensor instead of allocating new storage.** This will be especially important for Large Language models which perform a lot of these operations. ## Changes * Introduce the `buffer_to_buffer` shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors * Renaming various tensor properties to improve clarity: * `gpu_sizes` -> `padded_sizes` * `gpu_strides` -> `strides` * deprecate `texel_numel` fb: The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold: 1. Memory is loaded in units of cache-lines, so loading a `vec4` would trigger the same amount of memory to be fetched as loading a `float`. Thus loading consecutive `vec4`s should have no improvement over loading `consecutive `float`s 2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs Differential Revision: [D60931000](https://our.internmc.facebook.com/intern/diff/D60931000/) [ghstack-poisoned]
…er backed tensors" ## Context This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension. Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a `vec4[]` array instead of a `float[]` array. This was a premature optimization built on the assumption is that loading data in units of `vec4` would be more efficient than loading in units of `float`. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. **Thus the added zero padding does not serve any purpose for buffer-backed tensors**. ## Motivation Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as `view`, `permute`, `unsqueeze`, `squeeze`, etc. which can now just **modify the strides of the original tensor instead of allocating new storage.** This will be especially important for Large Language models which perform a lot of these operations. ## Changes * Introduce the `buffer_to_buffer` shader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensors * Renaming various tensor properties to improve clarity: * `gpu_sizes` -> `padded_sizes` * `gpu_strides` -> `strides` * deprecate `texel_numel` fb: The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold: 1. Memory is loaded in units of cache-lines, so loading a `vec4` would trigger the same amount of memory to be fetched as loading a `float`. Thus loading consecutive `vec4`s should have no improvement over loading `consecutive `float`s 2. processing smaller units in each thread improves memory coalescing, and allows for better parallelization of the ALUs Differential Revision: [D60931000](https://our.internmc.facebook.com/intern/diff/D60931000/) [ghstack-poisoned]
Differential Revision: D60931000 Pull Request resolved: #4594
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4637
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6c49740 with merge base 192d463 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cccclai
approved these changes
Aug 9, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
ParamsBindListto append a single Binding Info #4593Context
This diff makes a major change to how buffer backed tensors are handled in the Vulkan delegate; in particular, zero padding will not be applied to the packed dimension.
Previously, zero padding was applied to the packed dimension for buffer-backed tensors in order to stay consistent with texture backed tensors (which would add zero padding automatically through unused elements in boundary texels). The main benefit of this was that compute shaders could bind the GPU buffer as a
vec4[]array instead of afloat[]array.This was a premature optimization built on the assumption is that loading data in units of
vec4would be more efficient than loading in units offloat. However, experimental analysis showed that vectorization did not produce significant latency improvements, and in some cases negatively impacted latency. Thus the added zero padding does not serve any purpose for buffer-backed tensors.Motivation
Removing the zero padding will make it so that the data layout of the tensor on the GPU buffer is the exact same as the CPU buffer. This adds a lot of flexibility to eliminate data copying for orchestration ops such as
view,permute,unsqueeze,squeeze, etc. which can now just modify the strides of the original tensor instead of allocating new storage. This will be especially important for Large Language models which perform a lot of these operations.Changes
buffer_to_buffershader to perform direct copy between two GPU buffers, and use that to copy between staging and GPU buffer for buffer backed tensorsgpu_sizes->padded_sizesgpu_strides->stridestexel_numelfb:
The case that vectorization does not provide much benefits was discovered via the work on the PerfLab binaries (D59699933, D59877804). My guess as to why vectorization does not provide any improvement are twofold:
vec4would trigger the same amount of memory to be fetched as loading afloat. Thus loading consecutivevec4s should have no improvement over loadingconsecutivefloat`sDifferential Revision: D60931000