-
Notifications
You must be signed in to change notification settings - Fork 689
[ET-VK] update native_layer_norm to new layout gen & axis mapping #6358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6358
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6cd234d with merge base 4d7b294 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D63361329 |
Summary: Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim. We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase. Unfortunately, we can't pass vec types as specialization constants. But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant! We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation. Using this method, we incur a 1% overhead instead of the 20+% we previously saw. This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects) Reviewed By: SS-JIA Differential Revision: D63361329
c8d0244
to
5c8a973
Compare
This pull request was exported from Phabricator. Differential Revision: D63361329 |
Summary: Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim. We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase. Unfortunately, we can't pass vec types as specialization constants. But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant! We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation. Using this method, we incur a 1% overhead instead of the 20+% we previously saw. This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects) Reviewed By: SS-JIA Differential Revision: D63361329
5c8a973
to
079a2f1
Compare
This pull request was exported from Phabricator. Differential Revision: D63361329 |
Summary: Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim. We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase. Unfortunately, we can't pass vec types as specialization constants. But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant! We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation. Using this method, we incur a 1% overhead instead of the 20+% we previously saw. This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects) Reviewed By: SS-JIA Differential Revision: D63361329
079a2f1
to
c1c6af7
Compare
This pull request was exported from Phabricator. Differential Revision: D63361329 |
Summary: Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim. We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase. Unfortunately, we can't pass vec types as specialization constants. But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant! We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation. Using this method, we incur a 1% overhead instead of the 20+% we previously saw. This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects) Reviewed By: SS-JIA Differential Revision: D63361329
c1c6af7
to
6cd234d
Compare
This pull request was exported from Phabricator. Differential Revision: D63361329 |
This pull request has been merged in 324f021. |
## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) [ghstack-poisoned]
## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) ghstack-source-id: 250503989 Pull Request resolved: #6534
## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) [ghstack-poisoned]
Pull Request resolved: #6534 ## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. ghstack-source-id: 250525144 @exported-using-ghexport Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)
… map UBO" ## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) [ghstack-poisoned]
## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) [ghstack-poisoned]
Pull Request resolved: #6534 ## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. ghstack-source-id: 250928240 @exported-using-ghexport Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)
Pull Request resolved: #6534 ## Context #6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency. This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where: 1. Bits 28-31: `axis_map[0]` 2. Bits 24-27: `axis_map[1]` 3. Bits 20-23: `axis_map[2]` 4. Bits 16-19: `axis_map[3]` 5. Bits 12-15: `packed_dim` 6. Bits 0-11: unused Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant. Within the compute shader, the axis map and packed dim can be extracted like so: ``` ${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")} const lowp ivec4 in_axis_map = unhash_axis_map(in_layout); const lowp int in_packed_dim = unhash_packed_dim(in_layout); ``` Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values. ## Changes 1. Introduce `hashed_layout` 2. Replace all uses of `axis_map_ubo` with `hashed_layout` 3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class. ghstack-source-id: 250928240 @exported-using-ghexport Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/) Co-authored-by: Stephen Jia <ssjia@meta.com>
Summary:
Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.
We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.
This diff also adds a codegen function for specialization constants, along with a new accumulator
C
for constant ids (besidesB
for binding index for textures, buffers and buffer objects)Reviewed By: SS-JIA
Differential Revision: D63361329