[ET-VK] update native_layer_norm to new layout gen & axis mapping #6358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

nathanaelsee wants to merge 1 commit into pytorch:main from nathanaelsee:export-D63361329

Contributor

nathanaelsee commented Oct 18, 2024

Summary:
Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator C for constant ids (besides B for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

pytorch-bot bot commented Oct 18, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6358

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6cd234d with merge base 4d7b294 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Oct 18, 2024

This pull request was exported from Phabricator. Differential Revision: D63361329

facebook-github-bot added the fb-exported label

nathanaelsee changed the title ~~update native_layer_norm to new layout gen & axis mapping~~ [ET-VK] update native_layer_norm to new layout gen & axis mapping

SS-JIA approved these changes

View reviewed changes

nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request


          update native_layer_norm to new layout gen & axis mapping (pytorch#6358)

5c8a973

Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

nathanaelsee force-pushed the export-D63361329 branch from c8d0244 to 5c8a973 Compare

October 18, 2024 20:40

Contributor

facebook-github-bot commented Oct 18, 2024

This pull request was exported from Phabricator. Differential Revision: D63361329

nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request


          update native_layer_norm to new layout gen & axis mapping (pytorch#6358)

079a2f1

Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

nathanaelsee force-pushed the export-D63361329 branch from 5c8a973 to 079a2f1 Compare

October 18, 2024 21:07

Contributor

facebook-github-bot commented Oct 18, 2024

This pull request was exported from Phabricator. Differential Revision: D63361329

nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request


          update native_layer_norm to new layout gen & axis mapping (pytorch#6358)

c1c6af7

Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

nathanaelsee force-pushed the export-D63361329 branch from 079a2f1 to c1c6af7 Compare

October 18, 2024 21:39

Contributor

facebook-github-bot commented Oct 18, 2024

This pull request was exported from Phabricator. Differential Revision: D63361329


          update native_layer_norm to new layout gen & axis mapping (pytorch#6358)

6cd234d

Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

nathanaelsee force-pushed the export-D63361329 branch from c1c6af7 to 6cd234d Compare

October 19, 2024 00:04

Contributor

facebook-github-bot commented Oct 19, 2024

This pull request was exported from Phabricator. Differential Revision: D63361329

facebook-github-bot closed this in

324f021

Contributor

facebook-github-bot commented Oct 19, 2024

This pull request has been merged in 324f021.

facebook-github-bot added the Merged label

SS-JIA added a commit that referenced this pull request


          [ET-VK] Used hashed layout instead of axis map UBO

cc3cd83

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]

SS-JIA mentioned this pull request

[ET-VK] Used hashed layout instead of axis map UBO #6534

Merged

SS-JIA added a commit that referenced this pull request


          [ET-VK] Used hashed layout instead of axis map UBO

b4e9b2c

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

ghstack-source-id: 250503989
Pull Request resolved: #6534

SS-JIA added a commit that referenced this pull request


          Update on "[ET-VK] Used hashed layout instead of axis map UBO"

b43aa88

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          [ET-VK] Used hashed layout instead of axis map UBO

af8e46f

Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250525144
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

SS-JIA added a commit that referenced this pull request


          Update base for Update on "[ET-VK] Used hashed layout instead of axis…

7e6404e

… map UBO"

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          Update on "[ET-VK] Used hashed layout instead of axis map UBO"

a70b35e

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]

SS-JIA added a commit that referenced this pull request


          [ET-VK] Used hashed layout instead of axis map UBO

60669cb

Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250928240
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

SS-JIA added a commit that referenced this pull request


          [ET-VK] Used hashed layout instead of axis map UBO (#6574)

fbb0acf

Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250928240
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

Co-authored-by: Stephen Jia <ssjia@meta.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged