Skip to content

Conversation

nathanaelsee
Copy link
Contributor

Summary:
Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator C for constant ids (besides B for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329

Copy link

pytorch-bot bot commented Oct 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6358

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6cd234d with merge base 4d7b294 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 18, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63361329

@nathanaelsee nathanaelsee changed the title update native_layer_norm to new layout gen & axis mapping [ET-VK] update native_layer_norm to new layout gen & axis mapping Oct 18, 2024
nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request Oct 18, 2024
Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63361329

nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request Oct 18, 2024
Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63361329

nathanaelsee added a commit to nathanaelsee/executorch that referenced this pull request Oct 18, 2024
Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63361329

Summary:

Naively using ivec4 axis mapping regresses latency by 20-30% for layer norm, due to the added overhead of another layer of index lookups over the 2 loops over the entire width dim.

We can use specialization constants to move the index lookups ahead of time to the shader compilation and command buffer construction phase.
Unfortunately, we can't pass vec types as specialization constants.
But, we can squeeze the axis mapping into a single 32-bit int and pass that in as a specialization constant!
We can unpack the int and create a const ivec4 axis map which can be folded during shader compilation.
Using this method, we incur a 1% overhead instead of the 20+% we previously saw.

This diff also adds a codegen function for specialization constants, along with a new accumulator `C` for constant ids (besides `B` for binding index for textures, buffers and buffer objects)

Reviewed By: SS-JIA

Differential Revision: D63361329
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63361329

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 324f021.

SS-JIA added a commit that referenced this pull request Oct 28, 2024
## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Oct 28, 2024
## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

ghstack-source-id: 250503989
Pull Request resolved: #6534
SS-JIA added a commit that referenced this pull request Oct 28, 2024
## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Oct 28, 2024
Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250525144
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)
SS-JIA added a commit that referenced this pull request Oct 30, 2024
… map UBO"

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Oct 30, 2024
## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Oct 30, 2024
Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250928240
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)
SS-JIA added a commit that referenced this pull request Oct 30, 2024
Pull Request resolved: #6534

## Context

#6358 showed that passing in the axis map of a tensor via a specialization constant allows shaders to utilize the axis map in indexing calculations with minimal impact to latency.

This diff extends that idea, and introduces the concept of a hashed layout. The hashed layout is a 32 bit integer where:

1. Bits 28-31: `axis_map[0]`
2. Bits 24-27: `axis_map[1]`
3. Bits 20-23: `axis_map[2]`
4. Bits 16-19: `axis_map[3]`
5. Bits 12-15: `packed_dim`
6. Bits 0-11: unused

Essentially, the integer is divided into chunks of 4 bits, and each chunk is used to represent a value from the `axis_map` + `packed_dim`. This way, the entire description of how the tensor is represented as a texture can be passed into a compute shader with a single specialization constant.

Within the compute shader, the axis map and packed dim can be extracted like so:

```
${layout_declare_spec_const(C, "int", "in_layout", "DEFAULT_LAYOUT")}
const lowp ivec4 in_axis_map = unhash_axis_map(in_layout);
const lowp int in_packed_dim = unhash_packed_dim(in_layout);
```

Note that `lowp` can be used because the expected values are limited by the dimensionality of the tensor, therefore we expect only small values.

## Changes

1. Introduce `hashed_layout`
2. Replace all uses of `axis_map_ubo` with `hashed_layout`
3. Remove `axis_map_ubo` from `vTensor. This also reduces the size of the class.
ghstack-source-id: 250928240
@exported-using-ghexport

Differential Revision: [D65085141](https://our.internmc.facebook.com/intern/diff/D65085141/)

Co-authored-by: Stephen Jia <ssjia@meta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants