-
Notifications
You must be signed in to change notification settings - Fork 689
[ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan #6830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6830
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit 5dcfa4f with merge base ecdc007 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) ghstack-source-id: 253391362 Pull Request resolved: #6830
This pull request was exported from Phabricator. Differential Revision: D65899827 |
…ing to Vulkan" ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) [ghstack-poisoned]
Pull Request resolved: #6830 ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) ghstack-source-id: 253425727
This pull request was exported from Phabricator. Differential Revision: D65899827 |
…ing to Vulkan" ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) [ghstack-poisoned]
Pull Request resolved: #6830 ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. ghstack-source-id: 253568942 Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/)
This pull request was exported from Phabricator. Differential Revision: D65899827 |
b6da372
into
gh/SS-JIA/147/base
…an (#6857) * [ET-VK] Enforce GPU buffer limit when partitioning Pull Request resolved: #6829 ## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. ghstack-source-id: 253568943 Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) * [ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan Pull Request resolved: #6830 ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. ghstack-source-id: 253568942 Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) --------- Co-authored-by: Stephen Jia <ssjia@meta.com>
Stack from ghstack (oldest at bottom):
Context
The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used.
Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow.
Changes
The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit.
This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well.
Long Term
This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more.
Differential Revision: D65899827