[ET-VK] Enforce GPU buffer limit when partitioning #6829

SS-JIA · 2024-11-13T19:44:49Z

Stack from ghstack (oldest at bottom):

Context

In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue.

Changes

Along with texture_limits, introduce a configurable buffer_limit entry in the partitioner configuration.

Differential Revision: D65899828

## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) [ghstack-poisoned]

pytorch-bot · 2024-11-13T19:44:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6829

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

GLIBC not found in Nova workflows

✅ No Failures

As of commit d9bf5ff with merge base ecdc007 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-11-13T19:45:08Z

This pull request was exported from Phabricator. Differential Revision: D65899828

## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) [ghstack-poisoned]

facebook-github-bot · 2024-11-13T22:01:40Z

This pull request was exported from Phabricator. Differential Revision: D65899828

## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) [ghstack-poisoned]

facebook-github-bot · 2024-11-14T16:38:21Z

This pull request was exported from Phabricator. Differential Revision: D65899828

Pull Request resolved: #6829 ## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. ghstack-source-id: 253568943 Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) Co-authored-by: Stephen Jia <ssjia@meta.com>

…an (#6857) * [ET-VK] Enforce GPU buffer limit when partitioning Pull Request resolved: #6829 ## Context In Vulkan, there is a limit on the number of elements a GPU buffer can have. If a GPU buffer exceeds this limit, then the API will either produce an error or undefined behaviour will ensue. ## Changes Along with `texture_limits`, introduce a configurable `buffer_limit` entry in the partitioner configuration. ghstack-source-id: 253568943 Differential Revision: [D65899828](https://our.internmc.facebook.com/intern/diff/D65899828/) * [ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan Pull Request resolved: #6830 ## Context The final logit linear layer in the Transformer architecture has extremely large tensors, since the output and weight tensors will have a tensor with dim equal to the vocabulary size, which may be extremely large. Because of this, image textures cannot be used to execute the op when running with the Vulkan delegate, so an implementation using buffer based tensors must be used. Unfortunately, Vulkan does not have a performant implementation of linear with buffer based tensors at the moment. As a result, if this final linear layer is executed in Vulkan, model inference is extremely slow. ## Changes The below diff will prevent the final logit linear layer from being delegated to Vulkan by enforcing a GPU buffer limit. This diff modifies the export llama script to apply the XNNPACK partitioner after the Vulkan partitioner if lowering to Vulkan, to ensure that remaining ops will be accelerated with XNNPACK. 4 bit quantization will also apply an additional Quantizer after applying the Vulkan quantizer (which will skip the final logit linear layer) so that the final logit linear can be quantized as well. ## Long Term This is a temporary measure while an optimized buffer based linear implementation is developed. Once the Vulkan implementation achieves parity with XNNPACK, the final logit linear will be delegated to Vulkan once more. ghstack-source-id: 253568942 Differential Revision: [D65899827](https://our.internmc.facebook.com/intern/diff/D65899827/) --------- Co-authored-by: Stephen Jia <ssjia@meta.com>

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2024

SS-JIA mentioned this pull request Nov 13, 2024

[ET-VK][Llama] Apply XNNPACK partitoner as well when lowering to Vulkan #6830

Merged

facebook-github-bot added the fb-exported label Nov 13, 2024

nathanaelsee approved these changes Nov 13, 2024

View reviewed changes

facebook-github-bot merged commit 2fc047d into gh/SS-JIA/146/base Nov 14, 2024
39 of 41 checks passed

facebook-github-bot deleted the gh/SS-JIA/146/head branch November 14, 2024 18:16

facebook-github-bot temporarily deployed to cherry-pick-bot November 14, 2024 18:16 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Nov 14, 2024

[ET-VK] Enforce GPU buffer limit when partitioning #6856

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Enforce GPU buffer limit when partitioning #6829

[ET-VK] Enforce GPU buffer limit when partitioning #6829

Uh oh!

SS-JIA commented Nov 13, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Nov 13, 2024

Uh oh!

facebook-github-bot commented Nov 13, 2024

Uh oh!

facebook-github-bot commented Nov 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[ET-VK] Enforce GPU buffer limit when partitioning #6829

[ET-VK] Enforce GPU buffer limit when partitioning #6829

Uh oh!

Conversation

SS-JIA commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changes

Uh oh!

pytorch-bot bot commented Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6829

❗ 1 Active SEVs

✅ No Failures

Uh oh!

facebook-github-bot commented Nov 13, 2024

Uh oh!

facebook-github-bot commented Nov 13, 2024

Uh oh!

facebook-github-bot commented Nov 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SS-JIA commented Nov 13, 2024 •

edited

Loading

pytorch-bot bot commented Nov 13, 2024 •

edited

Loading