Skip to content

[ExecuTorch][Llama][ez] Remove unwrap_tensor_subclass from 4w quantization path#18957

Merged
meta-codesync[bot] merged 1 commit intogh/SS-JIA/521/basefrom
gh/SS-JIA/521/head
Apr 17, 2026
Merged

[ExecuTorch][Llama][ez] Remove unwrap_tensor_subclass from 4w quantization path#18957
meta-codesync[bot] merged 1 commit intogh/SS-JIA/521/basefrom
gh/SS-JIA/521/head

Conversation

@SS-JIA
Copy link
Copy Markdown
Contributor

@SS-JIA SS-JIA commented Apr 16, 2026

Stack from ghstack (oldest at bottom):

D100066455 changed the Llama export pipeline to quantize weights in the checkpoint dtype (typically bfloat16) before casting to the computation dtype (fp32). This introduced a regression for Vulkan 4w export: dequantize_affine ops produced bfloat16 outputs, which Vulkan doesn't support, causing the graph to be split into multiple partitions. When sym_constrain_range_for_size constraint nodes were partitioned into a different delegate than the _local_scalar_dense + slice_copy ops they constrain, ExportPass re-tracing (in ConvertToLinearPass, SpecPropPass, etc.) would crash with GuardOnDataDependentSymNode: Could not guard on data-dependent expression u539 < 0.

The root cause is unwrap_tensor_subclass(). This function decomposes IntxUnpackedToInt8Tensor into plain tensors via torch.nn.utils.parametrize, capturing the subclass's metadata — including its dtype attribute (which controls dequantize_affine's output dtype) — as a frozen snapshot in UnwrapTensorSubclass.rebuild_stack. A subsequent model.to(dtype=fp32) casts the plain tensors but cannot update the frozen metadata, so dequantize_affine continues to output bfloat16.

unwrap_tensor_subclass() was originally needed as a workaround because torch.export did not support tensor subclasses natively. This is no longer the case — the 8da4w path already works without it (using the same IntxUnpackedToInt8Tensor subclass), and torch.export traces through the subclass correctly. Removing it makes 4w consistent with 8da4w and avoids the metadata-freezing issue entirely.

This change was authored with Claude.

Differential Revision: D101252037

…ation path

D100066455 changed the Llama export pipeline to quantize weights in the checkpoint dtype (typically bfloat16) before casting to the computation dtype (fp32). This introduced a regression for Vulkan 4w export: `dequantize_affine` ops produced bfloat16 outputs, which Vulkan doesn't support, causing the graph to be split into multiple partitions. When `sym_constrain_range_for_size` constraint nodes were partitioned into a different delegate than the `_local_scalar_dense` + `slice_copy` ops they constrain, ExportPass re-tracing (in ConvertToLinearPass, SpecPropPass, etc.) would crash with `GuardOnDataDependentSymNode: Could not guard on data-dependent expression u539 < 0`.

The root cause is `unwrap_tensor_subclass()`. This function decomposes `IntxUnpackedToInt8Tensor` into plain tensors via `torch.nn.utils.parametrize`, capturing the subclass's metadata — including its `dtype` attribute (which controls `dequantize_affine`'s output dtype) — as a frozen snapshot in `UnwrapTensorSubclass.rebuild_stack`. A subsequent `model.to(dtype=fp32)` casts the plain tensors but cannot update the frozen metadata, so `dequantize_affine` continues to output bfloat16.

`unwrap_tensor_subclass()` was originally needed as a workaround because `torch.export` did not support tensor subclasses natively. This is no longer the case — the 8da4w path already works without it (using the same `IntxUnpackedToInt8Tensor` subclass), and `torch.export` traces through the subclass correctly. Removing it makes 4w consistent with 8da4w and avoids the metadata-freezing issue entirely.

This change was authored with Claude.

Differential Revision: [D101252037](https://our.internmc.facebook.com/intern/diff/D101252037/)

[ghstack-poisoned]
@SS-JIA SS-JIA requested a review from lucylq as a code owner April 16, 2026 22:22
SS-JIA pushed a commit that referenced this pull request Apr 16, 2026
…ation path

D100066455 changed the Llama export pipeline to quantize weights in the checkpoint dtype (typically bfloat16) before casting to the computation dtype (fp32). This introduced a regression for Vulkan 4w export: `dequantize_affine` ops produced bfloat16 outputs, which Vulkan doesn't support, causing the graph to be split into multiple partitions. When `sym_constrain_range_for_size` constraint nodes were partitioned into a different delegate than the `_local_scalar_dense` + `slice_copy` ops they constrain, ExportPass re-tracing (in ConvertToLinearPass, SpecPropPass, etc.) would crash with `GuardOnDataDependentSymNode: Could not guard on data-dependent expression u539 < 0`.

The root cause is `unwrap_tensor_subclass()`. This function decomposes `IntxUnpackedToInt8Tensor` into plain tensors via `torch.nn.utils.parametrize`, capturing the subclass's metadata — including its `dtype` attribute (which controls `dequantize_affine`'s output dtype) — as a frozen snapshot in `UnwrapTensorSubclass.rebuild_stack`. A subsequent `model.to(dtype=fp32)` casts the plain tensors but cannot update the frozen metadata, so `dequantize_affine` continues to output bfloat16.

`unwrap_tensor_subclass()` was originally needed as a workaround because `torch.export` did not support tensor subclasses natively. This is no longer the case — the 8da4w path already works without it (using the same `IntxUnpackedToInt8Tensor` subclass), and `torch.export` traces through the subclass correctly. Removing it makes 4w consistent with 8da4w and avoids the metadata-freezing issue entirely.

This change was authored with Claude.

Differential Revision: [D101252037](https://our.internmc.facebook.com/intern/diff/D101252037/)

ghstack-source-id: 368594388
Pull Request resolved: #18957
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18957

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 2 New Failures, 2 Unrelated Failures

As of commit 30e9d98 with merge base a43675c (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot merged commit 4ca6270 into gh/SS-JIA/521/base Apr 17, 2026
162 of 170 checks passed
@meta-codesync meta-codesync Bot deleted the gh/SS-JIA/521/head branch April 17, 2026 15:24
@meta-codesync meta-codesync Bot temporarily deployed to cherry-pick-bot April 17, 2026 15:24 Inactive
SS-JIA pushed a commit that referenced this pull request Apr 17, 2026
…ation path

D100066455 changed the Llama export pipeline to quantize weights in the checkpoint dtype (typically bfloat16) before casting to the computation dtype (fp32). This introduced a regression for Vulkan 4w export: `dequantize_affine` ops produced bfloat16 outputs, which Vulkan doesn't support, causing the graph to be split into multiple partitions. When `sym_constrain_range_for_size` constraint nodes were partitioned into a different delegate than the `_local_scalar_dense` + `slice_copy` ops they constrain, ExportPass re-tracing (in ConvertToLinearPass, SpecPropPass, etc.) would crash with `GuardOnDataDependentSymNode: Could not guard on data-dependent expression u539 < 0`.

The root cause is `unwrap_tensor_subclass()`. This function decomposes `IntxUnpackedToInt8Tensor` into plain tensors via `torch.nn.utils.parametrize`, capturing the subclass's metadata — including its `dtype` attribute (which controls `dequantize_affine`'s output dtype) — as a frozen snapshot in `UnwrapTensorSubclass.rebuild_stack`. A subsequent `model.to(dtype=fp32)` casts the plain tensors but cannot update the frozen metadata, so `dequantize_affine` continues to output bfloat16.

`unwrap_tensor_subclass()` was originally needed as a workaround because `torch.export` did not support tensor subclasses natively. This is no longer the case — the 8da4w path already works without it (using the same `IntxUnpackedToInt8Tensor` subclass), and `torch.export` traces through the subclass correctly. Removing it makes 4w consistent with 8da4w and avoids the metadata-freezing issue entirely.

This change was authored with Claude.

Differential Revision: [D101252037](https://our.internmc.facebook.com/intern/diff/D101252037/)

ghstack-source-id: 368594388
Pull Request resolved: #18957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants