Skip to content

[ROCm][INT4] Configurable ntile size for TilePacked format#3834

Merged
vkuzo merged 8 commits intopytorch:mainfrom
ZhiweiYan-96:zhiwei/int4_ut
Mar 9, 2026
Merged

[ROCm][INT4] Configurable ntile size for TilePacked format#3834
vkuzo merged 8 commits intopytorch:mainfrom
ZhiweiYan-96:zhiwei/int4_ut

Conversation

@ZhiweiYan-96
Copy link
Copy Markdown
Contributor

@ZhiweiYan-96 ZhiweiYan-96 commented Feb 6, 2026

Motivation

Fix a UT failure

pytest -sv test/integration/test_integration.py -k test_int4_weight_only_quant_subclass_api_grouped_5

The failed case is with shape (m, k, n)=(256, 256,8). The n dimension is smaller than the Matrix Core nTileSize=16 on AMD, while reasonable for Nv TensorCore with nTileSize=8

According to the code at https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu#L1116

auto nTiles = (B.size(0) / nTileScaleFactor);

We can infer that

  1. n/8 > 16, where 16=nTileScaleFactor*nTileSizeTensor otherwise, nTiles would be 0.(This is bug!)
  2. n/8 must be a mulitple of 8, otherwise, there would be fractional number of tiles.

This PR fix it by using a proper padding size when calling find_mmultiple utils.

Testing

pytest -sv test/integration/test_integration.py -k test_int4_weight_only_quant_subclass_api_grouped_5

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 6, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3834

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c04c333 with merge base c17160a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 6, 2026
@ZhiweiYan-96
Copy link
Copy Markdown
Contributor Author

@XiaobingSuper Mind take a look?

@@ -127,7 +127,8 @@ def from_hp(

# Pre-process: pad to required dimensions
in_features = find_multiple(orig_in_features, 1024)
out_features = find_multiple(orig_out_features, 8)
n_tile = 16 if orig_out_features < 16 and torch.version.hip else 8
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to n_tile = 16 if torch.version.hip else 8? find_multiple will do a padding according to the given tile size.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right.

@ZhiweiYan-96
Copy link
Copy Markdown
Contributor Author

ZhiweiYan-96 commented Feb 6, 2026

hi, @petrex @jithunnair-amd Could you please take a look for reviewing this PR and add ciflow/rocm label for testing? Thanks.

@XiaobingSuper
Copy link
Copy Markdown

@jerryzh168 could you help review it? Thanks!

@ZhiweiYan-96
Copy link
Copy Markdown
Contributor Author

ZhiweiYan-96 commented Feb 26, 2026

The two failures are irrelevant to this PR.

test/prototype/test_parq.py::TestTorchAoConfigIntegration::test_tied_weights_quantization - AttributeError: 'list' object has no attribute 'keys'

 torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 29500, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Mar 2, 2026

hi @ZhiweiYan-96 , looks reasonable overall. IMO we should make this user configurable instead of automatically selected, it's confusing when a packing format behaves differently based on the environment. Can we add this to the config instead of selecting it automatically? You can make the config clearly state which value the user needs to select on ROCm.

@ZhiweiYan-96 ZhiweiYan-96 changed the title [ROCm][INT4] Corner case n<Ntile handling [WIP][ROCm][INT4] Corner case n<Ntile handling Mar 5, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@ZhiweiYan-96
Copy link
Copy Markdown
Contributor Author

Thanks, @vkuzo , we can make n_tile_size as a configurable attributes in Int4WeightOnlyConfig, which defaults to 8 (nv) and 16 for the ROCm users.

`int4_choose_qparams_algorithm`: variants of choose qparams algorithm to use for int4,
currently support TINYGEMM ("tinygemm") and HQQ ("hqq"), used in version 2 only
`set_inductor_config`: if True, adjusts `torchinductor` settings to recommended values. used in both version 1 and 2
`version`: version of the config to use, default is 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add int4_tile_packed_ntile description here? also note that int4_tile_packed_ntile only works Int4PackingFormat.TILE_PACKED_TO_4D case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, moved

Int4ChooseQParamsAlgorithm.TINYGEMM
)
# ntile size for TILE_PACKED_TO_4D format, 8 for CUDA platform, 16 for ROCm platform
int4_tile_packed_ntile: int = 8
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a check to ensure it only supports a limited set of values?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@ZhiweiYan-96 ZhiweiYan-96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request change is commited

`int4_choose_qparams_algorithm`: variants of choose qparams algorithm to use for int4,
currently support TINYGEMM ("tinygemm") and HQQ ("hqq"), used in version 2 only
`set_inductor_config`: if True, adjusts `torchinductor` settings to recommended values. used in both version 1 and 2
`version`: version of the config to use, default is 2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, moved

Int4ChooseQParamsAlgorithm.TINYGEMM
)
# ntile size for TILE_PACKED_TO_4D format, 8 for CUDA platform, 16 for ROCm platform
int4_tile_packed_ntile: int = 8
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cls,
hp_tensor: torch.Tensor,
block_size: List[int],
ntile_size: Optional[int] = 8,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe make this argument last, to avoid breaking any existing callsites that specify arguments positionally

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo
Copy link
Copy Markdown
Contributor

vkuzo commented Mar 6, 2026

looks good! can we just make the new argument last, and after that if CI is green lgtm!

@pytorch-bot pytorch-bot bot removed the ciflow/rocm label Mar 7, 2026
@ZhiweiYan-96 ZhiweiYan-96 requested a review from vkuzo March 8, 2026 09:14
@ZhiweiYan-96 ZhiweiYan-96 changed the title [WIP][ROCm][INT4] Corner case n<Ntile handling [ROCm][INT4] Configurable ntile size for TilePacked format Mar 8, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 8, 2026

Warning: Unknown label ciflow/rocm-mi300.
Currently recognized labels are

  • ciflow/benchmark
  • ciflow/tutorials
  • ciflow/rocm
  • ciflow/4xh100
  • ciflow/xpu

Please add the new label to .github/pytorch-probot.yml

@vkuzo vkuzo merged commit 67e5358 into pytorch:main Mar 9, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm ciflow/rocm-mi300 CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. device: rocm topic: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants