[ROCm][INT4] Configurable ntile size for TilePacked format#3834
[ROCm][INT4] Configurable ntile size for TilePacked format#3834vkuzo merged 8 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3834
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c04c333 with merge base c17160a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@XiaobingSuper Mind take a look? |
| @@ -127,7 +127,8 @@ def from_hp( | |||
|
|
|||
| # Pre-process: pad to required dimensions | |||
| in_features = find_multiple(orig_in_features, 1024) | |||
| out_features = find_multiple(orig_out_features, 8) | |||
| n_tile = 16 if orig_out_features < 16 and torch.version.hip else 8 | |||
There was a problem hiding this comment.
need to n_tile = 16 if torch.version.hip else 8? find_multiple will do a padding according to the given tile size.
|
hi, @petrex @jithunnair-amd Could you please take a look for reviewing this PR and add |
|
@jerryzh168 could you help review it? Thanks! |
|
The two failures are irrelevant to this PR. |
|
hi @ZhiweiYan-96 , looks reasonable overall. IMO we should make this user configurable instead of automatically selected, it's confusing when a packing format behaves differently based on the environment. Can we add this to the config instead of selecting it automatically? You can make the config clearly state which value the user needs to select on ROCm. |
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
|
Thanks, @vkuzo , we can make n_tile_size as a configurable attributes in |
84c6f59 to
b390fec
Compare
| `int4_choose_qparams_algorithm`: variants of choose qparams algorithm to use for int4, | ||
| currently support TINYGEMM ("tinygemm") and HQQ ("hqq"), used in version 2 only | ||
| `set_inductor_config`: if True, adjusts `torchinductor` settings to recommended values. used in both version 1 and 2 | ||
| `version`: version of the config to use, default is 2 |
There was a problem hiding this comment.
add int4_tile_packed_ntile description here? also note that int4_tile_packed_ntile only works Int4PackingFormat.TILE_PACKED_TO_4D case.
There was a problem hiding this comment.
nice catch, moved
| Int4ChooseQParamsAlgorithm.TINYGEMM | ||
| ) | ||
| # ntile size for TILE_PACKED_TO_4D format, 8 for CUDA platform, 16 for ROCm platform | ||
| int4_tile_packed_ntile: int = 8 |
There was a problem hiding this comment.
Do we need to add a check to ensure it only supports a limited set of values?
There was a problem hiding this comment.
ZhiweiYan-96
left a comment
There was a problem hiding this comment.
Request change is commited
| `int4_choose_qparams_algorithm`: variants of choose qparams algorithm to use for int4, | ||
| currently support TINYGEMM ("tinygemm") and HQQ ("hqq"), used in version 2 only | ||
| `set_inductor_config`: if True, adjusts `torchinductor` settings to recommended values. used in both version 1 and 2 | ||
| `version`: version of the config to use, default is 2 |
There was a problem hiding this comment.
nice catch, moved
| Int4ChooseQParamsAlgorithm.TINYGEMM | ||
| ) | ||
| # ntile size for TILE_PACKED_TO_4D format, 8 for CUDA platform, 16 for ROCm platform | ||
| int4_tile_packed_ntile: int = 8 |
There was a problem hiding this comment.
| cls, | ||
| hp_tensor: torch.Tensor, | ||
| block_size: List[int], | ||
| ntile_size: Optional[int] = 8, |
There was a problem hiding this comment.
nit: maybe make this argument last, to avoid breaking any existing callsites that specify arguments positionally
There was a problem hiding this comment.
thanks @vkuzo ! Move the argument to the last one https://github.com/pytorch/ao/pull/3834/changes#diff-632459893f0de165d48a87a900c6535f5b392520b47fffeddaa2175f5904b6baR102. The CI is green now
|
looks good! can we just make the new argument last, and after that if CI is green lgtm! |
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
Motivation
Fix a UT failure
The failed case is with shape (m, k, n)=(256, 256,8). The n dimension is smaller than the Matrix Core nTileSize=16 on AMD, while reasonable for Nv TensorCore with nTileSize=8
According to the code at https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu#L1116
We can infer that
This PR fix it by using a proper padding size when calling
find_mmultipleutils.Testing