-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Introduce new template heuristic for triton autotune configs #144985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144985
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 8 Pending, 7 Unrelated FailuresAs of commit eb7843a with merge base 683178f ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an improvement, but we shouldn't assume a single device and we shouldn't read the config class in global scope so the user can override it without restarting the process.
torch/_inductor/choices.py
Outdated
def __init__(self): | ||
self.config_heuristics = self._get_device_config_heuristic() | ||
|
||
def _get_device_config_heuristic(self, device_type="cuda"): | ||
from torch._inductor.utils import get_gpu_type | ||
|
||
device_type = get_gpu_type() | ||
|
||
if device_type == "cuda": | ||
if torch.version.hip is None: | ||
return CUDAConfigHeuristic() | ||
else: | ||
return ROCmConfigHeuristic() | ||
elif device_type == "xpu": | ||
return XPUConfigHeuristic() | ||
elif torch.cuda.is_available(): | ||
return BaseConfigHeuristic() | ||
else: | ||
return CPUConfigHeuristic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work for programs that use multiple devices (most commonly GPU+CPU). We should be selecting the hursitics based on the device of the op, not the hardware properties of the machine. Maybe:
def get_device_config_heuristic(self, device):
... similar to the code you have ...
torch/_inductor/kernel/conv.py
Outdated
if torch._inductor.config.max_autotune_custom_heuristic is None: | ||
conv_heuristics = V.choices.config_heuristics | ||
else: | ||
conv_heuristics = torch._inductor.config.max_autotune_custom_heuristic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this into V.choices.
torch/_inductor/kernel/mm_common.py
Outdated
if inductor_config.max_autotune_custom_heuristic is None: | ||
mm_heuristics = V.choices.config_heuristics | ||
else: | ||
mm_heuristics = inductor_config.max_autotune_custom_heuristic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate code should move to V.choices.
torch/_inductor/kernel/mm_common.py
Outdated
if inductor_config.max_autotune_gemm_search_space != "EXHAUSTIVE": | ||
mm_kernel_configs = mm_heuristics.get_mm_configs() | ||
else: | ||
mm_kernel_configs = mm_heuristics.get_exhaustive_mm_configs() | ||
|
||
extra_mm_kernel_configs = mm_heuristics.get_extra_mm_configs() | ||
int8_mm_kernel_configs = mm_heuristics.get_int8_mm_configs() | ||
mixed_mm_kernel_configs_small_m = mm_heuristics.get_mixed_mm_configs() | ||
persistent_mm_kernel_configs = mm_heuristics.get_persistent_mm_configs() | ||
scaled_mm_kernel_configs = mm_heuristics.get_scaled_mm_configs() | ||
scaled_persistent_mm_kernel_configs = mm_heuristics.get_scaled_persistent_mm_configs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code should be moved inside functions so it can get overridden by the user (through the V.choices API). Evaluating in global scope prevents this.
torch/_inductor/kernel/conv.py
Outdated
# On ROCm convert num_stages to 1 as pipelining provides no benefit | ||
if torch.version.hip and torch.cuda.is_available(): | ||
platform_configs = build_rocm_gemm_configs(platform_configs) | ||
kernel_configs = conv_heuristics.get_conv_configs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this inside the function so you can override the choices.
Thanks @jansel, I will address these comments and try to get CI green. |
8479e33
to
06e9755
Compare
Successfully rebased |
f5395a9
to
eb7843a
Compare
@jansel has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
…configs (pytorch#144985)"" This reverts commit 4434376.
@pytorchbot merge -f "landed in fbcode" I fixed the inernal issue. |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Thanks for all the help here @jansel |
This pull request has been reverted by 465930e. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
Initial PR to refactor bulkiness of mm_common to allow for better device-specific specialisation e.g. in #143286 we require large conditionalisation to get ROCm specific optimisations in.
This PR introduces a new file
torch/_inductor/template_heuristics.py
which implements device specific subclasses for autotune configs:These subclasses are integrated as part of the
InductorChoices
class, which will be the interface for the kernel files to access the configs.The mm_common, mm_plus_mm and conv configurations are implemented in this class, in the future we plan to bring in flex attention configurations also so all of the tuning config logic for templated triton kernels are handled in this file.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov