[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

chenyang78 · 2024-02-05T19:54:59Z

Stack from ghstack (oldest at bottom):

-> [aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.

If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the assert macro from cassert.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @kadeng @muchulee8 @aakhundov @ColinPeppler

…p code In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. [ghstack-poisoned]

pytorch-bot · 2024-02-05T19:55:03Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119220

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d83f93e with merge base 5636412 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-12-py3-arm64 / test (default, 2, 3, macos-m1-12) (gh)
RuntimeError: inductor/test_torchinductor_opinfo 1/2 failed

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…p code In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. ghstack-source-id: f03eb9cd42cd68390c3788955980314d9a5e37fb Pull Request resolved: #119220

hl475 · 2024-02-05T20:27:03Z

torch/_inductor/codegen/cpp.py

@@ -1730,7 +1730,10 @@ def codegen_loops(self, code, worksharing):

    @property
    def assert_function(self) -> str:
-        return "TORCH_CHECK"
+        if V.graph.aot_mode:
+            return "AOTI_CHECK"


n00b q - both cuda and cpu cases will switch to use AOT_CHECK, right? is there any concerns from gpu side?

I think the change should be fine for both cpu and gpu.

hl475

Thanks Yang!

desertfire · 2024-02-05T22:24:40Z

torch/_inductor/codegen/cpp.py

+        if V.graph.aot_mode:
+            return "AOTI_CHECK"
+        else:
+            return "TORCH_CHECK"


JIT Inductor will also face the compilation time challenge sooner or later, so it's worth to unify the behavior here.

So, we can just remove the if V.graph.aot_mode condition? Thanks.

never mind. Saw your comment below. Let's keep AOTI_CHECK for aot_mode for now. Thanks.

desertfire · 2024-02-05T22:57:21Z

torch/csrc/inductor/aoti_runtime/model.h

@@ -15,6 +15,7 @@
 // in model.so, and should not refer to any aten/c10 headers except the stable
 // C ABI defined in torch/csrc/inductor/aoti_torch/c/shim.h. The same rule
 // applies to other files under torch/csrc/inductor/aoti_runtime/.
+#include <c10/util/Exception.h>


There is c10/util/Exception.cpp, so this has a potential to break the ABI-compatibility. For AOTI, we can implement a separate version of torchCheckFail. w.r.t. my comments earlier, we can probably start with fixing AOT first, and then JIT.

Hmm, I thought c10/util/Exception.h was indirectly included via c10/utils/generic_math.h and this was why I included the same header here. In any case, let me have our own version of torchCheckFail. Thanks.

…generate cpp code" In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

…p code In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. ghstack-source-id: 66a700e72f35186f9a7546fb405f648fef6055a2 Pull Request resolved: #119220

desertfire · 2024-02-06T19:27:28Z

torch/csrc/inductor/aoti_runtime/model.h

+  }
+}
+
+#ifdef STRIP_ERROR_MESSAGES


We need to update codecache.py to take this macro, otherwise users might be surprised.

We need to update codecache.py to take this macro, otherwise users might be surprised.

Could you elaborate? Do you mean we need to pass this macro from the compiler invocation command-line? I think this macro is only used for production mobile builds.

I think we can implement __aoti_check and AOTI_CHECK in C shim, and then naturally that will take whatever macro was used for building PyTorch. Another benefit of that is we can still call the same implementation in c10 (except we wrap with AOTI_NOINLINE) instead of having a potentially diverged behavior.

Good point. Done. Thanks.

…generate cpp code" In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

…p code In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we place the underneath assert function in c_shim so that it won't be inlined. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. ghstack-source-id: 4df750f7d4de389991e473c8af96318782401837 Pull Request resolved: #119220

…generate cpp code" In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 kadeng muchulee8 aakhundov ColinPeppler [ghstack-poisoned]

…ate cpp code In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_TORCH_CHECK, where we place the underneath assert function in c_shim so that it won't be inlined. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_TORCH_CHECK into a no-op, similar to the ```assert``` macro from cassert. ghstack-source-id: 56452e0ab08523e7a6bff0e3ddd008d89fa06f1d Pull Request resolved: #119220

desertfire · 2024-02-08T16:42:52Z

torch/csrc/inductor/aoti_torch/c/shim.h

+      __func__,                        \
+      __FILE__,                        \
+      static_cast<uint32_t>(__LINE__), \
+      TORCH_CHECK_MSG(cond, "", __VA_ARGS__));


Wait, there is still TORCH_CHECK_MSG. Hmm, this is annoying, and seems like we have to pass var args. I wonder if it will be easier to just make c10/util/Exception.h header only (merging c10/util/Exception.cpp with c10/util/Exception.h).

@ezyang , do you have any suggestion for this?

By reading c10's utility functions, seems this TORCH_CHECK_MSG is header only. After it's expanded, it invokes torchCheckMsgImpl:

pytorch/c10/util/Exception.h

Lines 451 to 463 in 5636412

template <typename... Args>

decltype(auto) torchCheckMsgImpl(const char* /*msg*/, const Args&... args) {

return ::c10::str(args...);

}

inline C10_API const char* torchCheckMsgImpl(const char* msg) {

return msg;

}

// If there is just 1 user-provided C-string argument, use it.

inline C10_API const char* torchCheckMsgImpl(

const char* /*msg*/,

const char* args) {

return args;

}

where ::c10::str also seems to be header only:

pytorch/c10/util/StringUtil.h

Lines 114 to 118 in 5636412

template <typename... Args>

inline decltype(auto) str(const Args&... args) {

return detail::_str_wrapper<

typename detail::CanonicalizeStrTypes<Args>::type...>::call(args...);

}

I see. Then it should be ok for this PR. At some point, we need to come up with a small set of self-contained c10/util headers.

I see. Then it should be ok for this PR. At some point, we need to come up with a small set of self-contained c10/util headers.

Yes. That would be very helpful.

desertfire

TORCH_CHECK_MSG still needs to be addressed

ColinPeppler · 2024-02-08T19:12:43Z

hello @chenyang78, n00b q: how to figure out why the host compiler is spending a lot of time optimizing due to inlining?

chenyang78 · 2024-02-08T19:24:15Z

hello @chenyang78, n00b q: how to figure out why the host compiler is spending a lot of time optimizing due to inlining?

I guess it mostly comes from my intuition :), because inlining increases the size of the function body, which could result in longer compilation time.

chenyang78 · 2024-02-08T19:24:48Z

@pytorchbot merge

## Description Fixes #114450. This PR builds upon the work from imzhuhl done in #114451. This PR requires #122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in #119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

@imzhuhl

## Description Fixes #114450. This PR builds upon the work from @imzhuhl done in #114451. This PR requires #122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in #119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 Pull Request resolved: #124350 Approved by: https://github.com/jgong5, https://github.com/desertfire

## Description Fixes #114450. This PR builds upon the work from imzhuhl done in #114451. This PR requires #122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in #119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 voznesenskym penguinwu EikanWang Guobing-Chen zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

@imzhuhl

## Description Fixes #114450. This PR builds upon the work from @imzhuhl done in #114451. This PR requires #122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in #119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 Pull Request resolved: #124350 Approved by: https://github.com/jgong5, https://github.com/desertfire

@imzhuhl

## Description Fixes pytorch#114450. This PR builds upon the work from @imzhuhl done in pytorch#114451. This PR requires pytorch#122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in pytorch#119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 Pull Request resolved: pytorch#124350 Approved by: https://github.com/jgong5, https://github.com/desertfire

@imzhuhl

## Description Fixes pytorch#114450. This PR builds upon the work from @imzhuhl done in pytorch#114451. This PR requires pytorch#122472 to land firstly. We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so. ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time. ### Test plan: ```sh python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu ``` ### TODOs in follow-up PRs 1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in pytorch#119220). 2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`. https://github.com/pytorch/pytorch/blob/6c4f43f82675b5fcfe8cf3e5983d0c0f326408aa/torch/_inductor/codegen/cpp_wrapper_cpu.py#L2023-L2024 Pull Request resolved: pytorch#124350 Approved by: https://github.com/jgong5, https://github.com/desertfire

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 ghstack-source-id: 24977184a55ed6986561fcaf63fd11adbd44fb34 Pull Request resolved: #128402

…RCH_CHECK" We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 ghstack-source-id: eda06e848570d7f174aee677db3fb569f1a50d0c Pull Request resolved: #128402

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

…RCH_CHECK" We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 ghstack-source-id: 654453e98654aac9a1561b60f5e70e0fa9c5043f Pull Request resolved: #128402

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 ghstack-source-id: fce97928ceea976c49c00a1311d118cc8f42bca9 Pull Request resolved: #128402

…RCH_CHECK" We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 [ghstack-poisoned]

We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation time issues. Unfortunately, it caused perf regressions for CPU , as described in issue #126665. After some investigation, it turned out the slow compilation was caused by the use of the builtin function __builtin_expect provided by gcc/clang. Moreover, nuking __builtin_expect doesn't seem to cause any performance penalty, even though its purpose is to improve performance by providing the compiler with branch prediction information. abs latency numbers using the script shared by #126665: before the fix after the fix T5Small 1019.055694 917.875027 T5ForConditionalGeneration 1009.825196 916.369239 Pull Request resolved: #128402 Approved by: https://github.com/desertfire

github-actions bot added module: inductor ciflow/inductor labels Feb 5, 2024

chenyang78 requested review from jgong5, desertfire and hl475 February 5, 2024 19:55

hl475 reviewed Feb 5, 2024

View reviewed changes

hl475 approved these changes Feb 5, 2024

View reviewed changes

desertfire requested changes Feb 5, 2024

View reviewed changes

chenyang78 requested a review from desertfire February 6, 2024 08:17

desertfire reviewed Feb 6, 2024

View reviewed changes

desertfire self-requested a review February 6, 2024 19:28

desertfire approved these changes Feb 8, 2024

View reviewed changes

desertfire requested changes Feb 8, 2024

View reviewed changes

desertfire approved these changes Feb 8, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 8, 2024

pytorchmergebot added the merging label Feb 8, 2024

chunyuan-w mentioned this pull request May 20, 2024

[AOTI] Using AOTI_TORCH_CHECK will cause performance drop on several models compared with using TORCH_CHECK #126665

Closed

chenyang78 mentioned this pull request Jun 11, 2024

[AOTI] fixed performance issue for AOTI_TORCH_CHECK #128402

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

chenyang78 commented Feb 5, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 5, 2024 •

edited

Loading

hl475 Feb 5, 2024

chenyang78 Feb 5, 2024

hl475 left a comment

desertfire Feb 5, 2024

chenyang78 Feb 5, 2024

chenyang78 Feb 6, 2024

desertfire Feb 5, 2024

chenyang78 Feb 5, 2024

desertfire Feb 6, 2024 •

edited

Loading

chenyang78 Feb 7, 2024

desertfire Feb 7, 2024 •

edited

Loading

chenyang78 Feb 8, 2024

desertfire Feb 8, 2024

chenyang78 Feb 8, 2024

desertfire Feb 8, 2024

chenyang78 Feb 8, 2024

desertfire left a comment

ColinPeppler commented Feb 8, 2024

chenyang78 commented Feb 8, 2024

chenyang78 commented Feb 8, 2024

	template <typename... Args>
	decltype(auto) torchCheckMsgImpl(const char* /msg/, const Args&... args) {
	return ::c10::str(args...);
	}
	inline C10_API const char* torchCheckMsgImpl(const char* msg) {
	return msg;
	}
	// If there is just 1 user-provided C-string argument, use it.
	inline C10_API const char* torchCheckMsgImpl(
	const char* /msg/,
	const char* args) {
	return args;
	}

	template <typename... Args>
	inline decltype(auto) str(const Args&... args) {
	return detail::_str_wrapper<
	typename detail::CanonicalizeStrTypes<Args>::type...>::call(args...);
	}

[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code #119220

Conversation

chenyang78 commented Feb 5, 2024 • edited by pytorch-bot bot Loading

pytorch-bot bot commented Feb 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119220

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hl475 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desertfire Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desertfire Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

desertfire left a comment

Choose a reason for hiding this comment

ColinPeppler commented Feb 8, 2024

chenyang78 commented Feb 8, 2024

chenyang78 commented Feb 8, 2024

chenyang78 commented Feb 5, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 5, 2024 •

edited

Loading

desertfire Feb 6, 2024 •

edited

Loading

desertfire Feb 7, 2024 •

edited

Loading