[Inductor] Force the parallel depth as outer loop fusion depth #123899

leslie-fang-intel · 2024-04-12T02:18:55Z

Stack from ghstack (oldest at bottom):

-> [Inductor] Force the parallel depth as outer loop fusion depth #123899

Summary
Fix issue: #123801 which brings performance regression of pyhpc_turbulent_kinetic_energy after outer loop fusion.

Root Cause

Generated Kernel before Outer Loop Fusion
- Taking below 2 kernels as example:
  - Kernel 0 has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by decide_parallel_depth. Therefore, the loop code will be generated with the #pragma omp single directive.
  - Kernel 1 has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
Generated Kernel after Outer Loop Fusion
- After outer loop fusion, Kernel0 and Kernel1 has been fused into one OuterLoopFusedKernel, the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.

In this PR, we propose a fix for loop_nest involving OuterLoopFusedKernel. The fix entails adding a specific heuristic for OuterLoopFusedKernel to determine the parallel depth by combining outer_loop_fusion_depth with the internal kernels' parallel depth.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-04-12T02:18:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123899

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 36972d1 with merge base adbf62c ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_clamp_max_cuda_float64

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 5b2373d Pull Request resolved: #123899

jgong5

What if the inner most loop count for the second loop nest is small enough so that the second loopnest is also decided as single-threaded too? Instead of forcing the parallelization on the outer loops, I'm wonder if we should revise the way of deciding the parallel depth for such outer loop fusion case. A simple way of implementing this is:

unsqueeze the ranges to align their length, e.g., [200, 200] and [200, 200, 60] -> [200, 200, 1] and [200, 200, 60]
do ranges = [max(i, j) for i, j in zip(ranges1, ranges2)]
depth = min(outer_loop_depth), decide_parallel_depth(ranges, ...))

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 8e95366 Pull Request resolved: #123899

leslie-fang-intel · 2024-04-15T01:13:38Z

Hi @jgong5, thanks for your suggestion. Revised the implementation:

Store the kernels' parallel depth before outer loop fusion in OuterLoopFusedKernel as kernels_par_depth
Deciding the parallel depth for outer loop fusion case as: min(outer_loop_depth, max(kernels_par_depth))

Please kindly help to take a look again.

torch/_inductor/codegen/cpp.py

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-04-16T06:39:25Z

Hi @jgong5, thanks for your suggestion. Changed accordingly, please help to take a look again.

I think we can't use decide_parallel_depth directly for all the kernels with different call_range, since it binds to specific kernel as in

pytorch/torch/_inductor/codegen/cpp.py

Line 2146 in 5bef127

seq = self.size_hint()
Add a decide_parallel_depth method for OuterLoopFusedKernel to provide heuristic deciding parallel depth for such case.

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: a25a020 Pull Request resolved: #123899

torch/_inductor/codegen/cpp.py

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 31c7c6c Pull Request resolved: #123899

leslie-fang-intel · 2024-04-24T01:12:17Z

@pytorchbot rebase

pytorchmergebot · 2024-04-24T01:14:05Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-04-24T01:14:23Z

Successfully rebased gh/leslie-fang-intel/88/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/123899)

ghstack-source-id: 6b8b0af Pull Request resolved: #123899

lezcano · 2024-04-24T09:16:11Z

@pytorchbot merge -i

pytorchmergebot · 2024-04-24T09:18:02Z

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

leslie-fang-intel · 2024-04-24T12:30:28Z

@pytorchbot merge

pytorchmergebot · 2024-04-24T12:32:14Z

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

leslie-fang-intel · 2024-04-24T12:36:47Z

Hi @atalman, this PR failed to merge due to the error

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

EasyCLA

It looks strange since only 1 Unrelated Failure reported by hud. Could you kindly help to take a look?

…epth" **Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 86bdbc3 Pull Request resolved: #123899

leslie-fang-intel · 2024-04-25T08:08:50Z

@pytorchbot merge

pytorchmergebot · 2024-04-25T08:10:44Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…ch#123899) **Summary** Fix issue: pytorch#123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. Pull Request resolved: pytorch#123899 Approved by: https://github.com/jgong5, https://github.com/lezcano

**Summary** Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion. **Root Cause** - [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209) - Taking below 2 kernels as example: - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive. - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized. - [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887) - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization. In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth. Pull Request resolved: #123899 Approved by: https://github.com/jgong5, https://github.com/lezcano

[Inductor] Force the parallel depth as outer loop fusion depth

8fd4116

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 12, 2024

leslie-fang-intel added a commit that referenced this pull request Apr 12, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

e8735df

ghstack-source-id: 5b2373d Pull Request resolved: #123899

pytorchbot added the open source label Apr 12, 2024

leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Apr 12, 2024

leslie-fang-intel marked this pull request as draft April 12, 2024 02:36

leslie-fang-intel marked this pull request as ready for review April 12, 2024 05:01

leslie-fang-intel requested review from jgong5 and lezcano April 12, 2024 08:57

jgong5 requested changes Apr 12, 2024

View reviewed changes

leslie-fang-intel added a commit that referenced this pull request Apr 13, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

002a26f

ghstack-source-id: 8e95366 Pull Request resolved: #123899

leslie-fang-intel requested a review from jgong5 April 15, 2024 01:14

jgong5 requested changes Apr 15, 2024

View reviewed changes

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

leslie-fang-intel requested a review from jgong5 April 16, 2024 06:39

leslie-fang-intel added a commit that referenced this pull request Apr 16, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

b24d30f

ghstack-source-id: a25a020 Pull Request resolved: #123899

jgong5 reviewed Apr 17, 2024

View reviewed changes

torch/_inductor/codegen/cpp.py Outdated Show resolved Hide resolved

leslie-fang-intel requested a review from peterbell10 April 17, 2024 01:14

leslie-fang-intel added a commit that referenced this pull request Apr 23, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

3133668

ghstack-source-id: 31c7c6c Pull Request resolved: #123899

Update

7782d27

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Apr 24, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

419da9e

ghstack-source-id: 6b8b0af Pull Request resolved: #123899

pytorchmergebot added the merging label Apr 24, 2024

pytorchmergebot removed the merging label Apr 24, 2024

pytorchmergebot added the merging label Apr 24, 2024

pytorchmergebot removed the merging label Apr 24, 2024

leslie-fang-intel added a commit that referenced this pull request Apr 25, 2024

[Inductor] Force the parallel depth as outer loop fusion depth

055dd16

ghstack-source-id: 86bdbc3 Pull Request resolved: #123899

pytorchmergebot added the merging label Apr 25, 2024

pytorchmergebot added the Merged label Apr 25, 2024

pytorchmergebot closed this in 2d7f709 Apr 25, 2024

pytorchmergebot removed the merging label Apr 25, 2024

leslie-fang-intel mentioned this pull request Apr 26, 2024

[inductor][cpu]pyhpc_turbulent_kinetic_energy AMP multithread static/dynamic shape default/cpp wrapper performance regression #123801

Closed

github-actions bot deleted the gh/leslie-fang-intel/88/head branch June 3, 2024 01:57

[Inductor] Force the parallel depth as outer loop fusion depth #123899

[Inductor] Force the parallel depth as outer loop fusion depth #123899

Uh oh!

Conversation

leslie-fang-intel commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123899

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

jgong5 left a comment

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel commented Apr 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

leslie-fang-intel commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Uh oh!

lezcano commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Merge failed

Uh oh!

leslie-fang-intel commented Apr 24, 2024

Uh oh!

pytorchmergebot commented Apr 24, 2024

Merge failed

Uh oh!

leslie-fang-intel commented Apr 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Apr 25, 2024

Uh oh!

pytorchmergebot commented Apr 25, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

leslie-fang-intel commented Apr 12, 2024 •

edited

Loading

pytorch-bot bot commented Apr 12, 2024 •

edited

Loading

leslie-fang-intel commented Apr 15, 2024 •

edited

Loading

leslie-fang-intel commented Apr 16, 2024 •

edited

Loading

leslie-fang-intel commented Apr 24, 2024 •

edited

Loading