Skip to content

Conversation

leslie-fang-intel
Copy link
Collaborator

@leslie-fang-intel leslie-fang-intel commented Apr 12, 2024

Stack from ghstack (oldest at bottom):

Summary
Fix issue: #123801 which brings performance regression of pyhpc_turbulent_kinetic_energy after outer loop fusion.

Root Cause

In this PR, we propose a fix for loop_nest involving OuterLoopFusedKernel. The fix entails adding a specific heuristic for OuterLoopFusedKernel to determine the parallel depth by combining outer_loop_fusion_depth with the internal kernels' parallel depth.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

Copy link

pytorch-bot bot commented Apr 12, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123899

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 36972d1 with merge base adbf62c (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

leslie-fang-intel added a commit that referenced this pull request Apr 12, 2024
@leslie-fang-intel leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Apr 12, 2024
@leslie-fang-intel leslie-fang-intel marked this pull request as draft April 12, 2024 02:36
@leslie-fang-intel leslie-fang-intel marked this pull request as ready for review April 12, 2024 05:01
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the inner most loop count for the second loop nest is small enough so that the second loopnest is also decided as single-threaded too? Instead of forcing the parallelization on the outer loops, I'm wonder if we should revise the way of deciding the parallel depth for such outer loop fusion case. A simple way of implementing this is:

  1. unsqueeze the ranges to align their length, e.g., [200, 200] and [200, 200, 60] -> [200, 200, 1] and [200, 200, 60]
  2. do ranges = [max(i, j) for i, j in zip(ranges1, ranges2)]
  3. depth = min(outer_loop_depth), decide_parallel_depth(ranges, ...))

…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 13, 2024
@leslie-fang-intel
Copy link
Collaborator Author

leslie-fang-intel commented Apr 15, 2024

Hi @jgong5, thanks for your suggestion. Revised the implementation:

  • Store the kernels' parallel depth before outer loop fusion in OuterLoopFusedKernel as kernels_par_depth
  • Deciding the parallel depth for outer loop fusion case as: min(outer_loop_depth, max(kernels_par_depth))

Please kindly help to take a look again.

…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@leslie-fang-intel
Copy link
Collaborator Author

leslie-fang-intel commented Apr 16, 2024

Hi @jgong5, thanks for your suggestion. Changed accordingly, please help to take a look again.

  • I think we can't use decide_parallel_depth directly for all the kernels with different call_range, since it binds to specific kernel as in
    seq = self.size_hint()
  • Add a decide_parallel_depth method for OuterLoopFusedKernel to provide heuristic deciding parallel depth for such case.

…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 16, 2024
…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` with `OuterLoopFusedKernel`, we will enforce the parallelization with parallel depth as same as `outer_loop_fusion_depth`.



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 23, 2024
@leslie-fang-intel
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/leslie-fang-intel/88/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/123899)

pytorchmergebot pushed a commit that referenced this pull request Apr 24, 2024
@lezcano
Copy link
Collaborator

lezcano commented Apr 24, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@leslie-fang-intel
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • EasyCLA

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@leslie-fang-intel
Copy link
Collaborator Author

leslie-fang-intel commented Apr 24, 2024

Hi @atalman, this PR failed to merge due to the error

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

EasyCLA

It looks strange since only 1 Unrelated Failure reported by hud. Could you kindly help to take a look?

…epth"


**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.
 
In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.


cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Apr 25, 2024
@leslie-fang-intel
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

alat-rights pushed a commit to alat-rights/pytorch that referenced this pull request Apr 26, 2024
…ch#123899)

**Summary**
Fix issue: pytorch#123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.

In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.

Pull Request resolved: pytorch#123899
Approved by: https://github.com/jgong5, https://github.com/lezcano
pytorch-bot bot pushed a commit that referenced this pull request May 3, 2024
**Summary**
Fix issue: #123801 which brings performance regression of `pyhpc_turbulent_kinetic_energy` after outer loop fusion.

**Root Cause**

- [Generated Kernel before Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209)
  - Taking below 2 kernels as example:
    - [Kernel 0](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L255-L305) has 2 loop levels with size [200, 200]. Parallelization is not feasible due to the inefficient number of elements determined by [`decide_parallel_depth`](https://github.com/pytorch/pytorch/blob/aaec97a40364bb6ccfd968f28d309cfff8748d20/torch/_inductor/codegen/cpp.py#L2145-L2164). Therefore, the loop code will be generated with the `#pragma omp single` directive.
    - [Kernel 1](https://gist.github.com/leslie-fang-intel/54fe21ac8871fc63b9bf20fdb6edf209#file-pyhpc_turbulent_kinetic_energy-before-outer-loop-fusion-py-L306-L316) has 3 loop levels with size [200, 200, 26] which has enough number of elements to be parallelized.
- [Generated Kernel after Outer Loop Fusion](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887)
  - After outer loop fusion, `Kernel0` and `Kernel1` has been fused into one [OuterLoopFusedKernel](https://gist.github.com/leslie-fang-intel/57a497b9d9c6aa82b1c6a686292fc887#file-pyhpc_turbulent_kinetic_energy-after-outer-loop-fusion-py-L261-L497), the outer loop size is [200, 200] which does not contain enough number of elements to do parallelization.

In this PR, we propose a fix for `loop_nest` involving `OuterLoopFusedKernel`. The fix entails adding a specific heuristic for `OuterLoopFusedKernel` to determine the parallel depth by combining `outer_loop_fusion_depth` with the internal kernels' parallel depth.

Pull Request resolved: #123899
Approved by: https://github.com/jgong5, https://github.com/lezcano
@github-actions github-actions bot deleted the gh/leslie-fang-intel/88/head branch June 3, 2024 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants