[Backend][AMD] Introduce stream pipeliner v2 #4148

sjw36 · 2024-06-17T15:33:36Z

This PR first promotes common infrastructure in lib/Dialect/TritonGPU/Transforms/Pipeliner to enable inclusion by other target backends. No other changes have been made to the lib/include directories.

Second, the tritonamdgpu-stream-pipeline pass has been completely revamped based on code from lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collects tt.load ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory for tt.dot uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic.

Lastly, the tritonamdgpu-reorder-instructions pass has been enhanced to move tt.load ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also move triton_gpu.local_store ops as early as possible if their source is not directly from a tt.load. In this way, a multi-buffered pipeline will overlap in this order:

tt.load buffer+2
tg.local_store buffer+1
tt.dot buffer+0

ThomasRaoux · 2024-06-17T21:50:29Z

Thanks for doing that, when you are ready please have @pawelszczerbuk review it :)

antiagainst · 2024-06-18T05:35:39Z

Thanks @sjw36! Some high level comments before reviewing detailed implementation--can we have a separate pull request for the NFC moving passes? Basically commit e0bd4d8. It's easier to review that way and if later we need to revert the changes to AMD part for whatever reason, we also don't need to revert the NFC code shuffling.

test/TritonGPU/amd/amd-reorder-instructions.mlir

antiagainst · 2024-07-02T19:04:36Z

@ThomasRaoux: @sjw36 and I chatted a bit offline. This pull request is great at showing the global picture. But we want to break it into smaller pieces to make it easier for review and restructure a bit. Overall the direction is increase reuse without abstracting too much; so we will expose some useful functions like scheduleLoads, schedulePrologueAndEpilogue, etc. into utility functions. And have another MatmulLoopPipeline.cpp-like pass for amd backend to organize them together with amd specific load/store ops. This avoids coupling nvidia and amd side too much so we can iterate fast on both side. will send out nfc pieces first and then the putting-together pull requests.

…structure - Copied scheduler from MatmulLoopPipeline (much could be consolidated) - Enable register buffering (even though may increases register pressure) - Enable num_stages=2+, including multi-buffering, and make `2` the default - updated tutorial for new tuning default - added lit tests

- Also move independent(from loop-carried buffer) `triton_gpu.local_store` as early as possible

- check for last atomic (sync?) - also check for other accesses to the source

…replaced with loop fusion * Reorder will not move loads/local_stores over loops

* Added TRITONAMD_OLD_STREAM_PIPELINER env variable to temporarily select old pipeliner

* update test

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp

pawelszczerbuk · 2024-07-26T21:53:02Z

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp

+
+  // Create a cluster for the prefetches. It may end up being empty, but this
+  // is OK.
+  tt::CoarseSchedule::Cluster prefetchCluster = schedule.clusters.newAtBack();


Prefetch cluster is needed to push the copies to the end of the loop, so they work well with prefetching, that is needed for nvidia A100. I'm not sure you need it?

Yes we also need to prefetch for AMD GPUs. The most naive pipelining we want should have the following structure:

S = <alloc-shared-memory> R(0) = load Global(0) store R(0) to S for i = 0 .. N-1 barrier R(i+1) = load Gloal(i+1) dot (load S) (load S) barrier store R(i+1) to S

Here is computation without pipelining

# load from HBM to SRAM Load R0 : Read global0 (global -> registers) Store R0: (registers -> SMEM) # compute on SRAM Load Si : (Si(SMEM) -> registers Ri) Compute Store Si : (registers -> Si) # store data on SRAM Store Rn : write SRAM data back to global

@pawelszczerbuk @antiagainst I have a question. Can we load from global to SRAM directly?

My question is "if we load data from global to register first" why dont't we compute it then store it back to SRAM:

# load from HBM R0_0 load R1_1 load # instead of store it back to SMEM compute R0, R1 (R0_0 + R1_0) store SMEM O0 # store back to global store

No direct global to shared support in normal global load in mi300. (buffer load supports that but we are. not using that, yet.)

pawelszczerbuk

Thank you!

sjw36 requested review from antiagainst, zhanglx13 and ptillet as code owners June 17, 2024 15:33

sjw36 marked this pull request as draft June 17, 2024 15:33

sjw36 mentioned this pull request Jun 18, 2024

[Pipeliner] NFC: Expose Pipeliner infrastructure for use by other target backends #4155

Merged

ThomasRaoux reviewed Jun 21, 2024

View reviewed changes

test/TritonGPU/amd/amd-reorder-instructions.mlir Outdated Show resolved Hide resolved

sjw36 force-pushed the sjw-pipeline-infra branch from c678fd8 to f586572 Compare June 25, 2024 19:07

sjw36 force-pushed the sjw-pipeline-infra branch from 69b4536 to 9517277 Compare July 16, 2024 21:39

sjw36 added 11 commits July 18, 2024 14:33

[AMD-Reorder] Move tt.loads as early as possible

f06e622

- Also move independent(from loop-carried buffer) `triton_gpu.local_store` as early as possible

* consolidated/fixed stream-pipeliner tests

047c2c1

* updated test

989150f

* Find insertion point for loads/local_stores as early as possible

5091416

- check for last atomic (sync?) - also check for other accesses to the source

* Reorder with BFS to keep relative order.

d42830b

* fixed pruning

768ed95

* updated test

452a3fa

* invert order of loads and local_stores

e344245

* Removed outer loop pipelining. It does not improve perf and may be …

cd8018d

…replaced with loop fusion * Reorder will not move loads/local_stores over loops

* cleanup tests

faf95cb

sjw36 force-pushed the sjw-pipeline-infra branch from 4eeb8cc to faf95cb Compare July 18, 2024 14:35

sjw36 and others added 4 commits July 22, 2024 18:34

* Restore old stream-pipeliner and moved new to StreamPipelineV2.cpp

c0ff506

* Added TRITONAMD_OLD_STREAM_PIPELINER env variable to temporarily select old pipeliner

* register new pass tritonamdgpu-stream-pipeline-v2

96c326b

* update test

* update tests

e4a89b3

Swap to disable new pipeline by default

ee98933

ThomasRaoux reviewed Jul 23, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Outdated Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Outdated Show resolved Hide resolved

antiagainst added 2 commits July 23, 2024 23:56

Drop unused header includes

c464a84

Drop changes to be exposed in future pull requests

1ceb6c6

antiagainst added 5 commits July 24, 2024 19:09

Drop unused chained load logic

c82defc

Add debug print

9c91b31

Drop uncessary canonicalization and cleanup some tests

181e37d

Merge remote-tracking branch 'origin/main' into sjw-pipeline-infra

e4f76af

Various improvements

fb694d1

antiagainst changed the title ~~[DRAFT] [AMD-Pipeliner] Transition stream-pipeline to new SW pipelining infrastructure~~ [Backend][AMD] Introduce stream pipeline v2 Jul 25, 2024

antiagainst added 11 commits July 25, 2024 17:05

NFC: change check prefix to AMD

9bbf5c9

Drop debug print \n

02b7073

[test] NFC: split loop pipeline test to prepare sharing

c782668

Merge tests back to the main file

8232d1a

Use COMMON prefix for shared check lines

f3e311e

Move one more test to cuda file

a27e45b

Merge remote-tracking branch 'origin/main' into sjw-pipeline-infra

8be2969

Delete unused block layout

b2694d2

Add some asserts regarding num stages

bb931de

Some more debug prints

10a2660

Remove unused insertindx

029cadb

ThomasRaoux reviewed Jul 26, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Show resolved Hide resolved

pawelszczerbuk reviewed Jul 26, 2024

View reviewed changes

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Outdated Show resolved Hide resolved

third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp Outdated Show resolved Hide resolved

pawelszczerbuk reviewed Jul 26, 2024

View reviewed changes

antiagainst added 2 commits July 26, 2024 23:46

Fix debug print regarding loop before expander

1e3068d

Create common utility for appendToForOpYield

98e831d

antiagainst marked this pull request as ready for review July 27, 2024 00:16

Clean up tests a bit

7f1f8c1

antiagainst changed the title ~~[Backend][AMD] Introduce stream pipeline v2~~ [Backend][AMD] Introduce stream pipeliner v2 Jul 28, 2024

Reduce the level of nestedness

1bb5868

antiagainst mentioned this pull request Jul 29, 2024

[Backend][AMD] Improve instruction reordering #4406

Closed

pawelszczerbuk approved these changes Jul 29, 2024

View reviewed changes

pawelszczerbuk merged commit c9c40be into triton-lang:main Jul 29, 2024
6 checks passed

jlebar mentioned this pull request Sep 3, 2024

Build LLVMAarch64CodeGen if CMAKE_OSX_ARCHITECTURES is arm64. #4637

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backend][AMD] Introduce stream pipeliner v2 #4148

[Backend][AMD] Introduce stream pipeliner v2 #4148

sjw36 commented Jun 17, 2024 •

edited

Loading

ThomasRaoux commented Jun 17, 2024

antiagainst commented Jun 18, 2024

antiagainst commented Jul 2, 2024

pawelszczerbuk Jul 26, 2024

antiagainst Jul 26, 2024

yiakwy-xpu-ml-framework-team Aug 9, 2024

antiagainst Aug 9, 2024

pawelszczerbuk left a comment

[Backend][AMD] Introduce stream pipeliner v2 #4148

[Backend][AMD] Introduce stream pipeliner v2 #4148

Conversation

sjw36 commented Jun 17, 2024 • edited Loading

ThomasRaoux commented Jun 17, 2024

antiagainst commented Jun 18, 2024

antiagainst commented Jul 2, 2024

pawelszczerbuk Jul 26, 2024

Choose a reason for hiding this comment

antiagainst Jul 26, 2024

Choose a reason for hiding this comment

yiakwy-xpu-ml-framework-team Aug 9, 2024

Choose a reason for hiding this comment

antiagainst Aug 9, 2024

Choose a reason for hiding this comment

pawelszczerbuk left a comment

Choose a reason for hiding this comment

sjw36 commented Jun 17, 2024 •

edited

Loading