-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backend][AMD] Introduce stream pipeliner v2 #4148
Conversation
Thanks for doing that, when you are ready please have @pawelszczerbuk review it :) |
Thanks @sjw36! Some high level comments before reviewing detailed implementation--can we have a separate pull request for the NFC moving passes? Basically commit e0bd4d8. It's easier to review that way and if later we need to revert the changes to AMD part for whatever reason, we also don't need to revert the NFC code shuffling. |
c678fd8
to
f586572
Compare
@ThomasRaoux: @sjw36 and I chatted a bit offline. This pull request is great at showing the global picture. But we want to break it into smaller pieces to make it easier for review and restructure a bit. Overall the direction is increase reuse without abstracting too much; so we will expose some useful functions like |
69b4536
to
9517277
Compare
…structure - Copied scheduler from MatmulLoopPipeline (much could be consolidated) - Enable register buffering (even though may increases register pressure) - Enable num_stages=2+, including multi-buffering, and make `2` the default - updated tutorial for new tuning default - added lit tests
- Also move independent(from loop-carried buffer) `triton_gpu.local_store` as early as possible
- check for last atomic (sync?) - also check for other accesses to the source
…replaced with loop fusion * Reorder will not move loads/local_stores over loops
4eeb8cc
to
faf95cb
Compare
* Added TRITONAMD_OLD_STREAM_PIPELINER env variable to temporarily select old pipeliner
* update test
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
third_party/amd/lib/TritonAMDGPUTransforms/StreamPipelineV2.cpp
Outdated
Show resolved
Hide resolved
|
||
// Create a cluster for the prefetches. It may end up being empty, but this | ||
// is OK. | ||
tt::CoarseSchedule::Cluster prefetchCluster = schedule.clusters.newAtBack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefetch cluster is needed to push the copies to the end of the loop, so they work well with prefetching, that is needed for nvidia A100. I'm not sure you need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we also need to prefetch for AMD GPUs. The most naive pipelining we want should have the following structure:
S = <alloc-shared-memory>
R(0) = load Global(0)
store R(0) to S
for i = 0 .. N-1
barrier
R(i+1) = load Gloal(i+1)
dot (load S) (load S)
barrier
store R(i+1) to S
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is computation without pipelining
# load from HBM to SRAM
Load R0 : Read global0 (global -> registers)
Store R0: (registers -> SMEM)
# compute on SRAM
Load Si : (Si(SMEM) -> registers Ri)
Compute
Store Si : (registers -> Si)
# store data on SRAM
Store Rn : write SRAM data back to global
@pawelszczerbuk @antiagainst I have a question. Can we load from global to SRAM directly?
My question is "if we load data from global to register first" why dont't we compute it then store it back to SRAM:
# load from HBM
R0_0 load
R1_1 load
# instead of store it back to SMEM
compute R0, R1 (R0_0 + R1_0)
store SMEM O0
# store back to global
store
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No direct global to shared support in normal global load in mi300. (buffer load supports that but we are. not using that, yet.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
This PR first promotes common infrastructure in
lib/Dialect/TritonGPU/Transforms/Pipeliner
to enable inclusion by other target backends. No other changes have been made to the lib/include directories.Second, the
tritonamdgpu-stream-pipeline
pass has been completely revamped based on code fromlib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp
using similar scheduling passes to compute multi-stage pipelines. Some of this code could be consolidated further in the CoarseSchedule class (or perhaps a derived LoopScheduler class). This modulo scheduler collectstt.load
ops and generates local_storage and management ops for the ramp-up stage (stage-0), then collecting all uses of the loads for stage-1. Multi-buffering is introduced when num_stages exceeds the max distance between load and uses. Buffering may be in Shared memory fortt.dot
uses or Registers for all other uses. This current implement does not support peeling the last iteration if the loop is dynamic.Lastly, the
tritonamdgpu-reorder-instructions
pass has been enhanced to movett.load
ops as early as possible in its region. This includes loop bodies as well as func entry blocks for the case of ramp-up. This pass will also movetriton_gpu.local_store
ops as early as possible if their source is not directly from att.load
. In this way, a multi-buffered pipeline will overlap in this order: