-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR] Enhance Lower cross thread Pass #17133
base: main
Are you sure you want to change the base?
Conversation
@LeiWang1999 please fix the lint and test case, @wrongtest-intellif do you mind help review the PR |
// output buffers. | ||
if (!IsDominantBlock(scope_block, GetRef<Block>(block))) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @LeiWang1999 could you explain why the dominant check here is cancelled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wrongtest-intellif Thanks for your review, I'll take a look and give you response this Friday.
Evaluate(Call(/*dtype=*/DataType::Handle(), | ||
/*op=*/tir::builtin::tvm_thread_allreduce(), | ||
/*args=*/std::move(parameters))))); | ||
ObjectPtr<BlockNode> cross_thread_block_node = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we initialize kIsCrossThreadReductionApplied
just in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah absolutely, but the problem is I don't know how to skip the initialization of init
, alloc_buffers
and match_buffers
, give them empty values might be a bit ugly.
Block::Block(Array<IterVar> iter_vars, Array<BufferRegion> reads, Array<BufferRegion> writes,
String name_hint, Stmt body, Optional<Stmt> init, Array<Buffer> alloc_buffers,
Array<MatchBufferRegion> match_buffers, Map<String, ObjectRef> annotations,
Span span)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also use block.CopyOnWrite()->annotations.Set(...)
if full constructor params looks ugly.
if (buf_it == crt_buf2threads_.end()) { | ||
continue; | ||
} | ||
for (auto[scope, range] : buf_it->second) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requires clang-format
* \param block The block to be checked | ||
* \return The init value of the given block | ||
*/ | ||
static PrimExpr FindInit(const Block& block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the usage in context, maybe we could merge the FindInit()
and CheckHasMMA()
to visit the block only once.
|
||
private: | ||
void VisitStmt_(const BufferStoreNode* node) final { | ||
BufferStore store = GetRef<BufferStore>(node); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check there is single buffer store and related value?
* \brief kind The kind of the for loop | ||
* \return The loop variables between stmt1 and stmt2 | ||
*/ | ||
class LoopVar { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just use ForNode
object?
I’m attempting to remove the For new_for = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body);
const ForNode* new_for_node = new_for.get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Output: 0 This snippet works fine and logs the expected output of However, when I try to instantiate const ForNode* new_for_node = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body).get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Check failed: (tindex < type_table_.size() && type_table_[tindex].allocated_slots != 0) is false: Unknown type index 76391888 @wrongtest-intellif, do you have any thoughts? |
node pointer do not take ownership of objects. it seems your |
thanks, exactly, something relevant to the Ownership, now use |
We currently only support lower cross thread with several constrains. For example, the lower_cross_thread only apples when the thread binding reduced axis is the innermost loop, and the block must have an init block. This can be a limiting for some cases.
For example, when tensorizing the reduction block (e.g., dp4a or mma), it becomes difficult to tensorize the init statement as well:
Moreover, certain cases, like small gemm, prefer block reduction in shared memory to enhance parallelization to better utilize the hardware resources.
This pull request improves the
lower_cross_thread
pass, it can now handle the thread block reduce lowering with separate init and reduce blocks, and removes the constrain that the reduced axis is the innermost loop to support TensorCore with block reduction.relevant test cases can be found at
tests/python/tir-transform/test_tir_transform_lower_cross_thread_reduction.py
.Please CC @MasterJH5574 .