-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TensorExpr] Add aten::matmuls to TE fuser. #54605
Conversation
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit d0f31f7 (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. ghstack-source-id: 10e40f57f18c5fd221e4fa6029dd9a2ac78ae18c Pull Request resolved: #54605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be aware that matmul has broadcasting behavior, so either make sure that's handled or opt out in fuser
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
That's a good point, I totally forgot about that! I've added checks into the fuser and tests to verify that we don't fuse what we can't correctly lower yet. Could you please take a look? Are these tests sufficient? |
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. ghstack-source-id: 2c46db959dfa4ca85cb6428cb0adf4794768e3d1 Pull Request resolved: #54605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just had one question about b[0]
@@ -187,6 +187,26 @@ bool conv2dIsSupported(const torch::jit::Node* node) { | |||
return true; | |||
} | |||
|
|||
// The fuser currently only supports matmul of 2D x 2D matrices | |||
bool matmulIsSupported(const torch::jit::Node* node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for putting this in kernel instead of tensorexpr_fuser ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only because we already have conv2dIsSupported
there. I wouldn't mind moving it to tensorexpr_fuser.cpp
, but in this PR for consistency I'd prefer to keep them in kernel.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I put conv2dIsSupported here b/c I wanted to refer to it from both kernel and fuser, and we already have other dependences going from kernel->fuser. But it's really fine either way.
const IntImm* total_size = dynamic_cast<const IntImm*>( | ||
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node()); | ||
|
||
if (total_size && total_size->value() < 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment here about why we're using external calls for n > 1000 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol clang-tidy "1000 is a magic number". I want to troll the linter by doing constexpr int kMagicNumber = 1000;
const Node* n = v->node(); | ||
auto const& shape = sizesForValue(v); | ||
Dtype dtype = kFloat; | ||
auto maybe_stype = findDtypeForValue(v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is maybe_stype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"maybe" is for optionality of returned value, "stype" is for scalar-type. I think we've been using this name in surrounding code too.
auto size_a = ExprVectorToExprHandleVector(a->dims()); | ||
auto size_b = ExprVectorToExprHandleVector(b->dims()); | ||
const IntImm* total_size = dynamic_cast<const IntImm*>( | ||
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be multiplying by size_b[0]
as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a rough estimate of amount of work matmul needs to do, which is NMK (size_a[1] == size_b[0] == M). Would I not pass an interview with such analysis? 😜
BufHandle bh(b); | ||
return Load::make(ah, {m, k}) * Load::make(bh, {k, n}); | ||
}, | ||
{{size_a[1], "K"}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spirit of https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/: there are a very small number of possible inputs to test, you might try writing a script to see how large you can make the total size while nnc is as fast as aten matmul (with a little bit of a leeway for possible fusion, say 5%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very basic naive heuristic and we for sure would like to tune it (and hopefully add better schedules for matmul as well). But I think it deserves a separate PR too, this one is mostly to lay down the foundation for this work.
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364) [ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. ghstack-source-id: 377223dd80342ec1d1623219205dc05f2e6c0a60 Pull Request resolved: #54605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. So if we fuse matmul and use an external call, are we still as fast as eager mode?
@@ -187,6 +187,26 @@ bool conv2dIsSupported(const torch::jit::Node* node) { | |||
return true; | |||
} | |||
|
|||
// The fuser currently only supports matmul of 2D x 2D matrices | |||
bool matmulIsSupported(const torch::jit::Node* node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I put conv2dIsSupported here b/c I wanted to refer to it from both kernel and fuser, and we already have other dependences going from kernel->fuser. But it's really fine either way.
const IntImm* total_size = dynamic_cast<const IntImm*>( | ||
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node()); | ||
|
||
if (total_size && total_size->value() < 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol clang-tidy "1000 is a magic number". I want to troll the linter by doing constexpr int kMagicNumber = 1000;
I would expect so, but honestly I haven't done any rigorous measurements yet. |
@ZolotukhinM merged this pull request in 5f19385. |
Summary: Pull Request resolved: pytorch#54605 For small sizes we generate a naive 3-layer loopnest, for bigger sizes we generate an external call. Test Plan: Imported from OSS Reviewed By: bertmaher Differential Revision: D27298364 Pulled By: ZolotukhinM fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
Stack from ghstack:
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.
Differential Revision: D27298364