Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TensorExpr] Add aten::matmuls to TE fuser. #54605

Closed
wants to merge 8 commits into from

Conversation

ZolotukhinM
Copy link

@ZolotukhinM ZolotukhinM commented Mar 24, 2021

Stack from ghstack:

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: D27298364

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added oncall: jit Add this issue/PR to JIT oncall triage queue cla signed labels Mar 24, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 24, 2021

💊 CI failures summary and remediations

As of commit d0f31f7 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-scanned failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
ZolotukhinM pushed a commit that referenced this pull request Mar 24, 2021
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

ghstack-source-id: 10e40f57f18c5fd221e4fa6029dd9a2ac78ae18c
Pull Request resolved: #54605
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be aware that matmul has broadcasting behavior, so either make sure that's handled or opt out in fuser

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
@ZolotukhinM
Copy link
Author

You should be aware that matmul has broadcasting behavior, so either make sure that's handled or opt out in fuser

That's a good point, I totally forgot about that! I've added checks into the fuser and tests to verify that we don't fuse what we can't correctly lower yet. Could you please take a look? Are these tests sufficient?

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
ZolotukhinM pushed a commit that referenced this pull request Apr 14, 2021
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

ghstack-source-id: 2c46db959dfa4ca85cb6428cb0adf4794768e3d1
Pull Request resolved: #54605
Copy link
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just had one question about b[0]

@@ -187,6 +187,26 @@ bool conv2dIsSupported(const torch::jit::Node* node) {
return true;
}

// The fuser currently only supports matmul of 2D x 2D matrices
bool matmulIsSupported(const torch::jit::Node* node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for putting this in kernel instead of tensorexpr_fuser ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only because we already have conv2dIsSupported there. I wouldn't mind moving it to tensorexpr_fuser.cpp, but in this PR for consistency I'd prefer to keep them in kernel.cpp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I put conv2dIsSupported here b/c I wanted to refer to it from both kernel and fuser, and we already have other dependences going from kernel->fuser. But it's really fine either way.

const IntImm* total_size = dynamic_cast<const IntImm*>(
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node());

if (total_size && total_size->value() < 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here about why we're using external calls for n > 1000 ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol clang-tidy "1000 is a magic number". I want to troll the linter by doing constexpr int kMagicNumber = 1000;

const Node* n = v->node();
auto const& shape = sizesForValue(v);
Dtype dtype = kFloat;
auto maybe_stype = findDtypeForValue(v);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is maybe_stype?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"maybe" is for optionality of returned value, "stype" is for scalar-type. I think we've been using this name in surrounding code too.

auto size_a = ExprVectorToExprHandleVector(a->dims());
auto size_b = ExprVectorToExprHandleVector(b->dims());
const IntImm* total_size = dynamic_cast<const IntImm*>(
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be multiplying by size_b[0] as well ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a rough estimate of amount of work matmul needs to do, which is NMK (size_a[1] == size_b[0] == M). Would I not pass an interview with such analysis? 😜

BufHandle bh(b);
return Load::make(ah, {m, k}) * Load::make(bh, {k, n});
},
{{size_a[1], "K"}});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/: there are a very small number of possible inputs to test, you might try writing a script to see how large you can make the total size while nnc is as fast as aten matmul (with a little bit of a leeway for possible fusion, say 5%)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very basic naive heuristic and we for sure would like to tune it (and hopefully add better schedules for matmul as well). But I think it deserves a separate PR too, this one is mostly to lay down the foundation for this work.

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Differential Revision: [D27298364](https://our.internmc.facebook.com/intern/diff/D27298364)

[ghstack-poisoned]
ZolotukhinM pushed a commit that referenced this pull request Apr 16, 2021
For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

ghstack-source-id: 377223dd80342ec1d1623219205dc05f2e6c0a60
Pull Request resolved: #54605
Copy link
Contributor

@bertmaher bertmaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. So if we fuse matmul and use an external call, are we still as fast as eager mode?

@@ -187,6 +187,26 @@ bool conv2dIsSupported(const torch::jit::Node* node) {
return true;
}

// The fuser currently only supports matmul of 2D x 2D matrices
bool matmulIsSupported(const torch::jit::Node* node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I put conv2dIsSupported here b/c I wanted to refer to it from both kernel and fuser, and we already have other dependences going from kernel->fuser. But it's really fine either way.

const IntImm* total_size = dynamic_cast<const IntImm*>(
IRSimplifier::simplify((size_a[0] * size_a[1] * size_b[1])).node());

if (total_size && total_size->value() < 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol clang-tidy "1000 is a magic number". I want to troll the linter by doing constexpr int kMagicNumber = 1000;

@ZolotukhinM
Copy link
Author

So if we fuse matmul and use an external call, are we still as fast as eager mode?

I would expect so, but honestly I haven't done any rigorous measurements yet.

@facebook-github-bot
Copy link
Contributor

@ZolotukhinM merged this pull request in 5f19385.

@facebook-github-bot facebook-github-bot deleted the gh/ZolotukhinM/415/head branch April 20, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#54605

For small sizes we generate a naive 3-layer loopnest, for bigger sizes
we generate an external call.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D27298364

Pulled By: ZolotukhinM

fbshipit-source-id: 2ddf275ff68d6fca16a3befca5ce5c26aef462b5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: jit Add this issue/PR to JIT oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants