-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mobile Backend: NHWC memory layout + XNNPACK integration. #32509
Mobile Backend: NHWC memory layout + XNNPACK integration. #32509
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
💊 CircleCI build failures summary and remediationsAs of commit 4b95293:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 1 new failure recognized by patternsThe following build failures do not appear to be due to upstream breakage:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There are a ton of unintentional submodule updates in this diff. |
This diff is really long and the description in your PR is not commensurate with its length. Please reping for review after there is a longer description of the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description
Thank you for your reviews Edward and Greg. I will address your comments, along with non-threadpool related comments from Jiakai on the previous PR, and upload an update. |
Well a few of those updates are actually necessary for this patch to work, such as PSIMD, cpu_info, and pthreadpool. These updates are required for XNNPACK to compile. The NNPACK update fixes a buffer overflow which can be a separate PR. I can leave the others out. The only reason I updated those is since they are used in NNPACK and family and I wanted to make sure we are using any potential bugfix or performance improvement they might bring to the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Removed trace of all pthreadpool changes.
- Removed extra submodule updates.
- Removed c10 op registration. Will decide on how best to expose the operators in a follow-up patch.
- Addressed comments.
- Will update PR description shortly.
Please let me know if I missed anything, or if you have any further concerns.
In that case, if the updates are backwards compatible, it's generally a good idea to do them in a separate diff first, and then the main diff. Although in this case it looks like you got all the tests to work. |
I'm adding @bwasti to this PR, because the caching scheme here is similar to things that bwasti observed were necessary in his sparse experiments in https://github.com/pytorch/sparse |
ya, splitting out the module updates into a separate PR is probably a good idea. You never know what could break downstream and having a minimally revertible piece is nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all major comments have been addressed. I'm going to accept this so we can start testing it out in real apps and iterating on the frontend.
groups, | ||
output_min, | ||
output_max), | ||
"xnnpack::convolution not available!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried that this error message doesn't actually tell the user what the problem was. Let's be sure to improve this if anyone gets confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was that if this error prints the line and file the user can investigate. Otherwise if I want to make the error message super descriptive, which is also a possibility, I have to break that function into its constituent tests. Is that what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Don't need to do it right now, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit more context "xnnpack engine for conv2d doesn't support this combination of padding and strides"
And put a TODO to improve this message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll likely need to expose this function later to JIT or other parts to make sure that the rewriting pass is safe in handling details. Or we could add a fallback path that despite "prepacking" still just calls a regular conv if the params don't match. General principles is: everything should run, but some things can run slowly. But it can be done in a separate diff.
|
||
TORCH_CHECK( | ||
usable(input_nhwc), | ||
"xnnpack::convolution not usable!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same. If I hit this, I wouldn't know what the problem was.
@@ -0,0 +1,96 @@ | |||
#ifndef USE_XNNPACK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe comment what this file is for.
The E2E bindings look pretty reasonable; I'm not sure what the current state of torchbind is but that's the main question I'd like resolved before shipping this |
I was gonna work on this following @jamesr66a's PR:https://github.com/pytorch/pytorch/pull/32938/files, taking similar approach. Basically a custom class (OpContext) registered with torchbind that captures the OP context. The setstate method registered with torchbind will create OpContext. Current linear_prepack will also generate OpContext as output that is consumed by linear_run. Freezing API then can be used to get rid of linear_prepack. I have a small quip describing this appraoch, to which I will add you guys for further comments. |
@ezyang, note that having linear_prepack and linear_run, or their updated _ prefixed versions, will eventually have to return not Tensor but a custom class that is registered with torchbind. This means we will have to introduce the new class in native_functions.yaml and patch everything else to make it work with the build system. At the moment, similar approach using torchbind for quantization does not have to deal with this as the quantization ops are registered differently. |
I suspect the better strategy is to do the registrations manually, in the same way quantization does them. But as long as there's an agreed upon plan on record here, I don't mind if the patch goes in "as is". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good modulo a few renames in inline comments. Let's land this and then have follow ups with frontend tests.
Also - do you plan in some follow up diffs to re-route regular non-prepacked ops to call xnnpack? Or is the perf penalty too high for it?
groups, | ||
output_min, | ||
output_max), | ||
"xnnpack::convolution not available!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a bit more context "xnnpack engine for conv2d doesn't support this combination of padding and strides"
And put a TODO to improve this message
groups, | ||
output_min, | ||
output_max), | ||
"xnnpack::convolution not available!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll likely need to expose this function later to JIT or other parts to make sure that the rewriting pass is safe in handling details. Or we could add a fallback path that despite "prepacking" still just calls a regular conv if the params don't match. General principles is: everything should run, but some things can run slowly. But it can be done in a separate diff.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Sorry this went under my radar. Yes, I plan to follow up with another diff that will reroute those operators. The performance penalty shouldn't be high. Also there is another patch left that I took out of the original diff whose purpose is to unify threading on mobile to reconcile XNNPACK's use of an updated version of pthreadpool that is using an updated interface that our internal custom Caffe2 implementation is not written against. Marat has also done improvements to pthreadpool's implementation itself to use spin locks for short sleeps which now makes our custom Caffe2 implementation redundant as the latter was also based on the same premise. My benchmarks last half showed better performance using pthreadpool's updated implementation compared to our custom version, and on top of that I think using a unified threading solution makes for easier code maintenance too. We have run into linker issues complaining about duplicate symbols any time we wanted to add support for a new platform, including internal BUCK targets. So yeah, these are the two pieces on the backend side of things remaining. There's of course the JIT side as well which Kimish is working on. |
What is this referring to Ashkan? Is this the same for ops such as Add? |
Yaaay! I'm genuinely so excited. Like a child discovering candy for the
first time.
…On Sun, Feb 23, 2020, 7:10 PM Facebook Community Bot < ***@***.***> wrote:
Closed #32509 <#32509> via 941b424
<941b424>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#32509?email_source=notifications&email_token=AD24NKC2UONUAKJSFK6XJEDREM3CZA5CNFSM4KKMULX2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOW2QKKDQ#event-3063981326>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD24NKD6LU7RTGPZLMKU2IDREM3CZANCNFSM4KKMULXQ>
.
|
@AshkanAliabadi merged this pull request in 941b424. |
Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
) Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
) Summary: Pull Request resolved: pytorch#33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.
XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.
Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.
Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.
The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.
This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.
Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.