Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mobile Backend: NHWC memory layout + XNNPACK integration. #32509

Closed
wants to merge 1 commit into from
Closed

Mobile Backend: NHWC memory layout + XNNPACK integration. #32509

wants to merge 1 commit into from

Conversation

AshkanAliabadi
Copy link
Contributor

@AshkanAliabadi AshkanAliabadi commented Jan 22, 2020

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@kostmo
Copy link
Member

kostmo commented Jan 22, 2020

💊 CircleCI build failures summary and remediations

As of commit 4b95293:

  • 1/1 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test2 (1/1)

Step: "Test" (full log | pattern match details)

RuntimeError: test_jit_fuser failed!
 
---------------------------------------------------------------------- 
Ran 46 tests in 11.450s 
 
FAILED (errors=4, skipped=10) 
Traceback (most recent call last): 
  File "run_test.py", line 486, in <module> 
    main() 
  File "run_test.py", line 479, in main 
    raise RuntimeError(message) 
RuntimeError: test_jit_fuser failed! 
 
(base) circleci@PACKER-5E29F737 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 65 times.

.gitmodules Show resolved Hide resolved
aten/src/ATen/native/mobile/cpu/internal/Add.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ezyang ezyang requested a review from smessmer January 23, 2020 15:28
@ezyang
Copy link
Contributor

ezyang commented Jan 23, 2020

There are a ton of unintentional submodule updates in this diff.

@ezyang
Copy link
Contributor

ezyang commented Jan 23, 2020

This diff is really long and the description in your PR is not commensurate with its length. Please reping for review after there is a longer description of the changes.

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No description

@AshkanAliabadi
Copy link
Contributor Author

Thank you for your reviews Edward and Greg. I will address your comments, along with non-threadpool related comments from Jiakai on the previous PR, and upload an update.

@AshkanAliabadi
Copy link
Contributor Author

There are a ton of unintentional submodule updates in this diff.

Well a few of those updates are actually necessary for this patch to work, such as PSIMD, cpu_info, and pthreadpool. These updates are required for XNNPACK to compile. The NNPACK update fixes a buffer overflow which can be a separate PR. I can leave the others out. The only reason I updated those is since they are used in NNPACK and family and I wanted to make sure we are using any potential bugfix or performance improvement they might bring to the table.

Copy link
Contributor Author

@AshkanAliabadi AshkanAliabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Removed trace of all pthreadpool changes.
  • Removed extra submodule updates.
  • Removed c10 op registration. Will decide on how best to expose the operators in a follow-up patch.
  • Addressed comments.
  • Will update PR description shortly.

Please let me know if I missed anything, or if you have any further concerns.

aten/src/ATen/CMakeLists.txt Outdated Show resolved Hide resolved
aten/src/ATen/native/mobile/cpu/internal/Add.cpp Outdated Show resolved Hide resolved
@ljk53 ljk53 requested a review from dzhulgakov January 23, 2020 22:39
@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2020

such as PSIMD, cpu_info, and pthreadpool

In that case, if the updates are backwards compatible, it's generally a good idea to do them in a separate diff first, and then the main diff. Although in this case it looks like you got all the tests to work.

@ezyang ezyang requested a review from bwasti January 24, 2020 16:45
@ezyang
Copy link
Contributor

ezyang commented Jan 24, 2020

I'm adding @bwasti to this PR, because the caching scheme here is similar to things that bwasti observed were necessary in his sparse experiments in https://github.com/pytorch/sparse

@gchanan
Copy link
Contributor

gchanan commented Jan 24, 2020

ya, splitting out the module updates into a separate PR is probably a good idea. You never know what could break downstream and having a minimally revertible piece is nice.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@dreiss dreiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all major comments have been addressed. I'm going to accept this so we can start testing it out in real apps and iterating on the frontend.

aten/src/ATen/CMakeLists.txt Show resolved Hide resolved
aten/src/ATen/native/ConvUtils.h Outdated Show resolved Hide resolved
aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved
aten/src/ATen/native/utils/Allocator.h Outdated Show resolved Hide resolved
aten/src/ATen/native/xnnpack/Common.h Show resolved Hide resolved
groups,
output_min,
output_max),
"xnnpack::convolution not available!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried that this error message doesn't actually tell the user what the problem was. Let's be sure to improve this if anyone gets confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was that if this error prints the line and file the user can investigate. Otherwise if I want to make the error message super descriptive, which is also a possibility, I have to break that function into its constituent tests. Is that what you have in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Don't need to do it right now, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit more context "xnnpack engine for conv2d doesn't support this combination of padding and strides"

And put a TODO to improve this message

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll likely need to expose this function later to JIT or other parts to make sure that the rewriting pass is safe in handling details. Or we could add a fallback path that despite "prepacking" still just calls a regular conv if the params don't match. General principles is: everything should run, but some things can run slowly. But it can be done in a separate diff.


TORCH_CHECK(
usable(input_nhwc),
"xnnpack::convolution not usable!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. If I hit this, I wouldn't know what the problem was.

aten/src/ATen/native/xnnpack/Factory.cpp Outdated Show resolved Hide resolved
@@ -0,0 +1,96 @@
#ifndef USE_XNNPACK
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe comment what this file is for.

cmake/Dependencies.cmake Outdated Show resolved Hide resolved
@ezyang
Copy link
Contributor

ezyang commented Feb 10, 2020

The E2E bindings look pretty reasonable; I'm not sure what the current state of torchbind is but that's the main question I'd like resolved before shipping this

@kimishpatel
Copy link
Contributor

The E2E bindings look pretty reasonable; I'm not sure what the current state of torchbind is but that's the main question I'd like resolved before shipping this

I was gonna work on this following @jamesr66a's PR:https://github.com/pytorch/pytorch/pull/32938/files, taking similar approach. Basically a custom class (OpContext) registered with torchbind that captures the OP context. The setstate method registered with torchbind will create OpContext. Current linear_prepack will also generate OpContext as output that is consumed by linear_run. Freezing API then can be used to get rid of linear_prepack. I have a small quip describing this appraoch, to which I will add you guys for further comments.

@kimishpatel
Copy link
Contributor

@ezyang, note that having linear_prepack and linear_run, or their updated _ prefixed versions, will eventually have to return not Tensor but a custom class that is registered with torchbind. This means we will have to introduce the new class in native_functions.yaml and patch everything else to make it work with the build system. At the moment, similar approach using torchbind for quantization does not have to deal with this as the quantization ops are registered differently.
So my question is, is the introduction of these custom types in native_functions.yaml and the subsequent patching acceptable?

@ezyang
Copy link
Contributor

ezyang commented Feb 12, 2020

I suspect the better strategy is to do the registrations manually, in the same way quantization does them. But as long as there's an agreed upon plan on record here, I don't mind if the patch goes in "as is".

Copy link
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo a few renames in inline comments. Let's land this and then have follow ups with frontend tests.

Also - do you plan in some follow up diffs to re-route regular non-prepacked ops to call xnnpack? Or is the perf penalty too high for it?

aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved
aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved
groups,
output_min,
output_max),
"xnnpack::convolution not available!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a bit more context "xnnpack engine for conv2d doesn't support this combination of padding and strides"

And put a TODO to improve this message

groups,
output_min,
output_max),
"xnnpack::convolution not available!");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll likely need to expose this function later to JIT or other parts to make sure that the rewriting pass is safe in handling details. Or we could add a fallback path that despite "prepacking" still just calls a regular conv if the params don't match. General principles is: everything should run, but some things can run slowly. But it can be done in a separate diff.

aten/src/ATen/native/xnnpack/Convolution.cpp Show resolved Hide resolved
@AshkanAliabadi
Copy link
Contributor Author

AshkanAliabadi commented Feb 18, 2020

  • Renamed the following operators:
    conv_prepack -> _conv_prepack
    conv_run -> _conv_packed
    linear_prepack -> _linear_prepack
    linear_run -> _linear_packed
  • Add XNNPACK integration files to the build in ATen/CMakeLists.txt only if USE_XNNPACK is true. _conv2d_prepack and family definitions required, even in absence of XNNPACK, to avoid linker errors.
  • Removed erroneously added header from ATen/native/ConvUtils.h
  • Renamed a number of variables and functions.
  • Handle the case where batch size is zero.
  • Added true to the end of the conditions.
  • Modified empty_with_tail_padding() to take dtype instead of TensorOptions, and to use resize_ instead of set_sizes_contiguous + empty_tensor_restride combo.
  • Improved error message to be more descriptive.
  • Added TODO to break error messages into respective constituents to better describe reason for failure.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshkanAliabadi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@AshkanAliabadi
Copy link
Contributor Author

Also - do you plan in some follow up diffs to re-route regular non-prepacked ops to call xnnpack? Or is the perf penalty too high for it?

Sorry this went under my radar. Yes, I plan to follow up with another diff that will reroute those operators. The performance penalty shouldn't be high.

Also there is another patch left that I took out of the original diff whose purpose is to unify threading on mobile to reconcile XNNPACK's use of an updated version of pthreadpool that is using an updated interface that our internal custom Caffe2 implementation is not written against. Marat has also done improvements to pthreadpool's implementation itself to use spin locks for short sleeps which now makes our custom Caffe2 implementation redundant as the latter was also based on the same premise. My benchmarks last half showed better performance using pthreadpool's updated implementation compared to our custom version, and on top of that I think using a unified threading solution makes for easier code maintenance too. We have run into linker issues complaining about duplicate symbols any time we wanted to add support for a new platform, including internal BUCK targets.

So yeah, these are the two pieces on the backend side of things remaining. There's of course the JIT side as well which Kimish is working on.

@kimishpatel
Copy link
Contributor

Sorry this went under my radar. Yes, I plan to follow up with another diff that will reroute those operators. The performance penalty shouldn't be high.

What is this referring to Ashkan? Is this the same for ops such as Add?
Also in the diffs I am working on, I was able to get around the linker issue, as we discussed in person, and compile XNNPACK against pthreadpool while still leaving the older interface that uses C2's internal implementation. Eventually we should unify these but with my patch, the unification should not be blocker.

@AshkanAliabadi
Copy link
Contributor Author

AshkanAliabadi commented Feb 24, 2020 via email

@facebook-github-bot
Copy link
Contributor

@AshkanAliabadi merged this pull request in 941b424.

facebook-github-bot pushed a commit that referenced this pull request Feb 25, 2020
Summary:
Pull Request resolved: #33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: #32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: #32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
hczhu pushed a commit that referenced this pull request Feb 28, 2020
Summary:
Pull Request resolved: #33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original #30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: #32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
)

Summary:
In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original pytorch#30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.
Pull Request resolved: pytorch#32509

Reviewed By: dreiss

Differential Revision: D19521853

Pulled By: AshkanAliabadi

fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
)

Summary:
Pull Request resolved: pytorch#33722

In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK.

XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards.  This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs.  This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way.

Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed.  The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance.

Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop.

The more optimal solution, and one we will  decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models.  Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution.

This PR does not include any of the front end changes  mentioned above.  Neither does it include the mobile threadpool unification present in the original pytorch#30644.  Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move.

Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs.

Pull Request resolved: pytorch#32509

Test Plan:
Build: CI
Functionality: Not exposed

Reviewed By: dreiss

Differential Revision: D20069796

Pulled By: AshkanAliabadi

fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet