-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NHWC memory layout support and XNNPACK integration for mobile #30644
Conversation
💊 CircleCI build failures summary and remediationsAs of commit 69f5e7d:
Detailed failure analysisOne may explore the probable reasons each build failed interactively on the Dr. CI website. 🕵️ 4 new failures recognized by patternsThe following build failures do not appear to be due to upstream breakage: pytorch_linux_xenial_py3_clang5_mobile_build (1/4)Step: "Build" (full log | pattern match details)
|
Please review this code or areas you are interested in. Any help appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Publishing to collect feedback.
@@ -5,8 +5,7 @@ | |||
#ifndef C10_MOBILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilia-cher Please take a look.
fn(0, i); | ||
} | ||
} | ||
native::mobile::internal::threadpool().run(fn, range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilia-cher What would be the repercussions of not passing the thread ID to the callback? pthreadpool, the lower level threading library we use on mobile, does not support that feature and I am wondering whether that matters enough to go through the trouble of supporting that use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?
Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.
I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a closer took, seems you refer to the first param (thread_pool_task_id
) of the fn
callback of _run_with_pool
- I think this callback is only used in this file by _parallel_run
, the first param is marked as unused
there: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L135
If we move to using pthreadpool which doesn't pass thread-id to fn
then we can simply remove it. It's doesn't seem to be in the public API of parallel_for
callback signature (which handles task_id instead of thread_id).
@@ -24,14 +25,26 @@ DEFINE_DISPATCH(leaky_relu_stub); | |||
DEFINE_DISPATCH(leaky_relu_backward_stub); | |||
|
|||
Tensor hardtanh(const Tensor& self, Scalar min, Scalar max) { | |||
// if (mobile::cpu::use_clamp(self, min, max)) { | |||
// return mobile::cpu::clamp(self, min, max); | |||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will enable all these. And all the commented out lines below in this diff.
// output = mobile::cpu::convolution( | ||
// input, weight, bias, | ||
// params.padding, params.stride, params.dilation, params.groups, params.transposed); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mobile hook for convolutions. This code path is not ideal since portions of the convolutions can be calculated once and cached - a major cause for performance uplift. This codepath cannot do that. Ideally we want to use newly introduced modules at the end of this diff, or even better, use JIT.
for non-dilated case here */ | ||
return at::thnn_conv2d( | ||
input, weight, kernel_size, bias, | ||
stride, padding); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NNPACK is gone. I left it there for Caffe2, but XNNPACK is still faster even if used inefficiently via the commented out code path above.
// factored out resulting in time savings per calls to forward() - something | ||
// this API cannot do. Furthermore, this API does not allow for fusion of | ||
// non-linear operators that, again, is something that the exposed c10 mobile | ||
// operators can handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explains why the commented out code paths above are not the most efficient use of our resources.
// TODO (Ashkan) | ||
|
||
ThreadPool& threadpool() { | ||
static ThreadPool threadpool_(4u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will have to fix this.
@@ -1,12 +1,12 @@ | |||
from __future__ import absolute_import, division, print_function, unicode_literals | |||
import torch | |||
from torch.nn import Conv2d, Conv3d, ReLU, Linear, BatchNorm2d | |||
from torch.nn import Conv2d, Conv3d, ReLU, ReLU6, Linear, BatchNorm2d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
import torchvision | ||
|
||
def mobilenetv2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ljk53 @suo @dzhulgakov Having separate mobile-specific models is not ideal. Ideally we want to perform these transformations in a JIT pass. Not sure how to proceed here. Should we move forward with this temporarily or move forward with merging the underlying implementation but not "exposing" it for now through these modules. The old code paths is not going to be very efficient as mentioned in comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)
@@ -87,6 +87,7 @@ def fuse_known_modules(mod_list): | |||
OP_LIST_TO_FUSER_METHOD = { | |||
(torch.nn.Conv2d, torch.nn.BatchNorm2d): fuse_conv_bn, | |||
(torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU): fuse_conv_bn_relu, | |||
(torch.nn.Conv2d, torch.nn.BatchNorm2d, torch.nn.ReLU6): fuse_conv_bn_relu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -212,9 +203,7 @@ int get_num_threads() { | |||
return _get_intraop_pool().size() + 1; | |||
} | |||
#else | |||
caffe2::ThreadPool* pool = caffe2::mobile_threadpool(); | |||
// caffe2::ThreadPool::getNumThreads() counts the current thread. | |||
return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ilia-cher Can you explain what in_parallel_region() is for? To avoid over parallelization (!?) when invoked from a parallel context? This check was added to avoid a deadlock while we were using caffe2::ThreadPool if I remember correctly, but now that we are moving away from that, is this still needed? cc @ljk53
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.
I think one thing I should do that will make it easier on the reviewers is to break this PR into several smaller ones. :) That way we can merge the core non-controversial parts of the implemetnation that will not affect anyone and then debate on the best ways to enable it and expose it to the user. |
While this PR seems to address PyTorch Mobile needs, XNNPACK is not limited to mobile platforms: it supports Web inference through WebAssembly and WebAssembly SIMD micro-kernels and includes a decent and quickly improving set of micro-kernels for server-class x86-64 processors, i.e. targeting AVX/AVX2/AVX512 features. I'd suggest to not restrict its integration to just mobile platforms. |
@@ -1,47 +0,0 @@ | |||
#include "caffe2/utils/threadpool/pthreadpool.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under the impression that these caffe2 pthreadpool glue code is still used in some FB production since we haven't fully migrated to PyTorch yet, is it correct? In that case we cannot immediately delete these files...
|
||
import torchvision | ||
|
||
def mobilenetv2(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out correctly let's split this PR into smaller ones. I think python frontend change can be separated out (as you mentioned we are still deciding between this and JIT pass...)
} | ||
|
||
const auto registry = c10::RegisterOperators() | ||
.op("mobile::conv2d_create", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we can move op registration to native_functions.yaml? We introduced manual op registration for quantized ops but people were debating whether it's right direction to go. I think the current suggestion is still to stick to native_functions.yaml & codegen? cc: @dzhulgakov @gchanan @smessmer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
native_functions.yaml
doesn't support namespaces like mobile::
yet, but generally yes, it's usually better to define things in native_functions.yaml
because you get the C++ frontend generated.
@@ -212,9 +203,7 @@ int get_num_threads() { | |||
return _get_intraop_pool().size() + 1; | |||
} | |||
#else | |||
caffe2::ThreadPool* pool = caffe2::mobile_threadpool(); | |||
// caffe2::ThreadPool::getNumThreads() counts the current thread. | |||
return !pool || in_parallel_region() ? 1 /* current thread */ : pool->getNumThreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember getNumThreads() acquires a mutex which is not reentrant safe, and there are some call-sites in the codebase that calls getNumThreads() from a thread in the threadpool thus can cause deadlock. I remember @xta0 both added this check here and removed the mutex from the pool (as the number of thread variable won't be updated after initialization so it's very unlikely to have race condition). We can check whether the new get_thread_count() method needs this or not.
fn(0, i); | ||
} | ||
} | ||
native::mobile::internal::threadpool().run(fn, range); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an internal API to query thread-ID: get_thread_num() https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Parallel.h#L30
Do you refer to it or some other API?
Here is one sample call-site of get_thread_num() API in aten: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorIteratorReduce.cpp#L34
I think it is used to perform some ad-hoc reduction - not sure if it's actually needed by any mobile model for real.
I remember right now we are using a thread-local to mimic this behavior - I'm not sure it needs depend on specific thread pool implementation:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/ParallelNative.cpp#L105
Breaking this PR into smaller chunks to enable a more controlled roll-out, starting with #32509. Closing this PR. |
Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
Summary: Pull Request resolved: #33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original #30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: #32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
) Summary: In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding **native** implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Reviewed By: dreiss Differential Revision: D19521853 Pulled By: AshkanAliabadi fbshipit-source-id: 99a1fab31d0ece64961df074003bb852c36acaaa
) Summary: Pull Request resolved: pytorch#33722 In order to improve CPU performance on floating-point models on mobile, this PR introduces a new CPU backend for mobile that implements the most common mobile operators with NHWC memory layout support through integration with XNNPACK. XNNPACK itself, and this codepath, are currently only included in the build, but the actual integration is gated with USE_XNNPACK preprocessor guards. This preprocessor symbol is intentionally not passed on to the compiler, so as to enable this rollout in multiple stages in follow up PRs. This changeset will build XNNPACK as part of the build if the identically named USE_XNNPACK CMAKE variable, defaulted to ON, is enabled, but will not actually expose or enable this code path in any other way. Furthermore, it is worth pointing out that in order to efficiently map models to these operators, some front-end method of exposing this backend to the user is needed. The less efficient implementation would be to hook these operators into their corresponding native implementations, granted that a series of XNNPACK-specific conditions are met, much like how NNPACK is integrated with PyTorch today for instance. Having said that, while the above implementation is still expected to outperform NNPACK based on the benchmarks I ran, the above integration would be leave a considerable gap between the performance achieved and the maximum performance potential XNNPACK enables, as it does not provide a way to compute and factor out one-time operations out of the inner most forward() loop. The more optimal solution, and one we will decide on soon, would involve either providing a JIT pass that maps nn operators onto these newly introduced operators, while allowing one-time calculations to be factored out, much like quantized mobile models. Alternatively, new eager-mode modules can also be introduced that would directly call into these implementations either through c10 or some other mechanism, also allowing for decoupling of op creation from op execution. This PR does not include any of the front end changes mentioned above. Neither does it include the mobile threadpool unification present in the original pytorch#30644. Furthermore, this codepath seems to be faster than NNPACK in a good number of use cases, which can potentially allow us to remove NNPACK from aten to make the codebase a little simpler, granted that there is widespread support for such a move. Regardless, these changes will be introduced gradually and in a more controlled way in subsequent PRs. Pull Request resolved: pytorch#32509 Test Plan: Build: CI Functionality: Not exposed Reviewed By: dreiss Differential Revision: D20069796 Pulled By: AshkanAliabadi fbshipit-source-id: d46c1c91d4bea91979ea5bd46971ced5417d309c
Gathering CI signal for now ...