Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TANH/Sigmoid 16-bit activation functions using LUT #34589

Merged
merged 8 commits into from Apr 7, 2020

Conversation

wwwind
Copy link
Contributor

@wwwind wwwind commented Nov 25, 2019

We think the reference functions for 16-bit activation are too complex for efficient implementation on resource constrained platforms and propose to replace the functions with a lookup table approach as follows:

First rescale the input data to fixed range of -10.7 to +10.7. Then use a 256-entry lookup table for Sigmoid followed by linear interpolation to efficiently derive the result.

The Sigmoid LUT table is used for the TANH function, because tanh(x) = 2sigmoid(2x) -1 and we take into account the symmetry.

The proposed reference kernel implementation also has higher accuracy than the existing one.
On the current functions we measure a difference of up to 6.3 for sigmoid and 11.7 for tanh in quantized units compared to the floating point reference implementation over the 16-bit input range (representing -8.0 to +8.0). For the implementation of this patch we see the error reduced to less than 1.5 quantized units compared to floating point reference for both tanh and sigmoid.

…LUT.

We think the reference functions for 16-bit activation are too complex for
efficient implementation on resource constrained platforms and propose
to replace the functions with a lookup table approach as follows:

First rescale the input data to fixed range of -10.7 to +10.7
Use a 256-entry lookup table for Sigmoid followed by linear interpolation
to efficiently derive the result.

The Sigmoid LUT table is used for the TANH function,
because tanh(x) = 2*sigmoid(2*x) -1 and we take into account the symmetry is taked.

The proposed reference kernel implementation also has higher accuracy than the existing one.
On the current functions we measure a difference of up to 6.3 for sigmoid and 11.7 for
tanh in quantized units compared to the floating point reference implementation over
the 16-bit input range (representing -8.0 to +8.0). For the implementation of this patch we
see the error reduced to less than 1.5 quantized units compared to floating point
reference for both tanh and sigmoid.

Change-Id: I4d1406928db65740c1750c9cd7bfffab30771419
@tensorflow-bot tensorflow-bot bot added the size:M CL Change Size: Medium label Nov 25, 2019
@gbaned gbaned self-assigned this Nov 26, 2019
@gbaned gbaned added the comp:lite TF Lite related issues label Nov 26, 2019
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Nov 26, 2019
Change-Id: Ia9fa7e70e15a5174a045ee5f98cf4f78e6a43ef6
@gbaned gbaned added the awaiting review Pull request awaiting review label Nov 29, 2019
@suharshs suharshs requested review from jdduke, suharshs and bjacob and removed request for nutsiepully December 9, 2019 19:29
@bjacob
Copy link
Contributor

bjacob commented Dec 9, 2019

can you share benchmark results showing how this compares to the current 16bit fixedpoint implementations on various ARM CPUs?

because pure arithmetic implementations scale linearly with SIMD widths and lookup tables do not, we have typically found that despite looking expensive in source code, pure arithmetic implemenations can actually perform well. As ARM CPUs SIMD width increases, they keep looking better. For example, on a Cortex-A76 CPU (e.g. Pixel4) we can issue 2 128-bit multiplications each cycle, each doing 8 16-bit fixed-point multiplications, so each fixed-point multiplication effectively only costs one sixteenth of a cycle.

as this varies greatly from one CPU to another, we don't expect one approach to consistently outperform the other (arithmetic vs LUT). Instead, we have started to let both coexist in source code. We still need to pick one to be used by default but at least having both in source code makes it easy to switch.

i don't remember the latest but some the the following people might: @renjie-liu @abattery @jianlijianli

@renjie-liu
Copy link
Member

Thanks Elena for the pr. Like Benoit pointed out, lut is not necessarily desired from the performance point of view.

Another thing is I suspect the accuracy measurement is "correct", do you have any real model to benchmark the accuracy? Like you said, your method will use range from -10.7 to 10.7, but should that based on the real scenario?

Jaesung can probably comment more, but if you want to cover more range for your scenario, probably just add one more bits to the integer parts? (where currently it's 3 bits, so represents -8 to 8) for sigmoid?

@abattery
Copy link
Contributor

Hi, Elena.

Thank you for your contributions :)

For the performance, I have seen that LUT-table methods for Tanh/Sigmoid are practically faster than fixed point arithmetic based calculations in the modern mobile ARM CPUs for integer 8 bit cases. However, it's worth double-checking again for the integer 16 bit cases. Could you share the benchmark results before and after?

It seems that you are trying to increase accuracy based on the idea that it takes an absolute value since tanh function is symmetric and does LUT where the table covers little larger than 8-bit fixed point arithmetic range. In my first looking at it, it makes sense to me. However, As Renjie said, the Sigmoid case has the same expression range [-8, 8] with the current fixed point arithmetic kernels for Sigmoid. I am not sure that this approach improves Sigmoid case. Could you share your accuracy difference results for both Sigmoid and Tanh to us and how did you measure the accuracy?

Best regards,
Jaesung

@abattery abattery self-requested a review December 10, 2019 02:04
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 10, 2019
@jianlijianli
Copy link
Contributor

Cache aliasing also plays key role in the performance of LUT and that is really hardware dependent. Seems to be the best option is for them to "coexist", like Benoit mentioned in his comment.

@wwwind
Copy link
Contributor Author

wwwind commented Dec 18, 2019

Hi guys,

Thank you for the review @bjacob @abattery @jianlijianli @renjie-liu !

I attached the file accuracy_test.cpp that demonstrates how the accuracy was tested:
https://gist.github.com/wwwind/0b7413833cf8a7596bee084403224fa5

The test code reports the range and total sum of the error over the whole signed 16-bit integer range. Below are results of the testing.

For sigmoid:

tensorflow sigmoid implementation error -6.392090 to 5.392578, error sum -32767.042969
proposed sigmoid implementation error -1.015625 to 1.015625, error sum -0.043635

For tanh:

tensorflow tanh implementation error -11.791016 to 11.791016, error sum 0.992188
proposed tanh implementation error -1.476562 to 1.476562, error sum 0.992188

Regarding the performance benchmark, we do not have such measurements at hand. When we refer to resource constrained platforms, apart from the world of CPUs, MCUs and GPUs, we also consider neural network accelerators, the dedicated HW where small look-up table is likely to be a preferred implementation path. However, even just for CPUs, for the non-optimised reference code, we expect the proposed solution to be more efficient than the gemmlowp one due to the number of operations involved in the iterative approach used by the latter.

In the proposed solution, to compute a value, we need about 14 operations. The existing implementation is based on the iterative Newton-Raphson division algorithm to find a reciprocal, which takes around 12 operations + computation of exponent, which needs Taylor approximations with 4 terms. So, it looks a lot heavier, at least for a single result computation.

We agree that, for vectorised implementation, the performance advantage may or may not shift towards the iterative approach. Therefore, it seems that the approach proposed by @bjacob, @abattery and @jianlijianli , where both implementations will co-exist, is a sensible compromise. I would suggest that we keep a more accurate, LUT-based implementation, as a default one. Should I proceed to make the corresponding changes in this PR? What would be the best way to implement the switching mechanism between the two implementations? Could you please point to an example in the code ?
Thanks!

@renjie-liu
Copy link
Member

Hi,

thanks a lot for the testing! and thanks a lot for the work!

I wonder should we sum the absolute value instead of the raw value? also the large sigmoid value makes me wonder it looks really look like a direct overflow: -32767.042969 really looks like -32768 plus some number (maybe we can just try with the extreme number?) or maybe just systematically shifted -0.5 for every value?

also given you're verifying the int16 value, judging from the number both have good results: +- 12/(1 << 15) for tanh

I think it's fine to keep both approach, you can refer here for prepare: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L395-L422

and here for eval: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L813-L824

@gbaned
Copy link
Contributor

gbaned commented Dec 24, 2019

@wwwind Could you please check renjie-liu's comments and keep us posted. Thanks!

@gbaned gbaned added the stat:awaiting response Status - Awaiting response from author label Dec 24, 2019
@wwwind
Copy link
Contributor Author

wwwind commented Dec 24, 2019

Hi @renjie-liu,

Thanks a lot for your review. I pushed an updated version. Could you please take a look ?
Also, I think that this block is not relevant anymore for LUT approach https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L515 Should it be removed ?

Thanks,
Elena

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 24, 2019
@gbaned gbaned requested a review from renjie-liu January 3, 2020 10:16
@gbaned gbaned added the awaiting review Pull request awaiting review label Jan 3, 2020
m.SetInput<int16_t>({
0, -6, 2, 4, //
3, -2, 10, 1, //
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renjie-liu this line has been modified by replacing 10 with 8, so the input number is in the given range of the input tensor and i added 4 more numbers.

@@ -798,14 +936,12 @@ TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
case kTfLiteInt16: {
TanhParams params;
params.input_left_shift = data->input_left_shift;
if (kernel_type == kReference) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we intended to keep the reference ones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we did not fully align on this. From the discussion preceding this change, we understood that we would keep the new LUT-based implementation as a default one, and assumed that "default" == "kReference".

It also appears that, if GEMMLOWP_NEON is not defined, optimized_ops::Tanh() called in the case of kFixedPointOptimize is the same as reference_ops::Tanh() that used to be called for kReference.

So, we just followed the same code pattern as is used for Int8 and UInt8 versions.

I appreciate now that later in this file kGenericOptimized version of kernel_type seems to be registered as the default one for Tanh and Logistic (in Register_TANH()).

Do you think we should change to calling different implementations for different kernel_type? I.e.
kReference -> reference_ops::Tanh()
kFixedPointOptimize -> optimized_ops::Tanh()
kGenericOptimized -> EvalUsingLookupTableTanh16Bit() (default)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@renjie-liu Could you please clarify how the current implementation and the suggested one using LUT should co-exist ? Should we call the suggested as a reference, because it is more accurate ? The current one looks like a good fit for the kernel kFixedPointOptimize. Which one should be when the kernel type is not specified ?
Thanks !

@tensorflow-bot tensorflow-bot bot removed the ready to pull PR ready for merge process label Mar 18, 2020
@wwwind
Copy link
Contributor Author

wwwind commented Mar 18, 2020

Hi @jianlijianli , Could you please re-approve this PR ? I had to push a fix for the warning.
Thanks!

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 21, 2020
@gbaned gbaned added the awaiting review Pull request awaiting review label Mar 24, 2020
jianlijianli
jianlijianli previously approved these changes Mar 24, 2020
Copy link
Contributor

@jianlijianli jianlijianli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the warning.

@tensorflow-bot tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 24, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 24, 2020
@gbaned gbaned added kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review labels Mar 25, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 25, 2020
@tensorflow-bot tensorflow-bot bot removed the ready to pull PR ready for merge process label Mar 25, 2020
@wwwind
Copy link
Contributor Author

wwwind commented Mar 25, 2020

Hi @gbaned, @liufengdb ! I pushed a small fix for the error reported by buildifier.
Could you please re-approve this PR ?
Thanks!

@gbaned gbaned added the awaiting review Pull request awaiting review label Mar 27, 2020
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 27, 2020
@tensorflow-bot tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 27, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 27, 2020
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 29, 2020
@wwwind
Copy link
Contributor Author

wwwind commented Apr 6, 2020

Hi @rthadur ! This PR has been approved 10 days ago and all checks are green.
Is it stuck internally with some errors ? Could you please take a look?
Thanks!

@rthadur
Copy link
Contributor

rthadur commented Apr 6, 2020

@wwwind thanks for your patience , @jianlijianli is working internally on this , we will get back to you asap.

@google-ml-butler google-ml-butler bot added the kokoro:force-run Tests on submitted change label Apr 7, 2020
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 7, 2020
@tensorflow-copybara tensorflow-copybara merged commit c57bb8b into tensorflow:master Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:lite TF Lite related issues ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Approved by Reviewer
Development

Successfully merging this pull request may close these issues.

None yet