TANH/Sigmoid 16-bit activation functions using LUT #34589

wwwind · 2019-11-25T18:16:30Z

We think the reference functions for 16-bit activation are too complex for efficient implementation on resource constrained platforms and propose to replace the functions with a lookup table approach as follows:

First rescale the input data to fixed range of -10.7 to +10.7. Then use a 256-entry lookup table for Sigmoid followed by linear interpolation to efficiently derive the result.

The Sigmoid LUT table is used for the TANH function, because tanh(x) = 2sigmoid(2x) -1 and we take into account the symmetry.

The proposed reference kernel implementation also has higher accuracy than the existing one.
On the current functions we measure a difference of up to 6.3 for sigmoid and 11.7 for tanh in quantized units compared to the floating point reference implementation over the 16-bit input range (representing -8.0 to +8.0). For the implementation of this patch we see the error reduced to less than 1.5 quantized units compared to floating point reference for both tanh and sigmoid.

…LUT. We think the reference functions for 16-bit activation are too complex for efficient implementation on resource constrained platforms and propose to replace the functions with a lookup table approach as follows: First rescale the input data to fixed range of -10.7 to +10.7 Use a 256-entry lookup table for Sigmoid followed by linear interpolation to efficiently derive the result. The Sigmoid LUT table is used for the TANH function, because tanh(x) = 2*sigmoid(2*x) -1 and we take into account the symmetry is taked. The proposed reference kernel implementation also has higher accuracy than the existing one. On the current functions we measure a difference of up to 6.3 for sigmoid and 11.7 for tanh in quantized units compared to the floating point reference implementation over the 16-bit input range (representing -8.0 to +8.0). For the implementation of this patch we see the error reduced to less than 1.5 quantized units compared to floating point reference for both tanh and sigmoid. Change-Id: I4d1406928db65740c1750c9cd7bfffab30771419

Change-Id: Ia9fa7e70e15a5174a045ee5f98cf4f78e6a43ef6

bjacob · 2019-12-09T19:47:11Z

can you share benchmark results showing how this compares to the current 16bit fixedpoint implementations on various ARM CPUs?

because pure arithmetic implementations scale linearly with SIMD widths and lookup tables do not, we have typically found that despite looking expensive in source code, pure arithmetic implemenations can actually perform well. As ARM CPUs SIMD width increases, they keep looking better. For example, on a Cortex-A76 CPU (e.g. Pixel4) we can issue 2 128-bit multiplications each cycle, each doing 8 16-bit fixed-point multiplications, so each fixed-point multiplication effectively only costs one sixteenth of a cycle.

as this varies greatly from one CPU to another, we don't expect one approach to consistently outperform the other (arithmetic vs LUT). Instead, we have started to let both coexist in source code. We still need to pick one to be used by default but at least having both in source code makes it easy to switch.

i don't remember the latest but some the the following people might: @renjie-liu @abattery @jianlijianli

renjie-liu · 2019-12-10T00:43:20Z

Thanks Elena for the pr. Like Benoit pointed out, lut is not necessarily desired from the performance point of view.

Another thing is I suspect the accuracy measurement is "correct", do you have any real model to benchmark the accuracy? Like you said, your method will use range from -10.7 to 10.7, but should that based on the real scenario?

Jaesung can probably comment more, but if you want to cover more range for your scenario, probably just add one more bits to the integer parts? (where currently it's 3 bits, so represents -8 to 8) for sigmoid?

abattery · 2019-12-10T01:57:31Z

Hi, Elena.

Thank you for your contributions :)

For the performance, I have seen that LUT-table methods for Tanh/Sigmoid are practically faster than fixed point arithmetic based calculations in the modern mobile ARM CPUs for integer 8 bit cases. However, it's worth double-checking again for the integer 16 bit cases. Could you share the benchmark results before and after?

It seems that you are trying to increase accuracy based on the idea that it takes an absolute value since tanh function is symmetric and does LUT where the table covers little larger than 8-bit fixed point arithmetic range. In my first looking at it, it makes sense to me. However, As Renjie said, the Sigmoid case has the same expression range [-8, 8] with the current fixed point arithmetic kernels for Sigmoid. I am not sure that this approach improves Sigmoid case. Could you share your accuracy difference results for both Sigmoid and Tanh to us and how did you measure the accuracy?

Best regards,
Jaesung

jianlijianli · 2019-12-10T23:44:01Z

Cache aliasing also plays key role in the performance of LUT and that is really hardware dependent. Seems to be the best option is for them to "coexist", like Benoit mentioned in his comment.

wwwind · 2019-12-18T19:43:23Z

Hi guys,

Thank you for the review @bjacob @abattery @jianlijianli @renjie-liu !

I attached the file accuracy_test.cpp that demonstrates how the accuracy was tested:
https://gist.github.com/wwwind/0b7413833cf8a7596bee084403224fa5

The test code reports the range and total sum of the error over the whole signed 16-bit integer range. Below are results of the testing.

For sigmoid:

tensorflow sigmoid implementation error -6.392090 to 5.392578, error sum -32767.042969
proposed sigmoid implementation error -1.015625 to 1.015625, error sum -0.043635

For tanh:

tensorflow tanh implementation error -11.791016 to 11.791016, error sum 0.992188
proposed tanh implementation error -1.476562 to 1.476562, error sum 0.992188

Regarding the performance benchmark, we do not have such measurements at hand. When we refer to resource constrained platforms, apart from the world of CPUs, MCUs and GPUs, we also consider neural network accelerators, the dedicated HW where small look-up table is likely to be a preferred implementation path. However, even just for CPUs, for the non-optimised reference code, we expect the proposed solution to be more efficient than the gemmlowp one due to the number of operations involved in the iterative approach used by the latter.

In the proposed solution, to compute a value, we need about 14 operations. The existing implementation is based on the iterative Newton-Raphson division algorithm to find a reciprocal, which takes around 12 operations + computation of exponent, which needs Taylor approximations with 4 terms. So, it looks a lot heavier, at least for a single result computation.

We agree that, for vectorised implementation, the performance advantage may or may not shift towards the iterative approach. Therefore, it seems that the approach proposed by @bjacob, @abattery and @jianlijianli , where both implementations will co-exist, is a sensible compromise. I would suggest that we keep a more accurate, LUT-based implementation, as a default one. Should I proceed to make the corresponding changes in this PR? What would be the best way to implement the switching mechanism between the two implementations? Could you please point to an example in the code ?
Thanks!

renjie-liu · 2019-12-19T02:56:19Z

Hi,

thanks a lot for the testing! and thanks a lot for the work!

I wonder should we sum the absolute value instead of the raw value? also the large sigmoid value makes me wonder it looks really look like a direct overflow: -32767.042969 really looks like -32768 plus some number (maybe we can just try with the extreme number?) or maybe just systematically shifted -0.5 for every value?

also given you're verifying the int16 value, judging from the number both have good results: +- 12/(1 << 15) for tanh

I think it's fine to keep both approach, you can refer here for prepare: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L395-L422

and here for eval: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L813-L824

gbaned · 2019-12-24T09:01:33Z

@wwwind Could you please check renjie-liu's comments and keep us posted. Thanks!

wwwind · 2019-12-24T10:36:43Z

Hi @renjie-liu,

Thanks a lot for your review. I pushed an updated version. Could you please take a look ?
Also, I think that this block is not relevant anymore for LUT approach https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/activations.cc#L515 Should it be removed ?

Thanks,
Elena

renjie-liu · 2020-01-06T07:30:21Z

tensorflow/lite/kernels/activations_test.cc

  m.SetInput<int16_t>({
      0, -6, 2, 4,   //
-      3, -2, 10, 1,  //


why delete this line?

@renjie-liu this line has been modified by replacing 10 with 8, so the input number is in the given range of the input tensor and i added 4 more numbers.

renjie-liu · 2020-01-06T07:30:53Z

tensorflow/lite/kernels/activations.cc

@@ -798,14 +936,12 @@ TfLiteStatus TanhEval(TfLiteContext* context, TfLiteNode* node) {
    case kTfLiteInt16: {
      TanhParams params;
      params.input_left_shift = data->input_left_shift;
-      if (kernel_type == kReference) {


we intended to keep the reference ones

It looks like we did not fully align on this. From the discussion preceding this change, we understood that we would keep the new LUT-based implementation as a default one, and assumed that "default" == "kReference".

It also appears that, if GEMMLOWP_NEON is not defined, optimized_ops::Tanh() called in the case of kFixedPointOptimize is the same as reference_ops::Tanh() that used to be called for kReference.

So, we just followed the same code pattern as is used for Int8 and UInt8 versions.

I appreciate now that later in this file kGenericOptimized version of kernel_type seems to be registered as the default one for Tanh and Logistic (in Register_TANH()).

Do you think we should change to calling different implementations for different kernel_type? I.e.
kReference -> reference_ops::Tanh()
kFixedPointOptimize -> optimized_ops::Tanh()
kGenericOptimized -> EvalUsingLookupTableTanh16Bit() (default)

@renjie-liu Could you please clarify how the current implementation and the suggested one using LUT should co-exist ? Should we call the suggested as a reference, because it is more accurate ? The current one looks like a good fit for the kernel kFixedPointOptimize. Which one should be when the kernel type is not specified ?
Thanks !

wwwind · 2020-03-18T14:23:27Z

Hi @jianlijianli , Could you please re-approve this PR ? I had to push a fix for the warning.
Thanks!

jianlijianli

Thanks for fixing the warning.

wwwind · 2020-03-25T11:04:02Z

Hi @gbaned, @liufengdb ! I pushed a small fix for the error reported by buildifier.
Could you please re-approve this PR ?
Thanks!

wwwind · 2020-04-06T09:00:45Z

Hi @rthadur ! This PR has been approved 10 days ago and all checks are green.
Is it stuck internally with some errors ? Could you please take a look?
Thanks!

rthadur · 2020-04-06T21:13:09Z

@wwwind thanks for your patience , @jianlijianli is working internally on this , we will get back to you asap.

tensorflow-bot bot added the size:M CL Change Size: Medium label Nov 25, 2019

googlebot added the cla: yes label Nov 25, 2019

gbaned self-assigned this Nov 26, 2019

gbaned added the comp:lite TF Lite related issues label Nov 26, 2019

gbaned added this to Assigned Reviewer in PR Queue via automation Nov 26, 2019

gbaned requested a review from nutsiepully November 26, 2019 04:14

Small improvement to TANH/Sigmoid implementation.

279f926

Change-Id: Ia9fa7e70e15a5174a045ee5f98cf4f78e6a43ef6

gbaned added the awaiting review Pull request awaiting review label Nov 29, 2019

suharshs requested review from jdduke, suharshs and bjacob and removed request for nutsiepully December 9, 2019 19:29

abattery self-requested a review December 10, 2019 02:04

tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 10, 2019

gbaned added the stat:awaiting response Status - Awaiting response from author label Dec 24, 2019

Addressed review comments for TANH/Sigmoid function.

eaac6ea

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 24, 2019

gbaned requested a review from renjie-liu January 3, 2020 10:16

gbaned added the awaiting review Pull request awaiting review label Jan 3, 2020

renjie-liu reviewed Jan 6, 2020

View reviewed changes

wwwind dismissed jianlijianli’s stale review via 9140684 March 18, 2020 13:59

tensorflow-bot bot removed the ready to pull PR ready for merge process label Mar 18, 2020

gbaned requested a review from jianlijianli March 19, 2020 12:24

tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 21, 2020

gbaned added the awaiting review Pull request awaiting review label Mar 24, 2020

jianlijianli previously approved these changes Mar 24, 2020

View reviewed changes

tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 24, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 24, 2020

gbaned added kokoro:force-run Tests on submitted change and removed awaiting review Pull request awaiting review labels Mar 25, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 25, 2020

Fix for the error with buildifier.

913a787

wwwind dismissed jianlijianli’s stale review via 913a787 March 25, 2020 11:00

tensorflow-bot bot removed the ready to pull PR ready for merge process label Mar 25, 2020

gbaned added the awaiting review Pull request awaiting review label Mar 27, 2020

rthadur approved these changes Mar 27, 2020

View reviewed changes

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 27, 2020

tensorflow-bot bot added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 27, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 27, 2020

tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 29, 2020

jdduke approved these changes Apr 7, 2020

View reviewed changes

google-ml-butler bot added the kokoro:force-run Tests on submitted change label Apr 7, 2020

kokoro-team removed the kokoro:force-run Tests on submitted change label Apr 7, 2020

tensorflow-copybara merged commit c57bb8b into tensorflow:master Apr 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TANH/Sigmoid 16-bit activation functions using LUT #34589

TANH/Sigmoid 16-bit activation functions using LUT #34589

wwwind commented Nov 25, 2019

bjacob commented Dec 9, 2019

renjie-liu commented Dec 10, 2019

abattery commented Dec 10, 2019

jianlijianli commented Dec 10, 2019

wwwind commented Dec 18, 2019

renjie-liu commented Dec 19, 2019

gbaned commented Dec 24, 2019

wwwind commented Dec 24, 2019

renjie-liu Jan 6, 2020

wwwind Jan 15, 2020

renjie-liu Jan 6, 2020

akarmi Jan 9, 2020

wwwind Jan 15, 2020

wwwind commented Mar 18, 2020

jianlijianli left a comment

wwwind commented Mar 25, 2020

wwwind commented Apr 6, 2020

rthadur commented Apr 6, 2020

TANH/Sigmoid 16-bit activation functions using LUT #34589

TANH/Sigmoid 16-bit activation functions using LUT #34589

Conversation

wwwind commented Nov 25, 2019

bjacob commented Dec 9, 2019

renjie-liu commented Dec 10, 2019

abattery commented Dec 10, 2019

jianlijianli commented Dec 10, 2019

wwwind commented Dec 18, 2019

renjie-liu commented Dec 19, 2019

gbaned commented Dec 24, 2019

wwwind commented Dec 24, 2019

renjie-liu Jan 6, 2020

Choose a reason for hiding this comment

wwwind Jan 15, 2020

Choose a reason for hiding this comment

renjie-liu Jan 6, 2020

Choose a reason for hiding this comment

akarmi Jan 9, 2020

Choose a reason for hiding this comment

wwwind Jan 15, 2020

Choose a reason for hiding this comment

wwwind commented Mar 18, 2020

jianlijianli left a comment

Choose a reason for hiding this comment

wwwind commented Mar 25, 2020

wwwind commented Apr 6, 2020

rthadur commented Apr 6, 2020