Use xa_nnlib for svdf for Fusion F1. #47098

advaitjain · 2021-02-12T00:04:07Z

The code in this change is the subset of functionality needed for int8 svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc

Note that the current change has not pulled in either the floating point implementation or the Hifi5 implementation.

Profiled the keryword_benchmark with the following command:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade run_keyword_benchmark -j8

gives a latency of 38516 ticks with this change vs 152642 ticks without this change.

Per OP latency with this change:

KeywordRunNIerations(1) took 38516 ticks (38 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 4753 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 3145 ticks (3 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 2890 ticks (2 ms).
SVDF took 3583 ticks (3 ms).
SVDF took 3054 ticks (3 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).

Without this change:

KeywordRunNIerations(1) took 152642 ticks (152 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 38003 ticks (38 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 13907 ticks (13 ms).
SVDF took 15827 ticks (15 ms).
SVDF took 15827 ticks (15 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).

Also confirmed that the kernel_svdf_test passes with:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_svdf_test -j8

Progress towards http://b/177457688

google-ml-butler · 2021-02-12T00:04:10Z

Thanks for contributing to TensorFlow Lite Micro.

To keep this process moving along, we'd like to make sure that you have completed the items on this list:

Read the contributing guidelines for TensorFlow Lite Micro
Created a TF Lite Micro Github issue
Linked to the issue from the PR description

We would like to have a discussion on the Github issue first to determine the best path forward, and then proceed to the PR review.

advaitjain · 2021-02-12T00:04:52Z

tagging @pnikam-cad @nyadla-sys @bhanuprakashbv

Confirmed that the test passes with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_softmax_test -j8 ``` However, the latency improvement is only ~1000 ticks, as tested with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_keyword_benchmark -j8 ``` Since Softmax is currently a small fraction of the overall keyword_benchmark latency we will focus on the latency of only this particular OP. With the optimized implementation: ``` SOFTMAX took 749 ticks (0 ms). ``` Reference implementation: ``` SOFTMAX took 2052 ticks (2 ms). ``` And with the LUT hifimini implementation (for completeness): ``` SOFTMAX took 1142 ticks (1 ms). ``` The gain of ~1500 ticks ticks is still worth merging because after all the optimizations (e.g. tensorflow#47098), this will still mean a ~5% improvement for the keyword benchmark. And the benefits might be more significant for other models too.

njeffrie

Very cool, great speed-ups! Only minor comment nits.

njeffrie · 2021-02-12T10:58:17Z

tensorflow/lite/micro/kernels/xtensa/svdf.cc

+                          (data.effective_scale_1_a), data.effective_scale_1_b),
+                      0);
+  }
+


This is no longer part of the feature matmul... perhaps leave a comment here to mark the start of the time weights dot product + activation stage.

The code in this change is the subset of functionality needed for int8 svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc Note that the current change has not pulled in either the floating point implementation or the Hifi5 implementation. Profiled the keryword_benchmark with the following command: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade run_keyword_benchmark -j8 ``` gives a latency of 38516 ticks with this change vs 152642 ticks without this change. Per OP latency with this change: ``` KeywordRunNIerations(1) took 38516 ticks (38 ms) QUANTIZE took 3758 ticks (3 ms). SVDF took 4753 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 4211 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 3145 ticks (3 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 4211 ticks (4 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 2890 ticks (2 ms). SVDF took 3583 ticks (3 ms). SVDF took 3054 ticks (3 ms). FULLY_CONNECTED took 1091 ticks (1 ms). SOFTMAX took 2042 ticks (2 ms). QUANTIZE took 366 ticks (0 ms). ``` Without this change: ``` KeywordRunNIerations(1) took 152642 ticks (152 ms) QUANTIZE took 3758 ticks (3 ms). SVDF took 38003 ticks (38 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 18803 ticks (18 ms). FULLY_CONNECTED took 1353 ticks (1 ms). SVDF took 13907 ticks (13 ms). SVDF took 15827 ticks (15 ms). SVDF took 15827 ticks (15 ms). FULLY_CONNECTED took 1091 ticks (1 ms). SOFTMAX took 2042 ticks (2 ms). QUANTIZE took 366 ticks (0 ms). ``` Also confirmed that the kernel_svdf_test passes with: ``` make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_svdf_test -j8 ```

advaitjain

added a comment, as request.

google-ml-butler bot added the size:M CL Change Size: Medium label Feb 12, 2021

google-cla bot added the cla: yes label Feb 12, 2021

advaitjain added the comp:micro Related to TensorFlow Lite Microcontrollers label Feb 12, 2021

advaitjain requested a review from njeffrie February 12, 2021 00:04

advaitjain added the ready to pull PR ready for merge process label Feb 12, 2021

advaitjain mentioned this pull request Feb 12, 2021

Add in optimizations for softmax for Fusion F1. #46746

Merged

gbaned self-assigned this Feb 12, 2021

gbaned added this to Assigned Reviewer in PR Queue via automation Feb 12, 2021

njeffrie approved these changes Feb 12, 2021

View reviewed changes

google-ml-butler bot added the kokoro:force-run Tests on submitted change label Feb 12, 2021

PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Feb 12, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Feb 12, 2021

gbaned removed the ready to pull PR ready for merge process label Feb 12, 2021

advaitjain commented Feb 12, 2021

View reviewed changes

advaitjain force-pushed the fusion-f1-svdf branch from de53e77 to 55633bf Compare February 12, 2021 22:23

advaitjain added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Feb 12, 2021

kokoro-team removed the kokoro:force-run Tests on submitted change label Feb 12, 2021

advaitjain added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Feb 12, 2021

copybara-service bot merged commit c8e9451 into tensorflow:master Feb 16, 2021

PR Queue automation moved this from Approved by Reviewer to Merged Feb 16, 2021

advaitjain deleted the fusion-f1-svdf branch February 16, 2021 23:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use xa_nnlib for svdf for Fusion F1. #47098

Use xa_nnlib for svdf for Fusion F1. #47098

advaitjain commented Feb 12, 2021 •

edited

google-ml-butler bot commented Feb 12, 2021

advaitjain commented Feb 12, 2021

njeffrie left a comment

njeffrie Feb 12, 2021

advaitjain Feb 12, 2021

advaitjain left a comment

Use xa_nnlib for svdf for Fusion F1. #47098

Use xa_nnlib for svdf for Fusion F1. #47098

Conversation

advaitjain commented Feb 12, 2021 • edited

google-ml-butler bot commented Feb 12, 2021

advaitjain commented Feb 12, 2021

njeffrie left a comment

Choose a reason for hiding this comment

njeffrie Feb 12, 2021

Choose a reason for hiding this comment

advaitjain Feb 12, 2021

Choose a reason for hiding this comment

advaitjain left a comment

Choose a reason for hiding this comment

advaitjain commented Feb 12, 2021 •

edited