Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use xa_nnlib for svdf for Fusion F1. #47098

Merged
merged 1 commit into from
Feb 16, 2021

Conversation

advaitjain
Copy link
Member

@advaitjain advaitjain commented Feb 12, 2021

The code in this change is the subset of functionality needed for int8 svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc

Note that the current change has not pulled in either the floating point implementation or the Hifi5 implementation.

Profiled the keryword_benchmark with the following command:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade run_keyword_benchmark -j8

gives a latency of 38516 ticks with this change vs 152642 ticks without this change.

Per OP latency with this change:

KeywordRunNIerations(1) took 38516 ticks (38 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 4753 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 3145 ticks (3 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 2890 ticks (2 ms).
SVDF took 3583 ticks (3 ms).
SVDF took 3054 ticks (3 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).

Without this change:

KeywordRunNIerations(1) took 152642 ticks (152 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 38003 ticks (38 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 13907 ticks (13 ms).
SVDF took 15827 ticks (15 ms).
SVDF took 15827 ticks (15 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).

Also confirmed that the kernel_svdf_test passes with:

make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_svdf_test -j8

Progress towards http://b/177457688

@google-ml-butler google-ml-butler bot added the size:M CL Change Size: Medium label Feb 12, 2021
@google-ml-butler
Copy link

Thanks for contributing to TensorFlow Lite Micro.

To keep this process moving along, we'd like to make sure that you have completed the items on this list:

We would like to have a discussion on the Github issue first to determine the best path forward, and then proceed to the PR review.

@google-cla google-cla bot added the cla: yes label Feb 12, 2021
@advaitjain advaitjain added the comp:micro Related to TensorFlow Lite Microcontrollers label Feb 12, 2021
@advaitjain
Copy link
Member Author

tagging @pnikam-cad @nyadla-sys @bhanuprakashbv

@advaitjain advaitjain added the ready to pull PR ready for merge process label Feb 12, 2021
advaitjain added a commit to advaitjain/tensorflow that referenced this pull request Feb 12, 2021
Confirmed that the test passes with:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_softmax_test -j8
```

However, the latency improvement is only ~1000 ticks, as tested with:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_keyword_benchmark -j8
```

Since Softmax is currently a small fraction of the overall keyword_benchmark latency we will focus on the latency of only this particular OP.

With the optimized implementation:
```
SOFTMAX took 749 ticks (0 ms).
```

Reference implementation:
```
SOFTMAX took 2052 ticks (2 ms).
```

And with the LUT hifimini implementation (for completeness):
```
SOFTMAX took 1142 ticks (1 ms).
```

The gain of ~1500 ticks ticks is still worth merging because after all the optimizations (e.g.  tensorflow#47098), this will still mean a ~5% improvement for the keyword benchmark.

And the benefits might be more significant for other models too.
@gbaned gbaned self-assigned this Feb 12, 2021
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation Feb 12, 2021
Copy link
Contributor

@njeffrie njeffrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, great speed-ups! Only minor comment nits.

(data.effective_scale_1_a), data.effective_scale_1_b),
0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer part of the feature matmul... perhaps leave a comment here to mark the start of the time weights dot product + activation stage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@google-ml-butler google-ml-butler bot added the kokoro:force-run Tests on submitted change label Feb 12, 2021
PR Queue automation moved this from Assigned Reviewer to Approved by Reviewer Feb 12, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Feb 12, 2021
@gbaned gbaned removed the ready to pull PR ready for merge process label Feb 12, 2021
The code in this change is the subset of functionality needed for int8
svdf for Hifi4 copied from https://github.com/pnikam-cad/tensorflow/blob/a737c1e3945bc70022259479ad24133a343ec906/tensorflow/lite/micro/kernels/xtensa_hifi/svdf.cc

Note that the current change has not pulled in either the floating point
implementation or the Hifi5 implementation.

Profiled the keryword_benchmark with the following command:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade run_keyword_benchmark -j8
```

gives a latency of 38516 ticks with this change vs 152642 ticks without this change.

Per OP latency with this change:
```
KeywordRunNIerations(1) took 38516 ticks (38 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 4753 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 3145 ticks (3 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 4211 ticks (4 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 2890 ticks (2 ms).
SVDF took 3583 ticks (3 ms).
SVDF took 3054 ticks (3 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).
```

Without this change:
```
KeywordRunNIerations(1) took 152642 ticks (152 ms)
QUANTIZE took 3758 ticks (3 ms).
SVDF took 38003 ticks (38 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 18803 ticks (18 ms).
FULLY_CONNECTED took 1353 ticks (1 ms).
SVDF took 13907 ticks (13 ms).
SVDF took 15827 ticks (15 ms).
SVDF took 15827 ticks (15 ms).
FULLY_CONNECTED took 1091 ticks (1 ms).
SOFTMAX took 2042 ticks (2 ms).
QUANTIZE took 366 ticks (0 ms).
```

Also confirmed that the kernel_svdf_test passes with:
```
make -f tensorflow/lite/micro/tools/make/Makefile TARGET=xtensa OPTIMIZED_KERNEL_DIR=xtensa TARGET_ARCH=fusion_f1 XTENSA_CORE=F1_190305_swupgrade test_kernel_svdf_test -j8
```
Copy link
Member Author

@advaitjain advaitjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment, as request.

@advaitjain advaitjain added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Feb 12, 2021
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Feb 12, 2021
@advaitjain advaitjain added ready to pull PR ready for merge process and removed ready to pull PR ready for merge process labels Feb 12, 2021
@copybara-service copybara-service bot merged commit c8e9451 into tensorflow:master Feb 16, 2021
PR Queue automation moved this from Approved by Reviewer to Merged Feb 16, 2021
@advaitjain advaitjain deleted the fusion-f1-svdf branch February 16, 2021 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes comp:micro Related to TensorFlow Lite Microcontrollers ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

4 participants