tensorflow · copybara-service · Feb 18, 2021 · Feb 17, 2021
diff --git a/tensorflow/lite/micro/README.md b/tensorflow/lite/micro/README.md
@@ -65,6 +65,7 @@ project, we have additional documentation in the [docs](docs/) folder.
 *   [Benchmarks](benchmarks/README.md)
 *   [Profiling](docs/profiling.md)
 *   [Memory Management](docs/memory_management.md)
+*   [Optimized Kernel Implementations](docs/optimized_kernel_implementations.md)
 *   [New Platform Support](docs/new_platform_support.md)
 *   [Software Emulation with Renode](docs/renode.md)
 *   [Pre-allocated tensors](docs/preallocated_tensors.md)
diff --git a/tensorflow/lite/micro/docs/optimized_kernel_implementations.md b/tensorflow/lite/micro/docs/optimized_kernel_implementations.md
@@ -0,0 +1,191 @@
+<!-- mdformateoff(b/169948621#comment2) -->
+
+<!--
+Semi-automated TOC generation with instructions from
+https://github.com/ekalinin/github-markdown-toc#auto-insert-and-update-toc
+-->
+
+<!--ts-->
+   * [Summary](#summary)
+   * [High-Level Steps](#high-level-steps)
+      * [Why not Optimize the Reference Kernels](#why-not-optimize-the-reference-kernels)
+   * [Software Architecture](#software-architecture)
+      * [Hardware-specific NN library](#hardware-specific-nn-library)
+      * [Optimized Kernels](#optimized-kernels)
+      * [Build System Integration](#build-system-integration)
+      * [Testing and Continuous Integration](#testing-and-continuous-integration)
+
+<!-- Added by: advaitjain, at: Wed 17 Feb 2021 02:14:16 PM PST -->
+
+<!--te-->
+
+# Summary
+
+This guide describes the recommended high-level architecture and steps to add
+hardware-specific optimized kernels to TfLite Micro.
+
+The goal with these optimizations and the process that we recommend to getting
+them merged into the TfLite Micro codebase is to have a measurable and
+documented performance improvement on a benchmark of interest.
+
+Once the optimizations are merged, they will indeed be used for more than the
+benchmark but the context for why the optimizations were added is still very
+important.
+
+
+# High-Level Steps
+
+1. Pick a benchmark that you would like to measure the performance for.
+   * Existing benchmarks are in the [benchmarks directory](../benchmarks).
+   * If none of the existing benchmarks capture your use-case, then please create
+     a github issue or start a thread on micro@tensorflow.org to figure out how to
+     add in a new benchmark.
+   * If adding a publicly-available benchmark to the TFLM codebase is determined
+     to be infeasible, then a fall-back would be to have an internal benchmark
+     that can be used to document the benefits of adding in the optimizations via
+     PR descriptions.
+   * Adding optimized code without any associated benchmarks will need very
+     strong justification and will most likely not be permitted.
+
+1. Do the groundwork and architecture needed to be able to add in optimizations
+   for your target (more details in the [software architecture](#software-architecture)
+   section).
+
+1. Create one pull request for each optimized kernel with the PR description
+   clearly stating the commands that were used to measure the performance
+   improvement.
+
+   * This context is important even if the toolchain is proprietary and there are
+     currently a small number of users.
+      * See [this PR](https://github.com/tensorflow/tensorflow/pull/47098) as an example.
+      * At minimum the latency with and without the particular optimized kernel
+      should be documented. [Additional context](https://github.com/tensorflow/tensorflow/pull/46746)
+      may also be desirable.
+   * Here is some [general guidance](https://testing.googleblog.com/2017/09/code-health-providing-context-with.html)
+     on writing [good PR descriptions](https://google.github.io/eng-practices/review/developer/cl-descriptions.html)
+
+## Why Not Optimize the Portable Reference Kernels?
+
+We would like to explicitly point out (as have others) that the reference kernel
+implementations are not performant and there are plenty of opportunities to
+speed them up. This is by design and the reference kernels are meant to be a
+shared starting point to then be optimized in a target specific optimized kernel
+implementation.
+
+Two previous discussions on this topic are on
+[PR #42477](https://github.com/tensorflow/tensorflow/pull/42477) and
+[PR #45227](https://github.com/tensorflow/tensorflow/pull/45227)
+
+Our current point of view on this topic is that while optimizing shared
+reference code in a portable manner is attractive, we are making an explicit
+choice to not go down that path and instead rely on target-specific optimized
+implementations. The TFLM codebase has a growing list of optimized kernel
+implementations, and we are investing in making the process of adding new
+implementations smoother.
+
+# Software Architecture
+
+The optimized kernel architecture is composed of the following three modules:
+
+1. Hardware-specific NN library
+1. Optimized Kernels
+1. Build System Integration
+
+## Hardware-specific NN library
+
+This library uses knowledge of the hardware and compiler to implement the
+underlying operations. Examples of this are [CMSIS-NN](https://github.com/ARM-software/CMSIS_5/tree/develop/CMSIS/NN)
+from ARM and [NNLib](https://github.com/foss-xtensa/nnlib-hifi4) from Cadence.
+
+The benefits of having this API separation are:
+
+1. The NN library does not need to follow the style guide of the rest of the
+   TFLM code.
+1. Releases of the NN library can be made independent of TFLM
+1. The same NN library can be used and tested independent of TFLM.
+1. The maintainers of the NN library have full control over the development
+   process that they would like to follow.
+
+## Optimized Kernels
+
+These will be (hopefully thin) wrappers that act as the glue between TFLM and
+the NN library.
+
+The goal here is to delegate as much work as possible to the NN library while
+still allowing the two APIs (TFLM and NN library) to be independent of each
+other. If there is a performance degradation due to this (for example,
+unnecessary memory copies) then we can evaluate those on a case-by-case
+basis.
+
+This code will be reviewed and merged in the TFLM github repository and must
+follow the development style of the TFLM codebase.
+
+Some amount of refactoring of the existing code may be needed to ensure that
+code is suitably shared between the reference and optimized kernels. There is
+currently no fixed recipe for this refactor and we will evaluate on a
+case-by-case basis during the PR review.
+
+For example, to add an optimized implementation for `fully_conntected` for the
+Xtensa Fusion F1 the steps were:
+ * [PR 1](https://github.com/tensorflow/tensorflow/pull/45464): refactor for
+   reference fallbacks and a baseline latency.
+ * [PR 2](https://github.com/tensorflow/tensorflow/pull/46242): refactor to
+   share code between reference and optimized kernels.
+ * [PR 3](https://github.com/tensorflow/tensorflow/pull/46411): add the code
+   needed to use the optimized NN lib and document the latency improvement.
+
+## Build System Integration
+
+This module is the least defined but we strongly recommend the following:
+1. A single target makefile.inc for all the architectures that you would like to
+   support along with optional target-specific [system_setup.cc](../arduino/system_setup.cc).
+   See [cortex_m_generic_makefile.inc](../tools/make/targets/cortex_m_generic_makefile.inc)
+   and [xtensa_makefile.inc](../tools/make/targets/xtensa_makefile.inc) as
+   examples.
+
+1. A single `ext_libs.inc` (and associated scripts) that downloads any external
+   dependencies (including the NN library). For example:
+   * [cmsis_nn.inc](../tools/make/ext_libs/cmsis_nn.inc) and
+     [cmsis_download.sh](../tools/make/ext_libs/cmsis_download.sh)
+   * [xtensa.inc](../tools/make/ext_libs/xtensa.inc) and
+     [xtensa_download.sh](../tools/make/ext_libs/xtensa_download.sh)
+
+1. The optimized kernels will then live in a kernels subdirectory (e.g.
+    [kernels/cmsis_nn](../kernels/cmsis_nn) and
+    [kernels/xtensa](../kernels/xtensa))
+
+Two development workflows that the TFLM team would like to encourage and support:
+
+1. Export static library + headers into target-specific development environment
+   * Build a static libtensorflow-microlite.a using the TFLM makefile with:
+     ```
+     make -f tensorflow/lite/micro/tools/make/Makefile TARGET=<target> OPTIMIZED_KERNEL_DIR=<optimize_dir> microlite
+     ```
+   * Use the static library and any TFLM headers as part of the overall
+     application (with its own build system).
+
+1. Integrate TFLM with IDE:
+   * This has historically been done using the TFLM Makefile’s support for
+     project generation.
+
+   * However, given the learning curve and high-maintenance overhead, we are
+     moving away from supporting project generation via the Makefile and are
+     encouraging future IDE integrations to be done outside of the TFLM Makefiles.
+
+   * The TFLM team is currently working through the details on this topic.
+
+## Testing and Continuous Integration
+
+The kernel tests are the primary method of ensuring that the optimized kernel
+implementations are accurate.
+
+Currently, most of the tests require the optimizations to be bit-exact to the
+quantized reference implementation. We can revisit this requirement if it ends
+up having a high associated cost on the latency.
+
+We strongly encourage optimized kernel implementations to have an associated
+continuous build that runs through all the unit tests and publishes a build
+badge to the [TFLM community supported
+builds](../README.md#community-supported-builds) table. Running the units tests
+once a day is often a good place to start.
+