- 
                Notifications
    You must be signed in to change notification settings 
- Fork 74.9k
TFLM Added optimized fully connected (float32 and int8) for CEVA-BX1 #45606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Thanks for contributing to TensorFlow Lite Micro. To keep this process moving along, we'd like to make sure that you have completed the items on this list: 
 We would like to have a discussion on the Github issue first to determine the best path forward, and then proceed to the PR review. | 
| Github issue: #45607 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comments on the first pass. I will likely have some more detailed feedback once we address the comments from this round of review.
Additional things to note:
- 
you can apply the kokoro:force-run label to trigger the CI. Look for the TF Micro CI as that will show any errors that will prevent merging. In the case of this PR, it would be missing licenses, incorrect formatting etc. 
- 
My preference is to have a detailed PR description that shows what the cycle improvement would be from this optimized implementation on (say the keyword_benchmark) as well as exact commands to reproduce (even if the toolchain is proprietary). This makes looking back at the history much more meaningful. - See #46411 as an example. In that PR I do not have a before/after since I already have continuous builds set up. For the current PR, I would recommend detailing the cycles for both the reference and optimized implementation.
 
- 
We can still break this PR into smaller chucks for some of the headers. And that would be preferable. We will be refactoring some of the reference kernels as we go along. 
| To your installation location. For example: TARGET_TOOLCHAIN_ROOT := | ||
| /home/myuser/work/CEVA-ToolBox/V18/BX | ||
| 4. Generate the Makefile for the project: /tensorflow$ make -f | ||
| tensorflow/lite/micro/tools/make/Makefile TARGET=ceva TARGET_ARCH=bx1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer going with TARGET=ceva and TARGET_ARCH=bx1 and having a single kernels/ceva, tools/make/ceva_makefile.inc, tools/make/ext_libs/ceva.inc.
And all the logic to support the different architectures be contained within these directories.
| struct OpData { | ||
| // The scaling factor from input to output (aka the 'real multiplier') can | ||
| // be represented as a fixed point multiplier plus a left shift. | ||
| int32_t output_multiplier; | ||
| int output_shift; | ||
| // The range of the fused activation layer. For example for kNone and | ||
| // uint8_t these would be 0 and 255. | ||
| int32_t output_activation_min; | ||
| int32_t output_activation_max; | ||
| // The index of the temporary tensor where the quantized inputs are cached. | ||
| int input_quantized_index; | ||
| // Cached zero point values of tensors. | ||
| int32_t input_zero_point; | ||
| int32_t filter_zero_point; | ||
| int32_t output_zero_point; | ||
| }; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use the shared struct:
tensorflow/tensorflow/lite/micro/kernels/fully_connected.h
Lines 26 to 41 in 9d66812
| struct OpDataFullyConnected { | |
| // The scaling factor from input to output (aka the 'real multiplier') can | |
| // be represented as a fixed point multiplier plus a left shift. | |
| int32_t output_multiplier; | |
| int output_shift; | |
| // The range of the fused activation layer. For example for kNone and | |
| // uint8_t these would be 0 and 255. | |
| int32_t output_activation_min; | |
| int32_t output_activation_max; | |
| // The index of the temporary tensor where the quantized inputs are cached. | |
| int input_quantized_index; | |
| // Cached zero point values of tensors. | |
| int32_t input_zero_point; | |
| int32_t filter_zero_point; | |
| int32_t output_zero_point; | |
| }; | 
and also shared code (as much as possible):
https://github.com/tensorflow/tensorflow/blob/9d6681202e062fbf6df8bcb39d70b129333f590c/tensorflow/lite/micro/kernels/fully_connected_common.cc
See #46242 for an example of the direction that we would like to move towards in terms of sharing code between the reference and optimized implementations.
| constexpr int kInputTensor = 0; | ||
| constexpr int kWeightsTensor = 1; | ||
| constexpr int kBiasTensor = 2; | ||
| constexpr int kOutputTensor = 0; | ||
|  | ||
| TfLiteStatus CalculateOpData(TfLiteContext* context, | ||
| TfLiteFusedActivation activation, | ||
| TfLiteType data_type, const TfLiteTensor* input, | ||
| const TfLiteTensor* filter, | ||
| const TfLiteTensor* bias, TfLiteTensor* output, | ||
| OpData* data) { | ||
| TfLiteStatus status = kTfLiteOk; | ||
| if (data_type != kTfLiteFloat32) { | ||
| double real_multiplier = 0.0; | ||
| TF_LITE_ENSURE_STATUS(GetQuantizedConvolutionMultipler( | ||
| context, input, filter, bias, output, &real_multiplier)); | ||
| int exponent; | ||
| QuantizeMultiplier(real_multiplier, &data->output_multiplier, &exponent); | ||
| data->output_shift = -exponent; | ||
| TF_LITE_ENSURE_STATUS(CalculateActivationRangeQuantized( | ||
| context, activation, output, &data->output_activation_min, | ||
| &data->output_activation_max)); | ||
|  | ||
| data->input_zero_point = input->params.zero_point; | ||
| data->filter_zero_point = filter->params.zero_point; | ||
| data->output_zero_point = output->params.zero_point; | ||
| } | ||
| return status; | ||
| } | ||
|  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these can all be shared from the refactored code.
| tflite::FullyConnectedParams op_params; | ||
| op_params.input_offset = -data.input_zero_point; | ||
| op_params.weights_offset = -data.filter_zero_point; | ||
| op_params.output_offset = data.output_zero_point; | ||
| op_params.output_multiplier = data.output_multiplier; | ||
| // TODO(b/138810107): Figure out whether output shift should be inverted | ||
| op_params.output_shift = -data.output_shift; | ||
| op_params.quantized_activation_min = data.output_activation_min; | ||
| op_params.quantized_activation_max = data.output_activation_max; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is now a function that does this conversion which reduces this boilerplate code to a single line.
| void* params = (void*)&op_params; | ||
| int8_t* inputp = (int8_t*)tflite::micro::GetTensorData<int8_t>(input); | ||
| int8_t* filterp = (int8_t*)tflite::micro::GetTensorData<int8_t>(filter); | ||
| int32_t* biasp = (int32_t*)tflite::micro::GetTensorData<int32_t>(bias); | ||
| int8_t* outputp = (int8_t*)tflite::micro::GetTensorData<int8_t>(output); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer C++ style casts, and static_cast whenever possible.
| ifeq ($(TARGET), ceva) | ||
| ifeq ($(TARGET_ARCH), bx1) | ||
| #TARGET_ARCH := | ||
| ifeq ($(TARGET), ceva_bx1) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this intentional? going to TARGET=ceva_bx1?
| TARGET_ARCH := bx | ||
| CXXFLAGS := -std=c++11 -DTF_LITE_STATIC_MEMORY | ||
| CCFLAGS := -std=c11 -DTF_LITE_STATIC_MEMORY | ||
| CCFLAGS := -std=c11 -DTF_LITE_STATIC_MEMORY -I./tensorflow/lite/micro/kernels/ceva_bx1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preference would be to use full include paths, unless there is a strong reason why that would not work.
| @@ -0,0 +1,611 @@ | |||
| /*************************************************************************************\ | |||
| * Copyright (C) CEVA(R) Inc. All rights reserved * | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only accept code with a single copyright for the tensorflow authors + apache license.
| @@ -0,0 +1,571 @@ | |||
|  | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the implementation of this header? Preference would be to break this up into exactly what is needed for a particular PR since that would make reviewing the code feasible.
| @@ -0,0 +1,348 @@ | |||
| /* Copyright 2017 The TensorFlow Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2021
| Also, I have created a new label (comp:micro:ceva). If you could apply that to all your PRs and github issues, that would make tracking this work easier. And link the issue from your PR as well. You should have the permissions needed to be able to apply labels. You can directly add me as a reviewer, have yourself be the assignee for the github issues. | 
| I may have reviewed this PR before it was ready -- sorry about that. If possible, please explicitly add a comment letting me know that the PR is ready for review (similar to #46411 (comment)) | 
| 
 This is my fault, I should have closed this one - I created a new PR after our chat -  incorporating code sharing with the reference kernel - see #46500 and also #46226 for Quantize. | 
| sounds good, I'll go ahead and close this PR then. Let me know when the others are ready for review. | 
No description provided.