Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSIS-NN Conv kernel writes outside of allocated memory when num_channels > 256 #42748

Closed
janjongboom opened this issue Aug 28, 2020 · 12 comments · Fixed by #43486
Closed

CMSIS-NN Conv kernel writes outside of allocated memory when num_channels > 256 #42748

janjongboom opened this issue Aug 28, 2020 · 12 comments · Fixed by #43486
Assignees
Labels
comp:micro Related to TensorFlow Lite Microcontrollers type:bug Bug

Comments

@janjongboom
Copy link
Contributor

janjongboom commented Aug 28, 2020

@tensorflow/micro

System information

  • Host OS Platform and Distribution (e.g., Linux Ubuntu 16.04): N/A
  • TensorFlow installed from (source or binary): source
  • Tensorflow version (commit SHA if source): b36436b
  • Target platform (e.g. Arm Mbed OS, Arduino Nano 33 etc.): ST IoT Discovery Kit

Describe the problem

The CMSIS-NN convolutional and depthwise convolutional layers write in uninitialized memory in CalculateOpData when num_channels is higher than 256 (kMaxChannels). There are no boundary checks for this function, and thus no error is thrown when this happens. Non CMSIS-NN kernels have no problem.

I can get around this by switching the structure to use dynamic memory:

void* per_channel_output_multiplier;
context->AllocatePersistentBuffer(context, sizeof(int32_t) * num_channels, &per_channel_output_multiplier);
data->per_channel_output_multiplier = (int32_t*)per_channel_output_multiplier;

But this is a temporary object, so I assume I should use the scratch buffer instead. However, I have no idea how to clear the object again from the scratch buffer.

Please provide the exact sequence of commands/steps when you ran into the problem

The problem shows in the following tflite model. In one of the final layers it has 1280 channels. ei-plants-vs-lamps-transfer-learning-tensorflow-lite-int8-quantized-model.lite.zip

/cc @kwagyeman This could be the same issue as you're facing with CMSIS-NN. It shows on our image models.

@janjongboom janjongboom added the comp:micro Related to TensorFlow Lite Microcontrollers label Aug 28, 2020
@kwagyeman
Copy link
Contributor

Awesome find Jan. If this is fixed then CMSIS-NN will work for images finally.

@janjongboom
Copy link
Contributor Author

@kwagyeman Seems like it. I've written a patch and now have CMSIS-NN working with MobileNetV2, will open up a PR after the weekend.

@kwagyeman
Copy link
Contributor

@janjongboom Are there any other parts in the code where they allocate static buffers like this? It's kinda down right absurd in production code by ARM employees...

@kwagyeman
Copy link
Contributor

Conv.cc and depthwiseconv.cc both have the issue. You need to fix both.

janjongboom added a commit to edgeimpulse/tensorflow that referenced this issue Aug 29, 2020
…namically (fix tensorflow#42748)

The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
janjongboom added a commit to edgeimpulse/tensorflow that referenced this issue Aug 29, 2020
…namically (fix tensorflow#42748)

The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
janjongboom added a commit to edgeimpulse/tensorflow that referenced this issue Aug 29, 2020
…namically (fix tensorflow#42748)

The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
janjongboom added a commit to edgeimpulse/tensorflow that referenced this issue Aug 29, 2020
…namically (fix tensorflow#42748)

The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
@Saduf2019 Saduf2019 assigned gbaned and unassigned Saduf2019 Aug 31, 2020
@gbaned gbaned assigned njeffrie and unassigned gbaned Aug 31, 2020
@freddan80
Copy link
Contributor

Hello @kwagyeman @janjongboom!

I represent Arm. Some background; kMaxChannels is there for historical reasons - that's how the reference kernels used to be implemented as well. Some months back, the AllocatePersistentBuffer API was introduced in favor of the static buffers. Hence the ref kernels was updated to use that API. Our intention is to 'up-merge' changes, such as this one, to our CMSIS-NN kernels on a regular basis. However, we had to revert that change since we discovered an issue with the AllocatePersistentBuffer API, see #41121 for details. We're worked with the TFLu team to resolve this issue and it looks like this was fixed this(!) friday by 59d177d#diff-7aab15c6eb5282359c2b928998bab743.

In the meanwhile, the quick fix is to bump kMaxChannels to the size it needs to be on your local fork.

I hope this gives some clarity to the issue.

@janjongboom thanks for the PR (#42770)! Let continue the technical discussion there.

Cheers!
Fredrik

@janjongboom
Copy link
Contributor Author

@freddan80 Thanks for the update. I ran into the same issue on our 2.3.0 fork, and replaced the scratch buffer with a persistent one to get around that.

In the meanwhile, the quick fix is to bump kMaxChannels to the size it needs to be on your local fork.

This adds 10K of RAM per convolutional layer, which won't fit on our targets.

janjongboom added a commit to edgeimpulse/tensorflow that referenced this issue Aug 31, 2020
…namically (fix tensorflow#42748)

The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.

This reverts commit 5986dd0.
@freddan80
Copy link
Contributor

@janjongboom thanks for the updated PR!

This adds 10K of RAM per convolutional layer, which won't fit on our targets.

Make sense. Then your PR is the preferred solution.

However, allocating the memory using AllocatePersistentBuffer API will only move the memory claim from the BSS to the memory arena (which may well be located in BSS as well). It actually won't save memory, it's just putting it in a different part of the SRAM.

Cheers!

@kwagyeman
Copy link
Contributor

The tensor arena is dynamically allocated and freed so memory can be used for other things until when it’s needed for TF lite. Having it be dynamically allocated is a must.

I don’t wish to be mean here, but, I have been consistently disappointed with ARMs lack of effort in support for these faster kernels... I would like to ask that future code released has more effort put into making it useful for a wider audience. Hard coded static buffers in code with max sizes without range checks is not production code.

Anyway, glad to see this getting fixed so we can enable it on the OpenMV Cam finally.

@freddan80
Copy link
Contributor

@kwagyeman I believe there is some misunderstanding here. The TFLu feature to handle interleaved persistent memory request and arena scratch buffer requests, is only possible since the 28th of Aug 2020 (see 59d177d). We've been working with Google towards getting this TFLu feature so that it improves the usage of CMSIS-NN for wider audience like you.

Sorry that you feel that it wasn't given attention. Hope PR #42770 helps the case now and please reach out by a Github ticket, such as this one, for unsupported features or bugs, either in the TFLu or the CMSIS-NN repo. That way the issue is seen and known by a wider audience and thereby gain traction and leverage support from the entire community.

Cheers!

@janjongboom
Copy link
Contributor Author

However, allocating the memory using AllocatePersistentBuffer API will only move the memory claim from the BSS to the memory arena (which may well be located in BSS as well). It actually won't save memory, it's just putting it in a different part of the SRAM.

Yep, but I only have one very large convolutional layer and a lot of smaller ones, so dynamically allocating it saves me a lot of memory. But great to see PRs being landed to mainline these types of issues.

@freddan80 Q though (we can take this offline if you prefer): are there currently CI/CD pipelines set up for the CMSIS-NN kernels? E.g. we found this issue & #41816 through our own CI (and a bunch of debugging :-)), and we have currently automated tests for a wide variety of development boards that we run on real hardware for releases, both with the micro allocator and custom allocators, and with and without CMSIS-NN. But we don't follow mainline TensorFlow so we only find these issues once we upgrade. Perhaps it would be beneficial to share that with Arm?

@freddan80
Copy link
Contributor

@freddan80 Q though (we can take this offline if you prefer): are there currently CI/CD pipelines set up for the CMSIS-NN kernels? E.g. we found this issue & #41816 through our own CI (and a bunch of debugging :-)), and we have currently automated tests for a wide variety of development boards that we run on real hardware for releases, both with the micro allocator and custom allocators, and with and without CMSIS-NN. But we don't follow mainline TensorFlow so we only find these issues once we upgrade. Perhaps it would be beneficial to share that with Arm?

We have our internal CI, that tests TFLu-CMSIS-NN usecases on FPGAs and HW models with Cortex-M4 up to Cortex-M55 reference systems. We usually find issues related to CMSIS-NN quite soon. Unfortunately the networks we use did not have a reshape before FC as you encountered in #41816, or it would have been caught. Usually we create a kernel unit test case - in fully_connected_test.cc in this case - when we find issues such as the one in #41816, and then contribute that to avoid the same issue in the future. For the issue reported in this ticket, we’ve been waiting for 59d177d to make a permanent fix (thanks again for you PR!). We had actually bumped kMaxChannels to 2048 in our internal tests (not being aware of the missing range check), but didn’t want to deliver that since it increases the memory footprint unnecessarily for smaller network usecases.

That’s some words about our internal CI. In parallel, we're working with the TFLu team to enable CMSIS-NN regression directly in the Tensorflow Github project, using Renode as target emulator (see test_stm32f4_binary.sh) to prevent breaking PR’s being merged. It’s work ongoing and may take some time since there's an issue running this using Docker. However, I believe CMSIS-NN is tested using at least one of the targets in the internal TFLu tests, so there is some level of protection.

It’s awesome to hear that you have CMSIS-NN tests in you CI tests as well! I’d love to hear more about that. Perhaps some of your work can be upstreamed to the Tensorflow Github repo?

Cheers!
Fredrik

@njeffrie
Copy link
Contributor

njeffrie commented Sep 7, 2020

I can provide some additional context from the TF Micro side:

As you probably already know, the conv and depthwise_conv kernels use per-channel scale and zero-point quantization parameters. From the scale and zero points values, we calculate a multiplier and shift during the Prepare stage, which involves some computationally expensive double precision math as well as exponent calculations. Applying the multiplier and shift is a relatively cheap operation which occurs during the Eval stage. The storage required for these multipliers and shifts correlates to the number of channels along the quantized dimension (see the TFLite quantization spec for more info).

Since we pre-calcualate shifts and multipliers during Prepare, we need to store the resulting values in persistent buffers, which means the memory cannot be re-used between kernel invocations (and the buffers must be allocated via context->AllocatePersistentBuffer). This is the price we pay for avoiding the large runtime overhead of calculating the multiplier and shift on-the-fly. Before we had a way to allocate persistent buffers in TF Micro, we implemented the conv and depthwise_conv kernels with a fixed-length buffer with 256 values. This issue seen here results from that previous implementation, and the fact not all of our kernels have been updated to use dynamic allocation.

On a side note, it seems that there is increasing evidence that allowing models to use per-tensor quantization instead of per-channel for conv and depthwise_conv may be better for some TF Micro usecases. This is something we are aware of and discussing internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:micro Related to TensorFlow Lite Microcontrollers type:bug Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants