New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMSIS-NN Conv kernel writes outside of allocated memory when num_channels > 256 #42748
Comments
Awesome find Jan. If this is fixed then CMSIS-NN will work for images finally. |
@kwagyeman Seems like it. I've written a patch and now have CMSIS-NN working with MobileNetV2, will open up a PR after the weekend. |
@janjongboom Are there any other parts in the code where they allocate static buffers like this? It's kinda down right absurd in production code by ARM employees... |
Conv.cc and depthwiseconv.cc both have the issue. You need to fix both. |
…namically (fix tensorflow#42748) The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
…namically (fix tensorflow#42748) The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
…namically (fix tensorflow#42748) The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
…namically (fix tensorflow#42748) The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena.
Hello @kwagyeman @janjongboom! I represent Arm. Some background; kMaxChannels is there for historical reasons - that's how the reference kernels used to be implemented as well. Some months back, the AllocatePersistentBuffer API was introduced in favor of the static buffers. Hence the ref kernels was updated to use that API. Our intention is to 'up-merge' changes, such as this one, to our CMSIS-NN kernels on a regular basis. However, we had to revert that change since we discovered an issue with the AllocatePersistentBuffer API, see #41121 for details. We're worked with the TFLu team to resolve this issue and it looks like this was fixed this(!) friday by 59d177d#diff-7aab15c6eb5282359c2b928998bab743. In the meanwhile, the quick fix is to bump kMaxChannels to the size it needs to be on your local fork. I hope this gives some clarity to the issue. @janjongboom thanks for the PR (#42770)! Let continue the technical discussion there. Cheers! |
@freddan80 Thanks for the update. I ran into the same issue on our 2.3.0 fork, and replaced the scratch buffer with a persistent one to get around that.
This adds 10K of RAM per convolutional layer, which won't fit on our targets. |
…namically (fix tensorflow#42748) The CMSIS-NN convolutional and depthwise convolutional kernels write in uninitialized memory when num_channels is higher than 256. This is because the OpData struct allocates memory statically, rather than request memory from the arena, like the non-CMSIS-NN kernels. As there is no boundary checking in this function this leads to errors when dealing with any network that contains more channels in a layer (this has shown for us in e.g. MobileNet v2 0.05). This patch brings the behavior of the kernel in line with the normal TFLM conv and depthwise_conv kernels, by allocating the memory from the arena. This reverts commit 5986dd0.
@janjongboom thanks for the updated PR!
Make sense. Then your PR is the preferred solution. However, allocating the memory using Cheers! |
The tensor arena is dynamically allocated and freed so memory can be used for other things until when it’s needed for TF lite. Having it be dynamically allocated is a must. I don’t wish to be mean here, but, I have been consistently disappointed with ARMs lack of effort in support for these faster kernels... I would like to ask that future code released has more effort put into making it useful for a wider audience. Hard coded static buffers in code with max sizes without range checks is not production code. Anyway, glad to see this getting fixed so we can enable it on the OpenMV Cam finally. |
@kwagyeman I believe there is some misunderstanding here. The TFLu feature to handle interleaved persistent memory request and arena scratch buffer requests, is only possible since the 28th of Aug 2020 (see 59d177d). We've been working with Google towards getting this TFLu feature so that it improves the usage of CMSIS-NN for wider audience like you. Sorry that you feel that it wasn't given attention. Hope PR #42770 helps the case now and please reach out by a Github ticket, such as this one, for unsupported features or bugs, either in the TFLu or the CMSIS-NN repo. That way the issue is seen and known by a wider audience and thereby gain traction and leverage support from the entire community. Cheers! |
Yep, but I only have one very large convolutional layer and a lot of smaller ones, so dynamically allocating it saves me a lot of memory. But great to see PRs being landed to mainline these types of issues. @freddan80 Q though (we can take this offline if you prefer): are there currently CI/CD pipelines set up for the CMSIS-NN kernels? E.g. we found this issue & #41816 through our own CI (and a bunch of debugging :-)), and we have currently automated tests for a wide variety of development boards that we run on real hardware for releases, both with the micro allocator and custom allocators, and with and without CMSIS-NN. But we don't follow mainline TensorFlow so we only find these issues once we upgrade. Perhaps it would be beneficial to share that with Arm? |
We have our internal CI, that tests TFLu-CMSIS-NN usecases on FPGAs and HW models with Cortex-M4 up to Cortex-M55 reference systems. We usually find issues related to CMSIS-NN quite soon. Unfortunately the networks we use did not have a reshape before FC as you encountered in #41816, or it would have been caught. Usually we create a kernel unit test case - in That’s some words about our internal CI. In parallel, we're working with the TFLu team to enable CMSIS-NN regression directly in the Tensorflow Github project, using Renode as target emulator (see It’s awesome to hear that you have CMSIS-NN tests in you CI tests as well! I’d love to hear more about that. Perhaps some of your work can be upstreamed to the Tensorflow Github repo? Cheers! |
I can provide some additional context from the TF Micro side: As you probably already know, the conv and depthwise_conv kernels use per-channel scale and zero-point quantization parameters. From the scale and zero points values, we calculate a multiplier and shift during the Prepare stage, which involves some computationally expensive double precision math as well as exponent calculations. Applying the multiplier and shift is a relatively cheap operation which occurs during the Eval stage. The storage required for these multipliers and shifts correlates to the number of channels along the quantized dimension (see the TFLite quantization spec for more info). Since we pre-calcualate shifts and multipliers during Prepare, we need to store the resulting values in persistent buffers, which means the memory cannot be re-used between kernel invocations (and the buffers must be allocated via context->AllocatePersistentBuffer). This is the price we pay for avoiding the large runtime overhead of calculating the multiplier and shift on-the-fly. Before we had a way to allocate persistent buffers in TF Micro, we implemented the conv and depthwise_conv kernels with a fixed-length buffer with 256 values. This issue seen here results from that previous implementation, and the fact not all of our kernels have been updated to use dynamic allocation. On a side note, it seems that there is increasing evidence that allowing models to use per-tensor quantization instead of per-channel for conv and depthwise_conv may be better for some TF Micro usecases. This is something we are aware of and discussing internally. |
@tensorflow/micro
System information
Describe the problem
The CMSIS-NN convolutional and depthwise convolutional layers write in uninitialized memory in
CalculateOpData
whennum_channels
is higher than 256 (kMaxChannels). There are no boundary checks for this function, and thus no error is thrown when this happens. Non CMSIS-NN kernels have no problem.I can get around this by switching the structure to use dynamic memory:
But this is a temporary object, so I assume I should use the scratch buffer instead. However, I have no idea how to clear the object again from the scratch buffer.
Please provide the exact sequence of commands/steps when you ran into the problem
The problem shows in the following tflite model. In one of the final layers it has 1280 channels. ei-plants-vs-lamps-transfer-learning-tensorflow-lite-int8-quantized-model.lite.zip
/cc @kwagyeman This could be the same issue as you're facing with CMSIS-NN. It shows on our image models.
The text was updated successfully, but these errors were encountered: