-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage and increase performance for convolution on iOS #3778
Conversation
…ke all the implementations functors
For quantized models will be used gemmlowp also on IOS? /cc @wangyida |
@bhack gemmlowp can be used on iOS, though we haven't investigated optimizing it for those devices in particular so I expect we'll need to do more work there. This is primarily a fix for memory issues when running float models. |
Jenkins, test this please. |
Could be interesting to benchmark this against BNNS. |
@bhack I can apply gemm on FC layer in tiny-dnn of IOS platform now, some memory issue seems related to the batch size and network structure rather than the parametric model itself. |
|
||
// This file contains a set of different implementations of the two-dimensional | ||
// convolution operation. The standard TensorFlow Conv2d kernel uses EigenTensor | ||
// to implement the computation, but here there are a variety of different ways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change "here there" to "there"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the line to read "this module has a variety...". Is that clearer?
// buffer for the next chunk and reuse it, keeping maximum memory size down. | ||
// In this case, we've picked 16 megabytes as a reasonable limit. | ||
const size_t max_chunk_size = (16 * 1024 * 1024); | ||
OP_REQUIRES(context, (filter_value_count * sizeof(T1)) <= max_chunk_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could pull filter_value_count * sizeof(T1) out into a constant and re-use it below.
Looks good to me (my latest round of comments were minor)... |
// the Im2ColConvFunctor template definition inside the op registration to | ||
// enable. Assumes row-major ordering of the values in memory. | ||
template <class T1, class T2, class T3> | ||
class ReferenceGemmFunctor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to include this? The problem with including slow reference implementations, is that they end up being used and are hard to get rid of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this offline, but to summarize it's useful for bootstrapping porting to new platforms, though I agree it's a little awkward here.
Once the tests have passed, could the admins merge this since we have LGTMs? |
We've had lots of problems with large convolutions hitting memory limits on iOS. This new implementation of the operator breaks the work into chunks so we never use more than 16 MB, and uses Apple's Accelerate framework to optimize the matrix multiplication.
Testing shows that it's between 5% to 10% faster than the existing implementation on various models, and keeps memory usage to a minimum.