Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage and increase performance for convolution on iOS #3778

Merged
merged 4 commits into from
Aug 23, 2016

Conversation

petewarden
Copy link
Contributor

We've had lots of problems with large convolutions hitting memory limits on iOS. This new implementation of the operator breaks the work into chunks so we never use more than 16 MB, and uses Apple's Accelerate framework to optimize the matrix multiplication.

Testing shows that it's between 5% to 10% faster than the existing implementation on various models, and keeps memory usage to a minimum.

@bhack
Copy link
Contributor

bhack commented Aug 14, 2016

For quantized models will be used gemmlowp also on IOS? /cc @wangyida

@petewarden
Copy link
Contributor Author

@bhack gemmlowp can be used on iOS, though we haven't investigated optimizing it for those devices in particular so I expect we'll need to do more work there. This is primarily a fix for memory issues when running float models.

@petewarden
Copy link
Contributor Author

Jenkins, test this please.

@bhack
Copy link
Contributor

bhack commented Aug 15, 2016

Could be interesting to benchmark this against BNNS.

@wangyida
Copy link

@bhack I can apply gemm on FC layer in tiny-dnn of IOS platform now, some memory issue seems related to the batch size and network structure rather than the parametric model itself.


// This file contains a set of different implementations of the two-dimensional
// convolution operation. The standard TensorFlow Conv2d kernel uses EigenTensor
// to implement the computation, but here there are a variety of different ways
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change "here there" to "there"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the line to read "this module has a variety...". Is that clearer?

// buffer for the next chunk and reuse it, keeping maximum memory size down.
// In this case, we've picked 16 megabytes as a reasonable limit.
const size_t max_chunk_size = (16 * 1024 * 1024);
OP_REQUIRES(context, (filter_value_count * sizeof(T1)) <= max_chunk_size,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could pull filter_value_count * sizeof(T1) out into a constant and re-use it below.

@andydavis1
Copy link
Contributor

Looks good to me (my latest round of comments were minor)...

// the Im2ColConvFunctor template definition inside the op registration to
// enable. Assumes row-major ordering of the values in memory.
template <class T1, class T2, class T3>
class ReferenceGemmFunctor {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to include this? The problem with including slow reference implementations, is that they end up being used and are hard to get rid of.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this offline, but to summarize it's useful for bootstrapping porting to new platforms, though I agree it's a little awkward here.

@petewarden
Copy link
Contributor Author

Once the tests have passed, could the admins merge this since we have LGTMs?

@rmlarsen rmlarsen merged commit 459c2fe into tensorflow:master Aug 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants