Skip to content

Support MPSCNN (MetalPerformanceShaders) on iOS #7958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Mar 1, 2017 · 32 comments
Closed

Support MPSCNN (MetalPerformanceShaders) on iOS #7958

cancan101 opened this issue Mar 1, 2017 · 32 comments
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome type:feature Feature requests

Comments

@cancan101
Copy link
Contributor

cancan101 commented Mar 1, 2017

Related to: #3001

Take advantage of the MPSCNN (Metal Performance Shaders) framework from Apple.

See blog post for a comparison of BNSS to MPSCNN (and associated code).

TL; DR BNNS is faster for smaller networks but slower for bigger networks.

Related: #4846

@tatatodd
Copy link
Contributor

tatatodd commented Mar 3, 2017

Thanks for filing this issue @cancan101 ! I think the commentary described on #4846 and #3001 describes the current situation well. I'm marking this as "contributions welcome".

@tatatodd tatatodd added stat:contribution welcome Status - Contributions welcome type:feature Feature requests labels Mar 3, 2017
@cancan101
Copy link
Contributor Author

cancan101 commented Mar 3, 2017

@tatatodd I might be interested in looking at this. Any suggestions for how to tackle this? Is there a concept of GPU device when running on iOS? The BNNS is simpler to think about as it uses the accelerate framework which runs on CPU.

@tatatodd
Copy link
Contributor

tatatodd commented Mar 4, 2017

@petewarden knows the space pretty well, and might have some suggestions.

@petewarden
Copy link
Contributor

CC-ing @keveman who's interested in this area.

@s1ddok
Copy link

s1ddok commented Apr 5, 2017

I would like to contribute for this as well. Is there any movement or at least slack for interested people?

@keveman
Copy link
Contributor

keveman commented Apr 5, 2017

I am putting together a skeletal framework for calling Metal from TF. I would hold off until that is upstreamed. Once that is done, there will room for lot of contributions in building the repertoire of TF operations that can be run using Metal. @s1ddok and @cancan101 , does it sound like a plan?

@s1ddok
Copy link

s1ddok commented Apr 5, 2017

One should keep in mind that there are a lot of things lacking in MPS framework. So we will have to provide custom layer-implementation for absent layers.

Also, you gain maximum performance when using MPSTemporaryImages, so everything should be encoded in one stage.

@keveman
Copy link
Contributor

keveman commented Apr 5, 2017

Yes. I am counting on contributors like you to add the missing pieces :)
Also, yes, I am aware of the need to use MPSTemporaryImage and set its readCount to 0 as soon as possible. My framework would handle those.

@s1ddok
Copy link

s1ddok commented Apr 5, 2017

You may take a look on this, it is a C++ wrapper around Metal. Not around MPS, but still. https://github.com/naleksiev/mtlpp

@s1ddok
Copy link

s1ddok commented Apr 5, 2017

set its readCount to 0

You don't really do it. MPS decreases it everytime that image is being encoded, but what we need to do is to set that readCount to the amount of "output" layers. Should be fairly simple.

P.S. We will of course need to decrease that number in custom layers, but those are details.

@keveman
Copy link
Contributor

keveman commented Apr 5, 2017

@s1ddok thanks for the pointers. The C++ wrapper around Metal is great. We do really want to call the convolution kernels in MPS, though.

@sschaetz
Copy link

sschaetz commented Apr 6, 2017

We started with MPS integration into Tensorflow here. We focused on the 2D convolution operation. This is a summary of what we did:

  • created a generic iOS test-bench that generates test data using Python/Tensorflow on the host, ships the data to a phone app, runs the MPS equivalent operation and compares the result (note that MPS runs on mobile devices exclusively) (readme, test-data script, unit test)
  • implemented a MPS conv2d prototype (code) - due to MPS esoteric data format it is somewhat tricky to implement a generic conv2d operation (data shuffling is necessary); with temporary images and chaining multiple conv2d operations we expect that this can be avoided but this prototype is not capable of doing that

We show that MPS can be used for conv2D but as mentioned here we need some infrastructure for calling Metal, sending Metal GPU memory (and temporary images) between nodes to get full performance. This is why at this time we are not ready to upstream these changes.

The proposed MTLPP library looks great - we would love to see something like this integrated. We are working on a similar wrapper library that wraps CUDA, OpenCL, Metal called Aura - if TF allowed us to switch the GPU backend under the hood we would be all set.

@s1ddok
Copy link

s1ddok commented Apr 6, 2017

@sschaetz good to see this. The goal is to of course shuffle the data one time only and then encode verything in one pass. Apple has a good example on how to implement Inception_v3, so basically all the ideas of the best performance can be gained from there.

There is also a rendering library called bgfx, it is an abstract API for GPU backends, I built a game engine on that one back in the day. Abstraction layer like that one could help to switch backends seamlessly.

@cancan101
Copy link
Contributor Author

@keveman any updates on your framework for calling metal ?

@keveman
Copy link
Contributor

keveman commented Apr 18, 2017

@cancan101 No significant updates, but coming soon.

@cancan101
Copy link
Contributor Author

http://caffe2.ai/docs/mobile-integration.html#null__performance-considerations makes the following claim:

ARM CPUs outperform the on-board GPUs (our NNPACK ARM CPU implementation outperforms Apple’s MPSCNNConvolution for all devices except the iPhone 7).

@cancan101
Copy link
Contributor Author

cancan101 commented Apr 18, 2017

And here is the pr adding the MPSCNN functionality to caffe2 : facebookarchive/caffe2#215

@Yangqing
Copy link
Contributor

@cancan101 small correction per @ajtulloch: Metal is faster for devices that are iPhone 6s and above, and NNPack faster on the rest of devices.

@ajtulloch
Copy link

Hey @keveman, I wrote the C2 mobile stuff so really interested in the TF approach. How are you thinking of structuring the TF integration? It'd be cool if we can reuse kernel sources and stuff.

Some decisions/hacks/notes I made that I was kind of unsure about, so I wondered how you'd approach them:

  • Using textures (MPSImage, MPSTemporaryImage) vs buffers (MTLBuffer) as the tensor representation. There are a few issues here - standard textures vs buffer stuff, but also some semi-surprising limitations of the implementation (e.g. a maximum of 2800 textures in a texture2d_array, which introduces issues with representing a batch-size 50, 224-channel image. You need the MPSTemporaryImage for the MPSCNN kernels obviously, but these kernels seem to be operating well below peak so I imagine there's space to improve there with a custom implementation.
  • Handling of the MTLCommandBuffer object. Instead of the CUDA approach with a reusable stream that you can schedule kernels/events on, this transient object is single use, so there needs to be some initialization and passing it around along with the MPSTemporaryImage in the MPSImageWrapper struct.
  • Memory management - solving it with MPSTemporaryImage's (and having a SSA pass to compute the readCount field) vs using a custom allocator with MTLBuffer.
  • General object ownership issues, integrating the ARC refcounting model with the DFG style ownership. In the C2 impl, the op maintains a ref on the MPSTemporaryImage which solves the common case but has some ugly corner cases, so it's probably worth fixing it to do proper ARC on the MPSImageWrapper itself.
  • Passing compile-time specialization arguments to kernels helped a lot (https://github.com/caffe2/caffe2/blob/d0ce49/caffe2/contrib/mpscnn-fb/MPSCNN.metal#L7-L13), but ideally that would be a bit nicer from a kernel authors perspective without cluttering it up with a lot of single-use structs/syntax.
  • Speaking of kernels, the kind of ugly switching/duplication between texture2d/texture2d_array is something that might be good to revisit via some macro sugar or something?
  • In general, there were a bunch of good snippets in https://developer.apple.com/videos/play/wwdc2016/606/.
  • Getting the GPU profiler working (via a call to MTLCommandQueue:insertDebugCaptureBoundary after Net::Run/Session::Run was pretty useful for some stuff.
  • For some use-cases, the cost of command construction/encoding was nontrivial compared to execution time, so I was thinking of doing some double-buffering wrapper (at a higher level) for latency-sensitive applications.

@keveman
Copy link
Contributor

keveman commented Apr 25, 2017

@Yangqing That's what I noticed too. Using Metal on iPhone 6s and above was significantly faster.

@keveman
Copy link
Contributor

keveman commented Apr 25, 2017

@ajtulloch Thanks for your detailed comments. It looks like the C2 implementation is further along than what I have, but it would be great to share code if possible. I haven't looked at the C2 implementation in detail yet, but I am thinking about all the points you bring up here. Especially, I am super frustrated by the texture2d/texture2d_array issue too. Here are some thoughts I have on some of your points.

  • Using textures: I want to be using MPS[Temporary]Image to represent tensors. One of the biggest use cases that is going to benefit from Metal is a model that works on individual frames of a video (batch size of 1). I am inclined towards supporting that use case well. So the limitation on the number of textures won't be too bad.
  • Memory management: Since the Metal framework manages the underlying memory of a MPSTemporaryImage pretty efficiently, I didn't think I needed any analysis (I may be wrong). Effectively, I declare one MPSTemporaryImage per tensor in the graph and set its readCount to 0 as soon as its last consumer is enqueued.
  • Double buffering: Yes, the host side cost of spinning up a pipeline and enqueing kernels is pretty significant. I don't have a great answer for this yet.

@ajtulloch
Copy link

ajtulloch commented Apr 26, 2017

  • Re. textures - makes sense. When would you use MPSImage in that case? In the C2 approach, all the intermediate states are transient/unobservable, and are only observable via a copy (which is also the synchronization point), so there's no need for textures that persist outside the lifetime of the command buffer (which is what MPSImage would give you?).
  • Re. memory management: yes, it seems pretty solid. I experimented with the MPSTemporaryImage:prefetchStorage call but didn't get a perf win on any of the models I was using, which seemed surprising and I wonder if I was misusing the API. Have you got a win out of this API at all?
  • Re. double buffering - I was thinking of having an async API that takes a complete metal compute graph, then preallocates a pool of buffers and net/session instances, and spins up a background thread that repeatedly constructs all the commands with respect to the stable pointer and enqueues the resulting MLTCommandBuffer/MTLBuffer back to the caller thread. Then, the client code just needs to do a memcpy from the caller data -> buffer + encode command buffer + wait, which I believe should be a win in the scenarios I was looking at. Does that make sense? I think a bunch of these applications are caveated on being able to execute the entire computation graph on the Metal device, which is what I was targeting for a bunch of our applications but might not be applicable for other applications.

@bryant1410
Copy link
Contributor

At @xmartlabs we have been working on a framework to use neural nets for iOS on top of metal that supports running TensorFlow models, maybe someone is interested in using it, at least until TF supports metal

https://github.com/xmartlabs/Bender

@taion
Copy link
Contributor

taion commented Jun 5, 2017

Would it make sense to take the same approach to adapt TF models to run on Caffe2?

@keveman
Copy link
Contributor

keveman commented Jun 6, 2017

@bryant1410 Thanks for the note. The Bender project is awesome! Some of the code, especially the shaders can be shared between the implementations. I'll keep you posted.

@cancan101
Copy link
Contributor Author

Core ML might be an interesting abstraction to both this and BNNS: #10468.

@ofirbb
Copy link

ofirbb commented Oct 16, 2017

@keveman @ajtulloch
Continuing your discussion from June 😸

  • prefetchStorageWithCommandBuffer:imageDescriptorList: Seems to give a small win when calling the same pipeline more than once. While counter intuitive being linked to a single commandBuffer calling the prefetch method on each new buffer with the same set of descriptors results in faster run times for me.
  • preencoding - @ajtulloch Did you actually manage to make this work? preencoding the whole buffer and then just copying the data into the input MPSImage and committing?
  • Double buffering - I did not try it yet, but I believe that for large enough graphs you can switch an MPSTemporaryImage in the middle for an MPSImage in order to encode a buffer up to that point, then while that buffer is commited, use the CPU to encode a new buffer to run from that point all the way to the output. Also, did not try it yet, but if any of you did, I would love to hear about it.
  • texture2d vs. texture2d_array - Since the code for C2 is already out there, I will open a PR for that in a couple of days. The essence is the fact that Metal is based on C++14 which fully supports templating, namespacing and other tools that allow you to create common code called by several kernels.

Last but not least, while being Swift based, Forge has some nice ideas in it too for building a framework on top of MPS.

@powderluv
Copy link
Contributor

@keveman Do you have your framework somewhere to try ? Even if it is WIP maybe easier to build on the same foundation

@ajtulloch
Copy link

@ofirbb interesting.

  • Re: prefetching - how did you select the descriptors to prefetch? Seems to trade off heap size vs time spent allocating right?
  • Re: preencoding/double buffering - we do something similar for models like Mask R-CNN, where we run parts of the graph on the CPU and parts on the GPU. Not very cleanly abstrated though.
  • Re: texture2d/2d_array - cool stuff. I was thinking of just a minor refactor to go to single kernel and keep using function_constants to select array/non-array, but if there's some nice template abstractions that's pretty neat.

@haoxi911
Copy link

Is there any updates on supporting MPSCNN in TensorFlow or TensorFlow-Lite? I am new to machine learning so I probably can only describe the problem from a user's perspective.

We are currently comparing the accuracy / performance between TF-Lite and Metal (MPSCNN), and it seems Metal works better on iOS (Mobilenet_v1). As we have to support iOS 10 so Core ML is not an option.

Though I don't like the Metal approach we used, which required us to extract the weights and biases from frozen TF graph and also we had to write Metal code by ourselves to build the inference. I just thought it would be nice if TF-Lite can provide comparable performance, then there will be no need for us to dig into the Apple's MPSCNN APIs :)

Thank you for all your efforts!

@github-actions
Copy link

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 28, 2023
Copy link

This issue was closed because it has been inactive for 1 year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:contribution welcome Status - Contributions welcome type:feature Feature requests
Projects
None yet
Development

No branches or pull requests