-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensorflow lite gpu delegate inference using opengl and SSBO in android #26297
Comments
Thanks for trying out the GPU delegate. Can you provide a little bit more context in terms of timing, i.e. how many milliseconds/seconds was it before and after? What kind of network are you using? Specifically, are all ops supported? Have you written a custom shader code to copy camera texture into SSBO, or are you just dumping CPU memory to SSBO by yourself? If it's the former, you're doing things right and it should get faster. If it's the latter, it's only going to get slower. |
Model: Similar to the Official TF-Lite Segmentation Model (model inference graph attached as image).The last three additional nodes are not supported by gpu delegate, it seems.The input image size is 129*129. Phone: OnePlus 3, GPU: Adreno 530 Timings:- i.e Time for executing 'interpreter.run()' method. Here is the method that we used to copy camera texture into SSBO:-
Can the same 'Interpreter.run()' method handle normal input from CPU and SSBO? Or is there any other options/functions for running the inference in this case? |
Apologies for the delayed response. For some reason, I just got this in my inbox >_< Quick question re: your code: Doesn't it have to be
? Also, do you have the luxury to make the input SSBO of shape 1x129x129x4 ? Then you could eliminate one hidden memcpy inside. From the graph you shared (btw, nice visualization; appreciate that), it indeed looks like everything would be handled until the last ResizeBilinear. The shape of it is also not too bad (129x129x2), in terms of, it has too many channels etc. So I wouldn't expect any slow down. Did you properly call
for MobileNet. Might not be directly applicable, but you roughly get the idea... |
Not officially announced yet, but FYI: GPU code is now visible at: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu if you need the code for better insight what is happening. |
Hi @impjdi ,
|
I am out of office on vacation this week with limited network access and there's a good chance I'll forget about this. Could you please nudge me again next week? |
Sure porygon ...😉 |
Hi @impjdi , |
Hi @impjdi , Maybe I should open a separate issue... We're attempting to use a GLSurfaceView in our app, alongside the tflite GPUDelegate. Our renderer works fine until A working example might clear things up... Thank you! |
Heh, I missed the porygon part earlier :) The below is in C++, but should be similar in Java too.
|
Hm, the only official example code is the TFLite demo app that is in the TF repository. As an Android app consists of a lot more than just a single Java file, that'd be difficult unless I start up a whole new git repo with the files. Unfortunately, on top of that, I'm not a real mobile app developer; I do most of my stuff in Android C++ without cameras. I'll see whether I can cook up a C++ binary that can do all this in a single C++ file =/ That discussion aside...
You can probably trace back what is causing the hanging. |
Did things work out? Can this issue be closed? |
The code is working fine; but we are not able to get correct output using the ssbo as input.The output seems to be black (i.e output is all zeroes).We are not able to ensure that data is correctly copied into ssbo or whether it is correctly accessed by tensorflow; even though it is running without errors.It seems there is no way to debug and see shader codes (GLSL) in android. |
Attached with this is the logfile containing the errors when we tried to use SSBO with the tflite model. The errors vary between Mali Devices, whereas the output is not getting visualized in Adreno Devices. Mali (Error logs are attached with the issue: mali-gpu-ssbo-errorlog.txt) Adreno (Error Logs are attached: adreno-gpu-ssbo-errorlog.txt) @impjdi Could you have a look at it.. And it would be better if you could share with us the working app code for reference. |
@impjdi Any updates on SSBO?? |
@ktgordon Have you found a resolution/workaround for this issue? I am experiencing exactly the same problem. After calling modifyGraphWithDelegate(), all glDraw calls results in black. Does not even need to associate SSBO buffer to TFLite Tensors. This is strange. Taking a deeper look as well. |
We did find a workaround. I'm assuming you're using the Java API and bringing in gpu delegates via What I think is happening is that modifyGraphWithDelegate() modifies the current context so that our display surface is no longer current... not a problem if we had access to our original state variables. However, since we originally tried using GLSurfaceView we didn't have access to any of these variables. In effect modifyGraphWithDelegate made changes to the gl state we couldn't recover from. Switching from GLSurfaceView to TextureView gave us more control at the cost of more complexity. We created a dummy context, initialized our interpreter and called modifyGraphWithDelegate(), then created a new shared context with the dummy context. This way we could make our display surface current and render to it. Managing the egl context was handled by reusing code from Grafika. This got us passed the black screen problem anyways... |
I am doing exactly what you said here as I based on TFLite demo (which uses TextureView). Mainly the following:
The draws using The Grafika code was also referenced as well. Will try to setup a dummy context next... |
Hi @ktgordon , @gnsmrky , Finally, are you able to achieve any speedup compared to normal GPU inference? If so, can you share a basic working demo app? Just to clear things up ... |
@ktgordon Just got it working! Indeed, the dummy shared context is the key to make it work. I guess the GLES context setting/switching can be a lot more complicated than one can imagine... @anilsathyan7 I based on the TFLite demo, which is the main sample project that TFLite GPU delegate page provides. This sample project uses The performance gain is significant. I was trying a 448x448 image. (Trying a larger image to amplify the copy time). The time it takes w/o SSBO/Image2D copy shader is around 900ms on a Snapdragon 808. Using copy shader the time comes down to < 20ms! |
@gnsmrky Could you share your repo, so that it could be a better thing for everyone to start exploring ssbo with that. |
@SanthoshRajendiran Trying to find the time to do that. The code is very messy now and unreadable. Will get it cleaned up as soon as I get spare cycles. |
Ah, thanks for the update and sharing! |
I followed the official documentation for android for the GPU delegate and got stuck at the bindBuffer step, too.
I checked out the current master and there is no gpu_delegate(.cc?), only a gpu_delegate_jni(.cc). Did you mean that? Anyways, I found that TfLiteGpuDelegateBindBufferToTensor seems to be an exported symbol of the library and we can get the native handle of the delegate so we might be able to call that method directly from java. |
Sorry, the last file should have been |
@impjdi Thanks for the update. Does that mean the SSBO route is currently only available with the C bindings or not at all? |
I haven't checked Java, but if Java has migrated to the new API (delegate.cc), your assessment is correct. For C++, it's only available in v1 (gl_delegate.cc), but not in v2 (delegate.cc). |
@impjdi is the SSBO bindBuffer issue in v2 delegate resolved? |
The current plan is not to support bindBuffer in delegate v2. |
@impjdi we have our image frame in GPU memory. Should we move it to CPU just to start inference, which will move it to GPU again? The time spent doing this would waste the benefits of gpu inference in many cases. |
@impjdi Could you share anything information about why bindBuffer will not be supported in delegate v2? I believe that it improves gpu end to end inference time by eliminating memcpy actions. Does tflite team run into some unresolvable issues or the decision is made only by product requirements? |
There are many advanced usages of the mobile GPU inference, and for each of those, GPU delegate needs helper functions like |
@impjdi This information is helpful. Another question is when will MediaPipe delegate v2 integration be released? Thank you. |
Someone's working on it :) |
Has anyone managed to bind the buffer with the v2 delegate? It seems to me that mediapipe is already using it, see mediapipe/tflite_gpu_runner.h . This runner is used in the calculator mentioned by impjdi under the This is very unfriendly for those who want to have the SSBO utility without maintaining their own interpreter, but going deeper, the bind logic is in mediapipe/tflite_gpu_runner.cc and simply calls The v2 delegate owns an @impjdi , any word of guidance would be helpful here. Is this correct? Can we simply patch the v2 delegate with a InferenceRunner::SetInputObject call, and invoke it instead of the v1 bindBuffer? I don't think I'm on the right track, but I do think it would be very useful to the community if we could achieve a patch file and share it here. |
@natario1 I think @impjdi explained that |
Thanks for your comment @brucechou1983 . A simpler solution for me is to stick to v1 delegate, but to be honest it doesn't seem like mediapipe runner is doing anything complex/fancy, other than calling The v2 delegate also does the same object/objectdef calls, but the difference is that it uses I don't know what's the support of OpenCL in Android, but OpenGL works just fine, so we could have a flag in v2 delegate options that tells the delegate to not try OpenCL and go with OpenGL. It's something that the TF team could add to ease the v1-v2 transition I think, since people who were using v1 likely have a SSBO set up. |
@natario1 If a flag for only using OpenGL is what you need, it's already there though it's still experimental. You can set the flag to However, when you need realtime (>>30fps) semantic segmentation and/or face mesh running on a $200 dollar phone, choosing a right GPU backend in tflite runtime for efficient execution is really not a trivial problem. I do see the value of using OpenCL for some MALI gpu devices. The |
@natario1 I see you did your homework there, good job 👍 You might have noticed, but TFLite is adding bunch of delegates for various accelerators or APIs. Each of them having custom helper functions didn't help usage, but makes it more confusing for 99% of the users who want to use TFLite GPU delegate just as a magic box doing GPU-accelerated inference. So the final decision we made was to keep the TFLite GPU delegate as simple as possible, but leave the room open for advanced users who want to do real performant things. The team that delivers TFLite GPU and MediaPipe are sister teams sharing one manager. Having said that, TFLite GPU won't break MediaPIpe, and that's a guarantee. And in that sense, going deeper and using advanced internal APIs like |
I understand the situation @impjdi . Would you consider something like You say that
Apart from this, I'll try to use these low-level APIs this weekend and see if I manage to get v2 working. Thanks for helping! Edit: After spending the weekend on it I realized this suggestion was not possible, but I hope you can consider something like what I ended up doing which is clean and keeps the delegate header untouched. |
@impjdi any suggestions on how to fix this error? It seems to be an issue with the BHWC > BHWC4 conversion, but I have no clue at how to address it. It happens in ToTensorConverter.
I create the object def and tensor object as follows: // object def
tflite::gpu::ObjectDef object_def;
object_def.data_type = tflite::gpu::DataType::FLOAT32;
object_def.data_layout = tflite::gpu::DataLayout::BHWC;
object_def.object_type = tflite::gpu::ObjectType::OPENGL_SSBO;
object_def.user_provided = true;
// tensor object
tflite::gpu::OpenGlBuffer tensor_object;
tensor_object.id = ssbo; Then pass both to the delegate before TF version is 2.2.0 and the model I am using is extremely simple, takes a 400x400x1 image and calculates the average intensity, returning a single float. I am trying to use a SSBO object for the input only. Also I'm running the OpenGL backend, OpenCL not available on my phone. |
After many hours, I think I hit a bug that is still present in 2.2.0, but was fixed in master by these commits: 4000a5c dffe6a0 For those who are interested, in short, the fact that I'm using BHWC with 1 color channel (instead of 4), requires the gl engine to do a conversion and this conversion (before 4000a5c and dffe6a0) is completely broken, because By cherry-picking 4000a5c and dffe6a0 into v2.2.0 and exposing the necessary APIs, I'm able to do SSBO I/O with the v2 delegate. These commits are pretty old so I hope they can make it into next release. These are the changes I had to make to expose the necessary APIs: deepmedia@7401fbb . I don't know C++ so there might be errors, but the point is to create an interface that the V2 delegate extends. This interface can be retrieved from the delegate using a separate C++ header (delegate_core.h) so the high-level delegate is still a black box. |
@anilsathyan7 Could you please try on latest stable version of tf 2.5 or 2.4.1 and let us know if this is still an issue.Thanks! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
Yes, modified inference code from tflite gpu delegate android sample with additional code from https://www.tensorflow.org/lite/performance/gpu_advanced#android_2.
Describe the current behavior
The tensorflow lite gpu delegate documentation has provided a sample code for running the tflite inference efficiently on android, avoiding CPU_GPU memory copying with the help of opengl and SSBO in a egl context. However, this method does not seem to give any performance gains; rather it degraded the inference performance in terms of speed.The documentation mentions a method - 'interpreter.runInference(null, outputArray)' for running the inference in this case.Is this method same as the basic run method i.e interpreter.run(inputTensor, outputTensor). (There seems to be no method in the current api called 'interpreter.runInference').Is the method suggested currently supported in the experimental gpu delegate api (i.e accessing input image from opengl ssbo directly for running the inference)?How can we ensure that the model takes the input from this SSBO in GPU memory?
** Expected behaviour**
The tflite inference using opengl ssbo should be faster than the basic gpu delegate inference, where data is copied every-time from cpu to gpu.
Other info / logs
We measured the time for the 'tflite.run' method in android studio.The input was in the recommended ByteBuffer format.
Error: Cannot resolve method runInference(null, ?)
The text was updated successfully, but these errors were encountered: