Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow lite gpu delegate inference using opengl and SSBO in android #26297

Closed
anilsathyan7 opened this issue Mar 3, 2019 · 106 comments
Closed
Assignees
Labels
comp:lite TF Lite related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug

Comments

@anilsathyan7
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes, modified inference code from tflite gpu delegate android sample with additional code from https://www.tensorflow.org/lite/performance/gpu_advanced#android_2.
  • OS Platform and Distribution : Android 8.0.0
  • Mobile device: OnePlus 3
  • TensorFlow version: 12.0

Describe the current behavior
The tensorflow lite gpu delegate documentation has provided a sample code for running the tflite inference efficiently on android, avoiding CPU_GPU memory copying with the help of opengl and SSBO in a egl context. However, this method does not seem to give any performance gains; rather it degraded the inference performance in terms of speed.The documentation mentions a method - 'interpreter.runInference(null, outputArray)' for running the inference in this case.Is this method same as the basic run method i.e interpreter.run(inputTensor, outputTensor). (There seems to be no method in the current api called 'interpreter.runInference').Is the method suggested currently supported in the experimental gpu delegate api (i.e accessing input image from opengl ssbo directly for running the inference)?How can we ensure that the model takes the input from this SSBO in GPU memory?

** Expected behaviour**
The tflite inference using opengl ssbo should be faster than the basic gpu delegate inference, where data is copied every-time from cpu to gpu.

Other info / logs
We measured the time for the 'tflite.run' method in android studio.The input was in the recommended ByteBuffer format.

Error: Cannot resolve method runInference(null, ?)

@ymodak ymodak added the comp:lite TF Lite related issues label Mar 4, 2019
@ymodak ymodak self-assigned this Mar 4, 2019
@ymodak ymodak added the type:bug Bug label Mar 4, 2019
@ymodak ymodak added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 4, 2019
@ymodak ymodak removed their assignment Mar 4, 2019
@impjdi
Copy link
Contributor

impjdi commented Mar 4, 2019

@anilsathyan7

Thanks for trying out the GPU delegate.

Can you provide a little bit more context in terms of timing, i.e. how many milliseconds/seconds was it before and after?

What kind of network are you using? Specifically, are all ops supported?

Have you written a custom shader code to copy camera texture into SSBO, or are you just dumping CPU memory to SSBO by yourself? If it's the former, you're doing things right and it should get faster. If it's the latter, it's only going to get slower.

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 5, 2019
@anilsathyan7
Copy link
Author

anilsathyan7 commented Mar 6, 2019

Model: Similar to the Official TF-Lite Segmentation Model (model inference graph attached as image).The last three additional nodes are not supported by gpu delegate, it seems.The input image size is 129*129.

Phone: OnePlus 3, GPU: Adreno 530

Timings:-
CPU Inference: 60-70 ms
GPU Inference: 40-50 ms
GPU Inference (SSBO): 80-90 ms

i.e Time for executing 'interpreter.run()' method.

Here is the method that we used to copy camera texture into SSBO:-

//Initialise SSBO
public int[] initializeShaderBuffer(){
    android.opengl.EGLContext eglContext = eglGetCurrentContext();
    int[] id = new int[1];
    GLES31.glGenBuffers(id.length, id, 0);
    GLES31.glBindBuffer(GLES31.GL_SHADER_STORAGE_BUFFER, id[0]);
    GLES31.glBufferData(GLES31.GL_SHADER_STORAGE_BUFFER, mWidth * mHeight, null, GLES31.GL_STREAM_COPY);
    GLES31.glBindBuffer(GLES31.GL_SHADER_STORAGE_BUFFER, 0);// unbind
    return id;
}
int inputSsboId = initializeShaderBuffer()[0];

 //After that every time a frame is available OR in onDraFrame(), call 
fillSsboWithCameraImageTexture(inputSsboId,data);

//(Note: Data is Nothing but Camera Frame ByteBuffer)

// Fill Ssbo With CameraImageTexture 

private int fillSsboWithCameraImageTexture(int inputSsboId,ByteBuffer cameraFramme) {

    GLES31.glBufferData(GLES31.GL_SHADER_STORAGE_BUFFER, mWidth * mHeight, cameraFramme, GLES31.GL_STREAM_COPY);
    return inputSsboId;

}

129_80k_dm05

Can the same 'Interpreter.run()' method handle normal input from CPU and SSBO? Or is there any other options/functions for running the inference in this case?

@impjdi
Copy link
Contributor

impjdi commented Mar 21, 2019

@anilsathyan7

Apologies for the delayed response. For some reason, I just got this in my inbox >_<

Quick question re: your code:

Doesn't it have to be

GLES31.glBufferData(GLES31.GL_SHADER_STORAGE_BUFFER, 3 * mWidth * mHeight, null, GLES31.GL_STREAM_COPY);

?

Also, do you have the luxury to make the input SSBO of shape 1x129x129x4 ? Then you could eliminate one hidden memcpy inside.

From the graph you shared (btw, nice visualization; appreciate that), it indeed looks like everything would be handled until the last ResizeBilinear. The shape of it is also not too bad (129x129x2), in terms of, it has too many channels etc. So I wouldn't expect any slow down.

Did you properly call BindGlBufferToTensor before ModifyGraphWithDelegate? Can you share the shader code that converts your texture to SSBO? I was doing something like:

   #version 310 es
   layout(local_size_x = 16, local_size_y = 16) in;
   layout(binding = 0) uniform sampler2D input_texture;
   layout(std430) buffer;
   layout(binding = 1) buffer Output { float elements[]; } output_data;
   void main() {
     ivec2 gid = ivec2(gl_GlobalInvocationID.xy);
     if (gid.x >= 224 || gid.y >= 224) return;
     vec3 pixel = texelFetch(input_texture, gid, 0).xyz;
     int linear_index = 3 * (gid.y * 224 + gid.x);
     output_data.elements[linear_index + 0] = pixel.x;
     output_data.elements[linear_index + 1] = pixel.y;
     output_data.elements[linear_index + 2] = pixel.z;
   }

for MobileNet. Might not be directly applicable, but you roughly get the idea...

@impjdi
Copy link
Contributor

impjdi commented Mar 28, 2019

Not officially announced yet, but FYI: GPU code is now visible at:

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu

if you need the code for better insight what is happening.

@anilsathyan7
Copy link
Author

Hi @impjdi ,
Can you just share the sample classification app using ssbo or atleast the opengl related code?
We used the following shader code based on your inputs.But we encountered some errors related to shader version, which we could not resolve being opengl beginners.

#version 310 es
layout(local_size_x = 16, local_size_y = 16) in;
layout(binding = 0) uniform sampler2D u_Texture0;
layout(std430) buffer;
layout(binding = 1) buffer Output { float elements[]; } output_data;
void main() {
    ivec2 gid = ivec2(gl_GlobalInvocationID.xy);
    if (gid.x >= 257 || gid.y >= 257) return;
    vec3 pixel = texelFetch(u_Texture0, gid, 0).xyz;
    int linear_index = 3 * (gid.y * 257 + gid.x);
    output_data.elements[linear_index + 0] = pixel.x;
    output_data.elements[linear_index + 1] = pixel.y;
    output_data.elements[linear_index + 2] = pixel.z;
}
mTextureUniformHandle0 = GLES31.glGetUniformLocation(mProgramHandle,
				"u_Texture0");
// Set the active texture0 unit to texture unit 0.
		GLES31.glActiveTexture(GLES31.GL_TEXTURE0);

		// Bind the texture to this unit.
		GLES31.glBindTexture(GLES31.GL_TEXTURE_2D, mTextureDataHandle0);

		// Tell the texture uniform sampler to use this texture in the shader by
		// binding to texture unit 0.
		GLES31.glUniform1i(mTextureUniformHandle0, 0);
	public int[] initializeShaderBuffer(){
		android.opengl.EGLContext eglContext = eglGetCurrentContext();
		int[] id = new int[1];
		GLES31.glGenBuffers(id.length, id, 0);
		GLES31.glBindBuffer(GLES31.GL_SHADER_STORAGE_BUFFER, id[0]);
		GLES31.glBufferData(GLES31.GL_SHADER_STORAGE_BUFFER, 257*257*3*4, null, GLES31.GL_STREAM_COPY);
		GLES31.glBindBufferBase(GLES31.GL_SHADER_STORAGE_BUFFER,1,id[0]);
		GLES31.glBindBuffer(GLES31.GL_SHADER_STORAGE_BUFFER, 0);// unbind
		return id;
	}

@impjdi
Copy link
Contributor

impjdi commented Apr 1, 2019

@anilsathyan7

I am out of office on vacation this week with limited network access and there's a good chance I'll forget about this. Could you please nudge me again next week?

@anilsathyan7
Copy link
Author

anilsathyan7 commented Apr 1, 2019

Sure porygon ...😉

@anilsathyan7
Copy link
Author

Hi @impjdi ,
Can you help us with the ssbo tflite inferecne issue?? We could not run the tflite inference using ssbo in android.Can you just share the sample classification app using ssbo or atleast the opengl related code?How much speed up can we expect in this scenario?

@ktgordon
Copy link

ktgordon commented Apr 9, 2019

Hi @impjdi ,
I'll second a request for a demo illustrating SSBO inference.

Maybe I should open a separate issue... We're attempting to use a GLSurfaceView in our app, alongside the tflite GPUDelegate. Our renderer works fine until interpreter.modifyGraphWithDelegate(delegate); is called, which results in a black screen. No glErrors are produced. Its difficult to understand how commenting/uncommenting the above line changes the behaviour, even after looking at the newly released GPU delegates source.

A working example might clear things up...

Thank you!

@impjdi
Copy link
Contributor

impjdi commented Apr 9, 2019

@anilsathyan7

Heh, I missed the porygon part earlier :)

The below is in C++, but should be similar in Java too.

    glActiveTexture(GL_TEXTURE0 + 0);
    glBindTexture(GL_TEXTURE_2D, /*your gl texture that has the image*/);
    glBindBufferRange(GL_SHADER_STORAGE_BUFFER, 1, /*your ssbo*/, 0, /*size in bytes*/);
    glUseProgram(/*the program above*/);
    glDispatchCompute(width / 16, height / 16, 1);  // these are work group sizes
    glBindBuffer(GL_SHADER_STORAGE_BUFFER, 0);  // unbind
    glBindTexture(GL_TEXTURE_2D, 0);  // unbind

@impjdi
Copy link
Contributor

impjdi commented Apr 9, 2019

@ktgordon

Hm, the only official example code is the TFLite demo app that is in the TF repository. As an Android app consists of a lot more than just a single Java file, that'd be difficult unless I start up a whole new git repo with the files. Unfortunately, on top of that, I'm not a real mobile app developer; I do most of my stuff in Android C++ without cameras. I'll see whether I can cook up a C++ binary that can do all this in a single C++ file =/ That discussion aside...

modifyGraphWithDelegate hanging sounds like you have an issue somewhere else. Make sure that your TfLiteGpuDelegateBindBufferToTensor is called before modifyGraphWithDelegate, and that your SSBO is already created. The flow of the program with modifyGraphWithDelegate is as follows:

Interpreter.modifyGraphWithDelegate (Java)
Interpreter::ModifyGraphWithDelegate (C++)
tflite::gpu::gl::(anonymous)::DelegatePrepare (C++)
tflite::gpu::gl::(anonymous)::Delegate::Prepare (C++)

You can probably trace back what is causing the hanging.

@impjdi
Copy link
Contributor

impjdi commented Apr 20, 2019

@anilsathyan7

Did things work out? Can this issue be closed?

@anilsathyan7
Copy link
Author

The code is working fine; but we are not able to get correct output using the ssbo as input.The output seems to be black (i.e output is all zeroes).We are not able to ensure that data is correctly copied into ssbo or whether it is correctly accessed by tensorflow; even though it is running without errors.It seems there is no way to debug and see shader codes (GLSL) in android.

@SanthoshRajendiran
Copy link

SanthoshRajendiran commented Apr 22, 2019

Attached with this is the logfile containing the errors when we tried to use SSBO with the tflite model.
The code works properly in Mobiles with Adreno-GPU without any errors but no output is visualized. But in phones with Mali-GPU, there are some issues even before the model comes into picture.

The errors vary between Mali Devices, whereas the output is not getting visualized in Adreno Devices.
The devices used in the below testing are:

Mali (Error logs are attached with the issue: mali-gpu-ssbo-errorlog.txt)
Samsung A8+
Honor Play
Moto C plus

Adreno (Error Logs are attached: adreno-gpu-ssbo-errorlog.txt)
Poco F1

mali-gpu-ssbo-errorlog.txt

adreno-gpu-ssbo-errorlog.txt

@impjdi Could you have a look at it.. And it would be better if you could share with us the working app code for reference.

@SanthoshRajendiran
Copy link

@impjdi Any updates on SSBO??

@gnsmrky
Copy link

gnsmrky commented May 8, 2019

Hi @impjdi ,
I'll second a request for a demo illustrating SSBO inference.

Maybe I should open a separate issue... We're attempting to use a GLSurfaceView in our app, alongside the tflite GPUDelegate. Our renderer works fine until interpreter.modifyGraphWithDelegate(delegate); is called, which results in a black screen. No glErrors are produced. Its difficult to understand how commenting/uncommenting the above line changes the behaviour, even after looking at the newly released GPU delegates source.

A working example might clear things up...

Thank you!

@ktgordon Have you found a resolution/workaround for this issue? I am experiencing exactly the same problem. After calling modifyGraphWithDelegate(), all glDraw calls results in black. Does not even need to associate SSBO buffer to TFLite Tensors. This is strange. Taking a deeper look as well.

@ktgordon
Copy link

ktgordon commented May 8, 2019

We did find a workaround. I'm assuming you're using the Java API and bringing in gpu delegates via
implementation 'org.tensorflow:tensorflow-lite:0.0.1-gpu-experimental'

What I think is happening is that modifyGraphWithDelegate() modifies the current context so that our display surface is no longer current... not a problem if we had access to our original state variables. However, since we originally tried using GLSurfaceView we didn't have access to any of these variables. In effect modifyGraphWithDelegate made changes to the gl state we couldn't recover from.

Switching from GLSurfaceView to TextureView gave us more control at the cost of more complexity. We created a dummy context, initialized our interpreter and called modifyGraphWithDelegate(), then created a new shared context with the dummy context. This way we could make our display surface current and render to it.

Managing the egl context was handled by reusing code from Grafika.

This got us passed the black screen problem anyways...

@gnsmrky
Copy link

gnsmrky commented May 9, 2019

I am doing exactly what you said here as I based on TFLite demo (which uses TextureView). Mainly the following:

  1. Create gl context, set gl viewport, etc. Stores eglDisplay, eglSurface, eglContext.
  2. Make call to modifyGraphWithDelegate().
  3. Set the eglContext, eglSurface, eglDisplay as current using eglMakeCurrent

The draws using glDrawArrays, results in black. Interestingly, if step 1 & step 2 is swapped in sequence, everything works.

The Grafika code was also referenced as well.

Will try to setup a dummy context next...

@anilsathyan7
Copy link
Author

Hi @ktgordon , @gnsmrky ,
Are you suggesting that ssbo method would not work with normal GLSurfaceView? What about something like GLTextureView( link1, link2)?

Finally, are you able to achieve any speedup compared to normal GPU inference? If so, can you share a basic working demo app? Just to clear things up ...

@gnsmrky
Copy link

gnsmrky commented May 9, 2019

@ktgordon Just got it working! Indeed, the dummy shared context is the key to make it work. I guess the GLES context setting/switching can be a lot more complicated than one can imagine...

@anilsathyan7 I based on the TFLite demo, which is the main sample project that TFLite GPU delegate page provides. This sample project uses TextureView. Don't know if SSBO works with other surface types. I would imagine it should as eglCreateWindowSurface() takes SurfaceView, SurfaceTexture, SurfaceHolder or a Surface, according to Android eglSurface doc. GLTextureView from your link extends SurfaceTexture, should work as well.

The performance gain is significant. I was trying a 448x448 image. (Trying a larger image to amplify the copy time). The time it takes w/o SSBO/Image2D copy shader is around 900ms on a Snapdragon 808. Using copy shader the time comes down to < 20ms!

@SanthoshRajendiran
Copy link

@gnsmrky Could you share your repo, so that it could be a better thing for everyone to start exploring ssbo with that.

@gnsmrky
Copy link

gnsmrky commented May 15, 2019

@gnsmrky Could you share your repo, so that it could be a better thing for everyone to start exploring ssbo with that.

@SanthoshRajendiran Trying to find the time to do that. The code is very messy now and unreadable. Will get it cleaned up as soon as I get spare cycles.

@impjdi
Copy link
Contributor

impjdi commented Apr 13, 2020

Ah, thanks for the update and sharing!

@martin-schulze-vireso
Copy link

martin-schulze-vireso commented Apr 14, 2020

I followed the official documentation for android for the GPU delegate and got stuck at the bindBuffer step, too.

I don't work in Java lands, and thus I don't know which delegate Java APIs are using, but bindGlBufferToTensor got renamed in the deprecated GL delegate, and removed in the new GPU delegate. Check out //tf/lite/delegates/gpu/gl_delegate & //tf/lite/delegates/gpu/gpu_delegate.

I checked out the current master and there is no gpu_delegate(.cc?), only a gpu_delegate_jni(.cc). Did you mean that?

Anyways, I found that TfLiteGpuDelegateBindBufferToTensor seems to be an exported symbol of the library and we can get the native handle of the delegate so we might be able to call that method directly from java.

@impjdi
Copy link
Contributor

impjdi commented Apr 14, 2020

Sorry, the last file should have been //tf/lite/delegates/gpu/delegate.cc. We were internally trying to use bindBuffer (without the delegate API, but with GPU-internal functions directly) and saw that the new API is a bit broken, so that it's not usable with the new API. Someone is working on fixing those. For now, if you want to use bindBuffer, I guess you are stuck with the old API, i.e. gl_delegate.

@martin-schulze-vireso
Copy link

martin-schulze-vireso commented Apr 14, 2020

@impjdi Thanks for the update. Does that mean the SSBO route is currently only available with the C bindings or not at all?

@impjdi
Copy link
Contributor

impjdi commented Apr 14, 2020

I haven't checked Java, but if Java has migrated to the new API (delegate.cc), your assessment is correct.

For C++, it's only available in v1 (gl_delegate.cc), but not in v2 (delegate.cc).

@brucechou1983
Copy link

@impjdi is the SSBO bindBuffer issue in v2 delegate resolved?

@impjdi
Copy link
Contributor

impjdi commented May 18, 2020

The current plan is not to support bindBuffer in delegate v2.

@natario1
Copy link

natario1 commented May 18, 2020

@impjdi we have our image frame in GPU memory. Should we move it to CPU just to start inference, which will move it to GPU again? The time spent doing this would waste the benefits of gpu inference in many cases.

@brucechou1983
Copy link

@impjdi Could you share anything information about why bindBuffer will not be supported in delegate v2? I believe that it improves gpu end to end inference time by eliminating memcpy actions. Does tflite team run into some unresolvable issues or the decision is made only by product requirements?

@impjdi
Copy link
Contributor

impjdi commented May 19, 2020

There are many advanced usages of the mobile GPU inference, and for each of those, GPU delegate needs helper functions like bindBuffer because it doesn't fit in the delegate framework. After adding a bunch of support for extended usages either through the helper functions or options, we decided it's no more maintainable with the combinatoric growth and gives an inconsistent look even within the GPU delegates (OpenCL, OpenGL, Metal, etc.). Note that, we have to wrap it up with a Java API. With the majority of the users wanting the GPU delegate as just a quick blackbox accelerator, we made the final decision that the delegate API will stay simple and clean. For advanced usages that supports a streamlined GPU execution pipeline, we will still have an example code through, e.g. MediaPipe's TfLiteInferenceCalculator. Note that it's not there yet, as it still uses the v1 delegate and thus has access to bindBuffer.

@brucechou1983
Copy link

@impjdi This information is helpful. Another question is when will MediaPipe delegate v2 integration be released? Thank you.

@impjdi
Copy link
Contributor

impjdi commented May 19, 2020

Someone's working on it :)

@natario1
Copy link

natario1 commented Jun 8, 2020

Has anyone managed to bind the buffer with the v2 delegate?

It seems to me that mediapipe is already using it, see mediapipe/tflite_gpu_runner.h . This runner is used in the calculator mentioned by impjdi under the use_advanced_gpu_api_ flag. It replaces the interpreter/delegate flow and uses low level components instead.

This is very unfriendly for those who want to have the SSBO utility without maintaining their own interpreter, but going deeper, the bind logic is in mediapipe/tflite_gpu_runner.cc and simply calls InferenceRunner::SetInputObject.

The v2 delegate owns an InferenceRunner itself so maybe a small patch to the v2 delegate could add the required SetInputObject (or output) call. But I haven't tested, setting this up would be hard for me at the moment.

@impjdi , any word of guidance would be helpful here. Is this correct? Can we simply patch the v2 delegate with a InferenceRunner::SetInputObject call, and invoke it instead of the v1 bindBuffer? I don't think I'm on the right track, but I do think it would be very useful to the community if we could achieve a patch file and share it here.

@brucechou1983
Copy link

brucechou1983 commented Jun 9, 2020

@natario1 I think @impjdi explained that bindBuffer APIs don't fit in the v2 delegation design. The key difference between v1 & v2 delegate is v2 supports both OpenCL and OpenGL backend while v1 only supports OpenGL. This will affect how Tflite handles data ownership exchange. Moreover, many devices on the market don't fully support OpenCL-OpenGL interoperability. I've also tried the use_advanced_gpu_api_ flag in MediaPipe, the app crashes when I turn it on. So I think it's not a trivial patch for v2 delegate to support bindBuffer features. If you need this feature, I think the most simple solution is stick to mediapipe with opengl backend.

@natario1
Copy link

natario1 commented Jun 9, 2020

Thanks for your comment @brucechou1983 . A simpler solution for me is to stick to v1 delegate, but to be honest it doesn't seem like mediapipe runner is doing anything complex/fancy, other than calling InferenceRunner::SetInputObject and InferenceRunner::SetInputObjectDef when preparing. I understand that it might not be ready yet though, as it is under a flag.

The v2 delegate also does the same object/objectdef calls, but the difference is that it uses ObjectType::CPU_MEMORY instead of ObjectType::OPENGL_SSBO like mediapipe does.

I don't know what's the support of OpenCL in Android, but OpenGL works just fine, so we could have a flag in v2 delegate options that tells the delegate to not try OpenCL and go with OpenGL. It's something that the TF team could add to ease the v1-v2 transition I think, since people who were using v1 likely have a SSBO set up.

@brucechou1983
Copy link

@natario1 If a flag for only using OpenGL is what you need, it's already there though it's still experimental. You can set the flag to TFLITE_GPU_EXPERIMENTAL_FLAGS_GL_ONLY.

However, when you need realtime (>>30fps) semantic segmentation and/or face mesh running on a $200 dollar phone, choosing a right GPU backend in tflite runtime for efficient execution is really not a trivial problem. I do see the value of using OpenCL for some MALI gpu devices. The invoke() execution is 2x-3x faster than OpenGL ES. Although I have to copy the data to/from the tensors, the overall performance is still better. I think tflite team is trying to design the v2 delegate as a blackbox accelerator for general purpose/arbitrary IoT device/easy to use, while creating interfaces for other frameworks like MediaPipe to optimize for specific usage like streamlined GPU execution on mobile/desktop.

@impjdi
Copy link
Contributor

impjdi commented Jun 10, 2020

@natario1 I see you did your homework there, good job 👍

You might have noticed, but TFLite is adding bunch of delegates for various accelerators or APIs. Each of them having custom helper functions didn't help usage, but makes it more confusing for 99% of the users who want to use TFLite GPU delegate just as a magic box doing GPU-accelerated inference. So the final decision we made was to keep the TFLite GPU delegate as simple as possible, but leave the room open for advanced users who want to do real performant things.

The team that delivers TFLite GPU and MediaPipe are sister teams sharing one manager. Having said that, TFLite GPU won't break MediaPIpe, and that's a guarantee. And in that sense, going deeper and using advanced internal APIs like InferenceRunner::SetInputObject the way MediaPipe uses it is safe. Of course, because it's not the public API, but an advanced internal one, there might be API changes that may break you every once in a while, but you will always have the MediaPipe's reference implementation.

@natario1
Copy link

natario1 commented Jun 10, 2020

I understand the situation @impjdi . Would you consider something like V2Delegate::GetInferenceRunner()? So that we can call InferenceRunner::SetInputObject or whatever else from outside the delegate. This makes all the difference, because we'd still have to do our homework for integration and maintenance, but at least we don't have to fork Tensorflow or use a bazel patch, which is honestly a big burden, although MediaPipe helps.

You say that SetInput/OutputObject and SetInput/OutputObjectDef APIs are "advanced" and they are to some extent, but at the same time, it makes all the sense that to bind a tensor to "something", one has to specify its data layout, size, object type and so on. They're actually very elegant and easy to understand with respect to BindGlBufferToTensor, which from my point of view, was just doing some obscure magic under the hood which I couldn't really grasp.

These APIs would also be hidden behind the GetInferenceRunner() API, which you could document as a "use at your own risk" function, and keep the black-box surface clean. I think this approach would really "leave the room open" as you say. (maybe it would be more work for you than just adding a getter for the inference runner, but you get the point - being able to control the delegate objects from outside)

Apart from this, I'll try to use these low-level APIs this weekend and see if I manage to get v2 working. Thanks for helping!

Edit: After spending the weekend on it I realized this suggestion was not possible, but I hope you can consider something like what I ended up doing which is clean and keeps the delegate header untouched.

@natario1
Copy link

natario1 commented Jun 14, 2020

@impjdi any suggestions on how to fix this error? It seems to be an issue with the BHWC > BHWC4 conversion, but I have no clue at how to address it. It happens in ToTensorConverter.

E/tflite:
    TfLiteGpuDelegate Invoke: Missing output in converter
    Node number 1 (TfLiteGpuDelegateV2) failed to invoke.

I create the object def and tensor object as follows:

// object def
tflite::gpu::ObjectDef object_def;
object_def.data_type = tflite::gpu::DataType::FLOAT32;
object_def.data_layout = tflite::gpu::DataLayout::BHWC;
object_def.object_type = tflite::gpu::ObjectType::OPENGL_SSBO;
object_def.user_provided = true;

// tensor object
tflite::gpu::OpenGlBuffer tensor_object;
tensor_object.id = ssbo;

Then pass both to the delegate before ModifyGraphWithDelegate. They are correctly passed to the inference runner and the runner builder, however I get that converter error.

TF version is 2.2.0 and the model I am using is extremely simple, takes a 400x400x1 image and calculates the average intensity, returning a single float. I am trying to use a SSBO object for the input only.

Also I'm running the OpenGL backend, OpenCL not available on my phone.

@natario1
Copy link

natario1 commented Jun 14, 2020

After many hours, I think I hit a bug that is still present in 2.2.0, but was fixed in master by these commits: 4000a5c dffe6a0

For those who are interested, in short, the fact that I'm using BHWC with 1 color channel (instead of 4), requires the gl engine to do a conversion and this conversion (before 4000a5c and dffe6a0) is completely broken, because user_provided is hardcoded to true (https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/lite/delegates/gpu/gl/api2.cc#L595) but when user_provided is true, the engine will not bother to create the output GL buffer (https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/lite/delegates/gpu/gl/api2.cc#L199-L202), so the C->C4 conversion can't happen.

By cherry-picking 4000a5c and dffe6a0 into v2.2.0 and exposing the necessary APIs, I'm able to do SSBO I/O with the v2 delegate. These commits are pretty old so I hope they can make it into next release.

These are the changes I had to make to expose the necessary APIs: deepmedia@7401fbb . I don't know C++ so there might be errors, but the point is to create an interface that the V2 delegate extends. This interface can be retrieved from the delegate using a separate C++ header (delegate_core.h) so the high-level delegate is still a black box.

@sushreebarsa
Copy link
Contributor

@anilsathyan7 Could you please try on latest stable version of tf 2.5 or 2.4.1 and let us know if this is still an issue.Thanks!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Jun 28, 2021
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Jul 5, 2021
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:lite TF Lite related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author type:bug Bug
Projects
None yet
Development

No branches or pull requests