Skip to content

Conversation

@Linchenn
Copy link
Collaborator

@Linchenn Linchenn commented Feb 4, 2023

With this PR:
Conv2DBackpropInput op is accelerated ~2x, partially fixed #5197.

Model Before(ms) After(ms)
BlazePoseDetector 23.1 17.5
ArPortraitDepth 52.3 23.9 (also set EXP_CONV to true)

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@Linchenn Linchenn requested review from pyu10055 and qjia7 February 4, 2023 00:55
Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever algorithm and great work!

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @Linchenn)


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 55 at r1 (raw file):

        //intialize dotProd with a small epsilon seems to reduce GPU accuracy loss.
        vec4 dotProd = vec4(0.000000000000001);

use an JS constant to interpolate in this value.


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 90 at r1 (raw file):

              dotProd.xy += vec2(dot(dyValue, wValue.xy), dot(dyValue, wValue.zw)) * idyCVal;

              dySample = getDy(batch, idyR, idyC2, d2);

might be good to move all memory accesses together.


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 90 at r1 (raw file):

              dotProd.xy += vec2(dot(dyValue, wValue.xy), dot(dyValue, wValue.zw)) * idyCVal;

              dySample = getDy(batch, idyR, idyC2, d2);

can idyC2 could be the same as idyC ? only read again if idyC2 != idyC ?

Copy link
Collaborator Author

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Ping. After tuning the shader, the inference time for the following op is improved from ~2.8928ms to ~2.6809903999999984ms on a Linux workstation:

const x = tf.ones([1,8,6,256]);
const w = tf.ones([4,4,256,256]);

s = (await tf.profile(()=>tf.conv2dTranspose(x, w, [1,16,12,256], 1, 'valid'))).kernels[0].kernelTimeMs

After benchmarking, the improvement is mainly from add a condition for reading dySample2, from ~2.8928ms to ~2.711304960000001ms

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @mattsoulanille and @pyu10055)


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 55 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

use an JS constant to interpolate in this value.

Removed it as we discussed, because this initial value currently is unnecessary, but we could keep this in mind when facing some correctness issues later.


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 90 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

can idyC2 could be the same as idyC ? only read again if idyC2 != idyC ?

Done.


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 90 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

might be good to move all memory accesses together.

Done.

Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! 2 of 1 approvals obtained (waiting on @mattsoulanille)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Perf] The time of Conv2DBackpropInput is very long in BlazePose/hand_detector models in WebGL

3 participants