Skip to content

Conversation

@Linchenn
Copy link
Collaborator

@Linchenn Linchenn commented Feb 16, 2023

Originally, we use idyCVal and idyCVal2 (could only be 0.0 or 1.0) to indicate if result.xy and result.zw are valid and result.xy * idyCVal and result.zw * idyCVal2 could return the results to avoid checking through if-branches.

If stride is 1, both idyCVal and idyCVal2 are always 1.0 and the original solution performs well. However, if stride is 2, whenever, only one of idyCVal and idyCVal2 is 1.0, which means either computing result.xy or result.zw is wasting of time.

As a result, this PR adds the if-branches to check idyCVal and idyCVal2 before computing, instead of always computing result.xy * idyCVal and result.zw * idyCVal2. This improves Conv2DBackpropInput ops in ArPortraitDepth ~40% and ArPortraitDepth model is improved from 23ms to 18.5ms (Benchmarked on LENOVO P620 2021).

Before the PR:
image

After the PR:
image

Reference #7371.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

Copy link
Collaborator Author

@Linchenn Linchenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 approvals obtained


tfjs-backend-webgl/src/conv_backprop_packed_gpu.ts line 78 at r1 (raw file):

            if (idyCVal && idyCVal2) {
              for (int d2 = 0; d2 < ${convInfo.outChannels}; d2 += 2) {
                vec4 wValue = getW(wRPerm, wCPerm, d1, d2);

Even though the three branches have the same codes, we still should not move the if-branches into the for loop. Otherwise the performance drops (for Conv2DBackpropInput op, drops ~10%), probably because the if-branches will be executed for each iteration (then threads probably would be stalled/synced in each iteration).

Code quote:

              for (int d2 = 0; d2 < ${convInfo.outChannels}; d2 += 2) {
                vec4 wValue = getW(wRPerm, wCPerm, d1, d2);

@Linchenn Linchenn requested review from pyu10055 and qjia7 February 16, 2023 19:42
Copy link
Collaborator

@pyu10055 pyu10055 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work, thanks!

Reviewable status: :shipit: complete! 1 of 1 approvals obtained (waiting on @qjia7)

Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Linchenn
Copy link
Collaborator Author

Linchenn commented Feb 17, 2023

Benchmark on Samsung Galaxy S22 Ultra:

Before this PR (with packed ConvBackpropInput):
image

After this PR:
ConvBackpropInput op's inference time drops (the top1 drops from 5.05 to 4.10)
image

@Linchenn Linchenn merged commit 7921dd5 into tensorflow:master Feb 17, 2023
@qjia7
Copy link
Contributor

qjia7 commented Feb 20, 2023

I just saw a big regression on Intel devices (both CFL and ADL) with this PR.
Before (CFL)
conv2dBackpropInput_before
model_before
After (CFL)
conv2dBackpropInput_after
model_after

Before (ADL)
ADL_op_before
ADL_model_before
After (ADL)
ADL_op_after
ADL_model_after

@qjia7
Copy link
Contributor

qjia7 commented Feb 20, 2023

It's surprising that there is such a big regressions for Intel devices. But the similar algorithm works well for webgpu. It may be a driver bug if it only happens on Intel devices. I can report an issue to our driver team. But it will be good if we can skip this optimization for Intel devices on TFJS level temporarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants