[WebGL] Implement packed ScatterND #7292

Linchenn · 2023-01-20T18:55:50Z

After this PR, 'USE-batch30' model would gain 30%~50% performance improvement. The 'Packed-ScatterND' column shows this PR's performance on 'USE-batch30'.

Specifically, on MacBook Pro, the time that 'USE-batch30' spends in 'ScatterNd' drops from 62.84 ms to 15.01ms.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.

This change is

…tterNd

Linchenn · 2023-01-20T19:53:12Z

tfjs-backend-webgl/src/scatter_packed_gpu.ts

+            if (flattenedIndex[0] == coords[0] || flattenedIndex[1] == coords[0] ||
+                flattenedIndex[0] == coords[0] + 1 || flattenedIndex[1] == coords[0] + 1) {
+              vec4 updVals = ${updatesSnippet};
+              if (flattenedIndex[0] == coords[0]) {
+                sum.xy += updVals.xy;
+                found.xy = vec2(1.);
+              }
+              if (flattenedIndex[1] == coords[0]) {
+                sum.xy += updVals.zw;
+                found.xy = vec2(1.);
+              }
+              if (flattenedIndex[0] == coords[0] + 1) {
+                sum.zw += updVals.xy;
+                found.zw = vec2(1.);
+              }
+              if (flattenedIndex[1] == coords[0] + 1) {
+                sum.zw += updVals.zw;
+                found.zw = vec2(1.);
+              }
+            }


The 4 if-branches seem to hurt performance, so I tried to replace it with the vectorized codes as the following and pasted tests:

vec4 isMatched = 1. - vec4(bvec4(flattenedIndex[0] - coords[0], flattenedIndex[1] - coords[0], flattenedIndex[0] - coords[0] - 1, flattenedIndex[1] - coords[0] - 1)); if (dot(isMatched, vec4(1.)) > 0.) { vec4 updVals = ${updatesSnippet}; found += isMatched.xxzz + isMatched.yyww; sum += updVals.xyxy * isMatched.xxzz + updVals.zwzw * isMatched.yyww; } ... setOutput(mix(${defaultValueSnippet}, sum, vec4(bvec4(found))));

However, it does not show obvious improvements (the performace is recorded in 'Packed-ScatterND-vectorizedBranches' column in PR description), so we could use the 4-if-branches here, which has better readability.

The four if statements is certainly more readable. I'm a bit surprised it doesn't hurt performance.

is the performance indifference happening on mobile as well?

Yes, the chart in PR description shows our mobile devices.

mattsoulanille

LGTM. Nice improvement!

mattsoulanille · 2023-01-20T20:32:37Z

tfjs-backend-webgl/src/scatter_packed_gpu.ts

+            if (flattenedIndex[0] == coords[0] || flattenedIndex[1] == coords[0] ||
+                flattenedIndex[0] == coords[0] + 1 || flattenedIndex[1] == coords[0] + 1) {
+              vec4 updVals = ${updatesSnippet};
+              if (flattenedIndex[0] == coords[0]) {
+                sum.xy += updVals.xy;
+                found.xy = vec2(1.);
+              }
+              if (flattenedIndex[1] == coords[0]) {
+                sum.xy += updVals.zw;
+                found.xy = vec2(1.);
+              }
+              if (flattenedIndex[0] == coords[0] + 1) {
+                sum.zw += updVals.xy;
+                found.zw = vec2(1.);
+              }
+              if (flattenedIndex[1] == coords[0] + 1) {
+                sum.zw += updVals.zw;
+                found.zw = vec2(1.);
+              }
+            }


The four if statements is certainly more readable. I'm a bit surprised it doesn't hurt performance.

pyu10055

Thank you! is this feature enabled by default and fully tested with current tests?

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @Linchenn)

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 3 at r1 (raw file):

/**
 * @license
 * Copyright 2018 Google LLC. All Rights Reserved.

2023

pyu10055

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @Linchenn and @mattsoulanille)

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 76 at r7 (raw file):

              }
            }
            if (flattenedIndex[0] == coords[0] || flattenedIndex[1] == coords[0] ||

is this check necessary? given the contained the branches will check again?

Linchenn

I have tested it locally, for both current changes and the vectorized branch optimization through:

yarn test --test_verbose_timeout_warnings --verbose_failures --nocache_test_results  --//:grep='scatterND'

and tested 'USE-30' model correctness on the local benchmark tool.

I am running a nightly test now https://pantheon.corp.google.com/cloud-build/builds/f79d985c-2a59-4d2c-98c6-cc904231334a?project=learnjs-174218. Will not until it is passed.

To be more safe, we could add a 'WEBGL_PACK_SCATTERND' (default as false) at first?

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @mattsoulanille and @pyu10055)

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 3 at r1 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

2023

I have updated it. It's weird that reviewable tools sometimes does not show the latest changes.

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 76 at r7 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

is this check necessary? given the contained the branches will check again?

Yes, the immediate following line vec4 updVals = ${updatesSnippet}; is a read instruction. Only if this check is passed, the read instruction would be executed. This has visible, even though small ~2ms for the model, improvements.

Code quote:

vec4 updVals = ${updatesSnippet};

pyu10055

Reviewed 2 of 3 files at r4, 1 of 1 files at r5, 1 of 1 files at r8, all commit messages.
Reviewable status: complete! 2 of 1 approvals obtained (waiting on @mattsoulanille)

pyu10055

LGTM, given it is guarded under WEBGL_PACK flag.

Reviewable status: complete! 2 of 1 approvals obtained (waiting on @mattsoulanille)

pyu10055

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @Linchenn and @mattsoulanille)

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 76 at r7 (raw file):

Previously, Linchenn wrote…

Yes, the immediate following line vec4 updVals = ${updatesSnippet}; is a read instruction. Only if this check is passed, the read instruction would be executed. This has visible, even though small ~2ms for the model, improvements.

can you move this to each child branches to avoid this top level branch statement?

Linchenn

Reviewable status: complete! 1 of 1 approvals obtained (waiting on @mattsoulanille and @pyu10055)

tfjs-backend-webgl/src/scatter_packed_gpu.ts line 76 at r7 (raw file):

Previously, pyu10055 (Ping Yu) wrote…

can you move this to each child branches to avoid this top level branch statement?

Merged mutual exclusive branches as if...else... . Thank you Ping for thinking deeply about it!

…tterNd

Linchenn · 2023-01-24T23:54:07Z

Just completed nightly tests.

Linchenn and others added 9 commits January 20, 2023 10:50

benchmark

15e3e94

Update flags_webgl.ts

3027ced

benchmark if-branch

caf22a5

roll out vectorized optimization

568751b

Merge branch 'master' into scatterNd

8b59977

Update ScatterNd.ts

180da15

Merge branch 'scatterNd' of https://github.com/Linchenn/tfjs into sca…

732f6a0

…tterNd

date

76d92cf

Update scatter_packed_gpu.ts

910a150

Linchenn commented Jan 20, 2023

View reviewed changes

Linchenn requested review from mattsoulanille and pyu10055 January 20, 2023 19:56

mattsoulanille approved these changes Jan 20, 2023

View reviewed changes

pyu10055 requested changes Jan 20, 2023

View reviewed changes

Linchenn commented Jan 21, 2023

View reviewed changes

Linchenn requested a review from pyu10055 January 21, 2023 00:58

reduce conversion

3e1fd72

pyu10055 approved these changes Jan 23, 2023

View reviewed changes

pyu10055 reviewed Jan 23, 2023

View reviewed changes

pyu10055 requested changes Jan 23, 2023

View reviewed changes

Linchenn and others added 2 commits January 23, 2023 14:53

Merge branch 'master' into scatterNd

1e01d2e

Update scatter_packed_gpu.ts

0bfb4e3

Linchenn requested review from mattsoulanille and pyu10055 January 23, 2023 22:56

Linchenn commented Jan 23, 2023

View reviewed changes

Merge branch 'scatterNd' of https://github.com/Linchenn/tfjs into sca…

71a9c01

…tterNd

pyu10055 approved these changes Jan 23, 2023

View reviewed changes

mattsoulanille approved these changes Jan 23, 2023

View reviewed changes

Merge branch 'master' into scatterNd

988de1a

Merge branch 'master' into scatterNd

d7f537b

Linchenn merged commit 167a74d into tensorflow:master Jan 25, 2023

Linchenn deleted the scatterNd branch January 25, 2023 00:16

[WebGL] Implement packed ScatterND #7292

[WebGL] Implement packed ScatterND #7292

Uh oh!

Conversation

Linchenn commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Linchenn Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

mattsoulanille Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

pyu10055 Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

Linchenn Jan 21, 2023

Choose a reason for hiding this comment

Uh oh!

mattsoulanille left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattsoulanille Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

Linchenn left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

pyu10055 left a comment

Choose a reason for hiding this comment

Uh oh!

Linchenn left a comment

Choose a reason for hiding this comment

Uh oh!

Linchenn commented Jan 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Linchenn commented Jan 20, 2023 •

edited

Loading

mattsoulanille left a comment •

edited

Loading