Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[webgpu] Use numbers directly instead of const variables #7193

Merged
merged 3 commits into from
Dec 21, 2022

Conversation

haoyunfeix
Copy link
Contributor

@haoyunfeix haoyunfeix commented Dec 20, 2022

FIXES BUG: #6746
Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829).
This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.

To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.


This change is Reviewable

@haoyunfeix
Copy link
Contributor Author

@qjia7 @xhcao @axinging @gyagp PTAL

Copy link
Contributor

@qjia7 qjia7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM. Just some nits.
Please also add the bug id that you fixed in your description message. Like Fixed BUG:#xxx. And the reason that why you do this.

@@ -102,14 +103,14 @@ export class ArgMinMaxProgram implements WebGPUProgram {
${sharedMemorySnippet}

${main('index')} {
let outputIndex = index / i32(workgroupSizeX);
let outputIndex = index / i32(${workgroupSizeX}u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index / i32(${workgroupSizeX}u) -> index / ${workgroupSizeX}.
Since ${workgroupSizeX} is already an int, you don't need to make it a uint and force to int again.
Similar for other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

let BCached0 = mm_Bsub[k * innerElementSize][tileCol];
let BCached1 = mm_Bsub[k * innerElementSize + 1][tileCol];
let BCached2 = mm_Bsub[k * innerElementSize + 2][tileCol];
for (var k = 0; k < ${tileInner} / ${innerElementSize}; k = k + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k < ${tileInner} / ${innerElementSize}
->
k < ${ tileInner / innerElementSize}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I prefer ${tileInner} / ${innerElementSize}, because I view ${} in WGSL as the equivalence of macros in C/C++.

Copy link
Contributor Author

@haoyunfeix haoyunfeix Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, for (var k = 0; k < 32 / 4; k = k + 1) is more meaningful than for (var k = 0; k < 8; k = k + 1), in case someone would like to read the running WGSL for some reason. However, this may cost additional division operations in this loop.
WDYT, @qjia7

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it's weird that you directly use 32 / 4 not 8 in shader. And from the whole loop body, we can also easily infer that each iteration will process 4 elements. And I don't know why we need the exact equivalence between wgsl and js code. Why not directly save this division in shader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either one is OK for me. This will not add runtime cost as Tint and shader compiler can do the optimization.

mm_Asub[tileCol] = vec4<f32>(${readVectorASnippet(transposeA)});
workgroupBarrier();

// Compute acc values for a single thread.
for (var k = 0; k < tileSize / 4; k = k + 1) {
let rowB = t * tileSize + k * 4;
for (var k = 0; k < ${tileSize} / 4; k = k + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k < ${tileSize} / 4
->
k < ${tileSize / 4}

@@ -34,15 +34,16 @@ export function makeMatMulReduceSource(): string {
let col = coords[2];
var sum = 0.0;
let Length = uniforms.dimInner;
for (var k = i32(localId.x); k < Length; k = k + i32(workgroupSizeX)) {
for (var k = i32(localId.x); k < Length; k = k + i32(${
workgroupSizeX}u)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k = k + i32(${workgroupSizeX}u)
->
k = k + ${workgroupSizeX}

let dataA = mm_readA(batchA, row, k);
let dataB = mm_readB(batchB, k, col);
sum = sum + dataA * dataB;
}
sumValues[localId.x] = sum;
workgroupBarrier();

for(var currentSize = workgroupSizeX / 2u; currentSize > 1u;
for(var currentSize = ${workgroupSizeX}u / 2u; currentSize > 1u;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${workgroupSizeX}u / 2u
->
${workgroupSizeX / 2}u

@@ -97,20 +98,20 @@ export class ReduceProgram implements WebGPUProgram {
return offset;
}
${main('index')} {
let outputIndex = index / i32(workgroupSizeX);
let outputIndex = index / i32(${workgroupSizeX}u);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i32(${workgroupSizeX}u
->
${workgroupSizeX}

for (var k = i32(localId.x); k < Length && outputIndex < uniforms.size;
k = k + i32(workgroupSizeX)) {
k = k + i32(${workgroupSizeX})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

@haoyunfeix haoyunfeix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Reviewable status: 0 of 1 approvals obtained (waiting on @gyagp, @qjia7, and @xhcao)


tfjs-backend-webgpu/src/matmul_packed_webgpu.ts line 216 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

Personally, I think it's weird that you directly use 32 / 4 not 8 in shader. And from the whole loop body, we can also easily infer that each iteration will process 4 elements. And I don't know why we need the exact equivalence between wgsl and js code. Why not directly save this division in shader?

Alright, fixed it.


tfjs-backend-webgpu/src/matmul_packed_webgpu.ts line 491 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

k < ${tileSize} / 4
->
k < ${tileSize / 4}

Done.


tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts line 38 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

k = k + i32(${workgroupSizeX}u)
->
k = k + ${workgroupSizeX}

Done.


tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts line 46 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

${workgroupSizeX}u / 2u
->
${workgroupSizeX / 2}u

Done.


tfjs-backend-webgpu/src/reduce_webgpu.ts line 101 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

i32(${workgroupSizeX}u
->
${workgroupSizeX}

Done.


tfjs-backend-webgpu/src/reduce_webgpu.ts line 107 at r4 (raw file):

Previously, qjia7 (Jiajia Qin) wrote…

ditto

Done.

let BCached0 = mm_Bsub[k * innerElementSize][tileCol];
let BCached1 = mm_Bsub[k * innerElementSize + 1][tileCol];
let BCached2 = mm_Bsub[k * innerElementSize + 2][tileCol];
for (var k = 0; k < ${tileInner} / ${innerElementSize}; k = k + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either one is OK for me. This will not add runtime cost as Tint and shader compiler can do the optimization.

}

workgroupBarrier();
}

for (var innerRow = 0; innerRow < rowPerThread; innerRow = innerRow + 1) {
for (var innerRow = 0; innerRow < ${
rowPerThread}; innerRow = innerRow + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang format really makes code hard to read. Can we disable clang format for shader code?
// clang-format off
// shader code
// clang-format on

@qjia7 @hujiajie @xhcao What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite love exceptions. Usually I try to fix such issues with less notable techniques.

I remember innerRow++ works in Tint, does it work in naga?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems innerRow++ works, but ++innerRow does not work...
I thought Yang means the weird newline, should we start from this PR, or in another PR for ALL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WGSL only supports i++. Let's keep this PR as is, and have offline discussion about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to use i++ for weird new lines in this PR, to see if we can correct some of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -258,7 +256,7 @@ function makeShader(
getOutputCoordsSnippet(outputData.shape, program.dispatchLayout);

const sources = [
commonSnippet + isInfSnippet, prefixSnippets.join('\n'),
commonSnippet , prefixSnippets.join('\n') + isInfSnippet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra space after commonSnippet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

workgroupBarrier();
}

for (var innerRow = 0; innerRow < rowPerThread; innerRow = innerRow + 1) {
for (var innerRow = 0; innerRow < ${
rowPerThread}; innerRow = innerRow + 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for me.

…s global scope

FIXES BUG: tensorflow#6746
Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829).
This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.

Address comments
@gyagp gyagp merged commit fd3c3f4 into tensorflow:master Dec 21, 2022
Linchenn pushed a commit to Linchenn/tfjs that referenced this pull request Jan 9, 2023
…7193)

* [webgpu] Use numbers directly instead of `const variables` in shader's global scope

FIXES BUG: tensorflow#6746
Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829).
This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.
@hujiajie hujiajie mentioned this pull request Feb 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants