-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[webgpu] Use numbers directly instead of const variables #7193
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM. Just some nits.
Please also add the bug id that you fixed in your description message. Like Fixed BUG:#xxx
. And the reason that why you do this.
@@ -102,14 +103,14 @@ export class ArgMinMaxProgram implements WebGPUProgram { | |||
${sharedMemorySnippet} | |||
|
|||
${main('index')} { | |||
let outputIndex = index / i32(workgroupSizeX); | |||
let outputIndex = index / i32(${workgroupSizeX}u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index / i32(${workgroupSizeX}u)
-> index / ${workgroupSizeX}
.
Since ${workgroupSizeX}
is already an int, you don't need to make it a uint and force to int again.
Similar for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
let BCached0 = mm_Bsub[k * innerElementSize][tileCol]; | ||
let BCached1 = mm_Bsub[k * innerElementSize + 1][tileCol]; | ||
let BCached2 = mm_Bsub[k * innerElementSize + 2][tileCol]; | ||
for (var k = 0; k < ${tileInner} / ${innerElementSize}; k = k + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k < ${tileInner} / ${innerElementSize}
->
k < ${ tileInner / innerElementSize}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I prefer ${tileInner} / ${innerElementSize}
, because I view ${}
in WGSL as the equivalence of macros in C/C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, for (var k = 0; k < 32 / 4; k = k + 1)
is more meaningful than for (var k = 0; k < 8; k = k + 1)
, in case someone would like to read the running WGSL for some reason. However, this may cost additional division operations in this loop.
WDYT, @qjia7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think it's weird that you directly use 32 / 4
not 8
in shader. And from the whole loop body, we can also easily infer that each iteration will process 4 elements. And I don't know why we need the exact equivalence between wgsl and js code. Why not directly save this division in shader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either one is OK for me. This will not add runtime cost as Tint and shader compiler can do the optimization.
mm_Asub[tileCol] = vec4<f32>(${readVectorASnippet(transposeA)}); | ||
workgroupBarrier(); | ||
|
||
// Compute acc values for a single thread. | ||
for (var k = 0; k < tileSize / 4; k = k + 1) { | ||
let rowB = t * tileSize + k * 4; | ||
for (var k = 0; k < ${tileSize} / 4; k = k + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k < ${tileSize} / 4
->
k < ${tileSize / 4}
@@ -34,15 +34,16 @@ export function makeMatMulReduceSource(): string { | |||
let col = coords[2]; | |||
var sum = 0.0; | |||
let Length = uniforms.dimInner; | |||
for (var k = i32(localId.x); k < Length; k = k + i32(workgroupSizeX)) { | |||
for (var k = i32(localId.x); k < Length; k = k + i32(${ | |||
workgroupSizeX}u)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k = k + i32(${workgroupSizeX}u)
->
k = k + ${workgroupSizeX}
let dataA = mm_readA(batchA, row, k); | ||
let dataB = mm_readB(batchB, k, col); | ||
sum = sum + dataA * dataB; | ||
} | ||
sumValues[localId.x] = sum; | ||
workgroupBarrier(); | ||
|
||
for(var currentSize = workgroupSizeX / 2u; currentSize > 1u; | ||
for(var currentSize = ${workgroupSizeX}u / 2u; currentSize > 1u; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${workgroupSizeX}u / 2u
->
${workgroupSizeX / 2}u
@@ -97,20 +98,20 @@ export class ReduceProgram implements WebGPUProgram { | |||
return offset; | |||
} | |||
${main('index')} { | |||
let outputIndex = index / i32(workgroupSizeX); | |||
let outputIndex = index / i32(${workgroupSizeX}u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i32(${workgroupSizeX}u
->
${workgroupSizeX}
for (var k = i32(localId.x); k < Length && outputIndex < uniforms.size; | ||
k = k + i32(workgroupSizeX)) { | ||
k = k + i32(${workgroupSizeX})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Reviewable status: 0 of 1 approvals obtained (waiting on @gyagp, @qjia7, and @xhcao)
tfjs-backend-webgpu/src/matmul_packed_webgpu.ts
line 216 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
Personally, I think it's weird that you directly use
32 / 4
not8
in shader. And from the whole loop body, we can also easily infer that each iteration will process 4 elements. And I don't know why we need the exact equivalence between wgsl and js code. Why not directly save this division in shader?
Alright, fixed it.
tfjs-backend-webgpu/src/matmul_packed_webgpu.ts
line 491 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
k < ${tileSize} / 4
->
k < ${tileSize / 4}
Done.
tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts
line 38 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
k = k + i32(${workgroupSizeX}u)
->
k = k + ${workgroupSizeX}
Done.
tfjs-backend-webgpu/src/matmul_reduce_webgpu.ts
line 46 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
${workgroupSizeX}u / 2u
->
${workgroupSizeX / 2}u
Done.
tfjs-backend-webgpu/src/reduce_webgpu.ts
line 101 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
i32(${workgroupSizeX}u
->
${workgroupSizeX}
Done.
tfjs-backend-webgpu/src/reduce_webgpu.ts
line 107 at r4 (raw file):
Previously, qjia7 (Jiajia Qin) wrote…
ditto
Done.
let BCached0 = mm_Bsub[k * innerElementSize][tileCol]; | ||
let BCached1 = mm_Bsub[k * innerElementSize + 1][tileCol]; | ||
let BCached2 = mm_Bsub[k * innerElementSize + 2][tileCol]; | ||
for (var k = 0; k < ${tileInner} / ${innerElementSize}; k = k + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either one is OK for me. This will not add runtime cost as Tint and shader compiler can do the optimization.
} | ||
|
||
workgroupBarrier(); | ||
} | ||
|
||
for (var innerRow = 0; innerRow < rowPerThread; innerRow = innerRow + 1) { | ||
for (var innerRow = 0; innerRow < ${ | ||
rowPerThread}; innerRow = innerRow + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not quite love exceptions. Usually I try to fix such issues with less notable techniques.
I remember innerRow++
works in Tint, does it work in naga?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems innerRow++
works, but ++innerRow
does not work...
I thought Yang means the weird newline, should we start from this PR, or in another PR for ALL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WGSL only supports i++. Let's keep this PR as is, and have offline discussion about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to use i++ for weird new lines in this PR, to see if we can correct some of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -258,7 +256,7 @@ function makeShader( | |||
getOutputCoordsSnippet(outputData.shape, program.dispatchLayout); | |||
|
|||
const sources = [ | |||
commonSnippet + isInfSnippet, prefixSnippets.join('\n'), | |||
commonSnippet , prefixSnippets.join('\n') + isInfSnippet, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No extra space after commonSnippet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
workgroupBarrier(); | ||
} | ||
|
||
for (var innerRow = 0; innerRow < rowPerThread; innerRow = innerRow + 1) { | ||
for (var innerRow = 0; innerRow < ${ | ||
rowPerThread}; innerRow = innerRow + 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good for me.
…s global scope FIXES BUG: tensorflow#6746 Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829). This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders. Address comments
…7193) * [webgpu] Use numbers directly instead of `const variables` in shader's global scope FIXES BUG: tensorflow#6746 Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829). This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.
FIXES BUG: #6746
Deno uses Naga for wgsl compilation, but Naga currently uses let for global constants(will be fixed in gfx-rs/naga#1829).
This PR helps WebGPU to run pose-detection models on Deno by removing global constants in shaders.
To see the logs from the Cloud Build CI, please join either our discussion or announcement mailing list.
This change is